Circuits and Systems, 2013, 4, 299-303
http://dx.doi.org/10.4236/cs.2013.43041 Published Online July 2013 (http://www.scirp.org/journal/cs)
Power and Time Efficient IP Lookup Table Design Using
Partitioned TCAMs
Youngjung Ahn, Yongsuk Lee, Gyungho Lee
Department of Computer Science & Engineering, Korea University, Seoul, South Korea
Email: yjahn@formal.korea.ac.kr, duchi@korea.ac.kr, ghlee@korea.ac.kr
Received May 7, 2013; revised June 7, 2013; accepted June 14, 2013
Copyright © 2013 Youngjung Ahn et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
ABSTRACT
This paper proposes a power and time efficient scheme for designing IP lookup tables. The proposed scheme uses parti-
tioned Ternary Content Addressable Memories (TCAMs) that store IP lookup tables. The proposed scheme enables O(1)
time penalty for updating an IP looku p table. The partitioned TCAMs allow an update don e by a simple insertion with-
out the need for routing table sorting. The organization of the routing table of the proposed scheme is based on a parti-
tion with respect to the output port for routing with a smaller priority encoder. The proposed scheme still preserves a
similar storage requirement and clock rate to those of existing designs. Furthermore, this scheme reduces power con-
sumption due to using a p artitioned routing table.
Keywords: IP Lookup Device; Routi n g Tab l e; TCAMs; In serti on
1. Introduction
IP routers have been growing more complex with each
passing year. These routers must continually support new
features such as Quality of Service and Service Level
Agreement monitoring, and they must perform these ser-
vices at an increasing pace to match new connection
speeds. The central design issue for these devices is per-
forming IP lookup and classification at wire speed. Many
routers now include a co-processor to perform the crucial
task of IP lookup. The most critical issues are related to
performance degradation, which occurs when a new en-
try is inserted into the table and due to the limitations
imposed by the large encoder logic. In the worst case, the
insertion will take O(N) time to be completed, where N
represents the number of entries in the routing table. Ta-
ble size is expected to grow in the future at an exponen-
tial rate, and the update time will linearly grow with table
size [1]. On an average day, this same router may spend
up to 10% of the time updating the table. No lookups are
possible while an insertion is occurring. The insertion
time has a dramatic effect on the performance of routers
with large tables [2,3].
TCAMs allow a third matching state of “X” or “Don’t
Care” for one or more bits in the stored data word, thus
adding flexibility to the search [4]. TCAMs have a very
high cost/den sity ratio, and they have to search the entire
table for each lookup that causes high power consump-
tion [5]. Therefore, TCAMs are used in specialized ap-
plications such as packet classification and routing in
high performance routers, where the searching speed
cannot be accomplished using a less costly method.
TCAMs are currently used to perform the task of the
Longest Prefix Match (LPM) in high-end routers because
they are able to search a table in parallel in one cycle. In
such high end routers, it is crucial to have a time and
power efficient scheme for updating IP lookup tables
with new entry insertions. This paper proposes a novel
solution that eliminates the need for routing tab le sorting
per prefix length. As a result, the time penalty for insert-
ing a new entry in the IP routing table is O(1). Moreover,
it significantly reduces power consumption because of its
partitioned TCAMs. This paper describes the proposed
design and implementation details along with the simula-
tion results and design evaluation.
2. Related Works
There have been many attempts to address the insertion
problem in IP routing tables [6]. A commonly used tech-
nique involves adding empty space after each prefix
group. If there is an empty slot after the required prefix
group, the TCAM may keep a few empty memory loca-
tions every x non-empty memory locations (x < N, where
N is the table size). The average case update time im-
proves to O(x), but it degenerates to O(N) if the interme-
C
opyright © 2013 SciRes. CS
Y. AHN ET AL.
300
diate empty spaces are filled up [7]. A better technique is
called the L-Algorithm, which can create an empty space
in a TCAM in no more than L memory shifts [7].
L-Algorithm reduces the worst-case to O(L) where L is
the length of the IP address. This means the algorithm
would have a worst-case insertion time of O(32) for IPv4
and O(128) for IPv6. This algorithm takes advantage of
the fact that sorting within each length group is unneces-
sary, so only one entry from each group needs to be
moved. This provides a great improvement over the O(N).
However, even with the L-Algorithm, problems can still
occur in an actual router. A routing table update due to a
new entry insertion may still need 32 or 128 cycles; this
is undesirable in high-end routers. The design may also
require a large overhead to manage the free space in the
table. Also, when a burst of updates is received by the
router, the router will not be able to route packets while
the insertion is taking place. This means that the incom-
ing packets need to be buffered. If the buffer size is not
sufficient to hold all the received packets, then some will
be dropped. This hurts the major selling point of routers,
level of the worst-case performance and reliability. If the
insertion time is reduced to O(1), then the need for addi-
tional buffering would be unnecessary since the inser-
tions can happen at lookup speed.
TCAMs are currently used to perform the task of the
Longest Prefix Match (LPM) in high-end routers because
they are able to search a table in parallel in one cycle.
However, TCAMs are inefficient in terms of power con-
sumption even though the search time is fast. The size of
TCAMs in IP lookup devices is increasing. There are
currently many studies on reducing the power consumed
by IP lookup devices by dividing routing tables [8,9].
3. Design Approach and Implementation
Details
In order to keep the table sorted, the table must be up-
dated, and the result would take O(N) moves. In the pro-
posed design, the need for sorting is removed, and thus
insertion time is improved to O(1). Figure 1 shows the
proposed design. The output port divides the table, so it
is partitioned into smaller tables. The number of tables
generated equals the number of output ports of the router,
and each table holds a collection of all the entries that
map to the output port that it corresponds with. All en-
tries in a partitioned table map to the same output port,
making it unnecessary to sort the entries in the table.
When searching, each TCAM checks the IP address in
parallel. Each table outputs the matched lengths to a se-
lection logic. The selection logic chooses the longest
length, and the packet is sent to the output port based on
the table that had the longest prefix match. When inser-
tion is need, the output port matching the entry is c hec ked
by analyzing the entry. After figuring out the matching
output port, the entry is inserted into an open location in
the corresponding table. Note that the entry goes to “any”
open location in the corresponding table.
Figure 2 shows the modified design for the partitioned
table. Typical TCAM cells storing the network address
prefix are shown on the left, and memory cells storing
the lengths with the priority encoder removed are shown
on the right. When the logic selects the table containing
the LPM, the information of the output port is known,
and there no longer a need for SRAM memory to store
such information. Since the TCAM cells can be con-
nected to the corresponding SRAM memory cells be-
cause they are directly related, the priority encoder and
address decoders are also not needed. In addition, more
than one row can be asserted at once. This is possible
because there can only be one match per length. If a
lookup results in more than one match of the same length,
then this would mean that duplicate entries have been
stored. Therefore, the maximum number of matches is
32-bits, the length of an IP address. Lengths are stored in
a one-cold encoding. Since lengths are being stored in-
stead of output ports in the SRAM cells, the number of
SRAM cells needed is more than that in a typical router.
Since the TCAM cells are directly connected to the
SRAMS cells, the proposed design reduces the complex-
ity of the traditional design. For each entry, 32 TCAM
cells are used to store the prefix, and 32 SRAM cells are
used to store the prefix length. If more than one entry
match is found, then each asserts a corresponding output
line to reveal the prefix length. The selection logic
checks the length output to determine the table contain-
ing the longest prefix match. This can be used to choose
the longest match since there can only be a maximum of
Figure 1. Proposed architecture.
Figure 2. Modified routing table.
Copyright © 2013 SciRes. CS
Y. AHN ET AL. 301
one matching entry for each prefix length between 1 and
32.
For the proposed design, th e SRAM cell word lines are
directly connected to the TCAM match lines. During
normal SRAM operation, only one word line should be
activated at once because if multiple word lines are acti-
vated, then the neighboring SRAM cells may become
corrupt due to the shared bit line. The TCAM search may
have multiple matches, and it can cause multiple SRAM
word lines to be activated. Additional access transistors
are added to each SRAM cell to activate the SRAM cell
only if a “0” is stored in the cell. Only one SRAM cell
will be activated for each column since each match can
correspond to only one leng th, and the length is stored in
a one-cold encoding. To enable access to the cell for
writing, two additional transistors are needed and four
transistors in total are added for each 6T SRAM cell.
This is shown in Figure 3. The standard 6T SRAM cell
is presented in the dotted line. The four extra transistors
will increase the area and latency of the SRAM cell.
The size of each partition of the table should still be
fairly large because the total table size is very large and
the number of output ports is usually quite low. Long
lines will significantly increase the latency [10]. This
requires each partition to be further divided into smaller
blocks, as is shown in Figure 3. The outputs from each
block are combined in an OR scheme. Since there can
only be one match for each length, there will be no con-
flicts.
Normally, the distribution of the prefixes will not be
balanced across p artitions. In some cases, more than 90%
of the prefixes forward to the same port [11]. Many con-
figurations are possible since this proposed design subdi-
vides the partitions further to reduce the bit line length.
Figure 3. SRAM cell design and par t itioned layout.
By combining the outputs of the blocks, i.e., sub-tables,
the OR scheme controls the boundaries of each partition.
One end of the boundary connects to the selection logic,
while the other end connects to a column pull-up. A basic
switching scheme allows the partition sizes to be pro-
grammable. The number of possible configurations can
be determined by the minimum grouping size. Due to
hardware complexity limitations, switching connections
cannot be available at all sub-table boundaries.
The selection logic takes inputs from the output lines
of the partitions of the tab le. The logic uses 32 inputs for
each output port, so if the router has four output ports,
then 128 inputs are connected. The selection logic first
determines the longest length that was found. Then, the
output port is determined based on the table that the LPM
is from. The logic finds the port with which the highest
length is associated. Then, the packet can be forwarded
to the correct port. This is a very simple logic and is
similar in complexity to a small p riority encoder.
4. Evaluation of the Design
Improving insertion time is only beneficial if the lookup
latency is not slow er than that of the original d esign. The
proposed design is compared to a typical co-processor
design. The typical design for this comparison contains a
standard TCAM that is capable of holding up to 256 K
entries. There are four and sixteen output ports in typical
core routers. By analyzing this, four output ports have
been used. This latency of the new and old designs could
be broken down into several parts and could also be
separately compared as shown in Table 1. As the table
shows, both designs have similar latencies. Although the
SRAM reading time is longer than that of the proposed
design, the selection logic is much simpler than the logic
of the priority encoder. The first portion of both designs
begins with the TCAM search. The design of the TCAM
cells remains unchanged so that the latency of the TCAM
search could remain identical. In the proposed de sign , th e
next step is to access the SRAM cells that correspond to
the matched prefixes from the TCAM search to deter-
mine the length of the prefix.
The overall size of the SRAM cell has increased due to
the addition of access transistors. This increases the
length of each bit line causing the capacitance to become
larger, and this increases the necessity time to pull down
Table 1. The latency.
TCAM
Search Priority
Encoder SRAM
Read Total
Original
Design ~6.5 ns ~3.6 ns 2.7 ns ~12.8 ns
TCAM
Search SRAM
Read Selection
Logic Total
Proposed
Design ~6.5 ns 4.1 ns ~0.98 ns ~11.6 ns
Copyright © 2013 SciRes. CS
Y. AHN ET AL.
302
the line. Since the bit lines of one large table are too long,
the partition table is further broken into sub-tables. Each
table outputs the bits corresponding to the lengths that
were found in the sub-table. The results from each
sub-table and each length are combined using an OR
scheme. The optimal configuration was found by com-
paring the delay due to the size of each sub-table and the
delay of combining the results from each sub-table. Ca-
dence tools in a 0.25 um technology were used to simu-
late the access time of various SRAM configurations to
compare the standard of SRAM cells. The capacitances
of the bit lines were estimated based on the area increase
of the additional transistors. These results show that the
original SRAM performs approximately 33% faster than
the new SRAM in all configurations. A full custom lay-
out would provide more accurate results. Larger size is
preferred in order to reduce the number of outputs that
should be combined tog ether. Table 2 shows the simula-
tion result.
In old designs, the delay of the major logic is from the
priority encoder. This circuit becomes larger when it
handles the input from all TCAM search lines. The pro-
posed design directly connects the TCAM cells to the
SRAM cells so that the priority encoder could be re-
moved. The address decoder of the SRAM from the old
design can be removed too. The major logic delay ac-
cording to the proposed design is from the selection logic.
First, the logic chooses the longest length, and it then
determines the length of the partition table that was al-
ready found. The circuit is similar to a very small priority
encoder.
The logic was designed by using Mentor Graphics
ModelSim in Verilog. The design was synthesized using
a 0.50 um technology and simulated by using Mentor
Graphics LenardoSpecturm. The results show that the
selection logic is much easier than the priority encoder
and has a smaller delay. The results are shown in Table
3. To evaluate the scalability of the proposed design, the
differences between the two designs were compared. The
Table 2. SRAM simulations.
Size Original SRAM Proposed SRAM
128 1.4 ns 2.0 ns
256 2.0 ns 3.0 ns
512 2.7 ns 4.1 ns
Table 3. Logic simulations.
Priority Encoder Selection Logic
Gates 27,668 3015
Delay 14.4 ns 3.8 ns
TCAM cells have not been modified, so they will scale in
the same way. When the SRAM cells were modified,
they have shown about 33% difference in performance.
Although the proposed design continuously has a differ-
ence, it will also scale in the same way. The largest dif-
ference between the two designs is found in the differ-
ence in logic. The selection logic scales logarithmically
depend on three things. The length of the IP address, the
number of output ports on the router and the size of the
routing table. Since the length of the IP address is unex-
pected to increase after IPv6 and the number of output
ports to change, the most important factor is the number
of entries in the routing table. A priority encoder’s delay
will also scale logarithmically with respect to the number
of entries in the table [12]. Its delay is not affected by the
number of output ports or length of the IP address be-
cause there are no dependencies. The address decoder in
SRAM will have a logN scaling factor. Based on the
above analysis, both designs are expected to scale simi-
larly as the routing table size grows. The power con-
sumption in the partitioned routing table is shown as be-
low. When the length of prefix is 32-bits while the num-
ber of entry of routing table is n, the power consumption
of TCAMs is denoted as (1) [13]. P(n) consists of one
TCAMs and describes the power consumption while the
number of entry is n.
22
Pnn0.5log n10.5log n 
search : match17 :83

(1)
In IP lookup, the behavior of TCAMs can be divided
by IP searching and notification of matched Prefix. Ac-
cording to [14], the ratio of power consumption of two
functions is shown as (2). The IP search in TCAMs is
advanced in all partitio ned routing tab les. In addition, the
notification of match ed Prefix is progressed in one parti-
tioned routing tables.
(2)
The proposed scheme divides the routing table corre-
sponding to the output port. The number of entry of each
partitioned table is n/x wh en the number of outpu t port is
x. By using this property and (2), the total power con-
sumption of partitioned routing table is shown as (3).

total nn
P1 0.17x P1 0.83p
xx
 
 
 
 
(3)
Table 4 describes power consumption and efficiency
corresponding to the number of entry, when the number
of output port is sixteen. When the routing table can be
divided into sixteen tables, efficiency is increased by
approximately 28% compared to the power consumption
of existing TCAMs. When the total number of entries is
increased, the efficiency of power consumption is de-
creased because the number of x entries of each parti-
tioned routing table is increased.
Copyright © 2013 SciRes. CS
Y. AHN ET AL.
Copyright © 2013 SciRes. CS
303
Table 4. Power consumption.
Power Consumption
# of
Entry
# of
Output
Port TCAMs Partitioned
TCAMs
Efficiency
128 K 16 3447.87 2432.89 29.44
256 K 16 5129.00 3660.05 28.64
512 K 16 7612.31 5488.88 27.89
1024 K 16 11274.00 8207.60 27.20
5. Conclusion
The purpose of IP router is to make a decision on a rout-
ing path to use and to forward a packet corresponding to
the decided route. An existing router, which stores a pre-
fix in the routing table, has a difficulty to meet QoS re-
quirement from rapidly expanding internet environmen t—
keep speeding up and adding new routes. The contribu-
tion of the proposed scheme is mainly to reduce routing
table updating time and also the power consumption at
the same time. The new design improves the routing ta-
ble updating time by storing new prefix in routing table
in unsorted manner: The worst case updating time in ex-
isting design O(N) reduces to O(1). In order to do this,
the routing table is partitioned per output port, while the
SRAM that stores a prefix length is directly connected to
each partition of the routing table. This allows partitioned
TCAMs to be employed in the design for shorter delay
and lower power consumption. The logic of priority en-
coder and another logic related to existing SRAM are
replaced to simple selection logic. This removes not only
the needs for ordering the table by prefix length but also
the lookup process for finding an output port.
6. Acknowledgements
This work was supported in part by the Basic Science
Research Program through the National Research Foun-
dation of Korea (NRF) funded by the Ministry of Educa-
tion, Science an d Technology (2012R1A1A2004615).
REFERENCES
[1] “Latest Version of AS65000-BGP Routing Table Statis-
tics Analysis Report.”
http://bgp.potaroo.net/as2.0/bgp-active.html
[2] V. Srinivasan, B. Nataraj and S. Khanna, “Methods for
Longest Prefix Matching In a Content Addressable Mem-
ory,” US Patent 6237061, 1999.
[3] R. Guo and J. G. Delgado-Frias, “IP Routing Table Com-
paction and Sampling Schemes to Enhance TCAM Cache
Performance,” Journal of Systems Architecture, Vol. 55,
No. 1, 2009, pp. 61-69. doi:10.1016/j.sysarc.2008.08.001
[4] K. Pagiamtzis and A. Sheikholeslami, “Content-Addressable
Memory (CAM) Circuits and Architectures: A Tutorial
and Survey,” IEEE Journal of Solid-State Circuits, Vol.
41, No. 3, 2006, pp. 712-727.
doi:10.1109/JSSC.2005.864128
[5] S. Kaxiras and G. Keramidas, “IPStash: A Power-Effi-
cient Memory Architecture for IP-Lookup,” 36th Interna-
tional Proceedings of Symposium on Microarchitecture,
San Diego, 3-5 December 2003, pp. 361-372.
[6] M. J. Akhbarizadeh and M. Nourani, “An IP Packet For-
warding Technique Based on Partitioned Lookup Table,”
IEEE International Conference on Communications, Vol.
4, 2002, pp. 2263-2267.
[7] D. Shah and P. Gupta, “Fast Updating Algorithms for
TCAMs,” IEEE Micro, Vol. 21, No. 1, 2001, pp. 36-47.
doi:10.1109/40.903060
[8] T. Kocak and F. Basci, “A Power-Efficient TCAM Ar-
chitecture for Network Forwarding Tables,” Journal of
Systems Architecture, Vol. 52, No. 5, 2006, pp. 307-314.
doi:10.1016/j.sysarc.2005.12.001
[9] V. C. Ravikumar, R. N. Mahapatra and L. N. Bhuyan,
“EaseCAM: An Energy and Storage Efficient TCAM-
Based Router Architecture for IP Lookup,” IEEE Trans-
actions of Computer, Vol. 54, No. 5, 2005, pp. 521-533.
doi:10.1109/TC.2005.78
[10] B. S. Amrutur and M. A. Horowitz, “Speed and Power
Scaling of SRAMs,” IEEE Transactions on Solid-State
Circuits, Vol. 35, No. 2, 2000, pp. 175-185.
doi:10.1109/4.823443
[11] The Internet Performance Measurement and Analysis Pro-
ject. http://ftp.chg.ru/pub/network/routing/ipma/Manual/
[12] C. H. Huang and J. S. Wang, “High-Performance and
Power-Efficient CMOS Comparators,” IEEE Journal of
Solid-State Circuits, Vol. 38, No. 2, 2003, pp. 254-262.
doi:10.1109/JSSC.2002.807409
[13] W. Lu and S. Sahni, “Low-Power TCAMs for Very Large
Forwarding Tables,” IEEE/ACM Transactions on Net-
working, Vol. 18, No. 3, 2010, pp. 948-959.
doi:10.1109/TNET.2009.2034143
[14] B. Agrawal and T. Sherwood, “Ternary CAM Power and
Delay Model: Extensions and Uses,” IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, Vol. 16,
No. 5, 2008, pp. 554-564.