Power and Time Efficient IP Lookup Table Design Using Partitioned TCAMs

doi:10.4236/cs.2013.43041

Paper Menu >>

Journal Menu >>

Circuits and Systems, 2013, 4, 299-303

http://dx.doi.org/10.4236/cs.2013.43041 Published Online July 2013 (http://www.scirp.org/journal/cs)

Power and Time Efficient IP Lookup Table Design Using

Partitioned TCAMs

Youngjung Ahn, Yongsuk Lee, Gyungho Lee

Department of Computer Science & Engineering, Korea University, Seoul, South Korea

Email: yjahn@formal.korea.ac.kr, duchi@korea.ac.kr, ghlee@korea.ac.kr

Received May 7, 2013; revised June 7, 2013; accepted June 14, 2013

which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

ABSTRACT

This paper proposes a power and time efficient scheme for designing IP lookup tables. The proposed scheme uses parti-

tioned Ternary Content Addressable Memories (TCAMs) that store IP lookup tables. The proposed scheme enables O(1)

time penalty for updating an IP looku p table. The partitioned TCAMs allow an update don e by a simple insertion with-

out the need for routing table sorting. The organization of the routing table of the proposed scheme is based on a parti-

tion with respect to the output port for routing with a smaller priority encoder. The proposed scheme still preserves a

similar storage requirement and clock rate to those of existing designs. Furthermore, this scheme reduces power con-

sumption due to using a p artitioned routing table.

Keywords: IP Lookup Device; Routi n g Tab l e; TCAMs; In serti on

1. Introduction

IP routers have been growing more complex with each

passing year. These routers must continually support new

features such as Quality of Service and Service Level

Agreement monitoring, and they must perform these ser-

vices at an increasing pace to match new connection

speeds. The central design issue for these devices is per-

forming IP lookup and classification at wire speed. Many

routers now include a co-processor to perform the crucial

task of IP lookup. The most critical issues are related to

performance degradation, which occurs when a new en-

try is inserted into the table and due to the limitations

imposed by the large encoder logic. In the worst case, the

insertion will take O(N) time to be completed, where N

represents the number of entries in the routing table. Ta-

ble size is expected to grow in the future at an exponen-

tial rate, and the update time will linearly grow with table

size [1]. On an average day, this same router may spend

up to 10% of the time updating the table. No lookups are

possible while an insertion is occurring. The insertion

time has a dramatic effect on the performance of routers

with large tables [2,3].

TCAMs allow a third matching state of “X” or “Don’t

Care” for one or more bits in the stored data word, thus

adding flexibility to the search [4]. TCAMs have a very

high cost/den sity ratio, and they have to search the entire

table for each lookup that causes high power consump-

tion [5]. Therefore, TCAMs are used in specialized ap-

plications such as packet classification and routing in

high performance routers, where the searching speed

cannot be accomplished using a less costly method.

TCAMs are currently used to perform the task of the

Longest Prefix Match (LPM) in high-end routers because

they are able to search a table in parallel in one cycle. In

such high end routers, it is crucial to have a time and

power efficient scheme for updating IP lookup tables

with new entry insertions. This paper proposes a novel

solution that eliminates the need for routing tab le sorting

per prefix length. As a result, the time penalty for insert-

ing a new entry in the IP routing table is O(1). Moreover,

it significantly reduces power consumption because of its

partitioned TCAMs. This paper describes the proposed

design and implementation details along with the simula-

tion results and design evaluation.

2. Related Works

There have been many attempts to address the insertion

problem in IP routing tables [6]. A commonly used tech-

nique involves adding empty space after each prefix

group. If there is an empty slot after the required prefix

group, the TCAM may keep a few empty memory loca-

tions every x non-empty memory locations (x < N, where

N is the table size). The average case update time im-

proves to O(x), but it degenerates to O(N) if the interme-

Y. AHN ET AL.

300

diate empty spaces are filled up [7]. A better technique is

called the L-Algorithm, which can create an empty space

in a TCAM in no more than L memory shifts [7].

L-Algorithm reduces the worst-case to O(L) where L is

the length of the IP address. This means the algorithm

would have a worst-case insertion time of O(32) for IPv4

and O(128) for IPv6. This algorithm takes advantage of

the fact that sorting within each length group is unneces-

sary, so only one entry from each group needs to be

moved. This provides a great improvement over the O(N).

However, even with the L-Algorithm, problems can still

occur in an actual router. A routing table update due to a

new entry insertion may still need 32 or 128 cycles; this

is undesirable in high-end routers. The design may also

require a large overhead to manage the free space in the

table. Also, when a burst of updates is received by the

router, the router will not be able to route packets while

the insertion is taking place. This means that the incom-

ing packets need to be buffered. If the buffer size is not

sufficient to hold all the received packets, then some will

be dropped. This hurts the major selling point of routers,

level of the worst-case performance and reliability. If the

insertion time is reduced to O(1), then the need for addi-

tional buffering would be unnecessary since the inser-

tions can happen at lookup speed.

TCAMs are currently used to perform the task of the

Longest Prefix Match (LPM) in high-end routers because

they are able to search a table in parallel in one cycle.

However, TCAMs are inefficient in terms of power con-

sumption even though the search time is fast. The size of

TCAMs in IP lookup devices is increasing. There are

currently many studies on reducing the power consumed

by IP lookup devices by dividing routing tables [8,9].

3. Design Approach and Implementation

Details

In order to keep the table sorted, the table must be up-

dated, and the result would take O(N) moves. In the pro-

posed design, the need for sorting is removed, and thus

insertion time is improved to O(1). Figure 1 shows the

proposed design. The output port divides the table, so it

is partitioned into smaller tables. The number of tables

generated equals the number of output ports of the router,

and each table holds a collection of all the entries that

map to the output port that it corresponds with. All en-

tries in a partitioned table map to the same output port,

making it unnecessary to sort the entries in the table.

When searching, each TCAM checks the IP address in

parallel. Each table outputs the matched lengths to a se-

lection logic. The selection logic chooses the longest

length, and the packet is sent to the output port based on

the table that had the longest prefix match. When inser-

tion is need, the output port matching the entry is c hec ked

by analyzing the entry. After figuring out the matching

output port, the entry is inserted into an open location in

the corresponding table. Note that the entry goes to “any”

open location in the corresponding table.

Figure 2 shows the modified design for the partitioned

table. Typical TCAM cells storing the network address

prefix are shown on the left, and memory cells storing

the lengths with the priority encoder removed are shown

on the right. When the logic selects the table containing

the LPM, the information of the output port is known,

and there no longer a need for SRAM memory to store

such information. Since the TCAM cells can be con-

nected to the corresponding SRAM memory cells be-

cause they are directly related, the priority encoder and

address decoders are also not needed. In addition, more

than one row can be asserted at once. This is possible

because there can only be one match per length. If a

lookup results in more than one match of the same length,

then this would mean that duplicate entries have been

stored. Therefore, the maximum number of matches is

32-bits, the length of an IP address. Lengths are stored in

a one-cold encoding. Since lengths are being stored in-

stead of output ports in the SRAM cells, the number of

SRAM cells needed is more than that in a typical router.

Since the TCAM cells are directly connected to the

SRAMS cells, the proposed design reduces the complex-

ity of the traditional design. For each entry, 32 TCAM

cells are used to store the prefix, and 32 SRAM cells are

used to store the prefix length. If more than one entry

match is found, then each asserts a corresponding output

line to reveal the prefix length. The selection logic

checks the length output to determine the table contain-

ing the longest prefix match. This can be used to choose

the longest match since there can only be a maximum of

Figure 1. Proposed architecture.

Figure 2. Modified routing table.

Y. AHN ET AL. 301

one matching entry for each prefix length between 1 and

32.

For the proposed design, th e SRAM cell word lines are

directly connected to the TCAM match lines. During

normal SRAM operation, only one word line should be

activated at once because if multiple word lines are acti-

vated, then the neighboring SRAM cells may become

corrupt due to the shared bit line. The TCAM search may

have multiple matches, and it can cause multiple SRAM

word lines to be activated. Additional access transistors

are added to each SRAM cell to activate the SRAM cell

only if a “0” is stored in the cell. Only one SRAM cell

will be activated for each column since each match can

correspond to only one leng th, and the length is stored in

a one-cold encoding. To enable access to the cell for

writing, two additional transistors are needed and four

transistors in total are added for each 6T SRAM cell.

This is shown in Figure 3. The standard 6T SRAM cell

is presented in the dotted line. The four extra transistors

will increase the area and latency of the SRAM cell.

The size of each partition of the table should still be

fairly large because the total table size is very large and

the number of output ports is usually quite low. Long

lines will significantly increase the latency [10]. This

requires each partition to be further divided into smaller

blocks, as is shown in Figure 3. The outputs from each

block are combined in an OR scheme. Since there can

only be one match for each length, there will be no con-

flicts.

Normally, the distribution of the prefixes will not be

balanced across p artitions. In some cases, more than 90%

of the prefixes forward to the same port [11]. Many con-

figurations are possible since this proposed design subdi-

vides the partitions further to reduce the bit line length.

Figure 3. SRAM cell design and par t itioned layout.

By combining the outputs of the blocks, i.e., sub-tables,

the OR scheme controls the boundaries of each partition.

One end of the boundary connects to the selection logic,

while the other end connects to a column pull-up. A basic

switching scheme allows the partition sizes to be pro-

grammable. The number of possible configurations can

be determined by the minimum grouping size. Due to

hardware complexity limitations, switching connections

cannot be available at all sub-table boundaries.

The selection logic takes inputs from the output lines

of the partitions of the tab le. The logic uses 32 inputs for

each output port, so if the router has four output ports,

then 128 inputs are connected. The selection logic first

determines the longest length that was found. Then, the

output port is determined based on the table that the LPM

is from. The logic finds the port with which the highest

length is associated. Then, the packet can be forwarded

to the correct port. This is a very simple logic and is

similar in complexity to a small p riority encoder.

4. Evaluation of the Design

Improving insertion time is only beneficial if the lookup

latency is not slow er than that of the original d esign. The

proposed design is compared to a typical co-processor

design. The typical design for this comparison contains a

standard TCAM that is capable of holding up to 256 K

entries. There are four and sixteen output ports in typical

core routers. By analyzing this, four output ports have

been used. This latency of the new and old designs could

be broken down into several parts and could also be

separately compared as shown in Table 1. As the table

shows, both designs have similar latencies. Although the

SRAM reading time is longer than that of the proposed

design, the selection logic is much simpler than the logic

of the priority encoder. The first portion of both designs

begins with the TCAM search. The design of the TCAM

cells remains unchanged so that the latency of the TCAM

search could remain identical. In the proposed de sign , th e

next step is to access the SRAM cells that correspond to

the matched prefixes from the TCAM search to deter-

mine the length of the prefix.

The overall size of the SRAM cell has increased due to

the addition of access transistors. This increases the

length of each bit line causing the capacitance to become

larger, and this increases the necessity time to pull down

Table 1. The latency.

TCAM

Search Priority

Encoder SRAM

Read Total

Original

Design ~6.5 ns ~3.6 ns 2.7 ns ~12.8 ns

TCAM

Search SRAM

Read Selection

Logic Total

Proposed

Design ~6.5 ns 4.1 ns ~0.98 ns ~11.6 ns

Y. AHN ET AL.

302

the line. Since the bit lines of one large table are too long,

the partition table is further broken into sub-tables. Each

table outputs the bits corresponding to the lengths that

were found in the sub-table. The results from each

sub-table and each length are combined using an OR

scheme. The optimal configuration was found by com-

paring the delay due to the size of each sub-table and the

delay of combining the results from each sub-table. Ca-

dence tools in a 0.25 um technology were used to simu-

late the access time of various SRAM configurations to

compare the standard of SRAM cells. The capacitances

of the bit lines were estimated based on the area increase

of the additional transistors. These results show that the

original SRAM performs approximately 33% faster than

the new SRAM in all configurations. A full custom lay-

out would provide more accurate results. Larger size is

preferred in order to reduce the number of outputs that

should be combined tog ether. Table 2 shows the simula-

tion result.

In old designs, the delay of the major logic is from the

priority encoder. This circuit becomes larger when it

handles the input from all TCAM search lines. The pro-

posed design directly connects the TCAM cells to the

SRAM cells so that the priority encoder could be re-

moved. The address decoder of the SRAM from the old

design can be removed too. The major logic delay ac-

cording to the proposed design is from the selection logic.

First, the logic chooses the longest length, and it then

determines the length of the partition table that was al-

ready found. The circuit is similar to a very small priority

encoder.

The logic was designed by using Mentor Graphics

ModelSim in Verilog. The design was synthesized using

a 0.50 um technology and simulated by using Mentor

Graphics LenardoSpecturm. The results show that the

selection logic is much easier than the priority encoder

and has a smaller delay. The results are shown in Table

3. To evaluate the scalability of the proposed design, the

differences between the two designs were compared. The

Table 2. SRAM simulations.

Size Original SRAM Proposed SRAM

128 1.4 ns 2.0 ns

256 2.0 ns 3.0 ns

512 2.7 ns 4.1 ns

Table 3. Logic simulations.

Priority Encoder Selection Logic

Gates 27,668 3015

Delay 14.4 ns 3.8 ns

TCAM cells have not been modified, so they will scale in

the same way. When the SRAM cells were modified,

they have shown about 33% difference in performance.

Although the proposed design continuously has a differ-

ence, it will also scale in the same way. The largest dif-

ference between the two designs is found in the differ-

ence in logic. The selection logic scales logarithmically

depend on three things. The length of the IP address, the

number of output ports on the router and the size of the

routing table. Since the length of the IP address is unex-

pected to increase after IPv6 and the number of output

ports to change, the most important factor is the number

of entries in the routing table. A priority encoder’s delay

will also scale logarithmically with respect to the number

of entries in the table [12]. Its delay is not affected by the

number of output ports or length of the IP address be-

cause there are no dependencies. The address decoder in

SRAM will have a logN scaling factor. Based on the

above analysis, both designs are expected to scale simi-

larly as the routing table size grows. The power con-

sumption in the partitioned routing table is shown as be-

low. When the length of prefix is 32-bits while the num-

ber of entry of routing table is n, the power consumption

of TCAMs is denoted as (1) [13]. P(n) consists of one

TCAMs and describes the power consumption while the

number of entry is n.









Pnn0.5log n10.5log n 

search : match17 :83



(1)

In IP lookup, the behavior of TCAMs can be divided

by IP searching and notification of matched Prefix. Ac-

cording to [14], the ratio of power consumption of two

functions is shown as (2). The IP search in TCAMs is

advanced in all partitio ned routing tab les. In addition, the

notification of match ed Prefix is progressed in one parti-

tioned routing tables.

(2)

The proposed scheme divides the routing table corre-

sponding to the output port. The number of entry of each

partitioned table is n/x wh en the number of outpu t port is

x. By using this property and (2), the total power con-

sumption of partitioned routing table is shown as (3).



total nn

P1 0.17x P1 0.83p

 

 

 

 

(3)

Table 4 describes power consumption and efficiency

corresponding to the number of entry, when the number

of output port is sixteen. When the routing table can be

divided into sixteen tables, efficiency is increased by

approximately 28% compared to the power consumption

of existing TCAMs. When the total number of entries is

increased, the efficiency of power consumption is de-

creased because the number of x entries of each parti-

tioned routing table is increased.

Y. AHN ET AL.

303

Table 4. Power consumption.

Power Consumption

# of

Entry

# of

Output

Port TCAMs Partitioned

TCAMs

Efficiency

128 K 16 3447.87 2432.89 29.44

256 K 16 5129.00 3660.05 28.64

512 K 16 7612.31 5488.88 27.89

1024 K 16 11274.00 8207.60 27.20

5. Conclusion

The purpose of IP router is to make a decision on a rout-

ing path to use and to forward a packet corresponding to

the decided route. An existing router, which stores a pre-

fix in the routing table, has a difficulty to meet QoS re-

quirement from rapidly expanding internet environmen t—

keep speeding up and adding new routes. The contribu-

tion of the proposed scheme is mainly to reduce routing

table updating time and also the power consumption at

the same time. The new design improves the routing ta-

ble updating time by storing new prefix in routing table

in unsorted manner: The worst case updating time in ex-

isting design O(N) reduces to O(1). In order to do this,

the routing table is partitioned per output port, while the

SRAM that stores a prefix length is directly connected to

each partition of the routing table. This allows partitioned

TCAMs to be employed in the design for shorter delay

and lower power consumption. The logic of priority en-

coder and another logic related to existing SRAM are

replaced to simple selection logic. This removes not only

the needs for ordering the table by prefix length but also

the lookup process for finding an output port.

6. Acknowledgements

This work was supported in part by the Basic Science

Research Program through the National Research Foun-

dation of Korea (NRF) funded by the Ministry of Educa-

tion, Science an d Technology (2012R1A1A2004615).

REFERENCES

[1] “Latest Version of AS65000-BGP Routing Table Statis-

tics Analysis Report.”

http://bgp.potaroo.net/as2.0/bgp-active.html

[2] V. Srinivasan, B. Nataraj and S. Khanna, “Methods for

Longest Prefix Matching In a Content Addressable Mem-

ory,” US Patent 6237061, 1999.

[3] R. Guo and J. G. Delgado-Frias, “IP Routing Table Com-

paction and Sampling Schemes to Enhance TCAM Cache

Performance,” Journal of Systems Architecture, Vol. 55,

No. 1, 2009, pp. 61-69. doi:10.1016/j.sysarc.2008.08.001

[4] K. Pagiamtzis and A. Sheikholeslami, “Content-Addressable

Memory (CAM) Circuits and Architectures: A Tutorial

and Survey,” IEEE Journal of Solid-State Circuits, Vol.

41, No. 3, 2006, pp. 712-727.

doi:10.1109/JSSC.2005.864128

[5] S. Kaxiras and G. Keramidas, “IPStash: A Power-Effi-

cient Memory Architecture for IP-Lookup,” 36th Interna-

tional Proceedings of Symposium on Microarchitecture,

San Diego, 3-5 December 2003, pp. 361-372.

[6] M. J. Akhbarizadeh and M. Nourani, “An IP Packet For-

warding Technique Based on Partitioned Lookup Table,”

IEEE International Conference on Communications, Vol.

4, 2002, pp. 2263-2267.

[7] D. Shah and P. Gupta, “Fast Updating Algorithms for

TCAMs,” IEEE Micro, Vol. 21, No. 1, 2001, pp. 36-47.

doi:10.1109/40.903060

[8] T. Kocak and F. Basci, “A Power-Efficient TCAM Ar-

chitecture for Network Forwarding Tables,” Journal of

Systems Architecture, Vol. 52, No. 5, 2006, pp. 307-314.

doi:10.1016/j.sysarc.2005.12.001

[9] V. C. Ravikumar, R. N. Mahapatra and L. N. Bhuyan,

“EaseCAM: An Energy and Storage Efficient TCAM-

Based Router Architecture for IP Lookup,” IEEE Trans-

actions of Computer, Vol. 54, No. 5, 2005, pp. 521-533.

doi:10.1109/TC.2005.78

[10] B. S. Amrutur and M. A. Horowitz, “Speed and Power

Scaling of SRAMs,” IEEE Transactions on Solid-State

Circuits, Vol. 35, No. 2, 2000, pp. 175-185.

doi:10.1109/4.823443

[11] The Internet Performance Measurement and Analysis Pro-

ject. http://ftp.chg.ru/pub/network/routing/ipma/Manual/

[12] C. H. Huang and J. S. Wang, “High-Performance and

Power-Efficient CMOS Comparators,” IEEE Journal of

Solid-State Circuits, Vol. 38, No. 2, 2003, pp. 254-262.

doi:10.1109/JSSC.2002.807409

[13] W. Lu and S. Sahni, “Low-Power TCAMs for Very Large

Forwarding Tables,” IEEE/ACM Transactions on Net-

working, Vol. 18, No. 3, 2010, pp. 948-959.

doi:10.1109/TNET.2009.2034143

[14] B. Agrawal and T. Sherwood, “Ternary CAM Power and

Delay Model: Extensions and Uses,” IEEE Transactions

on Very Large Scale Integration (VLSI) Systems, Vol. 16,

No. 5, 2008, pp. 554-564.