^{1}

^{*}

^{2}

^{*}

^{3}

^{*}

^{3}

^{*}

^{3}

^{*}

This paper presents a comparative study of the performances of arithmetic units, based on different number systems like Residue Number System (RNS), Double Base Number System (DBNS), Triple Base Number System (TBNS) and Mixed Number System (MNS) for DSP applications. The performance analysis is carried out in terms of the hardware utilization, timing complexity and efficiency. The arithmetic units based on these number systems were employed in designing various modulation schemes like Binary Frequency Shift Keying (BFSK) modulator/demodulator. The analysis of the performance of the proposed modulator on above mentioned number systems indicates the superiority of other number systems over binary number system.

With the advent of high speed DSP applications where the basic requirement was high data rates and fast adders and multipliers, the binary adders and multipliers were limited because of its carry propagation chain. For real time applications and DSP related problems fast arithmetic units particularly adders and multipliers are required for enhanced performance of the processors.

To evaluate this carry chain and have a carry free operation residue number system came into popularity. Residue Number System (RNS) [

In Double Base Number System (DBNS) [

Triple Base Number System (TBNS) [

Mixed Number System (MNS) [

In residue number system [

_{i}, x_{j}) = 1.

Now if M be the dynamic range (where X < M) then,

_{. }

and the residue are given as,

In this paper, the forward converter [

where m is the modulus. Now this equation is implemented by storing all the possible values of _{ }di-

rectly in LUT’s whose locations are accessed by the value of B_{j}. the size of the LUT is given by (p × log_{2}m) where p is the number of bits in each block. As it is a parallel converter it will require k number of LUT’s for every moduli followed by a multi operand modulo adder which will add the output bits from the k LUT’s. As a result of these parallel operations, this method of forward conversion is faster compared to other means of conversion.

In this case 8 bit binary number has been taken as input which is partitioned into two blocks, i.e. k = 2 having 4 bits each (p = 4). The moduli used are 5, 7 and 8. Depending on the moduli set and the portioned numbers, it is seen that the residue set repeat themselves by a period which is less than m − 1 (m is the modulus). As a result of which the LUT size is greatly reduced. A typical implementation of the following architecture is given in

In Residue Number System, addition [_{i} and y_{i} be the inputs to a RNS adder and z_{i} be their output result then it can be mathematically expressed as,

where

Several architectures have been proposed for implementing modulo addition on the basis of these equations. A simple modulo adder is given here (

RNS multiplication [_{rns} and y_{rns} are applied to the inputs of the array of AND gates which generate the partial products. The sum of partial products is given mathematically as,

where M_{k} is the partial product which is given as

Now after the bit by bit multiplication using and gate arrays the partial products are stored in the LUT’s in the form of

The outputs of the LUT’s are then applied to the array of carry save adders and the final assimilation is done using a normal ripple carry adder. The two MSB outputs are then mapped onto LUT. The output of the LUT is

then added with the other bits to generate the final multiplication result. The LUT contains the mapping function given as,

where Ir is the 6 bit output of the Wallace tree multiplier unit. The architecture of the proposed architecture is given in

Reverse conversion [

individually are applied as for addressing the LUT’s which generate,

is the range of the residue digits. The outputs are then added by a modulo M adder to get the converted binary output. The results lie within the range M (

Double Base Number System [

Therefore, binary system is special case of above representation. From these expression it is clear that when a binary number is converted into double base number system (DBNS), it is represented as number consisting of several (i, j) pairs. These (i, j) pairs are known as DBNS indices.

Greedy algorithm is an iterative approach for computing these indices. Since each iteration finds one index, the number of iterations indicates the number of one’s (1) in the DBNS table which are often referred to as active cells. The values given in each box in the DBNS table indicate the weight for the corresponding active cell. The maximum decimal number which is represented by a DBNS system with m × n cells can be obtained by adding the weights of all the m × n cells. Hence we can conclude that using a 4 × 4 DBNS table maximum decimal number that can be represented is 600 in following

The conversion of binary to DBNS [

To implement binary to DBNS conversion different approaches are used like Binary Search Tree (BST) [

Improved Range

To implement this architecture

a) First the data is passed through the 8:3 priority encoder, whose inputs are D7-D0 and outputs are Y2-Y0 and V (valid bit);

b) The three bit output is sent to control signal generation unit which checks the condition;

c) The control signal generation unit then send appropriate number to compare with the incoming data;

d) The hybrid approach has been used here for the functioning of control signal unit.

Architecture [

Serial No.. | 8 bit data | Output of the priority encoder | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

D7 | D6 | D5 | D4 | D3 | D2 | D1 | D0 | Y2 | Y1 | Y0 | V | |

1 | 1 | X | X | X | X | X | X | X | 0 | 0 | 0 | 1 |

2 | 0 | 1 | X | X | X | X | X | X | 0 | 0 | 1 | 1 |

3 | 0 | 0 | 1 | X | X | X | X | X | 0 | 1 | 0 | 1 |

4 | 0 | 0 | 0 | 1 | X | X | X | X | 0 | 1 | 1 | 1 |

5 | 0 | 0 | 0 | 0 | 1 | X | X | X | 1 | 0 | 0 | 1 |

6 | 0 | 0 | 0 | 0 | 0 | 1 | X | X | 1 | 0 | 1 | 1 |

7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | X | 1 | 1 | 0 | 1 |

8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |

9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | X | X | X | 0 |

of 108 is the 1st pair of the (i, j) for X. If X < 72 in the 1st comparison, then the upper input of the multiplexer is enable and X is compared to the 54, i.e. 1st pair of (i, j) will be the co-ordinate of 54 or 72.

So, here at least and at most two comparisons are required for extracting the indices means (i, j) pairs. Then the subtraction is done and the result is sent to the next PE. If zero is encountered, it is easily checked by the valid bit of priority encoder. No additional circuitry is required for this problem.

It is approved that for 4 × 4 DBNS table that the maximum five (5) number or (i, j) pairs are needed to represent an 8-bit binary number. So, a maximum of five Processing Element (PE) are employed which are con- nected in cascade. The block diagram of such a configuration is shown in

When the first binary data enters into the stage one of the Binary to DBNS Converter, maximum 5 cycles will be required to extract the (i, j) pairs. When the partial conversion data for the first input data enters into the second stage, the second input binary data enters into the first stage of the pipeline and so on. So, after 5 clock cycle each binary data will effectively take 1 clock cycle to represent the corresponding DBNS based number.

DBNS multiplication [

Multiplication of two DBNS numbers are expressed as the following equation

Two perform the above operation two Adders, one Look Up ^{b1+ b2 }is then stored in the LUT. Then the data is passed through the barrel shifter with the help of the first addition result. How many bits will be shifted is decided by the number of bits of the first addition result.

The architecture to perform the task of the above expression is shown in

Let us take two binary numbers K and L. After converting them into DBNS they can be expressed as,

Now, when they are multiplied the result can be written as,

From the expression it is clear that each term in the DBNS expression of one number has to be multiplied with each term of the 2nd number. After that the multiplication results are added together to get the final result. For implementing these two stages are required: 1) multiplication stage; 2) addition stage. In multiplication stage 25 multipliers are required which produces 25 multiplication results. They are stored in the LUT locations. This block diagram of this multiplication stage is shown in

Because 25 multipliers are required in the previous stage, 25 outputs are generated. These 25 outputs are added using 24 adders in 5 stages as shown in

The hardware requirements for implementing DBNS multiplier are summarized in

Sl No. | Subject | Values |
---|---|---|

1 | LUT size | 8 × 12 bits |

2 | Number of multiplication units | 25 |

3 | Adder size | 15 bits |

4 | Number of adders | 24 |

5 | Number of stages required | 5 |

The Triple Base Number System (TBNS) [

It is clear that when j = k = 0, the number system represented binary number system. The TBNS can be considered as a three dimensional geometric representation which is suitable to implement FIR, DFT, linear convolution etc of a signal with less hardware and design complexity. TBNS can perform arithmetic, using logarithmic-like computational unit. The structure of a TBNS table where i, j, and k varies from 0 to 3 is more like a (4 × 4 × 4) cube which represent maximum decimal number up to 27,000. The TBNS table where i, j, and k varies from 0 to 3 is as follows (

Conversion of a number from its binary to TBNS [

Here IRTS [

The Architecture of the CPE is described as an algorithm described (

a) First the data is passed through an 8:3 priority encoder, whose input are (D7-D0) and outputs are Y2, Y1, Y0 and v (valid bit);

b) Control Signal Generation Unit (CSGU) checks the priority encoder output;

c) Coordinate of Comparator Cum Substractor (CCS) will be the first pair of (i, j, k);

d) Control signal generation unit send value to the input of CCS and comparison is done. The same method mentioned in the previous step is repeated;

e) If the process does not satisfy, then second set of values are sent in CSGU;

f) Here at least one and at most six comparisons are required to extract pair of (i, j, k);

g) Then the subtraction is done and results are sent to the next CPE (

Maximum of three conversion processing elements (CPE) are required which are connected in cascade, i.e. pipelined fashion. The block diagram is shown in

(i, j) co-ordinates | N = 2^{i}3^{j}^{ } | k = 0 | k = 1 | k = 2 | k = 3 |
---|---|---|---|---|---|

N × 5^{k} | N × 5^{k} | N × 5^{k} | N × 5^{k} | ||

(0, 0) | 1 | 1 | 5 | 25 | 125 |

(1, 0) | 2 | 2 | 10 | 50 | 250 |

(0, 1) | 3 | 3 | 15 | 75 | 375 |

(2, 0) | 4 | 4 | 20 | 100 | 500 |

(1, 1) | 6 | 6 | 30 | 150 | 750 |

(3, 0) | 8 | 8 | 40 | 200 | 1000 |

(0, 2) | 9 | 9 | 45 | 225 | 1125 |

(2, 1) | 12 | 12 | 60 | 300 | 1500 |

(1, 2) | 18 | 18 | 90 | 450 | 2250 |

(3, 1) | 24 | 24 | 120 | 600 | 3000 |

(0, 3) | 27 | 27 | 135 | 675 | 3375 |

(2, 2) | 36 | 36 | 180 | 900 | 4500 |

(1, 3) | 54 | 54 | 270 | 1350 | 6750 |

(3, 2) | 72 | 72 | 360 | 1800 | 9000 |

(2, 3) | 108 | 108 | 540 | 2700 | 13,500 |

(3, 3) | 216 | 216 | 1080 | 5400 | 27,000 |

Srno | 8 bit data | N | (i, j, k) | |||||||
---|---|---|---|---|---|---|---|---|---|---|

D7 | D6 | D5 | D4 | D3 | D2 | D1 | D0 | |||

1 | 1 | x | x | X | X | X | X | X | 128 ≤ N | 125 or 135 or 150 or 180 or 200 or 216 or 225 or 250 |

2 | 0 | 1 | x | x | x | x | x | x | 64 ≤ N < 128 | 60 or 72 or 75 or 90 or 100 or 108 or 120 or 125 |

3 | 0 | 0 | 1 | x | x | x | x | x | 32 ≤ N < 64 | 30 or 36 or 40 or 45 or 50 or 54 or 60 |

4 | 0 | 0 | 0 | 1 | X | X | X | x | 16 ≤ N < 32 | 15 or 18 or 20 or 24 or 25 or 27 or 30 |

5 | 0 | 0 | 0 | 0 | 0 | X | X | x | 8 ≤ N < 16 | 8 or 9 or 10 or 12 or 15 |

6 | 0 | 0 | 0 | 0 | 0 | 0 | X | X | 4 ≤ N < 8 | 4 or 5 or 6 |

7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | x | 2 ≤ N < 4 | 2 or 3 |

8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | N = 1 | 1 |

9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | N = 0 | 0 |

Control signal | Value | Operations performed |
---|---|---|

Flag | 0 | 1st set values are generated (for CSGU) |

Flag | 1 | 2nd set of value are generated (for CSGU) |

Flag | 2 | 3rd set of values are generated (for CSGU) |

Flag | 3 | XXX |

Flag0 | 0 | Subtraction is not performed (for CCS) |

Flag0 | 1 | Subtraction is not performed (for CCS) |

CS0' | 0 | CSGU enabled |

CS2' | 0 | Comparator enabled |

After converting a given binary number to TBNS, we have to perform TBNS addition & multiplication [

But as TBNS multiplication is concerned, the [i, j, k] pairs will be added in powers of 2, 3 & 5. So, naturally the complexity of multiplication is reduced. This gives a great advantage in TBNS multiplication as compared to binary multiplication.

The expression for TBNS single-bit multiplication is shown in following equation

At first, the addition i + m, j + n and k + p will be performed using the binary adders as shown in

The hardware requirements for implementing TBNS multiplier are summarized in

Figure13. TBNS multiplication unit.

Sr. No. | Subject | Value |
---|---|---|

1 | LUT size | (64 * 29) bit |

2 | No of multiplication unit | 9 |

3 | Adder size | 32 |

4 | No of Adder | 8 |

5 | No of stages required | 4 |

Considering the advantageous parallel addition concept of Residue Number System and faster multiplication by Double-Base Number System a new concept of Mixed Number System(MNS) [

For implementing MNS five units are required. These are

A) DBNS conversion unit;

B) DBNS multiplier unit;

C) Binary to RNS conversion unit;

D) RNS adder unit;

E) RNS to binary conversion unit.

The architecture for implementing MNS is shown in

Here a particular modulation scheme has been chosen on which the different number systems has been applied and their performances have been compared. The architecture of the BFSK (Binary Phase Shift Keying) modulator is depicted in

The demodulator corresponding to the above mentioned modulator is given in

For analyzing the performance of different advanced number system on this architecture some binary arithmetic units have been replaced with RNS, DBNS and TBNS arithmetic units. The combination of changes is listed below.

A) Binary adder and multiplier are replaced by RNS adder and multiplier;

B) Binary multiplier is replaced by DBNS multiplier;

C) Binary multiplier is replaced by TBNS multiplier;

D) Binary adder and multiplier are replaced by RNS adder and DBNS multiplier respectively (MNS).

The modified architecture has been validated on Xilinx Vertex IV FPGA using Xilinx ISE version 9.1i and their performances have been compared.

All the designs of modulator and demodulator with arithmetic units using different number systems have been simulated and synthesized using Xilinx ISE version 9.1i and validated on Xilinx Vertex IV FPGA.

Device utilization for modulator with | Binary adder and multiplier | RNS adder and multiplier | DBNS multiplier | TBNS multiplier | RNS adder and DBNS multiplier (MNS) |
---|---|---|---|---|---|

No. of Slices | 48 | 254 | 3128 | 1797 | 3361 |

No. of Slice Flip Flops | 37 | 336 | 3405 | 896 | 3836 |

No. of 4 input LUTs | 89 | 454 | 5590 | 3000 | 6122 |

No. of IOs | 29 | 18 | 34 | 49 | 29 |

No. of bonded IOBs | 29 | 18 | 34 | 49 | 29 |

No. of GCLKs | 1 | 1 | 1 | 1 | 1 |

No. of DSP48s | 2 | × | × | × | × |

No. used as logic | × | 451 | 5589 | × | 6120 |

No. used as shift registers | × | 2 | 1 | × | 2 |

Device utilization for demodulator with | Binary adder and multiplier | RNS adder and multiplier | DBNS multiplier | TBNS multiplier | RNS adder and DBNS multiplier (MNS) |
---|---|---|---|---|---|

No. of Slices | 50 | 289 | 3536 | 1832 | 3549 |

No. of Slice Flip Flops | 39 | 368 | 3859 | 1214 | 3989 |

No. of 4 input LUTs | 89 | 531 | 5590 | 3178 | 6428 |

No.of IOs | 26 | 18 | 36 | 52 | 32 |

No. of bonded IOBs | 26 | 18 | 36 | 52 | 32 |

No. of GCLKs | 1 | 1 | 1 | 1 | 1 |

No. of DSP48s | 2 | × | × | × | × |

No. used as logic | × | 524 | 5712 | × | 6120 |

No. used as shift registers | × | 7 | 1 | × | 2 |

Timing summary for modulator with | Binary adder and multiplier | RNS adder and multiplier | DBNS multiplier | TBNS multiplier | RNS adder and DBNS multiplier (MNS) |
---|---|---|---|---|---|

Minimum period (ns) | 1.395 | 3.728 | 4.040 | 4.877 | 4.040 |

Minimum input arrival time before clock (ns) | 2.410 | 3.259 | 4.410 | 5.168 | 4.410 |

Maximum output required time after clock (ns) | 14.39 | 5.722 | 8.042 | 7.125 | 6.567 |

Using the data from the above mentioned synthesis reports the performance of the modulator and demodulator with arithmetic units of different number system has been compared and the results are given by the following graphs (Figures 17-20).

Timing summery for demodulator with | Binary adder and multiplier | RNS adder and multiplier | DBNS multiplier | TBNS multiplier | RNS adder and DBNS multiplier (MNS) |
---|---|---|---|---|---|

Minimum period (ns) | 1.333 | 5.545 | 8.212 | 8.911 | 4.103 |

Minimum input arrival time before clock (ns) | 2.215 | 2.909 | 7.848 | 8.137 | 4.170 |

Maximum output required time after clock (ns) | 15.03 | 4.677 | 5.846 | 4.724 | 6.927 |

From the graphs it is clear that:

a) Application of any number system provides better performance compared to binary number system;

b) Replacement of binary multiplier by DBNS multiplier improves the speed of the architecture but the hardware complexity increases substantially specially due to the conversion stage;

c) Although TBNS multiplier improves the performance of the modulator and demodulator further compared to DBNS multiplier, yet it suffers from additional hardware complexity for conversion;

d) Application of RNS adder and multiplier gives the best result among them, also providing lesser hardware complexity due to LUT based forward and reverse conversion stages;

e) MNS should provide better results compared to both DBNS and RNS. Although it performs better than DBNS according to the results but maximum output required time after clock is more than RNS because of the three conversion stages required for its implementation which is maximum among all of them. If the DBNS multiplier stage can be modified to produces results directly in RNS then application of MNS can be faster than RNS application.

In this paper at first high performance arithmetic units and architectures for applying them in DSP applications have been analyzed and validated on Xilinx virtex IV FPGA using Xilinx ISE 9.1i. To study their performances, arithmetic units of different number systems like RNS, DBNS, TBNS and MNS have been applied to a BFSK modulator as well as demodulator and validated similarly. By analyzing their performances it can be concluded that application of any number system provides better performance compared with binary number system. Comparing the performances of all those number system it is seen that the RNS provides the highest speed and that TBNS utilizes the least amount of hardware in case of this architecture. MNS also provides higher speed but suffers due to greater hardware complexity. If DBNS conversion stages can be implemented fully based on LUT, then the speed of DBNS and MNS will improve further. Future work can proceed if MNS can be implemented using TBNS multiplier and RNS adder. VLSI implementation can also be done.