^{1}

^{1}

^{1}

^{1}

^{1}

The design of reconfigurable multi-butterfly parallel radix-r FFT (Fast Fourier Transform) processors is proposed. FFT is widely used in signal processing, and the application needs real-time and high performance, while most of the traditional designs are limited to the power of two, which wastes the buffers and multipliers in big data. In response to the problem, we improve the parallel FFT algorithm with the design of reconfigurable control machine combined with buffer/multiplier, and the cost function with the input of radix/number/paddling number/time consuming is deduced. Constrained with the number of buffer and multipliers, the radix and number can be computed with the optimum cost function, and the resolution space of computing performance and hardware cost is presented. The proposed guarantees the real-time performance with better flexibility compared with the previous literature, and the comparison also suggests the effectiveness of the design.

The Fourier Transform is the basic algorithm for time-frequency domain processing, and the necessary tool for digital spectrum analysis. The Fast Fourier Transform (FFT), as the element theory in signal processing, is widely used in the research on electromagnetic characteristics, satellite navigation and communications and radar signal processing [

S. A. Salehi provided pipeline architecture for FFT which is limited to the power of two [^{2}. The traditional FFT design, is basically restricted to the radix-2/4 and the corresponding architectures, while the buffers and multipliers can be easily wasted for big data processing; and the previous literature often optimize the computing with constant parallel architectures, which makes it difficult to balance between performance and resources.

In previous works, [

The reconfigurable multi-butterfly parallel radix-r FFT processor is proposed, based on the improved parallel strategy. By designing configurable controller combined with hardware resources such as cache/multiplier, the parallel algorithm is improved. The design cost function of FFT with the input of radix, point, zero paddling and calculation time is given. In the actual design process, taking cache and multiplier resources as constraints, the optimal FFT design architecture is obtained by calculating the radix and number of points under the optimal cost function. The form of solution space is given for the computational performance and resource occupation of the design. Because of the use of processor architecture, the algorithm can be adjusted according to the availability of resources in the actual hardware design process, which not only ensures the flexibility of the design, but also guarantees the real-time requirements of parallel computing.

The architecture of parallel FFT processor is given in Section 2, include the internal sub-module design and connection relationship; the performance is analyzed based on the parallel architecture in Section 3, and the improved parallel algorithm is deduced for the designed processor; Section 4 compares the design with previous literatures; Section 5 concludes.

The point is configured by the external input interface, while the optimal radix and the parallel degree of multipliers are deduced by point. Then the parameters are used to form the FFT architecture.

The whole architecture in

Processor receives external input data, and output results after calculation. The workflow is as follow: first use internal cache to receive external input data, and the register set save external input and output control information; then, calculate input data; at last, output results to output cache, refresh the register set, and waiting for the external data.

According to the workflow of the processor, the internal hardware structure of the processor in

The processor mainly includes 8 sub-modules:

1) State Processing Unit;

2) Inline Bus;

3) Integer Processing Unit;

4) Floating-Point Processing Unit;

5) Reg Set;

6) Input Memory;

7) Middle Memory;

8) Output Memory.

While, SPU is used to handle all state controls generated according to the algorithm; ILB is used to connect three-tier cache and floating-point processing units and to interact with SPU for data; IPU is used to support integer computing requirements in SPU; FPUG are used for the floating-point computing; RS receive the input control information and internal state interaction; IM store the data input, and MM store the intermediate data, while OM store the results.

The state processing unit (SPU) is used to control and command all working components. Its function is to extract the state code from the state memory, send it to the state code register, and then enter the state decoder for decoding. According to the state code information, all the internal information needed for various operations is updated, so that all parts can coordinate their work and complete the various operations specified by the state code.

The SPU includes timing control logic, state code memory, state code register, state code decoder, state counter SC (State Count), state address register, state code pointer register SPTR and stack pointer register SP. Its internal structure is shown below.

1) Timing Control Logic

When the FPGA is started, the SPU is controlled to take out the state code and increase the state count.

2) State Code Register

Store the status code currently being executed.

3) Sate Code Decoder

When the state code is fed into the state decoder, the state code is decoded by the decoder, that is, the state code is converted into various specific operations, so that the state machine can correctly perform various functions required.

4) State Address Register

To store the next status code address to be executed. When a status code is removed from the status code memory according to the SC pointing address, SC automatically adds 1/2/3 to the next status code. When reset, (SC) = 0, so the address selection of system status code must start from unit 0.

There are two ways to form a status code address: one is to execute sequentially, adding 1/2/3 through SC; the other is to change the sequence of execution procedures. Generally, the transfer address is formed by the transfer class status code and sent to the status code address register as the next status code address.

1) State Code Pointer Register

SPTR is used to address external data to form an external data address pointer.

2) Stack Pointer

SP is used to store the top address of the stack. The stack accesses data according to the “first in, last out” principle.

The inline bus ILB in

1) Data transmission from input cache IM to intermediate cache MM;

2) Data Transfer from Intermediate Cache MM to Intermediate Cache MM;

3) Data transmission from intermediate cache MM to output cache OM;

4) Data transmission from intermediate buffer MM to floating point processing unit group FPUG;

5) Data transmission from FPUG to intermediate cache MM.

The main function of IPU module in

Floating point processing unit group FPUG in

each module varies according to the task requirements of parallel FFT computing and the amount of remaining resources of current FPGA. Once the design requirements are identified, the number of four modules can no longer change.

Internal cache has three parts: input cache IM, intermediate cache MM and output cache OM. All three caches are dual-port RAM.

Suppose the point is N, where N = r M , r, M is the positive. Then, the radix-r FFT can be represented as [

[ X m ( j 1 ) X m ( j 2 ) ⋮ X m ( j r ) ] = [ W N r M − 1 × ( 0 × 0 ) W N r M − 1 × ( 0 × 1 ) ⋯ W N r M − 1 × [ 0 × ( r − 1 ) ] W N r M − 1 × ( 1 × 0 ) W N r M − 1 × ( 1 × 1 ) ⋯ W N r M − 1 × [ 1 × ( r − 1 ) ] ⋮ ⋮ ⋱ ⋮ W N r M − 1 × [ ( r − 1 ) × 0 ] W N r M − 1 × [ ( r − 1 ) × 1 ] ⋯ W N r M − 1 × [ ( r − 1 ) × ( r − 1 ) ] ] × [ X m − 1 ( j 1 ) X m − 1 ( j 2 ) ⋅ W N p ⋮ X m − 1 ( j r ) ⋅ W N ( r − 1 ) p ] (1)

j 1 + k m − 1 = l ⋅ r M − m + 1 + k m − 1 ⋅ r M − m + q (2)

l = ∑ i = 0 m − 2 k i ⋅ r m − 2 − i (3)

q = ∑ i = 0 M − m − 1 n i ⋅ r i (4)

For the N_{0}-point parallel FFT ( N 0 is positive integer), the pre-work should be done as:

1) Choose the current optimal radix-r and FFT point N;

2) The parallel computing flow is designed according to the number of current hardware additive multiplication.

This section describes the parameter calculation method and implementation process.

1) Number of Zero Complementation Points

For the N_{0}-point FFT, the actual number of points N of FFT is calculated by radix-r.

N = r ⌈ log r N 0 ⌉ (5)

where, ⌈ ⌉ is the ceiling operation.

The number of zeros to be paddling is

Δ ( r ) = r ⌈ log r N 0 ⌉ − N 0 (6)

where, r ∈ [ 2 , ⌊ r ⌈ log r N 0 ⌉ ⌋ ] .

2) Time Consuming

For the radix-r N-point FFT, time consuming is:

T = β r 2.5 r ⌈ log r N 0 ⌉ − 1 ⌈ log r N 0 ⌉ = β r 1.5 + ⌈ log r N 0 ⌉ ⌈ log r N 0 ⌉ (7)

where, β ≥ 1 , represents the time coefficients caused by branch jumps, data access and computing module delays in current computing systems. When the hardware structure is fixed, β is fixed. In this paper, β = 15.7 .

3) Optimal Parameter Selection Based on Cost Function

In the calculation of FFT, the choice of parameters should take into account not only the calculation speed, but also the current occupancy of space resources. Therefore, the calculation time and zero-filling points should be unified planning.

γ ( γ > 0 ) is defined as the ratio of computing time to buffer cost. Then the total cost function is

δ = Δ + γ T (8)

The process of determining parameter r is to find the minimum value of δ ( r ) , which is r min ( δ ( r ) ) .

Examples are as follows:

For the N 0 = 171 point FFT, the zeros paddling of different radix are as in

The computing cycles for different radix are as in

Suppose γ = 10 − 3 , then the cost function δ ( r ) is as in

From

Radix | Zeros Paddling | Computing Cycles | Cost Function |
---|---|---|---|

2 | 85 | 0.0909e6 | 0.1759e3 |

3 | 72 | 0.0991e6 | 0.1711e3 |

4 | 85 | 0.1286e6 | 0.2136e3 |

5 | 454 | 0.4388e6 | 0.8928e3 |

6 | 45 | 0.1495e6 | 0.1945e3 |

7 | 172 | 0.2992e6 | 0.4712e3 |

8 | 341 | 0.5457e6 | 0.8867e3 |

9 | 558 | 0.9271e6 | 1.4851e3 |

10 | 829 | 1.4894e6 | 2.3184e3 |

11 | 1160 | 2.2871e6 | 3.4471e3 |

12 | 157 | 3.3833e6 | 4.9403e3 |

13 | 2026 | 4.8503e6 | 6.8763e3 |

4) The Implementation of Parameters Computing

Generally, logarithmic calculation is calculated by look-up table method and CORDIC algorithm. Because parameters can be calculated in the initialization process, the real-time requirement for parameter calculation is not high. The use of look-up table method and CORDIC algorithm will occupy a large amount of memory resources or logical resources. In this design, a simple integer logarithmic calculation method is designed by local optimization of Taylor series expansion method and table lookup (256 table lookup data).

The optimal radix could be deduced according to (4)-(8), in which the transcendental functions are used frequently.

The logarithm of traditional Taylor series expansion is as follows

ln ( 1 + x ) = ∑ i = 1 ∞ ( − 1 ) i − 1 x i / i (9)

In the finite order, the closer x approaches 1, the greater the error. This leads to a high order of accuracy in order to ensure the whole range.

The derivation is as follows,

ln ( 1 + x ) = ln ( 2 ( 1 + x ) / 2 ) = ln 2 + ln ( 1 + 0.5 ( x − 1 ) ) (10)

The following improvements are made to Taylor series expansion

ln ( 1 + x ) = { ln ( 1 + x ) , if x ∈ ( 0 , 0.5 ] ln 2 + ln ( 1 + 0.5 ( x − 1 ) ) , if x ∈ ( 0.5 , 1 ] (11)

The error of formula (11) is less than 10^{−4}, as shown in the following

In the process of parameter calculation, for practical use, the point N_{0} is less than 65,536, which can be expressed by 16-bit integer. At this time, the radix r will not exceed 256, and can be expressed by 8-bit integers.

The calculation of formula (4)-(8) is mainly concentrated on the logarithm of r obtained from N_{0}, which is deduced as follows.

ln ( N 0 ) = ln ( 256 N 0 H + N 0 L ) = ln 256 + ln N 0 H + ln ( 1 + N 0 L 256 N 0 H ) (12)

And,

log r N 0 = ln N 0 ln r (13)

where, 0 < N 0 L 256 N 0 H < 1 , N 0 H < 256 .

For N_{0}, the logarithm of r can be simplified into the following three steps: 1/(lnr) and ln N 0 H are calculated by using 8-bit 256 data lookup tables; ln ( 1 + N 0 L / ( 256 N 0 H ) ) is calculated according to (11); the results plus a constant ln256, and then are divided by lnr, which is log r N 0 .

In the process of any radix FFT processing, the butterfly computing matrix W is first calculated, which is only related to the calculating series M and the base number r. Therefore, after calculating the parameters, W is calculated once and stored, and then only data is taken out for calculation in each butterfly calculation. The r-ary reverse order of input data can ensure the correct order of settlement results when calculating the output. Each iteration needs to calculate N/r butterfly calculation. Before each butterfly calculation, the twiddle factor needs to be calculated. In fact, the rotating factor can be selected in the butterfly calculation matrix W. This ensures that the twiddle factor does not need additional calculation, calculates the address of the number of operands for butterfly calculation, takes out r operands, and the rotating factor and the butterfly meter. The calculation matrix is multiplied separately to complete the butterfly calculation, and the storage address of the calculated butterfly results is stored.

Because Singleton’s fixed structure is adopted in the design of the algorithm, X and Y caches are used to store the input and output of each fixed structure. The pseudo code of the whole algorithm is as follows.

For the case of only one multiplier and one adder, the whole algorithm flow needs to be serially operated according to the above pseudo-code, and the calculation time is as described above.

As can be seen from the above algorithm, there are two ways to improve the algorithm by using parallel computing method:

1) Parallel computation is carried out for the matrix operation in the butterfly computation of line 15 of the algorithm.

2) Parallel computing is carried out for N/r butterfly computing units.

The Radix used in practical application is generally less than r = 10. Winograd’s second-order matrix multiplication is not suitable for use. Therefore, the traditional matrix multiplication structure (mid-product algorithm) is used for calculation.

In butterfly computing process, the r data of A(fetch_idx) are multiplied by r twiddle factors to form the r multiplication is not suitable for use. Therefore, the traditional matrix multiplication of r × r matrix and r × 1 vector, needs r^{2} multiplications and r(r − 1) additions.

The parallel computing architectures are discussed as follow with the different resources:

1) When the number of multipliers n_{m} is less than r^{2}, and the number of additions n_{a} is less than r(r − 1);

The twiddle factors must be pre-processed, and the time cost is ⌈ r / n m ⌉ ; the time cost of matrix multiplication is max { ⌈ r 2 / n m ⌉ , ⌈ r ( r − 1 ) / n a ⌉ } . The parallel process applies n_{m} threads computing. For convenience, n a ≥ n m + 1 .

Set r = 5 and n_{m} = 3 as the example, to illustrate the parallel design of butterfly. Denote the operands as op_{i}, the twiddle factors as tf_{i}, and the product of the two as o p ′ i ; the butterfly matrix is denoted as w_{i}_{,j}, and the result is opo_{i}, where i , j ∈ [ 0 , r − 1 ] . And the parallel proceed is as follow in

2) When the number of multipliers n_{m} is not less than r^{2}, and the number of additions n_{a} is not less than r(r − 1).

The resources of adders and multipliers are greater than a fully parallel requirement, and multi-butterflies could be configured parallel to enhance the performance. The parallel degree is P = min { ⌈ n m / r 2 ⌉ , ⌈ n a / r ( r − 1 ) ⌉ } . When n a ≥ n m + 1 , P = ⌈ n m / r 2 ⌉ . Each group includes Q = N/r/P butterflies.

Improve the alg. 1, then the multi-butterfly parallel FFT alg. 2 is as follow in

According to the performance of the parallel alg. in Section 3.2.2, the gap pipelines in the single butterfly and the multi-butterfly are adjusted to enhance the parallelism. In each state the time cost of the butterfly is deduced as:

1) When the number of multipliers n_{m} is less than r^{2}, and the number of additions n_{a} is less than r(r − 1);

T b f = ⌈ r 2 / n m ⌉ N r (14)

2) When the number of multipliers n_{m} is not less than r^{2}, and the number of additions n_{a} is not less than r(r − 1).

T b f = N r ⌈ n m / r 2 ⌉ (15)

Then, the total time cost is

T f f t = { log r N ⌈ r 2 / n m ⌉ N r , if n m < r 2 log r N N r ⌈ n m / r 2 ⌉ , if n m ≥ r 2 (16)

The buffers of 8 modules in FFT processor are composed by block ram, while the SPU/ILB/IPU/RG/FPUG occupies the slices. The first 4 modules resources consuming are the same with different parallel degree, and the resources of FPUG increases with parallel degree growth.

Take Xilinx Virtex-II 3000 as an example, SPU/ILB/IPU/RG costs 2973 slices; floating-point adder costs 273 slices; floating-point multiplier costs 75 slices and 4 fix-point multipliers. Then, the total costs are

S r C o s t = 2973 + 273 n m + 75 n a (17)

According to (17), the parameters space is depicted in

The design method in this paper is compared with the previous literature [

This paper presents a design method of configurable multi-butterfly parallel computing radix-r FFT processor. In the information processing process, FFT has a wide range of applications, large demand and high real-time requirements. The existing design methods are mainly limited to base 2/4 and the corresponding parallel architecture. It is easy to waste storage and multiplier resources with different number of points and multipliers under large data. In order to solve this problem, the parallel FFT algorithm is improved by designing a configurable

Radix | Flexibility | Performance | |
---|---|---|---|

[ | 4 | Constant Arch. | Constant |

[ | 2^{n} | Constant Arch. | Constant |

[ | 2^{2} | Constant Arch. | Constant |

[ | r | Constant Arch. | Constant |

Proposed | r | Reconfigurable | Adaptive |

controller combined with hardware resources such as buffer and multiplier. The FFT design cost function with cardinality, number of points, number of zeros and computing time as input are given. In the actual design process, with the constraints of buffer and multiplier resources, the optimal FFT design architecture is obtained by calculating the number of points and cardinality under the optimal cost function and the form of solution space is given for the calculated performance and resource occupancy. The design method in this paper has good flexibility, and its parallel computing architecture also guarantees the real-time performance of the calculation. The comparison with the previous literature shows that the design method is effective under the same design parameters.

This work was carried out by Professor Dan Huang and Professor Zong Qi (Chongqing University of Technology). We gratefully acknowledge their invaluable cooperation in preparing this application note.

The authors declare no conflicts of interest regarding the publication of this paper.

Yu, J.Y., Cheng, B.W., Li, Z.L., Liu, W.W. and Wang, L.Y. (2019) Reconfigurable Multi-Butterfly Parallel Radix-r FFT Processor. Journal of Data Analysis and Information Processing, 7, 91-107. https://doi.org/10.4236/jdaip.2019.73006