FPGA Implementation of Approximate 2 D Discrete Cosine Transforms

Discrete cosine transform (DCT) is frequently used in image and video signal processing due to its high energy compaction property. Humans are able to perceive and identify the information from slightly erroneous images. It is enough to produce approximate outputs rather than absolute outputs which in turn reduce the circuit complexity. Numbers of applications like image and video processing need higher dimensional DCT algorithms. So the existing architectures of one dimensional (1D) approximate DCTs are reviewed and extended to two dimensional (2D) approximate DCTs. Approximate 2D multiplier-free DCT architectures are coded in Verilog, simulated in Modelsim to evaluate the correctness, synthesized to evaluate the performance and implemented in virtexE Field Programmable Gate Array (FPGA) kit. A comparative analysis of approximate 2D DCT architectures is carried out in terms of speed and area.


Introduction
The increase in use of computers increases the use of digital signal processing (DSP).In DSP, three domains are used to represent the signals.They are time domain/spatial domain (for one-dimensional signals for multidimensional signals respectively), frequency domain, and wavelet domains.Signal can be represented in any one of the domain which represents the essential characteristics of the signal.Frequency domain also called spectrumor spectral analysis makes partitioning of spectral components to propose a small and meaningful form of signal representation.There are many frequency domain transformations.Due to its strong "energy compaction" property DCT is frequently used in signal and image processing.It is also used in a multitude of compression standards.For multimedia applications, video processing systems such as High Efficiency Video Coding (HEVC) need fast and compact blocks.Approximation of DCT transform becomes efficient by the vast improvement in fast algorithms.
A Discrete Cosine Transform (DCT) [1] gives a finite number of points in terms of addition of cosine functions oscillating at different frequencies.Discrete Fourier transforms (DFT) using only real numbers becomes DCT, a Fourier-related transform.
DCT can be expressed as where ( ) 1D-DCT is utilized for changing one dimensional signal like audio.But image and video signal needs 2D-DCT for its handling.Number of uses requiring higher dimensional DCT calculations are increasing.So much importance is given for the algorithm which can be extended for higher dimensional readily.
The individual product of all dimensions of 1D-DCT is used to produce multidimensional DCT.For example, the product of 1D-DCT along the rows and columns form the 2D-DCT of an image.The computation of 2D-DCT from 1D-DCTs across all dimension is known as a row-column algorithm.
The expression for 2D-DCT is given by , cos cos 0, , ; 0, , 2 2 where ( ) In conventional DCT, an 8-point 1D-DCT requires 64 multiplications and 56 additions and 8-point 2D-DCT requires 1024 multiplications and 896 additions.It is computation intensive and also occupies more area.So the approximate DCTs are preferred.Humans are able to perceive and identify the information from slightly erroneous images.It is enough to produce approximate outputs rather than absolute outputs.
The idea of this paper is three-fold: First, reviewing architecture of approximate DCTs; Second, extending the architecture of 1D approximate DCT to 2D approximate DCT and third, proposes implementation of 2D approximate DCT in virtexE FPGA; The workflow is shown in Figure 1.

Methods and Materials
The idea of using the approximation algorithm for DCT is to eliminate computation intensive and power consuming operations like multiplications and also to get significant evaluation of DCT.It is more suitable for large DCTs to reduce the computations of DCT which increases randomly.The available methods are not suitable for extension.But sizes such as 16-point and 32-point DCTs are needed for many image processing applications like biomedical signal processing, satellite communication, etc.
Approximate DCT transforms have been proposed with no cost of multiplication gives better compression performance.Realization of the approximations in digital VLSI hardware requires only additions and subtractions which reduces chip area and power consumption than conventional DCTs transforms.The 8-point approximate DCT manipulation requires only addition and no multiplication.So computational complexity is brought down.A reconfigurable video standard like HEVC uses the best DCT approximation.The transformation matrix cost is equal to the number of arithmetic operations in its computation.The number of reserved coefficients in the transform domain is the main constraint of image compression process.The performance of the DCT approximations is often a trade-off between accuracy and computational density of a given algorithm.
The reduced computational complexity, orthogonality, small error energy extendable of DCT is the main features of approximate DCT.
The diagonal matrix typically includes irrational numbers in the form 1 m , where m is a small positive integer.Normally, the irrational numbers in the diagonal matrix requires more computations.Since the elements of the matrix comprise only powers of two {0, ±1/2, ±1, ±2}, no multiplication is required.

One-Dimensional Digital Architectures of Approximate DCT
In [2], a low complexity approximate was introduced by Bouguezel et al. is shown in Figure 2 and called BAS-2008 Approximation. .The computation requires only 18 additions and 2 shifts.
An 8-point orthogonal DCT transform proposed by Bouguezel-Ahmad-Swamy in 2011 [3] contains a single parameter "a".It is shown in Figure 3.

Two Dimensional Approximate Transform
For real-time implementation of approximate algorithms, the proposed digital architectures are custom designed.The 1-D approximate DCT block is implemented using suitable algorithm chosen from the existing architectures [9]- [11].The row wise transformation of the input image, followed by a column wise transformation of the intermediate result forms the 2D-DCT transformation as shown in Figure 9.
Multidimensional DCT from 1D DCT  row wise transformation of the input image of1D DCT  Column wise transformation of resultant row wise transformation above. Or alternatively vertical to Horizontal.The row and column wise transforms can be any of the existing DCT approximations.In other words, there is no restriction for both row and column wise transforms to be the same.However, for simplicity, identical transforms for both steps are adopted.It employs two parallel realizations of DCT approximation blocks, as shown in Figure 10.

Transposition Buffer
Between the approximate DCT blocks a real-time row parallel transposition buffer circuit is required.Such block ensures data ordering for converting the row transformed data from the first DCT approximation circuit to a transposed format as required by the column transform circuit.The transposition buffer block is detailed in Figure 11.
From Table 2, it is evident that vaithyanathan et al. [8] transform requires less hardware resource and have highest frequency of operation than remaining approximations.The delay and computational complexity are reduced in this transform.

Conclusion
In this paper, we proposed 1) VLSI Architectures for Approximate 2D DCTs and 2) hardware implementation of In terms of image compression, the approximate transforms could outperform the conventional transforms.
Hence the proposed transforms are the best approximation for the DCT in terms of computational complexity and speed.Introduced implementations address 2-D approximate DCTs.All the approximations were digitally simulated, prototyped and implemented using modelsim, VirtexE FPGA kit and Xilinx.The proposed architectures are suitable for image and video processing, being candidates for improvements in several standards including the HEVC.In future, the approximate versions for the 16-, 32-and 64-point DCT will be developed.
requires only 16 additions.InCB-2011[4], DCT approximation was obtained by rounding-off the elements of exact DCT matrix.The resulting matrix is orthogonal and contains elements only in {0, ±1}.It own very low arithmetic complexity.The CB-2011 architecture is shown in Figure4.

Figure 11 .
Figure 11.Detailed circuit of the transposition buffer block.

Figure 12 .
Figure 12.Comparison of number of slices for different 2-D DCT.

Figure 15 .
Figure 15.Comparison of maximum frequency for different 2-D DCT.
DCT transforms.All proposed approximate 2D DCT transforms perform well.However, Vaithyanathan et al. transform offers lower computational complexity and faster than all other transforms.