^{1}

^{*}

^{1}

^{2}

^{1}

^{3}

This paper describes the design and Field Programmable Gate Array (FPGA) based 4 × 4 breadth heuristic Multiple-Input—Multiple-Output (MIMO) decoder using 16 and 64 Quadrature Amplitude Modulation (QAM) schemes. The intention of this work is to observe the performance of Candidate Execution with Low Latency Approach for soft MIMO detector in FPGA (CELLA). The Smart Ordering and Candidate Adding (SOCA), Parallel Candidate Adding (PCA) and Backward Candidate Adding (BCA) give better performance in terms of Bit Error Rate (BER) or chip level service. In order to attain both BER and FPGA level performance in a single system, CELLA is developed in this work. Simulation and experimental results demonstrate the effectiveness of the proposed work under the system 4 × 4 MIMO-OFDM employing 16 QAM and 64 QAM. The proposed experiment is implemented in Xilinx Virtex 5 C5VSX240T. The performance results, in terms of FPGA level 76% slice reduction, 58.76% throughput improvement, 75% power reduction and 87% latency reduction, are achieved. The BER performance is observed and compared with the conventional algorithms. Thus, the proposed work achieves better outcome than the conventional work.

In today’s data-rich world, MIMO has become an energetic element of wireless communication standard for high data rate communications. MIMO is a method for multiplying the capacity using multiple transmit and for receiving antennas to make use of multipath propagation. The method of incorporating turbo code in MIMO system is labeled as turbo coded MIMO system. Using the sphere decoder, a simple method is used to detect and decode linear space-time mapping with any channel code and it is called “soft” inputs and outputs [

In [

Michael et al. have introduced a low-cost parallel programmable co-processor, which can achieve high throughput in ASIC/FPGA designs [

The system model specification is similar to that of Takuma et al. [_{t} represents transmit antennas and N_{r} represents receive antennas (N_{r} ≥ N_{t}). These antennas are identically distributed. The binary information (input) is passed through LDPC encoder (by a rate γ_{c}) and an interleaver. The final results are mapped into symbols. These symbols are demultiplexed into N_{t} sub modules and passed through OFDM section. The vector consisting of the N_{r} receives symbols

where

^{2} = N_{0}, and H is the N_{r} ´ N_{t} Rayleigh fading channel matrix, where its element h_{ij} is the complex transfer function from transmitter j to receiver i with

sists of MIMO detector, deinterleaver and LDPC decoder. The calculation of A Posteriori Probability (APP) for each of the coded bits b_{(j,k)} is taken in the form of Log Likelihood Ratio (LLR). From [

The list size upsets the complexity of the system, since the computational load increases linearly with the list size. To avoid this issue, the insertion of the list is removed from [

The conventional algorithm for visited nodes and the list size are listed in

The above values are reduced by introducing QRM concept. The visited node is 784 and the list size is 16 in 16 QAM whereas in the 64 QAM, the visited node is 5824 and the list size is 30. After that, LFSD, PCA and BCA combination of QRM-BCA and PCA-BCA visited node and list node are listed for both 16 QAM and 64 QAM. On comparing these concepts, the proposed methodology describes lesser consumption of visited nodes in the order of 34 and there is a decrease in the value of list size in the order of 16 in the 16 QAM. On the other hand, if we are considering 64 QAM case, visited nodes and list size do not give a drastic change like the above- described methodologies. This minimizes the size of Look Up

ALGORITHM | VISITED NODES | LIST SIZE |
---|---|---|

MLD [ | ||

QRM [ | M | |

LFSD [ | ||

PCA [ | ||

BCA [ | ||

QRM-BCA [ | ||

PCA-BCA [ | ||

CELLA |

Algorithm | 16-QAM | 64-QAM | ||
---|---|---|---|---|

Visited nodes | List size | Visited nodes | List size | |

MLD [ | 69,904 | 65,536 | NR | NR |

QRM [ | 784 | 16 | 5824 | 30 |

LFSD [ | 176 | 64 | 704 | 64 |

PCA [ | 88 | 28 | 292 | 82 |

BCA [ | 88 | 28 | 292 | 82 |

QRM-BCA [ | 808 | 28 | 5860 | 48 |

PCA-BCA [ | 112 | 40 | 328 | 100 |

CELLA | 34 | 16 | 136 | 32 |

SOCA scheme is used to identify the partial MAP using parent nodes [

The counterhypotheses of conventional SOCA, PCA and BCA algorithms are added in parallel with other candidates. From the hardware aspects, Look-Up

In order to convert the Simulink to Xilinx, conversion process utilizes more LUT and memory. Due to this complexity, the performance degradation occurs in the conventional architecture in FPGA level. To evade this issue, the conventional architecture is modeled using a VHDL implementation in Xilinx ISE Design Suite 12.1.

The pseudo-code of the CELLA scheme is given in

only 4 cycles to obtain the entire list for every transmitted vector after a single latency, while the detector receives a throughput of 400 Mbps at a clock frequency of 100 MHz.

The functional modules of MSU are comparator and decision logic. The MSU is used to identify the local MAP node and the architecture is made so that, each PE processes 4 nodes sequentially. In PCA, 3 MSUs are employed to deal with the counterhypotheses list, which contains (4 × 3) + (4 × 2) + 4 = 24 nodes. To handle it, 6 numbers of PEs are required, whereas in the proposed scheme, only one MSU and single PE are required to deal with the counterhypotheses list, which contains 4 × 1 + 4 × 1 + 4 = 12 nodes. As a result of hardware elimination (Removal of 4 PEs and MSU) in the proposed work, the performance improvements in terms of throughput give a range of 970 Mbps at 16 QAM and 830 Mbps at 64 QAM.

The clock frequency obtained here is in the range of 150 MHz for two modulation architectures. They turn out with the power reduction in the range of 40 mw and 141 mw in 16 QAM and 64 QAM, respectively. The above- mentioned features are extended with the previously proposed metrics, and there is an improved standard in terms of reduction in latency, reduction in usage of slices, and improved data throughput. The most important term is the reduction in the usage of power that makes a standard among all the previous mentioned standards. When the throughput of CELLA is compared, the reduction in latency cycles also makes a greater contribution in the throughput for both 16 QAM and 64 QAM architectures. It is detailed in the FPGA level performance, and that is also independent in the amount of noise level.

The performance results are observed in simulation platform via MATLAB environment and FPGA platform via Xilinx ISE Design Suite 12.1. The Bit Error Rate (BER) results of SOCA, PCA, BCA, PCA + BCA, and CELLA are observed.

The BER performance has been obtained by computer simulation using 64-subcarrier OFDM per transmit antenna. Block Rayleigh fading is used for the channel model, where the ordering is required only once at the beginning of each received block. The rate of the LDPC code is chosen as ɣ_{c} =1/2, and the length of the code word is N = 3072 bits with maximum number of sum-product iteration T = 40, and L_{max} = 8. First, _{b}/N_{o}) of sibling nodes and the difference between the BER with the proposed and also the conventional algorithms.

In

PAPERS | [ | [ | [ | [ | [ | PROPOSED | |||
---|---|---|---|---|---|---|---|---|---|

Algorithm | Optimized FSD-B | K-Best SD | PIPSD | FSD 2 | LFSD (16,221) | SOCA | PCA | CELLA | |

FPGA | Xilinx XC2VP70 | Xilinx XC2VP30 | Virtex 6 XC6VLX240T | Xilinx XC2VP70 | Xilinx Virtex 5 XC5VSX240T | ||||

Modulation | 16 QAM | 16 QAM | 16 QAM | 16 QAM | 16 QAM | 16 QAM | 64 QAM | ||

SLICES (Available) | 24,815 (33,088) | 8778 (13,696) | NR | 12,721 (33,088) | 13,577 (37,440) | 9570 (37,440) | 9471 (37,440) | 2215 (37,440) | 8913 (37,440) |

FFs (Available) | 39,800 (66,176) | 6274 (27,392) | NR | 15,332 (66,176) | 32,934 (149,760) | 23,133 (149,760) | 22,814 (149,760) | 4918 (149,760) | 16,850 (149,760) |

LUT (Available) | 31,759 (66,176) | 13,417 (27,392) | NR | 16,119 (66,176) | 34,378 (149,760) | 24,523 (149,760) | 24,126 (149,760) | 4982 (149,760) | 16,132 (149,760) |

DSP (Available) | NR | 48 (136) | NR | NR | 348 (1056) | 174 (1056) | 174 (1056) | 38 (1056) | 151 (1056) |

BRAM (Available) | NR | NR | NR | 82 (328) | 99 (516) | 70 (516) | 70 (516) | 18 (516) | 67 (516) |

Latency (cycles) | 78 | NR | NR | NR | 63 | 133 | 109 | 14 | 53 |

F_{clock} | 150 MHz | NR | 178 MHz | 150 MHz | 150 MHz | 150 MHz | 150 MHz | ||

Throughput at 20 db (Mbps) | 450 Mbps | 732 Mbps | 356 Mbps | 600 Mbps | NR | NR | 400 Mbps | 970 Mbps | 830 Mbps |

Power (mw) | NR | 165 | NR | NR | NR | NR | NR | 40 mw | 141 mw |

CELLA. They obtain 2215 slices, 4918 FFs, and 4982 LUT with latency of 14 cycles in 16 QAM. Similarly, these results are obtained in 64 QAM. The proposed method achieves the power value of 40 mw, 141 mw in 16 QAM and 64 QAM, respectively. This method is implemented in (FPGA platform) Xilinx Virtex5 XC5VSX240T. The experimental results of the proposed system are implemented in Xilinx Virtex5 XC5VSX240T platform. The performance analysis at FPGA level attains 76% slice reduction, 58.76% throughput improvement at 20 db, 75% power reduction and 87% latency reduction when compared to the conventional work. The BER performance is also computed for the proposed and conventional algorithms. It minimizes the noise level and improves the performance boost at 20 db/s.

In this work, a candidate execution with low latency approach has been introduced for a turbo coded MIMO system. The proposed system identifies partial MAP nodes during online process and adds counterhypothesis in parallel within the parent candidate, which indicates the use of child nodes. This is not included here to minimize the ED distance metrics. The experimental results of the proposed system are implemented in Xilinx Virtex 5 XC5VSX240T platform. The performance analysis at FPGA level attains 76% slice reduction, 58.76% throughput improvement at 20 db, 75% power reduction and 87% latency reduction when compared to the conventional work. The BER performance is also computed for the proposed and conventional algorithms. It minimizes the noise level and improves the performance boost at 20 db/s. Thus, it proves the execution of the nodes in the confined region compared to other algorithms. Thus, CELLA achieves the best performance than the conventional work.

Erulappan Sakthivel,Kasinathan Pounraj,Veluchamy Malathi,Muruganantham Arunraja,Govindaraj Perumalvignnesh, (2016) CELLA: FPGA Based Candidate Execution with Low Latency Approach for Soft MIMO Detector. Circuits and Systems,07,1760-1768. doi: 10.4236/cs.2016.78152