Wake-Up-Word Feature Extraction on FPGA

Abstract

Wake-Up-Word Speech Recognition task (WUW-SR) is a computationally very demand, particularly the stage of feature extraction which is decoded with corresponding Hidden Markov Models (HMMs) in the back-end stage of the WUW-SR. The state of the art WUW-SR system is based on three different sets of features: Mel-Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding Coefficients (LPC), and Enhanced Mel-Frequency Cepstral Coefficients (ENH_MFCC). In (front-end of Wake-Up-Word Speech Recognition System Design on FPGA) [1], we presented an experimental FPGA design and implementation of a novel architecture of a real-time spectrogram extraction processor that generates MFCC, LPC, and ENH_MFCC spectrograms simultaneously. In this paper, the details of converting the three sets of spectrograms 1) Mel-Frequency Cepstral Coefficients (MFCC), 2) Linear Predictive Coding Coefficients (LPC), and 3) Enhanced Mel-Frequency Cepstral Coefficients (ENH_MFCC) to their equivalent features are presented. In the WUW- SR system, the recognizer’s frontend is located at the terminal which is typically connected over a data network to remote back-end recognition (e.g., server). The WUW-SR is shown in Figure 1. The three sets of speech features are extracted at the front-end. These extracted features are then compressed and transmitted to the server via a dedicated channel, where subsequently they are decoded.

Share and Cite:

V. Këpuska, M. Eljhani and B. Hight, "Wake-Up-Word Feature Extraction on FPGA," World Journal of Engineering and Technology, Vol. 2 No. 1, 2014, pp. 1-12. doi: 10.4236/wjet.2014.21001.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Këpuska, V.Z., Eljhani, M.M. and Hight, B.H. (2013) Front-end of wake-up-word speech recognition system design on FPGA. Journal of Telecommunications System & Management, 2, 108.
[2] Këpuska, V.Z. and Klein, T.B. (2009) A novel wake-up-word speech recognition system, wake-up-word recognition task, technology and evaluation. Nonlinear Analysis, Theory, Methods & Applications, 71, e2772-e2789.
[3] Tuzun, O.B., Demirekler, M. and Bora, K. (1994) Comparison of parametric and non-parametric representations of speech for recognition. 7th Mediterranean Electro-technical Conference, Antalya, 12-14 April 1994, 65-68.
[4] Openshaw, J.P., Sun, Z.P. and Mason, J.S. (1993) A comparison of composite features under degraded speech in speaker recognition. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 2, 371-374.
http://dx.doi.org/10.1109/ICASSP.1993.319316
[5] Vergin, R., O’Shaughnessy, D. and Gupta, V. (1996) Compensated mel frequency cepstrum coefficients. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Minneapolis, 7-10 May 1996, 323-326.
[6] Davis, S. and Mermelstein, P. (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28, 357-366.
http://dx.doi.org/10.1109/TASSP.1980.1163420
[7] Combrinck, H. and Botha, E. (1996) On the mel-scaled cepstrum.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.18.1382&rep=rep1&type=pdf
[8] Schroeder, M.R. (1982) Linear prediction, extremely entropy and prior information in speech signal analysis and synthesis. Speech Communication, 1, 9-20.
http://dx.doi.org/10.1016/0167-6393(82)90004-8
[9] Paliwal, K.K. and Kleijn, W.B. (1995) Speech synthesis and coding, chapter quantization of LPC parameters. Elsevier Science Publication, Amsterdam, 433-466.

Copyright © 2023 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.