^{1}

^{*}

^{1}

^{*}

^{1}

^{*}

Wake-Up-Word Speech Recognition task (WUW-SR) is a computationally very demand , particular ly the stage of feature extraction which is decoded with corresponding Hidden Markov Models (HMMs) in the back-end stage of the WUW-SR. The state of the art WUW-SR system is based on three different sets of features: Mel-Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding Coefficients (LPC), and Enhanced Mel-Frequency Cepstral Coefficients (ENH_MFCC). In (front-end of Wake-Up-Word Speech Recognition System Design on FPGA) [1] , we presented an experimental FPGA design and implementation of a novel architecture of a real - time spectrogram extraction processor that generates MFCC, LPC, and ENH_MFCC spectrograms simultaneously. In this paper, the details of converting the three set s of spectrograms 1) Mel-Frequency Cepstral Coefficients (MFCC), 2) Linear Predictive Coding Coefficients (LPC), and 3) Enhanced Mel-Fre quency Cepstral Coefficients (ENH_MFCC) to their equivalent features are presented. In the WUW- SR system, the recognizer’s frontend is located at the terminal which is typically connected over a data network to remote back-end recognition (e.g., server). The WUW-SR is shown in Fig ure 1. The three sets of speech features are extracted at the front-end. These extracted features are then com pressed and transmitted to the server via a dedicated channel, where subsequently they are decoded.

In general, any automatic speech recognition system needs to be activated manually (push-to-talk), which needs hand movement and hence mixed multi-modal interface. However, for people who use hands-busy applications, hand movement may be limited or impractical. This research work represents alternative solution to use Speech Only Interface. The solution that is being proposed is called Wake-Up-Word Speech Recognition (WUW-SR) [

The feature extraction of speech is one of the most important issues in the field of speech recognition. There are two dominant acoustic measurements of speech signal. One is the parametric modeling approach, which is developed to match closely the resonant structure of the human vocal tract that produces the corresponding speech sound. It is mainly derived from Linear Predictive analysis, such as LPC-based cepstrum (LPCC). The other approach is the nonparametric modeling method that is basically originated from the human auditory perception system. Mel-frequency cepstral coefficients (MFCCs) are utilized for this purpose [

As shown in

The audio signal processing module accepts raw audio samples and produces spectral representations of short time signals. The feature-extraction module generates features from this spectral representation, which are decoded with the corresponding hidden Markov’s models (HMMs). The individual feature scores are classified using support vector machines (SVMs) into INV, OOV: in-, out-of-vocabulary speech.

As shown in

1) Analog to Digital Converter ADC.

2) DC Filtering.

3) Serial to 32-Bit Parallel Converter.

4) Integer to Floating-Point Converter.

5) Pre-Emphasis Filtering.

6) Window Advance Buffering.

7) Hamming Window.

The six yellow-colored modules represent the Linear Predictive Coding Coefficients (LPC) stage and are used to generate 12-Linear Predictive Coding features.

1) Autocorrelation Linear Predictive Coding.

2) Fast Fourier Transform FFT.

3) LPC Spectrogram.

4) Mel-Scale Filtering.

5) Discrete Cosine Transform DCT.

6) LPC Feature.

The five brown-colored modules represent the MFCC stage and are used to generate 12-MFCCs features.

1) Fast Fourier Transform FFT.

2) MFCC Spectrogram.

3) Mel-Scale Filtering.

4) Discrete Cosine Transform DCT.

5) MFCC Feature.

The five green-colored modules represent the ENH-MFCC stage and are used to generate 12-Enhanced MFCC features.

1) Enhanced Spectrum (ENH).

2) Enhanced MFCC Spectrogram.

3) Mel-Scale Filtering.

4) Discrete Cosine Transform DCT.

5) ENH-MFCC Feature.

The feature extraction involves identifying the formants in the speech, which represent the frequency locations of energy concentrations in the speaker’s vocal tract. There are many different approaches used: Mel-scale Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding (LPC), Linear Prediction Cepstral Coefficients (LPCC), Reflection Coefficients (RCs). Among these, MFCC has been found to be more robust in the presence of background noise compared to other algorithms [

Intensive efforts have been carried out to achieve a high performance front-end. Converting a speech waveform into a form suitable for processing by the decoder requires several stages as shown in

1) Filtration: The waveform is sent through a low pass filter, typically 4 kHz to 8 kHz. As is evidenced by the bandwidth of the telephone system being around 4 kHz; this is sufficient for comprehension and used a minimum bandwidth required for telephony transmittal.

2) Analog-To-Digital Conversion: The process of digitizing and quantizing an analog speech waveform begin with this stage. Recall that the first step in processing speech is to convert the analog representations (first air pressure, and then analog electric signals from a microphone), into a digital signal.

3) Sampling Rate: The resulting waveform is sampled. Sampling rate theory requires a sampling (Nyquist) rate of double the maximum frequency (so 8 to 16 kHz as appropriate). The sampling rate of 8 kHz was used in our front-end (we used CODEC Chip to perform first, second, and third stages).

4) Serial to Parallel Converter: This model gets serial digital signal from CODEC and converts it to 32-bit.

5) Integer to Floating-Point Converter: This module converts 32-bit, signed integer data to single-precision (32-bit) floating-point values. The input data is routed through the int_2_float Mega function core named ALTFP_CONVERT.

6) Pre-Emphasis: The digitalized speech signal s(n) is put through a low-order LPF to spectrally flatten the signal and to make it less susceptible to finite precision effects later in the signal processing. The filter is represented by:

where we have chosen the value of PRE_EMPH_FACTOR (α) as 0.975.

7) Window Buffering: A 32-bit, 256 deep dual-port RAM (DPRAM) stores 256 input samples. A state machine handles moving audio data into the RAM, and pulling data out of the RAM (40 samples) to be multiplied by the Hamming coefficients, which are stored in a ROM memory.

8) Windowing: The Hamming window function smoothes the input audio data with a Hamming curve prior to the FFT function. This stage slices the input signal into discrete time segments. This is done by using window typically 25 ms wide (200 samples). A Hamming window size of 25 ms which consists of 200 samples at 8 KHz sampling frequency and 5 ms frame shift (40 samples) is picked for our front-end windowing.

9) Fast Fourier Transform: In order to map the sound data from the time domain to the frequency domain, the Altera IP Megafunction FFT module is used. The module is configured so as to produce a 256-point FFT. This function is capable of taking a streaming data input in natural order, and it can also output the transformed data in natural order, with maximum latency of 256 clock cycles once all the data (256 data samples) has been received.

10) Spectrogram: This module takes the complex data generated by the FFT and performs the function: 20*log10 (fft_real^{2} + fft_imag^{2}). We designed spectrogram to show how the spectral density of a signal varies with time. We used spectrogram module to identify phonetic sounds. Digitally sampled data, in the time domain, are broken up into chunks, which usually overlap, and Fourier transformed to calculate the magnitude of the frequency spectrum for each chunk. Each chunk then corresponds to a vertical line in the image; a measurement of magnitude versus frequency for a specific moment in time. The spectrums or time plots are then “laid side by side” to form the image surface.

11) Mel-Scale Filtering: While the resulting spectrum of the FFT contains information in each frequency in linear scale, human hearing is less sensitive at frequencies above 1000 Hz. This concept also has a direct effect on performance of ASR systems; therefore, the spectrum is warped using a logarithmic Mel scale. In order to create this effect on the FFT spectrum, a bank of filters is constructed with filters distributed equally below 1000 Hz and spaced logarithmically above 1000 Hz.

12) Discrete Cosine Transform: DCT is a Fourier-related transform similar to the discrete Fourier transform (DFT), but using only real numbers. DCTs are equivalent to DFTs of roughly twice the length, operating on real data with even symmetry (since the Fourier transform of a real and even function is real and even). A DCT computes a sequence of data points in terms of summation of cosine functions oscillating at various frequencies. The idea of performing DCT on Mel Scale is motivated by extraction of the speech frequency domain characteristics. DCT module reduces the speech signal’s redundant information, and reaches the aim of regulating the speech signal into feature coefficients with minimal dimensions.

As shown in

This module gets windowed data from the window module for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model. We use this method to encode good quality speech and provide an estimate of speech parameters.

The goal of this method is to calculate prediction coefficients ak for each frame. The order of LPC, which is the number of coefficients p , determines how closely the prediction coefficients can approximate the original spectrum. As the order increases, the accuracy of LPC also increases. This means the distortion will decrease. The main advantage of LPC is usually attributed to the all-pole characteristics of vowel spectra. Also, the ear is also more sensitive to spectral poles than zeros (M. R. Schroeder) [

The spectrum enhancement module is used to generate ENH-MFCC set of features. We have implemented this

module as shown in the

In (front-end of Wake-Up-Word Speech Recognition System Design on FPGA) [

1) As shown in Figures 6, 7 the MFCC, LPC, and ENH-MFCC spectrograms generated from Hardware and Software (C++) are identical.

2) In Figures 8, 9 the MFCC, LPC, and ENH-MFCC 12-features histograms generated from Hardware and Software (C++) are also identical.

3) We regenerated new MFCC, LPC, and ENH-MFCC histograms with 11-features by removing the first feature slice because of large dynamic range of the first feature that would make the remaining output features very small.

As shown in Figures 10, 11 the MFCC, LPC, and ENH-MFCC 11-features histograms generated from Hard-

ware and Software (C++) are identical.

In this study, the efficient hardware architecture and implementation of front-end of WUW-SR in FPGA is presented. This front-end is responsible for generating three sets of features MFCC, LPC, and ENH-MFCC. These features are needed to be decoded with corresponding HMMs in the back-end stage of the WUW Speech Recognizer (e.g., server). WUW Speech Recognition presented is a novel solution. The most important characteristic of a WUW-SR system is that it should guarantee virtually 100% correct rejection of non-WUW (out of vocabulary words—OOV) while maintaining correct acceptance rate of 99% or higher (in vocabulary words— INV). This requirement sets apart WUW-SR from other speech recognition systems because no existing system can guarantee 100% reliability by any measure. The computational complexity and memory requirement of three features algorithms are analyzed in detail in the past showing a significant improvement [

The partitioned table look-up method is adopted and modified to be suitable in our case with very small table memory requirement. The overall performance and area are highly improved. The proposed front-end is the first hardware system specifically designed for WUW-SR feature extraction based on three different sets of features. To demonstrate its effectiveness, the proposed design has been implemented in cyclone III FPGA hardware. The custom DSP board developed is a power-efficient, flexible design and can also be used as a general-purpose prototype board.

[

[

[

[

[

[

[

[

[