Arabic Speech Recognition System Based on MFCC and HMMs

Speech recognition allows the machine to turn the speech signal into text through identification and understanding process. Extract the features, pre-dict the maximum likelihood, and generate the models of the input speech signal are considered the most important steps to configure the Automatic Speech Recognition System (ASR). In this paper, an automatic Arabic speech recognition system was established using MATLAB and 24 Arabic words Consonant-Vowel Consonant-Vowel Consonant-Vowel (CVCVCV) was recorded from 19 Arabic native speakers, each speaker uttering the same word 3 times (total 1368 words). In order to test the system, 39-features were extracted by partitioning the speech signal into frames ~ 0.25 sec shifted by 0.10 sec. in back-end, the statistical models were generated by separated the features into number of states between 4 to 10, each state has 8-gaussian distri-butions. The data has 48 k sample rate and 32-bit depth and saved separately in a wave file format. The system was trained in phonetically rich and ba-lanced Arabic speech words list (10 speakers * 3 times * 24 words, total 720 words) and tested using another word list (24 words * 9 speakers * 3 times *, total 648 words). Using different speakers similar words, the system obtained a very good word recognition accuracy results of 92.92% and a Word Error Rate (WER) of 7.08%.


Introduction
Speech is a way to express ourselves, it's a complex naturally acquired human motor ability [1]. Speech recognition is the capability of a device to receive, identify, and recognize the speech signal [2]. Speech recognition process fundamentally functions as a pipeline that converts the sound into recognized text, as shown in Figure 1. Based on spectral, the input signal is converted into a sequence of training and testing feature vectors saved in unique files. Given all the observations in the training data, Baum-Welch algorithm can learn and generate the HMM models equal to the number of the words to be recognized. In testing process, pattern matching provides likelihoods of a match of all sequences of speech recognition units to the input speech. Decision making generated according to the best path sequence between the models and testing data. Speech recognition system involved in several applications such as: call routing, automatic transcriptions, information searching, data entry, Speech to Text conversion, Text to Speech conversion etc. [3].
Arabic is the native language for over 300 million speakers and considered one of the official languages in many countries around the world. It has a unique set of diacritics that can change the meaning [4]. Arabic ASR received little attention compared to other languages, and research was oblivious to the diacritics in most cases. Omitting diacritics circumscribes the Arabic ASR system's usability for several applications such as voice-enabled translation, text to speech, and speech-to-speech [5].
Feature Extraction is accomplished by changing the waveform speech form to a form of parametric representation with a relatively low data rate for subsequent processing and analysis. Subsequently, the acceptable classification in the training and testing part is derived from the quality features [6]. Therefore, the most popular speech methods, Mel Cepstral frequency coefficients (MFCC) and Hidden Markov Model have been selected and tested in order to provide a high level of reliability and acceptability of the Arabic ASR.

Mel Frequency Cepstral Coefficients (MFCC)
MFCC is a feature widely used in automatic speech and speaker recognition has   .3), and finally taking the logarithm of that spectrum and computing its inverse Fourier transform as shown in Figure 2.

Hidden Markov Model (HMM)
HMM is used to classify the features and generate the correct decision. HMM considered the powerful statistical tool used in speech recognition and speaker identification systems, due to the ability to model non-linearly aligning speech and estimating the model parameters [9]. Gaussian Mixtures also used to model the emission probability distribution function inside each state.
In training process, the observation parameters, transition probability matrix, the prior probabilities, and Gaussian distribution were re-estimated in order to get good parameters at each iteration as shown in Figure 3. As a result, all the previous HMM parameters are used to generate the likelihood scores, which are used to find the best path between the frames in order to recognize the unknown word [10] [11].

Evaluation Process
Given the observation sequence (O) and the model parameters (λ), Forward (α) and Backward (β) algorithms were used to find the probability of the observation sequence given the model ( ) | P O λ [12]. As shown in Figure 4, forward and backward probabilities are added to evaluate the probability that any sequence of states has produced the sequence of observations.

Training Process
Given the observation sequence (O) and the model parameters (λ), Baum-Welch algorithm was used to re-adjust and re-estimate the transition probability matrix and Gaussian mixture parameters (mean and covariance) that best describe the process [13] [14]. Baum welch algorithm also used to learn and encode the characteristics of the observation sequence in order to recognize a similar observation sequence.

Decoding Process
Viterbi algorithm has been used to comparing between the training and the testing data and find the optimal scoring path of state sequence by selecting the high probabilities between the model and the testing data [

Experimental Results
Using the automatic ASR system, several experiments were carried out using 24 (CVCVCV) Arabic isolated words as shown in Table 1. The feature vectors have been extracted for each sound using MFCC algorithm and saved, and the statistical models were generated using Hidden Markov Model classifier to match the data. The performance evaluation of the Arabic ASR system was obtained by finding the maximum word recognition rate.
In this work, (24 words * 3 times) Arabic CVCVCV words, small vocabulary data set are recorded from 19 adult male speakers (total 1368) divided into training and testing files. Table 2 shows the confusion matrix of the average     Table 3 and the chart in Figure 1 summarizes the recognition rate obtained for each state number.

Conclusion
The primary contribution of this work is to design Arabic ASR system and find the performance of the selected Arabic words is successfully verified and examined. For this purpose, 24 CVCVCV Arabic words were recorded from native speakers, all the experiments are conducted, and the recognition results of the ASR system were investigated and evaluated. The system is designed by MATLAB based on MFCC and discrete-observation multivariate HMM. In this work, the best results are achieved when the acoustic signals are extracted using 10 states and modeled by 8 Gaussian mixtures. The best recognition rate reaches 92.92% (51 total error count from 1368 total words count). According to Figure 3, the recognition rate decreased when using more or less than 10 state numbers.