1. Introduction
Speech is a way to express ourselves, it’s a complex naturally acquired human motor ability [1]. Speech recognition is the capability of a device to receive, identify, and recognize the speech signal [2]. Speech recognition process fundamentally functions as a pipeline that converts the sound into recognized text, as shown in Figure 1. Based on spectral, the input signal is converted into a sequence of training and testing feature vectors saved in unique files. Given all the observations in the training data, Baum-Welch algorithm can learn and generate the HMM models equal to the number of the words to be recognized. In testing process, pattern matching provides likelihoods of a match of all sequences of speech recognition units to the input speech. Decision making generated according to the best path sequence between the models and testing data. Speech recognition system involved in several applications such as: call routing, automatic transcriptions, information searching, data entry, Speech to Text conversion, Text to Speech conversion etc. [3].
Arabic is the native language for over 300 million speakers and considered one of the official languages in many countries around the world. It has a unique set of diacritics that can change the meaning [4]. Arabic ASR received little attention compared to other languages, and research was oblivious to the diacritics in most cases. Omitting diacritics circumscribes the Arabic ASR system’s usability for several applications such as voice-enabled translation, text to speech, and speech-to-speech [5].
Feature Extraction is accomplished by changing the waveform speech form to a form of parametric representation with a relatively low data rate for subsequent processing and analysis. Subsequently, the acceptable classification in the training and testing part is derived from the quality features [6]. Therefore, the most popular speech methods, Mel Cepstral frequency coefficients (MFCC) and Hidden Markov Model have been selected and tested in order to provide a high level of reliability and acceptability of the Arabic ASR.
2. Mel Frequency Cepstral Coefficients (MFCC)
MFCC is a feature widely used in automatic speech and speaker recognition has
been used to extract spectral features from frame sequences [7] [8]. Fast Fourier Transform (FFT) has been used to transfer the signal into frequency domain using the Equation (2.1). After pre-emphases, blocking, and windowing the input signal, FFT applies on the speech frames to obtain 256-point certain parameters, converting the power-spectrum to a Mel-frequency spectrum using Equations (2.2) and (2.3), and finally taking the logarithm of that spectrum and computing its inverse Fourier transform as shown in Figure 2.
(2.1)
(2.2)
. (2.3)
3. Hidden Markov Model (HMM)
HMM is used to classify the features and generate the correct decision. HMM considered the powerful statistical tool used in speech recognition and speaker identification systems, due to the ability to model non-linearly aligning speech and estimating the model parameters [9]. Gaussian Mixtures also used to model the emission probability distribution function inside each state.
In training process, the observation parameters, transition probability matrix, the prior probabilities, and Gaussian distribution were re-estimated in order to get good parameters at each iteration as shown in Figure 3. As a result, all the previous HMM parameters are used to generate the likelihood scores, which are used to find the best path between the frames in order to recognize the unknown word [10] [11].
3.1. Evaluation Process
Given the observation sequence (O) and the model parameters (λ), Forward (α) and Backward (β) algorithms were used to find the probability of the observation sequence given the model
[12]. As shown in Figure 4, forward and backward probabilities are added to evaluate the probability that any sequence of states has produced the sequence of observations.
Figure 2. Mel Frequency Cepstral Coefficients (MFCC) block diagram.
Figure 3. Three states hidden Markov model.
Figure 4. Recognition rate using different state numbers based on MFCC.
3.2. Training Process
Given the observation sequence (O) and the model parameters (λ), Baum-Welch algorithm was used to re-adjust and re-estimate the transition probability matrix and Gaussian mixture parameters (mean and covariance) that best describe the process [13] [14]. Baum welch algorithm also used to learn and encode the characteristics of the observation sequence in order to recognize a similar observation sequence.
3.3. Decoding Process
Viterbi algorithm has been used to comparing between the training and the testing data and find the optimal scoring path of state sequence by selecting the high probabilities between the model and the testing data [15] [16]. The maximal probability of state sequences is defined using the Equation (3.1), and the optimal scoring path of state sequence selected is calculated using the following MATLAB function.
start_recognition (“testing_list.mat”, dim).
(3.1)
4. Experimental Results
Using the automatic ASR system, several experiments were carried out using 24 (CVCVCV) Arabic isolated words as shown in Table 1. The feature vectors have been extracted for each sound using MFCC algorithm and saved, and the statistical models were generated using Hidden Markov Model classifier to match the data. The performance evaluation of the Arabic ASR system was obtained by finding the maximum word recognition rate.
In this work, (24 words * 3 times) Arabic CVCVCV words, small vocabulary data set are recorded from 19 adult male speakers (total 1368) divided into training and testing files. Table 2 shows the confusion matrix of the average
Table 2. Recognition rate using different state numbers based on MFCC.
Table 3. Recognition rate summary based on MFCC.
classification results, which obtained using convenient features in training and testing sessions. Each experiment conducted by dividing the data into 4, 5, 6, 7, 8, 9, and 10 number of states and modeled using 8 multi-dimensional Gaussians Hidden Markov Model.
During the experiments, the speech signal pre-reemphasis using 0.975 factor, covered by 25 milliseconds hamming window, and 10 milliseconds overlapping. The 256-point Fast Fourier Transform (FFT) was applied to the signal to transform 200 samples of speech from time to frequency domain. The summary of the resulting confidence level intervals for the recognition rate obtained in decoding process are listed in Table 3 and the chart in Figure 1 summarizes the recognition rate obtained for each state number.
5. Conclusion
The primary contribution of this work is to design Arabic ASR system and find the performance of the selected Arabic words is successfully verified and examined. For this purpose, 24 CVCVCV Arabic words were recorded from native speakers, all the experiments are conducted, and the recognition results of the ASR system were investigated and evaluated. The system is designed by MATLAB based on MFCC and discrete-observation multivariate HMM. In this work, the best results are achieved when the acoustic signals are extracted using 10 states and modeled by 8 Gaussian mixtures. The best recognition rate reaches 92.92% (51 total error count from 1368 total words count). According to Figure 3, the recognition rate decreased when using more or less than 10 state numbers.