Phoneme Sequence Modeling in the Context of Speech Signal Recognition in Language “ Baoule ”

This paper presents the recognition of “Baoule” spoken sentences, a language of Côte d’Ivoire. Several formalisms allow the modelling of an automatic speech recognition system. The one we used to realize our system is based on Hidden Markov Models (HMM) discreet. Our goal in this article is to present a system for the recognition of the Baoule word. We present three classical problems and develop different algorithms able to resolve them. We then execute these algorithms with concrete examples.


Introduction
The speech recognition by machine has long been a research topic that fascinates the public and remains a challenge for specialists, and it has continued since then to be at the heart of much research.The progress of new information and communications technology has helped accelerate this research.In our first article, we presented a method to separate phonemes contained in a speech signal.
In this article we propose to identify a flow of words often uttered in a more or less background noise.This task is made difficult not only by the deformations induced by the use of a microphone but also by a series of factors inherent in human language, homonyms; local accents; the habits of language; the speed differences between the speakers; the imperfections of a microphone, etc.For our human ear, these factors do not usually represent difficulties.Our brain juggles these deformations of speech by taking into account, almost unconsciously, nonverbal and contextual elements that al-low us to eliminate ambiguities.It is only by taking into account these elements that are external to the voice itself that voice recognition software will be able to achieve a high level of reliability.Today, speech recognition softwares that work best are all based on a probabilistic approach.The aim of speech recognition is to reconstruct a sequence of words M from a recorded acoustic signal A. In the statistical approach, we will consider all the consequences of M words that could match the signal A. In this set of possible word sequences, we will then choose the one (M) which is the most likely to maximize the P(M/A) probability that M is the correct interpretation of A, we note: This equation is the key to the probabilistic approach to speech recognition.Indeed, the first term P(A/M) is the probability of observing the acoustic signal A if the M sequence of words is pronounced: it is a purely acoustic problem; the second term P(M) is the probability that this is the result of M words that is actually stated: it is a linguistic problem.The above equation thus teaches us that we can split the speech recognition problem into two independent parts: we will model the acoustic aspects separately and language problems.In the literature, we usually speak of orthogonality between the ACOUSTIC MODELS and LANGUAGE.The succession of possible words that is obtained must be refined and validated by the word patterns and language.The acoustic model can take into account the acoustic and phonetic constraints in a sound or group of sounds.On our part, we have chosen the WORD as decision unit.By integrating also a Markov modeling, which has higher levels of language, it becomes possible to achieve a pronounced phrases discretely recognition system (i.e. in single word).

Characteristics of the Speech Signal
PAR is a difficult problem, mainly due to the specific material to interpret: the voice signal.The speech acoustic signal has characteristics that make complex interpretation.
Redundancy: the acoustic signal carries much more information than necessary, which explains its resistance to noise.Of analytical techniques were implemented to extract relevant information without too degrading it.
Variability: the acoustic signal is highly variable from one speaker to the other (gender, age, etc.) but also for a given speaker (emotional state, fatigue, etc.), which makes very difficult the recognition problem speaker's independent speech.Continuity: the acoustic signal is continuous and contextual effects of sound on elementary visions are considerable.

Processing of the Speech Signal
By speech processing we mean the processing of the information contained in the speech signal.The objective is the transmission or recording of this signal, or its synthesis or recognition.The speech processing is now a fundamental component of the engineering sciences.Located at the intersection of digital signal processing and lan-guage processing (that is to say, symbolic data processing), this scientific discipline has known since the 60s a rapid expansion, linked to the development of means and telecommunications techniques.The special importance of speech processing in this broader context is explained by the privileged position of the word as an information vector in our human society.

The Acoustic Model
The ACOUSTIC MODEL (Figure 1 The result is compared with prototypes stored in computer memory in both a standard dictionary and a speaker's own dictionary.This dictionary is constructed by initially sessions dictation standard texts that the speaker must make before effectively use the software.This own dictionary is regularly enriched by self learning during the software uses.It is interesting to note that thus constituted voiceprint is relatively stable for a given speaker and little influenced by external factors such as stress, colds, etc. (Figure 3).

The Language Model
It is generally divided into two part linked to language: a syntactic part and a semantic game.When ACOUSTIC MODEL has identified at best phonemes "heard", we still look the most likely message M corresponding thereto, that is to say, the probability P(M) defined above.It is the role of syntactic and semantic models.See Figure 4     mar standards (The language "Baoule" has one) as well as a dictionary and a grammar own speaker; these reflect the "habits" of the speaker and is continuously enriched.
Then SEMANTIC MODEL seeks to optimize the identification of the message by analyzing the context of the words and while basing on both its own common language semantics and on cleanning the speaker semantics (a style).This modeling is usually built from the analysis of sequences of words from a large textual corpus.This clean semantics will be enriched as you use the software.Most softwares also allow enriching the analysis of texts that reflect the stylistic habits of the speaker.These two modules work together and it is easy to conceive that there is a feedback between them.
Initially, the dictionary associated with these two modules were based on fixed syntax language models, that is to say, modeled on a grammar defined by a rigid set of rules (this is not the case in most African languages including the "Baoule" language).
Then, the voice recognition software has evolved into the use of local probabilistic models: recognition no longer performs at a word but at a series of words, called n-gram where n is the length words in a sequence .The statistics of these models are obtained from standard texts and may be enriched gradually.See Figure 5 below.
Here too, Hidden Markov Models are those currently used to describe the probabilistic aspects.the most advanced software tend to combine the advantages of statistical models and fixed syntax models in what is called the "probabilistic grammars", the idea being to derive from fixed grammars of probabilities that can be combined with those of a probabilistic model.In recent approaches, it becomes difficult to distinguish the syntactic model of the semantic model and we rather speak of a single language model.

Fundamentals
Hidden Markov Models (HMM) were introduced by Baum and his collaborators in the 60s and the 70s [1].This model is closely related to Probabilistic Automata (PAs) [2].A probabilistic automaton is defined by a structure composed of states and transitions, and a set of probability distribution on transitions.Each transition is associated with a symbol of a finite alphabet.This symbol is generated every time the transition is taken.
An HMM is also defined by a structure consisting of states and transitions and by Figure 5. Semantic model.a set of probability distribution over the transitions.The essential difference is that the IPs symbol generation is performed on the states, and not on transitions.In addition, is associated with each symbol, not a state, but a probability distribution of the symbols of the alphabet.
HMMs are used to model the observation sequences.These observations may be discrete (e.g., characters from a finite alphabet) or continuous (the frequency of a signal, a temperature, etc.).The first area in which the HMMs have been applied is the speech processing in early 1970 [3] [4].In this area, the HMM will rapidly become the reference model, and most of the techniques for using and implementing HMM have been developed in the context of these applications.These techniques were then applied and adapted successfully to the problem of recognition of handwritten texts [5] [6] and analysis of biological sequences [7] [8].Theorems, rating and proposals that follow are largely from [9].

Characteristics of HMM
for any time k and any suite 0 , , k i i E ∈  Note that notion generalizes the notion of deterministic dynamical system (finite state machine recurrent sequence, or ordinary differential equation): the probability distribution of the present state k X depends only on the immediate past state is entirely characterized by the data • the original legislation • and the transition matrix ( ) supposedly independent of time k (homogeneous Markov chain).
Knowing the transition probabilities that exist between two succesive times is enough to globally characterize a Markov chain.

Proposal
υ is a probability on E, and π a Markov matrix E The probability distribution of the Markov chain { } k X of υ original legislation and π transition matrix is given by for any time k, and any suite 0 , , with values in a finite space O (if symbolic) or d  (digital case), collected through a channel without memory, that is to say, conditionally to { } , and every sequence 0 , , The transition matrix ( ) So just a local data (transition probabilities between two successive times, and densities of issue at a time) comprehensively characterizes a hidden Markov model, example: for K = 3, it comes: initial υ law of π transition matrix, and g emission densities, is given by , and every sequence 0 , ,

Equations Forward/Backward Baum
We first present a first method (basic but inefficient) to calculate the probability distribution of observations ( ) Proposal: The probability distribution of observations ( ) Note that elementary method provides a first expression for the conditional probability distribution of the sequence of states ( ) and the likelihood of the model (obtained using the following observations ( ) we deduce the following identities: Note the number of operations required to calculate the probability distribution of observations ( ) from this basic method is significant for each possible path ( ) of the Markov chain, you must compute the product of ( ) and there is different possible paths the total number of elementary operations (additions and multiplications) thus made is of the order of ( ) the number is growing exponentially with the number n of observations.we define the forward ( ) (seen as a row vector) by Note the forward variable used to calculate the conditional probability distribution of the present state k X given observations ( ) ∈ . (In this sense, k p is a distribution of non-normalized probability), and the normalization constant is interpreted as the likelihood of the model given observations ( ) Theorem: The sequence { } k p satisfies the following recurrence equation: ( ) for all j E ∈ with the initial condition ( ) Note this statement result component-by-component can also be made for the variable forward view as a row vector ( ) ( ) Note the recursive calculation of the variable forward n p involves only the product matrix/vector, and to calculate more efficiently the probability distribution of observations ( )  for all i E ∈ and then deduct the normalization constant (likelihood) and the normalized version of the conditional distribution (filter) It is more efficient, on a digital point of view, spread directly log-likelihood and filter.
Proposal: Following { } k p Verie the following recurrent equation: ( ) where the normalization constants are defined by . Note this result statement component-by-component may also be formulated for the normalized forward variable seen as a row vector .For all intermediate time k, less than the final instant n, is defined Note: That variable allows to calculate the conditional probability distribution of the present state k X knowing all comments ( ) . Note: Fix the state at time k allows a break between the past up to time ( ) and the future from time ( ) (seen as a column vector) and defined as: ( ) ( ) for any i E ∈ and in particular ( ) for all i E ∈ with this definition, is obtained Note: Conditionally ( ) ( ) for all i E ∈ with the initial condition: Note: This result statement component-by-component can also be formulated for the backward view variable as a column vector ( ) Proposal: the forward and backward equations are dual to one another: not dependent of the time in question Proposal: For the distribution of conditional probability of transition ( ) until the final moment is given by: [ ] ( ) By summing for all j E ∈ and using the equation backward, or by summing for all i ∈ E and using the forward equation, we find the following results in terms of product component-by-component variables forward and backward.
Corollary: the conditional probability distribution of the present state k X knowing all comments ( ) Note: Verie one that constant Standards It is more efficient on a digital point of view, spread directly log-likelihood and filter, then spread the variable defned at any time k as for any i E ∈ .
Note: That with normalization of the backward variable, the conditional probability distribution of k X state given observations ( ) Proposal: Following { } k v Verie recurrent retrograde following equation: for all i E ∈ , with the initial condition: where the normalization constants are those already defined for the normalization of the variable forward.
Note: This result statement component-by-component can also be formulated for backward standardized variable viewed as a column vector ( ) where the normalization constants are those already defined for the normaliza- tion of the variable forward.
Note: It is noted that And postponing this identity the expressions obtained above, we Verie that the conditional probability distribution of the transition ( )

Viterbi Algorithm
Forward and backward variables used to calculate the conditional probability distribution of the state this n X , or n X state at an intermediate moment, given observations ( ) for any i E ∈ respectively, where the normalization con- does not depend on the time in question, and interprets as the likelihood of the model given observations ( ) generated is inconsistent with the model, in the following sense: it can happen that is obtained LMAP k X j = for two successive times, while , 0 i j π = for the same pair ( ) , i j , which meant that the transi- tion from state i to state j is just impossible for the model for this reason, rather it uses another estimator, called trajectoriel maximum a posteriori estimator, defined by And minimizes the probability of the estimation error of the sequence of hidden states given observations ( )

Re-Estimation Formulas Baum-Welch
So far, the focus was on the estimation of a hidden condition or because of successive hidden states, from a series of observations and for a given model.The goal here is to identifier the model, that is to say, to estimate the parameters of the model characteristics, from a series of observations, and the approach taken is that of estimation maximum likelihood.
In the digital case, we look at the case of the Gaussian emission densities characterized by the data of finite  vectors and of finite Family trices invertible covariance, that is to say: The likelihood function of the model obtained with the basic method, and we will study an iterative algorithm to maximize n L likelihood function with respect to the parameters ( ) another model, for which the likelihood function takes the value ( ) ( ) which vanishes when the model M coincides with the model ′ M .
Maximize n Q compared with parameters ( ) ensures that the likelihood of the model which achieved maximum n Q will be greater than the likelihood n L′ current model ′ M re-formulas Baum-Welch -Estimated allow explicitly find the parameters of the new model based on parameters ( ) of the current model ′ M by repeating this procedure, we construct a sequence of increasing likelihood models, and ideally this sequence converges to a model that reaches the maximum likelihood function.

Theorem
In the digital case with densities of Gaussian issue, the iterative algorithm for estimating the maximum likelihood of the model parameters from the observations ( ) , is given by explicit formulas re-estimate ( )

Implementation
Our model is based on acoustic signal parameters.word recognition requires a short pause between each spoken word, while the speech recognition does not continue.Speech recognition systems can be classified as a dependent or speaker-independent.Speaker dependent system recognizes only the word of the voice of a particular speaker, while an independent speaker system can recognize any voice.
The implementation presented here uses features integrated into MATLAB and related products to develop the recognition algorithm.There are two main steps in the recognition of isolated words: • a learning phase and • a test phase.
The learning phase teaches the system by building its dictionary, an acoustic model for each word that the system has to recognize.In our example, the dictionary includes the numbers "zero" to "nine" in "Baoule" language.The test phase uses acoustic models of these numbers to recognize isolated words using a classification algorithm.We start with the the speech signal acquisition, and then we end with its analysis.

Speech Signal Acquisition
During the learning phase, it is necessary to record the repeated statements of each digit in the dictionary.For example, we repeat the word "nnou" (which means five in "Baoule" language) many times with a pause between each statement.That word will be saved in the file 'cinq.wav'.Using the following MATLAB code with a sound card standard PC, we capture ten seconds of speech from a microphone to 8000 samples per second.We obtained y that is a matrix of 8000 rows and one column.This approach works well for training data.

Acquired Speech Signal Analysis
We first develop a word-detection algorithm that separates each word of ambient noise.We then obtain an acoustic model that provides a strong representation of each word in the stage of learning.Finally, we select an appropriate classification algorithm for testing.

The Development of a Word-Detection Algorithm
The word-detection algorithm continuously reads 160 samples frames from the data of "speech".To detect single digits, we use a combination of the signal energy and have zero crossing for each speech frame.
The signal energy works well to detect sound signals, while the zero-crossing numbers work well for detecting non-voice signals.The calculation of these measures is simple using mathematical operators and MATLAB basic logic.To avoid identifying the ambient noise of speech, we assume that each individual word will last at least 25 milliseconds.In Figure 7 below, we plot the speech signal "five" and the power of short duration and zero crossing measurement.

Development of the Acoustic Model
A good acoustic model should be derived from the word of features that allow the system to distinguish different words in the dictionary.We know that different sounds are produced by varying the shape of the human vocal tract, and these different sounds can each have different frequencies.To investigate the frequency characteristics, we examine the density estimates Spectral Power (CSP) various spoken digits.Since the human vocal tract can be modeled as a filter on all poles, we use the parametric spectral estimation technique Yule-Walker of the window Signal Processing Toolbox to calculate the DSP.After importing a statement of a single digit in the variable "word" we use the MATLAB code below to view the DSP estimate: here there is the speech signal that we have acquired (Figure 8).order = 12; nfft = 512; Fs = 8000; pyulear(cinq, order, nfft, Fs) Because the Yule-Walker algorithm adapts a linear prediction filter model autoregression to the signal, you must supply an order of this filter.We select an arbitrary value of 12, which is typical for voice applications.
Figure 9 shows the PSD estimate of three different expressions of the words "one" and "two".We can see the tops of the PSD remain consistent for a particular number, but differ from one figure to another.This means that we can draw the acoustic models in our system from the spectral characteristics.
A set of spectral characteristics commonly used in voice applications because of its robustness is Mel Frequency Cepstral Coefficients (MFCC).MFCC give a measure of

Selecting a Classification Algorithm
After estimating a GMM for each digit, we have a dictionary for use in the testing phase.
Given some test speech, we extracted again MFCC feature vectors of each frame of the detected word.The goal is to find the model numbers of the maximum a posteriori probability for all the long delivery tests, which reduces to maximize the value of log-likelihood.
Given a GMM model (equal to model here) model numbers and some feature vectors tests test data (equal to five here), the log-likelihood value is easily calculated using the post office in Statistics Toolbox: [P, log_like] = later (model, five); we repeat this calculation using the model of each digit.The test speech is classified as revenues at the MGM produce the maximum log-likelihood.

Conclusions
In this article we presented an overview of HMM: their applications and conventional algorithms used in the literature, the generation probability calculation algorithms in a sequence by an HMM, the path search algorithm optimum, and the drive algorithms.
The speech signal is a complex form drowned in the noise.Its learning is part of complex intelligent activity [10].By learning a starting model, we will build gradually an effective model for each of the phonemes of the "Baoule" language.
Note finally that HMMs have established themselves as the reference model for solving certain types of problems in many application areas, whether in speech recognition, modeling of biological sequences or for the extraction of information from textual data.
However other formalisms such as neural networks can be used to improve the modeling.Our future work will focus on the modeling of the linguistic aspect of the "Baoule" language.
) reflects the acoustic realization of each modeled element (phoneme, silence, noise, etc.).It is based on the concept of phonemes.Phonemes can be considered as the basic sound units in verbal language.The first stage of speech recognition is to recognize a set of phonemes in words flow.Statistical realization of acoustic parameters of each phone is represented by a Markov model Cache (HMM: Hidden Markov Model).Each phoneme is typically represented by 2 or 3 states and density multigaussienne (GMM: Gaussian Mixture Model) is associated with each state.See Figure 2 below.The speech signal (picked up using a microphone) is first digitized: it is sampled by a Fourier transformation which calculates the energy levels of the signal in bands of 25 milliseconds, which strips overlap in 10 milliseconds time.

Figure 2 .
Figure 2. The part of the system using the acoustic model.

Figure 3 .
Figure 3. Acoustic model (A phoneme is modeled as a sequence of acoustic vector).

Figure 4 .
Figure 4.The part of the system using the semantic model.

+
 to come hidden states is a Markov chain, from initial law , i π  (line i the π matrix), that is to say that [ ] for all j E ∈ and π transition matrix it follows that the backward variable can be interpreted as the likelihood of the model derived from the k X i = state at time k given observations ( ) depend on the time in question, and are interpreted as the likelihood of the model given observations ( ) 0 , , n Y Y  .instead of first solve the backward and forward equation equation separately, and to successively deduct the non-normalized version of the conditional distribution, defined at any instant k as of course not possible to perform this maximization exhaustive manner, listing all 1 n E + possible trajectories: the efficient calculation of this estimator is provided by a dynamic programming algorithm called Viterbi algorithm.

For example:Figure 6 .
Figure 6.Graphic representation of the Hidden Markov Model.

Figure 7 .
Figure 7. Speech signal "five" and the power of short duration and zero crossing measurement.

Figure 10 Figure 10 .
Figure10shows the distribution of the first dimension of MFCC feature vectors ex- Gaussian white noise dimension, with zero mean and covariance matrix R reversible, independent of the Markov chain { } k X function h defined on E with values in d  is characterized by the data of a finite family ( ) These parameters are obtained by calculating cepstral coefficients according to a Mel scale (MFCC Mel Frequency Cepstral Coefficients).Statistical realization of acoustic parameters of each phoneme is represented by a Hidden Markov model.Each phoneme is typically represented by 2 or 3 states, and multigaussienne density (GMM: Gaussian Mixture Model) is associated with each state.GMM densities with a large number of components designed to address multiple sources of variability that are affecting the speech signals (sex and age of the speaker, accent, noise).