Real Time Speech Based Integrated Development Environment for C Program

This Automatic Speech Recognition (ASR) is the process which converts an acoustic signal captured by the microphone to written text. The motivation of the paper is to create a speech based Integrated Development Environment (IDE) for C program. This paper proposes a technique to facilitate the visually impaired people or the person with arm injuries with excellent programming skills that can code the C program through voice input. The proposed system accepts the C program as voice input and produces compiled C program as output. The user should utter each line of the C program through voice input. First the voice input is recognized as text. The recognized text will be converted into C program by using syntactic constructs of the C language. After conversion, C program will be fetched as input to the IDE. Furthermore, the IDE commands like open, save, close, compile, run are also given through voice input only. If any error occurs during the compilation process, the error is corrected through voice input only. The errors can be corrected by specifying the line number through voice input. Performance of the speech recognition system is analyzed by varying the vocabulary size as well as number of mixture components in HMM.


Introduction
Speech is one form of communication used by the humans for exchanging the information.Each word that is spoken by the humans is created using the phonetic combination of vowel and consonant speech sound units.Speech processing is the study of speech signals and processing methods of these signals.The speech signals are usually processed in a digital representation.Speech recognition is the process of converting the speech signal into human readable text.Nowadays speech recognition is used in variety of applications.People with disabilities can benefit from speech recognition programs.For individuals that are deaf or hard of hearing, speech recognition software is used to automatically generate a closed captioning of conversations such as discussions in conference rooms, classroom lectures.Speech recognition is also very useful for people who have difficulty using their hands, ranging from mild repetitive stress injuries to involved disabilities that preclude using conventional computer input devices.Our proposed system is developed to facilitate the visually impaired people or the person with arm injuries with excellent programming skills that can code the C program through voice input.The paper is organized into existing systems, proposed system, implementation and its performance.Literature related with the proposed systems is discussed in Chapter 2. Chapter 3 deals with the proposed frame work followed by the implementation in Chapter 4. The performance analysis is detailed in Chapter 5. Chapter 6 concludes with a few points as to the scope for future enhancement.

Literature Survey
Speech recognition [1] is used to convert the audio signals into human readable text format.Speech recognition is classified as two types according to number of users using it.Speaker dependent system-the system [2] recognizes the words only from the trained speaker.Accuracy of these systems is usually high.Speaker independent system-these systems [2] are able to be used by different individuals without training to recognize each person's speech characteristics.Speech recognition system is classified into two types based on input to the system.Isolated speech recognition-it operates on a single word at a time [3], requiring a pause between saying each word.Continuous speech recognition [4] [5]-It operates on speech in which, words are connected together.Speech recognition is classified as three types according to sub word unit [3] used for recognition.The types are: syllable-based Recognition, phoneme-based recognition and word-based recognition.There are two phases in developing speech recognition system.They are training and testing phase.In training phase, the speech samples are collected.The features are extracted from the collected speech samples and the acoustic model is built.During testing phase, using acoustic model speech utterance is recognized.
Different types of spectral features that [6] can be extracted during training phase are Linear predictive analysis (LPC), Linear predictive cepstral coefficients (LPCC), perceptual linear predictive coefficients (PLP), Melfrequency cepstral coefficients (MFCC) etc.In our proposed work, MFCC features are extracted from the speech samples.
A few of the speech based applications developed are mentioned below.In [7], a web based application to find out the user's mood from the emotions present in the speech signal is presented.It is used to overcome the difficulties in the present web education system's feedback.However this application will collect the feedback about the web page as voice samples.From the collected voice samples, it will identify the user moods.A client-server based speech recognition system is described in [8].In this work, recognition is done at the server side.The client will transmit the speech signals to the server.Speech can be transmitted to the server through internet.The disadvantage of this approach is that the user cannot access these applications through low bandwidth connections.A speech browser is developed in [9].This system is used to browse the worldwide web via speech input.Speech based e-learning is described in [10].In [10], the speech recognition system uses client/server architecture.The client uses a Java applet, which is integrated in an HTML page.It takes the user's input and activates the corresponding service at the speech server.

Proposed Framework
Two major modules of the proposed framework are shown in Figure 1.
Module 1: Speech recognition.Module 2: Building IDE for C program.

Speech Recognition
In Speech recognition training phase, feature vectors are extracted from the given speech signal.The extracted feature is used to build the acoustic model.In testing phase, from the test speech signal, the features are extracted.The extracted feature is compared with the acoustic model to produce the recognized text.Speech Recognition system is implemented using sphinx [11]- [13] tool kit.

Building IDE for C Program
Recognized text from module 1 is pre processed to convert the text into proper C program using syntactic construct of the C language.This C program will be fetched as input to IDE.This IDE will produce the compiled output of the recognized C program.

Training Phase
The training phase consists of the following modules.1) Data collection Speech utterance correspond to C keywords are collected from different speakers.For each keyword, sixty speech utterances are collected.All speech samples are recorded in wav format.After collecting the voice samples, dictionary file, fileids, transcription files, language model files are generated.The fileids contains location of the wav file, transcription file contains the text corresponding to the wav file, dictionary file lists all the words and its phoneme sequence, language model file contains the probability of occurrence of each word in the speech corpus.
Example of dictionary file: Features are extracted from the voice samples.MFCC features [14] [15] are extracted using the following steps shown in Figure 2.
Pre emphasis-Divide the signal into 20 -40 ms frames.In this paper the frame size is assumed as 25 ms.This means the frame length for a 16 kHz signal is 0.025 × 16,000 = 400 samples.Frame step is usually 10 ms to 15 ms, which allows overlap between the frames.
Hamming windowing-Windowing is applied to minimize the disruptions at the start and at the end of the frame.
Fast fourier transform-The conversion from time domain to frequency domain is carried out by fourier transform method.
where h(n): N sample long analysis window, K: the length of the DFT.The periodogram-based power spectral estimate for the speech frame s i (n) is given by ( ) ( ) Mel Filter Bank Processing-The filters are used to compute a weighted sum of spectral components to filter the output.
Mel Scale-The Mel scale relates perceived frequency, or pitch, of a pure tone to its actual measured frequency.Humans are superior at discerning minute alterations in pitch at low frequencies than they are at high frequencies.Incorporating this scale makes our features matches closely with humans' perception.
The formula for converting frequencies into Mel scale is: ( ) where d t is a delta coefficient, from frame computed in terms of the static coefficients c t+N to c t-N .
3) Building HMM model A hidden Markov model (HMM) [16] is a statistical Markov model which has the unobserved states.The states in the Hidden Markov Model are not directly visible, i.e. it has hidden states.Each state has a probability distribution over the other states.Using HMM, the acoustic model is built from the extracted MFCC features.It is a word based model.Here each phone is represented as state.First state and last state are non-emitting state.The state transition between one phoneme to another phoneme.The state transition will lead to find out the vocabulary during the recognition process.The language model is used to assign probability to each word according to their frequencies.This language model will facilitate to predict the subsequent word during the testing phase of the speech recognition.In speech recognition if the HMM model didn't predict the word correctly, then language model will find out the subsequent sequence word using these calculated probability values.Different types of language models can be built, e.g.unigram, bigram, trigram model, etc. Unigram model is used to find out one single word, whereas bigram model is used to predict the predecessor or successor word.Trigram model is used to predict the predecessor and the successor of the given word.
The unigram model can be calculated by: ( ) ( ) where 1 2 n w w w  represents the words in the corpus and P(w i ) represents the probability of each word occurring in that corpus.i represents the i th word in the corpus.C(w i ) represents the count of the i th word.
N represents the total number of words in the corpus.The bigram model can be calculated by: where 1 2 1 i w w w −  represents the previous words in the corpus and ( ) P w w − represents the probability of each word with the other words occurring in the corpus.i represents the i th word in the corpus.
The trigram model can be calculated by: Example of language model: (unigram)-3.8770IF (bigram)-3.3998<s>OPENBRACKET (trigram)-0.3010<s>OPENBRACKET </s> <s>: represents a word in the word corpus that occurs predecessor of the current word.</s>: represents a word in the word corpus that occurs a successor of the current word.Here probability values are represented in logarithm.

Speech Testing 1) Feature Extraction
The MFCC features are extracted from the test speech utterance.The procedure for extracting MFCC feature is explained in Section 3.3.1.

IDE Preprocessing
In IDE pre processing module, the recognized text from speech testing will be converted into C program using the syntactic construct of C language.In the first step, the recognized text is divided into tokens.If token is recognized as symbol then replace the token with its corresponding symbol.If token is a number then convert the token into its equivalent number.If the token is not a number or a symbol then leave the text as it is.After the recognized text is pre processed, it will be fetched as input to the IDE module.

Text to Symbol and Number Conversion
Create two look up tables for storing the symbols (operators in C language) and numbers.Compare the token with the symbols present in the look up table for symbols.If one of the symbols matches with the token then replace the token with its corresponding symbol.Otherwise, compare the token with the numbers present in the look up table for numbers.If one of the numbers matches with the token then replace the token with its corresponding value.If the token does not match with all of the symbols or numbers then leave the token as such.This process will be repeated for all the tokens.Few symbols and all numbers are listed in look up Table 1 & Table 2 respectively.
The algorithms for doing IDEpreprocessing are shown in algorithm 3, 4, 5.

Existing IDE
In this module, the pre processed text will be fetched as input.IDE commands are also given through voice input only.The IDE commands used in our proposed work are open file, save file, new file, compile file and run file.

New File
This command will open a new file in the IDE.The voice command will create a new file in the IDE.
Table 1.Look up table for symbols.

Goto Line Number
The goto line number command is used to correct the errors occurred during the compilation of the C program.
For error correction in a C file, the user has to provide the command "goto line number <lineno>" through voice input.Extract the line number from the user voice command.Empty the text in that line.After clearing the text in the specific line given by the user, place the latest recognized text.The new text is recognized from the user voice.
Example: goto line number six six.From this six six should be extracted and converted to numbers as 66.This text to number conversion will be done by the IDEP reprocessing module.The algorithm for doing IDECommands is shown in algorithms 6, 7, 8, 9, 10.

Experimental Setup
We have collected 217 C programming language keywords.Speech utterances corresponding to these keywords are collected from 25 speakers.Each keyword is uttered 20 times.The speech samples are collected using microphone.Speech data is decoded with sampling rate of 16 KhZ with single bit mono channel stored in WAV format.After the data collection, transcription file, dictionary file and language model files are generated.From the collected speech samples, the MFCC features are extracted.Using the extracted features, HMM model is built.During testing, using the HMM model, the test utterance is recognized.In IDEpre processing, recognized text is converted into C program using syntactic construct of the C language.The IDE commands open file, save file, compile file, run file and goto line number also implemented using the voice input.

Performance Analysis
The performance measure used to evaluate the proposed system is discussed below.

Performance Measures
The Word Error Rate (WER) is a metric which is used to measure the performance of an ASR.It compares the given word to a recognized word and is defined as follows:

S D I WER N
+ + = (10) where: S is the number of substitutions, D is the number of deletions, I is the number of insertions and, N is the number of words in the actual word.Word Error Rate calculation for the entire system is as follows: where: c is the voice input for IDE command, WERc is the Word Error Rate for the IDE command c, m is the total no. of IDE commands uttered by user.The performance of speech recognition is analyzed by varying the number of mixture components using 150 words are tabulated in Table 3. From Table 3, it has been noted that word error rate is decreased when number of mixture component is increased.There is an increase in word error rate for mixture component 128.The reason for increase in word error rate is that, the amount of training data is not sufficient to train the model for 128 mixture component.The number of times the word is uttered is increased to 40 to improve the performance for 128 mixture component.The number of mixture component in HMM model is decided based on the number of phonemes available in the training utterances.From the above experiment, it has been concluded that the optimum number of mixture component for the current study is fixed as 64 components.
The performance of speech recognition is analyzed by varying number of words are represented in Figure 3. From Figure 3, it has been noted that word error rate is increased when the vocabulary size is increased.For large vocabulary speech recognition system, the suitable sub word unit is phoneme or syllable.

Conclusion
Our proposed system is used to capture the C program through voice input and produces the compiled C program as output.During training phase, speech utterances corresponding to C key word are collected.MFCC features are extracted from the speech samples.HMM model is built using extracted features.During testing, from the test utterance, the MFCC features are extracted.Using the HMM model, the text is recognized.The recognized text is converted into the C program by using syntactic constructs of the C language.The IDE commands for saving, opening, compiling and running the file are also given through voice input.The proposed speech based IDE is implemented for C program only, it can be extended to other programming languages.In our proposed work, word based speech recognition is implemented.While extending the research work to other programming languages, phoneme based speech recognition can be applied.Phoneme based speech recognition supports the large vocabulary data set.
Discrete Cosine Transform-It is used to convert the Mel spectrum to the domain of time.Delta Energy and Delta Spectrum-It is necessary to add features related to the change in the characteristics of cepstral over the time.Delta energy and delta spectrum are also known as differential and acceleration coefficients.The MFCC feature vector describes only the power spectral envelope of a single frame, however speech would also have information in the dynamics i.e. what are the trajectories of the MFCC coefficients over time.It turns out that calculating the MFCC trajectories and appending them to the original feature vector increases ASR performance by quite a bit.Delta coefficients are computed as follows:
For example: INCLUDE IH N K L UW D The word INCLUDE has 6 states.State transition from IH to D will lead to the word INCLUDE 4) Language Model

Table 3 .
Performance of speech recognition system using different mixture components.
Figure 3. WER vs. number of words.