Investigation of Automatic Speech Recognition Systems via the Multilingual Deep Neural Network Modeling Methods for a Very Low-Resource Language, Chaha

Automatic speech recognition (ASR) is vital for very low-resource languages for mitigating the extinction trouble. Chaha is one of the low-resource languages, which suffers from the problem of resource insufficiency and some of its phonological, morphological, and orthographic features challenge the development and initiatives in the area of ASR. By considering these challenges, this study is the first endeavor, which analyzed the characteristics of the language, prepared speech corpus, and developed different ASR systems. A small 3-hour read speech corpus was prepared and transcribed. Different basic and rounded phone unit-based speech recognizers were explored using multilingual deep neural network (DNN) modeling methods. The experimental results demonstrated that all the basic phone and rounded phone unit-based multilingual models outperformed the corresponding unilingual models with the relative performance improvements of 5.47% to 19.87% and 5.74% to 16.77%, respectively. The rounded phone unit-based multilingual models outperformed the equivalent basic phone unit-based models with relative performance improvements of 0.95% to 4.98%. Overall, we discovered that multilingual DNN modeling methods are profoundly effective to develop Chaha speech recognizers. Both the basic and rounded phone acoustic units are convenient to build Chaha ASR system. However, the rounded phone unit-based models are superior in performance and faster in recognition


Introduction
Human language technologies (HLTs) are important for the low-resource languages, to revitalize and document them for preventing the challenge of extinction, and to raise the interest and make the language attractive again for their native speakers [1]. ASR is one of the HLTs that is developed for such languages using small training corpora, which are often prepared by researchers. Thus, the performance of speech recognizers of low-resource languages is worse than that of speech recognizers of technologically favored languages. Besides, due to the shortage of sufficient training corpora, the DNN models suffer from overfitting problem when developing speech recognizers for low-resource languages. The scarcity of the training dataset and overfitting challenges of DNN models are mitigated by either increasing the size of the training datasets or developing optimal DNN models using various model regularization techniques such as dropout, l2-normalization, activation functions, layer normalization, and batch normalization.
The model regularization techniques can reduce the overfitting problem to some extent, but to overcome the above problems substantially and to develop reliable ASR systems for the low-resource languages, it is better to increase the size of the training datasets. The size of the training datasets can be increased by preparing a new training corpus, borrowing from high-resource languages, and generating synthetic datasets via various audio data augmentation techniques.
The first approach is expensive because it takes considerable time, human, and financial resources and it is challenging to obtain electronically available text for the very much low-resource languages. Thus, it is better to use the second and the third methods, namely, borrow training datasets from the high-resource  Investigating different phone and rounded phone unit-based speech recognizers using various multilingual DNN acoustic modeling paradigms and comparing the recognizers in terms of performance and recognition speed for the Chaha language.  Comparing and suggesting the best acoustic modeling units to develop speech recognition system for the Chaha language.
The remainder of this paper is organized as follows. The review of related works is presented in Section 2. A description of the Chaha language is given in Section 3. Section 4 describes the preparation of corpora. The experiments, results, and discussion of this work are discussed in Section 5. Section 6 explains the conclusions and future directions of this work.

Related Works
Multilingual DNN acoustic modeling paradigms are helpful to share and transfer DNN hidden layers among or between multiple languages for improving the performance of the individual languages. These paradigms are effective to reduce overfitting problem of DNN-based speech recognition systems for low-resource languages. The widely used multilingual DNN acoustic modeling paradigms in speech recognition of low-resource speech recognition systems include phone sharing, multi-task learning, and weight transfer. In phone sharing modeling paradigm, the phones of various languages are either merged with a language identifier prefix or combined with the universal phones of all the languages based on data-driven or International phonetic alphabet (IPA) approaches to create the multilingual phone sets, and then train the model using the mixed multilingual datasets from all languages. For instance, Vu et al. [2] have trained two phone sharing multilingual DNN models for the ten languages from Global phone database in the low resource scenarios. The first is merged phone sets based phone sharing, which is created by simply concatenating all involved monolingual phone sets with a language identification prefix to ensure that all the phones are distinct between languages. The second is a universal 1 Basic phone units contain only basic phones, where the rounded phones are maps to the corresponding basic phones. 2 Rounded phone units contain all the basic phones and the rounded vowels, where rounded phones map to the basic phones and rounded vowels to consider their roundedness. Journal of Signal and Information Processing phone set based phone sharing, which merges all the monolingual phones that share the same symbol in the IPA table. Using both paradigms, they have obtained superior performances over the corresponding unilingual DNN models.
Multitask learning is helpful to transfer knowledge between or among languages if the languages are phonetically related with each other and share some internal representation by jointly learning together. In this multilingual paradigm, the hidden or initial layers of the network are shared across all languages and each language has a specific output layer, as shown in Figure 1. On the other hand, in the weight transfer modeling paradigm, the hidden layers of the source DNN model train using the unilingual or multilingual datasets, and then remove the output layer and replace it with a new target language output layer with dimension equal to the number of senones. Then, train only the added output layer or retrain all the model-hidden layers using small training dataset of the target language, as shown in Figure 2. For example, Gales et al. [6] have examined the use of shared hidden layer multilingual DNN-HMM models for the low-resource languages from IARPA Babel project. Huang et al. [5] have studied the multi-task learning DNN architecture and weight transfer schemes, and attained better performance over the unilingual DNN models. Lin et al. [7] have also used these two multilingual DNN models to develop speech recognizers  for the low-resource Taiwanese Mandarin language, and obtained better performances using both models, and the multitask learning model outperforms the corresponding weight transfer model. Similarly, Ghahremani et al. [4] have compared the multitask learning and weight transfer models using lattice free maximum mutual information (LF-MMI) objective function, and obtained superior performance using the multitask learning model over the weight transfer model. Moreover, Miao and Metze [8] have combined and trained the dropout model regularizer with multitask learning DNN model for the very low-resource language settings, and acquired significant performance improvements.
The performance of the multilingual modeling paradigms is profoundly affected by the size of the training datasets and the relatedness of the languages.
Hence, training related target and source languages together produce better performance than training unrelated target and source languages. For example, the works presented in [2] [4] [5], and [8] are trained related target and source languages, and obtained superior performances over the works presented in [2] [5] [7], which train unrelated target and source languages. The target and source languages are considered related languages when they are phonetically related to each other. Commonly, the languages that are found within the same language family are phonetically related languages. For example, Chaha and Amharic are Journal of Signal and Information Processing members of the Semitic language family. Hence, these languages are phonetically related to each other. Different researchers have investigated ASR system for Amharic language [9].
For example, Abate et al. [10] have analyzed the language specific and resource-related challenges for developing ASR system for Amharic language.
Tachbelie et al. [11] have examined syllable and hybrid acoustic modeling units based speech recognizers for Amharic. Tachbelie et al. [12] have also analyzed the various acoustic, language, and lexical modeling units to develop Amharic ASR system. However, HLTs in general and ASR system in particular have not been investigated for the Chaha language. Thus, this study is a first attempt to investigate Chaha speech recognition systems using multilingual DNN modeling paradigms by borrowing the training datasets from a phonetically related language, Amharic.

The Chaha Language
Chaha is one of the major dialects of the west Gurage language. It belongs to the Semitic language family of which the other members are Arabic, Geez, Amharic, Tigriyna, Argobba, Harari, and Gaft [13]. Chaha is spoken in the Gurage Zone However, Chaha is a developing language. This is because Chaha is in vigorous use, with literature in a standardized form being used by some though this is not yet widespread [18], Chaha is not used as a medium for lesson delivering or as a program in education, namely, primary and secondary schools and higher institutions, and has less documentation and development products. For instance, it has very few books. As of the time of writing this paper only four fictions, one bible, one poem, and one proverb publications are available in the language. Moreover, there are no revitalization efforts and language development agencies for the language. Hence, Chaha needs a particular attention of linguistics and HLTs developers, to make the language easily accessible and usable by the speakers. This section discusses the phonological, morphological, and orthographic characteristics of the language.
The phonetic transcription of the consonants b, p, f, m, w, g, k, d, t, z, s, h, l, n, r, T. G. Fantaye et al. and y corresponds to that of Amharic and English consonants. The pronunciation of consonants t', č, k', š, ž, c, ǧ, x, β, ɲ, qʷ, k w , and g w correspond to the equivalent Amharic consonants t', č', q, š, ž, č, ğ, h, v, ň, q w , k w , and g w , respectively. The consonant speech sounds, q', k', g', x', x w , p w , b w , f w , and m w are peculiar to Chaha, and do not have corresponding sounds in Amharic and English languages. The sound units q', k', g', and x' are palatalized counterparts from amongst the consonants q, k, g, and x, while qʷ, k w , g w , x w , p w , b w , f w , and m w are the labialized form of the consonants q, k, g, x, p, b, f, and m, respectively [13] [14]. The only laryngeal sound that exists in Chaha is h, which is used to call a few Amharic loan words such as haymanot "belief", har "silk". Chaha is a non-geminating language, in which whenever an originally voiced consonant expects to geminate, it becomes voiceless. For example, the sound b becomes p. However, occasionally one encounters occurrences with a geminated radical, as in ənnəm "all" for the loan words from the Amharic language. Hence, Chaha has only a few consonants that can geminate, namely, m, n, t, and k [13] [15]. The seven basic vowels, namely, ä, u, i, a, e, ə, o, and the two low-mid front (ɛ) and back (ɔ) vowels, form the nine phonetic vowel inventory of Chaha, as presented in Table 2

Chaha Morphology
Chaha exhibits a root-pattern, inflectional, and derivational morphological phenomena like other Semitic languages such as Arabic and Amharic [17]. Moreover, Chaha has unique properties, namely, labialization, palatalization, devoicing, and sonorant alternations. Hence, Chaha is a morphologically rich language, and its morphological richness challenges the performances of speech recognition systems.

Chaha Writing System
Chaha is written using Geez script. However, Chaha represents the palatalized consonants, which are not found in Geez script by introducing the modified characters to the script, such as using wedges on the tops. Chaha script is syllabic, where each symbol represents a consonant combined with a vowel except the sixth-order consonant, which is sometimes realized as a consonant without a vowel, and at other times a consonant with a vowel [13]. Each symbol in Chaha writing system represents a consonant-vowel (CV) syllable, and there are 264 distinct letters [13] [14]: 224 letters-32 core symbols with seven orders, 20 letters-four rounded velars with five orders, and 20 letters-four rounded labials with five orders. In the 32 core letters, consonants are concatenated with every seven basic vowels to obtain a total 264 CV syllables. Similarly, four plain velar letters, namely, q, k, g, and x, and four plain labial letters, namely, p, b, f, and m are combined with five rounded vowels, namely, w ä, w i, w a, w e, and w ə, to obtain 20 rounded velars and 20 rounded labial letters, respectively. In addition, Chaha has various syllable structures with a form of C(C)V(C)(C), and the possible syl-  [15]. The CV and CVC syllables are basic, and are called light and heavy syllables, respectively, while the CVCC syllables are the super heavy syllables. CV syllable is the domi-Journal of Signal and Information Processing nantly available syllable type in the language [14] [16]. The phonetic and syllabic features of the Chaha writing system favor the development of ASR systems. For example, it is easy to develop lexical dictionary using a grapheme-based approach. However, Chaha writing system does not show gemination and devoicing of consonants, and pronunciation of an epenthetic vowel ə and open vowels, namely, ɛ and ɔ. These characteristics of Chaha writing system are analogous to the vowels of Arabic and Hebrew, and the geminated consonants and epenthetic vowel of Amharic, which are not indicated in writing system. Moreover, Chaha has syllables that have the same pronunciation with different orthographic symbols. Overall, the above features of the language challenge the development of ASR systems.

Preparation of Corpora
In this section, the text corpus, speech corpus, lexical dictionaries, and synthetic speech corpora, which we have used in our study, and the process followed to prepare them are discussed.
Chaha does not have a readily available text corpus. Besides, it has limited presence on the web, and has limited hardcopy books. Thus, we have collected small set of texts from bible, web, and hardcopy books such as fiction, poem, and proverbs. Then, the texts of bible, web, and books are merged, and applied text cleaning tasks like correcting spelling and grammar errors, expanding abbreviations, removing foreign words, textually transcribing numbers, and separating concatenated words. As a result, we obtained 14,595 sentences (200,944 tokens and 38,182 word types) as text corpus, which is used to generate lexical dictionaries and to train language models. Moreover, the phone-level Unicode versions of the text corpus and transcribed speech text are used. The transliteration 3 of the text corpus and the transcribed speech text from their syllable-level Unicode versions into the corresponding phone-level Unicode versions is conducted as follows: All the syllables except the 20 rounded velars and 20 rounded labials syllables are transliterated in terms of CV pattern. For instance, the word በና/bäna/, which means "eat", is transliterated as ብኧንኣ/bäna/, where syllable በ/bä/ is transliterated as the combination of the sixth-order phone, namely, ብ/b/, with the first vowel, namely, ኧ/ä/, to the transliterated form of ብኧ/bä/, and syllable ና/na/ is transliterated as the combination of sixth-order phone ን/n/ with the 4th vowel, namely, ኣ/a/, to the transliterated form of ንኣ/na/. However, the rounded velar and labialized syllables are combinations of two or three CV syllables. Thus, according to [13] [14], these syllables can be transliterated as the concatenations of sixth-order phones with rounded vowels. For example, ኳ/k w a/ is a rounded velar syllable and is transliterated as a combination of sixth-order phone ክ/k/ with rounded vowel ውኣ/ w a/ to the corresponding phone transliteration of ክውኣ /k w a/.
Like the text corpus, Chaha does not have publicly available speech corpus for Journal of Signal and Information Processing examining speech recognition tasks. Hence, we have prepared the speech corpus by selecting 2000 relatively phonetically balanced sentences from the obtained text corpus. A speech corpus of 3-hour is recorded in an office environment using a Philips voice recorder (VTR5100) from 15 native speakers (10 male and 5 female), who read a total of 2000 sentences. Of the 3-hour speech corpus, 2.67-hour (1778 sentences), is collected from 10 native speakers (7 male and 3 female) who read 178 sentences each. This corpus is utilized as a training dataset. To avoid the overlapping between the training and testing datasets with respect to speakers and sentences, a 0.33-hour (222 sentences) corpus is collected from a separate 5 native speakers (3 male and 2 female) who read 45 sentences each, and this corpus is ten percent of the total 3-hour corpus and is used as a testing dataset. However, compared to other speech corpora that contain tens and above hours of speech data for training, clearly this corpus is very much small, and hence, the models will suffer from lack of training data. The distribution of phonemes within 2.67-hour training dataset is shown in Figure 3.
There are no available lexical dictionaries for Chaha language. Hence, we have prepared two basic phone-based and two rounded phone-based lexicons via a grapheme-based approach [19]. The two basic phone-based lexicons contain 36 basic phones: 29 basic consonants and 7 basic vowels, and 32 basic phones: 25 basic consonants and 7 basic vowels, respectively. In the first lexicon, the four palatalized phones are used directly, while in the second lexicon, these phones are mapped into the corresponding basic phones. These lexicons are prepared by a simple transcription of words as separated phones. The two rounded phone-based lexicons consist of 44 phones: 29 basic consonants, 7 basic vowels, 5 rounded vowels and 3 palatial vowels, and 41 phones: 29 basic phones, 7 basic vowels and 5 rounded vowels, respectively. The first lexicon includes additional palatal vowels, while the second lexicon uses the palatal phones directly. These Moreover, we have used Amharic as a resource provider language. It has 26-hour training speech corpus (from [20] and own), which contains a total of 13,549 sentences that are collected from 125 native speakers. Alternatively, to increase the size of training datasets, the synthetic training datasets are generated using the speed perturbed audio data augmentation approach [21] by modifying the speed of speech signal to 90% and 110% of the initial rate for both languages.  Table 3 lists the summary of the total training datasets used to train the Chaha ASR systems.

Experimental Setups
All the GMM-HMM and DNN-HMM models are developed using the state-of-the-art speech recognition toolkit, Kaldi [22]. For GMM-HMM models, speaker adaptive training (SAT) technique based 40-dimensional features are extracted with feature-space maximum likelihood linear regression (fMLLR) method. Various Bakis HMM topology triphone models are built for all basic and rounded phone acoustic modeling units. Moreover, word based backed-off and interpolated trigram language models are built using SRI language modeling (SRILM) toolkit [23]. These language models are smoothed using the modified Kneser-Ney smoothing algorithm, and are applied to train all the basic and rounded phone acoustic units.
For DNN-HMM models, we used a chain model that trains with LF-MMI criterion without the need for frame level cross-entropy pretraining [24]. This model uses a one-state HMM topology for each context-dependent phone, and the phonetic-context decision tree obtains using one-state HMM topology and reduced frame rate after converting the alignments from the GMM-HMM model.

Baseline GMM-HMM Models
Two baseline GMM-HMM models are trained, namely, GMM-CH, which trains using the Chaha language in-domain dataset, and GMM-MUL which is a phone sharing model that trains using the mixed multilingual data (Chaha and Amharic languages) by concatenating the Chaha and Amharic phone sets with a language identification. These models are developed for all basic and rounded triphone acoustic units.  Table 4. Table 4 indicates that the best HMM topology for both basic and rounded phone units-based GMM-CH model is a 3-state with the fourth last non-emitting state with skip from the first state to the last non-emitting state Bakis HMM topology. Using this topology, the best performing GMM-CH model that has a small word error rate (WER) consists of 1200 leaves i.e.  Table 5 shows that both basic and rounded phone unit-based GMM-MUL models are worse in performance than the corresponding GMM-CH models.

DNN-HMM Models
Using the optimal parameters stated in Section 5. 1, two unilingual TDNN models, namely, TDNN-CH and TDNN-AM are developed for Chaha and Amharic languages, respectively. The TDNN-AM models are used as the bootstrap to train the weight transfer multilingual models for the Chaha language. The TDNN-CH models are trained using the combined Chaha in domain real and synthetic speech corpus, and the results demonstrate that the rounded phone unit-based TDNN-CH model outperforms the equivalent basic phone unit-based model with an absolute WER reduction of 1.16%, as presented in the first row of Table 6. The finding shows that the use of rounded phones within lexical dictionary and phone list is improved the performance of Chaha ASR system.
Moreover, the effect of synthetic training dataset on the performance of basic and rounded phone unit-based TDNN models is examined by developing DNN-CH-Naug (None-augmented version of TDNN-CH) models using Chaha real training dataset (2.67 hrs), and compared with the TDNN-CH models developed using the Chaha total datasets (8.01 hrs). The basic phone and rounded phone unit-based TDNN-CH models achieved superior performance over the corresponding TDNN-CH-Naug models with absolute performance improvements of 6.19% and 4.84%, respectively, as presented in Table 6. Hence, augmenting the training dataset by generating the synthetic data using audio data augmentation technique improves the performances of ASR system for very low-resource languages.
We investigated three multilingual DNN models, explicitly, phone sharing   than the corresponding basic phone unit-based model with a relative WER reduction of 0.95%.
We investigated three weight transfer models, namely, weight transfer  Table 7 reveals the following experimental results. All the basic and rounded phone unit-based multilingual TDNN models, namely, phone sharing, multitask learning, and weight transfer models outperform the baseline unilingual TDNN models consistently. This is because the unilingual TDNN models are trained using small training dataset than the multilingual TDNN models.  Hence, the basic phone units can be used as alternative acoustic modeling units to develop Chaha ASR system.

Comparison of DNN-HMM Models Based on Their Recognition Speeds
The speed of a speech recognition system is measured using a real time factor (RTF). RTF is a very natural measure of a speech decoding speed which expresses how much the speech recognition system decodes slower than the user speaks. It is the ratio of the speech recognition system response time to the utterance duration, as formulated in Equation (1).
where a is an utterance. Usually both the average RTF (average over all utterances) and 90 th percentile RTF is examined in efficiency analysis of speech recognition system. We have used an average RTF of all the utterances to analysis the speed of all the speech recognition systems developed in this study. Hence, the recognition speeds of the basic and rounded phone unit-based unilingual and multilingual TDNN models are presented in Table 8. Both the basic and rounded phone unit-based unilingual TDNN models are faster than the equivalent phone sharing and multitask learning multilingual TDNN models. Likewise, Journal of Signal and Information Processing  Overall, the performances of rounded phone unit-based multilingual TDNN models are better than the corresponding basic phone unit-based models, as discussed in Section 5.2.3. In line with this, the recognition speeds of basic phone unit-based multilingual TDNN models are worse than the corresponding rounded phone unit-based models. Hence, the rounded phone units are the best acoustic modeling units to develop ASR system for the very low-resource language, Chaha.

Conclusions and Future Works
This study presents a first attempt made on the investigation of ASR systems using various multilingual DNN techniques for the very low-resource language, Chaha. The language and resource-related problems are the major factors that challenge the development of Chaha ASR systems. By considering these challenges, this paper investigated different unilingual and multilingual speech recognizers. The experimental results demonstrate that all the basic and rounded phone unit-based multilingual TDNN models realized superior performances over the corresponding unilingual TDNN models with overall relative WER reductions of 5.47% to 10.87% and 5.74% to 16.77%, respectively. Hence, multi- However, almost all the rounded phone unit-based unilingual and multilingual models realized superior performances and faster recognition speeds than the corresponding basic phone unit-based models. Hence, the rounded phone units are the best acoustic modeling units to develop reliable ASR system for the very low-resource language, Chaha.
As future work, we are interested in exploring the use of CV syllables as acoustic modeling units for building Chaha ASR system. Besides, the language-specific issues like gemination and devoicing of consonants, proper insertion of an epenthetic vowel, and pronunciation of open vowels in the training corpus and pronunciation dictionaries will need to be handled.