Impact of Languages and Accent on Perceived Speech Quality Predicted by Perceptual Evaluation of Speech Quality (PESQ) and Perceptual Objective Listening Quality Assessment (POLQA): Case of Moore, Dioula, French and English

Perceptual Objective Listening Quality Assessment (POLQA) and Perceptual Evaluation of Speech Quality (PESQ) are commonly used objective standards for evaluating speech quality. These methods were developed and trained on native speakers’ speech sequences of some western languages. One can then wonder how these methods perform if they are applied to other languages or if the speaker is non-native. This paper deals with the evaluation of PESQ and POLQA on languages that were not been considered when setting up these methods, with emphasis on Moore and Dioula, two local languages of Burkina Faso. Another aspect is the evaluation of these two methods in the case of non-native speakers. For this purpose, in the one hand, the Mean Opinion Score-Listening Quality Objective (MOS-LQO) of PESQ and POLQA, computed for Moore and Dioula, are compared to those of French and English. On the second hand, the MOS-LQO scores of French and English are compared for native and non-native speakers, to evaluate the effect of the accent of speakers.


Introduction
Standards for assessing perceived speech quality can be divided into two main Subjective methods, also called subjective testing, consist of a set of tests in which participants judge the speech quality as they perceive it on a defined quality scale [1] [2]. The scores from this test are used to calculate an average score called Mean Opinion Score (MOS). This approach is the most suitable for assessing speech quality. However, it can be expensive and time-consuming to implement. Objective methods, also called objective models, aim to automatically predict the perceived speech quality as it would be obtained in a formal subjective test. The predicted score, called MOS-LQO (Mean Opinion Score-Listening Quality Objective), is obtained by comparing a degraded signal and its orig- [5]. The well-known and most widely used objective standards, by telecommunications operators, are POLQA (ITU-T Standard P.863 [6]) and PESQ (ITU-T standard P.862 [5] One should note that these two speech quality measurement models were developed and trained on native speakers' speech sequences of some western languages. For example, 11 languages were used for POLQA [7] [8] namely: English, British English, Chinese (Mandarin), Czech, Dutch, French, German, Swiss, German, Italian, Japanese, Swedish. However, these standards are used in several countries, including Burkina Faso, by telecommunications regulatory authorities and by telephone operators to evaluate the speech quality transmitted in phones networks. One then wonder how these methods perform if they are applied to other languages, or if the speaker is non-native. Several authors have already carried out work on similar issues. F. Ben Ali et al., [9] investigated the dependency on the language and objective quality assessment models. By using an important measurements database, they mapped the scores of SwissQual's speech quality algorithm for Listening Quality (Squad-LQ) and PESQ, for the languages French, English, and Arabic. They concluded that PESQ and Squad-LQ do not score these three languages in the same way. By working on English and Igbo (a West African tonal language), D. U. Ebem et al., [10] showed that the MOS-LQO scores predicted by POLQA for Igbo, seem to be overestimated compared to the MOS scores given by Igbo listeners.
No previous scientific work has focused on the evaluation of PESQ and POSQA on the local languages of Burkina Faso.
This work compare the MOS-LQO scores of PESQ and POLQA for Moore and Dioula (two local languages of Burkina Faso), with those of French and English. The speech sequences of these four languages, considered to compute the MOS-LQO scores, come from native and non-native speakers.
The aims are, on the one hand, to evaluate the influence of the languages on the MOS-LQO scores and on the other hand, to evaluate the impact of a native and non-native speaker. It should be noted that in Burkina Faso, telephone communications are carried out in a narrow band. Therefore, the POLQA and Open Journal of Applied Sciences PESQ models will be evaluated in this band.
The continuation of the document is organized into three parts. The first part presents the process used to record the speech sequences. The second part shows the obtained results and discussions, while the third part derived conclusions and perspectives.

Reference Speech Signals Database Construction
To constitute the database of reference speech signals, let considered the four A total set of 48 reference speech signals were constructed. These speech signals are sampled at 48 kHz and quantized on 16 bits. However, to simulate narrowband communication, these signals were down-sampled to 8 kHz and then degraded by adding different nature's defects as described in the following section.

Degraded Speech Signals Database Construction
In the purpose to simulate the defaults perceived during phone calls, different degradation conditions have been considered, as described in Table 1. Leman et al. [12] [13] and Tiemounou et al. [11] have shown that the noises perceived during narrowband and super-wideband phone calls can be subdivided into three families, among which environmental and breathing noises are the most representative. One can choose babble noise to model the noise of the environment and a random pink noise for the breath one.
To cover a wide range of perceived noise levels, let chose five Signal to Noise   Figure 1 describes the degraded speech signals generation process. It should be noted that the same approach described in [7] is adopted. The database was constructed in such a way as to simulate a narrowband communication from the 48 reference speech samples and the 2 background noises. First, the speech signal is down-sampled to 8 kHz and filtered [7] [13] to obtain a narrowband signal (from 300 to 3400 Hz). Then, the resulting signal is equalized to −26 dBov according to ITU-T P.56 [14]. For degradation due to noise, the reference speech and noise signals are mixed (with different SNR) to obtain the degraded signals.
In addition, the resulting signal level is again equalized and then coded and decoded using the G 711 code [15], one of the most widely used codecs by cell phone operators in Burkina Faso. This process leads to the degraded signal. For degradation due to the sound level, there is no mixing with noise, but the degradation is performed during the sound level equalization step.
Then, 720 degraded speech signal samples were generated.

Impact of Language on MOS-LQO Scores of PESQ and POLQA
This section presents the evaluation results of PESQ and POLQA on the degraded signals generated in Section 2. Figure 2 shows the MOS-LQO scores of PESQ and POLQA for the four languages (Moore, Dioula, Native French, and Native English) obtained by Monte Carlo simulation [16].
One can see in Figure 2, that for Moore and Dioula the MOS-LQO scores of PESQ are larger than those of native French and native English. On the other hand, for POLQA, no language seems to emerge.
In the following paragraph, a statistical analysis of the MOS-LQO scores of PESQ and POLQA is performed, to validate the results obtained above. To measure the impact of the language on the MOS-LQO scores of PESQ and POLQA, let used the ANalysis Of VAriance (ANOVA) method [17]. It is an inferential statistical method that tests whether the means of several groups are significantly different. The statistical hypotheses are the following: • H0 (or null hypothesis): all means are equal; • H1 (or alternative hypothesis): all means are not equal.
To validate the null hypothesis, a significance threshold (denoted alpha) must be specified. The ANOVA test provides 2 main statistical values: • F: it corresponds to the ratio of the variation between the means of the samples and the variation within the samples; • p-value: probability associated with the F statistic.
Thus if the p-value is lower than alpha then the null hypothesis is rejected. Therefore, the means are statistically different. Otherwise, one cannot make a decision.  Table 2 displays the ANOVA results applied to the four languages.
As one can see in Table 2, the p-value for the POLQA model is very large (0.98), so the null hypothesis cannot be rejected. Therefore, one can conclude that for the POLQA model, language does not seem to have an impact on the    Figure 4 compares the MOS-LQO scores of PESQ and POLQA for native and non-native speakers, using French and English. The native speaker's speech signals come from [7], while the non-native speech sequences are recorded from Burkina Faso speakers. Figure 3 also shows that for PESQ, the MOS-LQO scores of native speakers are larger than those of non-native speakers. However, these results are not obvious for POLQA.

Impact of Accent on MOS-LQO Scores of PESQ and POLQA
To confirm the previous result, a one-sided Student's test (a variant of the ANOVA for the case of 2 groups) is performed. The distribution of the MOS-LQO scores of these two groups are presented in Figure 5 and Table 3 presents the average MOS-LQO scores for the two groups (native and non-native), as well as the Student's test statistics.    Here again, for the PESQ standard, the null hypothesis can be rejected for a significance threshold of 0.01. While for POLQA the null hypothesis cannot be rejected. However, the relatively low value of the p-value requires further study.
The difference of performance observed between POLQA and PESQ could be explained by some limitation of PESQ. Indeed, it was shown in [18] that PESQ performs worst for some codecs, like Enhanced Variable Rate Codec (EVRC) family codecs [19], VoIP systems [20] and for wideband signal. The degradation caused by all these system may concern the signal spectral content. One can think that in case of degradation due to noise, the spectral content for Moore, Dioula or non native speaker does not modify in the same way than those of French or english native speaker. This, can affect PESQ performance. As mention in ITU-T P.863 [3], POLQA was developped in order to overcome PESQ limitations, in particular any variation of the signal spectral content, such as Spectral flatness, Strong variations of the Disturbance Density over time indicators. This could explain why POLQA performance is insensitive to language or accent. Nevertheless, futer work will focus on the study of spectral contents of the four languages.

Conclusions and Perspectives
This paper evaluates the impact of language and accent on the speech quality predicted by the PESQ and POLQA standards. First of all, the effect of language is evaluated by comparing the MOS-LQO scores of PESQ and POLQA obtained for the language Moore and Dioula (two languages of Burkina Faso), as well as for French and English. In a second step, the effect of accent is evaluated by comparing the MOS-LQO scores of the two standards on speech signals of French and English from native and non-native speakers. The results show that language and accent significantly impact the perceived speech quality predicted by PESQ, but seem to have no significant effect on POLQA.
Future work include experiments with a cell phone operator of Burkina Faso and comparison of the MOS-LQO scores predicted by the PESQ and POLQA models with the subjective MOS scores delivered by the listeners.

Conflicts of Interest
The authors declare no conflicts of interest regarding the publication of this paper.