Applying Score Reliability Fusion to Bi-Model Emotional Speaker Recognition

Emotion mismatch between training and testing is one of the important factors causing the performance degradation of speaker recognition system. In our previous work, a bi-model emotion speaker recognition (BESR) method based on virtual HD (High Different from neutral, with large pitch offset) speech synthesizing was proposed to deal with this problem. It enhanced the system performance under mismatch emotion states in MASC, while still suffering the system risk introduced by fusing the scores from the unreliable VHD model and the neutral model with equal weight. In this paper, we propose a new BESR method based on score reliability fusion. Two strategies, by utilizing identification rate and scores average relative loss difference, are presented to estimate the weights for the two group scores. The results on both MASC and EPST shows that by using the weights generated by the two strategies, the BESR method achieve a better performance than that by using the equal weight, and the better one even achieves a result comparable to that by using the best weights selected by exhaustive strategy.


Introduction
In the most studies about speaker recognition technology, the changes of environment or channel which are something about robustness is considered most.Less research work to consider the effect of speaker's own change such as their mood.Emotion mismatch between training and testing will cause system performance decline sharply which is emotional speaker recognition.
In order to avoid this problem in speech emotion recognition, in this paper we propose a weight strategy based on scores average relative loss difference.Using scores average relative loss difference of various types of testing voice on the corresponding model set to estimate the various types of score weight coefficient.In addition, the weight strategy based on the recognition rate using the two models on the respective classification test speech recognition rate as a test class model score weighted right respectively.The results on both MASC and EPST [1] show that using the Bi-model system of fusion weight estimating strategy achieve a better performance than that by using the equal weight.

Applying Score Reliability Fusion to Bi-Model Emotional Speaker Recognition
In this method, we mark the emotional speech which is different from the neutral voice on MFCC and baseband such as angry, happy and scared state as high different voice.And we build a virtual high difference in emotional training voice by adjusting the neutral voice baseband mean for each speaker, Thereby reducing the degree of mismatch between the test voice and training model.Figure 1 describes the system framework of method.During the training process, we establish the two models for each speaker: Using the synthetic virtual high difference emotional speech to train high difference model, the neutral speech for neutral model.During the testing process, we can get the gender information of test voice from gender recognition.Then mark the high mismatch part by using gender-dependent mismatch detection.Finally, for each speaker, calculated the score of test voice high mismatch part in its high-difference model and other low mismatch part in neutral model respectively, fuse them using linear weighted fusion.

Gender Recognition
Gender recognition can be regarded as a special speaker recognition that speaker is divided into two types: male and female.Male gender model is trained by male speaker's corpus M m .Female gender model is trained by female speaker's corpus M f .When testing, match scores between testing speech X and the 2 gender models were computed, and gender of the model with the highest score was the speaker's gender.

High Mismatch Detection
Our previous study [2] found three high differences emotional speech baseband mean relative mean neutral speech Base frequency (the same speaker and the same text) there is a certain deviation, the greater the deviation with the lower the degree of matching of the speech, the lower the probability of correctly recognized.Accordingly, we mark higher baseband mean of high differences emotional speech as high mismatch.Because of the big differences between male and female speech, the male and female are matched by training in the mismatch detection.Specific operations can be divided into two steps: Firstly, we use the differences detection technology to test weather the Statement belongs to high differences.In the differences detection technology, we use the classification method based on KNN and we have chose eight global features of the speech snippet: baseband and the mean, the maximum, the minimum and standard deviation of log domain energy.Secondly, we divide the high different emotional speech into several snippets.The next, the baseband mean which is higher than the threshold is marked as the high mismatch parts.The threshold is the value of error rate (EER) point, by using baseband mean to distinguish between low differences(neutral and sad) and high differences(angry, happy and scared) speech in parameter development set (male:156 Hz, female: 250 Hz).

The Virtual High Differences in Emotional Speech Construction Based on Time-Frequency Mapping
Between the sound source and channel interference phe-nomena [3,4], the change of the sound source characteristics (f0) It can be speculated to some extent led channel characteristics (such as MFCC) change.It tends to baseband mean of high differences emotional speech by adjusting baseband distribution center of the neutral speech, resulting in the virtual high differences in emotional speech.When the people express the particular emotion, baseband mean relative to the Changes amplitude of neutral speech is generally related to their vocal cords characteristics (Commonly used baseband mean to describe).
There is a big difference on the emotion expression between male and female.Here we delimit that the baseband changes amplitude is ( ) g f L when the speakers express the high differences emotion.L is the baseband mean of the neutral speech, g is the gender information.If we know the g f function definition, we can use the formula (1) to adjust the baseband mean of the neutral speech, thus we can get the baseband sequence of the high differences emotional speech.
L t is the baseband value of neutral speech and H t is the baseband value of the virtual high differences emotional speech at the t frame.
But the form of is unknown , and it's difficult to get an analytic solution.Here we use polynomial function to fit .At the same time, we use AIC criterion [5] to determine the order of polynomial function, which make the AIC reach to the minimum.In this paper, we use the simplified form of AIC criterion: The m is the parameters number of fitting function, the n is the number of observed sample, and RSS is the re- When the is known, we can use the autocorrelag f tion algorithm [6] to extract the baseband from the neutral speech, then get the baseband sequence of "the high differences emotional speech" according to formula (1).At last, we can get the corresponding virtual high differences emotional speech through correcting the baseband by PSOLA method [7].

Emotional Speaker Recognition
This system is built on the frame foundation of GMM-UBM.Every registered user i adaptive to two son model from UBM( ubm  and  are the weights fused score of on the , which is so-called bi-model way of equal weight [2].In this paper, we use two kinds of fusion weight estimating strategy based on the score reliability assessment to determine  and  .At last, we judge the testing speech belong to the highest score speaker in the  .

Fusion Weight Estimating Strategy Based on the Score Reliability Assessment
Synthetic virtual high differences emotional speech is different from the real emotional speech, so there is unreliability in the score which is got from the virtual high differences model H  .While the score which is got from L on the neutral model N X  is reliable.For the two kinds different reliability score, it is unreasonable obviously to plus equal weight.In this part we will propose two fusion weight estimating strategy based on the score reliability assessment.

Based on Fusion Weight Strategy of Scores Average Relative Loss Difference
Determine the model collection .
  , M is the number of registration speaker, H is the type of high differences, N is the neuter.Testing speech collection is  is the K  kind number of testing speech, L is the type of low differences.
In the speaker recognition, we determine the testing speech belong to the speaker who correspond the model of the maximum matching probability values.The score of testing speech which is on the model Suppose the testing speech j is the speech of speaker , the model collection of speaker recognition this system can distinguish the testing speech j cor- rectly.On the contrary, the bigger the distance between the score and the maximum collection score the worse the ability of the system to distinguish speech.
When we use the model collection Z θ determine identity of the speech which is in the collection O φ , scores average relative loss difference can be determined to : are the maximum and minimum which the score of on the model collection is the model of belong to speaker in .The bigger the , the more unreliable the score of the testing speech collection , which is on the model collecting .We can use to estimate  and  in formula (3):

Based on the Weight Strategy of Recognition Rate
When the speakers use collection  to determine the speech identity in the  Z  collection in the recognition, the higher the speakers identification rate  , the higher the proportion which the testing speech of  IR  is identified correctly by model collection  Similarly, the match score of speech in the   on the model which is in  is more reliable.According this, we can determine the weight Z Z  and  of formula (3):  In the experiment, UBM was adopted 1024 order and characteristics were 13-dimensional MFCC and its delta.The length of window for MFCC, energy and pitch were 32ms uniformly, and step sizes were 16ms uniformly.The weight coefficients α and β, baseband mapping function f, and gender models are all got from the dates in development data.The order of f is set According to bi-model approach base on equal weight.Take 11 as the order of male f, and 5 as the order of female f.
For verifying the validity of two kinds fusion weight estimating strategy based on the score reliability assessment, this part will compare the four methods of recognition performance on the MASC corpus and EPST corpus.
The four methods are: the bi-model method fusion weight estimating strategy based on the score reliability assessment (score difference), the bi-model method based on the weight strategy of recognition rate (recognition rate), the bi-model method based on the equal weight (equal weight) and the traditional GMM-UBM method (datum).Experimental results with four methods on MASC corpus were shown in Table 1.Relative to the datum, basing on 3 different weight estimating strategy recognition rates of high differences emotions testing speech are improved obviously(4.87%-6.93% on angry speech, 4.00% -6.23% on happy speech and 2.10% -4.27% on scared speech).And the performance of the low different emotional testing speech has declined slightly (0.76% -0.93% on neutral speech and 0.33% -0.73% on sad speech).For the bi-model method, these two weight estimating strategy is better the equal weight strategy, especially the recognition property of three high differences emotional testing speech has a obviously improvement.This improvement mainly benefit from that two assessment methods can assess the score reliability effectively, so that we can merge two different score effectively.We still find that the recognition property of the system which is on the low differences emotional speech has declined slightly.This problem is mainly caused by inaccurate of mis-match testing.It is known easily that it will have some negative impact if we use low mis-match parts to score on the virtue high differences emotional model.Despite these shortcomings, the two weight estimating strategy proposed in our paper are increased 3.23% compared with standard on the recognition rate, and it is increased 1.35% than the traditional method.For checking the effectiveness of the weight estimating strategy, this part will compare the different weight coefficient on the bi-model method which the recognition performance of the MASC corpus., the recognition performance of bimodel approximate with the best performance.In this paper, we can get the bi-model score weighting coefficients  and  from MASC parameters development data by using our two Fusion Weight Estimating Strategies.Using the bi-model approach achieve good results in speaker recognition assessment experiment on MASC text set.EPST corpus has different culture background with MASC.We will verify the validity of the coefficient by bi-model approach in EPST corpus.Experimental results with four methods in EPST corpus were shown in Table 2.We can find that three bi-model approaches based on different score fusing strategies are also improved obviously than by baseline in performance of high differences emotion testing speech.In contras to MASC, recognition rate for sad speech is improved too.

The Evaluation Result on the EPST
It is mainly because people with different culture also express his emotion very differently and pitch mean of sad speech in EPST corpus is noticeable higher than neutral speech.Because there is a big difference between two speech database, using  and  get from the parameters developed dates in MASC corpus can't significantly improve the recognition rate of bi-model approach, but it is still effective.The overall recognition rate from EPST corpus, bi-model approach based on two score reliability assessment strategies is also improved 0.30% and 0.59% than equal weight.In addition, we can also get from

Conclusions
It is clearly unreasonable to fuse the score equal weight in Bi-model.So in this paper, we propose two fusion weight strategies based on score reliability fusion: by utilizing identification rate and scores average relative loss difference.Systemic risk caused by Unreliability of Virtual synthesized speech could be reduced by using our strategy.In MASC, bi-model based on score reliability fusion is improved 3.23% than by baseline, and 1.35% than bi-model based on equal weight.In extended test of EPST corpus, using our new strategy is also improved 0.59% than before.

Figure 1 .
Figure 1.Applying score reliability fusion to bi-model emotional speaker recognition.

Figure 2 .
Figure 2. The IR on MASC for BESR method with various weights.

Figure 2
count the recognition rate of bi-model method which is based on } fusion weight strategy of scores average relative loss difference).We can know something from the figure that when system performance is better than    , this tell us the score is more reliability which we use neutral model N to determine the low mis-match part (It is similar with low differences emotional speech L  ).The rational weigh α should greater than the score weigh β of H Z on the high mis-match part.In addition, the recognition performance in the situation bi-model which is

figure 3 ,
the recognition rate in EPST corpus with bi-model approach, the fusion weight strategy based on recognition rate (

Figure 3 .
Figure 3.The IR on EPST for BESR method with various weights.