Sudden Noise Reduction Based on GMM with Noise Power Estimation

This paper describes a method for reducing sudden noise using noise detection and classification methods, and noise power estimation. Sudden noise detection and classification have been dealt with in our previous study. In this paper, GMM-based noise reduction is performed using the detection and classification results. As a result of classification, we can determine the kind of noise we are dealing with, but the power is unknown. In this paper, this problem is solved by combining an estimation of noise power with the noise reduction method. In our experiments, the proposed method achieved good performance for recognition of utterances overlapped by sudden noises.


Introduction
Sudden and short-term noises often affect the performance of a speech recognition system.To recognize the speech data correctly, noise reduction or model adaptation to the sudden noise is required.However, it is difficult to remove such noises because we do not know where the noise overlapped and what the noise was.
There have been many studies conducted on non-stationary noise reduction in a single channel [1][2][3][4].The target of our study is mostly sudden noise from among these non-stationary noises.There have been many studies on model-based noise reduction [5][6][7].These methods are effective for additive noises.However, these reduction methods are difficult to apply for sudden noise reduction directly since these methods require the noise information in order to be carried out.
In our previous study [8], we proposed detecting and classifying these noises before removing them.But there is a problem with this because the noise power is unknown from the classification results, although the kind of noise can be estimated.In this paper, we propose a noise reduction method that uses the results of noise detection and classification to accomplish the noise reduction.The proposed method integrates noise power estimation with the noise reduction based on GMM to solve the aforementioned problem.

System Overview
Figure 1 shows the overview of the noise reduction sys-tem.The speech waveform is split into small segments using a window function.Each segment is converted to a feature vector, which is a log Mel-filter bank.Next, the system identifies whether or not the feature vector is noisy speech overlapped by sudden noises using a non-linear classifier based on AdaBoost.The system clarifies the sudden noise type only from the detected noisy frame using a multi-class classifier.Then a noise reduction method based on GMM is applied.Even though we apply the proposed technique to the output from AdaBoost, it can be successfully applied to that from a binary identification technique such as SVM.

Clustering Noise
There are many kinds of noises in a real environment.

Figure 1. System overview of sudden noise reduction
The smaller the difference between the noise in training and the overlapped noise in the test, the better the performance of the noise reduction method in Section 5 is.But there are many kinds of noises, and potential noises need to be grouped by noise type in some way.Therefore, we made a tree of noise types based on the k-means method, where we used the log Mel-filter bank as the noise feature.

K-Means Clustering Limited by Distance to Center
K-means clustering usually sets the number of classes.In our method, the number of classes is decided automatically by increasing class so that distance d between the data and the center of a class must be smaller than an upper limit  decided beforehand.First, all data are clustered using the k-means clustering method.Next, we calculate the distance d between the data and the center of the class to which the data belongs.If the distance d is bigger than  (d >  ), this class is divided into two classes and k-means clustering is performed.This step is repeated until all the distances are less than  .
The noise data for noise reduction is given as the mean value of each class data.So, the smaller the upper limit  is, the higher the noise reduction performance is expected to be because the variance of the class becomes smaller.

Tree of Noise Types
One problem with the above k-means algorithm is that too many classes may be created when  is set small.This problem is solved by making a tree using the above k-means clustering, while  is set at a larger value and all the data are clustered.The bigger the level is, the less distance there is.In this paper,  is set to be reduced by half with each level increment change on the noise tree.Figure 2 shows an example of one such tree.In this paper, the clustering is performed using the mean vectors of each type of noise.Noise detection and classification are described in [8].A non-linear classifier H(x), which divides clean speech features and noisy speech features, is learned using AdaBoost.Boosting is a voting method using weighted weak classifiers and AdaBoost is one method of boosting [9].The AdaBoost algorithm is as follows.
Input: n examples where means a label of and it is {-1,1} )} , ( ),..., , {( where, m is the number of positive data, and l is the number of negative data. Do for t = 1,…,T 1) Train a base learner with respect to weighted example distribution and obtain hypothesis  AdaBoost algorithm uses a set of training data, {( , ), . .., ( , )}, where is the i-th feature vector of the observed signal, and y is a set of possible labels.For noise detection, we consider just two possible labels, Y = {−1, 1}, where label 1 means noisy speech and label −1, means speech only.In this paper, single-level decision trees (also known as decision stumps) are used as weak classifiers, and the threshold of f(x) is 0.
Using this classifier, we determine whether the frame is noisy or not.

Noise Classification
Noise classification is performed for the frame detected as noisy speech.If the frame is noise only, it may be classified by calculating the distance from templates.But it is supposed that the frame contains speech, too.In this paper, we use AdaBoost for noise classification.Ada-Boost is extended and used to carry out multi-class classification utilizing the one-vs-rest method, and a multi-class classifier is created.The following shows this algorithm.

Final classifier:
This classifier is made at each node in tree.K is the total number of the noise classes in a node.In this paper, each node has from 2 to 5 classes.

Noisy Speech
The observed signal feature , which is the energy of filter b of the Mel-filter bank at frame t, can be written as the follows using clean speech and additive noise In this paper, we suppose that noises are detected and classified but the SNR is unknown.In other words, the kind of the additive noise is estimated but the power is unknown.Therefore, the parameter  , which is used to adjust the power is used as follows.
In this case, the log Mel-filter bank feature (= ) is The clean speech feature can be obtained by estimating and subtracting it from .

Speech Feature Estimation Based on GMM
The GMM-based noise reduction method is performed to estimate s(t) [5,6].(In [5,6], the noise power parameter  is not considered.)The algorithm estimates the value of the noise using the clean speech GMM in the log Mel-filter bank domain.A statistical model of clean speech is given as an M-Gaussian mixture model.
Here, N(*) denotes the normal distribution, and ) , , ( , , , where n  is the mean vector for one of the noise classes, which is decided by the result of the noise classification.At this time, the estimated v where, The clean speech feature s is estimated by subtracting from feature x of the observed signal.

Noise Power Estimation Based on EM Algorithm
The parameter  , which is used to adjust the noise power, is unknown.Therefore, ( 9) cannot be used because m x,


and p(m|x) depend on  .In this paper, this parameter is calculated by the EM algorithm.The EM algorithm is used for estimation of noise power  for maximizing p(x) which is the likelihood of a noisy speech feature.p(x) is written as (6), in which m x,  depends on  .So, we replace p(x) with p(x|  ), and the noise power parameter  is calculated by maximizing likelihood p(x|  ) using the EM algorithm.E-step: where k is the iteration index.The above two steps are calculated repeatedly until converges to optimum solution.In M-step, the solution is found by calculating the following equation.
This equation can be expanded as follows.
However, it is difficult to find a solution of this equation analytically.So, Newton's method is used for this equation.An approximation of the optimum solution is calculated repeatedly as follows using Newton's method.
) , ( Equation ( 16) is calculated repeatedly until  converges.The initial value of Newton's method was set at 0.

Experiments
In order to evaluate the proposed method, we carried out isolated word recognition experiments using the ATR database for speech data and the RWCP corpus for noise data [10].

Experimental Conditions
The experimental conditions are shown in Table 1.All features were gotten in a 20 ms window by 10 ms frame shift.The word utterances of ten different people are recorded in the ATR database.There were 105 types of noises in RWCP corpus [10].The kinds of noises, for example, are telephone sounds, beating woods, tearing paper and so on.One kind of noise consists of 100 data samples, which are divided into 50 samples for testing and 50 samples for training.The noise tree was made using the mean vectors of the training samples, and these vectors were divided into 37 classes (which is the total number of leaves).Learning classifiers for detection and classification were performed using the noisy speech features.So, we made noisy utterances in each class, adding noises to 2,000 × 10 clean utterances of 10 persons (five men, five women) for training data.Clean utterances were in ATR database which were Japanese word utterances of 10 persons.In this case, SNR is adjusted between -5 dB and 5 dB.One model of GMM for noise reduction and HMM for recognition were learned using the same 2,000 × 10 clean utterances of 10 persons.
In order to make test data, we used 500 × 10 different word utterances by the same 10 persons.Some noises overlapped one test utterance with adjusting SNR to -5, 0 and 5 dB and duration time of each noise to 10 ~ 200 ms. Figure 3 shows an example of noisy speech.

Experimental Results
Table 2 shows the results of detection and classification."Recall" is the ratio of detected true noisy frames among all the noisy frames, "Precision" is the ratio of detected true noisy frames among all the detected frames and "Classification" is the rate of true classification frames among the detected noisy frames.In this table, Recall rate and Precision rate are higher value, which mean noise is well detected.The classification rate was low, however.Even if the classification results are different from the real noise label, though, if the noises are classified near to the real noise, the negative effect on noise reduction may be negligible.
Figure 4 shows the recognition rate for each SNR.In Figure 4, the baseline means noise reduction is not applied and "No estimation of noise power" means that power estimation was not performed in GMM-based noise reduction (calculated in (11) as  = 1)."EM algorithm" means that noise power is estimated using the  method written in section 5.3."Oracle label" means that correct detection and classification results were given.In this case (Oracle-label), 64 Gaussian components were used.In cases where there were no noises, the recognition rate is 97.4%.As shown in Figure 4, the recognition rate was improved by using the proposed method.Furthermore, the proposed method has higher performance than no estimation.

Experiments for Unknown Noise
We examined the effectiveness of the proposed method for dealing with unknown noises using 10-fold cross validation of noise type.105 types of noise were divided into 10 sets, with 9 sets for training and 1 set for testing.The noise tree and classifiers were created using training sets and test data were made using test sets.Experimental conditions were similar to those in Table 1, but we examined only 64 Gaussian mixture components for noise reduction.Table 3 shows the detection results.Classification rate cannot be evaluated because the classes of the noises that overlapped utterances are not defined.Figure 5 shows recognition rate for unknown noises for test sets.As shown in this Figure 5, the proposed method improved the word recognition rate for unknown noises.But, in comparison with the "Oracle label", the performance of speech recognition degraded due to differences between the training and test noise data.
Number of components of GMM used for noise reduction

Conclusions
In this paper, we have described a sudden noise reduction method.Noise detection and classification are performed using AdaBoost, and GMM-based noise reduction is performed using the detection and classification results.Combining an estimation of noise power with the noise reduction method, we solved the problem of word recog-

Figure 5. Recognition results for words utterances mixed unknown noises
Number of components of GMM used for noise reduction nition when that noise power was unknown.Our proposed method improved the word recognition rate, although admittedly, the classification accuracy was not high.Furthermore, although this method was effective for unknown noises, it will need combination of a noise adaptation, tracking technique and so on.In future research, we will attempt to verify effectiveness of this new method in dealing with sudden noise when a large vocabulary is used.

Figure 2 .
Figure 2.An example of a tree of noise types


are the mean vector and the variance matrix of the clean speech s(t) at the mixture m.The noisy speech model is assumed using this model as follows: