^{1}

^{1}

The detection of speech endpoint is an important application for speech signal processing. Although there are variance methods, the endpoint can’t be detected accurately in low SNR (Signal to Noise Ratio). The paper pointes out an endpoint detection algorithm combining two methods together: the one is improved spectral subtraction based on multitaper spectral estimation, and the other is BARK subband variance in frequency domain. Firstly, the noisy speech signal is processed though the improved spectral subtraction based on multitaper spectral estimation. It can achieve the purpose of noise reduction through this step. Then the noisy speech signal is detected using the method of BARK subband variance in frequency domain. Compared with the common endpoint detection algorithm, it is concluded that endpoint detection accuracy by new method can be improved in low SNR.

Speech is the most natural form of human-human communications and is related to human physiological capability. And it is the most important, the most effective and most convenient form of information exchange. Speech signal processing is a comprehensive subject and a popular research field, which involves a wide range of content [

Though there are a lot of methods to detect speech signal endpoint, it is a challenging research for low SNR condition. With the help of the bark wavelet transforming knowledge [

The simulation experiment and result analysis are carried out by using MATLAB software. The results show that this method can improve the accuracy for the situation of low SNR.

Multitaper spectral estimation was proposed by Thomson in 1982. While the traditional periodogram method only uses one data window for the same data sequence, the multitaper spectral proposed by Thomson uses multiple orthogonal data windows to obtain the direct spectrum for the same data sequence. Then the spectrum estimation could be got by seeking average of the direct spectrum before. Therefore, it is possible to obtain smaller estimation variance [

The definition of multitaper spectral is shown as follows:

S m t ( ω ) = 1 L ∑ k = 0 L − 1 S k m t ( ω ) (1)

where L expresses the number of data window. S m t means the spectrum of the k^{th} data window:

S k m t ( ω ) = | ∑ n = 0 N − 1 a k ( n ) x ( n ) e − j n ω | 2 (2)

where, x ( n ) expresses data series. N expresses the length of series. a k ( n ) means the k^{th} data window which satisfies the mutually orthogonal. Its formula is expressed as follows:

{ ∑ a k ( n ) a j ( n ) = 0 k ≠ j ∑ a k ( n ) a j ( n ) = 1 k = j (3)

The data window is a set of discrete ellipsoidal sequences with mutually orthogonal and also known as Slepian windows.

Both the amplitude spectrum and phase spectrum should be calculated by FFT when the noisy speech had been processed to frames by Slepian windows firstly. Then the average amplitude spectrum is calculated by smoothing based on the adjacent frames. At the same time, the power spectral density of multitaper spectral is estimated for data frame. The estimated values can also be disposed by smoothing based on the adjacent frames and the smoothed power spectrum density can be calculated. With the condition that the number of preamble frames without words segment (noise) is known, the average power spectral density of the noise could be gotten. Then the gain factor of spectral subtraction can be obtained by taking advantage of spectral subtraction relationship. If the condition of amplitude spectral and phase spectral is known beforehand, the speech signal could be restored to time domain by IFFT. Furthermore, the enhancement of spectral subtraction speech is realized [

The function of basilar membrane of human ear is similar to that of the frequency analyzer based on the frequency group content of the auditory masking effect. The frequency between 20 Hz and 22,050 Hz is divided into 25 frequency group, while the basilar membrane of human ear is divided into a lot of parts by our brain. Each part of basilar membrane is corresponded to a frequency group which is also known as the unequal bandwidth (BARK) subband [

The principle of endpoint detection algorithm based on BARK subband variance is descripted briefly. First of all, the speech signal is added with window to be frames. Secondly, it is processed with FFT. The total number of ( N / 2 + 1 ) of positive frequency spectral lines is obtained sequently. The spectral line will be extended by interpolating. The average amplitude of the BARK subband in the BARK is calculated by Equation (4).

E i ( j ) = 1 f j , h − f j , l + 1 ∑ f j , l ≤ f k ≤ f j , h | X i ( k ) | j = 1 , 2 , ⋯ , q (4)

where the f j , l and f j , h are the j^{th} critical frequencies of BARK subband at low frequency and high frequency respectively..

The mean and variance of the BARK subband can be obtained as followed. Using preamble without words segment, the average value of noise is obtained. The speech signal endpoint will be detected by using the double threshold method with the single parameter after the threshold is set.

There are three key steps for the proposed algorithm in this paper. The first step is doing spectrum analysis by FFT to obtain the characteristic of speech signal. Consequently, the signal to noise of speech signal can be improved by using the method of improved spectral subtraction based on multitaper spectral estimation. Finally, the endpoint can be detected though calculating the BARK subband variance in frequency domain. The flow diagram is indicated in

The proposed algorithm is simulated with the simulation software MATLAB. There are 4 parameters to be defined for the experiment. The sampling frequency of clean speech is8kHz and Hamming window is selected. Meanwhile, the length of preamble without words segment is 0.25 seconds and frame shift is 80 sample points. The clean speech is a clean Chinese phrases that “lan tian, bai yun, bi lv de da hai” indicated in

Firstly, the signal to noise ratio is set to 0 dB to verify the anti-noise capabili-

ty. The result of the paper algorithm (method I) is compared with the result of the method of BARK subband variance in frequency domain (method II) and that of spectral subtraction short-time uniform subband variance [

Analyzing the result of BARK subband variance in frequency domain, as indicated in

Consequently, the SNR is reduced to −5 dB indicated in

It is obviously that BARK subband variance in frequency domain is failed to meet the detection requirements as shown in

ability of anti-noise is much better under the condition of low SNR.

Another speech “ci hen mian mian wu jue qi” in Chinese poetry was applied to this detection experiment. The speech signal is added with Gauss noise and white, volvo of NOISE-92 Library respectively. This paper algorithm is compared with the method of BARK subband variance in frequency domain and spectral subtraction short-time uniform subband variance [

Accuracy is defined as follow:

Accuracy = TotalFrames − ErrorFrames TotalFrames (5)

ErrorFrames = Speechmisjudgesnoiseframes + noisemisjudgesspeechframes (6)

The accuracy of three methods with different noise and different SNR value is shown in

As shown in

All of the above, it is concluded that this paper algorithm can show better anti-noise performance and higher accuracy in low SNR. Though the result detection is affected by several factors, such as the speed, environmental noise, and

Endpoint detection algorithm | Gauss noise white noise Volvo noise | ||||||||
---|---|---|---|---|---|---|---|---|---|

−10 dB | 0 dB | 5 dB | −10 dB | 0 dB | 5 dB | −10 dB | 0 dB | 5 dB | |

BARK subband variance in Frequency domain | Null | 42.9 | 42.9 | Null | 42.9 | 42.9 | 14.3 | 42.9 | 42.9 |

Spectral subtraction short-time uniform subband variance [ | 57.1 | 42.9 | 42.9 | 28.6 | 42.9 | 42.9 | 14.3 | 28.6 | 42.9 |

This paper algorithm | 71.4 | 57.1 | 42.9 | 57.1 | 42.9 | 42.9 | 57.1 | 42.9 | 42.9 |

the performance of the algorithm and so on, it needs to be improved in the future.

Considering that the detection of endpoint is one of the most important aspects of speech signal processing, a speech endpoint detection algorithm with low SNR condition is proposed in this paper. Firstly, the noisy speech is processed with the method of improved spectral subtraction based on multitaper spectral estimation in order to improve the signal to noise ratio. Then the method of BARK subband variance in frequency domain is applied to detect the speech endpoint. According to the results of simulation, it is distinct that the algorithm mentioned in the paper can detect speech endpoint correctly in low SNR condition, and it has a good anti-noise performance. The method should make a part in application of speech endpoint detection because of its high efficiency in the condition of low SNR. However, there are some elements, for instance, the type of noise, which have some influences with the algorithm. It is necessary to make some deep research to improve for accuracy of detection. This will be the focus of the work in the future.

This work was supported in part by the National Science Foundation of China under Grants 51504039.

Wei, J. and Sun, X.E. (2017) Research on Speech Endpoint Detection Algorithm with Low SNR. Open Access Library Journal, 4: e3487. https://doi.org/10.4236/oalib.1103487