Due to the presence of non-stationarities and discontinuities in the audio signal, segmentation and classification of audio signal is a really challenging task. Automatic music classification and annotation is still considered as a challenging task due to the difficulty of extracting and selecting the optimal audio features. Hence, this paper proposes an efficient approach for segmentation, feature extraction and classification of audio signals. Enhanced Mel Frequency Cepstral Coefficient (EMFCC)-Enhanced Power Normalized Cepstral Coefficients (EPNCC) based feature extraction is applied for the extraction of features from the audio signal. Then, multi-level classification is done to classify the audio signal as a musical or non-musical signal. The proposed approach achieves better performance in terms of precision, Normalized Mutual Information (NMI), F-score and entropy. The PNN classifier shows high False Rejection Rate (FRR), False Acceptance Rate (FAR), Genuine Acceptance rate (GAR), sensitivity, specificity and accuracy with respect to the number of classes.
In this paper, an efficient approach for segmentation, EMFCC-EPNCC based feature extraction and PNN-based classification of audio signal is proposed. The background section presents a brief overview of the existing audio segmentation, feature extraction and classification techniques, along with its drawbacks. The proposed work is illustrated in the contribution section.
Segmentation and classification [
Feature extraction techniques are classified as temporal and spectral feature extraction techniques [
However, the conventional audio segmentation techniques are usually quite simple and do not consider all possible scenarios. The decoder-based segmentation approaches only place the boundaries at the silence locations. This do not have any connection with the acoustic changes in the audio data. The model-based approaches do not generalize to the data conditions, as the models are not compatible with the new data conditions. The metric-based approaches generally require a threshold to make decisions. These thresholds are set empirically and require an additional development data. Hence, there arises a need in the development of efficient segmentation technique. The computational complexity of the traditional feature extraction approaches is increased with respect to the increase in the number of audio signals. The traditional classification techniques applied directly on the feature-vectors yielded poor results. Therefore, classification of audio signal is done without depending on the feature vectors. However, the existing audio classification systems do not represent the perceptual similarity of audio signals, as they mainly depend on the single similarity measure.
To overcome the challenges in the existing techniques, an efficient approach for segmentation, feature extraction and classification of audio signals is introduced in this paper. The proposed work involves the combination of new objective function of peak and pitch extraction and EPNCC and EMFCC based feature extraction. The presence of silence and irrelevant frequency details in the audio signal is eliminated. The clear features of the speech signal filtered from other background signals are obtained. PNN-based classification provides a better prediction of the classified label using the probability estimation.
This paper proposes an efficient approach for segmentation, feature extraction and classification of audio signals. In our proposed work, mean filtering is utilized for filtering the audio signal. Better reduction in the Gaussian noise is achieved than the traditional filtering techniques. Segmentation of the audio signal is performed using the peak estimation and pitch extraction. Then, the spectral difference in the audio signal pattern is estimated. Feature extraction is performed by using the combination of EMFCC-EPNCC, peak and pitch feature extraction for collecting the testing features of the audio signal. Multi-label and multi-level classification is performed for classifying the audio signal as a musical or non-musical signal. The category of the audio signal is extracted from the classification result. Finally, the proposed approach is compared with existing algorithms. The PNN- based classification approach achieves better performance in terms of sensitivity, specificity, accuracy, FAR, FRR and GAR. The proposed approach achieves high precision, NMI, F-score and entropy.
The remaining sections of the paper are organized as follows: Section II describes about the conventional works related to the audio segmentation and classification process. Section III explains the proposed approach including mean filtering, segmentation, feature extraction and PNN-based classification processes. The performance evaluation result of the proposed approach is illustrated in the Section IV. Section V discusses about the conclusion and future work of the proposed approach.
This section presents the conventional research works related to the automatic segmentation and classification of audio signals and feature extraction using various techniques and approaches. Haque and Kim proposed a correlation intensive FCM (CIFCM) algorithm for the segmentation and classification of audio data. The audio-cuts were detected efficiently irrespective of the presence of the fading effects in the audio data. The boundaries between different types of sounds were detected and classified into clusters. The conventional FCM approach was outperformed by the proposed CIFCM approach [
The usage of the audio signals in the identification of bird species was outperformed by using the short audio segments having high amplitude called as pulses. Training of the Support Vector Machine (SVM) classifiers was performed by using a previously labeled database of bird songs. Best results can be obtained by using the automatically obtained pulses and SVM classifier [
Dhanalakshmi et al. proposed effective algorithms for the automatic classification of the audio clips into six classes. The audio content was characterized by extracting the acoustic features such as Linear Prediction Cepstral Coefficients (LPCC) and MFCC. A method for indexing the classified audio was proposed by utilizing the k-means clustering algorithm and LPCC features [
Gergen and Martin introduced various data combination strategies for the efficient classification of audio signal. The audio classification performance was analyzed based on the simulations and audio recordings. High classification accuracy was achieved [
Ludeña-Choez and Gallardo-Antolín [
Anguera [
Alam et al. [
The proposed approach is clearly explained in this section. Smoothening of the audio signal is performed by using mean filter. Segmentation of the audio signal is performed by using peak estimation and pitch extraction process. Peak estimation is applied to identify the variation in signal amplitude with previous and present values of the signal amplitude with respect to the sampling time. The pitch extraction is performed, based on the frequency difference of the audio signal. Then, it is determined whether the pitch satisfies the segmentation of signal sample, based on the pitch frequency deviation.
Drawbacks of existing segmentation, classification and feature extraction approaches | Merits of our proposed work |
---|---|
・ It can be applied only on discrete audio segments. ・ Generation of extra overhead during the computation of MFCC features. ・ Distinguishing of the speech from the music signals is poor. ・ The accuracy of the existing classification techniques is low. ・ High computational complexity and cost. | ・ The clear features of the speech signal filtered from other background signals are obtained. ・ PNN-based classification provides a better prediction of the classified label using the probability estimation. ・ The presence of silence and irrelevant frequency details in the audio signal is eliminated. ・ PNN classification approach achieves efficient classification of the musical and non-musical signal. ・ The accuracy of the proposed approach is high. |
The index region of the audio sample is extracted and represented as a projection line over the audio signal. Segmentation of the audio signal is done by extracting the signal amplitude according to the window selection of the sampling time. EMFCC-EPNCC is applied to extract testing feature for the classification stage with the combination of peak estimated signal feature. Classification of audio signal into musical or non-musical signal is done by using PNN classifier. From this classification result, the category of the audio signal is specified. This is done to extract index of audio input for retrieving the audio signal. The overall flow diagram of the proposed approach is shown in the
・ Mean filter
・ Segmentation
・ Peak Estimation
・ Peak extraction
・ Pitch extraction
・ Feature Extraction
・ EMFCC-EPNCC based feature extraction
・ PNN-based classification
Filtering of the audio signal is performed by using the mean filter. The mean filter is applied directly to the input audio signal, without the need to know about the statistical characteristics of the audio signal. This filter operates by using small movable window for each sample duration of the audio signal. Smoothing signal is obtained by considering the mean values of the side window and replacing the central window element with the mean value. The amplitude of the audio signal is normalized and the Gaussian noise present in the audio signal is reduced. This filtered signal is then applied to the segmentation process. The plot of the input audio signal is depicted in
The main purpose of the segmentation process is to divide the input audio signal into homogeneous segments. This is done by evaluating the similarity between two contiguous windows of fixed length, in the cepstral domain. The audio segmentation is performed by using three processes:
v Peak Estimation
v Peak Extraction
v Pitch Extraction
During the peak estimation, peaks are calculated from amplitude and frequency of input signal from the parameters of α, β and γ. The threshold peak value is calculated based on the average value of the signal. The interpolated peak location is calculated and the condition of peak from the peak magnitude is checked with the
threshold peak magnitude estimate value. If the peak magnitude is greater than the estimate, then it is noted as a peak range in the sampled size of signal. Interpolated peak location is given as,
The peak magnitude estimate is given as,
where “α” is the starting edge of parabola of the signal, “β” is the peak amplitude edge of signal and “γ” is the finishing edge of parabola of the signal. The above parameters are calculated from the transformation signal obtained as the result of MFCC method.
In this feature extraction stage, R_Loc represents the feature at which the wave is in high peak Positive and Q_Loc represents the features at small signal difference at negative edge of the audio signal and S_Loc represents the feature values [
Feature Vector is formed as,
a) Max (Q_loc),
b) Max (R_loc),
c) Max (S_loc),
d) Length (Q_loc > 0),
e) Length (R_loc > 0),
f) Length (S_loc > 0),
g) Sum (Q_loc > 0),
h) Sum (R_loc > 0),
i) Sum (S_loc > 0).
where,
where
where “N” is the sample size of input audio,
In this pitch extraction, initially the objective function is implemented to perform weight calculation from the input audio signal based on the cosine angle difference of the signal amplitude. The pitch angle variation for each pre-allocated time samples calculated from the length of input signal (Xi) is extracted based on the objective function from [
a) Zero Crossing
b) Autocorrelation
c) Maximum Likelihood
d) Adaptive filter using FFT
e) Super Resolution pitch detection
In the proposed method, Maximum Likelihood based Pitch extraction is implemented. This is represented as,
where, “t” is the frame size of audio signal, “t” is the sampling time and “N” is the total size of audio signal. This is updated by using the objective function as,
The pitch frequency in each frame of the audio signal is calculated. The threshold value of the amplitude of the segmented audio signal is calculated by using
Then, the minimum and maximum peak values of the segmented signal are checked based on the threshold value. The peak value is estimated based on the positive and negative peak values lying on the left and right sides of the segmented signal. The positive small and large pitches and negative small and large pitches are obtained based on the peak values.
EMFCC-EPNCC is applied for the extraction of features from the audio signal. In several feature analysis techniques, the signal intensity is estimated based on spectrum depth variation only. In our proposed work, we implement both Mel-function with Power normalized Cepstral Coefficients for speech signal analysis. This method filters other signals present in the speech data with Gamma tone frequency integration. By using this method, the feature of signal is clear than other feature extraction types. Representation of the audio signal is performed by using a set of features.
Feature extraction is performed based on the EMFCC and EPNCC to return the feature values computed from the audio signal and sampled at fs (Hz). In the EMFCC process, 20 frame size is chosen from the sample size of input audio signal. The audio signal is subjected to the windowing process to divide it into frames and perform spectrum analysis for each and every frame of the signal. Then, Discrete Fourier Transform (DFT) is applied to the frames. The Mel frequency warping is applied to the DFT output. Logarithm is applied to the filter bank of the Mel frequency warping output. Inverse DFT is applied to obtain the Mel cepstrum coefficients.
In the EMFCC based audio feature extraction, the Mel Cepstrum is extracted from the transformation output. The input audio is divided into frames by applying the windowing function at fixed intervals. The distribution function for the window is defined as
Windowing involves multiplication of the time record using a finite-length window with a smoothly varying amplitude. This results in the continuous waveforms without sharp transitions. Windowing process minimizes the disruptions at the starting and end point of the frame. The output of the window is given as
where
A cepstral feature vector is generated for each frame and the DFT is applied to each frame. Mel frequency warping represented by the cosine transformation is applied to the DFT output. The cosine transform is described as
where,
The cosine transform is used to convert the log Mel cepstrum back into the spatial domain. The FFT is applied to calculate the coefficients from the log Mel cepstrum. The main advantage of the Mel frequency warping is the uniform placement of the triangular filter on the Mel scale between the lower and upper frequency limits of the Mel-warped spectrum.
The Mel frequency warping is calculated using the formula
Here “
The output of the Mel-frequency warping is shown in
The EMFCC-EPNCC is applied for extracting the audio features.
2) EMFCC-EPNCC Algorithm
The EPNCC extraction process involves frequency-to-Mel conversion, Mel-to-frequency conversion and cosine transform process. In the EPNCC-EMFCC method, the frequency to Mel is performed for extracting spectral data of signal based on the peak and pitch variation. Mel to frequency conversion is performed to filter out
other frequency signals by the frequency domain. The window size of the filtered signal is initialized by using the equations
where, “
Then, the Mel frames are converted into frequency by using the equation
The cosine transform is applied by using
The output of the DCT process is shown in
where,
The Filter Coefficient “FL” is extracted by using the equation
The audio feature output is obtained from the product of the cosine transformation value, logarithmic value of the magnitude and filter coefficient.
The index difference plot is shown in
The audio feature from the selected feature vectors obtained from the segmentation based on peak and pitch estimation is applied to the classification process. Classification of audio signal is performed using the PNN classifier, based on the testing features. Multi-label feature analysis is presented in the proposed work. Hence, a multi-class classifier model is implemented. Compared with other types of classifier, PNN provides a better prediction of the classified label using the probability estimation based on the neural network function.
The neural network is frequently used for the classification of the signals. The PNN is the quick learning model than the other neural network models. Hence it is used for classification of audio signal. The Probability Density Function (PDF) for a single sample is calculated as the output of the neuron of the pattern layer. This is given as
where “Y” denotes the unknown input vector. “
where
where
where, “e” is the feature vector of input signal, and “
This section illustrates the performance evaluation and comparative analysis of the proposed approach with the existing techniques. The datasets obtained from Ffuhrmann [
・ Precision
・ NMI
・ F-Score
・ Entropy
The comparison of the Precision, NMI, F-score and Entropy of the proposed approach and existing features is shown in
Methods | Precision | NMI | F-Score | Entropy |
---|---|---|---|---|
Acoustic features (spectral clustering) | 0.373 | 0.129 | 0.502 | 1.405 |
Acoustic features (spectral rotation) | 0.384 | 0.127 | 0.485 | 1.389 |
fMRI-measured features of SVR | 0.406 | 0.213 | 0.539 | 1.304 |
fMRI-measured features using high-level features | 0.423 | 0.21 | 0.543 | 1.262 |
fMRI-measured features of ITGP | 0.485 | 0.294 | 0.585 | 1.155 |
Integrated features of kernel addition | 0.52 | 0.323 | 0.61 | 1.083 |
Integrated features of kernel product | 0.499 | 0.324 | 0.583 | 1.117 |
Integrated features of CCA | 0.51 | 0.317 | 0.599 | 1.1 |
Integrated features of ITGP | 0.541 | 0.337 | 0.623 | 1.079 |
Proposed ASFEC approach | 0.718 | 0.412 | 0.7195 | 1.6875 |
Process (ITGP), integrated features of kernel addition, kernel product, Canonical Correlation analysis (CCA) and ITGP. The precision, NMI, F-score and Entropy of the proposed approach are found to be relatively higher than the acoustic and integrated features [
Acoustic features: Only the acoustic features are used in this experiment for the audio signal clustering. Spectral rotation is used to replace the K-means in the spectral algorithm. The performance of the spectral rotation has proven to be better than the spectral clustering approach.
fMRI-measured features of SVR: First, the SVR model is trained by adopting the fMRI-measured features and acoustic features of audio selections and applied to predict the fMRI-features of the audio samples.
fMRI-measured features of ITGP: The ITGP model is trained with fMRI-measured features and acoustic features of audio selections and applied to predict the fMRI-features of the audio samples.
Integrated features of Kernel addition: The kernel addition method is applied on the fMRI-measured features and acoustic features of testing audio samples. First, the kernels are integrated by adding them and the Eigen vectors of the Laplacian of the integrated kernel are computed. Then, a matrix is generated by using the Eigen vectors as columns. Finally, each row of this matrix is considered as an integrated feature.
Integrated features of kernel product: The corresponding elements of kernels of different views are multiplied with each other to form the integrated kernel.
Integrated features of CCA: The correlated features are extracted from the fMRI-measured features and acoustic features.
Precision is defined as the ratio of the number of correct results to the number of predicted results.
NMI is one of the rapidly prevalent measures to evaluate the agreement level between two affinity matrices formed by the predicted labels and true labels of the audio samples.
where p(x) and p(y) are the marginal probabilities and p(x,y) is the joint probabilities.
F-score is taken as a weighted average of the precision and recall values. Recall is defined as the ratio of number of correct results to the number of returned results. Higher values of Precision, NMI and F-score indicate the improved efficiency for segmentation and classification of audio signal.
The entropy is the sum of the individual entropies for the classification process weighted according to the classification quality. Higher entropy values indicate better classification results.
where, “H” is the entropy of the discrete random variable “X”. “P” is the probability of X and “I” is the information content of “X”. I(X) is a random variable.
The ROC curve is a graphical plot that shows the performance of the PNN classifier for the classification of audio signal. The true positive rate is plotted with respect to the false positive rate at various threshold settings. The ROC curve is generated by plotting the cumulative distribution function of the true detection probability versus the false-alarm probability. Each point on the ROC plot represents a pair of the sensitivity/specificity values corresponding to the specific decision threshold value. The proximity of the ROC plot to the upper left corner indicates the higher accuracy of the classification process.
The FRR is defined as the ratio of the number of false rejections to the number of the classified signals.
FAR typically is defined as the ratio of the number of false acceptances to the number of classified signals.
The GAR is the fraction of the genuine scores exceeding the threshold value. Higher the GAR value, higher is the classification efficiency.
The sensitivity is a measure of the actual members of the class that are correctly identified. It is defined as the ratio of the positively classified instances that are predicted correctly by the PNN classifier.
Here, True Positive (TP) is the number of audio signals that are correctly classified as a music or non-musical signal and False Negative (FN) is the number of music signals that are incorrectly classified as non-musical signal.
Specificity is referred as a true negative rate. It is defined as the ratio of the negatively classified instances that are predicted correctly by the PNN classifier.
Here, True Negative (TN) is the number of audio signals that are incorrectly classified as a music or non- musical signal and False Positive (FP) is the number of music signals that are incorrectly classified as non- musical signal.
Accuracy is defined as the ratio of number of correctly classified results to the total number of the classified results. The performance of the classifier is determined based on the number of samples that are correctly and incorrectly predicted by the classifier.
The comparative analysis of the sensitivity, specificity and accuracy with respect to the prediction rate is shown in the
GTZAN dataset: It is composed of 1000 30-second clips covering 10 genres such as blues, classical, country, disco, hip-hop, jazz, metal, pop, reggae, rock, with 100 clips per genre.
MTG dataset: It consists of approximately 2500 excerpts of Western music labeled into 11 classes of pitched
Parameters | Value |
---|---|
Average GAR | 99.67% |
Average FAR | 0.33% |
Average FRR | 0.33% |
Average Accuracy | 96.50% |
Average Error Rate | 3.50% |
Overall Accuracy (%) | ||||
---|---|---|---|---|
Dataset | ODL | K-means | Exemplar | EMFCC-EPNCC with PNN |
GTZAN Dataset | 88 | 95.4 | 95.7 | 96.2 |
MTG Dataset | 87.9 | 91.8 | 94.5 | 97.3 |
instruments such as cello, clarinet, flute, acoustic guitar, electric guitar, Hammond organ, piano, saxophone, trumpet, violin and singing voice and two classes of drums and no-drums. The class labels are applied to the predominant instrument over a 3-second snippet of polyphony music.
The conclusion and future work of the proposed approach are discussed in this section. An efficient approach for segmentation, feature extraction and classification of audio signals is presented in this paper. Audio segmentation is performed by extracting the signal amplitude between the lengths of sample time. From this segmented output, EMFCC is applied to extract testing feature for the classification process, along with the combination of peak estimated signal feature. This extracts 41 number of feature vectors for the audio signal. PNN classifier is used for classification of audio signal. From this classification result, the category of given audio input is specified. The audio signal is classified as a musical or non-musical signal, based on the testing feature. If it is detected as a musical signal, the label is classified as Piano, Guitar, etc.
The proposed approach achieves better performance in terms of precision, NMI, F-score and entropy. The FRR, FAR, GAR, sensitivity, specificity and accuracy of the PNN classifier are higher with respect to the number of classes. In future, the audio signal is segmented from the given input and various frequencies presented in single audio input are separated. Then, the separated frequency is retrieved by classifying features of segmented signal frequency.
Muthumari Arumugam,Mala Kaliappan, (2016) An Efficient Approach for Segmentation, Feature Extraction and Classification of Audio Signals. Circuits and Systems,07,255-279. doi: 10.4236/cs.2016.74024