Low-Rank Sparse Representation with Pre-Learned Dictionaries and Side Information for Singing Voice Separation

At present, although the human speech separation has achieved fruitful results, it is not ideal for the separation of singing and accompaniment. Based on low-rank and sparse optimization theory, in this paper, we propose a new singing voice separation algorithm called Low-rank, Sparse Representation with pre-learned dictionaries and side Information (LSRi). The algorithm incorporates both the vocal and instrumental spectrograms as sparse matrix and low-rank matrix, meanwhile combines pre-learning dictionary and the reconstructed voice spectrogram form the annotation. Evaluations on the iKala dataset show that the proposed methods are effective and efficient for singing voice separation.


Introduction
Separating singing voice from music recording is very useful in many applications, such as music information retrieval, singer identification and lyrics recognition and alignment [1].Although the human auditory system can easily distinguish the vocal and instrumental of music recording, it is extremely difficult for computer systems.In this context, researchers are increasingly concerned with the mining of music information.Many algorithms have been proposed to separate singing voice from music recording.
Robust Principal Component Analysis (RPCA) is a matrix factorization algorithm for solving underlying low-rank and sparse matrices [2].Suppose we are given a large data matrix M, and know that it may be decomposed as X A E = + , where A is a low-rank matrix and E is a sparse matrix.Based on RPCA, Huang et al. [3] have separated singing-voice from music accompaniment.They assumed that the repetitive music accompaniment lies in a low-rank subspace, while the singing voices can be regarded as sparse within songs.The main drawback to this approach is that it is completely unsupervised, just based on the particular properties of each individual components to guide the decomposition.After, Yu et al. [4] utilized any pre-learned information and pre-learned universal voice and music dictionaries from isolated singing voice and background music training data.

The Proposed Method
Before we come up with our method, let's review the Low-rank and Sparse representation with Pre-learned Dictionaries (LSPD) method [4], where X is the input spectrogram, is a pre-learned dictionary of the music accompaniment, is a pre-learned dictionary of the singing voice, 1 1 D Z is the separated instrumentals, 2 2 D Z is the separated voice.E denotes the residual part.In our model, we considered more prior information i.e., the reconstructed voice spectrogram from the annotation.Model as follows, Here all parameters in model 2 are in accordance with model 1, and 0 E de- notes the reconstructed voice spectrogram from the annotation.
F ⋅ denotes the Frobenius norm.In the following, we also use the ADMM algorithm [10] to solve the optimization problem, by introducing two auxiliary variables 1 J and 2 J as well as three equality constraints, The unconstrained augmented Lagrangian  is given by ( ) where 1 2 3 , , Y Y Y are the Lagrange multipliers.We then iteratively update the solutions for 1 1 2 , , J Z J and 2 Z .
1) Update 1 J : ( ) where ( ) 3) Update 2 J : ( ) that can be solve by the soft-threshold operator ( ) since the spectrogram is non-negative ( ) where 0 is an all zero matrix of the size as 2 J .

Dataset
Our experiment was conducted on the iKala dataset [9].The iKala dataset contains 252 30-second clips of Chinese popular songs in CD quality.In the following experiments, we randomly select 44 songs for training (i.e., learning the dictionaries D 1 and D 2 ), leaving 208 songs for testing the performance of separation.
To reduce the computational cost and the memory footprint of the proposed algorithm, we down sample all the audio recordings from 44,100 to 22,050 Hz.
Then, computed its STFT by sliding a Hamming window of 1411 samples with a 75% overlap to obtain the spectrogram.

Dictionary and E0
Our implementation of Online Dictionary Learning for Sparse Coding (ODL) [12] is based on the SPAMS toolbox.Given N signals ( i m x ∈  ), ODL learns a dictionary D by solving the following joint optimization problem, C. H.  ).Following [8], we define the dictionary size to be 100 atoms.
To get the reconstructed voice spectrogram from the annotation (E 0 ), we first transform the human-labeled vocal pitch contours into a time-frequency binary mask.The authors in [13] have proposed a harmonic mask similar to that of [14], which only passes integral multiples of the vocal fundamental frequencies [15] [16], ( ) ( ) Here ( ) F t is the vocal fundamental frequency at time t, n is the order of the harmonic, and w is the width of the mask.Then we simply define the vocal annotations as 0 E X M =  , where  denotes the Hadamard product.

Evaluation
Separation performance is measured by BSS EVAL toolbox version 3.01 .We use source-to-interference ratio (SIR), source-to-artifacts ratio (SAR) and source-to-distortion ratio (SDR) provided in the commonly used BSS EVAL toolbox version 3.0.Denotes the singing voice v , the original clean singing voice v, the source-to-distortion ratio (SDR) [17] is computed as follows, ( ) Normalized SDR (NSDR) is the improvement of SDR from the original mixture x to the separated singing voice v [18] [19], and is commonly used to measure the separation performance for each mixture, For overall performance evaluation, the global NSDR (GNSDR) is calculated as, ( ) where N is the total number of the songs and w i is the length of the i-th song.
Higher values of SIR, SAR, SDR, GSIR, GSAR, GSDR and GNSDR represent better quality of the separation.

Parameter Selection
During parameter selection, we use the indicator of global normalized source-to-distortion ratio (GNSDR) as the evaluation index.The higher the val-Advances in Pure Mathematics ue is, the better the separation quality is.In our algorithms, we set ( ) similar to [9], Here we only adjust γ.
Figure 1 presents the GNSDR for the separated singing voice and background music, using LSPDi.In the vocal part, we can see that, the GNSDR monotonically increases with γ first and then gradually decreases.When 5 γ = , the LSRi achieves the overall highest GNSDR.In the accompaniment part, the values of GNSDR increase first, steady after 5 γ = .Therefore, we set the parameter 5 γ = .

Comparison Results
We compare three different Low-rank, Sparse algorithms on the iKala dataset, • RPCA unsupervised method proposed by Huang et al. [3], use default parameter values ( ) m n λ = .
• LSPD Supervised method proposed by Yu et al. [4], use default parameter values ( ) • LSRi Proposed LSRi method with Low-Rank representation and the reconstructed voice spectrogram from the annotation, ( )   As shown in Table 1, whether the singing part or the accompaniment, our method has a higher value of global normalized source-to-distortion ratio (GNSDR), which suggests that LSRi algorithm performs well in the overall separation performance, and introduction of prior knowledge improve the separation performance.In the vocal part, our algorithm achieves higher GSIR than RPCA and LSPD, which shows that LSRi has better ability to remove the instrumental sounds than RPCA and LSPD.In the background music part, our algorithm achieves higher GSIR, which suggests that LSRi has better ability to remove the singing, a better performs in limiting artifacts during the separation process.But GSAR values did not improve significantly, this indicates that we need to improve on eliminating the interference of the algorithm.

Conclusion
In this paper, we have presented a time-frequency based source separation algorithm for music signals.LSRi considers both the vocal and instrumental spectrograms as sparse matrix and low-rank matrix, respectively.And the components that are not identified parts are specified as a residual term.Note that the dictionaries for the singing voice and background music are pre-learned from isolated singing voice and background music training data, respectively.Furthermore, LSRi incorporates vocal annotations information further, through which prior knowledge of the voice and background music is introduced to the source separation processing.Our approach has successfully exploited relevant useful information.Evaluations on the iKala dataset show the proposed methods better performance for both the separated singing voice and music accompaniment.In future studies, we can consider applying LSRi to the separation of complete songs.

1 2 ,
λ λ are two weighting parameters for balancing the different regularization terms in this model.Compared with the unsupervised RPCA algorithm, the LSPD algorithm adds pre-learning dictionary information and improves the separation quality.To C. H. Yang, H. J. Zhang DOI: 10.4236/apm.2018.84024421 Advances in Pure Mathematics further improve the separation quality of singing voice and music accompaniment, we proposed Low-rank, Sparse Representation with pre-learned dictionaries and side Information (LSRi).

Figure 1 .
Figure 1.Separation performance measured by GNSDR for the singing voice (left) and background music (right), using our proposed method LSPDi.
They proposed Low-rank and Sparse representation with Pre-learned Dictionaries (LSPD) for singing voice separation.Chan et al. [5] proposed a modified RPCA algorithm.This work represented one of the first attempts to incorporate vocal activity information into the RPCA algorithm, then the vocal activity detection was widely studied [6] [7].Chan et al.
[8]proposed to separate singing voice by group-sparse representation with the idea of pitch annotations separation.In this paper, we present a model named Low-rank, Sparse representation with pre-learned dictionaries and side information (LSRi) under the ADMM framework.First, we pre-learn voice and music dictionaries from isolated singing voice and background music training data, respectively.Then, we use a sparse spectrogram and a low-rank spectrogram to model the singing voice and the background music, respectively.Outside ⋅ denotes the Euclidean and λ is a regularization parameter.The input frames are extracted from the training set after short-time Fourier transform (STFT Yang, H. J. Zhang DOI: 10.4236/apm.2018.84024423 Advances in Pure Mathematics where 2

Table 1 .
Separation quality for the singing voice and music for the iKala dataset of RPCA, LSPD and LSRi.