Metasample-Based Robust Sparse Representation for Tumor Classification

In this paper, based on sparse representation classification and robust thought, we propose a new classifier, named MRSRC (Metasample Based Robust Sparse Representation Classificatier), for DNA microarray data classification. Firstly, we extract Metasample from trainning sample. Secondly, a weighted matrix W is added to solve an l1-regularized least square problem. Finally, the testing sample is classified according to the sparsity coefficient vector of it. The experimental results on the DNA microarray data classification prove that the proposed algorithm is efficient.


Introduction
A tumor is a neoplasm from an abnormal growth of cells.An accurate, effective and prompt treatment of tumor is necessary for patient.But before treatment, how to classify the tumor is a more important mission because tumors have many types.If you make a wrong analysis, your treatment will become another killer.
DNA microarray is a biotechnology that simultaneously monitors the expression of tens of thousands genes in cells.There are many methods have been used in tumor classification through microarray gene expression profiling like independent component analysis (ICA), nonnegative matrix factorization (NMF), i.e.Since the biological data is too large in scale and too complicated in profiling, which not only bring a great difficulty to save, search, process or analyze these data, but also bring an big challenge to data mining technology.A new efficient data mining technology is necessary for improving the accuracy of tumor classification.
Sparse representation has been successfully used in image processing applications [1], DNA microarray data classification [2], and Text classification.Intuitively, the sparsity of the coding coefficient vector for samples can be measured by the 0 -norm or -norm minimization ( -norm minimization is the closest convex function to 0 -norm minimization) of it.The -norm minimization is widely applied in sparse coding.Generally, the sparse coding problem can be formulated as where y is a given signal, e.g. the gene expression profile of a sample.D is the dictionary of coding atoms,  is the coding vector of y over D and  > 0 is a constant.
The processing means of this function for gene expression data can be interpreted as following: By coding DNA microarray data y as a sparse linear combination of the training samples D via the 0 l -norm or -norm minimization(here we used -norm minimization) in above function, SRC(Sparse Representation Classification) classifies y by evaluating which class of training samples could result in the minimal reconstruction error of it with the associated coding coefficients.But there are two important issues in this model.The first one is that whether 1  is well enough to represent the signal sparsely.there are many works having been done for it.For example, adding a nonnegative constraint to  [3]; introducing a Laplacian term of coefficient in sparse coding [4]; Designing the sparsity regularization terms by using the Bayesian methods [5] and using the weighted 2 -norm in the sparsity constraint [6].So, the first issue has been solved well.The Second one is that whether the 2 l -norm term is effective enough to represent the signal fidelity when y is noisy or has many outliers.By now there are only in [7,8] the -norm was used to keep the coding fidelity of  To solve this problem, we first extract metasamples from each class training samples, then using them to construct the dictionary.The detailed processes are listed in the following Section.

Maintaining the Integrity of the Specification
A typical characteristic of DNA microarray data is that gene amount is much more than the number of samples.Generally, the number of samples is about hundreds, but there are thousands of genes in each sample.Which makes that many classic classification methods can't be used in DNA microarray data analysis.Fortunately, methods of choosing related features or extract new features can solve well this issue.So far, there are many documents have studied how to use genes selection to classify the tumor samples, like [10][11][12], and how to extract new features, e.g., metasamples [16].Alter et al. [13] used SVD to transform the gene expression data from the "genes  samples" space to diagonalized "eigengenes  eigenarrays" space, where the eigengenes are unique orthonormal superposition of the genes.Brunet et al. [14] used NMF to describe the gene expression data through a few of metasamples.(Figure 1) Metasamples of gene expression data is defined as a linear combination of several samples.We factorized the gene expression data set matrix A into two matrices where matrix A is of size .In matrix A, the column represents the expression level of genes in a sample.Each row means the expression level of one gene through all samples.Matrix H is of size .Matrix W is of size From the above analysis, it can be seen that there are many methods to extract the metasamples, such as SVD, PCA [13], ICA [15], NMF [14], etc.In consideration of algorithm's simple and fast features, in this paper we use SVD to extract the metasamples.
We extract metasamples from the samples in class i and denoted them as i , then put them together, which constructed the new dictionary.

The RSRC (Robust Sparse Representation Coding) model
The sparse coding model in Eq. ( 1) is equivalent to the LASSO problem [22]: where is the dictionary with column vector j d being the atom, and th j  is the coding coefficient vector.Here the resid- ual e = y D  follows Gaussian distribution.We can see that a sparsity-constrained least square estimation problem is necessary to the sparse coding problem in Eq. ( 4).If e follows Laplacian distribution, this solution will be Copyright © 2013 SciRes.ENG However, the residual e may be far from Gaussian or Laplacian distribution because of noisy or outliers.So the conventional sparse coding models in Eq. ( 4) and Eq. ( 5) may not be robust and effective enough for DNA microarray data classification.In order to build a more robust model for DNA microarray data classification, we rewrite D as D=  1 2 ; ; ; n r r r  , where row vector is the row of D, and set e =  , the RSRC can be formulated as [23]: ) and it has the following properties: (0 , and we let (0)   = 0. From Eq. ( 6), we can see that the propose RSRC model is a more general sparse coding model.Eq. ( 4) and Eq. ( 5) are special cases of it when it follows Gaussian and Laplacian distributions.Now by solving Eq. ( 6), we can get the coding coefficient vector  .But one key problem is how to determine the distribution of   .From above analysis we can see that taking   as Gaussian or Laplacian distribution directly is not effective or robust enough.So we do not determine   directly to solve Eq. ( 6).Insteadly, we transform Eq.( 6) into an iteratively reweighted sparse coding problem [23]: where W is a diagonal matrix: Eq. ( 7) is a weighted LASSO problem.Because W needs to be estimated by using Eq.(8).Eq. ( 7) is a local approximation of the RSRC in Eq. ( 6) at , and W should be updated using the residuals in previous iteration via Eq.( 8).Using Eq. ( 7) the determination of distribution 0 e   is transformed into the determination of W.
As the logistic function has properties similar to the hinge loss function in SVM, the weight function can take ( ) exp( ) / (1 exp( )) where u and  are positive scalars.u controls the decreasing rate from 1 to 0, and  controls the location of demarcation point.Through Eq. ( 9), Eq. ( 8) and (0) The sparse coding models in Eqs. ( 4) and ( 5) can be interpreted by Eq. ( 7).When = 2, we will get the model in Eq. ( 4).If we let 1/ i e , we can get the model in Eq. ( 5).Eq. ( 7) has the following advantage: outliers will be assigned with low weights to reduce their affects.The weighted function of Eq. ( 9) is bounded in [0,1].

Algorithm of MRSRC
When we get a testing sample y, in order to initialize the weight, we set e as ini e y y   y  . In this paper, we compute as where D m is the mean of all training samples.At this algorithm, W will change as e in Eq. ( 8) at every iteration.We stop the iteration if the following condition satisfying: where  is a small positive scalar, t is amount of iteration.
After the iteration, we get the coefficient  , then clas- sify y using the following function:  Step 2: Extract the metasamples of every class using SVD and get D.
Step 4: Compute the difference: From the algorithm it can be seen that MRSRC is the combination of RSC and metasample based cluster.In MRSRC, the complexity of algorithm depends on the number of iterations t, which depends on the percentage of outliers in the DNA microarray data set.Generally, the number of iteration takes 2, unless the percentage of outliers is too big.At that instance, t should be taken about 10 to ensure the algorithm to reach convergence.

Experimental Results
In this section, experiments were performed to show the efficiency of the proposed method.

Parameter Selection
In the weight function Eq. ( 9), there are two parameters, i.e.,  and  . is the parameter of demarcation point.We compute the value of  as follows.
Denote by

Two-Class Classification
In this subsection, we use three microarray data set to study the tumor We used the proposed to classify these dataset.Fro comparison, we also used other three methods, i.e., SVM, LASSO and SRC, to classify these experimental datasets.The classification results are listed in Table 2.In our method, SVD is used to extract the metasamples of gene expression data.Here, we choose 3 dimensions' metasamples when we extract these two class samples.And as the samples are not big enough, we use the nested stratified 10-fold cross validation to get a more accurate result.
From Table 2 we can see that, MRSRC have a good classification performance in Colon cancer datasets and DLBCL datasets.But in Prostate cancer datasets, even MRSRC is not better than SRC, but it has an advantage over SVM and LASSO.SRC is best in Prostate datasets, but as not well as MRSRC in the other two datasets.In all, in this Two-Class Classification experiment, MRSRC has well results.
To better illustrate results, we show the accuracies of our methods MRSRC in Figure 2 when t = 10.The nested stratified 10-fold cross validation is also be used.The x-axis represents the k-dimension.The y-axis represents the accuracy of the classification.From Figure 2, we can see that when the dimension of metasample is 3, the best classification accuracy can be reached.This result fully shows the advantage of metasample in reducing calculation complexity.

Multiclass Classification
To further investigate the performance of our method, we also used five multiclass data sets to do experiment.All the five data sets were produced by oligonucleotide microarrays.They are the Lung cancer data set [17], which contains 4 lung cancer types, includes 203 samples with 12600 genes.The Leukemia data set [18], which contains 3 kinds of samples, includes 72 samples with 11225 genes.The SRBCT(Small round blue cell tumors of childhood) data set [19], which contains 4 types of tumors, includes 83 samples with 2308 genes.The 11_Tumors data set [20], which contains 11 various human tumor types, includes 174 samples with 12533 genes.The 9_Tumors data set [21], which contains 9 various human tumor types, includes 60 samples with 5726 genes.
The results are listed in Table 3. From the experiments we can see that, for multiclass classification the proposed MRSRC does not have clear advantages over SVM and SRC.The reason is that the training samples are very few so that the extracted metasamples cannot fully represent the information of these classes.For example, 9_Tumors data set only have 60 samples but 9 classes.One class only has 7 samples so that the training samples can not fully represent the testing sample.

Conclusions
In this paper, using Sparse Coding and Robust theory, we proposed a Metasample-Based Robust Sparse Representation Classifier (MRSRC).Comparing MRSRC with SRC and SVM on various types of DNA microarray data, the experimental results validated that MRSRC is effective and efficient in tumor classification.One important advantage of MRSRC is that MRSRC show robustness to various types of outliers or noisy because the reweighted function can reduce the outlier's affection in each iteration.One should be noted is that our method is based the assumption that any testing sample can be well represented as a linear combination of the training samples from the same class.This means that the training samples should be many enough.Otherwise, the experimental results may be not famous.In future, we will use gene selection and SVM or NMF to further reduce training samples' dimension, speed up calculation and improve accuracy of classification.

1 l 1 l
However, using original training sample as dictionary directly can not express a new test sample well enough sometimes.At the same time, if there are too much training samples, the speed of algorithm will slow down.

Figure 1 .
Figure 1.The metasample model of gene expression data.


With consideration of the sparsity constraint of .


, and let t = t + 1. 6) Return step 1 until convergencing, or reached the maximal number of iterations.
 is a rebuilt testing sample by the class metasamples.Then we can classify y according to th i Copyright © 2013 SciRes.ENGthe difference between y and .For example, if the difference between y and is smallest, y is classified to The classification algorithm is summarized as follows:Input: matrix of training samples A = [ mormalize the columns of A to have unit -norm. 2 l e., The information of which class y belongs to.( ) a in ) identity y 

.
We sort  elements in an ascending order, then we can get a new array is a constant.According to lots of experiments, we usually set c as 8, and set  as 0.8 in our experiments.

2. MRSRC 2.1. Classification Based On Sparse Representation
1 l 1 y D    .Since we assume that the testing signal y can be represented by the training sample D. But in practice this assumption may not hold well as the noisy or outliers.So the 2 l -or -norm may not be robust enough in DNA microarray data classification.1 l

Table 1 .
classification, they are Colon cacer data set, Prostate cancer data set and DLBCL data set.The data set datails are listed in All these three data sets have two class samlples.Colon data set has 2000 genes, Prostate data set has 12600 genes and DLBCL data set has 5469 genes.