Metasample-Based Robust Sparse Representation for Tumor Classification

doi:10.4236/eng.2013.55B016

Paper Menu >>

Journal Menu >>

Engineering, 2013, 5, 78-83

doi:10.4236/eng.2013.55B016 Published Online May 2013 (http://www.scirp.org/journal/eng)

Metasample-Based Robust Sparse Representation for

Tumor Classification

Bin Gan1, Chun-Hou Zheng1,2, Jin-Xing Liu1

1College of Information technology and communication, Qufu Normal University, Rizhao, China

2College of Electrical Engineering and Automation, Anhui University, Hefei, China

Email: ganbinganxinhai@126.com

Received 2013

ABSTRACT

In this paper, based on sparse representation classification and robust thought, we propose a new classifier, named

MRSRC (Metasample Based Robust Sparse Representation Classificatier), for DNA microarray data classification.

Firstly, we extract Metasample from trainning sample. Secondly, a weighted matrix W is added to solve an l1-regular-

ized least square problem. Finally, the testing sample is classified according to the sparsity coefficient vector of it. The

experimental results on the DNA microarray data classificati on pr o ve that the pr oposed algorithm is efficient.

Keywords: DNA Microarray Data; Sp arse Representation Classificatio n; MRSRC; Robust

1. Introduction

A tumor is a neoplasm from an abnormal growth of cells.

An accurate, effective and prompt treatment of tumor is

necessary for patient. But before treatment, how to clas-

sify the tumor is a more important mission because tu-

mors have many types. If you make a wrong analysis,

your treatment will become another killer.

DNA microarray is a biotechnology that simultane-

ously monitors the expression of tens of thousands genes

in cells. There are many methods have been used in tu-

mor classification through microarray gene expression

profiling like independent component analysis (ICA),

nonnegative matrix factorization (NMF), i.e. Since the

biological data is too large in scale and too complicated

in profiling, which not only bring a great difficulty to

save, search, process or analyze these data, but also bring

an big challenge to data mining technology. A new effi-

cient data mining technology is necessary for improving

the accuracy of tumor classification.

Sparse representation has been successfully used in

image processing applications [1], DNA microarray data

classification [2], and Text classification. Intuitively, the

sparsity of the coding coefficient vector for samples can

be measured by the 0-norm or -norm minimization

(-norm minimization is the closest convex function to

0-norm minimization) of it. The -norm minimization

is widely applied in sparse coding. Generally, the sparse

coding problem can be formulated as

min



subject to 2







(1)

where y is a given signal, e.g. the gene expression profile

of a sample. D is the dictionary of coding atoms,



the coding vector of y over D and



> 0 is a constant.

The processing means of this function for gene ex-

pression data can be interpreted as following: By coding

DNA microarray data y as a sparse linear combination of

the training samples D via the 0

l-norm or -norm

minimization(here we used -norm minimization) in

above function, SRC(Sparse Representation Classifica-

tion) classifies y by evaluating which class of training

samples could result in the minimal reconstruction error

of it with the associated coding coefficients.

But there are two important issues in this model. The

first one is that whether 1



is well enough to represent

the signal sparsely. there are many works having been

done for it. For example, adding a nonnegative constraint



[3]; introducing a Laplacian term of coefficient in

sparse coding [4]; Designing the sparsity regularization

terms by using the Bayesian methods [5] and using the

weighted 2-norm in the sparsity constraint [6]. So, the

first issue has been solved well. The Second one is that

whether the 2

l-norm term







is effective

enough to represent the sign al fidelity when y is noisy or

has many outliers. By now there are only in [7,8] the

-norm was used to keep the coding fidelity of







. Since we assume that the testing signal y

can be represented by the training sample D. But in prac-

tice this assumption may not hold well as the noisy or

outliers. So the 2

l- or -norm may not be robust

enough in DNA microarray data classification.

To improve the robustness and effectiveness of sparse

B. GAN ET AL. 79

representation based classification. In this paper, we

propose a Metasample-Based Robust Sparse Representa-

tion Classifier (MRSRC) for DNA microarray data clas-

sification using Sparse Coding and Robust theory. A

weighted matrix W is added in -regularized least

square problem to reduce noisy or outliers.

The rest of this paper is organized as follows. Section

2 presents the proposed MRSRC model and algorithm.

The metasample model of gene expression data are first

presented, then MRSRC is given. Section 3 presents the

numerical experiments. Section 4 concludes the paper.

2. MRSRC

2.1. Classification Based On Sparse

Representation

Sparse representation (SR) is a new effective method

based on -norm minimization of data processing. By

using SR technique, we can represent a new sample as a

linear combination of training sample set with sparse

coefficient, and classify new sample based on the sparse

coefficient vector. Sparse Representation Classification

(SRC) has been used in face recognition, tumor classifi-

cation [9] and performed well. The core idea of the SRC

is that a test sample can be well expressed by only using

the same class training sample. And in this condition,

there are only a few nonzero elements in the coefficient

vector of SR. In SR, we use -norm sparsity-con-

strained least square estimation to get coefficient vector.

SRC is different from traditional supervised learning

algorithm in that SRC hasn’t strict training and test proc-

ess. So it has no over learning problem.

However, using original training sample as dictionary

directly can not express a new test sample well enough

sometimes. At the same time, if there are too much

training samples, the speed of algorithm will slow down.

To solve this problem, we first extract metasamples from

each class training samples, then using them to construct

the dictionary. The detailed processes are listed in the

following Section.

2.2. Maint aining the In tegrity of the Specific ation

A typical characteristic of DNA microarray data is that

gene amount is much more than the number of samples.

Generally, the number of samples is about hundreds, but

there are thousands of genes in each sample. Which

makes that many classic classification methods can’t be

used in DNA microarray data analysis. Fortunately,

methods of choosing related features or extract new fea-

tures can solve well this issue. So far, there are many

documents have studied how to use genes selection to

classify the tumor samples, like [10-12], and how to ex-

tract new features, e.g., metasamples [16]. Alter et al. [13]

used SVD to transform the gene expression data from the

“genes



samples” space to diagonalized “eigengenes



eigenarrays” space, where the eigengenes are unique

orthonormal superpositio n of the genes. Brunet et al. [14]

used NMF to describe the gene expression data through a

fe w of metasamples. (Figure 1)

Metasamples of gene expression data is defined as a

linear combination of several samples. We factorized the

gene expression data set matrix A into two matrices

WH , (2)

where matrix A is of size . In matrix A, the col-

umn represents the expression level of genes in a sample.

Each row means the expression level of one gene through

all samples. Matrix H is of size. Matrix W is of

size

nm

km



From the above analysis, it can be seen that there are

many methods to extract the metasamples, such as SVD,

PCA [13], ICA [15], NMF [14], etc. In consideration of

algorithm’s simple and fast features, in this paper we use

SVD to extract the metasamples.

We extract metasamples from the samples in class i

and denoted them as i, then put them together, which

constructed the new dictionary.

D = 12

,,,

WW W









, (3)

2.3. The RSRC (Robust Sparse Representation

Coding) model

The sparse coding model in Eq. (1) is equivalent to the

LASSO problem [22]:

min yD



 subject to 1





, (4)

where



> 0 is a constant, y =





;;;

yy y n



the signal to be coded, D =





dd,,,

d nm



 is the

dictionary with column vector

d being the atom,

and



is the coding coefficient vector. Here the resid-

ual e =yD





follows Gaussian distribution.

We can see that a sparsity-constrained least square es-

timation problem is necessary to the sparse coding prob-

lem in Eq. (4). If e follows Laplacian distribution, this

solution will be

min yD



 subject to 1





, (5)

Figure 1. The metasample model of gene expression data.

B. GAN ET AL.

However, the residual e may be far from Gaussian or

Laplacian distribution because of noisy or outliers. So the

conventional sparse coding models in Eq. (4) and Eq. (5)

may not be robust and effective enough for DNA mi-

croarray data classification. In order to build a more ro-

bust model for DNA microarray data classification, we

rewrite D as D=





;;;

rr r , where row vector is the

row of D, and set e = i

iyD



=





;;;

ee e . Then we

get iii

eyr



 . And i

e are distributed according to

some probability density function ()



With consideration of the sparsity constraint of



, the

RSRC can be formulated as [23]:

min( )

iyr











 subject to 1





, (6)

where ()





= ln()



 and it has the following

properties: (0)





is the global minimal of ()





;

()





= ()





; ()





<()





if i

e>j

e, and we

let (0)





= 0.

From Eq. (6), we can see that the propose RSRC

model is a more general sparse coding model. Eq. (4) and

Eq. (5) are special cases of it when it follows Gaussian

and Laplacian distributions.

Now by solving Eq. (6), we can get the coding coeffi-

cient vector



. But one key problem is how to determine

the distribution of





. From above analysis we can see

that taking





as Gaussian or Laplacian distribution

directly is not effective or robust enough. So we do not

determine





directly to solve Eq. (6). Insteadly, we

transform Eq.(6) into an iteratively reweighted sparse

coding problem[23]:

1/2

min ()WyD



 subject to 1





, (7)

where W is a diagonal matrix:

,0, 0,

() ()/

ijii i

Wwe ee





 0,

, (8)

Eq. (7) is a weighted LASSO problem. Because W

needs to be estimated by using Eq. (8). Eq. (7) is a local

approximation of the RSRC in Eq. (6) at , and W

should be updated using the residuals in previous itera-

tion via Eq. (8). Using Eq. (7) the determination of dis-

tribution





is transformed into the determination of W.

As the logistic function has properties similar to the

hinge loss function in SVM, the weight function can take

() exp()/(1exp())

weu ueu ue





, (9)

where u and



are positive scalars. u controls the de-

creasing rate from 1 to 0, and



controls the location of

demarcation point. Through Eq. (9), Eq. (8) and (0)





= 0, we get

()(ln(1 exp()) ln(1exp()))

euue









 , (10)

The sparse coding models in Eqs. (4) and (5) can be

interpreted by Eq. (7). When = 2, we will get the

model in Eq. (4). If we let i



()



()we 1/ i

e, we can get

the model in Eq. (5). Eq. (7) has the following advantage:

outliers will be assig ned with low weights to reduce th eir

affects. The weighted function of Eq. (9) is bounded in

[0,1].

According to Eq. (7, 8, 9, 10), the detailed algorithm to

get



is following as:

Start from t = 1:

1) Compute resisual . ( is first ini-

tialed in Eq.(11))

() ()

rec

eyt()t

rec

2) Estimate weights as

() ()()

() (

(

tt t



 





() 2

()

())() 2

exp() )

()

1exp(() )

it t

we e





, (14)

where ()t



and ()t



are parameters estimated in the

iteration.

3) Sparse coding:

*()1/2

min ()()

WyD





subject to 1

|| ||







where is the estimated diagonal weight matrix with

()t

W()tt() ()Wwe

,ii i



4) Update the sparse coding coefficients:

If t = 1, () *t





;

If t >1, ;

()(1tt)()*(1)

()

t t

 



 

()t

where 0 <



< 1 is the step size. ()t



can be searched

from 1 to 0 by the standard line-search process.

5) Compute the reconstructed test sample:

()t

rec

y = D()t



, and let t = t + 1.

6) Return step 1 until convergencing, or reached the

maximal number of iterations.

Iterations finished and output sparsity coefficient vec-

tor



2.4. Algorithm of MRSRC

When we get a testing sample y, in order to initialize the

weight, we set e as ini

eyy





y

. In this paper, we com-

pute as

ini

ini D

(11)

where

m is the mean of all training samples.

At this algorithm, W will change as e in Eq. (8) at

every iteration. We stop the iteration if the following

condition satisfying :

()( 1)( 1)

tt t

WW W



 <



, (12)

where



is a small positive scalar, t is amount of itera-

tion.

After the iteration, we get the coefficient



, then clas-

sify y using the following function:

min( )()

iryyDi





 (13)

where =

ˆi

y()





is a rebuilt testing sample by the

class metasamples. Then we can classify y according to

B. GAN ET AL. 81

the difference between y and . For example, if the

difference between y and is smallest, y is classified

to the class.

ˆi

3th

The classification algorithm is summarized as follows:

Input: matrix of training samples A = [12

,,,





for k classes; testing sample y

R



 and

initialized as.

(1)

rec

ini

Step 1: mormalize the columns of A to have unit

-norm.

Step 2: Extract the metasamples of every class using

SVD and get D.

Step 3: calculate the sparse coefficient



using Eq.

(7).

Step 4: Compute the difference:

() (

D)ry y





(

r y



rgm

Output: i, i.e., The informa-

tion of which class y belongs to.

() ain)identity y

From the algorithm it can be seen that MRSRC is the

combination of RSC and metasample based cluster. In

MRSRC, the complexity of algorithm depends on the

number of iterations t, which depends on the percentage

of outliers in the DNA microarray data set. Generally, the

number of iteration takes 2, unless the percentage of out-

liers is too big. At that instance, t should be taken about

10 to ensure the algorithm to reach convergence.

3. Experimental Results

In this section, experiments were performed to show the

efficiency of the proposed method.

3.1. Parameter Selection

In the weight function Eq. (9), there are two parameters,

i.e.,



and





is the parameter of demarcation point.

We compute the value of



as follows.

Denote by22

) ,2

[(),(,]ee e( )



 . We sort



ele-

ments in an ascending order, then we can get a new array





. Letkn







, where scalar





0,1



, and n











outputs the largest integer smaller than n



. We set











 (15)



controls the decreasing rate of weight value from 1

to 0. We set/c





, where c is a constant. According to

lots of experiments, we usually set c as 8, and set



0.8 in our experiments.

3.2. Two-Class Classification

In this subsection, we use three microarray data set to

study the tumor classification, they are Colon cacer data

set, Prostate cancer data set and DLBCL data set. The

data set datails are listed in Table 1. All these three data

sets have two class samlples. Colon data set has 2000

genes, Prostate data set has 12600 genes and DLBCL

data set has 54 69 genes.

We used the proposed to classify these dataset. Fro

comparison, we also used other three methods, i.e., SVM,

LASSO and SRC, to classify these experimental datasets.

The classification results are listed in Table 2. In our

method, SVD is used to extract the metasamples of gene

expression data. Here, we choose 3 dimensions' meta-

samples when we extract these two class samples. And as

the samples are not big enough, we use the nested strati-

fied 10-fold cross validation to get a more accurate re-

sult.

From Table 2 we can see that, MRSRC have a good

classification performance in Colon cancer datasets and

DLBCL datasets. But in Prostate cancer datasets, even

MRSRC is not better than SRC, but it has an advantage

over SVM and LASSO. SRC is best in Prostate datasets,

but as not well as MRSRC in the other two datasets. In

all, in this Two-Class Classification experiment, MRSRC

has well results.

To better illustrate results, we show the accuracies of

our methods MRSRC in Figure 2 when t = 10. The

nested stratified 10-fold cross validation is also be used.

Table 1. Three type of tumors for DNA classification ex-

periments.

samples

Datasets Class 1 Class 2 Genes

Colon cancer 40 22 2000

Prostate cancer 77 59 12600

DLBCL 58 19 5469

Table 2. The classification accuracies by different methods.

Datasets SVM LASSO SRC MRSRC-SVD

Colon

cancer 85.48 85.48 85.48 90.32

Prostate

cancer 91.18 91.91 94.85 93.38

DLBCL 96.10 96.10 97.40 98.70

Figure 2. The classification accuracy with t = 10.

B. GAN ET AL.

The x-axis represents the k-dimension. The y-axis repre-

sents the accuracy of the classification. From Figure 2,

we can see that when the dimension of metasample is 3,

the best classification accuracy can be reached. This re-

sult fully shows the ad vantage of metasample in redu cing

calculation complexity.

3.3. Multiclass Classification

To further investigate the performance of our method, we

also used five multiclass data sets to do experiment. All

the five data sets were produced by oligonucleotide mi-

croarrays. They are the Lung cancer data set [17], which

contains 4 lung cancer types, includes 203 samples with

12600 genes. The Leukemia data set [18], which contains

3 kinds of samples, includes 72 samples with 11225

genes. The SRBCT(Small round blue cell tumors of

childhood) data set [19], which contains 4 types of tu-

mors, includes 83 samples with 2308 genes. The

11_Tumors data set [20], which contains 11 various hu-

man tumor types, includes 174 samples with 12533 genes.

The 9_Tumors data set [21], which contains 9 various

human tumor types, includes 60 samples with 5726

genes.

The results are listed in Table 3. From the experiments

we can see that, for multiclass classification the proposed

MRSRC does not have clear advantages over SVM and

SRC. The reason is that the training samples are very few

so that the extracted metasamples cannot fully represent

the information of these classes. For example, 9_Tumors

data set only have 60 samples but 9 classes. One class

only has 7 samples so that the training samples can not

fully represent the testing sample.

4. Conclusions

In this paper, using Sparse Coding and Robust theory, we

proposed a Metasample-Based Robust Sparse Represen-

tation Classifier (MRSRC). Comparing MRSRC with

SRC and SVM on various types of DNA microarray data,

the experimental results valida ted that MRSRC is effective

and efficient in tumor classification. One important advan-

tage of MRSRC is that MRSRC show robustness to various

types of outliers or noisy because the reweighted func-

tion can reduce the outlier’s affection in each iteration.

Table 3. The Multiclass Classification Accuracies by Dif-

ferent Methods.

Dataset SVM SRC MRSRC-SVD

Lung cancer 96.05 95.07 95.07

Leukemia 96.60 95.83 97.22

SRBCT 100 100 100

11_Tumors 94.68 94.83 95.40

9_Tumors 65.10 66.67 60.00

One should be noted is that our method is based the as-

sumption that any testing sample can be well represented

as a linear combination of the training samples from the

same class. This means that the training samples should

be many enough. Otherwise, the experimental results

may be not famous. In future, we will use gen e selection

and SVM or NMF to further reduce training samples’

dimension, speed up calculation and improve accuracy of

classification.

5. Acknowledgements

This work was supported by the National Science Foun-

dation of China under Grant No. 61272339, the Science

Foundation of Anhui Province under Grant No.

1308085MF85, the China Postdoctoral Science Founda-

tion Funded Project under Grant No. 2012M510091 , and

the Key Project of Anhui Educational Committee under

Grant No. KJ2012A005.

REFERENCES

[1] J. Mairal, M. Elad and G. Sapirol, “Sparse representation

for color image restoration,” IEEE Transactions on image

processing, Vol. 17, No. 1, pp. 53-69, 2008, 625.

[2] C. H. Zheng, L. Zhang, T. Y. Ng, Simon C. K. Shiu and

D. S. Huang, “Metasample-Based Sparse Representation

for Tumor Classification,” IEEE/ACM Transactions on

Computational Biology and Bioinformatics, Vol. 8, No. 5,

2011, pp. 1273-1282. doi:10.1109/TCBB.2011.20

[3] Y. N. Liu, F. Wu, Z. H. Zhang, Y. T. Zhuang and S. C.

Yan, “Sparse face recognition under nonnegative curds

and whey,” IEEE Conference on Computer Vision and

Pattern Recognition, 2010,p. 626.

[4] S. H. Gao, I. W. H. Tsang, L. T. Chia and P. L. Zhao.

“Local Features Are not Lonely-Laplacian Sparse Coding

for Image Classification,” IEEE Conference on computer

vision and pattern recognition, 2010, p. 626.

[5] S. H. Ji, Y. Xue and L. Carin. “Bayesian Compressive

Sensing,” IEEE Transactions on Singal Processing, Vol.

56, No. 6, 2008, pp. 2346-2356.

[6] J. J. Wang, J. C. Yang, K. Yu, F. J. Lv, T. Huang and Y.

H. Gong, “Locality-Constrained Linear Coding for Image

Classification,” IEEE Conference on Computer Vision

and Pattern Recognition, 2010, p. 626.

[7] J. Wright and Y. Ma, “Dense Error Correction Via l1

Minimization,” IEEE Transactions on Information The-

ory, Vol. 56, No. 7, 2010, pp. 3540-3560.

[8] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry and Y. Ma,

“Robust Face Recognition Via Sparse Representation,”

IEEE Transactions on pattern analysis and machine in-

telligence, Vol. 31, No. 2, 2009, pp. 210-227.

[9] X. Huang and F. X. Wu, “Sparse Representation for

Classification of Tumors Using Gene Expression Data,”

Journal of Biomedicine and Biotechnology, Vol. 2009,

No. 1, 2009, pp. 1-6. doi:10.1155/2009/403689

[10] T. R. Golub, D. K. Slonim et al., “Molecular Classifica-

B. GAN ET AL.

tion of Cancer: Class Discovery and Class Prediction by

Gene Expression Monitoring,” Science, No. 286, 1999,

pp. 531-537. doi:10.1126/science.286.5439.531

[11] J. G. Liao and K. V. Chin, “Logistic Regression for Dis-

ease Classification Using Microarray Data: Model Selec-

tion in a Large p and Small n Case,” Bioinformatics, Vol.

23, No. 15, 2007, pp. 1945-1951.

doi:10.1093/bioinformatics/btm287

[12] H. H. Zhang, J. Ahn, X. L in an d C. Par k, “Gen e Sele ction

Using Support Vector Machines with Non-Convex Pen-

alty,” Bioinformatics, Vol. 22, 2006, pp. 88-95.

doi:10.1093/bioinformatics/bti736

[13] O. Alter, P. O. Brown and D. Botstein, “Singular Value

Decomposition for Genome-Wide Expression Data Proc-

essing and Modeling,” Proceeding of the National

Academy of Sciences of the United State of America, Vol.

97, 2000, pp. 10101-10106.

doi:10.1073/pnas.97.18.10101

[14] J. P. Brunet, P. Tamayo, T. R. Golun and J. P. Mesirov,

“Metagenes and Molecular Pattern Discovery Using Ma-

trix Factorization,” Proceeding of the National Academy

of Sciences of the United State of America, Vol. 101, No.

12, 2004, pp. 4164-4169.doi:10.1073/pnas.0308531101

[15] D. S. Huang and C. H. Zheng, “Independent Component

Analysis-Based Penalized Discriminant Method for Tu-

mor Classification Using Gene Expression Data,” Bioin-

formatics, Vol. 22, No. 15, 2006, pp. 1855-1862.

doi:10.1093/bioinformatics/btl190

[16] W. Liebermeister, “Linear Modes of Gene Expression

Determinded by Independent Component Analysis,” Bio-

informatics, Vol. 18, 2002, pp. 51-60.

doi:10.1093/bioinformatics/18.1.51

[17] A. Bhattacharjee et al., “Classification of Human Lung

Carcinomas by mRNA Expression Profiling Reveals Dis-

tinct Adenocarcinoma Subclasses,” Proceeding of the Na-

tional Academy of Sciences of the United State of Amer-

ica, Vol. 98, 2001, pp. 13790-13795.

doi:10.1073/pnas.191502998

[18] S. A. Armstrong, J. E. Staunton, L. B. Silverman, R.

Pieters, M. L. den Boer, M. D. Minden, S. E. Sallan, E. S.

lander, T. R. Golub and S. J. Korsmeyer, “MLL Translo-

cations Specify a Distinct Gene Expression Profile that

Distinguishes a Unique Leukemia,” Nature Genetics, Vol.

30, 2002, pp. 41-47. doi:10.1038/ng765

[19] J. Khan, J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F.

Westermann, F. Berthold, M. Schwab, C. R. Antonescu,

C. Peterson and P. S. Meltzer, “Classification and Diag-

nostic Prediction of Cancers Using Gene Expression Pro-

filing and Artificial Neural Networks,” Nature Medicine,

Vol. 7, No. 6, 2001, pp. 673-679. doi:10.1038/89044

[20] A. I. Su et al., “Molecular Classification of Human Car-

cinomas by Use of Gene Expression Signatures,” Cancer

Research, Vol. 61, No. 20, 2001, pp. 7388-7393.

[21] J. E. Staunton et al., “Chemosensitivity Prediction by

Transcriptional Profiling,” Proceeding of the National

Academy of Sciences of the United State of America, Vol.

98, No. 19, 2001, pp. 10787-10792.

doi:10.1073/pnas.191368598

[22] R. Tibshirani, “Regression Shrinkage and Selection Via

the Lasso,” Journal of the Royal Statistical Society B, Vol.

58, No. 1, 1996, pp. 267-288.

[23] M. Yang, L. Zhang and J. Yang et al. “Regularized Ro-

bust Coding for Face Recognition,” 2012.