A Localized-Statistic-Based Approach for Biomarker Identification of Omics Data

Omics data provides an essential means for molecular biology and systems biology to capture the systematic properties of inner activities of cells. And one of the strongest challenge problems biological researchers have faced is to find the methods for discovering biomarkers for tracking the process of disease such as cancer. So some feature selection methods have been widely used to cope with discovering biomarkers problem. However omics data usually contains a large number of features, but a small number of samples and some omics data have a large range distribution, which make feature selection methods remains difficult to deal with omics data. In order to overcome the problems, we present a computing method called localized statistic of abundance distribution based on Gaussian window (LSADBGW) to test the significance of the feature. The experiments on three datasets including gene and protein datasets showed the accuracy and efficiency of LSADBGW for feature selection.


Introduction
With the advent of high-throughput measurement techniques such as transcriptome by microarray and proteome by mass spectrometry, the omics, which mean comprehensive analysis of a specific layer in a cellular system and are emerging as essential methodological approaches for molecular biology and systems biology, have been accumulated rapidly and make it possible to capture the entire snapshot of cell-wide activity [1,2].The increase in data acquisition has lead to a demand for practical and effective data mining methods for in silico analysis.One of the strongest challenge problems biological researchers have faced is to find the methods for discovering biomarkers for tracking the process of disease such as cancer [3,4], as the biomarkers selection can be viewed as a major bottleneck of supervised learning and data mining on omics data [5,6].
Feature selection approaches, which aim to find a set of features that best discriminate biological samples of different types, have been widely applied to cope with discovering biomarkers problem [3,4,[7][8][9].The selected features are "biomarkers", and they form "marker panel" for analysis.The fold-change and p-value are two commonly known criteria to select differentially expressed features under two experimental conditions.In the foldchange method, a feature is viewed as a "biomarker" if the ratio in absolute value of the expression levels be-tween two classes exceeds a certain threshold, e.g., a 2fold change.The p-value ranking is an alternative approach for feature selection.Often the p-value is the probability outcome from a statistical testing procedure that there is no difference between two conditions for an individual feature.A variety of statistical tests including two-sample t test [10][11][12][13][14][15][16], X 2 test [10,17], the one-way analysis of variance [18,19], the Wilcoxon signed rank test [20][21][22][23] and Mann-Whitney test [23] have been used to obtain the p-values.Though great success have been obtained using these approaches in selecting biomarkers, it still remains difficult to deal with omics data.As we know that omics datasets always belong to small sample datasets, because the number of features significantly outnumbers the number of samples.Then the p-value methods based on statistical tests sometimes are failed to deal with the omics data, for example, if the sample number of the dataset only equals to 1 for each class the statistical tests miss their efficiency.And [24] indicates that some omics data have a large range distribution, so the same criteria for different range data which is the strategy employed by fold-change approach is incorrect, for example, the significance of 2-fold change from 2 to 1 is not equal to the significance of 2-fold change from 20,000 to 10,000.
In order to overcome the large range problem, [24] developed a computing method called Localized Statistics of Protein Abundance Distribution (LSPAD) to eva-luate the statistical significance of protein-abundance bias between two classes, by which are differentia significance of a particular protein should be calculated through its local protein-abundance distribution-window rather than through whole distribution range from the lowest to highest protein abundances.In fact, even though the sample number of the dataset only equals to 1 for each class LSPAD also shows good performance which is validated in [24].However LSPAD is under-utilized practice and there are two shortcomings in LSPAD.The first is that the strategy of selecting local distribution window is too rough, which postulated a width of the local window for statistics as 33%, i.e.only neighbored proteins within the 33% A-axis around a particular protein should be used for calculation.And the second is that LSPAD employs the fisher exact test to check the statistical significance.Fisher exact test is a statistical significance test used in the analysis of contingency tables where sample sizes are small.However if the data type is float rounding operation must be performed which may make Fisher exact test fail to deal with the omics data and fisher exact test should be time-consuming when the sample sizes are large.
In this study we present a computing method called localized statistic of abundance distribution based on Gaussian window(LSADBGW) which also employs the localized statistic strategy used by LSPAD but propose a Gaussian window as the local abundance distribution window and a simpler and more general statistic approach to test the significance of the feature.By using the Gaussian window, the selection of local abundance distribution window is more reasonable and persuasive.And LSADBGW not only can deal with the integral data but also the float data, which furthers the application range comparing with LSPAD.The experiments on three datasets including gene and protein datasets and the comparison with the LSPAD show the accuracy and efficiency of LSADBGW for feature selection.
In summary, our contributions are: 1) We extend the application range of localized statistic strategy to all omics, which is opposite to LSPAD is only oriented towards the protein tandem mass spectrometry data processed by SEQUEST [25]; 2) We propose a new strategy of selecting local abundance distribution window which employs the Gaussian window.By using the Gaussian window our method is more reasonable and persuasive than LSPAD; 3) We proposed a simpler but more effective statistic test instead of the fisher exact test used in LSPAD.The rest of the paper is organized as the follows.A brief not on the LSPAD is given in Section 2. Our method is presented in Section 3 and the datasets and experiments are given in Section 4. We show the experimental results and discuss the results in Section 5. Finally Section 6 concludes.

Related Work
The concept of localized statistic used in feature selection of omics is firstly proposed by [24], in which human serum of non-diabetic and diabetic cohorts was analyzed by proteomic approach.To analyze total 1377 high-confident serum-proteins, they developed a computing strategy called localized statistics of protein abundance distribution (LSPAD) to calculate a significant bias of a particular protein-abundance between these two cohorts.
The LSPAD method can be divided to two steps.Firstly, since the peptide-spectral-count distributions of identified serum-proteins were widely spread out to the range of 10 5 , they developed M-A plotting referring to microarray analysis in order to display a relative protein-abundance distribution of each protein.The M and A values are defined as follows: log ( 1) wherein X 1 and X 2 respectively represent the peptide spectral counts in diabetic serum and in non-diabetic serum, M represents differential protein abundance between diabetic and non-diabetic serum, and A represents the average protein abundance.
Then the differential significance of a particular protein is calculated based on the proteins fell into its local protein-abundance distribution-window using fisher's exact test.And [24] postulates a width of the local window for statistics as 33% A-axis.

Method
In order to overcome the under-utilized in practice and the unreasonable window selection strategy, we proposed a more practical and reasonable method of selecting significant features called localized statistic of abundance distribution based on Gaussian window (LSADBGW).In fact, the M value used in Equation ( 1) can be employed as statistic value; on the contrary, M value is ignored by LSPAD.Because of the generality and simplicity of the normal distributions, it has been widely used in various areas, including the omics data such as gene expression data [26].And we propose a Gaussian window in LSADBGW instead the local window used in LSPAD.

The Significant Test Method Using M Value
We assume that the M value obeys the normal distribution, and this is reasonable which can be validated in Figure 1.
With the assuming a Gaussian distribution, the signi- ficance of a feature can be given by wherein S represents the significance value, M data represents the M value of the feature tested, "M W " represents the mean value of the M values fell in the statistical window and W σ represents the standard deviation of the M val- ues fell in the statistical window.
After S value is obtained by Equation ( 2), the significance can be calculated through S e.g.|S| ≧ 2.6 can be treated as significant at a level of 99% assuming a Gaussian distribution.

The Gaussian Window
Since the Significance calculation of particular differential features should be localized to a certain range of related abundance level [24], the selection of appreciate local abundance distribution window plays an important role in localized statistics method.However choosing a local window for localized statistics appropriate to all kinds of data distribution, which ensures that all the data fell into it are under the same range, is difficult or impossible, as the concept of the same range is puzzled.Then we consider the interaction between different range samples instead of accurate the same range partition, that is, the correlation between samples located nearby with each other is higher than the samples located far.For example, under the data partition of [24], the correlation between low level and high level of protein abundance samples is lower between two high level samples.
However, how to accurately define and quantify the correlation between two samples according to their range distance is also a problem.Fortunately, it is known that there is close relationship between data range and data distribution, that is, the problem of estimating the correlation between two different range samples may be redefined and carried out from the view of the density estimation of distribution.So the correlation between two samples can be performed according to the contribution to the density estimation of each sample point for each other.For example, if sample point A has a higher contribution for the density estimation of sample C than the point B, we can say that the relationship between A and C is higher than A and B.
So from the point view of density estimation, the selection of location range window can employ the same strategy of location density estimation window.In fact, LSPAD employs rectangle window which the width is the 33% of all the range length.However this seems not reasonable, that is, it is difficult to say that using 33% is better than using 25% or others.We focus on the Gaussian window instead of rectangle window.
With a generalized weight kernel function K(x) the density estimator ˆ( ) p x is given by wherein N is the sample number, h is called smoothing parameter or window width and the kernel function K(x) is required to be a normalized probability density.If K(x) is the Gaussian kernel, the density estimator is given by The choice of the bandwidth h is crucial to the density estimator, that is, if h is chosen to small spurious fine structure becomes visible, while if h is too large all detail, spurious or otherwise is obscured.There are some methods for choosing an appropriate bandwidth available, however most of these methods suffer a considerable computational burden [27].As a tradeoff between computational effort and performance one may choose the optimal bandwidth as the one that minimizes the mean integrated square error, assuming the underlying distribution is Gaussian.An optimal Gaussian bandwidth h opt is given by [28] We employ the Gaussian window as the local abundance distribution window.In fact the Gaussian window used is not the original local window, on the contrary, it is a whole window but the weight for each sample point is different.The sample set used to localized statistics is constructed by the follow strategy ( ) wherein randow i is a random number obey the uniform distribution between 0 and 1, staDataset represents the sample set used to localized statistics and P(x i ) is given by wherein norcdf (x, hopt, |x i |) is defined as the normal cumulative distribution function, x represents the mean of the normal distribution function, h opt represents the standard deviation and |x i | means the absolute value of the sample x i .

Datasets
Three datasets are deployed here: Dataset1: Ovarian cancer Dataset (07 August 2002), which was collected using WCX2 protein array.The sample set included 91 controls and 162 ovarian cancers.The SELDI MS data for each case is an ASCLL file containing 15,155 points of m/z values with corresponding intensities.
Dataset2: Small Round Blue Cell Tumors (SRBCTs), which was obtained from glass-slide cDNA microarrays.The data consisted of expression measurements on 6567 genes (2308 genes after filtering for minimal level of expression).The tumors are classified as Burkitt lymphoma (BL, 11 samples), Ewing sarcoma (EWS, 29 samples), neuroblastoma (NB, 18 samples) and rhabdomyosarcoma (RMS, 25 samples).As we only focus on the binary classification problem, EWS and RMS are selected to form a new two class dataset.
Dataset3: Stem Cell Matrix (SCM) [29], which is a database of global gene expression profiles.The database consisted of 218 samples which belong to 17 cell lines.As the operation in dataset2, ES cells_undifferentiated and ES_differentiated neural stem cells are selected to form a new two class dataset.IPS cells also are selected to further our method and this will be discussed in the latter section.

The Classification Results and Discussion
The LSADBGW currently is suitable for the two column data, so the mean vectors of two classes must be calculated firstly and form a new mean dataset.In fact, this operation may ignore the differences among the same class data which are useful for feature selection.Leave-oneout-cross-validation (LOOCV) method and liner-SVM are employed in our classification experimental framework.
As the mean vectors are only used for three methods, the differences between the same classes samples are ignored which may be an obstacle for classification.After the feature selection, we cluster the features selected to 10 classes by k-mean cluster method, and then we selected the top 1 feature of each class to form a feature sets for classification.
In Figures 2-4 we respectively list the results obtained from the dataset 1, dataset 2 and dataset 3 using LSADBGW, LSPAD' and LSPAD.Here, all the p values used in three methods were equal to 0.95.
The results in Figure 2 showed that the LSPAD performed better than LSPAD', which seems that the fisher' exact test was better than using simple statistical test using M values.However, in Figured 3 and 4, the results generated by LSPAD were not represented.This is because that the LSPAD did not generate good significant features set which were illustrated in Figures 5 and 6.The results in Figure 5 showed that only two features were selected while in Figure 6 showed that almost all the features were selected, this phenomena indicated that the LSPAD using the fisher' exact test was not a stable strategy for omics data, on the contrary, the LSPAD' using simple statistical test were much more stable.
It was also showed that the performance of using Gaussian window performed better than rectangle window, especially in Figure 2. However the results in Figure 4, LSPAD' seems a little better than LSADBGW.We then respectively used the top 10 and top 20 features without clustering to investigate the performance of LSADBGW and LSPAD', and the results were showed in Figure 7    and 10.The new results, especially in Figure 8, indicated that the performance of LSADBGW was better than LSPAD', which meant that the strategy employing the Gaussian window performs better than employing the rectangle window.
The comparative study of three feature selection methods indicated that the strategy employing simple statistical test using M values was much more stable than fisher' exact test and employing the Gaussian window is much more accurate than rectangle window.

Conclusion
In this article, we proposed a new localized statistical approach to deal with biomarkers selection called localized statistic of abundance distribution based on Gaussian window (LSADBGW).Comparing with the localized statistics of protein abundance distribution (LSPAD), LSADBGW employs the more reasonable local statistical window selection strategy and a more generalized and simpler statistical test method.The classification experimental results prove that our approach perform well than LSPAD.In conclusion, we hope that our LSADBGW method could present useful alternatives in the analysis of the omics data.

Figure 1 .
Figure 1.The M values distribution.(a) represents the M values distribution of serum SELDI MS data (Ovarian, 07 August 2002); (b) represents the M values distribution of wing sarcoma and rhabdomyosarcoma in the dataset small round blue cell tumors which is a DNA microarray dataset.

Figure 2 .
Figure 2. The classification performance comparison on the ovarian cancer dataset.

Figure 3 .
Figure 3.The classification performance comparison on the small round blue cell tumors dataset.

Figure 4 .
Figure 4.The classification performance comparison on the stem cell matrix dataset.

Figure 5 .
Figure 5. M-A plotting of small round blue cell tumors dataset, red dots represented statistically significant overpresented genes in EWS and Green dots represented statistically significant under-represented genes in EWS.

Figure 6 .
Figure 6.M-A plotting of stem cell matrix dataset, red dots represented statistically significant over-presented genes in ES cells_undifferentiated and green dots represented statistically significant under-represented genes in ES cells_undifferentiated.

Figure 7 .
Figure 7.The classification performance comparison on the stem cell matrix dataset using the top 10 features without clustering.

Figure 8 .
Figure 8.The classification performance comparison on the stem cell matrix dataset using the top 20 Features without clustering.