Evaluating Common Strategies for the Efficiency of Feature Selection in the Context of Microarray Analysis

The recent explosion of high-throughput technology has been accompanied by a corresponding rapid increase in the number of new statistical methods for developing prognostic and predictive signatures. Three commonly used feature selection techniques for time-to-event data: single gene testing (SGT), Elastic net and the Maximizing R Square Algorithm (MARSA) are evaluated on simulated datasets that vary in the sample size, the number of features and the correlation between features. The results of each method are summarized by reporting the sensitivity and the Area Under the Receiver Operating Characteristic Curve (AUC). The performance of each of these algorithms depends heavily on the sample size while the number of features entered in the analysis has a much more modest impact. The coefficients estimated utilizing SGT are biased towards the null when the genes are uncorrelated and away from the null when the genes are correlated. The Elastic Net algorithms perform better than MARSA and almost as well as the SGT when the features are correlated and about the same as MARSA when the features are uncorrelated.


Introduction
Discovering prognostic or predictive signatures is a worthwhile endeavor as it is well known that the effect of a treatment is largely heterogeneous.The medical research has witnessed a recent explosion of high-throughput technology, rendering the measurement of a large number of genetic features possible.Correspondingly, new analytical techniques are constantly being developed to process and draw associations from this daunting amount of information.However, the rapid development of both aspects-the measurement and analysis of featureshas made it difficult to determine the best analytical technique for finding a genetic signature.
To find a genetic signature, an algorithm is applied which ultimately combines several features into a single risk score, associated with the outcome [1] [2] [3] [4] [5].The strength of the association between the risk score and the outcome depends heavily on the features which defines it.If the selected genes have little or no value in explaining the outcome, it is unlikely that a signature created using their values would be useful.Thus, the selection process is of paramount importance in the process of defining a signature.The selected features are typically studied in the laboratory (in vitro and in vivo).Thus, a well-chosen subset of features is contributing to a rapid development of new treatment strategies.
In this paper, we present several algorithms for feature selection for a time-toevent outcome.By using simulated data, we know which features are associated with patient outcome and therefore are able to assess the performance of a technique by calculating the sensitivity and the Area Under the Receiver Operating Characteristic Curve (AUC).Throughout the paper, we use the term "gene" to represent the feature of a high-throughput analysis, which can be a probe set, clone, gene expression or any other molecular feature measured in a continuous manner.The primary aim of this paper is to evaluate the performance of the selection process and not the performance of the signature itself.
Three algorithms are chosen for evaluation (Figure 1): single gene testing (SGT), Least Absolute Shrinkage and Selection Operator (LASSO) [6] and its extension, the Elastic Net [7] and the Maximizing R Square Algorithm (MARSA, [5]).Each algorithm is applied to the same simulated data in which a number of genes are known to be associated with patient survival.These algorithms were chosen because they are commonly used in the literature [8] and they are considered substantially different from each other.SGT is used extensively by itself or in combination with other strategies.LASSO and Elastic Net are well-defined statistical algorithms which have been recently gaining in popularity.MARSA is an in-house strategy developed at the Princess Margaret Cancer Centre.This strategy was used to find a signature which could separate patients with low vs. high risk of dying from non-small lung cancer [9].
The signature found using this strategy was validated in 5 independent datasets [9].
When the selection is based on the p-value unadjusted for multiple comparisons the SGT is a marginal technique which does not depend on the number of genes tested.This technique is usually employed on the total number of the genes and it supplies a subset of reasonable size for other algorithms.The rest of the algorithms (SGT when the selection is based on the false discovery rate, LASSO, Elastic Net or MARSA) are usually applied to a relatively smaller group of genes.Thus, in this paper the number of genes simulated is between 250 and 750 which is a reasonable number of genes to start any of the latter selection algorithms.
To our knowledge, the MARSA technique has not been properly evaluated until now and this paper is the first to compare feature selection algorithms for time-to-event outcomes using completely simulated datasets with varying sample sizes, with both positive and negative association with outcome and different levels of correlations between predictors.
Several papers have attempted to compare feature selection algorithms.In general, when the algorithms are compared on real datasets, there is no way to compare the accuracy of the signatures.Other papers propose a new algorithm and compare it to other techniques under specific conditions.For example, Song and Liang [8] proposed a split-and-merge algorithm and compared it to penalized regression techniques using both simulated data as well as real datasets.
However, the simulated scenario considered only a low between-feature correlation of 0.25.Pavlou et al. [10] compared penalized regression models with algorithms based on maximum likelihood estimation for binary outcomes on semisynthetic datasets.That is, the authors utilized the real data, but varied the prevalence of the event and created training datasets such that the ratio between the number of events and the number of predictors was 3 or 5.While this paper discusses the penalized regression models, the set of predictors is somewhat limited, and they do not discuss extreme cases such as when the numbers of predictors are much larger than the number of events.On the other hand, Bühlmann and Mandozzi [11] discuss penalized regression methods when the set of predictors is high-dimensional and the outcome is continuous.The authors use semisynthetic datasets in which they vary the size of the predictor set, the association between the predictors and the outcome and the strength of correlation between predictors.Their conclusion was that in general, LASSO was preferable, but the differences between the algorithms were small.
In the next section, we present the theoretical formulation for each of these algorithms.The details on simulations can be found in Section 3 and the results in Section 4. In Section 5, we summarize the results and provide conclusions.

Single Gene Testing (SGT)
Single gene testing (SGT) is a simple algorithm in which each gene is tested for its association with patient survival separately using the most common technique for survival analysis: the Cox proportional hazards (PH) model [12].In this approach, the hazard for developing the outcome is assumed to have the form: where h 0 (t) refers to the baseline hazard, x i is the value for the gene expression for a specific patient i and β is the coefficient obtained by maximizing the partial likelihood: ( ) with R i being the risk set at time t i .In this paper, all genes with a Likelihood Ratio Test (LRT) p-value of less than a particular value α (0.05 and 0.001 [13]) are considered significant and henceforth retained as part of the signature.A stricter α level would have a higher rate of false negative genes while a more relaxed alpha will have a higher rate of false positives.Alternatively, the genes are also selected based on the False Discovery Rate (FDR) [14] using a FDR of 0.05 and 0.1.
In this paper, the analysis is performed using the survival package in R but any standard statistical software can be employed.

Least Absolute Shrinkage and Selection Operator (LASSO) and the Elastic Net
LASSO is a penalized likelihood regression model introduced originally by Tibshirani (1997).This method has exhibited increased popularity as a feature selection technique in the biomedical field with more than 30 articles using this method either alone or in combination with another method [15]- [45].This method is applied to the Cox PH model with the following restriction imposed on the coefficients: where p is the number of covariates and s is a parameter specified by the user and controls the amount of penalization used.With this restriction, all the coefficients are shrunk towards zero and some will be exactly zero, functioning in this way as a selection process.A larger s will allow fewer non-zero coefficients as compared to a smaller s.
More recently [7], LASSO was extended to incorporate ridge regression using the following restriction: The parameter α balances how much LASSO restriction is involved in comparison to ridge-type restriction.When α=1 there is a purely LASSO restriction and when α = 0 there is a ridge-type restriction.When 0 < α < 1, this technique is known as Elastic Net.As α decreases and the ridge restriction component increases, more covariates are selected.
In essence, the estimate of the coefficients are found as [46]: ( ) where , m is the number of events, nthe number of observations and p the number of covariates.The bold items represent vectors and the x T represents the transpose of vector x.The parameter λ is chosen such that it maximizes the K-fold cross validation log partial likelihood (CVL) introduced by Verveij and van Houwelingen [47].
where the subscript (-k) indicates that the k-th subset of the data is left out.
LASSO and Elastic Net are recommended when the number of covariates in the model is large, often exceeding the number of observations, and the covariates are correlated.To mimic a real life scenario only the genes with a p-value <= 0.2 were considered for this algorithm.By choosing a relaxed α level of 0.2 we want to ensure that all the genes with some potential are included while keeping the false negative rate to a minimum.
The two methods can be performed using the glmnet package in R. The parameter λ is based on cross-validation.Since each run of the cross-validation will produce a slightly different value for λ, the cross-validation was repeated 5 times and the median of the 5 resulting values was the one utilized in the subsequent steps.

Maximizing R Square Algorithm (MARSA)
The MARSA algorithm was developed at the Princess Margaret Cancer Centre and used successfully [5] [48] to find a signature.In the first step, all genes are tested, one by one in a Cox PH model and the coefficients are preserved.Using these coefficients as weights, a risk score is calculated by multiplying each gene by its coefficient and summing across all the genes.The resulting risk score can then be tested in a CoxPH model.As a measure of predictability, the approximation of the Kent and O'Quigley's for the R-squared [49] was used: where β is the coefficient obtained in the CoxPH model and S is the variance of the covariate.The first step is to select a number of candidate genes.To order the genes, we used the LRT p-value when each single gene is tested and selected the first p = 50 genes when 10 genes were associated with outcome (case A) and p = 60 when 20 genes were associated with outcome (case B) and p = 120 when 60 genes were associated with outcome (case C, please Section 3 for the description of the cases A-C).The run-time for the algorithm increases (approximately n 2 ) with the number of genes included.The selection process starts with a risk score based on all genes.In a backward selection fashion, all risk scores which are based on all genes but one (that is, p − 1 genes) are fitted using Cox proportional hazards model and the set with the best R-squared is kept.Next, all the risk scores based on the sets of p − 2 genes obtained from the winner of the p − 1 sets is calculated, tested and the model with the highest R-squared is kept.This process is repeated until the risk score is based on just a single gene.A forward selection is then applied by starting with this one gene and adding each one of the genes not yet in the risk score.At each step the R-squared is retained.In this way, a series of R-squared values are obtained for each number of genes from p to 1 in the backward phase of selection and another series in the forward phase of the selection.The smallest set of genes for which the R-squared value does not drop by adding another gene is selected as the constituent parts of the signature.Figure 2 presents a graphical display of this criterion.Although the highest R-squared is at B with approximately 18 genes in the risk score, our algorithm would choose point A with approximately 6 genes in the risk score.When the R-squared decreases as the number of genes increases, it is a sign that the R-squared has reached its full potential.The high value at point B is due to overfitting rather than due to a real signal (Figure 3).

Description of the Simulation
In this paper, the term "correlated genes" refers to the genes which are correlated among themselves and "association with survival" refers to the relationship of the genes with patients' survival.The number of generated genes is realistic as all algorithms, except the SGT based on the p-value, are usually applied on a subset of the genes and not on the whole array.

Case A (Table 1)
To investigate the performance of the three algorithms described above in relation to the sample size and the number of genes in the dataset, nine datasets were generated from a standard normal distribution with different number of genes (p = 250, 500 and 750) and different number of patients (n = 50, 100 and 200).The genes were simulated to be independent of each other.For each of these sets, survival data were generated such that the first 10 genes were associated with survival with a coefficient of 0.45.The rest of p-10 genes were not associated with survival.

Case B (Table 1)
For the situation p = 250 and n = 200, we also considered the possibility that some genes may be correlated with varying degree of correlation (0, 0.4, 0.6, 0.8).Thus, it was considered that 20 genes were associated with survival (coefficient 0.45) and 10 of these were correlated among themselves.

Case C (Table 1)
For the same situation of p = 250 and n = 200 we considered the situation where 60 genes were associated with survival; 30 positively associated with death (coefficient 0.45) and 30 negatively associated with death (coefficient −0.45).Ten of the first 30 were correlated among themselves as well as 10 of the second group of 30.The correlation coefficients varied as before (0, 0.4, 0.6, 0.8).

Generating the Survival Times
The survival times were generated as exponentially distributed with the hazard: ( ) with β i the coefficient of the i th covariate.To obtain approximately 50% events in each dataset, the censoring time was generated as uniformly distributed between 2 and 5, representing an accrual time of 3 years and a follow-up time of 2 years.The coefficients (0.45 and −0.45) were chosen such that the power to detect significance for one covariate with 50, 100 and 200 records varies and reflects reallife situations.For α = 0.001 the power for n = 50, 100 and 200 is 15%, 46% and 89% respectively and for alpha = 0.05 the power is 61%, 89% and 99% respectively.
All simulations were performed 2000 times.Each algorithm (SGT, LASSO, Elastic Net (α = 0.3), Elastic Net (α = 0.7), and MARSA) was applied to each of the simulated dataset.Data presented in this paper is based solely on simulation and do not contain any piece of information collected from patients.As such, consent was not necessary.

Evaluation of the Simulation
The goal of the selection process is to choose as many genes as possible from the set of those truly associated with survival and to choose as few genes as possible from the set of those which are independent of outcome.To judge the performance of each strategy and each scenario, two metrics were calculated: sensitivity and the Area Under the Receiver Operating Characteristic (AUC).The sensitivity is the proportion of selected genes out of the truly associated genes.The AUC measures an overall performance with the intent to minimize both the false positive and false negative genes.Arguably, of the two types of false results, the false negative may be more damaging since the false positive genes could be weeded out through a second process of validation using a different platform (like Polymerase Chain Reaction (PCR)).On the other hand, the false negative genes are lost completely.Sensitivity is a good measure to assess which scenarios would minimize the false negative genes.
A gene was considered as selected if it was significant and the direction of the detected association corresponded to the theoretical one.A disregard of the direction of significance would inappropriately inflate the results.For example if one of these methods has the tendency to select a positive gene but to estimate the effect in the opposed direction then it may appear that it is better than another method which selects fewer genes but with the correct direction.

Results of the Simulation
The performance of each of these algorithms depends heavily on the sample size.Regardless of the number of genes entered in the analysis, the AUC is higher for n = 200 than for lower n, while the difference made by the number of genes entered in the analysis has a much more modest impact.The number of genes considered for each of these analyses is small in comparison to any high throughput data.This choice is considered realistic as FDR, MARSA and the penalized likelihood methods are typically applied to a subset of features, chosen through a marginal method as the unadjusted p-value of the SGT method.Figure 4 shows the distributions of the AUC for the 9 situations of Case A for each of the algorithms.
Choosing α = 0.001 seems overly conservative with AUC around 0.7 even for n = 200 while for the rest of the algorithms the AUC is around 0.9 for n = 200 and around 0.6 for n = 50.With the exception of the SGT strategy, the other four algorithms exhibit a modest decrease in performance with the number of genes entered in the analysis.The performance increases slightly with the amount of ridge regression included in the Elastic Net.Choosing the genes based on FDR = 0.1 seems to be an excellent choice when the number of observations is adequate.It is important to note that the specificity is in general high (>0.8)and thus the level of AUC depends greatly on the level of sensitivity (Supplementary Tables 1(a)-(c)).In most cases, the sensitivity is tremendously poor (<0.4) for n = 50.This low sensitivity suggests that the sample size is extremely important and argues against dividing an already small dataset into two subsets for training and validation.
Of utmost importance is the fact these algorithms most often do not produce the same set of significant genes.Figure 5 gives the results for two simulated datasets, one with 50 records and one with 200 records.The two Venn Diagrams show that the set of genes selected by SGT 0.05, FDR 0.1, Elastic Net 70% ridge regression or MARSA are quite different.When the dataset is small (n = 50) only one gene is common to all and 6 of the 10 genes truly associated with the outcome are not selected by any of these algorithms.Of the 13 genes chosen only by  the Elastic Net none are truly significant.Twenty genes are selected by both MARSA and the ElasticNet of which only 3 are truly associated with the outcome.When the number of records is large (n = 200, power > 90% for testing one gene only) then 9 of the 10 genes associated with the outcome are selected by all algorithms.The unselected gene of the 10, has the uniariable p-value > 0.05.However, the number of genes selected by at least one of the algorithms but not associated with the outcome is quite large (43).
It was observed that the estimated coefficients for each strategy are sometimes biased, depending on the number of genes theoretically associated with outcome and on the correlation structure between these genes (Figure 6).When the genes were independent of each other, the estimated coefficients were always smaller in absolute value than the theoretical coefficient.As the number of genes associated with the outcome increased, the estimated coefficients were further from the theoretical value.Figure 6 presents the averages over the 2000 simulations of the estimated coefficients (based on SGT) when 10, 20 and 60 genes were associated with the outcome.In this figure all genes were independent of each other.The horizontal lines are drawn at the theoretical coefficients of ±0.45.The thicker vertical broken lines divide the different datasets.
The coefficients obtained from SGT for the correlated genes were biased away from the null while for those uncorrelated (but in the presence of some correlated genes) the bias was slightly towards the null (Figure 7 for Case B and Supplementary Figure 1 for Case C).
Thus, in the presence of correlated genes, the overall performance is misleading as it will average the performance of the correlated genes more likely to be selected with the performance of the uncorrelated genes less likely to be selected.Table 2 shows the sensitivity for Case B for the two groups of genes: 10 correlated and 10 independent.As expected, the SGT algorithms have a sensitivity of 1 when the genes were correlated, but the sensitivity was very poor when the genes were independent.Note that for the SGT algorithms even a poor correlation like 0.4 can have a tremendous effect on the significance of the correlated Figure 6.The average of the coefficients for the genes associated with outcome over the 2000 simulations when the genes are independent among themselves.gene.On the other hand, the LASSO and Elastic Net algorithms perform better than MARSA and almost as well as the SGT algorithms for the correlated genes and about the same as MARSA for the uncorrelated genes.The pattern is the same for the Case C (Supplementary Table 2) and the direction of the association with outcome has no influence on the sensitivity.

Conclusions
The existence of high-throughput datasets containing genetic information at multiple levels facilitates a broader and deeper understanding of the patients' ability to cope, be resistant or sensitive to treatments for diseases.Benefits of this knowledge are at the patient level as well as the social and economic level.However, extracting this information from a large amount of data can be challenging.Several statistical algorithms exist which attempt to find important genetic features to describe a specific condition or to explain an outcome.This paper presents a comparison of three major strategies for feature selection with survival as outcome.The SGT strategy is present either as the main strategy or as part of a more elaborate algorithm in the majority of papers analyzing highthroughput data.The alpha level of 0.001 is considered more informative as it guards against inflated type I error, ubiquitous in this type of data.This paper also presents the results for an alpha level of 0.05 which is traditionally used in medical statistics as well as 2 levels for FDR (0.05 and 0.1).As the need for more elaborate techniques increases, the LASSO/Elastic Net technique gains popularity.It was created specifically to mitigate the disparity between the large number of covariates included in a model and the relatively small number of observations.MARSA is an algorithm created in Princess Margaret Cancer Centre to obtain a genetic signature which explains the difference in survival for apparently homogeneously non-small cell lung cancer patients.While not widely used, this algorithm proved to be valuable as the genetic signature found with this technique was successfully validated in independent datasets.Using simulated data the AUC and the sensitivity for each method under several scenarios are calculated and presented, suggesting under which conditions each of these strategies is most beneficial.The specificity (for case A, Supplementary Table 1(c)) is high in general due to the large number of genes generated under the null hypothesis (no association with survival).
To replicate realistic datasets, several parameters were varied in the process of simulation: the number of observations, the number of genes entered in the algorithm, the number of associated genes, the strength and the direction of associations of the genes with survival and the level of the correlation between genes.
The combination of the different sample sizes, the different strengths of association with survival and the level of significance, α, covers a wide range of the statistical power with which a gene can be detected (15% to 99%).
Our simulations indicate that the number of observations is extremely important when analyzing this type of data.Thus, regardless of the chosen strategy or number of genes the AUC is higher when the sample size is 200.The ability to select the correct genes is affected by the number of genes when MARSA or one of the Elastic Net methods is used.Therefore, there is no real advantage to divide a small dataset into two very small datasets to obtain training and validation datasets.A far better choice is to obtain another independent sample on which to validate the results.Increasingly, datasets with genetic and outcome information can be found in the public domain, and can be used for validation.In the absence of such a dataset, applying more than one method and utilizing a crossvalidation technique might help in choosing the appropriate algorithm.
Based on these simulations it was observed that when multiple independent genes are associated with patient outcome, their univariate coefficients tend to be lower than the theoretical coefficients.This attenuation implies that the SGT technique is unlikely to select these genes and an algorithm which considers more genes at the same time in the model is more desirable (like MARSA or penalized likelihood).On the other hand, the correlation between genes (even a poor correlation of 0.4), when each one of them contributes to the outcome, could make each gene appear more interesting than it really is, due to an overestimation of the real coefficient.Thus, the correlations between the genes which are entered into MARSA or penalized likelihood need to be calculated.
As in any simulation study, it was possible to judge the efficiency of a method because we had information on the true underlying relationship in the data, information which is not usually available in the process of analyzing a real dataset.However, this study could give information on how these methods behave such that one could interpret the results easier.
It was not considered necessary to present examples as each of these strategies has been applied to real datasets in the past.Moreover, the main objective for this paper was to determine the suitability of these strategies in correctly selecting as many of the associated genes as possible.The underlying assumption is that the appropriate set of features would also validate in an independent study.In addition, we do not wish to recommend a specific strategy for use in all situations as, indeed, this is unrealistic, but present situations when each of these strategies may be more suitable than another.We also recommend that any new strategy needs to be thoroughly investigated in simulated environment and evaluated against other common strategies.
In conclusion, one has to employ not only methodologies which test for association with outcome but also for correlations between the features considered.This paper is intended to guide a statistician or bioinformatician in the daunting task of finding genes associated with outcome.

Figure 1 .
Figure 1.The diagram of the algorithms used.

Figure 2 .
Figure 2.An example of the R-squared values vs. the number of genes in the risk score.A and B are local maxima, but the algorithm chooses A which is the local maximum with the smallest number of genes.

Figure 3 .
Figure 3.The steps for MARSA algorithm.

Figure 4 .
Figure 4. AUC under the different scenarios of Case A.

Figure 5 .
Figure 5. Examples of two datasets and the number of selected genes by each algorithm: (a) the number of records is 50; (b) the number of records is 200.

Figure 7 .
Figure 7.The average of the coefficients for the first the 20 genes associated with outcome (10 correlated among themselves and 10 independent) over the 2000 simulations for Case B.

Figure 1 .
Figure 1.Average of the coefficients over the 2000 simulations for the 60 genes associated with outcome with the first 20 being correlated.

Table 1 .
Summary of the parameters used for the simulations.

Table 2 .
The sensitivity for Case B. Elastic Net with 50% ridge regression.**Elastic Net with 70% ridge regression. *

Table 2 .
The sensitivity for all scenarios of Case C.