^{1}

^{*}

^{2}

The recent explosion of high-throughput technology has been accompanied by a corresponding rapid increase in the number of new statistical methods for developing prognostic and predictive signatures. Three commonly used feature selection techniques for time-to-event data: single gene testing (SGT), Elastic net and the Maximizing R Square Algorithm (MARSA) are evaluated on simulated datasets that vary in the sample size, the number of features and the correlation between features. The results of each method are summarized by reporting the sensitivity and the Area Under the Receiver Operating Characteristic Curve (AUC). The performance of each of these algorithms depends heavily on the sample size while the number of features entered in the analysis has a much more modest impact. The coefficients estimated utilizing SGT are biased towards the null when the genes are uncorrelated and away from the null when the genes are correlated. The Elastic Net algorithms perform better than MARSA and almost as well as the SGT when the features are correlated and about the same as MARSA when the features are uncorrelated.

Discovering prognostic or predictive signatures is a worthwhile endeavor as it is well known that the effect of a treatment is largely heterogeneous. The medical research has witnessed a recent explosion of high-throughput technology, rendering the measurement of a large number of genetic features possible. Correspondingly, new analytical techniques are constantly being developed to process and draw associations from this daunting amount of information. However, the rapid development of both aspects―the measurement and analysis of features― has made it difficult to determine the best analytical technique for finding a genetic signature.

To find a genetic signature, an algorithm is applied which ultimately combines several features into a single risk score, associated with the outcome [

In this paper, we present several algorithms for feature selection for a time-to- event outcome. By using simulated data, we know which features are associated with patient outcome and therefore are able to assess the performance of a technique by calculating the sensitivity and the Area Under the Receiver Operating Characteristic Curve (AUC). Throughout the paper, we use the term “gene” to represent the feature of a high-throughput analysis, which can be a probe set, clone, gene expression or any other molecular feature measured in a continuous manner. The primary aim of this paper is to evaluate the performance of the selection process and not the performance of the signature itself.

Three algorithms are chosen for evaluation (

These algorithms were chosen because they are commonly used in the literature [

When the selection is based on the p-value unadjusted for multiple comparisons the SGT is a marginal technique which does not depend on the number of genes tested. This technique is usually employed on the total number of the genes and it supplies a subset of reasonable size for other algorithms. The rest of the algorithms (SGT when the selection is based on the false discovery rate, LASSO, Elastic Net or MARSA) are usually applied to a relatively smaller group of genes. Thus, in this paper the number of genes simulated is between 250 and

750 which is a reasonable number of genes to start any of the latter selection algorithms.

To our knowledge, the MARSA technique has not been properly evaluated until now and this paper is the first to compare feature selection algorithms for time-to-event outcomes using completely simulated datasets with varying sample sizes, with both positive and negative association with outcome and different levels of correlations between predictors.

Several papers have attempted to compare feature selection algorithms. In general, when the algorithms are compared on real datasets, there is no way to compare the accuracy of the signatures. Other papers propose a new algorithm and compare it to other techniques under specific conditions. For example, Song and Liang [

In the next section, we present the theoretical formulation for each of these algorithms. The details on simulations can be found in Section 3 and the results in Section 4. In Section 5, we summarize the results and provide conclusions.

Single gene testing (SGT) is a simple algorithm in which each gene is tested for its association with patient survival separately using the most common technique for survival analysis: the Cox proportional hazards (PH) model [

where h_{0}(t) refers to the baseline hazard, x_{i} is the value for the gene expression for a specific patient i and β is the coefficient obtained by maximizing the partial likelihood:

with R_{i} being the risk set at time t_{i}. In this paper, all genes with a Likelihood Ratio Test (LRT) p-value of less than a particular value α (0.05 and 0.001 [

LASSO is a penalized likelihood regression model introduced originally by Tibshirani (1997). This method has exhibited increased popularity as a feature selection technique in the biomedical field with more than 30 articles using this method either alone or in combination with another method [

where p is the number of covariates and s is a parameter specified by the user and controls the amount of penalization used. With this restriction, all the coefficients are shrunk towards zero and some will be exactly zero, functioning in this way as a selection process. A larger s will allow fewer non-zero coefficients as compared to a smaller s.

More recently [

The parameter α balances how much LASSO restriction is involved in com- parison to ridge-type restriction. When α=1 there is a purely LASSO restriction and when α = 0 there is a ridge-type restriction. When 0 < α < 1, this technique is known as Elastic Net. As α decreases and the ridge restriction component increases, more covariates are selected.

In essence, the estimate of the coefficients are found as [

where^{T} represents the transpose of vector x.

The parameter λ is chosen such that it maximizes the K-fold cross validation log partial likelihood (CVL) introduced by Verveij and van Houwelingen [

where the subscript (-k) indicates that the k-th subset of the data is left out.

LASSO and Elastic Net are recommended when the number of covariates in the model is large, often exceeding the number of observations, and the covariates are correlated. To mimic a real life scenario only the genes with a p-value <= 0.2 were considered for this algorithm. By choosing a relaxed α level of 0.2 we want to ensure that all the genes with some potential are included while keeping the false negative rate to a minimum.

The two methods can be performed using the glmnet package in R. The parameter

The MARSA algorithm was developed at the Princess Margaret Cancer Centre and used successfully [

where β is the coefficient obtained in the CoxPH model and S is the variance of the covariate.

The first step is to select a number of candidate genes. To order the genes, we used the LRT p-value when each single gene is tested and selected the first p = 50 genes when 10 genes were associated with outcome (case A) and p = 60 when 20 genes were associated with outcome (case B) and p = 120 when 60 genes were associated with outcome (case C, please Section 3 for the description of the cases A-C). The run-time for the algorithm increases (approximately n^{2}) with the number of genes included. The selection process starts with a risk score based on all genes. In a backward selection fashion, all risk scores which are based on all genes but one (that is, p − 1 genes) are fitted using Cox proportional hazards model and the set with the best R-squared is kept. Next, all the risk scores based on the sets of p − 2 genes obtained from the winner of the p − 1 sets is calculated, tested and the model with the highest R-squared is kept. This process is repeated until the risk score is based on just a single gene. A forward selection is then applied by starting with this one gene and adding each one of the genes not yet in the risk score. At each step the R-squared is retained. In this way, a series of R-squared values are obtained for each number of genes from p to 1 in the backward phase of selection and another series in the forward phase of the selection. The smallest set of genes for which the R-squared value does not drop by adding another gene is selected as the constituent parts of the signature.

In this paper, the term “correlated genes” refers to the genes which are correlated among themselves and “association with survival” refers to the relationship of the genes with patients’ survival. The number of generated genes is realistic as all algorithms, except the SGT based on the p-value, are usually applied on a subset of the genes and not on the whole array.

To investigate the performance of the three algorithms described above in relation to the sample size and the number of genes in the dataset, nine datasets were generated from a standard normal distribution with different number of genes (p = 250, 500 and 750) and different number of patients (n = 50, 100 and 200). The genes were simulated to be independent of each other. For each of these sets, survival data were generated such that the first 10 genes were associated with survival with a coefficient of 0.45. The rest of p-10 genes were not associated with survival.

For the situation p = 250 and n = 200, we also considered the possibility that some genes may be correlated with varying degree of correlation (0, 0.4, 0.6, 0.8).

Number of observations | Total number of genes | Number of independent genes associated with survival (theoretical coefficient) | Number of correlated genes associated with survival (theoretical coefficient) | |
---|---|---|---|---|

Case A | 50, 100, 200 | 250, 500, 750 | 10 (0.45) | 0 |

Case B | 200 | 250 | 10 (0.45) | 10 (0.45) |

Case C | 200 | 250 | 20 (0.45), 20 (−0.45) | 10 (0.45), 10 (−0.45) |

Thus, it was considered that 20 genes were associated with survival (coefficient 0.45) and 10 of these were correlated among themselves.

For the same situation of p = 250 and n = 200 we considered the situation where 60 genes were associated with survival; 30 positively associated with death (coefficient 0.45) and 30 negatively associated with death (coefficient −0.45). Ten of the first 30 were correlated among themselves as well as 10 of the second group of 30. The correlation coefficients varied as before (0, 0.4, 0.6, 0.8).

The survival times were generated as exponentially distributed with the hazard:

with β_{i} the coefficient of the i^{th} covariate. To obtain approximately 50% events in each dataset, the censoring time was generated as uniformly distributed between 2 and 5, representing an accrual time of 3 years and a follow-up time of 2 years. The coefficients (0.45 and −0.45) were chosen such that the power to detect significance for one covariate with 50, 100 and 200 records varies and reflects real- life situations. For α = 0.001 the power for n = 50, 100 and 200 is 15%, 46% and 89% respectively and for alpha = 0.05 the power is 61%, 89% and 99% respectively.

All simulations were performed 2000 times. Each algorithm (SGT, LASSO, Elastic Net (α = 0.3), Elastic Net (α = 0.7), and MARSA) was applied to each of the simulated dataset. Data presented in this paper is based solely on simulation and do not contain any piece of information collected from patients. As such, consent was not necessary.

The goal of the selection process is to choose as many genes as possible from the set of those truly associated with survival and to choose as few genes as possible from the set of those which are independent of outcome. To judge the performance of each strategy and each scenario, two metrics were calculated: sensitivity and the Area Under the Receiver Operating Characteristic (AUC). The sensitivity is the proportion of selected genes out of the truly associated genes. The AUC measures an overall performance with the intent to minimize both the false positive and false negative genes. Arguably, of the two types of false results, the false negative may be more damaging since the false positive genes could be weeded out through a second process of validation using a different platform (like Polymerase Chain Reaction (PCR)). On the other hand, the false negative genes are lost completely. Sensitivity is a good measure to assess which scenarios would minimize the false negative genes.

A gene was considered as selected if it was significant and the direction of the detected association corresponded to the theoretical one. A disregard of the direction of significance would inappropriately inflate the results. For example if one of these methods has the tendency to select a positive gene but to estimate the effect in the opposed direction then it may appear that it is better than another method which selects fewer genes but with the correct direction.

The performance of each of these algorithms depends heavily on the sample size. Regardless of the number of genes entered in the analysis, the AUC is higher for n = 200 than for lower n, while the difference made by the number of genes entered in the analysis has a much more modest impact. The number of genes considered for each of these analyses is small in comparison to any high throughput data. This choice is considered realistic as FDR, MARSA and the penalized likelihood methods are typically applied to a subset of features, chosen through a marginal method as the unadjusted p-value of the SGT method.

Choosing α = 0.001 seems overly conservative with AUC around 0.7 even for n = 200 while for the rest of the algorithms the AUC is around 0.9 for n = 200 and around 0.6 for n = 50. With the exception of the SGT strategy, the other four algorithms exhibit a modest decrease in performance with the number of genes entered in the analysis. The performance increases slightly with the amount of ridge regression included in the Elastic Net. Choosing the genes based on FDR = 0.1 seems to be an excellent choice when the number of observations is adequate. It is important to note that the specificity is in general high (>0.8) and thus the level of AUC depends greatly on the level of sensitivity (Supplementary Tables 1(a)-(c)). In most cases, the sensitivity is tremendously poor (<0.4) for n = 50. This low sensitivity suggests that the sample size is extremely important and argues against dividing an already small dataset into two subsets for training and validation.

Of utmost importance is the fact these algorithms most often do not produce the same set of significant genes.

the Elastic Net none are truly significant. Twenty genes are selected by both MARSA and the ElasticNet of which only 3 are truly associated with the outcome. When the number of records is large (n = 200, power > 90% for testing one gene only) then 9 of the 10 genes associated with the outcome are selected by all algorithms. The unselected gene of the 10, has the uniariable p-value > 0.05. However, the number of genes selected by at least one of the algorithms but not associated with the outcome is quite large (43).

It was observed that the estimated coefficients for each strategy are sometimes biased, depending on the number of genes theoretically associated with outcome and on the correlation structure between these genes (

The coefficients obtained from SGT for the correlated genes were biased away from the null while for those uncorrelated (but in the presence of some corre- lated genes) the bias was slightly towards the null (

Thus, in the presence of correlated genes, the overall performance is mislead- ing as it will average the performance of the correlated genes more likely to be selected with the performance of the uncorrelated genes less likely to be selected.

SGT | MARSA | Penalized likelihood | |||||||
---|---|---|---|---|---|---|---|---|---|

α = 0.05 | α = 0.001 | FDR = 0.05 | FDR = 0.1 | LASSO | ELASTA5* | ELASTA3** | |||

CCorrelation 0 | 10 genes | 0.649 | 0.17 | 0.554 | 0.702 | 0.692 | 0.85 | 0.852 | 0.854 |

10 genes | 0.654 | 0.171 | 0.564 | 0.71 | 0.696 | 0.853 | 0.856 | 0.857 | |

Correlation 0.4 | 10 correlated genes | 1 | 1 | 1 | 1 | 0.585 | 0.981 | 0.994 | 0.998 |

10 independent genes | 0.334 | 0.04 | 0.106 | 0.227 | 0.551 | 0.588 | 0.596 | 0.598 | |

Correlation 0.6 | 10 correlated genes | 1 | 1 | 1 | 1 | 0.479 | 0.955 | 0.986 | 0.996 |

10 independent genes | 0.274 | 0.026 | 0.045 | 0.128 | 0.493 | 0.515 | 0.524 | 0.528 | |

Correlation 0.8 | 10 correlated genes | 1 | 1 | 1 | 1 | 0.331 | 0.874 | 0.97 | 0.994 |

10 independent genes | 0.224 | 0.018 | 0.019 | 0.069 | 0.443 | 0.459 | 0.47 | 0.473 |

*Elastic Net with 50% ridge regression. **Elastic Net with 70% ridge regression.

gene. On the other hand, the LASSO and Elastic Net algorithms perform better than MARSA and almost as well as the SGT algorithms for the correlated genes and about the same as MARSA for the uncorrelated genes. The pattern is the same for the Case C (Supplementary

The existence of high-throughput datasets containing genetic information at multiple levels facilitates a broader and deeper understanding of the patients’ ability to cope, be resistant or sensitive to treatments for diseases. Benefits of this knowledge are at the patient level as well as the social and economic level. However, extracting this information from a large amount of data can be challenging. Several statistical algorithms exist which attempt to find important genetic features to describe a specific condition or to explain an outcome. This paper presents a comparison of three major strategies for feature selection with survival as outcome. The SGT strategy is present either as the main strategy or as part of a more elaborate algorithm in the majority of papers analyzing high- throughput data. The alpha level of 0.001 is considered more informative as it guards against inflated type I error, ubiquitous in this type of data. This paper also presents the results for an alpha level of 0.05 which is traditionally used in medical statistics as well as 2 levels for FDR (0.05 and 0.1). As the need for more elaborate techniques increases, the LASSO/Elastic Net technique gains popularity. It was created specifically to mitigate the disparity between the large number of covariates included in a model and the relatively small number of observations. MARSA is an algorithm created in Princess Margaret Cancer Centre to obtain a genetic signature which explains the difference in survival for apparently homogeneously non-small cell lung cancer patients. While not widely used, this algorithm proved to be valuable as the genetic signature found with this technique was successfully validated in independent datasets.

Using simulated data the AUC and the sensitivity for each method under several scenarios are calculated and presented, suggesting under which conditions each of these strategies is most beneficial. The specificity (for case A, Supplementary

To replicate realistic datasets, several parameters were varied in the process of simulation: the number of observations, the number of genes entered in the algorithm, the number of associated genes, the strength and the direction of associations of the genes with survival and the level of the correlation between genes. The combination of the different sample sizes, the different strengths of association with survival and the level of significance, α, covers a wide range of the statistical power with which a gene can be detected (15% to 99%).

Our simulations indicate that the number of observations is extremely important when analyzing this type of data. Thus, regardless of the chosen strategy or number of genes the AUC is higher when the sample size is 200. The ability to select the correct genes is affected by the number of genes when MARSA or one of the Elastic Net methods is used. Therefore, there is no real advantage to divide a small dataset into two very small datasets to obtain training and validation datasets. A far better choice is to obtain another independent sample on which to validate the results. Increasingly, datasets with genetic and outcome information can be found in the public domain, and can be used for validation. In the absence of such a dataset, applying more than one method and utilizing a cross- validation technique might help in choosing the appropriate algorithm.

Based on these simulations it was observed that when multiple independent genes are associated with patient outcome, their univariate coefficients tend to be lower than the theoretical coefficients. This attenuation implies that the SGT technique is unlikely to select these genes and an algorithm which considers more genes at the same time in the model is more desirable (like MARSA or penalized likelihood). On the other hand, the correlation between genes (even a poor correlation of 0.4), when each one of them contributes to the outcome, could make each gene appear more interesting than it really is, due to an overestimation of the real coefficient. Thus, the correlations between the genes which are entered into MARSA or penalized likelihood need to be calculated.

As in any simulation study, it was possible to judge the efficiency of a method because we had information on the true underlying relationship in the data, information which is not usually available in the process of analyzing a real dataset. However, this study could give information on how these methods behave such that one could interpret the results easier.

It was not considered necessary to present examples as each of these strategies has been applied to real datasets in the past. Moreover, the main objective for this paper was to determine the suitability of these strategies in correctly selecting as many of the associated genes as possible. The underlying assumption is that the appropriate set of features would also validate in an independent study. In addition, we do not wish to recommend a specific strategy for use in all situations as, indeed, this is unrealistic, but present situations when each of these strategies may be more suitable than another. We also recommend that any new strategy needs to be thoroughly investigated in simulated environment and evaluated against other common strategies.

In conclusion, one has to employ not only methodologies which test for association with outcome but also for correlations between the features considered. This paper is intended to guide a statistician or bioinformatician in the daunting task of finding genes associated with outcome.

The authors declare that they have no competing interests. None of the authors have any financial competing interests to disclose.

MP: initiated the research, performed statistical analysis, drew conclusions, drafted the manuscript

JS: drew conclusions, critically revised the manuscript.

All authors read and approved the final manuscript.

Pintilie, M. and Sykes, J. (2017) Evaluating Common Strategies for the Efficiency of Feature Selection in the Context of Microarray Analysis. Journal of Data Analysis and Information Processing, 5, 11-32. https://doi.org/10.4236/jdaip.2017.51002

n | p | SGT | MARSA | Penalized likelihood | |||||
---|---|---|---|---|---|---|---|---|---|

α = 0.05 | α = 0.001 | FDR = 0.05 | FDR = 0.1 | LASSO | ELASTA5* | ELASTA3** | |||

50 | 250 | 0.634 | 0.517 | 0.542 | 0.601 | 0.585 | 0.588 | 0.644 | 0.675 |

500 | 0.631 | 0.517 | 0.54 | 0.602 | 0.56 | 0.548 | 0.585 | 0.638 | |

750 | 0.633 | 0.517 | 0.541 | 0.598 | 0.551 | 0.533 | 0.554 | 0.594 | |

100 | 250 | 0.752 | 0.557 | 0.709 | 0.789 | 0.716 | 0.783 | 0.805 | 0.808 |

500 | 0.751 | 0.556 | 0.705 | 0.788 | 0.675 | 0.698 | 0.756 | 0.794 | |

750 | 0.748 | 0.556 | 0.702 | 0.785 | 0.652 | 0.646 | 0.7 | 0.756 | |

200 | 250 | 0.899 | 0.687 | 0.914 | 0.95 | 0.894 | 0.914 | 0.905 | 0.897 |

500 | 0.897 | 0.683 | 0.912 | 0.948 | 0.873 | 0.927 | 0.916 | 0.907 | |

750 | 0.896 | 0.683 | 0.911 | 0.949 | 0.853 | 0.926 | 0.92 | 0.911 |

n | p | SGT | MARSA | Penalized likelihood | |||||
---|---|---|---|---|---|---|---|---|---|

α = 0.05 | α = 0.001 | FDR = 0.05 | FDR = 0.1 | LASSO | ELASTA5* | ELASTA3** | |||

50 | 250 | 0.322 | 0.034 | 0.083 | 0.203 | 0.218 | 0.208 | 0.377 | 0.486 |

500 | 0.314 | 0.036 | 0.079 | 0.204 | 0.145 | 0.106 | 0.205 | 0.361 | |

750 | 0.32 | 0.035 | 0.082 | 0.197 | 0.118 | 0.071 | 0.123 | 0.232 | |

100 | 250 | 0.555 | 0.114 | 0.418 | 0.578 | 0.496 | 0.65 | 0.734 | 0.767 |

500 | 0.553 | 0.114 | 0.411 | 0.576 | 0.382 | 0.428 | 0.584 | 0.708 | |

750 | 0.548 | 0.113 | 0.404 | 0.571 | 0.326 | 0.307 | 0.438 | 0.596 | |

200 | 250 | 0.848 | 0.374 | 0.828 | 0.901 | 0.856 | 0.954 | 0.955 | 0.955 |

500 | 0.844 | 0.368 | 0.823 | 0.895 | 0.781 | 0.943 | 0.95 | 0.953 | |

750 | 0.843 | 0.367 | 0.822 | 0.897 | 0.73 | 0.912 | 0.935 | 0.944 |

n | p | SGT | MARSA | Penalized likelihood | |||||
---|---|---|---|---|---|---|---|---|---|

α = 0.05 | α = 0.001 | FDR = 0.05 | FDR = 0.1 | LASSO | ELASTA5* | ELASTA3** | |||

50 | 250 | 0.947 | 0.999 | 1 | 1 | 0.951 | 0.967 | 0.912 | 0.864 |

500 | 0.947 | 0.999 | 1 | 1 | 0.976 | 0.99 | 0.966 | 0.915 | |

750 | 0.947 | 0.999 | 1 | 1 | 0.984 | 0.995 | 0.985 | 0.956 | |

100 | 250 | 0.949 | 0.999 | 1 | 1 | 0.937 | 0.917 | 0.875 | 0.848 |

500 | 0.949 | 0.999 | 1 | 1 | 0.968 | 0.969 | 0.928 | 0.881 | |

750 | 0.948 | 0.999 | 1 | 1 | 0.978 | 0.985 | 0.962 | 0.917 | |

200 | 250 | 0.949 | 0.999 | 1 | 1 | 0.932 | 0.875 | 0.855 | 0.84 |

500 | 0.95 | 0.999 | 1 | 1 | 0.965 | 0.911 | 0.882 | 0.86 | |

750 | 0.949 | 0.999 | 1 | 1 | 0.976 | 0.939 | 0.906 | 0.877 |

*Elastic net with 50% ridge regression; **Elastic net with 70% ridge regression.

*Elastic net with 50% ridge regression; **Elastic net with 70% ridge regression.

*Elastic net with 50% ridge regression; **Elastic net with 70% ridge regression.

SGT | MARSA | Penalized likelihood | |||||||
---|---|---|---|---|---|---|---|---|---|

α = 0.05 | α = 0.001 | FDR = 0.05 | FDR = 0.1 | LASSO | ELASTA5* | ELASTA3** | |||

Correlation 0 | 10 corr.* | 0.32 | 0.033 | 0.08 | 0.204 | 0.387 | 0.53 | 0.549 | 0.558 |

10 corr** | 0.317 | 0.034 | 0.081 | 0.201 | 0.394 | 0.538 | 0.557 | 0.567 | |

20 indep.* | 0.315 | 0.034 | 0.076 | 0.197 | 0.385 | 0.531 | 0.551 | 0.561 | |

20 indep.** | 0.321 | 0.035 | 0.088 | 0.203 | 0.388 | 0.536 | 0.555 | 0.565 | |

Correlation 0.4 | 10 corr.* | 0.998 | 0.955 | 0.998 | 0.999 | 0.343 | 0.845 | 0.904 | 0.94 |

10 corr** | 0.999 | 0.952 | 0.999 | 1 | 0.346 | 0.848 | 0.907 | 0.942 | |

20 indep.* | 0.18 | 0.013 | 0.006 | 0.027 | 0.341 | 0.34 | 0.368 | 0.382 | |

20 indep.** | 0.179 | 0.013 | 0.004 | 0.025 | 0.34 | 0.338 | 0.366 | 0.38 | |

Correlation 0.6 | 10 corr.* | 1 | 0.999 | 1 | 1 | 0.238 | 0.796 | 0.887 | 0.935 |

10 corr** | 1 | 0.999 | 1 | 1 | 0.238 | 0.801 | 0.891 | 0.938 | |

20 indep.* | 0.149 | 0.009 | 0.002 | 0.01 | 0.298 | 0.294 | 0.324 | 0.34 | |

20 indep.** | 0.148 | 0.008 | 0.001 | 0.007 | 0.295 | 0.29 | 0.32 | 0.336 | |

Correlation 0.8 | 10 corr.* | 1 | 1 | 1 | 1 | 0.125 | 0.688 | 0.853 | 0.933 |

10 corr** | 1 | 1 | 1 | 1 | 0.126 | 0.688 | 0.853 | 0.932 | |

20 indep.* | 0.129 | 0.008 | 0.001 | 0.005 | 0.212 | 0.26 | 0.292 | 0.307 | |

20 indep.** | 0.129 | 0.007 | 0.001 | 0.004 | 0.213 | 0.263 | 0.294 | 0.309 |

*Theoretical coefficient is 0.45; **Theoretical coefficient is (−0.45).

AUC = Area Under the Receiver Operating Characteristic Curve

SGT = Single Gene Testing

LASSO = Least Absolute Shrinkage and Selection Operator

MARSA = Maximizing R Square Algorithm

LRT = Likelihood Ratio Test

FDR = False Discovery Rate

PCR = Polymerase Chain Reaction

Submit or recommend next manuscript to SCIRP and we will provide best service for you:

Accepting pre-submission inquiries through Email, Facebook, LinkedIn, Twitter, etc.

A wide selection of journals (inclusive of 9 subjects, more than 200 journals)

Providing 24-hour high-quality service

User-friendly online submission system

Fair and swift peer-review system

Efficient typesetting and proofreading procedure

Display of the result of downloads and visits, as well as the number of cited articles

Maximum dissemination of your research work

Submit your manuscript at: http://papersubmission.scirp.org/

Or contact jdaip@scirp.org