Cross Entropy Based Sparse Logistic Regression to Identify Phenotype-Related Mutations in Methicillin-Resistant Staphylococcus aureus

Emergence of drug resistant bacteria is one of the serious problems in today’s public health. However, the relationship between genomic mutation of bacteria and the phenotypic difference of them is still unclear. In this paper, based on the mutation information in whole genome sequences of 96 MRSA strains, two kinds of phenotypes (pathogenicity and drug resistance) were learnt and predicted by machine learning algorithms. As a result of effective feature selection by cross entropy based sparse logistic regression, these phenotypes could be predicted in sufficiently high accuracy (100% and 97.87%, respectively) with less than 10 features. It means that we could develop a novel rapid test method in the future for checking MRSA phenotypes.


INTRODUCTION
As shown in an action plan shown by World Health Organization in 2015, antimicrobial resistance is a really serious problem in today's infectious disorder. Due to the fast evolution of bacteria, they obtain the ability of resisting antimicrobial drugs. Among various single-or multi-drug resistant bacteria, Methicillin-resistant Staphylococcus aureus (MRSA) is one of the most popular and serious infectious microbial.
The ability of surviving under the treatment by methicillin might come from the genomic sequence of microbial, however, still it is unclear in which mutation of genome causes it. Especially, reason of phenotypic difference between MRSA strains has not been well studied.
To analyze the relationship between genotypes and phenotypes, we have red whole genome sequences of 96 MRSA strains by a next generation sequencer. After mapping the short reads from sequencer onto a reference genome sequence of MRSA, thousands of mutations called insertion or deletion (Indel) have Open Access been detected in the whole genome sequences of 96 MRSA strains. On the other hand, we prepared two phenotypes of these MRSA strains. The first phenotype is about pathogeny, and the second is about drug resistance. Then, next problem to be solved is finding the relationship between mutations and phenotypes. Actually, most of the mutations might be irrelevant from these phenotypes and only a small subset of them might cause the difference in the phenotypes of the strains. In this paper, we applied machine learning algorithms, which is classification by support vector machine and feature selection for the improvement of classification accuracy. Through the process of finding a feature subset for higher classification accuracy, features (i.e. mutations) irrelevant to a phenotype are naturally removed. As a result, we could achieve highly accurate classification by less than 10 features in average. For effectively selecting features from high dimensional binary vectors representing the existence of mutations in MRSA strains, we used cross entropy based sparse logistic regression. Since our MRSA mutation data show a certain level of sparsity (around 80% of values are zero), the algorithm is expected to improve the performance of classification.

Sparse Logistic Regression
Sparse model is an approach to reduce complexity by neglecting less influential features in the model. The sparse model leads to achieve more interpretable selected features for a typical sparse dataset. The sparsity concept has been gaining attention in many fields of application such as statistics, data mining, and signal processing [1]. Sparsity also plays an important role in genomic studies especially in term of reducing fraction of genes in order to accurately predict disease out of non-disease samples [2]. The idea of this technique is simply achieved by modifying the model (e.g. by adding regularization on the model) that simultaneously produces good predictions and sparse (i.e. zero coefficient for less influential features) [3].
One of the applications of sparse model is that in logistic regression which is a statistical model to handle a binary (dichotomous) classification problem. Many researchers have been dealing with sparse model along with logistic regression [2,[4][5][6]. This binary response can be viewed as a nonlinear function of features. Let β is the intercept and j β is the j-th coefficient of j-th feature. The log-likelihood function of above equation is defined as: , vector of coefficients. The advantage of logistic regression is its possibility to estimate the probabilities i π and 1 i π − simultaneously for each class and making classification.
To construct a sparse logistic regression is simply by adding a nonnegative penalty or constrain term to the logistic regression model in order to reduce the dimension of features. The most well-known penalty is one proposed by Tibshirani [7], the L 1 -penalty, also known as LASSO (least absolute shrinkage and selection operator). The L 1 -penalty equals to the sum of all absolute coefficients, . This penalty constrain simultaneously performs feature selection and coefficient estimation. The penalized logistic regression then becomes

Biomedical Science and Engineering
The scalar λ is a tuning parameter. Choosing this parameter is crucial in the feature selection to successfully reach high accuracies [8].

Cross Entropy Method
The Cross Entropy Method (CEM) was originally introduced by Rubinstein [9] as an adaptive algorithm to estimate probabilities of rare events in complex stochastic network by involving minimization of variance. By then a simple modification of the original work could be used for solving a complex combinatorial optimization problem [10]. Many applications have been shown to testify the power of the CE method for solving NP-hard problems. The advantages of the cross entropy method in comparison to other optimization methods, such as simulated annealing, tabu search, and genetic algorithm, are fast and optimal updating rules [11,12].
Briefly, the CEM involves two iterative phases: 1) Generating random data sample as candidate parameters using a specified mechanism, 2) Updating the parameters based on the data to produce "better" solution in the next iteration [13]. The details of CEM are as follows. Suppose that we aim to minimize (or maximize) the target function S over the solution space Ω and let * x the corresponding minimizer. For simplicity let assume we have only one single variable to minimize S(x). Denote the minimum fitness by So, we wish to minimize the function S(x) over all x in the solution space Ω. Based on CEM, initially we generate n samples as the candidate solution to minimize S(x). The samples can be generated according to a uniformly random distribution or a certain normal random distribution. Let's assume, for instance, a normal random distribution. Next, we calculate the fitness of each sample and sort the samples based on the fitness. Then calculate the mean and the standard deviation only on the best fitness proportion or the elite samples. Based on this pair of mean and standard deviation we generate samples in the next iteration. These iterative procedures are conducted until the stopping criteria satisfied.
The tutorial on CEM is given in [13] and some examples of CEM application can be found in [14]. The following figure shows pseudo-code of the cross-entropy algorithm for normal distribution.

Datasets
From male and female patients mainly more than 60 years old at Kanazawa University Hospital from 1998 to 2015, 96 strains of MRSA were taken and their DNA sequences have been extracted. DNA sequences from a strain are red by Hiseq 2500, one of the most popular and reliable next generation sequencers. The output of the sequencer is a huge amount of fixed length subsequences called short reads. In this experiment, for each strain we obtained around 6.5 million of short reads with the length of 150 base. Using Bowtie 2 software, they are mapped to HO 5096 0412, the reference genome sequence which we chose. After that, 5587 Indels were detected by using VarScan software. It means that we prepared 96 feature vectors with binary values for 5587 features. At this point, a feature corresponds to a specific insertion or deletion at a position of reference genome sequence. For instance, a feature "581245:T->TTCAGAC" corresponds to an insertion of "TCAGAC" right after the "T" at position 581245 of the reference genome sequence. Since two or more features sharing completely the same pattern of occurrence in 96 strains are harmful for classification accuracy, such redundant features are unified. Finally, 96 feature vectors with 1978 unified features have been prepared. About the phenotypes used as class labels to be predicted, a pathogenic phenotype (1: developed, 0: latent) is identified for all 96 strains. For another phenotype about drug resistance, resistance of each strain to four kinds of antimicrobial drugs (Piperacillin (PIPC), Sulbactam/Ampicillin (S/A), Cefazolin (CEZ), and Clindamycin (CLDM)) has been tested (1: PIPC, S/A, and CEZ resistant, 0: PIPC, S/A, CEZ, and CLDM resistant). Since we could not precisely identify this phenotype for two strains, 94 strains were used for predicting the drug resistance phenotype. In this context, two features became meaningless and were removed since they showed the same value (all zero or all one) for 94 strains. Therefore, 1976 unified features were used for predicting drug resistance phenotype. Details of two datasets are summarized in Table 1.

Experiment
In our experiment we adopted leave one out cross validation (LOOCV) to ensure our methods work well. As a result, we have training-testing pair datasets as the number of samples, i.e. 96 for Pathogenicity dataset and 94 for Drug resistance dataset. Dealing with high dimensional dataset in some circumstance is time consuming, especially when adopting LOOCV. To avoid spending too much time for eliminating so many unimportant features, we applied random forest method to remove zero-importance features from training datasets. After removing these features, the sparse logistic regression took place to select and estimate the features and their coefficients simultaneously. We employed CE algorithm to estimate the coefficients. At the end we used support vector machine (SVM) classifier for sample classification. Table 2 and Table 3, the performance of our proposed method applied to both datasets based on classification accuracy was almost perfect. In Pathogenicity dataset we reached 100% accuracy, while in Drug resistance dataset the accuracy was 97.87%. They outperformed the accuracies 92.1% and 95.5% achieved by the similar experiments we did with correlation-based feature selection with grid search.

As shown in
Since the set of selected features are different in every iteration of cross-validation, we counted each feature's occurrence (i.e. inclusion in feature set) through the cross validation. Relative frequency indicates the percentage of occurrence, and features with high relative frequency are shown in Figure 1 and Figure  2. For the top 4 selected features in Pathogenicity datasets, the occurrences of features are relatively to have the same level when LOOCV to apply. In contrast, we can easily see the disparity among the top 4 selected features in Drug resistance dataset. This implies that feature selection in Pathogenicity dataset might be more stable than in Drug resistance dataset.

CONCLUSION
Feature selection to achieve high accuracy classification is one of the main focuses in the data analysis for high-dimensional datasets. A powerful technique to a typical dataset does not guarantee to be appropriate for another dataset. We have shown that our method can perform good results in tackling feature J. Biomedical Science and Engineering