^{1}

^{1}

^{1}

^{1}

^{1}

^{1}

^{1}

^{2}

^{2}

^{*}

Emergence of drug resistant bacteria is one of the serious problems in today’s public health. However, the relationship between genomic mutation of bacteria and the phenotypic difference of them is still unclear. In this paper, based on the mutation information in whole genome sequences of 96 MRSA strains, two kinds of phenotypes (pathogenicity and drug resistance) were learnt and predicted by machine learning algorithms. As a result of effective feature selection by cross entropy based sparse logistic regression, these phenotypes could be predicted in sufficiently high accuracy (100% and 97.87%, respectively) with less than 10 features. It means that we could develop a novel rapid test method in the future for checking MRSA phenotypes.

As shown in an action plan shown by World Health Organization in 2015, antimicrobial resistance is a really serious problem in today’s infectious disorder. Due to the fast evolution of bacteria, they obtain the ability of resisting antimicrobial drugs. Among various single- or multi-drug resistant bacteria, Methicillin-resistant Staphylococcus aureus (MRSA) is one of the most popular and serious infectious microbial. The ability of surviving under the treatment by methicillin might come from the genomic sequence of microbial, however, still it is unclear in which mutation of genome causes it. Especially, reason of phenotypic difference between MRSA strains has not been well studied.

To analyze the relationship between genotypes and phenotypes, we have red whole genome sequences of 96 MRSA strains by a next generation sequencer. After mapping the short reads from sequencer onto a reference genome sequence of MRSA, thousands of mutations called insertion or deletion (Indel) have been detected in the whole genome sequences of 96 MRSA strains. On the other hand, we prepared two phenotypes of these MRSA strains. The first phenotype is about pathogeny, and the second is about drug resistance. Then, next problem to be solved is finding the relationship between mutations and phenotypes. Actually, most of the mutations might be irrelevant from these phenotypes and only a small subset of them might cause the difference in the phenotypes of the strains. In this paper, we applied machine learning algorithms, which is classification by support vector machine and feature selection for the improvement of classification accuracy. Through the process of finding a feature subset for higher classification accuracy, features (i.e. mutations) irrelevant to a phenotype are naturally removed. As a result, we could achieve highly accurate classification by less than 10 features in average. For effectively selecting features from high dimensional binary vectors representing the existence of mutations in MRSA strains, we used cross entropy based sparse logistic regression. Since our MRSA mutation data show a certain level of sparsity (around 80% of values are zero), the algorithm is expected to improve the performance of classification.

Sparse model is an approach to reduce complexity by neglecting less influential features in the model. The sparse model leads to achieve more interpretable selected features for a typical sparse dataset. The sparsity concept has been gaining attention in many fields of application such as statistics, data mining, and signal processing [

One of the applications of sparse model is that in logistic regression which is a statistical model to handle a binary (dichotomous) classification problem. Many researchers have been dealing with sparse model along with logistic regression [2,4-6]. This binary response can be viewed as a nonlinear function of features. Let y i ∈ { 0 , 1 } be a sample vector of size n × 1 , x i be a p × 1 vector of features, and π i = p ( y i = 1 | x i ) be the vector of probability estimates. Logistic regression generates the coefficients of features to predict a logit transformation of the probability of sample cases:

logit [ π i ] = ln [ π i 1 − π i ] = β 0 + ∑ j = 1 p x i j β j , i = 1 , 2 , ⋯ , n .

where β 0 is the intercept and β j is the j-th coefficient of j-th feature. The log-likelihood function of above equation is defined as:

l ( β ) = ∑ i = 1 n { y i ln ( π i ) + ( 1 − y i ) ln ( 1 − π i ) }

where β = ( β 0 , β 1 , ⋯ , β p ) ,vector of coefficients.

The advantage of logistic regression is its possibility to estimate the probabilities π i and 1 − π i simultaneously for each class and making classification.

To construct a sparse logistic regression is simply by adding a nonnegative penalty or constrain term to the logistic regression model in order to reduce the dimension of features. The most well-known penalty is one proposed by Tibshirani [_{1}-penalty, also known as LASSO (least absolute shrinkage and selection operator). The L_{1}-penalty equals to the sum of all absolute coefficients, | β | = ∑ | β j | . This penalty constrain simultaneously performs feature selection and coefficient estimation. The penalized logistic regression then becomes PLR = − l ( β ) + λ | β | . The estimate of coefficient then defined as

β ^ = arg min β [ − ∑ i = 1 n { y i ln ( π i ) + ( 1 − y i ) ln ( 1 − π i ) } + λ ∑ j = 1 p | β j | ]

The scalar λ is a tuning parameter. Choosing this parameter is crucial in the feature selection to successfully reach high accuracies [

The Cross Entropy Method (CEM) was originally introduced by Rubinstein [

Briefly, the CEM involves two iterative phases: 1) Generating random data sample as candidate parameters using a specified mechanism, 2) Updating the parameters based on the data to produce “better” solution in the next iteration [

θ * = S ( x * ) = min S ( x )

So, we wish to minimize the function S(x) over all x in the solution space Ω. Based on CEM, initially we generate n samples as the candidate solution to minimize S(x). The samples can be generated according to a uniformly random distribution or a certain normal random distribution. Let’s assume, for instance, a normal random distribution. Next, we calculate the fitness of each sample and sort the samples based on the fitness. Then calculate the mean and the standard deviation only on the best fitness proportion or the elite samples. Based on this pair of mean and standard deviation we generate samples in the next iteration. These iterative procedures are conducted until the stopping criteria satisfied.

The tutorial on CEM is given in [

Input: β 0 = ( β 0 , 1 , β 0 , 2 , ⋯ , β 0 , p ) and σ 0 = ( σ 0 , 1 , σ 0 , 2 , ⋯ , σ 0 , p ) % initial distribution parameters

n % sample size

ρ % elite sample size

ε % stopping criterion

d % initial criterion

t = 0% iteration

while d ≥ ε

· set t = t + 1

· generate matrix B of size n × p based on vector β_{0} and the corresponding standard deviation σ_{0}

· for each column of B, fit the values to the objective function

· evaluate the fitness

· sort each column of B partially based on the fitness

· take the best ρ sample of each column of B

· calculate the means and deviations to get new parameters β_{t}_{+1} and σ_{t}_{+1}

· calculate d

end while

From male and female patients mainly more than 60 years old at Kanazawa University Hospital from 1998 to 2015, 96 strains of MRSA were taken and their DNA sequences have been extracted. DNA sequences from a strain are red by Hiseq 2500, one of the most popular and reliable next generation sequencers. The output of the sequencer is a huge amount of fixed length subsequences called short reads. In this experiment, for each strain we obtained around 6.5 million of short reads with the length of 150 base. Using Bowtie 2 software, they are mapped to HO 5096 0412, the reference genome sequence which we chose. After that, 5587 Indels were detected by using VarScan software. It means that we prepared 96 feature vectors with binary values for 5587 features. At this point, a feature corresponds to a specific insertion or deletion at a position of reference genome sequence. For instance, a feature “581245:T->TTCAGAC” corresponds to an insertion of “TCAGAC” right after the “T” at position 581245 of the reference genome sequence. Since two or more features sharing completely the same pattern of occurrence in 96 strains are harmful for classification accuracy, such redundant features are unified. Finally, 96 feature vectors with 1978 unified features have been prepared. About the phenotypes used as class labels to be predicted, a pathogenic phenotype (1: developed, 0: latent) is identified for all 96 strains. For another phenotype about drug resistance, resistance of each strain to four kinds of antimicrobial drugs (Piperacillin (PIPC), Sulbactam/Ampicillin (S/A), Cefazolin (CEZ), and Clindamycin (CLDM)) has been tested (1: PIPC, S/A, and CEZ resistant, 0: PIPC, S/A, CEZ, and CLDM resistant). Since we could not precisely identify this phenotype for two strains, 94 strains were used for predicting the drug resistance phenotype. In this context, two features became meaningless and were removed since they showed the same value (all zero or all one) for 94 strains. Therefore, 1976 unified features were used for predicting drug resistance phenotype. Details of two datasets are summarized in

In our experiment we adopted leave one out cross validation (LOOCV) to ensure our methods work well. As a result, we have training-testing pair datasets as the number of samples, i.e. 96 for Pathogenicity dataset and 94 for Drug resistance dataset. Dealing with high dimensional dataset in some circumstance is time consuming, especially when adopting LOOCV. To avoid spending too much time for eliminating so many unimportant features, we applied random forest method to remove zero-importance features from training datasets. After removing these features, the sparse logistic regression took place to select and estimate the features and their coefficients simultaneously. We employed CE algorithm to estimate the coefficients. At the end we used support vector machine (SVM) classifier for sample classification.

As shown in

Since the set of selected features are different in every iteration of cross-validation, we counted each feature’s occurrence (i.e. inclusion in feature set) through the cross validation. Relative frequency indicates the percentage of occurrence, and features with high relative frequency are shown in

Feature selection to achieve high accuracy classification is one of the main focuses in the data analysis for high-dimensional datasets. A powerful technique to a typical dataset does not guarantee to be appropriate for another dataset. We have shown that our method can perform good results in tackling feature

Dataset (phenotype) | Samples | Features | Classes |
---|---|---|---|

Pathogenicity | 96 (1: 63 samples, 0: 33 samples) | 1978 | 1: developed 0: latent |

Drug resistance | 94 (1: 19 samples, 0: 75 samples) | 1976 | 1: PIPC, S/A, and CEZ resistant 0: PIPC, S/A, CEZ, and CLDM resistant |

Tuning parameter (λ) | 2.0 | 3.0 | 4.0 | 4.5 | 4.5 |
---|---|---|---|---|---|

Classification Accuracy | 0.9063 | 0.9583 | 0.9896 | 1.0000 | 0.9896 |

Average Number of Selected Features | 12.71 | 12.22 | 5.73 | 5.36 | 3.03 |

Best performance shown in boldface.

Tuning parameter (λ) | 2.0 | 2.5 | 3.0 | 3.5 | 5.0 |
---|---|---|---|---|---|

Classification Accuracy | 0.9468 | 0.9681 | 0.9787 | 0.9681 | 0.9574 |

Average Number of Selected Features | 7.32 | 5.62 | 4.62 | 4.68 | 3.44 |

selection for sparse datasets especially for the datasets we used in this work. In addition, this result could be utilized to develop a novel rapid test method in the future for checking MRSA phenotypes, if the feature set is further tested by real experiments.

We would like to express our gratitude to Professor Takashi Wada and Dr. Yasunori Iwata at Division of Infection Control, Kanazawa University for providing MRSA data. In this research, the super-computing resource was provided by Human Genome Center, the Institute of Medical Science, the University of Tokyo. Additional computation time was provided by the super computer system in Research Organization of Information and Systems (ROIS), National Institute of Genetics (NIG). This work was supported by JSPS KAKENHI Grant Number 26330328.

The authors declare no conflicts of interest regarding the publication of this paper.