Identification of Deleterious Single Amino Acid Polymorphism Using Sequence Information Based on Feature Selection and Parameter Optimization

Most of the human genetic variations are single nucleotide polymorphisms (SNPs), and among them, non-synonymous SNPs, also known as SAPs, attract extensive interest. SAPs can be neural or disease associated. Many studies have been done to distinguish deleterious SAPs from neutral ones. Since many previous studies were based on both structural and sequence features of the SAP, these methods are not applicable when protein structures are not available. In the current paper, we developed a method based on UMDA and SVM using protein sequence information to predict SAP’s disease association. We extracted a set of features that are independent of protein structure for each SAP. Then a SVM-based machine-learning classifier that used grid search to tune parameters was applied to predict the possible disease association of SAPs. The SVM method reaches good prediction accuracy. Since the input data of SVM contain irrelevant and noisy features and parameters of SVM also affect the prediction performance, we introduced UMDA-based wrapper approach to search for the ‘best’ solution. The UMDA-based method greatly improved prediction performance. Compared with current method, our method achieved better performance.


Introduction
With the completion of the human genome project, more and more single nucleotide polymorphisms (SNPs) are collected.It's estimated that around 90% of human genetic variations are SNPs [1], among them, the single amino acid polymorphisms (SAPs), also known as nonsynonymous SNPs or nsSNPs, cause amino acid substitutions in the protein product, and they have the potential to affect protein structure and function.Some of the SAPs won't make any change in phenotype, so we consider them neutral, while others account for many human genetic diseases, so we consider them deleterious [2].It's important to discriminate deleterious nsSNPs from neutral ones for the study of human disease.
So far, many methods have been developed to predict possible disease association of SAPs.For example, empirical rules [3,4], probabilistic models [5], and machine learning techniques [6][7][8][9][10][11][12] are used to classify SAPs.These studies use various features to distinguish deleterious nsSNPs from neutral ones.Some of the methods use features derived from protein sequences [4,[6][7][8] while others use both sequential and structural features [9][10][11][12].An up-to-date study predicted deleterious SAPs by adding network features based on sequential features and structural features [13].However, a limitation of the methods using structural features is that they are applicable only when protein structures are known, while the majority of proteins don't have available structural information.Therefore, it's significant to predict deleterious SAPs with high accuracy only using sequential information.Some previous methods can predict the disease association of SAPs using only sequence information.Maybe because some sequence information hasn't been explored or the features haven't been optimized by the previous studies, their prediction accuracy is not high enough to omit further research.
In this study, we extracted a set of 130 features, and these features are independent of protein structure.Each SAP was encoded by 130 features, and then a SVMbased machine-learning classifier is applied to predict the possible disease association of SAPs.It's not clear which features are relevant with the prediction, so feature selection is often used to improve the prediction accuracy.On the other hand, the parameters of the classifier have an important effect on the classification performance.Thus, we applied a UMDA-based method which can select features and optimize parameters simultaneously.Using the selected features and parameters, SVM classifier achieves better prediction performance.

Conservation Attributes
The degree to which a residue is conserved is very important for the classification of SAPs.We measured the residue conservation by using PSI-BLAST [14].PSI-BLAST can output Position Specific Scoring Matrix (PSSM), which is an L*20 matrix and L in this study is the length of input sequence.First, we derived three attributes from PSSM, that is, the observed frequencies of original residue and substitution residue in SAP, and their difference.Then we also obtained conservation score of the SAP's neighbors, from the output of PSI-BLAST directly.Saunders' study showed that the number of homologous sequence in alignment can reflect the reliability of PSSM, so we added this value as an attribute.

Motif Attributes
Motifs are some conservative segments in the protein sequence, and they contain lots of biological information.In this study, we used MEME [15] to extract motifs from database.Finally, we got 27 motifs from deleterious SAP dataset and 20 motifs from neural SAP dataset by setting different parameters.Then for each SAP's neighbor sequence, if one motif appeared in it, we set the attribute to1.Otherwise, set the attribute to 0. In this way, we obtained 47 attributes.

Physicochemical Properties of Amino Acid
If the difference between the physicochemical properties of the original amino acid and the variation amino acid is big, the SAP is likely to cause changes of protein's function.If the difference is small, the SAP may be compatible by the body and won't be deleterious.We considered four kinds of properties of amino acid that were molecular weight, pI value, hydrophobicity value, and conservative value.For each SAP, property values of the original amino acid and the variation amino acid as well as their differences of corresponding property value between the two amino acids were calculated.

Amino Acid Submission Rates
Matthew's report pointed that amino acid submission rates in the deleterious SAP dataset and in the neural SAP dataset are different.For the deleterious SAP dataset, submission rates between physicochemically similar amino acids are lower, and that between highly physicochemically different amino acids are higher.However, amino acid submission rates for the neural SAP dataset come to the contrary.Thus, for each SAP, the difference between the logarithms of submission rates in the deleterious set and the neural set was calculated as an attribute: where n ij p and d ij p are the submission rates of amino acid i and j in the deleterious SAP dataset and the neural SAP dataset respectively.

Position Attributes
We supposed that the position where SAP is in the protein sequence has an effect on the disease association of the SAP.Based on the above guess, we divided the protein sequence into five same areas according to the position.These five areas were named by ( )

Stability of Protein
The physicochemical properties stability of a protein sequence may be associated with the function change [16].Thus, we calculated the average property values of SAP's neighbor sites, including four amino acids in the downstream and four in the upstream.The difference between the property value of original amino acid and the average value, as well as the difference between the property value of variation amino acid and the average value were also calculated. 4 1 where

{ }
prop molecular weight, pI value, hydrophobicity value, conservative value represents the property value of original amino acid and prop V is that of variation amino acid.
i prop is the property of the ith site around the SAP.
prop Avg is the average property value of SAP's neighbor sites.

Sequence Features Used in Previous Study
Hu et al. [8] calculated 686 features which were derived from sequence information for each SAP, and selected 10 features at last.Among the 10 selected features, is HLA indicates whether the protein in which the SAP located belongs to the HLA family, METAL and MOD_RES shows whether the SAP is close to functional sites, nor_diff_freq is the normalized frequency difference between original residue and substitution residue, and the other features are defined based on entries from AAindex that is a public database of amino acid properties.

SVM Classifier and Performance Measure
The classifier we used in our study is Support Vector Machine (SVM), which separates transformed data with a hyper plane in a high-dimensional space.SVM has been widely used in classification in bioinformatics.We adopted the LIBSVM that used the radial basis function (RBF) as the kernel function [17].
We used grid-search method to search the best values of the penalty coefficient C and the kernel parameter γ .
C was set to  , and so does γ .We tried all the combinations of C and γ , then selected the pair that got the best 5-fold cross-validation accuracy.
Let "Disease" be the positive class and "Polymorphism" be the negative class.The overall accuracy is defined as below.

TP TN ACC TP TN FP FN
where TP is the number of true positive; TN is the number of true negative; FP is the number of false positive and FN is the number of false negative.We also calculated the Matthew's correlation coefficient (MCC), which is more realistic than ACC on an unbalanced dataset [18].

MCC TP TN FP FN TP FN TP FP TN FN TN FP
The higher the ACC and MCC are, the better the classifier's performance is.

UMDA Method
Estimation Distribution Algorithms (EDAs) [19] are evolutionary searching strategies without crossing and mutation operators.In EDAs, the new population is sampled from a probability distribution which is estimated from the fittest individuals.Univariate marginal distribution algorithm (UMDA) is a type of EDAs, and UMDA assumes that the variables in the problem were independent.In our study, UMDA was used to search for the "best" solution, including a feature subset and a set of SVM parameters.In the UMDA-based method, the binary coding chromosome representation is shown as Figure 1.
In Figure 1,   ( ) where n is the number of variables, and ( ) S6: Retain N fittest individuals and sample M N − individuals from ( ) t p x to form the next generation.S7: If the termination criterion is satisfied, stop.Otherwise, go to S2.The maximum number of iterations is the termination criterion.

Experiment and Result
In this study, we acquired SAP data from Swiss-Prot knowledgebase for the prediction of SAPs [20].SAPs in the Swiss-Prot are classified into three categories, "Disease", "Polymorphism", and "Unclassified".SAP with disease association is annotated with "Disease" (same as "deleterious"); SAP with no reported disease association is annotated with "Polymorphism" (same as "neutral"); and SAP whose disease association is unclear is annotated with "Unclassified".We deleted "Unclassified" SAPs and focused on "Disease" and "Polymorphism" ones.After that, the dataset consisted of 19510 disease SAPs and 33701 polymorphism SAPs.We extracted 5000 SAPs randomly according to the proportion of disease and polymorphism ones for the prediction of SAPs.In other words, the final dataset contained 1833 disease SAPs and 3167 neutral SAPs.Hu's study [8] and Ye's study [11] used the data selected from Swiss-Prot, the structural information of which was available.By contrast, we use the data selected randomly from the database were more objective and more reliable.
We compared our method with Hu's method that predicted SAP's disease association using sequence-derived information.Hu et al. used a greedy approach to select features useful for the classification of SAPs and 10 features were selected.Using the 10 features, a decision tree method can achieve a high accuracy.
First, we applied Hu's method on our dataset using the 10 features and decision tree method.In the recent study, we added 120 features to the previous 10 features, so the new feature set was obtained.Next, we used a decision tree which was used in Hu's method to identify deleterious SAPs with the new feature set, and this method was called NF_DecisionTree.Then the decision tree was replaced with SVM that used grid search to tune parameters, and this method was called NF_GridSVM.At last, we applied the UMDA based method to optimize the prediction performance, and this method was called NF_UMDA.Table 1 compared the results of above methods.
On the same dataset, Hu's method achieved 68.06% accuracy and 0.2783MCC, while NF_DecisionTree achieve 73.9% accuracy and 0.4495 MCC.So we can see that using the same classifier, the added features in the recent study contributed to the improvement in prediction performance.
As was shown in Table 1, NF_GridSVM achieved much higher accuracy and MCC than NF_DecisionTree.The reason may be that, compared with decision tree, RBF kernel SVM can achieve good generalization and thus is suitable for the data which may contain irrelevant features.
Moreover, it's observed that NF_UMDA can achieve considerably better prediction performance than NF_ GridSVM.This can be explained by the fact that the UMDA-based method is able to select features that are more relevant and classifier parameters that are more appropriate, and then obtain a better classification.
Hu et al. also compared their method with SIFT [4] that was a popular sequence-based method to predict whether a SAP is deleterious or neutral.SIFT achieved

Conclusion
We explored sequence features deeply and introduced UMDA into the current study to search the solution that includes a feature subset and a set of SVM parameters maximizing the prediction performance on our dataset.The experimental result showed that the UMDA-based search strategy considerably improved the prediction performance.Our method used only information derived from protein sequence, so the method can be applied to predict all the SAPs.The performance of the proposed method is higher than Hu's method and SIFT.Further research will introduce more features and try other evolutionary algorithms to improve the prediction accuracy of SAP's disease association.

Figure 1 .
Figure 1.The binary coding chromosome.bit with value "1" indicates the feature is selected and "0" indicates not. 1 c v ~c n c v , 1 v γ ~n v γ γ denote the value of parameter C and γ , respectively.f n , c n and n γ are the number of bits indicating the features, parameters C and γ .c n and n γ are chosen according to the calcu- lation precision required.In our study, 130 f n = , 15 c n = and 15 n γ = .The procedure of UMDA-based method is given as follows: S1: Generate M individuals randomly.S2: Evaluate the fitness value (ACC) of each individual.S3: Sort the individuals according to their fitness values from high to low.S4: Select N ( N M ≤ ) individuals with higher fitness value from population.S5: Estimate the probability distribution of the tth iteration ( ) t p x by the selected individuals.

Table 1 . Prediction performance of various methods.
0.33 MCC using all SAPs from the Swiss-Prot database, while our method achieved an increase of 0.164 in MCC over SIFT. only