Improved Protein Phosphorylation Site Prediction by a New Combination of Feature Set and Feature Selection

Phosphorylation of protein is an important post-translational modification that enables activation of various enzymes and receptors included in signaling pathways. To reduce the cost of identifying phosphorylation site by laborious experiments, computational prediction of it has been actively studied. In this study, by adopting a new set of features and applying feature selection by Random Forest with grid search before training by Support Vector Machine, our method achieved better or comparable performance of phosphorylation site prediction for two different data sets.


INTRODUCTION
Phosphorylation is one of the most important post-translational modifications (PTMs) in eukaryotes.This occurs when a phosphate group is added to a protein by a kinase.The addition of phosphate group usually happens to Serine (S), Threonine, and Tyrosine (Y) [1].It is also the common PTM, which occur in eukaryotic cell [2].Between 30% and 50% proteins of eukaryotic cell undergo phosphorylation [3].
In the past, experimental approach such as mass spectrometry (MS/MS) [4] has been commonly used to identify phosphorylation sites.However, implementing this approach has several disadvantages.Conducting experiment to predict phosphorylation sites is considered expensive and requires intensive labor.
In addition, it also requires adequate technique, skill, and specific equipment.
Instead, computational approach (in silico) is becoming more common because of new computer technology development.These days, computers can process large number of data in a short time.This makes prediction of phosphorylation sites using computational approach becoming more popular.Trost and Kusalik provided summarization of various technique approach, data, and tools that can be applied to predict phosphorylation site using computational approach [5].
In general, phosphorylation site prediction can be classified into two approaches: kinase-specific phosphorylation site prediction and non-kinase-specific (general) phosphorylation site prediction.Kinase-specific prediction approach requires both protein sequence and the type of kinase for phosphorylation to conduct prediction.The other approach is non-kinase-specific, which only requires protein sequence.Xue and Trost provided comparisons of these two approaches [5,6].The main disadvantage of kinase-specific approach is that the publicly available information about the type of kinase is limited, especially for human kinase [7].Therefore, non-kinase-specific approach is more popular to predict phosphorylation site [8].
There are various methods proposed for the prediction.For example, Blom used neural network (NN) approach to predict eukaryotic protein phosphorylation site based on sequence and structure of proteins [9].Kim proposed a prediction method using support vector machine (SVM) [10].
In this paper, we propose a new prediction method for non-kinase-specific phosphorylation site.By adopting a new combination of a classifier, features, and feature selection algorithm, we improved the performance of prediction.We measured the result of feature selection and classification and compared it with existing methods.We also tested our method with an independent data set and analysis of the classification result.

Data Sets
In this work, we followed the preparation step done by Ismail [11].The dataset we use were downloaded from the Phospho SVM website [12].The phosphorylation site data set is P.ELM version 9 [13].The data set contains phosphorylated sequences at the position of Serine (S), Threonine (T), and Tyrosine (Y).These sequences were also checked for redundancy and sequences that had similarity more than 30% were removed.Table 1 shows the number of sequences and number of sites for Serine, Threonine, and Tyrosine, respectively.
We generated fixed-length protein sequences using window size 9, which have phospholyratable residues (Serine, Threonine, or Tyrosine) at the center of them.If the center residue of the sequence is known as phosphorylated, the sequence is "positive", otherwise "negative".For positive and negative sequences, redundant ones were removed using skipredundant [14].The parameters for redundancy removal are as follows: acceptable threshold percentage of similarity was set to 0% -20%, value for gap opening penalty to 10, and gap extension penalty to 0.5.Table 2 shows the number of positive and negative sequences before and after removing redundant sequences for each residue.
The number of negative sequences after redundancy removal for Serine, Threonine, and Tyrosine residues are: 4771, 3343, and 898, respectively.We then randomly selected negative sequences for each residue with the same number of negative sequences in the work of Ismail.
Using the same window size and method, we also generated sequences from PPA data set downloaded from the Phospho SVM website for conducting performance evaluation by independent data set.PPA is a database for providing information of phosphorylation sites in Arabidopsis and a predictor for plant-specific phosphorylation site [15].After removal of redundant sequences, we randomly selected positive and negative sequences based on the work of Ismail.Table 3 shows number of positive and negative phosphorylation sites for each amino acid.In order to make the data set well-balanced, the numbers of positive and negative sequences were set to equal.

Feature Extraction
Using the fixed-length sequences, we conducted feature extraction to represent them as vectors of numerical values.We used three different programs to extract features: PROFEAT (2016), PSI-BLAST, and protr.
PROFEAT (2016) is a web server for extracting features from protein sequences [16].We used it to generate the following feaures: Amino Acid Composition, Dipeptide Composition, Normalized Moreau-Broto Autocorrelation Descriptor, Moran Autocorrelation Descriptor, Moran Autocorrelation Descriptor, Geary Autocorrelation Descriptor, Composition, Transition, Distribution Descriptor, Amphiphilic Pseudo-Amino Acid Composition, and Total Amino Acid Properties.For Position Specific Scoring Matrix (PSSM) features, we used PSI-BLAST [17].In addition, an R package called protrwas used to produce the following features: BLOSUM and PAM Matrices for the 20 Amino Acid, Amino Acid Properties Based Scales Descriptor (Protein Fingerprint), Scales-based Descriptor derived by Principal Components Analysis, Scales-based Descriptor derived by Multidimensional Scaling, Conjoint Triad Descriptors, and Sequence-Order-Coupling Number [18].Details of these features are described below.Except three features (CTD, SOCN, QSO), most of the features are not used in Ismail's work.
 Amino Acid Composition (AAC) Amino Acid Composition is defined as the fraction of each amino acid in a protein sequence [19].For all 20 amino acids, the fraction is calculated using this equation.
total of number of amino acid type fraction of total number of amino acid in protein sequence where i is a spesific type of amino acid  Dipeptide Composition (DPC) Dipeptide Composition generates 400 fixed-length numeric information based on the input protein sequences.It encapsulates information about the fraction of amino acid as well as their local order.It is calculated using Equation ( 2): total of number of fraction of total number of all posible dipeptide where dep(i) is one dipeptide i of 400 dipeptides.
When we usePROFEAT, the value of nlag should be lower than the size of the sequence.Since the window size is 9, we setnlag = 8.

 Geary Autocorrelation Descriptors (GEARY)
Geary Autocorrelation can be defined using Equation ( 6): In the use of PROFEAT, we setnlag = 8.  Composition, Transition, Distribution (CTD) Composition, Transition, Distribution represent the amino acid distribution patterns of a certain structural or physicochemical property from a protein sequences.These features are calculated as follows: the protein sequence is transformed into a sequence of a specific physicochemical or structural properties of residue.Twenty amino acids are divided into three groups [20,21].
Composition (C), Transition (T), and Distribution (D) are then calculated for a given attribute to describe the global percent composition if the three groups of amino acids in a protein, the percent frequencies with which the attribute changes its index along the entire length of the protein, and the distribution pattern of attribute along the sequence, respectively.
 Sequence-Order-Coupling Number (SOCN) Sequence-Order-Coupling Number can be used to represent amino acid distribution pattern of a specific physicochemical property along a protein sequence.The dth rank of sequence-order-coupling num-J.Biomedical Science and Engineering ber can be calculated using Equation ( 7): ( ) 3, ,30 where d i,i+d is the distance between two amino acid at position i and i + d.In the use of protr, we also setnlag = 8.
 Quasi-Sequence-Order Descriptors (QSO) Quasi-Sequence-Order Descriptors can be calculated using Sequence-Order-Coupling Number.For each amino acid type, the type-1 Quasi-Sequence-Order Descriptors is calculated using Equation (8): where f r is the normalized occurrence of amino acid type i and w is the weighting factor, w = 0.1.The type-2 Quasi-Sequence-Order Descriptors is calculated using Equation ( 9): In the use of PROFEAT, we setnlag = 8.  Amphiphilic Pseudo-Amino Acid Composition(APAAC) Before we calculate Amphiphilic Pseudo-Amino Acid Composition, we must define Pseudo-Amino Acid Composition (PAAC) [22].First, three variables are generated from the original hydrophobicity values ( ) H i , and side chain masses Then, a correlation function can generated as: From which, sequence order-correlated factors are defined as: where λ is parameter.Let f i be the normalized frequency of 20 amino acids in the protein sequence, a set of 20 + λ descriptors called the PAAC can be defined using Equation ( 15): , when 1 20 , when 20 1 20 where w = 0.05.From Equation (10) and Equation ( 11), the hydrophobicity and hydrophilicity correlation can be define as: , ; , Then, sequence order factor can be define using Equation ( 17): , where 2 Finally, APAAC can be calculated using Equation ( 18): , when 1 20 , when 20 1 20 In the use of PROFEAT, we set weight factor = 0.05 and lamda = 8.  Total Amino Acid Properties (AAP) Total Amino Acid Properties for a specific physicochemical property i is defined using Equation ( 19): ( ) where i j norm P represents the property i of amino acid R j that is normalized between 0 and 1.N is the length of the protein sequence.

Feature Selection
Random Forest was introduced by Breiman [23].Random Forest method works as a collection of large number of decision trees randomly generated and not correlated to each other.This method is widelyapplied to classification problems.
In Random Forest, Gini impurity index (GII) is used to measure feature importance.GII represents how often randomly chosen element from the data set would be classified incorrectly if it was randomly classified based on the distribution of classes in the subset.We use Gini Index to rank important features that can be used for the classification algorithm.
In [11], Ismail also attempted the same feature selection and top 100 features were selected.In contrast, we conducted grid search to find the best set of selected features.

Classification
Vapnik [24] proposed support vector machine (SVM) as a classification method.It is a popular classifier widely applied to various problems including phosphorylation site prediction.SVM produces an optimal hyperplane separation between the classes.Here, optimal means finding the maximum margin around the separating hyperplane.In this work, we adopted Gaussian (also known as radial basis function) kernel for SVM.

Evaluation
10-fold cross-validation was repeated 10 times to measure the average performance of the P.ELM data.To measure the performance for the PPA data set, which is used for the independent data set, leave-one-out cross-validation (LOOCV) was conducted.
The metricsused to measure the classification performance are: Accuracy, Sensitivity, Specificity, F1 score, and Matthew's Correlation Coefficient (MCC).These metrics are defined in the following equations: TP TN Accuracy TP TN FP FN where TP, TN, FP, and FN are the abbreviation for true positive, true negative, false positive, and false negative.In this work, Area under the ROC curve (AUC) is also measured.

Feature Selection
Gini impurity index (GII) in Random Forest was used to measure the importance of the features.For P.ELM data set, we conducted 10-fold cross validation repeated 10 times.For each fold in each iteration, we generate a list of important features based on the training data.Then we average the value GII of each features from the 100 list features we generated before.To give insight of which features that effects the classification, we listed the top twenty features for each residue in Figure 1.Composition, Transition, and Distribution (CTD) dominates the features for Serine (60% of all 20 features), and followed by Position Specific Scoring Matrix (PSSM) and Quasi-Sequence-Order (QSO), 15% and 10%, respectively.Although not as large as Serine, CTD still dominates the features for Threonine residue (35% of all 20 features).Furthermore, as the same as Serine, PSSM is the second dominate feature by 20%.In addition, the third dominant features is QSO by 10%.In contrast, for Tyrosine, MDSSCALES dominates the list by 30% from all 20 features, followed by CTD and QSO.
From Figure 1, we can assume that CTD group features plays an important role, to predict phosphorylation site for Serine and Threonine.However, for Tyrosine it is not a top important feature.It suggests the specialty of Tyrosine in protein phosphorylation in comparison with Serine and Threonine.

Classification Result
The performance of our proposed method is shown in Table 4.In general, we can see that there is an improvement if we implement feature selection before conducting class prediction.Without feature selection (i.e. using all the 2292 features), the prediction performances were quite low in all six metrics.
By implementing feature selection with grid search for finding the best set of features, performances were improved greatly.For instance, using only averagely top 271 important features, Serine increased its accuracy and having the highest accuracy 96.46%, followed by Threonine 92.22% by using averagely top 224 important features.For Tyrosine, by using averagely top 1635important features, achieved its best performance, which is 80.19%.Based on the comparison before using and after using feature selection, Threonine has the largest percentage of increase accuracy, which is 40.43%,followed by Serine 34.46%, and Tyrosine 24.65%.
Since feature selection decreased the performance in Ismail's work, it is an important finding that under an appropriate combination of classifier and features, feature selection could improve the performance of protein phosphorylation site prediction.
In this work, we also compared the result of our method with existing methods for predicting phosphorylation site.The compared methods are as follows: Netphos [9], Netphos K [25], GPS 2.1 [26], Swaminathan, NetPhos [9], PPRED [8], Musite [27], Phospho SVM [12], and RF-Phos [11].Table 5 shows the performance comparison between our method with other methods.For Serine and Threonine, our method achieved highest AUC, sensitivity, and MCC.However, specificity using Threonine data set is lower than the result of RF-Phos.On the other hand, our method using Tyrosine data set achieved a lower AUC, specificity, and MCC, in comparison with the result of RF-Phos.In addition, only sensitivity achieved the highest score.

PPA Data Set
Using the PPA data set as an independent data set, we also conducted feature selection by Random Forest and classification by SVM.

Feature Selection
We used the same features importance value as the P.ELM data set, that is Gini impurity index from Random Forest.Using PPA data set, we conducted leave-one-out cross validation.For each fold, we generate a list for important featured based on the training data.The number of feature list for each residue equals the number of samples in the data set.We measure the average GII value for each features from all list feature.The top twenty important features are shown in Figure 2. CTD dominates the top twenty features for Serine by 75% of all top twenty features, then followed by QSO only 10%.For Threonine, MDSSCALES dominates the top twenty features by 30%, second place CDT and QSO appear 20%.For Tyrosine, SCALES features dominates by 50%, followed by MDSSCALES by 20% of the top twenty features.

Classification Result
In general, as it is shown in Table 6, we can see that without feature selection, for all three data set, the accuracy is lower than 60%.However, there is an improvement if we implement feature selection before conducting class prediction.Threonine has the highest accuracy 91.18% using 772 features, then followed by Serine, using 1316 feature achieving 87.66% accuracy.Tyrosine, using 160 features, has the lowest accuracy in comparison with the other two data sets, achieving 57.84%.
If we compare the increase of performance between not using feature selection and feature selection, Threonine achieved63.18%increase of accuracy, followed by Serine 49.92% increase.Serine has the lowest increase of accuracy which is 59.65%.
We also compared our classification result with the ones in other researches.The method we comparedare: Netphos K, GPS 2.1, NetPhos, PHOSPHER, Musite, Phospho SVM, and RF-Phos.In Table 7, we can see that our method has a higher performance in sensitivity, specificity, and MCC for Serine and Threonine residue.For Tyrosine, our method could not outperform other results from previous work.
and the maximum values of the original amino acid property i, respectively. Position Specific Scoring Matrix (PSSM) PSSM features were generated using PSI-BLAST against a local database generated from the phosphorylation data set. BLOSUM and PAM Matrices for the 20 Amino Acid (BLOSUM) In the use of protr, we setk = 5, lag = 3, and Matrix type = AABLOSUM45. Amino Acid Properties Based Scales Descriptor (Protein Fingerprint) (ProtFP) In the use of protr, we set pc = 5, lag = 5, index vector for Amino Acid Index = (160:165, 258:296). Scales-based Descriptor derived by Principal Components Analysis (SCALES) In the use of protr, we set pc = 7, lag = 5, properties matrix = AA index (7:26). Scales-based Descriptor derived by Multidimensional Scaling (MDDSCALES) In the use of protr, we set lag = 8.J. Biomedical Science and Engineering Conjoint Triad Descriptors (CTriad)[22]

Table 1 .
P.ELM data set of phosphorylation site from PhosphoSVM website.

Table 2 .
Number of sequences before and after removing redundant sequence for window size = 9.

Table 3 .
PPA data set as the independent data set.

Table 4 .
Performance of Classification using all of the features (2292 features) and best result of features selection.

Table 5 .
Comparsion of performance several methods to predict phosphorylation site for residue: Serine, Threonine, and Tyrosine.

Table 6 .
Performance of Classification using all of the features (2292 features) and best result of features selection using the independent data set.

Table 7 .
Comparsion of performance several methods to predict phosphorylation site using the independent data set for residue: Serine, Threonine, and Tyrosine.