PFP-RFSM : Protein fold prediction by using random forests and sequence motifs

Protein tertiary structure is indispensible in revealing the biological functions of proteins. De novo perdition of protein tertiary structure is dependent on protein fold recognition. This study proposes a novel method for prediction of protein fold types which takes primary sequence as input. The proposed method, PFPRFSM, employs a random forest classifier and a comprehensive feature representation, including both sequence and predicted structure descriptors. Particularly, we propose a method for generation of features based on sequence motifs and those features are firstly employed in protein fold prediction. PFPRFSM and ten representative protein fold predictors are validated in a benchmark dataset consisting of 27 fold types. Experiments demonstrate that PFP-RFSM outperforms all existing protein fold predictors and improves the success rates by 2% 14%. The results suggest sequence motifs are effective in classification and analysis of protein sequences.


INTRODUCTION
Protein structures are indispensable for revealing the regularities associated with protein functions, interactions and cell cycle [1][2][3].In addition to biological context, protein structures are frequently used in simulation of protein structures that are unsolved experimentally.The information about protein structure is crucially important for structure-based drug development as elaborated in a comprehensive review [4].Due to the difficulties in pro-tein extraction, purification, and crystallization, the amount of known protein structures is negligible when compared to the amount of solved protein sequences.As of May 2013, the Protein Data Bank [5] includes 83,695 protein structures while RefSeq database [6] includes 31,593,499 non-redundant protein sequences.The structures of 31,509,804 protein sequences are not experimentally solved and need to be studied through computational methods.The wide and enlarging gap between known protein sequences and known protein structures with annotated biological functions motivates the development of in-silico methods for protein sequence analysis, protein tertiary structure prediction, and protein function annotation.In-silico study of protein structures can be categorized into two classes: template-based methods and de novo methods.The template-based method, in essence, is an algorithm that identifies templates, i.e., solved protein structures, for a query protein sequence.Both homology modeling [7] and threading [8] belong to template-based methods, and are successful in protein tertiary structure prediction.The difference is that homology modeling identifies templates that are tightly associated with query sequence while threading is capable of recognizing templates that are remotely related to query sequence.The de novo methods are focused on classification of protein structures.Currently, protein structure classification is largely manually implemented.Two hierarchical protein structure classification systems, the SCOP (structural classification of proteins) database [9] and CATH Protein Structure Classification databases [10], were established during the last two decades.However, SCOP and CATH only provide a classification of protein domains with known structures and cannot make a classification for proteins that lack tertiary structures.The first level of the hierarchy of SCOP and CATH is defined as a protein structural class, which can be furtherly categorized into a number of folds.Protein folds are the second level of the hierarchy and they are the classification targets in our study.A number of algorithms were proposed in detection of structural similarity for sequences that have low sequence similarity [11,12].In general, prediction of protein fold type for a protein sequence is typically processed in two steps: firstly, protein sequences are converted into the same feature space, in other words, each sequence is represented by the same number of features; secondly, build a computational model that takes the features as inputs and predicts the protein fold types.
Historically, the first model for prediction of protein folds was proposed by Ding and colleagues [13].They represent the protein sequence by a number of sequence and structural descriptors, i.e., composition vector, secondary structure information and so on.The authors implemented two machine learning algorithms, including neural networks and support vector machine, for classification.Several other methods were proposed subsequently [14][15][16][17][18][19][20][21], and these methods implemented more sophisticated classification architectures while employing similar sequence representation as in Ding's study [13].In a study proposed by Chen and Kurgan, the predicted secondary structure was first used in generation of feature space and it provided higher success rates in recognition of protein folds [22].
In this study, we aim at the development of novel fold classification method that improves on known fold recognition method.The proposed method utilizes random forest classifier [23] and employs an extensive set of features, which incorporating sequence-based features, i.e., the composition vectors, predicted structure descriptors, i.e., the secondary structure information and features based on BLAST.We also designed a method for calculating features based on sequence motifs, which is for the first time utilized in protein fold classification.According to a recent comprehensive review [24] demonstrated by a series of recent publications [25][26][27][28][29], to establish a really useful statistical predictor for a protein system, we need to consider the following procedures: 1) construct or select a valid benchmark dataset to train and test the predictor; 2) formulate the protein samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the attribute to be predicted; 3) introduce or develop a powerful algorithm (or engine) to operate the prediction; 4) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; 5) establish a userfriendly web-server for the predictor that is accessible to the public.Below, let us describe how to deal with these steps.

Feature-Based Representation
This study utilizes both sequence and predicted structure descriptors as inputs.The sequence representation includes a comprehensive list of features that was previously used for prediction of protein structural class [11,33,34], protein fold types [17] and protein folding rates [35], and protein sub-cellular locations [36].As suggested by Chou, the feature vector of protein sequences can be seen as a general form of pseudo amino acid composition [37], which can be formulated as where T is a transpose operator, the components 1  , 2  , … depend on how to extract the desired information from the statistical samples, while Ω is an integer standing for the dimension of the feature vector P.In our study, we generate 7 sets of features, including composition vector of amino acids, secondary structure contents, predicted relative solvent accessibility, predicted dihedral angles, features based on the PSSM matrix, features based on nearest neighbour sequences and features based on sequence motifs, which are denoted by 1  respectively.The definitions of the 7 sets of features are given as below: -Composition Vector of Amino Acids is calculated di-rectly from primary sequence.The composition vector contains 20 values and each value stands for the percentage of a certain amino acid in a given sequence [38][39][40].

OPEN ACCESS
-Secondary Structure Contents are generated by PSIPRED [30].The PSIPRED program generates the 3-states secondary structures for each residue of the sequence.Subsequently, we calculate the contents of the 3 secondary structure states, which is similar to the calculation of composition vectors.-Predicted Relative Solvent Accessibility is generated by Real-SPINE3 [31].We use the real values, which quantify the fraction of the surface area of a given residue that is accessible to the solvent, for the residues in the window.The average of the relative solvent accessibility of each residue is utilized to stand for the relative solvent accessibility of a sequence.-Predicted Dihedral Angles are generated by Real-SPINE3 [31].We utilize two real values, which represent phi (involving the backbone atoms C'-N-C α -C') and psi (involving the backbone atoms N-C α -C'-N) angles.Similarly, the phi and psi angles are averaged for the entire sequence.-Features Based on the PSSM Matrix are generated by PSI-Blast [32].The PSI-Blast provides two position specific scoring matrices; one contains conservation scores of a given AA at a given position in a sequence and the other provides probability of occurrence of a given AA at given position in the sequence.The matrix values are aggregated either horizontally or vertically to obtain a fixed length feature vector.The details of calculation of this set of features were given in [46]

Random Forest Classifier
We validate the predictive quality of 6 representative classifiers, including random forest [23], support vector machine (SVM) [42], kstar algorithm [43], nearest neighbour (IB1) [44], Naïve Bayes [45] and multiple logistic regression.The random forest classifier is employed by PFP-RFSM as it outperforms the remaining classifiers and the detailed results are given in the following section.
Random forest is an ensemble learning method that generates a multitude of decision trees.The method includes 2 parameters, i.e., the number of selected features, denoted by n, and number of constructed trees, denoted by k.The method generally includes 4 steps.Firstly, we randomly select n features from the full feature set.Secondly, we perform the bagging algorithm on the training set and generate a training set with re-sampled instances.Thirdly, employ a decision tree algorithm on the resampled training set and the randomly selected feature space, and build a decision tree, which serves as base classifier in Step 4. Repeat Steps 1, 2 and 3 for k times and generate k decision trees.Lastly, summarize the k decision trees and generate final predictions.The architecture of random forest algorithm is given in Figure 1.

Evaluation Criteria
The assessment of the predicted results was reported using several measures including success rate and Matthews's correlation coefficient (MCC) for each class.The two measures are frequently used in previous studies on protein fold prediction [13][14][15][16]20,21].In this study, we utilize the same measures for evaluation and they are defined in Equations ( 2) and (3).where TP, TN, FP and FN stand for true positives, true negatives, false positives and false negatives respectively.

Comparison between Random Forest and Other Machine Learning Classifiers
We first validate the performance of the random forest classifier, meaning that random forest classifier is compared with a variety of machine learning classifiers, including support vector machine (SVM), Kstar algorithm, Nearest Neighbour (IB1), Naïve Bayes and Multiple Logistic Regression on the same feature representation.The success rates and MCC of the 6 representative classifiers are shown in Tables 1 and 2 respectively.Random Forest (with 300 trees and 60 features) gives the highest success rate, i.e. 73.7%, among the six classifiers, whereas, the runner up classifier, Naïve Bayes achieves an average success rate of 71.4% over the 27 folds.We note that the success rates of the remaining classifiers are all below 70%.Similar trend is observed for MCC, see Table 2. Random forest achieves the highest MCC, i.e., 0.746, followed by the Naïve Bayes classifier, which outperforms the remaining 4 classifiers.Among the 27 folds, random forest achieves the highest success rate in 16 folds and the highest MCC for 15 folds.Overall, random forest classifier is more accurate in prediction of protein folds than the remaining classification method.
The overall success rate and the success rates in each fold are given in Table 3.The PFP-RFSM predictor achieves an overall success rate of 73.7% for the 27 folds, which is 2% -17.7% higher than the existing predictors.Among the 27 folds, PFP-RFSM achieves the highest success rate in 12 folds, while the runner up methods, PFP-FunDSeqE and MarFold obtain the highest success rate in 10 and 8 folds respectively.In the literature, MCC index is only calculated in PFP-FunDSeqE method [19].Therefore, the PFP-RFSM method can only be compared with PFP-FunDSeqE for the MCC index.

CONCLUSION
This study proposes a novel method, PFP-RFSM, that takes primary sequence as input and aims at the prediction of protein fold types.The PFP-RFSM method employs random forest classifier and a comprehensive feature representation.In particular, the features based on sequence motifs are firstly proposed in protein sequence classification whereas the random forest classifier is firstly utilized for protein fold prediction.PFP-RFSM is compared with 10 representative methods on a benchmark dataset consisting of 27 folds.Extensive experiments demonstrate that PFP-RFSM outperforms all known methods which are predictions by PFP-RFSM and are SVM [12] HKNN [13] DIMLP [14] SE [15] PFP [16] PFRES [20] ALH [17] ALHK [18] MarFold [18] PFP-FunDSeqE [19]  complementary to predictions generated by existing methods.Since user-friendly and publicly accessible webservers represent the future direction for developing prac-tically more useful models, simulated methods, or predictors [47,48], we shall make efforts in our future work to provide a web-server for the method presented in this paper.
[41]ther words, the identified neighboring sequences have higher probability to be homologous to the test sequence.For each test sequence, the top 5 neighboring sequences in the training set are identified and a vector of n values are utilized to represent each neighboring sequence, where n stands for the number of fold types, i.e., n = 27 in this article.If the neighboring sequence belongs to fold type i, then the i th value of the vector is assigned with the p-value and the remaining values are set to 0. Totally, this set of features includes 27 * 5 = 135 features.-FeaturesBasedonSequence Motifs are generated by GLAM2 program[41].Generation of sequence motifs includes 2 steps and is performed in the training set.Firstly, training set is divided into 27 subsets based on the fold types, meaning that sequences in the same subset belong to the same fold type.For each subset, we perform GLAM2 program and identify three sequence motifs with lowest p-values.Therefore, we totally generate 27 * 3 = 81 motifs.Secondly, we calculate the similarity between a test sequence and the 81 motifs.We use the 81 similarity scores as input features for classification.
[32]atures Based on Nearest Neighbor Sequences are generated by Blast[32].For a test sequence, Blast firstly identifies a number of neighbor sequences, meaning that these sequences have the lowest p-values when performing pairwise alignment to the test sequence.

Table 1 .
Success rates of random forest and other 5 machine learning classifiers.The best results for each fold are shown in bold.

Table 4
lists the MCC values for the 27

Table 2 .
Matthews's correlation coefficients (MCC) calculated for random forest and other 5 machine learning classifiers.

Table 3 .
Comparison between PFP-RFSM and 10 representative protein fold predictors on success rates.