A Novel Prediction Method of Protein Structural Classes Based on Protein Super-Secondary Structure

At present, the feature extraction of protein sequences is the most basic issue to predict protein structural classes and is also the key problem to decide the quality of prediction. In order to predict protein structural classes accurately, we construct a 14-dimensional feature vector based on protein secondary and super-secondary structure information to reflect the content and spatial ordering of the given protein sequences. Among the vector, seven features about α -helix bundle, hairpin β motifs, Rossman folds, αβ -plaits and other super-secondary structure information are first proposed in our paper. Experiments show that our method improves overall accuracy of lower similarity datasets 1189 and 640 by 0.9% 3.8% and 0.5% 4.2% respectively compared with other methods and has a competitive advantage for predicting proteins in / α β and α β + classes.


Introduction
Modern molecular biological studies indicate that the function of a protein is determined by its spatial structure.Therefore, it's very important to predict the structural classes of the newly discovered protein accurately [1].The types and order of 20 kinds of amino acids are the basic information in a protein sequence, then a large number of initial prediction methods use features based on amino acid composition (AAC) and the position information, which are the easiest and most intuitive methods [2] [3] [4] [5].However, these methods have advantage to predict protein sequences with high degree of similarity and have disadvantage to predict protein sequences with similarity less than 40%.Due to the low-similarity protein sequences always have high-similarity secondary structural contents and spatial arrangements, so a large number of researchers tried to extract features from the secondary structure of proteins predicted by PSI-PRED [6].Under the guidance of this idea, SPRED model [2] and MODAS model [7] were constructed.Currently novel computational prediction methods [8] [9] build feature vectors by using the protein secondary structure information and protein sequences are predicted into four classes (All-α , All-β , / α β , α β + ) using SVM (support vector machine) classifier.Overall prediction accuracy on several datasets of these methods reach to 80% -90%, but the prediction of / α β and α β + classes is not ideal, especially for α β + class with accuracy just about 70%.In order to improve the prediction accuracy of / α β and α β + classes, we will extract seven different features reflecting general contents and spatial arrangements of the secondary structural elements from super-secondary structure of a given protein sequence.We use SVM to predict protein structural classes after features are extracted.Finally, our paper evaluated our model objectively.

Feature Vector
Nowadays, a lot of methods are used to predict amino acids sequences into secondary structural sequences (SSS) constructed by three secondary structural elements, α -helix (H), β -strand (E), and random coil(C).Firstly, we obtain corresponding secondary structural sequences by PSI-PRED (version 2.6) [6].It's difficult to distinguish / α β and α β + classes for both of them contain α -helices and β -strands.α -helices and β -strands are usually separated in / α β class, while they are usually inters- persed in α β + class.In order to better represent the distribution of α -helices and β -strands, every segment H, E and C in secondary structure sequence(SSS) is re- placed by α , β and ς respectively, the new sequence is secondary structure seg- ment sequence(SS), all element ς are removed from SS to form a new sequence represented by SSW [13].
Our novel method mapped each protein sequence into a 14-dimentional vector that can be defined as A number of secondary structure elements interact with each other and form a regular combination of secondary structure, which acts as a structural member of tertiary structure in much protein and is known as super-secondary structure (motif).Some super-secondary structures are related to specific functions and there are three basic forms of combination: αα , βαβ and ββ .In order to tap the structural characteris- tics of each class, we extract several typical features of folds and combinations.
1) The easiest ββ structure is hairpin β motif which is connected by a short loop.
Multiple hairpin β motifs together will form a stable and widespread β -turns, so extracting the number of hairpin β motifs (defined as Con βςβ ) are very meaningful.
Super-secondary structure αα is a α helix bundle and is often formed by two in- tertwined spiral parallel or ant parallel α -helices.Structure βαβαβ named Rossman folds is one of the most special structures of / α β class.So we extract the number of super-secondary structure αα ( Con αα ) and βαβαβ ( Con βαβαβ ) as features.Then the corresponding features can be defined as 2 Features mentioned above can be represented as: where Maxseg α and Maxseg β are the maximal lengths of α segments and β segments in SSW. 5) Position information of SS is also the deciding factor.Herein, the position of a segment is defined as a starting position of the segment.The corresponding features can be defined as , ϕ ϕ can be defined as ) where Hj P and Ej P are the j-th order of H and E in SSS, ConH and ConE are the number of H and E in protein secondary structure sequence (SSS).
7) The probability of content C can be ignored due to the sum of the three probabilities of H, E and C is 1 [1].Hence, the two features are expressed as 13 14 ( ), ( )

Classification Algorithm Construction
Protein secondary structure prediction is a multiclass classification problem.With high prediction accuracy, support vector machine (SVM) has been widely used for protein secondary structure classification [4] [5].Here we use of the "one to one" multiclass classification method that construct a multiclass classifier by combining six binary classifiers.We choose Gaussian radial basis function (RBF) as the kernel function for SVM [14].Using a grid search on the training set (25PDB) by tenfold cross-validation, we can find out the penalty parameter C and kernel parameter γ , the final parameters are 80 C = , 0.8 γ = .

Performance Measures
In this paper, we use an independent testing dataset cross-validation.There are many indicators to evaluate model's performance, sensitivity (Sens), specificity (Spec), Matthew's correlation coefficient (MCC) and overall accuracy (OA) are widely used in protein structure prediction [15].The total number of proteins, classes and proteins in k-th class are denoted by N, k and k N respectively, so

Structural Class Prediction Accuracies
In our experiment, we use 25PDB dataset as a training set and other three datasets as testing sets.We not only report the values of Sens, Spec, MCC and overall accuracy (OA) of every structural class of testing set, but also report the average of Sens, Spec, MCC and overall accuracy (OA).The detail results can be seen in Table 1.The overall accuracy is more than 84% for each test set and it reaches 90% for FC699 dataset.What's more, the average overall accuracy of 3 test sets is up to 86.6%.The Sens and

Feature Vector Analysis
To better verify the effect of the new proposed seven features, we do the following experiment with FC699, 1189 and 640 datasets.The comparison of obtained accuracies between our method including 14 features and our method including 7 features can be seen in Table 2.After added new features, the average overall accuracy increases by 2.6% up to 86.6%.For FC699 dataset, the overall accuracy and accuracies of All-α , All- β and / α β classes are improved by 2.7%, 1.5%, 4.4% and 2.7% respectively.For 1189 and 640 datasets, the overall accuracy increases by 2.6% and 2.3%, respectively.However, the results of / α β and α β + classes are not obvious because of the interference of other classes [13].
To further validate effect of super-secondary structure features, we do experiment just on proteins in / α β and α β + classes with a 14-dimensional feature vectors.The 25PDBS, FC699S, 1189S and 640S sets are the subsets formed by removing all the proteins in the All-α and All-β classes from 25PDB, FC699, 1189 and 640 datasets respectively.Hence, we use the 25PDBS to train SVM classifier and other subsets to test.The parameters C and γ ( ) are selected by tenfold cross-validation on 25PDBS with a grid search method.The corresponding experimental results are shown in Table 3.In Table 3, the overall accuracies of all datasets predicted by our method are higher than 80%.The overall accuracies and accuracies of α β + class predicted by our method are the highest compared with other competitive methods.The prediction accuracies of all structural classes are higher than 90% on FC699S subset.The accuracies of α β + class and the overall accuracy are increased by 9.2% -27.4% and 2.4% -7.3% on 1189S respectively.For 640S, the accuracies of α β + class and the overall accuracy are improved by 2.9% -11.7% and 0.6% -5.2%  classes is very effective.

Comparison with Other Prediction Method
It's known to all, SCPRED [2] and MODAS [7] are famous in predicting protein secondary structure and are often used as baseline for comparison.From Table 4 we can see, our method improves the overall accuracies by 0.9% -3.8% and 0.5% -4.2% on 1189 and 640 datasets compared with other competing prediction methods including SCPRED and MODAS.And only for FC699 dataset, the overall accuracy is lower than Kong et al. and is not the highest, but it is increased by 0.8% -2.9% compared with the rest methods.Compared with model SCPRED and the method of Liu and Jia, the overall accuracy predicted by our method is improved by 2.9% and 0.8% on FC699 dataset, respectively, besides the accuracy of α β + class is increased by 4.8% and 19.5%.Our method obtains the highest accuracies for the All-β and α β + classes which reach to 87.8% and 79.3% on 1189 dataset and the overall accuracy is the highest than other exiting methods.For 640 dataset, the overall accuracy and the accuracy of All-β class are the highest.
Therefore our method extracting features based on super-secondary structure has the ability to reflect the realistic characteristics of proteins more accurately.Specially, our method improved the accuracies of / α β and α β + classes greatly.According Table 4, we find our method is not always the best.The reason is that some methods not only extract features based on protein secondary structure but also combine other information.In contrast, our method is aimed to predict secondary structural classes by extracting features more effectively just on the basis of secondary structure.

Conclusion
In this paper, a novel method is proposed based on protein super-secondary structure information.Seven new features related to α -helix bundle, hairpin β motifs, Ross- man folds, αβ -plaits and other information are very useful to predict protein secondary structural classes.We adopt advanced SVM classifier which use little computational time and space, is accurate and is very suitable for large-scale protein sequence databases.Finally, experimental results show that this new prediction method not only improve the overall prediction accuracy but also improve the accuracies of all structural classes, especially, the accuracies of / α β and α β + classes are improved greatly.Hence, the new extracted features can reflect the characteristics of different structural classes more accurately and our method is more effective than previous methods.

3 )E
In proteins of the / α β class, α -helices and β -strands alternate more fre- quently than in proteins of the α β + class.Based on this characteristic, we can design Altn is the alternating frequency of α -helices and β -strands in SSS. 4) Because the length of the secondary structural segments will affect the assignment of the structural class, we define new features Maxseg are the maximal lengths of α -helix and β -strand segments in SSS.
starting order of the α -helix ( β -strand) segment in SSS, of α -helix and β -strand segments in SSS. 6) To reflect the position information of protein secondary structure, two features 11 12 Usually, four parame- ters are used by studies for examining a predictor's effectiveness: The number of proteins which is correctly predicted as kth class and non-kth class are denoted by k TP and k TN .The number of proteins which is incorrectly predicted as kth class and non-kth class are denoted by FP N N + = − .Using these parameters, we can obtain Equation (2):

Table 1 .
The prediction quality of our method on test datasets.not only contains α and β classes but also contains αβ -plaits structure, so the prediction accuracy of α β + class is lower.

Table 2 .
Comparison of the accuracies between the method including 14 features and one including only 7 features.

Table 3 .
The accuracy of differentiating between the /

Table 4 .
Performance comparison of difference methods on 3 test datasets.