Prediction of protein folding rates from primary sequence by fusing multiple sequential features

We have developed a web-server for predicting the folding rate of a protein based on its amino acid sequence information alone. The web-server is called Pred-PFR (Predicting Protein Folding Rate). Pred-PFR is featured by fusing multiple individual predictors, each of which is established based on one special feature derived from the protein sequence. The ensemble predictor thus formed is superior to the individual ones, as demonstrated by achieving higher correlation coefficient and lower root mean square deviation between the predicted and observed results when examined by the jack-knife cross-validation on a benchmark dataset constructed recently. As a user-friendly web-server, Pred-PFR is freely accessible to the public at www.csbio.sjtu.edu.cn/bioinf/Folding Rate/.


INTRODUCTION
Knowledge of protein three-dimensional (3D) structures plays an indispensable role in molecular biology, cell biology, biomedicine, and drug design [1].However, each protein begins as a polypeptide, translated from a sequence of mRNA as a linear chain of amino acids.A protein can function properly only if it is folded into a correct shape or conformation [2].Failure to fold into the intended 3D structure usually produces inactive proteins with different properties.Although many efforts have been made trying to understand the mechanism of protein folding (see, e.g., [3,4,5,6]), it still remains one of the most challenging problems in molecular biology.In addition to understanding how a protein chain is folded, it is also important to find the folding rates of proteins from their primary sequences.Protein chains can fold into the functional 3D structures with quite different rates, varying from several microseconds to even an hour [7,8].
Experimentally determining the three dimensional structure of a protein is often very difficult and expensive.However the sequence of that protein is easily known.Therefore, for quite a long time, scientists have tried to use the "least free energy principle" [2,9] to predict the 3D structures of proteins.Unfortunately, owing to the notorious local energy minimum problem, so far it can only be successfully used to address very limited structural characters, such as the handedness tendency and packing arrangement in proteins (see, e.g., [10,11,12]).In the past two decades, various statistical methods have been developed for predicting the structural classes of proteins and their folding patterns according to the sequence information alone (see, e.g., [13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28] and a review [29]).Encouraged by the results obtained via these statistical approaches, various methods were developed for predicting the folding rates of proteins because the information thus acquired would be very useful for understanding the protein folding mechanism and the sequence-structure-function relationship [8,30].In this regard, the approaches can be generally categorized into two groups: (1) the prediction of protein folding rates is based on the protein structure information; and (2) the prediction is based on the primary sequence information.
For the first group, the features of proteins are extracted from their 3D structural information and hence the predictions are feasible only after the structures have been determined.Most of the methods in this group tried to derive the statistical significance of the correlation between the protein folding rate and the corresponding structural topological parameters, such as contact order (CO) [31], absolute contact order (Abs_CO) [32], total contact distance (TCD) [33], long-range order (LRO) [34], the fraction of local contact (FLC) [34], the chain
For the second group, the features of proteins are mainly extracted from their primary amino acid sequences, such as the amino acid biochemical properties [36] and the effective folding length (L eff ) [8] derived from the sequence-predicted secondary structure.The approaches in the second group are particularly useful when the 3D structural information of the protein concerned is not available.
Although the aforementioned methods in predicting folding rates of proteins each have their own merits, they were all established by focusing on one (or a few) specific feature(s).As is well known, a protein folding system is very complicated that involves many physical and chemical factors.For this kind of complicated biological system, it would be particularly effective to treat it by assembling many individual predictors with each operated based on its own special feature [37,38].In view of this, the present study was devoted to develop a novel ensemble predictor for predicting the folding rate of a protein chain by incorporating its many different features through an optimal fusion process.

MATERIALS AND METHODS
To develop a powerful statistical predictor, the first important thing is to obtain an effective benchmark dataset [39].To realize this and also for facilitating comparison with the existing prediction methods, we use the benchmark dataset as described below.

Benchmark Dataset
The large dataset recently constructed by Ouyang and Liang [30] was used in the current study.It contains 80 proteins whose folding rates have been experimentally determined.Of the 80 proteins, 45 belong to the twostate folding behaviors without the visible intermediates while the other 35 belong to the three-state or multi-state folding kinetics that exhibit the obvious intermediate state during the folding process under the experimental conditions.If classified according to their structural classes,18 are all-proteins, 32 all-, and the remaining 30 are proteins (where means the mix of and α [40]).The folding rates of the 80 proteins range from 6.9   12.9  K .For users' convenience, the benchmark dataset, denoted as bench , is given in the Online Supporting Information A, which can also be downloaded from the web-site at  www.csbio.sjtu.edu.cn/bioinf/FoldingRate/.It is instructive to point out that f K in bench is actually an apparent folding rate constant (see Appendix A).Therefore, to develop a statistical method for predicting  f K of a protein according to its sequence information alone, there is no need to discriminate whether the protein is two-state or multi-state folding.

Sequence Feature Extraction
As mentioned above, although the features extracted from the 3D structures of proteins are very useful for predicting their folding rates, they can be used only when the corresponding PDB codes are available.Owing to such a limit, in this study we will focus on those features that can be derived from the amino acid sequential information alone, either directly or indirectly.
(a) Amino acid properties.Protein is composed of different amino acids, which show different physical, chemical, and conformational properties and hence may have correlations with the folding rates.In this study, the following four amino acid properties were used: c , the propensity to be at the C-terminal of -helix [41]; S , the propensity to form β -strand [41]; , the compressibility [42]; and SA , the solvent accessible surface area in an unfolding protein chain [43].Suppose a protein P is expressed by where 1 represents the 1 st residue of the protein , 2 the 2 nd residue, and so forth.Thus, the protein's scores in the aforementioned four amino acid properties can be formulated as where represents the protein length, and where and its various expressions forms could be useful features for predicting protein folding rates [8,30].In the present study, was adopted.ln( ) L   β (c) Information derived from secondary structure prediction.Given a protein sequence, its secondary structure can be predicted by means of various secondary structure prediction tools.In the present study, based on the information thus obtained by using PSIPRED [44], we have the secondary structure content ratios for the protein , as formulated by where , and are the ratios of the -helix, -sheet, and coiled-coil residues for the protein .Note that although the secondary structure content contains three components ( , , they were treated as one feature because of the normalized condition imposed by Eq.4.Moreover, based on the secondary structure prediction results, the effective protein folding chain length can be derived, as given by [8]: where is the total number of amino acids for the entire protein chain; ; for a standard -helix, ).In the current study, was set at 3, and used as the feature input.

Prediction Algorithm
According to the above section, we have a set of seven different kinds of specific features, as can be summarized by the following equation: To study the folding rate of a protein chain, the key is to determine K , the so-called folding rate constant.For reader's convenience, a brief discussion about the role of f K (or its logarithm f ln K ) on the protein folding rate is provided in Appendix A. According to Eq.6, we can construct the following seven linear re-gression models for predicting the protein folding rate constants: is the protein folding rate constant predicted based on the specific feature i -th i  (cf.Eq.6), while i and i are the corresponding parameters determined by using the regression analysis on a training dataset such as a b bench .For the details of how to use the regression procedures to determine i and , refer to [45].Note that f

K
of Eq.7.6 is involved with more parameters because the 6-th feature 6  contains three sub-features (cf.Eq.6).
All the above seven formulae (Eqs.7.1-7.7)can be used to predict the protein folding rates but they each reflect the effect (s) of only one (or one kind) of specific feature (s).To incorporate the effects from all the seven kinds of features, let us consider the following formulation: where is the weight that reflects the impact of the specific feature .Since they are actually not the same, it would be rational to introduce some statistical criterion to reflect their different impacts, as formulated below.
( 1,2, ,7 i   ) Given a statistical system consisting of samples, the Pearson Correlation Coefficient (ACC) is defined by where i x and are, respectively, the observed and predicted results for the sample, while i y -th i x and y the corresponding mean values for the samples.Since reflects the correlation of the predicted results with the actual ones, its value can be used to

SciRes Copyright © 2009
JBiSE measure the quality of a prediction method.If all the predicted results are exactly the same as the observed ones, we have the perfect correlation of .For different prediction algorithms, Eq.9 will yield different values of .Therefore, the weight in Eq.8 can be formulated as where is the Pearson Correlation Coefficient (Eq.9) obtained with the folding rate predicting formula in Eq.7 on the benchmark dataset ) -th i bench  by the jackknife cross-validation.
The prediction method by fusing the seven individual methods as formulated by Eq.7 is called the Pred-PFR (Predictor of Protein Folding Rate).

RESULTS AND DICSUSSIONS
In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling test, and jackknife test [40].However, as elucidated in [38] and demonstrated by Eq.5 of [39], among the three cross-validation methods, the jackknife test is deemed the most objective that can always yield a unique result for a given benchmark dataset, and hence has been increasingly and widely used by investigators to examine the accuracy of various predictors (see, e.g., [46,47,48,49,50,51,52,53,54]).To demonstrate the quality of Pred-PFR, here let us also use the jackknife cross-validation on the benchmark dataset bench  (see the Online Supporting Information A).Now, let us use f PCC( ) K to represent the Pearson Correlation Coefficient (Eq.9) obtained with Pred-PFR (Eq.8) on the benchmark dataset ben  ch by the jackknife cross-validation.For facilitate comparison of the ensemble predictor with the individual predictors, the values of f ) PCC(K and those of are given in Table 2. ) Furthermore, to show the accuracy about the prediction in a more intuitive manner, let us introduce the (R RMSD oot Mean Square Deviation) as defined by that by the formula of Eq.7.All these values are also given in Table 2.
As we can see from the table, the overall value yielded by the ensemble prediction formula (Eq.8) is 0.88, which is the closest to 1 in comparison with those by the individual prediction formulae (Eqs 7.1-7.7).Such an overall value is even higher than that by the prediction method using the 3D structural information [30] on the same benchmark dataset.Moreover, it can be seen from Table 2 that the overall RMSD value generated by the ensemble prediction formula is the lowest one in comparison with those by the seven individual prediction formulae.The highest correlation and lowest deviation results indicate that the Pred-PFR ensemble predictor formed by the fusing approach is indeed more powerful than the individual predictors.

CONCLUSIONS
Pred-PFR is developed for predicting the folding rate of a protein based on its sequence information alone.It is an ensemble predictor formed by fusing multiple individual predictors with each based on one special feature.As expected, the ensemble predictor is superior to the individual predictors.The web-server for Pred-PFR is freely accessible to the public at www.csbio.sjtu.edu.cn/bioinf/FoldingRate/.

ACKNOWLEDGEMENTS
This work was supported by the National Natural Science Foundation of China (Grant no.60704047), the Science and Technology Commission of Shanghai Municipality (Grant no.08ZR1410600, 08JC1410600), and sponsored by Shanghai Pujiang Program.

APPENDIX A. THE PROTEIN FOLDING RATE CONSTANT K f
For a given protein, its folding rate is generally reflected by the apparent rate constant f K as defined by the following differential equation   2. The jackknife test results by using different formulae on the benchmark dataset bench (see the Online Supporting Information A). a Note that PCC may also have negative value (see Eq.9).However, the correlation strength of the predicted results with the observed ones is generally measured by its absolute value.

S
Prediction formula PCC a (cf.Eq.9) RMSD (cf.Eq.12) ( where and represent the concentrations of its unfolded state and folded state, respectively.Suppose the total protein concentration is , and initially only the unfolded protein is present; i.e., and when .Subse-quently, the protein system is subjected to a sudden change in temperature, solvent, or any other factor that causes the protein to fold.Obviously, the solution for Eq.A1 is unfold P (t It can be seen from the above equation that the larger the f K , the faster the folding rate will be.However, the actual process is much more complicated than the one as described by Eq.A1 even if the system concerned consists of only two states.The reason is the folded state may reverse back to the unfolded state, as described by the following equation  exp exp  [55,56] for the two-state protein folding mechanism as schematically expressed in Eq.A3 and formulated in Eq.A4.(b) The phase digraph obtained from of panel (a) according to the graphic rule 4 [55,56], which is also called "Chou's graphic rule for non-steady-state enzyme kinetics" in the literature (see, e.g., [57]).The symbol in panel (b) is an interim parameter (see Eq.A5) and the related text for further explanation).K be treated as a constant.It can be imagined that for a three-state or multi-state folding system, f K will be much more complicated.We can also see from the above derivation that using graphic analysis to deal with kinetic systems is quite efficient and intuitive, particularly in dealing complicated kinetic systems.For more discussions about graphic analysis and its applications to kinetic systems, see [55,58,59,60,61,62].


on the protein folding rate.If the impacts of the seven features were the same, we should

Eq. 9 .
Obviously, the smaller the value of , the more accurate the prediction.If all the predicted results are identical to the corresponding observed ones, we have of obtained with the ensemble predictor Pred-PFR (Eq.8) on the benchmark dataset RMSD bench  RMSD by the jackknife cross-validation, and

Figure 1 .
Figure 1.(a)The directed graph or digraph[55,56] for the two-state protein folding mechanism as schematically expressed in Eq.A3 and formulated in Eq.A4.(b) The phase digraph obtained from of panel (a) according to the graphic rule 4[55,56], which is also called "Chou's graphic rule for non-steady-state enzyme kinetics" in the literature (see, e.g.,[57]).The symbol in panel (b) is an interim parameter (see Eq.A5) and the related text for further explanation).

Table 1 .
The values of the four amino acid properties that have been normalized according to the Max-Min normalization procedure of Eq.3.For more explanation about the four amino acid properties, see the relevant text.