PFP-RFSM: Protein fold prediction by using random forests and sequence motifs

Abstract

Protein tertiary structure is indispensible in revealing the biological functions of proteins. De novo perdition of protein tertiary structure is dependent on protein fold recognition. This study proposes a novel method for prediction of protein fold types which takes primary sequence as input. The proposed method, PFP-RFSM, employs a random forest classifier and a comprehensive feature representation, including both sequence and predicted structure descriptors. Particularly, we propose a method for generation of features based on sequence motifs and those features are firstly employed in protein fold prediction. PFP-RFSM and ten representative protein fold predictors are validated in a benchmark dataset consisting of 27 fold types. Experiments demonstrate that PFP-RFSM outperforms all existing protein fold predictors and improves the success rates by 2%-14%. The results suggest sequence motifs are effective in classification and analysis of protein sequences.

 

Share and Cite:

Li, J. , Wu, J. and Chen, K. (2013) PFP-RFSM: Protein fold prediction by using random forests and sequence motifs. Journal of Biomedical Science and Engineering, 6, 1161-1170. doi: 10.4236/jbise.2013.612145.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Luscombe, N.M., Laskowski, R.A. and Thornton, J.M. (2001) Amino acid-base interactions: A three-dimensional analysis of protein-DNA interactions at an atomic level. Nucleic Acids Research, 29, 2860-2874.
http://dx.doi.org/10.1093/nar/29.13.2860
[2] Jones, S. and Thornton, J.M. (1996) Principles of proteinprotein interactions. Proceedings of the National Academy of Sciences of the United States of America, 93, 13-20. http://dx.doi.org/10.1073/pnas.93.1.13
[3] Alaei, L., Moosavi-Movahedi, A.A., Hadi, H., Saboury, A.A., Ahmad, F. and Amani, M. (2012) Thermal inactivation and conformational lock of bovine carbonic anhydrase. Protein and Peptide Letters, 14, 852-858.
http://dx.doi.org/10.2174/092986612801619507
[4] Chou, K.C. (2004) Review: Structural bioinformatics and its impact to biomedical science. Current Medicinal Chemistry, 11, 2105-2134.
http://dx.doi.org/10.2174/0929867043364667
[5] Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., et al. (2000) The protein data bank. Nucleic Acids Research, 28, 235-242.
http://dx.doi.org/10.1093/nar/28.1.235
[6] Pruitt, K.D., Tatusova, T., Brown, G.R. and Maglott, D.R. (2012) NCBI reference sequences (RefSeq), current status, new features and genome annotation policy. Nucleic Acids Research, 40, D130-D135.
http://dx.doi.org/10.1093/nar/gkr1079
[7] Ginalski, K. (2006) Comparative modeling for protein structure prediction. Current Opinion in Structural Biology, 16, 172-177.
http://dx.doi.org/10.1016/j.sbi.2006.02.003
[8] Skolnick, J. and Brylinski, M. (2008) A threading-based method (FINDSITE) for ligand-binding site prediction and functional annotation. Proceedings of the National Academy of Sciences of the United States of America, 105, 129-134. http://dx.doi.org/10.1073/pnas.0707684105
[9] Andreeva, A., Howorth, D., Chandonia, J.M., Brenner, S.E., Hubbard, T.P., et al. (2008) Data growth and its impact on the SCOP database: New developments. Nucleic Acids Research, 36, D419-D425.
http://dx.doi.org/10.1093/nar/gkm993
[10] Cuff, A.L., Sillitoe, I., Lewis, T., Redfern, O.C., Garratt, R., et al. (2009) The CATH classification revisited—Architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Research, 37, D310-D314. http://dx.doi.org/10.1093/nar/gkn877
[11] Chen, K., Kurgan, L.A. and Ruan, J. (2008) Prediction of protein structural class using novel evolutionary collocation-based sequence representation. Journal of Computational Chemistry, 29, 1596-1604.
http://dx.doi.org/10.1002/jcc.20918
[12] Ding, Y.S., Zhang, T.L. and Chou, K.C. (2007) Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network. Protein and Peptide Letters, 14, 811-815.
http://dx.doi.org/10.2174/092986607781483778
[13] Ding, C.H. and Dubchak, I. (2001) Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 17, 349-358.
http://dx.doi.org/10.1093/bioinformatics/17.4.349
[14] Okun, O. (2004) Protein fold recognition with K-local hyperplane distance nearest neighbor algorithm. Proceedings of the 2nd European Workshop on Data Mining and Text Mining in Bioinformatics, 1, 51-57.
[15] Bologna, G. and Appel, R.D. (2002) A comparison study on protein fold recognition. Proceedings of the 9th International Conference on Neural Information Processing, 5, 2492-2496.
[16] Nanni, L. (2006) A novel ensemble of classifiers for protein fold recognition. Neurocomputing, 69, 2434-2437.
http://dx.doi.org/10.1016/j.neucom.2006.01.026
[17] Shen, H.B. and Chou, K.C. (2006) Ensemble classifier for protein fold pattern recognition. Bioinformatics, 22, 1717-1722.
http://dx.doi.org/10.1093/bioinformatics/btl170
[18] Yang, T. and Kecman, V. (2008) Adaptive local hyperplane classification. Neurocomputing, 71, 3001-3004.
http://dx.doi.org/10.1016/j.neucom.2008.01.014
[19] Yang, T., Kecman, V., Cao, L., Zhang, C. and Huang, J.Z. (2011) Margin-based ensemble classifier for protein fold recognition. Expert Systems, 38, 12348-12355.
http://dx.doi.org/10.1016/j.eswa.2011.04.014
[20] Shen, H.B. and Chou, K.C. (2009) Predicting protein fold pattern with functional domain and sequential evolution information. Journal of Theoretical Biology, 256, 441-446. http://dx.doi.org/10.1016/j.jtbi.2008.10.007
[21] Liu, L., Hu, X.Z., Liu, X.X., Wang, Y. and Li, S.B. (2012) Predicting protein fold types by the general form of chou’s pseudo amino acid composition: Approached from optimal feature extractions. Protein & Peptide Letters, 19, 439-449. http://dx.doi.org/10.2174/092986612799789378
[22] Chen, K. and Kurgan, L. (2007) PFRES: Protein fold classification by using evolutionary information and predicted secondary structure. Bioinformatics, 23, 2843-2850. http://dx.doi.org/10.1093/bioinformatics/btm475
[23] Leo, B. (2001) Random forests. Machine Learning, 1, 5-32.
[24] Chou, K.C. (2011) Some remarks on protein attribute prediction and pseudo amino acid composition (50th anniversary year review). Journal of Theoretical Biology, 273, 236-247. http://dx.doi.org/10.1016/j.jtbi.2010.12.024
[25] Chen, W., Feng, P.M., Lin, H. and Chou, K.C. (2013) iRSpot-PseDNC: Identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Research, 41, e69. http://dx.doi.org/10.1093/nar/gks1450
[26] Xu, Y., Shao, X.J., Wu, L.Y., Deng, N.Y. and Chou, K.C. (2013) iSNO-AAPair: Incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins. PeerJ, 1, e171.
http://dx.doi.org/10.7717/peerj.171
[27] Xiao, X., Min, J.L., Wang, P. and Chou, K.C. (2013) iCDI-PseFpt: Identify the channel-drug interaction in cellular networking with PseAAC and molecular fingerprints. Journal of Theoretical Biology, 337C, 71-79.
http://dx.doi.org/10.1016/j.jtbi.2013.08.013
[28] Xiao, X., Min, J.L., Wang, P. and Chou, K.C. (2013) iGPCR-Drug: A web server for predicting interaction between GPCRs and drugs in cellular networking. PLoS One, 8, e72234.
http://dx.doi.org/10.1371/journal.pone.0072234
[29] Feng, P.M., Chen, W., Lin, H. and Chou, K.C. (2013) iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Analytical Biochemistry, 442, 118-125.
http://dx.doi.org/10.1016/j.ab.2013.05.024
[30] McGuffin, L.J., Bryson, K. and Jones, D.T. (2000) The PSIPRED protein structure prediction server. Bioinformatics, 16, 404-405.
http://dx.doi.org/10.1093/bioinformatics/16.4.404
[31] Faraggi, E., Xue, B. and Zhou, Y. (2009) Improving the prediction accuracy of residue solvent accessibility and real-value backbone torsion angles of proteins by guidedlearning through a two-layer neural network. Proteins, 74, 847-856. http://dx.doi.org/10.1002/prot.22193
[32] Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., et al. (1997) Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25, 3389-3402.
http://dx.doi.org/10.1093/nar/25.17.3389
[33] Chou, K.C. and Zhang, C.T. (1995) Review: Prediction of protein structural classes. Critical Reviews in Biochemistry and Molecular Biology, 30, 275-349.
http://dx.doi.org/10.3109/10409239509083488
[34] Ding, Y.S., Zhang, T.L. and Chou, K.C. (2007) Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network. Protein & Peptide Letters, 14, 811-815.
http://dx.doi.org/10.2174/092986607781483778
[35] Harihar, B. and Selvaraj, S. (2011) Analysis of rate-limiting long-range contacts in the folding rate of three-state and two-state Proteins. Protein and Peptide Letters, 18, 1042-1052.
http://dx.doi.org/10.2174/092986611796378684
[36] Chou, K.C. and Shen, H.B. (2010) A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple sites: Euk-mPLoc 2.0. PLoS One, 5, e11335.
http://dx.doi.org/10.1371/journal.pone.0011335
[37] Chou, K.C. (2001) Prediction of protein cellular attributes using pseudo amino acid composition. PROTEINS: Structure, Function, and Genetics, 43, 246-255.
[38] Chou, K.C. (2009) REVIEW: Recent advances in developing web-servers for predicting protein attributes. Current Proteomics, 6, 262-274.
http://dx.doi.org/10.2174/157016409789973707
[39] Chou, K.C. (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins, 43, 246-255. http://dx.doi.org/10.1002/prot.1035
[40] Chou, K.C. (2011) iLoc-Euk: A multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. Journal of Theoretical Biology, 273, 236-247.
http://dx.doi.org/10.1016/j.jtbi.2010.12.024
[41] Bailey, T.L., Boden, M., Buske, F.A., Frith, M., Grant, C.E., Clementi, L., Ren, J., Li, W.W. and Noble, W.S. (2009) MEME SUITE: Tools for motif discovery and searching. Nucleic Acids Research, 37, W202-W208.
http://dx.doi.org/10.1093/nar/gkp335
[42] Kerthi, S.S., Shevade, S.K., Bhattacharyya, C. and Murphy, K.R.K. (2001) Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation, 13, 637-649. http://dx.doi.org/10.1162/089976601300014493
[43] Cleary, J.G. and Trigg, L.E. (1995) K*: An instancebased learner using an entropic distance measure. Proceedings of the 12th International Conference on Machine Learning, 108-114.
[44] Aha, D. and Kibler, D. (1991) Instance-based learning algorithms. Machine Learning, 6, 37-66.
http://dx.doi.org/10.1007/BF00153759
[45] John, G.H. and Langley, P. (1995) Estimating continuous distributions in bayesian classifiers. Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, 338-345.
[46] Mizianty, M.J. and Kurgan, L.A. (2009) Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences. BMC Bioinformatics, 10, 414.
http://dx.doi.org/10.1186/1471-2105-10-414
[47] Lin, S.X. and Lapointe, J. (2013) Theoretical and experimental biology in one—A symposium in honour of Professor Kuo-Chen Chou’s 50th anniversary and Professor Richard Giegé’s 40th anniversary of their scientific careers. Journal of Biomedical Science and Engineering, 6, 435-442. http://dx.doi.org/10.4236/jbise.2013.64054
[48] Chou, K.C. and Shen, H.B. (2009) Review: Recent advances in developing web-servers for predicting protein attributes. Natural Science, 2, 63-92.
http://dx.doi.org/10.4236/ns.2009.12011

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.