The Comparison between Random Forest and Support Vector Machine Algorithm for Predicting β-Hairpin Motifs in Proteins

Abstract

Based on the research of predictingβ-hairpin motifs in proteins, we apply Random Forest and Support Vector Machine algorithm to predictβ-hairpin motifs in ArchDB40 dataset. The motifs with the loop length of 2 to 8 amino acid residues are extracted as research object and thefixed-length pattern of 12 amino acids are selected. When using the same characteristic parameters and the same test method, Random Forest algorithm is more effective than Support Vector Machine. In addition, because of Random Forest algorithm doesn’t produce overfitting phenomenon while the dimension of characteristic parameters is higher, we use Random Forest based on higher dimension characteristic parameters to predictβ-hairpin motifs. The better prediction results are obtained; the overall accuracy and Matthew’s correlation coefficient of 5-fold cross-validation achieve 83.3% and 0.59, respectively.

Share and Cite:

Jia, S. , Hu, X. and Sun, L. (2013) The Comparison between Random Forest and Support Vector Machine Algorithm for Predicting β-Hairpin Motifs in Proteins. Engineering, 5, 391-395. doi: 10.4236/eng.2013.510B079.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] M. Kuhn, J. Meiler and D. Baker, “Strand-Loop-Strand Motifs: Prediction of Hairpins and Diverging Turns in Proteins,” Proteins: Structure, Function, and Bioinformatics, Vol. 54, 2004, pp. 282-288. http://dx.doi.org/10.1002/prot.10589
[2] X. Cruz, E. G. Hutchinson, A. Shepherd and J. M. Thornton, “Toward Predicting Protein Topology: An Approach to Identifying β Hairpins,” Proceedings of the National Academy of Sciences of the USA, Vol. 99, 2002, pp. 11157-11162. http://dx.doi.org/10.1073/pnas.162376199
[3] M. Kumar, M. Bhasin, N. K. Natt and G. P. S. Raghava, “BhairPred: Prediction of β-Hairpins in a Protein from Multiple Alignment Information Using ANN and SVM Techniques,” Nucleic Acids Research, Vol. 33, 2005, pp. 154-159. http://dx.doi.org/10.1093/nar/gki588
[4] T. F. Jenny, D. L. Gerloff, M. A. Cohen and S. A. Benner, “Predicted Secondary and Super Secondary Structure for the Serine-Threonine-Specific Protein Phosphatase Family,” Proteins: Structure, Function, and Bioinformatics, Vol. 21, 1995, pp. l-10.
[5] A. Godzik, J. Skolnick and A. Kolinski, “Simulations of the Folding Pathway of Triose Phosphate Isomerase-Type Alpha/Beta Barrel Proteins,” Proceedings of the National Academy of Sciences of the USA, Vol. 89, 1992, pp. 2629-2633. http://dx.doi.org/10.1073/pnas.89.7.2629
[6] R. T. Wintjens, M. J. Rooman and S. J. Wodak, “Automatic Classi-fication and Analysis of Alpha Alpha-Turn Motifs in Proteins,” Journal of Molecular Biology, Vol. 255, 1996, pp. 235-253.
[7] X. Z. Hu and Q. Z. Li, “Prediction of the β-Hairpins in Proteins Using Support Vector Machine,” Protein Jour nal, Vol. 27, 2008, pp. 115-122. http://dx.doi.org/10.1007/s10930-007-9114-z
[8] X. Z. Hu, Q. Z. Li and C. L. Wang, “Recognition of β- Hairpin Motifs in Proteins by Using the Composite Vector,” Amino Acids, Vol. 38, 2010, pp. 915-921. http://dx.doi.org/10.1007/s00726-009-0299-7
[9] W. Kabsch and C. Sander, “Dictionary of Protein Secondary Structure: Pattern Recognition of Hydrogen-Bonded and Geometrical Features,” Biopolymers, Vol. 22, 1983, pp. 2577-2637. http://dx.doi.org/10.1002/bip.360221211
[10] B. Oliva, P. A. Bates, E. Querol, F. X. Aviles and M. J. E. Sternberg, “An Automated Classification of the Structure of Protein Loops,” Journal of Molecular Biology, Vol. 266, 1997, pp. 814-830. http://dx.doi.org/10.1006/jmbi.1996.0819
[11] J. Espadaler, N. F. Fuentes, A. Hermoso, E. Querol, F. X. Aviles, M. J. E. Sternberg and B. Oliva, “ArchDB: Automated Protein Loop Classification as a Tool for Structural Genomics,” Nucleic Acids Research, Vol. 32, 2004, pp. 185-188. http://dx.doi.org/10.1093/nar/gkh002
[12] L. Breiman, “Random Forests,” Machine Learning, Vol. 45, 2001, pp. 5-32. http://dx.doi.org/10.1023/A:1010933404324
[13] F. S. Edelenyi, L. Goumidi and S. Bertrais, “Prediction of the Metabolic Syndrome Status Based on Dietary and Genetic Parameters, Using Random Forest,” Genes & Nutrition, Vol. 3, 2008, pp. 173-176. http://dx.doi.org/10.1007/s12263-008-0097-y
[14] O. Okun and H. Priisalu, “Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues,” Pattern Recognition and Image Analysis, Vol. 4478, 2007, pp. 483-490. http://dx.doi.org/10.1007/978-3-540-72849-8_61
[15] A. Liaw and M. Wiener, “Classification and Regression by Random Forest,” R News, Vol. 2, 2002, pp. 18-22.
[16] V. Vapnik, “Statistical Learing Theory,” Wiley-Interscience, 1998.
[17] J. Panek, I. Eidhammer and R. Aasland, “A New Method for Identification of Protein (sub) Families in a Set of Proteins Based on Hydropathy Distribution in Proteins,” Proteins: Structure, Function, and Bioinformatics, Vol. 58, 2005, pp. 923-934. http://dx.doi.org/10.1002/prot.20356
[18] R. R. Laxton, “The Measure of Diversity,” Journal of Theoretical Biology, Vol. 70, 1978, pp. 51-67. http://dx.doi.org/10.1016/0022-5193(78)90302-8
[19] J. M. Claverie and S. Audic, “The Statical Significance of Nucleotide Position-Weight Matrix Matches,” CABIOS, Vol. 12, 1996, pp. 431-439.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.