A Low-Memory-Requiring and Fast Approach to Cluster Large-Scale Decoy Protein Structures


This work demonstrates the so-called PCAC (Protein principal Component Analysis Clustering) method, which clusters large-scale decoy protein structures in protein structure prediction based on principal component analysis (PCA), is an ultra-fast and low-memory-requiring clustering method. It can be two orders of magnitude faster than the commonlyused pairwise rmsd-clustering (pRMSD) when enormous of decoys are involved. Instead of N(N – 1)/2 least-square fitting of rmsd calculations and N2 memory units to store the pairwise rmsd values in pRMSD, PCAC only requires N rmsd calculations and N × P memory storage, where N is the number of structures to be clustered and P is the number of preserved eigenvectors. Furthermore, PCAC based on the covariance Cartesian matrix generates essentially the identical result as that from the reference rmsd-clustering (rRMSD). From a test of 41 protein decoy sets, when the eigenvectors that contribute a total of 90% eigenvalues are preserved, PCAC method reproduces the results of near-native selections from rRMSD.

Share and Cite:

Y. Yuan, Y. Shang and H. Li, "A Low-Memory-Requiring and Fast Approach to Cluster Large-Scale Decoy Protein Structures," Open Journal of Biophysics, Vol. 2 No. 3, 2012, pp. 57-63. doi: 10.4236/ojbiphy.2012.23008.

Conflicts of Interest

The authors declare no conflicts of interest.


[1] O. M. Becker, “Geometric versus Topological Clustering: An Insight into Conformation Mapping,” Proteins: Structure, Function, and Bioinformatics, Vol. 27, No. 2, 1997, pp. 213-226.doi:10.1002/(SICI)1097-0134(199702)27:2<213::AID-PROT8>3.0.CO;2-G
[2] D. Shortle, K. T. Simons and D. Baker, “Clustering of Low-Energy Conformations near the Native Structures of Small Proteins,” Proceedings of the National Academy Sciences of the USA, Vol. 95, No. 19, 1998, pp. 1115811162. doi:10.1073/pnas.95.19.11158
[3] M. R. Betancourt and J. Skolnick, “Finding the Needle in a Haystack: Educing Native Folds from Ambiguous Ab initio Protein Structure Predictions,” Journal of Computational Chemistry, Vol. 22, No. 3, 2001, pp. 339-353.doi:10.1002/1096-987X(200102)22:3<339::AID-JCC1006>3.0.CO;2-R
[4] J. B. Holmes and J. Tsai, “Some Fundamental Aspects of Building Protein Structures from Fragment Libraries,” Protein Science, Vol. 13, No. 6, 2004, pp. 1636-1650.doi:10.1110/ps.03494504
[5] H. Li and Y. Zhou, “Fold Helical Proteins by Energy Minimization in Dihedral Space and a Dfire-Based Statistical Energy Function,” Journal of Bioinformatics and Computational Biology, Vol. 3, No. 5, 2005, pp. 11511170. doi:10.1142/S0219720005001430
[6] J. Fetrow, M. Palumbo and G. Berg, “Patterns, Structures, and Amino Acid Frequencies in Structural Building Blocks, a Protein Secondary Structure Classification Scheme,” Proteins, Vol. 27, No. 2, 1997, pp. 249-271.doi:10.1002/(SICI)1097-0134(199702)27:2<249::AID-PROT11>3.3.CO;2-X
[7] O. Sander, I. Sommer and T. Lengauer, “Local Protein Structure Prediction Using Discriminative Models,” BMC Bioinformatics, Vol. 7, 2006, p. 14.doi:10.1186/1471-2105-7-14
[8] J. Hartigan, “Clustering Algorithms,” John Wiley & Sons, New York, 1975.
[9] M. J. Rooman, J. Rodriguez and S. J. Wodak, “Automatic Definition of Recurrent Local Structure Motifs in Proteins,” Journal of Molecular Biology, Vol. 213, No. 2, 1990, pp. 327-336. doi:10.1016/S0022-2836(05)80194-9
[10] F. Cohen and M. Sternberg, “On the Prediction of Protein Structure: The Significance of the Root-Mean-Square Deviation,” Journal of Molecular Biology, Vol. 138, No. 2, 1980, pp. 321-333. doi:10.1016/0022-2836(80)90289-2
[11] H. Li and Y. Zhou, “Scud: Fast Structure Clustering of Decoys Using Reference State to Remove Overall Rotation,” Journal of Computional Chemistry, Vol. 26, No. 11, 2005, pp. 1189-1192. doi:10.1002/jcc.20251
[12] M. P. Jacobson, D. L. Pincus, C. S. Rapp, T. J. F. Day, B. Honig, D. E. Shaw and R. A. Friesner, “A Hierarchical Approach to All-Atom Protein Loop Prediction,” Proteins: Structure, Function, and Bioinformatics, Vol. 55, No. 2, 2004, pp. 351-367. doi:10.1002/prot.10613
[13] L. David, “Linear Algebra and Its Applications,” 3rd Edition, Addison-Wesley, New York, 2002.
[14] J. C. Gower, “Some Distance Properties of Latent Root and Vector Methods Used in Multivarient Analysis,” Biometrika, Vol. 53, No. 3-4, 1966, pp. 325-338.doi:10.1093/biomet/53.3-4.325
[15] J. C. Gower, “Adding a Point to Vector Diagrams in Multivariate Analysis,” Biometrika, Vol. 55, No. 3, 1968, pp. 582-585. doi:10.1093/biomet/55.3.582
[16] K. D. Ball, R. S. Berry, R. E. Kunz, F. Y. Li, A. Proykova and D. J. Wales, “From Topographies to Dynamics on Multidimensional Potential Energy Surfaces,” Science, Vol. 271, No. 5251, 1996, pp. 963-965.doi:10.1126/science.271.5251.963
[17] J. N. Onuchic, Z. L. Schulten and P. G. Wolynes, “Theory of Protein Folding: The Energy Landscape Perspective,” Annual Reviews of Physical Chemistry, Vol. 48, 1997, pp. 545-600.doi:10.1146/annurev.physchem.48.1.545
[18] N. Kamiya, J. Higo and H. Nakamura, “Conformational Transition States of a β-Hairpin Peptide between the Ordered and Disordered Conformations in Explicit Water,” Protein Science, Vol. 11, No. 10, 2002, pp. 2297-2307.doi:10.1110/ps.0213102
[19] T. Ichiye and M. Karplus, “Collective Motions in Proteins: A Covariance Analysis of Atomic Fluctuations in Molecular Dynamics and Normal Mode Simulations,” Proteins: Structure, Function, and Bioinformatics, Vol. 11, No. 3, 1991, pp. 205-217. doi:10.1002/prot.340110305
[20] A. E. Garcia, “Large-Amplitude Nonlinear Motions in Proteins,” Physical Review Letters, Vol. 68, No. 17, 1992, pp. 2696-2699. doi:10.1103/PhysRevLett.68.2696
[21] U. Schieborr and H. Ruterjans, “Bias-Free Separation of Internal and Overall Motion of Biomolecules,” Proteins: Structure, Function, and Bioinformatics, Vol. 45, No. 3, 2001, pp. 207-218. doi:10.1002/prot.1141
[22] N. Kannan and S. Vishveshwara, “Identification of Side-Chain Clusters in Protein Structures by a Graph Spectral Method,” Journal of Molecular Biology, Vol. 292, No. 2, 1999, pp. 441-464.doi:10.1006/jmbi.1999.3058
[23] P. Koehl and M. Levitt, “Improved Recognition of Native-Like Protein Structures Using a Family of Designed Sequences,” Proceedings of National Academy Sciences of USA, Vol. 99, No. 2, 2002, pp. 691-696.doi:10.1073/pnas.022408799
[24] E. G. Emberly, R. Mukhopadhyay, N. S. Wingreen and C. Tang, “Flexibility of α-Helices: Results of a Statistical Analysis of Database Protein Structures,” Journal of Molecular Biology, Vol. 327, No. 1, 2003, pp. 229-237.doi:10.1016/S0022-2836(03)00097-4
[25] J. C. Liao, R. Boscolo, Y.-L. Yang, L. M. Tran, C. Sabatti and V. Roychowdhury, “Network Component Analysis: Reconstruction of Regulatory Signals in Biological Systems,” Proceedings of National Academy Sciences of USA, Vol. 100, No. 26, 2003, pp. 15522-15527.doi:10.1073/pnas.2136632100
[26] U. Bastolla, M. Porto, H. E. Roman and M. Vendruscolo, “Principal Eigenvector of Contact Matrices and Hydrophobicity Profiles in Proteins,” Proteins: Structure, Function, and Bioinformatics, Vol. 58, 2005, pp. 22-30.
[27] K. Ikeda, T. Hirokawa, J. Higo and K. Tomii, “ProteinSegment Universe Exhibiting Transitions at Intermediate Segment Length in Conformational Subspaces,” BMC Structural Biology, Vol. 8, 2008, pp. 37-54.doi:10.1186/1472-6807-8-37
[28] H. Shen, F. Xu, H. Hu, F. Wang, Q. Wu, Q. Huang and H. Wang, “Coevolving Residues of (β/α)8-Barrel Proteins Play Roles in Stabilizing Active Site Architecture and Coordinating Protein Dynamics,” Journal of Structural Biology, Vol. 164, No. 3, 2008, pp. 281-292.doi:10.1016/j.jsb.2008.09.003
[29] N. Elmaci and R. S. Berry, “Principal Coordinate Analysis on a Protein Model,” Journal of Chemical Physics, Vol. 110, No. 21, 1999, pp. 10606-10622.doi:10.1063/1.478992
[30] R. Abseher and M. Nilges, “Are There Non-Trival Dynamics Cross-Correlations in Proteins?” Journal of Molecular Biology, Vol. 279, No. 4, 1998, pp. 911-920.doi:10.1006/jmbi.1998.1807
[31] D. M. F. V. Aalten, B. L. D. Groot, J. B. C. Findlay, H. J. C. Berendsen and A. Amadei, “A Comparision of Techniques for Calculating Protein Essential Dynamics,” Journal of Computional Chemistry, Vol. 18, No. 2, 1997, pp. 169-181.doi:10.1002/(SICI)1096-987X(19970130)18:2<169::AID-JCC3>3.0.CO;2-T
[32] Y. Mu, P. H. Nguyen and G. Stock, “Energy Landscape of a Small Peptide Revealed by Dihedral Angle Principal Component Analysis,” Proteins: Structure, Function, and Bioinformatics, Vol. 58, No. 1, 2005, pp. 45-52.doi:10.1002/prot.20310
[33] M. Nanias, M. Chinchio, J. Pillardy, D. R. Ripoll and H. A. Scheraga, “Packing Helices in Proteins by Global Optimization of a Potential Energy Function,” Proceedings of National Academy of Sciences of the USA, Vol. 100, No. 4, 2003, pp. 1706-1710. doi:10.1073/pnas.252760199
[34] C. Zhang, J. Hou and S. H. Kim, “Fold Prediction of Helical Proteins Using Torsion Angle Dynamics and Predicted Restraints,” Proceedings of National Academy of Sciences of the USA, Vol. 99, No. 6, 2002, pp. 3581-3585.doi:10.1073/pnas.052003799
[35] H. Zhou and Y. Zhou, “Distance-Scaled, Finite Ideal-Gas Reference State Improves Structure-Derived Potentials of Mean Force for Structure Selection and Stability Prediction,” Protein Science, Vol. 12, No. 9, 2003, p. 2121.doi:10.1002/pro.122121
[36] Y. Zhang and J. Skolnick, “Spicker: A Clustering Approach to Identify Near-Native Protein Folds,” Journal of Computional Chemistry, Vol. 25, No. 6, 2004, pp. 865871. doi:10.1002/jcc.20011
[37] J. J. Prompers and R. Bruschweiler, “Dynamic and Structural Analysis of Isotropically Distributed Molecular Ensembles,” Proteins: Structure, Function, and Bioinformatics, Vol. 46, No. 2, 2002, pp. 177-189.doi:10.1002/prot.10025

Copyright © 2023 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.