A Low-Memory-Requiring and Fast Approach to Cluster Large-Scale Decoy Protein Structures

This work demonstrates the so-called PCAC (Protein principal Component Analysis Clustering) method, which clusters large-scale decoy protein structures in protein structure prediction based on principal component analysis (PCA), is an ultra-fast and low-memory-requiring clustering method. It can be two orders of magnitude faster than the commonlyused pairwise rmsd-clustering (pRMSD) when enormous of decoys are involved. Instead of   1 2 N N  least-square fitting of rmsd calculations and N memory units to store the pairwise rmsd values in pRMSD, PCAC only requires N rmsd calculations and N × P memory storage, where N is the number of structures to be clustered and P is the number of preserved eigenvectors. Furthermore, PCAC based on the covariance Cartesian matrix generates essentially the identical result as that from the reference rmsd-clustering (rRMSD). From a test of 41 protein decoy sets, when the eigenvectors that contribute a total of 90% eigenvalues are preserved, PCAC method reproduces the results of near-native selections from rRMSD.


Introduction
In ab initio protein-structure prediction, usually a large amount of protein conformations (decoys) are generated.Clustering of similar predicted protein structures is a commonly adopted procedure [1][2][3][4].The clustering procedure simplifies data analysis by reducing the enormous number of decoys generated from the large-scale conformational search and provides information of the distribution of the structures in conformational space.In order to compare clustering results from diversed proteins, an adaptive cluster cutoff method is recommended [5] instead of K-means algorithm.The main drawback of K-means clustering [6,7] is that a pre-determined number of clusters is required, which is not suitable for decoy clustering.
Structural clustering (e.g., leader algorithm [8]) is usually based on the pairwise root-mean-squared distance (pRMSD), which is a more accurate similarity measure than other measures, e.g., distance of internal coordinates [9,10].The pRMSD requires   1 2 N N  least-square fitting of rmsd calculations for N structures.It is timeconsuming when a large amount of decoys are involved.Li et al. have developed a fast decoy clustering method (SCUD) that is based on reference root-mean-squared distance (rRMSD), which only requires N rMSD calculations to a reference conformation [11].A randomly selected reference conformation is used to remove overall translational and rotational motion for all the decoys and rmsd between any two conformations is determined without further reorientation.From a test of 53 decoy sets of or proteins, the near-native selections of rRMSD is similar to that of pRMSD.SCUD is 8 times faster without significant change in the accuracy of near-native selections.However, both pRMSD and rRMSD methods require N 2 memory units to store the rmsd values of each pair of structures in order to speed up calculation, which may exceed the computer's available memory limit when the number of decoys ranges from 10 4 to 10 6 [12].
Principal Component Analysis (PCA) is one of the most valuable results from linear algebra.It can be used to reduce the number of variables from a linear Gaussian data set or to classify them [13].PCA was first introduced to biosystem analysis by Gower in 1966 [14,15].It has been successfully implemented to describe molecules' energy landscape [16][17][18], nonlinear motions in proteins [19][20][21], as well as many other bioinformatics fields [22][23][24][25][26][27][28].From a sample data set with p variables and N individuals, there are two ways to build the PCA matrix.One method constructs a p × p matrix to measure the discrepancy of the individuals along principal axes in the p-dimensional space.The other method, i.e., principal coordinate analysis, builds a N × N matrix to analyze the similarity of the individuals [14,29].Normally, a covariance matrix [19] is constructed, since its eigenvalue is the variance of the N individuals along the corresponding eigenvector.The matrix can also be constructed as a distance matrix [29], in which the eigenvalue is not directly related to the variance of the individuals any more.The elements of the matrix can be calculated either from the Cartesian coordinates, internal coordinates (e.g., pair distances between two atoms [30], bond angles or dihedral angles [31]), their derivations [32], or any other reasonable measures [23].
In this study, we describe that a clustering method based on PCA, which is called PCAC (Protein principal Component Analysis Clustering), is also a powerful tool for clustering the predicted protein structures.PCAC clustering based on Cartesian coordinates is identical to rRMSD clustering when all the eigenvectors are preserved.From a test of 41 proteins 5 with 2000 folding decoys each, PCAC results in similar near-native selections as rRMSD method when the eigenvectors (about 17) that contribute a total of 90% eigenvalues are preserved.The method needs only N least-square fitting of rMSD calculations instead of   in pRMSD.Furthermore, other than N 2 memory units needed in pRMSD and rRMSD clustering, PCAC requires only N × P memory units to store the preserved eigenvectors, where P is the number of preserved eigenvectors that is usually a fixed number less than 100 and independent to the number of decoys N. Consequently, it can be hundreds of times faster than pRMSD method when a large number of decoys are studied and the computer cannot store the N 2 pairwise rmsd values in memory.PCAC may be implemented to cluster other large-scale database as well, e.g., compound library for virtual screening.

Constructing Covariance Matrix in Cartesian Coordinates
The covariance matrix [13] of C atoms in Cartesian coordinates is used in the PCA calculation.The element σ ij (covariance of two coordinates) in the 3p × 3p matrix (p number of C atoms in protein that have a total of 3p Cartesian coordinates) is defined as where N is the number of decoys for a specific protein, l en th

PCAC: Clustering in PCA Space
is the decoy index, i and j are the coordinate indices of a total of 3p Cartesian coordinates, and x i and x j are the average of the conformations along ith and jth coordinates, respectively.Before the covariance matrix is constructed, the decoys are translated and rotated to match a reference conformation: thus the rMSD in between the decoy and the reference is minimized.A total of N rMSD calculations is required to remove the overall rotation.
The eigenvalues are sorted in descending order wh e covariance matrix is diagonalized.Only the eigenvectors that have significant eigenvalues are preserved for further analysis.We either preserve eigenvectors with the highest eigenvalues, or set an eigenvalue-percentage-cutoff value (the fraction of the preserved eigenvalues over the total eigenvalues) to select the number of preserved eigenvectors.
PCAC is based on pairwise distance of de space.The PCA distance, d mn , between structures m and n is defined as where P is the number of preserved important eigenvecalized to a scaled Cartesi tors (P ≤ 3p), and m k and n k are the coordinates of the two decoys projected on the kth eigenvector.During clustering, the projected coordinates of each decoy is stored in N × P memory units.
The PCA-distance d mn is norm an PCA-distance in order to compare it with the rRMSD method where p is the number of C atoms to build the Cartesian o the diversity of protein decoy sets, an adaptive cl

Discussion
est PCAC method is obtained from the energy-minimization of 41 helical proteins [5].The num-covariance matrix.All the decoys close to each other within a cluster cutoff in PCA space are clustered to one family.
Due t uster cutoff value is strongly recommended.The cluster cutoff value is calculated when the number of decoys in the top 3 largest clusters includes 5% of the total decoys (T35 value [11]).The top 5 largest clusters are selected as the best prediction for a specific protein and the value of the structure selection listed in tables is the one that has minimum rMSD from native among the 5 selected structures.

The decoy set to t
Copyright © 2012 SciRes.OJBIPHY ber of residues in the proteins ranges from 40 to 124 and the number of helices is from 2 to 6.For each protein, at least 2000 initial structures are produced.The initial structures are constructed with random dihedral angles for the residues in nonhelical regions and native dihedral angles for the residues in helical regions [33,34].The DFIRE energy function [35], together with improper torsion energy and a simple repulsive potential, are employed to minimize the initial structures in dihedral space to fold the protein.As proved by Equation ( 5), PCAC generates the exactly same results as rRMSD when all the eigenvalues are preserved, no further decoy sets are needed on test of the new methodology.

PCA-Distance and rRMSD
The rRMSD value used in SCUD, rectly calculated rMSD of two decoy rMSD r mn s indexed as , is the dim and ize their n when both of conformations minim rMSD values to the reference conformation [11], where p is the total number of C atoms i culation, and m k′ and n k′ are the k′th laboratory Cartesian oordinates on the PCA ei nvolved in calcoordinates (after least-square fitting to the reference conformation) of the two decoys.
Since the decoy's coordinates measured in PCA space are the projections of laboratory c genvectors, the Cartesian distance of any two C atoms remains constant in both laboratory coordinates and PCA coordinates.Thus, from Equations ( 2 Apparently, from the above equation, we the rRMSD value is identical to the scaled PCA-distance w luster Cutoff ined cutoff value is selected to However, it is difficult to set one produced and the nu

   
can see that hen all the eigenvectors resulted from PCA are considered.

C
Normally a pre-determ cluster the structures.cutoff value for the diverse proteins because this will lead to too few clusters for some proteins and too many clusters for the others [5,36].
Figure 1 depicts the effect of cluster cutoff on the fraction of the number of clusters mber of decoys included in the top 3 largest clusters.We can see that the fraction of the number of clusters strongly depends on the number of decoys, whereas the fraction of the number of decoys included in the top 3 largest clusters is independent.A well-defined cluster cutoff value should be constant as the number of ana- lyzed decoys varies.Hence, we select the cluster cu a est clusters accounts for 5% of the total decoys.T35 value is selected as cluster cutoff based on the fact that a statistically significant amount of decoys are included in the top largest cluster (over 40 for a 2000 decoy set), meanwhile, most of the diverse decoys (about 50%) are conserved.Li et al. have tested the effect of using different cutoffs as 1%, 3%, 5%, 10%, 15% and 20% of all decoy structures contained in top three clusters [11].It shows that a cutoff between 3% to 5% produces the best near-native selections.

Eigenvectors to
Only the eigenvectors with highest eig ute significantly to determining struct PCA.The eigenvalue distribution of protein 1GAB 10-51 is illustrated in Figure 2. The top 5 and 10 eigenvectors from 1GAB 10-51 contain 70% and 85% of the eigenvalues, respectively.We can also see that the curves from 2000 and 9000 decoys are almost identical, implying the eigenvalue distribution is independent to the number of decoys.
As listed in Table 1, the near-native selection result improves eases, up to 95%.The 90% eigenvalue-percentage-cutoff value can be sufficient to generate near-native pre-diction similar to rRMSD, which preserves only 17 eigenvectors on average.Table 2 compares the best structure selection of 41 proteins from rRMSD and PCAC at 90% eigenvalue-percentage-cutoff.The average nearnative selection of top 5 clusters from PCAC is 6.0 Å, which is very close to the 5.9 Å value from rRMSD method.At 99% eigenvalue-percentage-cutoff (on average 51 eigenvetors are preserved), the average near-native selections from the two methods are identical.PCAC method is a sufficient method to cluster structures as rRMSD is further displayed in Figure 3.The fig on n be af-ure shows the relationship of the scaled PCA-distance and the least-square fitting rMSD value of protein 1GAB 10-51 at different eigenvalue-percentage-cutoffs.The correlation coefficient increases from 0.88 to 0.99 at 47% and 99% eigenvalue-percentage-cutoffs.As the scaled PCA-distances at 99% eigenvalue-percentage-cutoff are almost identical to the rmsd values within the cluster cutoff region, the clustering results from the two methods are expected to be almost identical as well.

Choose the Reference Conformati
Results from principal component analysis ca fected by the selection of the reference structures [37].As shown in Table 3, the near-native selection result from using native as reference is artificially enhanced.We must avoid selecting the native as a reference conformation in PCAC.However, it is shown that the average near-native selection result is not sensitive to a randomly selected structure, even if the unfolded initial structure is picked as reference [5].As listed in Table 3, we tested using 3 randomly selected energy-minimized structures (on average 10 Å rMSD from native) and 2 initial unfolded structures (16 Å from native) as reference states.The resulting near-native selections are similar.Therefore, in terms of near-native structure selections, using any structure that is not close (within the cluster cutoff) to the native (and/or close to any of the top 5 clusters) can produce similar and unbiased results.

An Ultra-Fast Method
Cartesian coordinates produces sult as the rRMSD method.Moreover, PCAC can be hundreds of times faster when thousands or more decoys are calculated.The pairwise rmsd clustering requires   calculations of pairwise least-square fitting rmsd values.We also need N 2 memory units to store the res lting rmsd values, which can easily surpass computer memory when tens of thousands of decoys are included.For example, upto 1,000,000 loop decoys were generated by Jacobson et al. [12].On the other hand, PCAC only needs N least-square fitting rmsd calculations and N × P memory units to store the preserved P eigenvectors.
The overhead of PCAC method is the PCA calculation, which includes constructing the covariance matrix (needs on u e round calculation of N least-square fittings to the reference conformation), diagonalizing the matrix, and projecting the structures on the preserved eigenvectors.
In the test of clustering on 9000 1GAB rMSD values (each rMSD value occupies 4 bytes storag a real number).If the required memory storage can not be satisfied, each pair of rMSD values must be recalculated when needed.The computing time for traditional pRMSD method to cluster the 9000 decoy set is 53,500 seconds (including a total of 10 rounds of clustering to search the cluster cutoff T35 value).For PCAC, less than 1 Mb memory storage is required for the clustering of 9000 decoys at 90% eigenvalue-percentage-cutoff.It takes a total of 274 seconds, which includes 102 seconds for PCA calculation and 172 seconds for clustering.PCAC method is almost 200 times faster than pRMSD method when the computer memory can not hold N 2 real numbers.
The more decoys analyzed, the faster PCAC method becomes, since the fraction of the overhead PCA calculation drops ribun is almost independent to the number of decoys (shown in Figure 2), another way to speed up PCA calculation is analyzing only a small number of decoys (e.g., 2000) to calculate the PCA matrix and implementing the obtained eigenvectors to a large number of decoys.Consequently, the overhead of PCA calculation can be further reduced.

Conclusion
This work dem and low-memoryscale predicted protein structures argen be over 100 times faster than pairwise-rMSD clustering method.The computer memory requirement also drops from O(N 2 ) to O(N), where N is the size of the dataset.PCAC algorithm may be implemented to cluster other large-scale bioinformatics dataset as well when the dataset can be effectively described in PCA space.

Figure 1 .
Figure 1.The fraction of decoys in the top 3 largest clusters (two lines end at right-top corner) and number of cluster

Figure 2 .
Figure 2. The eigenvalue distribution of protein 1GAB 10-51 The solid line and dashed line represent the distributio

Figure 3 .
Figure 3. Backbone rmsd from the structure in largest cluster of 2000 1GAB 10-51 decoys vs the scaled backb e

Table 3 . Average structure selections of 41 proteins using different reference conformations.
top five structures ranked by cluster size from PCA clustering (in Å); d The number of principal axes preserved; e This random-selected energy-minimized structure is the one used in other tables.