A Method to Predict Amino Acids at Proximity of Beta-sheet Axes from Protein Sequences

A general and elementary protein folding step was described in a previous article. Energy conservation during this folding step yielded an equation with remarkable solutions over the field of rational numbers. Sets of sequences optimized for folding were derived. In this work, a geometrical analysis of protein beta-sheet backbone structures allows the definition of positions of topological interest. They correspond to amino acids' alpha carbons located on a unique axis crossing all beta-sheet's strands or at proximity of this axis defined here. These positions of topological interest are shown to be highly correlated with the absence of sequences optimized for folding. Applications in protein structure prediction for the quality assessment of structural models are envisioned.


Introduction
Protein structure prediction from sequences remains a major challenge even though the problem is several decades old [1,2].Protein structure prediction was recently achieved using ab initio methods for small proteins, using templates with sequence or fold similarity or using sets of correlated mutations [3][4][5][6][7][8][9].One-dimensional protein sequences can generally be predicted from gene sequences on genomic scales [10,11].Secondary structures can also be efficiently predicted computationally from protein sequences [12][13][14][15][16][17].However, three-dimensional protein structures have generally been solved experimentally and computationally by time-consuming and costly approaches such as X-ray diffraction on protein crystals or nuclear magnetic resonance on concentrated protein solutions.Independently, studies on protein folding allowed major conceptual advances on the understanding of general protein properties linked to their conversion of one dimensional sequences into three-dimensional structures [18][19][20][21].Molten globules and pre-molten globules have been characterized [22,23].A rugged funnel-like energy landscape was described for protein folding [24].Small model systems allowed protein folding simulations to be carried out [25,26].Protein engineering and folding kinetics were combined to define folding pathways at the level of single amino acid residues [27].Consideration of an elementary folding step allowed edge strands in beta-sheets to be predicted from protein sequences [28].A link was also established here between protein sequences and three-dimensional structure information: the focus is in this work on amino acids at proximity of the axis crossing the beta-sheet's strands.

Methods
The program pdb2 [28] was written in Perl v5.8.9.It can be used on the Mobyle platform at Institut Pasteur [29] 80 and makes use of files from the Protein Data Bank (PDB) [30].Protein lengths were in the range of 50 to 250 amino acids.Sequences optimized for folding (SOF) as shown in Figure 1 were computed as described earlier [28].Small proteins and designed proteins were not included in this study.Proteins were chosen because of their distinct folds as described in the structural classification of proteins (SCOP) [32].
The gap is characterized by an integer value, which is the integer part of the middle of the gap's ends corresponding to the set of amino acid positions for which no SOF were found (Figures 1 and 2).Independently, positions of topological interest (TIPs) were determined from the protein domain structures' backbone either by visual analysis of the structure using the Pymol software or by automatic annotation using pdb22 (see below).For each protein consisting of L amino acids, the number of TIPs T and the number of gaps G were noted in the Annex Table A1.A coincidence was defined as an amino acid position where a TIP coincides with a gap within a small error range e depending on the protein length L. For proteins of length L between 51 and 100, the gap position was defined plus or minus two amino acids (e = 2), thereby corresponding to 5 amino acid positions.Similarly, for proteins of length 101 -150, 151 -200 and 201 -250, the error e was defined as 3, 4 and 5 respectively (Figure 2).For example, the structure with PDB reference (1c3g) with 170 amino acids numbered from 180/1 to 349/170 in the structure and sequence files respectively allows the definition of 10 TIPs corresponding to amino acid alpha carbons on the three following axes at positions numbered (266/87; 291/112; 316/137), (188/9; 228/49; 253/74), (206/27; 212/33; 247/68; 238/59).The model applied to the corresponding sequence allows the definition of two gaps between amino acids 28 and 31 (noted 29 and coinciding with TIP 27) and between amino acids 89 and 90 (noted 89 coinciding with TIP 87).Given that the differences between TIP and gap numbers is 2 in both cases and as an error of 4 is allowed for proteins of 170 amino acids, the number C of coincidences is two for the two gaps (Annex Table A1).The www interface for the identification of gaps is available for any protein sequence at the following address: http://mobyle.pasteur.fr/cgi-bin/portal.py#forms::pdb2; it accepts PDB file names as entries (4 characters).
The program pdb22 is available at the address: http://mobyle.pasteur.fr/cgi-bin/portal.py#forms::pdb22; it is also a program written in perl and uses the same entry files as pdb2.The pdb22 output file (.xls) provides for each protein within the list its PDB name, the amino acid number and name in three-letter code, the start and the end of beta-strands indicated as amino acid numbers, the name of the sheet noted on the lines corresponding to amino acids found at the intersection of a beta-strand with the sheet axis and the distance D, which is calculated in Angströms and averaged per beta-strand for each sheet consisting of n strands using the following equation: where mindist(i) is the minimal distance between an alpha carbon of strand i and the sheet axis.The distance d is estimated for each pair of amino acids defining an axis characterized by the atomic coordinates of one amino acid's alpha carbon in the first strand and another one in the sequence's last strand.The sheet axis is defined as the axis for which the distance d is minimal.For a sheet, the minimum of all distances d is noted D.
The probability q for having C coincidences occurring at random, that is the probability for G gaps to coincide with T TIPs within the error range e was calculated according to Equation ( 2) deriving from the exclusion-inclusion principle (cf.Annex for the equation's proof). with The corresponding probabilities q are reported for each protein structure defined by its PDB reference in the Annex Table A1.It appears that for 14 of the proteins, the probability q is higher than 0.5.
In order to compute the p-value of the test, the probability of failing at most 14 times within 46 experiments (one experiment for each protein structure associated to a PDB reference) when the probability of failure is taken as 0.5 was computed using the binomial law as in Equation (3): The severity of this statistical test is highlighted for example by the data obtained for the protein of 193 amino acids referenced 3pn3 in the PDB, for which the correct identification of one coincidence for the gap was not considered as successful because of the large number of TIPs defined which is associated to a probability (q > 0.5; Annex Table A1).The numerical value of p  0.0057 indicates the statistical significance, which is far below the commonly accepted standard threshold of 0.05.Independently, a program (pdb7) was written to make use of lists of PDB files as entries and to provide within the output sequence file the gaps and TIPs calculated using pdb2 and pdb22 respectively.For each beta-sheet, the axis was defined as the line minimizing the distance for all strands from one alpha carbon per beta-strand to the line defined by two alpha carbons taken in the first and last strands in the protein sequence as described above.Analysis of the pdb7 output files yielded the results for the 248 correlations evaluated between gaps and TIPs (Table 1).

Results
An elementary step of protein folding was described as a folding unit or chemical group folding onto a folding entity to yield a larger folding entity [28].Criteria that are sufficient to define protein subsequences with optimal folding properties were derived [28].
A gap was defined as one or several amino acid(s) position(s) for which no sequence with optimal folding properties (SOF) is found.A quarter of the proteins analyzed yielded graphs of SOF which did not contain any gap.As an example, a single gap was noted between amino acids 114 and 115 for the central domain of C. symbiosum pyruvate phosphate dikinase (Figure 1).The gap's position was defined as the integer part of the middle of the gap (Figures 1 and 2).
Topologically interesting positions (TIPs) can be determined from protein domain structures' atomic coordinates.Beta-sheets are typically curved planes in three dimensions because of the twist found within beta-strands [33].Still, there generally exists at least one axis crossing most, if not all, beta-strands of the sheet (Figure 3): we define here a sheet axis as a straight line crossing the sheet's beta-strands, and which is generally perpendicular to the beta-strands (cf.Methods).The axis was chosen to cross the first and last strands at amino acids' alpha carbons.For the other strands, one amino acid per strand is further chosen for its proximity to the axis.The axis minimizing the distance to their alpha carbon is represented as a circle including the set of amino acids which are on the axis or closest to the axis in the pyruvate phosphate dikinase domain sheet structure (Figure 3).The intersection of this axis with each beta-strand yields one alpha carbon at an amino acid position defined as a topologically interesting position or TIP.
An error e for the gaps' positions prediction was allowed and chosen to increase slightly as a function of in-  creasing proteins' length as described in the methods section (Figure 2).The statistical evidence that TIPs and gaps are strongly correlated derives from a binomial test on the analysis of domain structures (Annex Table A1).
The p-value (<0.0057) calculated (cf.Methods) shows the correlation.Given that gaps can be straightforwardly calculated for any protein sequence, the correlation between gaps and topologically interesting amino acid positions (TIPs) provides information on the three-dimensional protein structure.
To obtain an independent proof of this conclusion, another program (pdb7) was then written for automatic annotation of gaps and TIPs on protein sequences: the hypothesis that the observed distribution of distances between gaps and TIPs (Table 1) follows the calculated distribution assuming a random assignment is excluded given the statistical p-value (0.0032).

Discussion
An elementary protein folding step was described [28].Application of classical mechanics and of the total energy conservation law to an elementary folding step yielded a quadratic equation with remarkable solutions over the field of rational numbers [28].
While numerical applications of equations from classical mechanics are commonly done over the field of real numbers, the following pieces of evidence indicate that discreteness provides a useful basis which is adapted in particular for the understanding of why the genetic code is the way it is.The genetic code is remarkable because of its quasi-universality within living organisms on earth and because it is about four billion years old [34].The role of selection pressures in the definition of amino acid assignments to codons was largely discussed in the context of the coevolution of the genetic code with essential proteins [35,36].A side-chain volume conservation was further found to be statistically significant for amino acids involved in precursor-product relationships within biosynthetic pathways and put in the context of side-chains' packing in protein beta-sheets [37].From the experimental side, the genetic code was engineered in multiple studies for applications in protein engineering [38][39][40].From the theoretical side, discrete symmetries associated to degeneracy in the genetic code were identified by Rumer [41,42].The discrete nature of the most frequent mutations provided a rationale accounting for those symmetries [43].Independently, kinetic energy conservation in polypeptide chains during molecular evolution was found to be consistent with the grouping of codons in the genetic code; the formalism consisting of energy conservation laws with solutions over the field of rational numbers was thereby validated for amino acids by the genetic code's codon arrangement [44].The field of rational numbers was also taken into consideration for another extension of classical mechanics [45].
In this work, the absence of sequences optimized for folding was linked to topological information on protein beta-sheets.It should be of interest to extend this analysis to other secondary structure elements such as protein helices while considering the impact of protein families and classes [71,72].

Conclusion
There is a need for practical methods describing complex chemical processes [73].Predicting the sequence-specific folding of a polypeptide chain into a three-dimensional structure remains a challenge.An axis characterizing the topology of beta-sheets was defined in this work.The fast computational method described here combining the identification of amino acids at proximity of beta-sheet axes (using pdb22) and the identification of gaps (using pdb7) derives three-dimensional structure information on beta-sheets from protein sequences at scales of topological interest for structural domains of less than 250 amino acids.Both the formalism based on energy conservation during an elementary protein folding step [28] and the definition of beta-sheet axes should therefore improve protein structure prediction strategies by implementation as quality assessment methods for structural models [74][75][76]: it provides new criteria for the selection of the most accurate protein structural models out of thousands of them.A quantitative evaluation of this method's efficiency may be achieved within the next challenge for the critical assessment of techniques for protein structure prediction such as CASP11 [77,78].
The last fraction in the sum is the probability for the T TIPs to avoid j choosen gaps.With the additional bi- , we may moreover choose the j gaps we want to avoid.
But some events appear several times in these terms.Fix an integer k between G -C + 1 and G.A distribu-tion of TIPs avoiding exactly k gaps gets counted k j       times in the term: for every 1 G C j k     : those are the possibilities to choose j among the k gaps the distribution avoids.To get an exact formula for the probability of avoiding at least G -C + 1 gaps, there is a need to compensate this via an inclusion-exclusion method.To prove that in the sum above one counts exactly one time a configuration of TIPs avoiding exactly k gaps, it remains to prove the formula: For the sake of readability, let us note r G C   .Then one may transform it: Hence the sum we are interested in is rewritten: And it remains to prove: We may use an analytic proof.Define the polynomial in two variables: Then one computes the integral via integration by parts:  A1, TIPs can also include secondary positions of topological interest defined by the intersection of the polypeptide chain with the axis involving the N-and C-termini for protein domain structures with few beta-strands to ensure that T ≥ G and allow thereby for predictions to be tested.

Figure 1 .Figure 2 .
Figure 1.Set of sequences with optimal folding properties.SOFs (red) were calculated as in [28] for the central protein domain of Clostridium symbiosum pyruvate phosphate dikinase (PDB reference 2fm4) [31].A gap defined by the absence of SOF is found between amino acids 114 and 115 and is characterized by the integer part of the gap's middle (114).

Figure 3 .
Figure 3. Positions of topological interest and a beta-sheet axis.Topologically interesting positions (TIPs) shown for a protein domain (PDB reference 2fm4) whose backbone is represented by links between adjacent alpha carbons from amino acids [31].TIPs are found on a beta-sheet axis within the circle shown in pink; they are numbered 8, 21, 114, 123.Amino acid 114 in the sequence file is numbered 497 in the structure file.

Table A1 . List of protein domains and of the probabilities q calculated according to Equation (2).
a In Annex Table