Prediction of Crystallization Propensity of Proteins from Bacillus haloduran Using Various Amino Acid and Protein Features

Correct prediction of propensity of crystallization of proteins is important for cost- and time-saving in determination of 3-demensional structures because one can focus to crys-tallize the proteins whose propensity is high through predictions instead of choosing proteins randomly. However, so far this job has yet to accomplish although huge efforts have been made over years, because it is still extremely hard to find an intrinsic feature in a protein to directly relate to the propensity of crystallization of the given protein. Despite of this difficulty, efforts are never stopped in testing of known features in amino acids and proteins versus the propensity of crystallization of proteins from various sources. In this study, the comparison of the features, which were developed by us, with the features from well-known resource for the prediction of propensity of crystallization of proteins from Bacillus haloduran was conducted. In particular, the propensity of crystallization of proteins is consi-dered as a yes-no event, so 185 crystallized proteins and 270 uncrystallized proteins from B. haloduran were classified as yes-no events. Each of 540 amino-acid features including the features developed by us was coupled with these yes-no events using logistic regression and neural network. The results once again demonstrated that the predictions using the features developed by us are relatively better than the predictions using any of 540 amino-acid features.


INTRODUCTION
The prediction of propensity of crystallization of proteins from various bacteria is an important as-Open Access pect of our studies [1][2][3][4][5][6][7][8] because this research direction is still active [9][10][11][12][13][14][15], though, after years of investigation. Statistically, the predictions are better than random chance throughout studies, even with very high successful rate. However, this is still a phenomenon based approach because it still cannot figure out the deeply-uncovered factors, which determine the propensity of crystallization.
Of predictors, an important group of predictors is physicochemical features of amino acids. However, no solid and general conclusion could be easily reached on which physicochemical feature is better to predict the crystallization propensity [16]. Yet, the protein crystallization more and more becomes a routine work in many laboratories, which require simple and reliable methods to predict the propensity of crystallization of proteins of interests.
Clearly, much effort and many studies are still in need to approach this problem because the number of proteins is still increasing rapidly although the crystallization already is no longer the only technology to find the 3-dimensional structure of proteins.
Accordingly, it necessarily tests each physicochemical feature against the propensity of crystallization for as many different proteins as possible although all known physicochemical features have been tested in different occasions under different circumstances.
In this study, the three features, which combined features from amino acid and protein, were tested against the propensity of crystallization of proteins from B. haloduran, and compared with the results obtained from each of 540-plus features possessed by amino acid. The results of this study once again demonstrate the wide-ranged applicability of three features developed by us because they catch the intrinsic random characteristic from protein sequences.

Data
Four hundred fifty five proteins of B. haloduran were obtained from Target DB [17,18] under the criterion of purified proteins including 185 under the criterion of crystallized protein as used in previous studies [1-8, 19, 20].

Features Possessed by both Amino Acid and Protein
The amino acid distribution probability is the first feature possessed by both amino acid and protein. This feature comes from the occupancy of subpopulations and partitions describing the distribution of elementary particles in energy states according to three assumptions of whether or not to distinguish each particle and energy state, i.e. Maxwell-Boltzmann, Fermi-Dirac, and Bose-Einstein assumptions in statistical mechanism [21]. This feature has been used in many occasions, whose probability can be computed with the following equation, r!/(q 0 ! × q 1 ! × ··· × q n !) × r!/(r 1 ! × r 2 ! ×··· × r n !) × n −r , where ! is the factorial, r is the number of a type of amino acid, q is the number of partitions with the same number of amino acids and n is the number of partitions in the protein for a type of amino acid. For a type of amino acids, it has only one distribution probability in a protein (Columns 8 and 9, Table 1).
The amino acid future composition is the second feature possessed by both amino acid and protein, which comes from the observation that there are 64 RNA codons but only 20 types of amino acids, so each type of amino acids corresponds to different number of RNA codons, e.g., methionine has one RNA codon (AUG), phenylalanine has two RNA codons (UUC and UUU) but leucine has six RNA codons (CUA, CUC, CUG, CUU, UUA and UUG). These naturally lead to different translation probabilities when a single RNA code mutates, and consequently the probability that an amino acid mutates to another amino acid is different (Columns 10 and 11 in Table 1). And this feature has been used in many occasions.
The amino acid pair predictability is the third feature possessed by both amino acid and protein, which is based on permutation. And this feature has been used in many occasions.

Amino Acid Features
By contrast, a physicochemical feature is only related to a single aspect of individual amino acids, therefore there are more than 540 amino acid features documented in AA Index database [22], for example, spatial features [23], electronic features [24], hydrophobic features [25], predictors for secondary structures [26].

Models
Logistic regression was a major tool used to model the relationship between crystallization propensity of proteins and amino-acid/protein features for proteins from B. haloduran because whether a protein can be crystallized can be defined as a yes-no event as the output of logistic regression, whereas various amino-acid/protein features can serve as the input of logistic regression. Similarly, the 10-1 feedforward backpropagation neural network was also used to model the relationship between crystallization propensity of proteins and amino-acid/protein features for proteins from B. haloduran. MatLab was used to perform both logistic regression and neural network [27,28].

Statistics
The results were grouped into true positive (TP), true negative (TN), false positive (FP) and false negative (FN), so the accuracy, sensitivity and specificity can be calculated as follows: (TP + TN)/(TP + FP + TN + FN) × 100, (TP)/(TP + FN) × 100, and (TN)/(TN + FP) × 100, respectively. The McNemar's test was used to compare the classified results. The sensitivity and specificity were compared using receiver operating characteristic (ROC) analysis [29][30][31]. The Mann-Whitney U-test was used to compare predicted accuracies at different cutoff values. Table 1 shows differences between amino acid features and combined features. As seen, the amino acid feature BEGF750101 that describes the helix-coil equilibrium has a invariable value for each type of amino acid (Columns 4 and 5) regardless of amino acid's location, composition (Columns 2 and 3), and neighboring amino acids. A simple remedy is to multiply this amino acid feature by its corresponding composition (Columns 6 and 7, Table 1). In contrast, two combined features have different values for different amino acids for those two proteins (last four columns, Table 1). As can be seen, there are differences among these features, which can be used to correlate with the propensity of crystallization of proteins from B. haloduran, as well as for the comparison of their predictability. Figure 1 showed the comparisons of accuracy, sensitivity and specificity obtained using logistic regression to correlate the propensity of protein crystallization with each of features. In this figure, every bar indicated how many features resulted in a similar accuracy, sensitivity or specificity. For example, the first bar from left-hand in the upper panel indicated that three amino acid features (CHAM830107, MITS020101 and FAUJ880112) had similar accuracies (0.588 ± 0.001). Interestingly, similar features (CHAM830108, FAUJ880111 and MITS020101) also have the worst performance in prediction of propensity of crystallization of proteins from Mycobacterium tuberculosis [8] and from Lactobacillus [7]. Similarly, the second bar indicated that two amino acid features (FAUJ880111 and KLEP840101) had the same accuracy (0.593), so the features, FAUJ880111 and FAUJ880112, should be completely eliminated for any prediction in this regard in future. Figure 1 strongly displayed that two combined features had relatively good relationship with the propensity of crystallization of protein. In particular, the prediction using amino acid distribution probability was the best with respect to accuracy and sensitivity. Figure 2 displayed the comparisons of accuracy, sensitivity and specificity obtained using neural network to correlate the propensity of crystallization of protein with each of features. The presentations in this figure had similar explanations as those in Figure 1. As shown in previous studies [1][2][3][4][5][6][7][8] and this study, the neural network can more accurately perceive difference between features. Compared against amino acid features, Figure 1 and Figure 2 suggested that two combined features not only are actively involved in crystallization process, but also worked better for the predictions of propensity of protein crystallization. Again, many amino acid features render similar results, being identical to the argument of abundance in amino acid features [32]. Indeed, the prediction using amino acid distribution probability was the best with respect to accuracy and specificity in Figure 2.

RESULTS AND DISCUSSION
The database in the computation for both Figure 1 and Figure 2 was not regrouped, that is, the model parameters got from the 428 B. haloduran proteins were employed for predictions. This procedure is usually regarded as the initial stage of modeling, and then the database should be regrouped into two groups; one produces the model parameters whereas the other serves for the validation [33]. Figure 3 illustrated the accuracy, sensitivity and specificity got from delete-1 jackknife validation, which further demonstrated the predictions using combined features were not worse than those using amino acid features. In fact, Figure 3 showed that the prediction using amino acid distribution probability and future composition had the best predictions in terms of accuracy and sensitivity. Table 2 listed predictive performance with respect to each feature in terms of accuracy, sensitivity     and specificity. As shown, the delete-1 validation with neural network produces different features sensitive to predictions. This difference between delete-1 validation and other methods of validation is still unclear, suggesting more studies in need. Figure 4 displayed the results of ROC analysis with respect to logistic regression, fitting and delete-1 jackknife validation using 20-1 feedforward backpropagation neural network. As expected: all the prediction features generate their classifications distributing above diagonal, so the predictions are not a random event because the McNemar's test showed that the classified results were significantly different from those of random guess (P < 0.01). Still, the combined features worked quite well in comparison with others. Table 3 showed the third combined feature, unpredictable portion of amino acid pairs, and predictive accuracy in all, crystallized and non-crystallization proteins from B. haloduran. In Table 3, this feature had difference between crystallized and non-crystallized proteins from B. haloduran, and predictive accuracy was different between crystallized and non-crystallized proteins, too. In particular, the predictable portion is statistically higher in crystallized proteins than in non-crystallized ones (40.07% vs. 38.37%), which suggests the difference between crystallized and uncrystallized proteins in terms of the predictable portion, while other physicochemical features cannot show such difference. This difference perhaps explains the reason in the accuracy of predictions.
In conclusion, the present study once again demonstrated that the predictions using the features developed by us are relatively better than the predictions using any of 540 amino-acid features because they catch the intrinsic random characteristic from protein sequences so they have a wide-ranged applicability.  Although many studies have been carried with respect to the prediction of propensity of crystallization of various proteins [1-15, 19, 20, 34-50], this issue is definitely unsolved. Therefore, effects are needed. In particular, how to find the features, which really represent the propensity of crystallization of various proteins is still unsolved.