Correlating Combined Features of Amino Acid and Protein with Crystallization Propensity of Proteins from Mycobacterium tuberculosis

Since a decade ago, both protein and amino acid features have been correlated with crystallization propensity of proteins in order to develop methods to predict whether a protein can be crystallized. In this continuing study, each of three features combining features of amino acid and protein, was correlated with the crystallization propensity of proteins from Mycobacterium tuberculosis using logistic and neural network models. The results showed that two combined features, amino acid distribution probability and future composition, had good predictions on whether a protein would be crystallized in comparison with the predictions obtained from each of 531 amino acid features. The results obtained from the third combined feature, amino acid pair predictability, demonstrated the trend of crystallization propensity in proteins from Mycobacterium tuberculosis.


INTRODUCTION
Many features possessed by amino acid and features possessed by a protein have an influence on the process of protein crystallization. Doubtlessly, humans can find more and more features possessed by amino acids and features possessed by a protein with advance in science and technology, each feature provides us with a new insight from a viewpoint different from the rest of features, and nevertheless, every new feature may have a certain relationship with the crystallization propensity of proteins.
The notable features are the amino acid physicochemical features, which have been repeatedly correlated with propensity of protein crystallization [1]. Subsequently, these features were also correlated with propensity of protein crystallization [2], for example, protein length, protein isoelectric point, percentage Open Access J. Biomedical Science and Engineering of charged residues, hydrophobicity. With the compilation of features of amino acids [3], efforts once again were made to correlate propensity of protein crystallization with amino acid features, which had not been used in previous studies [2,4].
Apparently, all known features possessed by amino acids and a protein have been tested. However, several features, which were developed by us, have not yet been widely tested against crystallization propensity of proteins. Indeed, it is necessary to test each feature against crystallization propensity of different proteins as many as possible, and then a solid scientific conclusion can be drawn on whether a particular feature is suitable for predicting crystallization propensity of proteins.
In this context, we tested three features, which combined features possessed by both amino acids and a protein, against the crystallization propensity of proteins from Mycobacterium tuberculosis in this study, and compared with the results obtained from each of 530-plus features possessed by amino acids.

Data
428 proteins from Mycobacterium tuberculosis were found in Target DB [5,6] under the criterion of purified proteins, of which 277 were found under the criterion of crystallized protein. Those two criteria were used in previous studies [7][8][9][10][11][12][13][14][15]. Actually, there are many different criteria in this database as well as in other databases, but our primary interest in this study is focused on the process between purified and crystallized proteins.

Features Possessed by Amino Acid and Protein
The first feature is the amino acid distribution probability [16], which is based on the occupancy of subpopulations and partitions describing the distribution of elementary particles in energy states according to three assumptions with respect to whether or not to distinguish each particle and energy state, i.e.
Maxwell-Boltzmann, Fermi-Dirac, and Bose-Einstein assumptions in statistical mechanism [17]. For its application to protein, for example, Rv1875 protein has 3 tyrosines, and the simplest question is what probability it is if 3 tyrosines are clustered together or scattered along the protein sequence. This probability can be computed according to the equation [17], where ! is the factorial, r is the number of a type of amino acid, q is the number of partitions with the same number of amino acids and n is the number of partitions in the protein for a type of amino acid. For a type of amino acids, it has only one distribution probability in a protein. As amino acid composition is different, each type of amino acids has its own distribution probability. Two worked examples were listed in columns 8 and 9 of Table 2 to show the distribution probability related to each type of amino acids in proteins. The second feature is the amino acid future composition [16], which comes from the observation that there are 64 RNA codons but only 20 types of amino acids, so each type of amino acids corresponds to different number of RNA codons. For example, methionine corresponds to one RNA codon (AUG), and phenylalanine corresponds to two RNA codons (UUC and UUU) whereas leucine corresponds to six RNA codons (CUA, CUC, CUG, CUU, UUA and UUG). These naturally lead to different translation probabilities when a single RNA code mutates, and consequently the probability that an amino acid mutates to another amino acid is different (Table 1). For instance, when a mutation occurs in alanine, it has 12/36 chances to mutate to alanine, 2/36 chances to mutate to both aspartic acid and glutamic acid, 4/36 chances to mutate to glycine, proline, serine, threonine, and valine, respectively. Two worked examples were listed in columns 10 and 11 of Table 2 to show the characteristic of this feature.
The third feature is the amino acid pair predictability [16], which is based on permutation. For instance, there are 15 leucines (L), 17 alanines (A), and 9 isoleucines (I) in Rv1155 protein. According to the permutation, the amino acid pair LA would appear twice (15/147 × 17/146 × 146 = 1.73), and there are indeed two LAs in realty so the pair LA is predictable. However, the amino acid pair IA would appear once Because all the three features are computed with the consideration on individual amino acids with their composition and/or distribution in a protein, so they possess characteristics of individual amino acid and a whole protein.

Amino Acid Features
Amino acid features are the characteristics possessed by individual amino acids, and currently a database, AAIndex, contains 540-plus amino acid features describing various aspects of amino acids [3], including physicochemical features, spatial features [18], electronic features [19], hydrophobic features [20], predictors for secondary structures [21], etc. Amino acid features are measured through experiments and documented so that they have no need to compute for each protein, whereas the features described in previous section need to compute for each protein. Therefore an amino acid feature is a constant for an amino acid, i.e., each feature has an unchanged value for a type of amino acid. In fact, only 531 amino acid features have 20 values for 20 types of amino acids. In this study, each amino acid feature served as a benchmark to compare with the results obtained from the features described in previous section.

Models
Logistic regression was a major tool used in previous studies [22] because it works for a relationship between yes-no event and continuously numeric values, i.e. the relationship between propensity of protein crystallization, which is encoded either with amino acid features or with protein features. In this study an attempt was made to correlate each of three protein features with the crystallization propensity of proteins from Mycobacterium tuberculosis through logistic and neural network models, whose results were compared with the results obtained from modeling each of 531 amino acid features with the crystallization propensity of the proteins.

Statistics
The results were classified as true positive (TP), true negative (TN), false positive (FP) and false negative (FN), so the accuracy, sensitivity and specificity can be calculated as follows [9][10][11][12][13][14][15]: TP = (TP + TN)/(TP + FP + TN + FN) × 100, TN = (TP)/(TP + FN) × 100, and FP = (TN)/(TN + FP) × 100, respectively. MatLab was used to perform both logistic regression and neural network [23,24]. The McNemar's test was used to compare the classified results. The sensitivity and specificity were compared using receiver operating characteristic (ROC) analysis [25][26][27][28]. The Mann-Whitney U-test was used to compare predicted accuracies at different cutoff values. Table 2 shows differences between amino acid features and combined features. As can be seen, the amino acid feature FINA770101 that describes the helix-coil equilibrium has a constant value for each type of amino acid (columns 4 and 5) regardless of amino acid's location, composition (columns 2 and 3), and neighboring amino acids. A simple remedy is to multiply this amino acid feature by its corresponding composition (columns 6 and 7, Table 2). By contrast, two combined features have different values for different amino acids for those two proteins (last four columns, Table 2). This is an important distinction between combined features and amino acid features, and a rationale to correlate with the crystallization propensity of proteins from Mycobacterium tuberculosis. Figure 1 showed the comparisons of accuracy, sensitivity and specificity obtained using logistic regression to correlate the propensity of protein crystallization with each of features. In this figure, each bar represented how many features resulted in a similar accuracy, sensitivity or specificity. For example, the first bar from left-hand in the upper panel indicated that three amino acid features (CHAM830108, FAUJ880111 and MITS020101) had similar accuracies (0.643 ± 0.003). Similarly, the second bar indicated that three other amino acid features (CHAM830105, GOLD730101 and MIYS990101) had similar accuracies (0.657 ± 0.004). Figure 1 clearly showed that two combined features had a relatively good relationship with the propensity of protein crystallization. In particular, the prediction using amino acid distribution probability was the best in terms of accuracy and sensitivity. Figure 2 displayed the comparisons of accuracy, sensitivity and specificity obtained using neural network to correlate the propensity of protein crystallization with each of features. The presentations in this figure had similar explanations as those in Figure 1. Clearly, the neural network can furthermore distinguish the difference between features. Compared against amino acid features, Figure 1 and Figure 2 suggested that two combined features not only were involved in crystallization process, but also served better for the predictions of protein crystallization. Also, many amino acid features gave similar results, being consistent with the study that demonstrated the abundance in amino acid features [29]. In particular, Figure 2 showed that the prediction using amino acid distribution probability was the best in terms of accuracy and specificity.

RESULTS AND DISCUSSION
In Figure 1 and Figure 2, the database was not divided, i.e. the model parameters obtained from the 428 Mycobacterium tuberculosis proteins were used for predictions. This was generally considered as the first stage in modeling, and then the database should be divided into two groups, one for the generation of model parameters while the other for the validation [30]. Figure 3 displayed the accuracy, sensitivity and specificity obtained from delete-1 jackknife validation, which further demonstrated the predictions using combined features were not worse than those using amino acid features. In fact, Figure 3 showed that the prediction using amino acid distribution probability and future composition had the best predictions in terms of accuracy and specificity. Table 3 listed predictive performance with respect to each feature in terms of accuracy, sensitivity and specificity. As can be seen, the best results were obtained using amino acid distribution probability, physicochemical features and second structure features.    Figure 4 displayed the results of ROC analysis with respect to logistic regression, fitting and delete-1 jackknife validation using 20-1 feedforward backpropagation neural network. Two points could be drawn: 1) all the features gave their classifications distributing above diagonal, i.e. the predictions were better than random chance because the McNemar's test showed that the classified results were significantly different from those of random guess (P < 0.01), and 2) two combined features worked quite well in comparison with others. Furthermore, the third combined feature that is the percentage of predictable/unpredictable amino acid pairs was used to compare the accuracy for predicting the protein crystallization. Figure 5 and Figure  6 showed such analysis in both neural network fitting and delete-1 jackknife validation. First, a cutoff value of accuracy was set at 0.75, 0.80, 0.85 and 0.90 levels; Second, 428 Mycobacterium tuberculosis proteins were divided into two groups according to the above-mentioned cutoff values; Third, the predictable portions of proteins were compared between two groups. Figure 5 and Figure 6 showed that the proteins, which had a large predictable portion, provided a high accuracy of predicting their crystallization propensity.   Table 4 showed the third combined feature, unpredictable portion of amino acid pairs, and predictive accuracy in all, crystallized and non-crystallization proteins from Mycobacterium tuberculosis. As can be seen in Table 4, this feature had difference between crystallized and non-crystallized proteins from Mycobacterium tuberculosis, and predictive accuracy was different between crystallized and non-crystallized proteins, too. In particular, the unpredictable portion was statistically higher in crystallized proteins than in non-crystallized ones (65.25% vs. 61.50%), while the accuracy of predictions was higher in crystallized proteins than in non-crystallized ones. However, we could not find a direct correlation between unpredictable portion and prediction accuracy. The issue of whether an amino acid or protein feature can be correlated with propensity of protein crystallization has been tested through modeling [1,4,6,7,22,[31][32][33][34][35][36][37][38][39]. This is because it is impossible to conduct a control experiment without either amino acid or protein feature. In this study, three new features, which combined the features of individual amino acid and protein, were correlated with the crystallization propensity of proteins from Mycobacterium tuberculosis. The results demonstrate that these three combined features can be considered as the factors that affect the propensity of protein crystallization. Among three combined features, the amino acid pair predictability uses a single value, unpredictable portion, to represent a protein while the other two features, amino acid distribution probability and future composition, have each value for each type of amino acid. In this view, the amino acid distribution probability and future composition are somewhat similar to the 540-plus amino acid features, however, the two combined features do not have constant values as those amino acid features, therefore they more efficiently reflect certain features of amino acid in a whole protein. Clearly, more studies are needed to expend these three protein features to analyze the crystallization process in proteins from other organisms.

FUND
This study was supported by National Natural Science Foundation of China (31560315), and Key Project of Guangxi Scientific Research and Technology Development Plan (AB17190534).