Since a decade ago, both protein and amino acid features have been correlated with crystallization propensity of proteins in order to develop methods to predict whether a protein can be crystallized. In this continuing study, each of three features combining features of amino acid and protein, was correlated with the crystallization propensity of proteins from Mycobacterium tuberculosis using logistic and neural network models. The results showed that two combined features, amino acid distribution probability and future composition, had good predictions on whether a protein would be crystallized in comparison with the predictions obtained from each of 531 amino acid features. The results obtained from the third combined feature, amino acid pair predictability, demonstrated the trend of crystallization propensity in proteins from Mycobacterium tuberculosis.
Many features possessed by amino acid and features possessed by a protein have an influence on the process of protein crystallization. Doubtlessly, humans can find more and more features possessed by amino acids and features possessed by a protein with advance in science and technology, each feature provides us with a new insight from a viewpoint different from the rest of features, and nevertheless, every new feature may have a certain relationship with the crystallization propensity of proteins.
The notable features are the amino acid physicochemical features, which have been repeatedly correlated with propensity of protein crystallization [
Apparently, all known features possessed by amino acids and a protein have been tested. However, several features, which were developed by us, have not yet been widely tested against crystallization propensity of proteins. Indeed, it is necessary to test each feature against crystallization propensity of different proteins as many as possible, and then a solid scientific conclusion can be drawn on whether a particular feature is suitable for predicting crystallization propensity of proteins.
In this context, we tested three features, which combined features possessed by both amino acids and a protein, against the crystallization propensity of proteins from Mycobacterium tuberculosis in this study, and compared with the results obtained from each of 530-plus features possessed by amino acids.
428 proteins from Mycobacterium tuberculosis were found in Target DB [5 , 6] under the criterion of purified proteins, of which 277 were found under the criterion of crystallized protein. Those two criteria were used in previous studies [7 - 15]. Actually, there are many different criteria in this database as well as in other databases, but our primary interest in this study is focused on the process between purified and crystallized proteins.
The first feature is the amino acid distribution probability [
can be computed according to the equation [
factorial, r is the number of a type of amino acid, q is the number of partitions with the same number of amino acids and n is the number of partitions in the protein for a type of amino acid. For a type of amino acids, it has only one distribution probability in a protein. As amino acid composition is different, each type of amino acids has its own distribution probability. Two worked examples were listed in columns 8 and 9 of
The second feature is the amino acid future composition [
The third feature is the amino acid pair predictability [
Amino acid | Mutated amino acids with their translation probability |
---|---|
A | 12/36A + 2/36D + 2/36E + 4/36G + 4/36P + 4/36S + 4/36T + 4/36V |
R | 18/54R + 2/54C + 2/54Q + 6/54G + 2/54H + 1/54I + 4/54L + 2/54K + 1/54M + 4/54P + 6/54S + 2/54T + 2/54W + 2/54STOP |
N | 2/18N + 2/18D + 2/18H + 2/18I + 4/18K + 2/18S + 2/18T + 2/18Y |
D | 2/18A + 2/18N + 2/18D + 4/18E + 2/18G + 2/18H + 2/18Y + 2/18V |
C | 2/18R + 2/18C + 2/18G + 2/18F + 4/18S + 2/18W + 2/18Y + 2/18STOP |
E | 2/18A + 4/18D + 2/18E + 2/18Q + 2/18G + 2/18K + 2/18V + 2/18STOP |
Q | 2/18R + 2/18E + 2/18Q + 4/18H + 2/18L + 2/18K + 2/18P + 2/18STOP |
G | 4/36A + 6/36R + 2/36D + 2/36C + 2/36E + 12/36G + 2/36S + 1/36W + 4/36V + 1/36STOP |
H | 2/18R + 2/18N + 2/18D + 4/18Q + 2/18H + 2/18L + 2/18P + 2/18Y |
I | 1/27R + 2/27N + 6/27I + 4/27L + 1/27K + 3/27M + 2/27F + 2/27S + 3/27T + 3/27V |
L | 4/54R + 2/54Q + 2/54H + 4/54I + 18/54L + 2/54M + 6/54F + 4/54P + 2/54S + 1/54W + 6/54V + 3/54STOP |
K | 2/18R + 4/18N + 2/18E + 2/18Q + 1/18I + 2/18K + 1/18M + 2/18T + 2/18STOP |
M | 1/9R + 3/9I + 2/9L + 1/9K + 1/9T + 1/9V |
F | 2/18C + 2/18I + 6/18L + 2/18F + 2/18S + 2/18Y + 2/18V |
P | 4/36A + 4/36R + 2/36Q + 2/36H + 4/36L + 12/36P + 4/36S + 4/36T |
S | 4/54A + 6/54R + 2/54N + 4/54C + 2/54G + 2/54I + 2/54L + 2/54F + 4/54P + 14/54S + 6/54T + 1/54W + 2/54Y + 3/54STOP |
T | 4/36A + 2/36R + 2/36N + 3/36I + 2/36K + 1/36M + 4/36P + 6/36S + 12/36T |
W | 2/9R + 2/9C + 1/9G + 1/9L + 1/9S + 2/9STOP |
Y | 2/18N + 2/18D + 2/18C + 2/18H + 2/18F + 2/18S + 2/18Y + 4/18STOP |
V | 4/36A + 2/36D + 2/36E + 4/36G + 3/36I + 6/36L + 1/36M + 2/36F + 12/36V |
STOP | 2/27R + 1/27C + 2/27E + 2/27Q + 1/27G + 3/27L + 2/27K + 3/27S + 2/27W + 4/27Y + 4/27STOP |
(9/147 × 17/146 × 146 = 1.04), but it appears three times in this protein, so the pair IA is unpredictable. In this way, all amino acid pairs are classified as 72.5% predictable and 27.5% unpredictable in Rv1155 protein.
Because all the three features are computed with the consideration on individual amino acids with their composition and/or distribution in a protein, so they possess characteristics of individual amino acid and a whole protein.
Amino acid features are the characteristics possessed by individual amino acids, and currently a database, AAIndex, contains 540-plus amino acid features describing various aspects of amino acids [
Amino Acid | Number | FINA770101 | FINA770101 ´ Number | Distribution probability | Future composition, % | |||||
---|---|---|---|---|---|---|---|---|---|---|
Rv1155 | Rv1875 | Rv1155 | Rv1875 | Rv1155 | Rv1875 | Rv1155 | Rv1875 | Rv1155 | Rv1875 | |
A | 17 | 17 | 1.08 | 1.08 | 18.36 | 18.36 | 0.1098 | 0.0229 | 8.42 | 9.10 |
R | 13 | 13 | 1.05 | 1.05 | 13.65 | 13.65 | 0.0617 | 0.0386 | 8.05 | 8.39 |
N | 4 | 5 | 0.85 | 0.85 | 3.40 | 4.25 | 0.5625 | 0.3840 | 3.64 | 2.34 |
D | 15 | 8 | 0.85 | 0.85 | 12.75 | 6.80 | 0.0125 | 0.0421 | 4.08 | 4.35 |
C | 0 | 0 | 0.95 | 0.95 | 0.00 | 0.00 | 0.0000 | 0.0000 | 1.86 | 2.17 |
E | 4 | 8 | 0.95 | 0.95 | 3.80 | 7.60 | 0.5625 | 0.1682 | 4.69 | 4.20 |
Q | 6 | 6 | 1.15 | 1.15 | 6.90 | 6.90 | 0.1543 | 0.3472 | 2.75 | 2.57 |
G | 8 | 14 | 0.55 | 0.55 | 4.40 | 7.70 | 0.2523 | 0.0262 | 6.70 | 8.29 |
H | 4 | 2 | 1.00 | 1.00 | 4.00 | 2.00 | 0.5625 | 0.5000 | 4.11 | 3.33 |
I | 9 | 2 | 1.05 | 1.05 | 9.45 | 2.10 | 0.1967 | 0.5000 | 4.79 | 4.17 |
L | 15 | 17 | 1.25 | 1.25 | 18.75 | 21.25 | 0.1569 | 0.0366 | 8.98 | 9.15 |
K | 4 | 2 | 1.15 | 1.15 | 4.60 | 2.30 | 0.1406 | 0.5000 | 2.71 | 2.95 |
M | 3 | 2 | 1.15 | 1.15 | 3.45 | 2.30 | 0.6667 | 0.5000 | 1.71 | 1.35 |
F | 2 | 3 | 1.10 | 1.10 | 2.20 | 3.30 | 0.5000 | 0.6667 | 2.73 | 2.57 |
P | 10 | 8 | 0.71 | 0.71 | 7.10 | 5.68 | 0.1905 | 0.0280 | 6.65 | 6.37 |
S | 8 | 5 | 0.75 | 0.75 | 6.00 | 3.75 | 0.0673 | 0.1920 | 7.34 | 7.31 |
T | 7 | 12 | 0.75 | 0.75 | 5.25 | 9.00 | 0.1071 | 0.1241 | 6.07 | 6.15 |
W | 2 | 4 | 1.10 | 1.10 | 2.20 | 4.40 | 0.5000 | 0.1875 | 0.77 | 0.87 |
Y | 5 | 3 | 1.10 | 1.10 | 5.50 | 3.30 | 0.3840 | 0.6667 | 2.47 | 1.71 |
V | 11 | 16 | 0.95 | 0.95 | 10.45 | 15.20 | 0.1616 | 0.0715 | 8.01 | 8.99 |
predictors for secondary structures [
Amino acid features are measured through experiments and documented so that they have no need to compute for each protein, whereas the features described in previous section need to compute for each protein. Therefore an amino acid feature is a constant for an amino acid, i.e., each feature has an unchanged value for a type of amino acid. In fact, only 531 amino acid features have 20 values for 20 types of amino acids. In this study, each amino acid feature served as a benchmark to compare with the results obtained from the features described in previous section.
Logistic regression was a major tool used in previous studies [
The results were classified as true positive (TP), true negative (TN), false positive (FP) and false negative (FN), so the accuracy, sensitivity and specificity can be calculated as follows [9 - 15]: TP = (TP + TN)/(TP + FP + TN + FN) × 100, TN = (TP)/(TP + FN) × 100, and FP = (TN)/(TN + FP) × 100, respectively. MatLab was used to perform both logistic regression and neural network [23 , 24]. The McNemar’s test was used to compare the classified results. The sensitivity and specificity were compared using receiver operating characteristic (ROC) analysis [25 - 28]. The Mann-Whitney U-test was used to compare predicted accuracies at different cutoff values.
In
Classification | The highest value | Accession number | Description | Characteristic |
---|---|---|---|---|
Fitting with logistic regression | ||||
Accuracy | 0.6963 | Distribution probability | Combined feature | |
0.6963 | TANS770107 | Normalized frequency of left-handed helix | Second structure feature | |
0.6963 | FAUJ880109 | Number of hydrogen bond donors | Second structure feature | |
Sensitivity | 0.9819 | FAUJ880111 | Positive charge | Physicochemical feature |
Specificity | 0.2848 | 40 features | Amino acid omposition | |
0.2848 | 176 features | Physicochemical feature | ||
0.2848 | 225 features | Second structure feature | ||
Fitting with neural network | ||||
Accuracy | 0.8631 | Distribution probability | Combined feature | |
Sensitivity | 1 | 24 features | Amino acid composition | |
1 | 68 features | Physicochemical feature | ||
1 | 23 features | Second structure feature | ||
Specificity | 0.7269 | Distribution probability | Combined feature | |
Delete-1 validation with neural network | ||||
Accuracy | 0.6481 | NADH010101 | Hydropathy scale based on self-information values in the two-state model (5% accessibility) | Physicochemical feature |
Sensitivity | 1 | RADA880106 | Accessible surface area | Physicochemical feature |
1 | FASG760102 | Melting point | Physicochemical feature | |
1 | LEVM760104 | Side chain torsion angle phi (AAAR) | Second structure feature | |
Specificity | 0.4334 | HUTJ700102 | Absolute entropy | Physicochemical feature |
Furthermore, the third combined feature that is the percentage of predictable/unpredictable amino acid pairs was used to compare the accuracy for predicting the protein crystallization.
Characteristic | Group | Number | Median (25% - 75%) | P value |
---|---|---|---|---|
Unpredictable portion (%) | All proteins | 428 | 63.63 (54.88 - 75.25) | 0.013 |
Crystallized | 277 | 65.25 (55.50 - 78.25) | ||
Non-crystallized | 151 | 61.50 (53.31 - 71.50) | ||
Accuracy in fitting | All proteins | 428 | 0.959 (0.323 - 0.998) | <0.001 |
Crystallized | 277 | 0.994 (0.964 - 0.999) | ||
Non-crystallized | 151 | 0.122 (0.042 - 0.361) | ||
Accuracy in delete-1 | All proteins | 428 | 0.668 (0.377 - 0.926) | <0.001 |
Crystallized | 277 | 0.855 (0.659 - 0.978) | ||
Non-crystallized | 151 | 0.268 (0.103 - 0.467) |
The issue of whether an amino acid or protein feature can be correlated with propensity of protein crystallization has been tested through modeling [1 , 4 , 6 , 7 , 22 , 31 - 39]. This is because it is impossible to conduct a control experiment without either amino acid or protein feature. In this study, three new features, which combined the features of individual amino acid and protein, were correlated with the crystallization propensity of proteins from Mycobacterium tuberculosis. The results demonstrate that these three combined features can be considered as the factors that affect the propensity of protein crystallization. Among three combined features, the amino acid pair predictability uses a single value, unpredictable portion, to represent a protein while the other two features, amino acid distribution probability and future composition, have each value for each type of amino acid. In this view, the amino acid distribution probability and future composition are somewhat similar to the 540-plus amino acid features, however, the two combined features do not have constant values as those amino acid features, therefore they more efficiently reflect certain features of amino acid in a whole protein. Clearly, more studies are needed to expend these three protein features to analyze the crystallization process in proteins from other organisms.
This study was supported by National Natural Science Foundation of China (31560315), and Key Project of Guangxi Scientific Research and Technology Development Plan (AB17190534).
The authors declare no conflicts of interest regarding the publication of this paper.