1. Introduction
Enzymatic reactions require many elegant conditions, which are usually determined through experiments. Those elegant experimental conditions are valuable for any new experiments with new enzymes because they can save much time and money for experimenters. On the other hand, many elegant experimental conditions are not always available in literature, so the valuable experience could not be fully useful for fellow researchers.
Still, the modern protein designing produces numerous new enzymes, whose optimal working conditions are totally unknown. Although we can extrapolate our previous experience to new enzymes, they are generally empirical.
With fast development on computational chemistry and bioinformatics, it could be possible to use models to predict the optimal working conditions for enzymatic reactions with newly designed enzymes. This is plausible because currently a lot of information on primary, secondary, tertiary, and quaternary structures is readily available, and many studies have been done on account of structure-function relationship of proteins [1,2]. Actually the optimal working conditions are adjusted in order to be suitable for enzymatic function, more exactly for enzyme structure, therefore we could assume that there is a certain relationship between enzyme structure and working conditions in enzymatic reaction. This assumption would lay the foundation for predicting the working condition in enzymatic reaction using enzyme structure.
Another effort made by scientific community is to build a comprehensive database to include enzymes with their functional parameters in enzymatic reactions, for example, Km and pH. However, even such comprehensive database cannot include all the parameters for all enzymes simply because many enzymatic parameters are not documented in literature.
Although a measurement of working condition is not difficult during experiments, a measurement is different from a prediction, not only because they are different along the time course, i.e. a measurement is related to the past while a prediction is related to the future; but also because they are different in mechanism, i.e. a measurement is related to mechanism of enzymatic reaction while a prediction is related to enzyme structure-function relationship.
The β-glucosidase (EC
3.2.1
.21) plays an important role in biological processes because it cuts the β-bond linkage in glucose molecules [3], of which celluloses got much recent attention because of interests in its role in biofuels [4]. With such great interest, more efforts are made not only to search for new β-glucosidases but also to mutate current β-glucosidases, so we have more and more β-glucosidases with clear annotations of their primary structures but without their working conditions for enzymatic reactions, for example, pH optimum. This would provide a good case for developing models to predict the pH optimum for the activity of newly mutated and newly found β-glucosidases because the predictions of pH for protein stability have been the research focus for years [5 - 8]. Therefore, the prediction of pH optimum for enzyme reaction would advance our current knowledge from structure-function relationship to structure-environment relationship.
In this study, we attempted to use the knowledge about amino-acid features from β-glucosidase sequences to predict the pH optimum for the activity of β-glucosidases.
2. MATERIALS AND METHODS
2.1. Data
The β-glucosidases (EC
3.2.1
.21) are found in the Comprehensive Enzyme Information System BRENDA [9]. In this databank, only 34 β-glucosidases were found under the category of pH optimum as functional parameter, of which two β-glucosidases are documented with their mutants [10,11]. Also, two pH values are documented in each of the β-glucosidase A9UIG0, B5TWK3, and P96316, respectively. In total, this databank provides 44 matched β-glucosidases with their pH values, while information on sequences of β-glucosidases was found in the UniProt [12].
2.2. Predictors
For most enzymes, we generally have only their primary structure because the knowledge on secondary, tertiary, and quaternary structures would require considerable amount of experiments. Therefore, the prediction at this stage would focus on using knowledge of amino acids from enzyme sequences. We use several amino-acid properties listed in Table 1 as predictors. The knowledge in Table 1 is actually the values reflecting various aspects of amino acids [13], for example, the spatial properties listed from row 3 to row 6 [14,15]; the hydrophobic properties listed from row 7 to row 11 [16 - 18]; the electronic properties
listed from row 12 to row 18 [19], and the amino-acid based secondary structure predictions listed from row 17 to 25 row, which are depended on assigning a set of predicted values to a residue and then calculated by applying a simple algorithm [20].
A particular characteristic in Table 1 is that those values are constant regardless amino-acid position in a protein, neighboring amino acids, protein length, etc. This is understandable because those properties would not be changed in these regards, for example, an amino acid’s physicochemical property would not be different no matter where this amino acid is located in a protein. As the amino-acid composition is different one from another in β-glucosidases, we weigh the values listed in Table 1 by multiplying their amino-acid composition of each β-glucosidase.
Besides the classical knowledge listed in Table 1, there is also the amino-acid distribution probability that is based on the occupancy of subpopulations and partitions [21] and reflects the random aspect of amino-acid distribution along a protein (for review and textbook, see [22 - 26]). The difference is that the amino-acid distribution probability does not give each amino acid a constant value as shown in Table 1, but the value subject to the length of enzyme and position of each amino acid. Table 2 shows such a difference.
2.3. Predictive Model
As the predictors are directly related to 20 types of amino acids, so it is natural to consider a predictive model to couple 20 inputs of knowledge on amino acids with single output with documented pH optimum. As this predictive model advances a step from structure-function to structure-environment relationship, we choose a 20-1 feedforward backpropagation neural network [27,28] to account this hidden and implicit relationship after large workings on model selection in Figure 1.
2.4. Validation of Predictions
The second column in Table 3 lists all 44 β-glucosidases obtained from the databank, of which 30 were used to generate the model parameters, weights and biases in neural network as the training group, and 14 were used to validate the neural network with generated weights and biases as the validation group. This is a very traditional approach for validation in neural network.
The second approach for validation is the delete-1 observation jackknife, each time we use 43 β-glucosidases as the training group to generate model parameters, and then to validate the prediction in omitted β-glucosidase until all 44 β-glucosidases undergo the same procedure. It is said to be the most effective approach for validation [29] although it is labor-intensive and time-demanding.
The third approach for validation is the cross-validation, through which 44 β-glucosidases were split into 11 subsets containing 4 cases each or 4 subsets containing 11 cases each. Each time, ten or three subsets were used to generate the model parameters, and one subset was used for validation, such a procedure was conducted in turn until each subset has served for validation [29].
2.5. Statistics
For each predictor, we generated 100 sets of model parameters in order that the predictions based on 100 sets of model parameters to have well normally distributed mean±SD to compare with the documented pH optimum of activity for each β-glucosidase [30]. For the data with normal distribution, the Student’s t-test was used, and for the data with abnormal distribution, the non-parametric Mann-Whitney U-test was used. P < 0.05 is considered statistically significant. For visual comparison, linear regression was also used to evaluate the predicted pH values with their documented ones.
3. RESULTS AND DISCUSSION
It is highly likely that the relationship between feature of amino acids and pH optimum of activity is at least one step beyond the structure-function relationship, which might imply that we need at least two layers in the neural network to account this hidden and implicit relationship (Figure 1). Technically, the development of predictive method includes (1) selection of predictors, and (2) selection of predictive models, while the general and efficient practice is to select predictors at first, and then to select predictive models.
Working with this network model, the next consideration is the training process, which once again guarantees a fair selection of predictors. Technically, both initialization of weights and biases and number of training epochs govern whether the neural network can converge. We used the random initialization
![]()
Table 2. Inductive effect scale, amino-acid number and distribution probability in β-glucosidase A9UIG0 and Q4U4W7. The amino-acid distribution probability, is computed according to the equation, r!/(q0! × q1! × ... × qn!) × r!/(r1! × r2! × ... × rn!) × n−r, where! is the factorial function, r is the number of a type of amino acid, q is the number of partitions with the same number of amino acids and n is the number of partitions in the protein for a type of amino acid. The computation can be found in the web site (http://www.gxas.cn/dp.htm).
![]()
![]()
![]()
Figure 1. 20-1 feedforward backpropagation neural network to account for the relationship between features of primary structure of β-glucosidase, labeled with amino acid abbreviations, and pH optimum. Each triangle presents a neuron. IW{1} is the input weights, LW{2,1} is the layer weights to the second layer from the first layer. b{1} and b{2} are the biases related to each neuron at the first and second layers.
function to initialize weights and biases, and 250 training epochs for convergence. Figure 2 displays the performance of convergence in the training group, where each line represents a training process with random initialization of weights and biases running 250 training epochs. Different predictor has different profiles of its convergence. As seen, the convergence of 11 predictors can be reached within 250 training epochs with any random initialization, which guarantees our training process. However, the convergence of other predictors is not possible after 100 epochs, as the amino-acid composition shown in the top-left panel of Figure 2.
Figure 3 shows that the percentage of correctly predicted pH optimum ranges from about 70% to 90% with respect to different features. Actually, Figure 3 gives us a basic concept on which predictor has better effect on predicting pH optimum of activity. Accordingly, the amino-acid distribution probability is the better one than others. This is because the amino acid features except for amino-acid distribution probability are not subject to their positions in amino acid sequence and their neighboring amino acids, whereas the amino-acid distribution probability is sensitive to these conditions.
Proteins evolve to function in specific cellular environment; thus pH of activity is subject to evolutionary pressure [5]. If the pH level could change the conformation of β-glucosidase; then different pH levels could have slight difference in conformation of β-glucosidase. In this context, the explanation for the results in Figure 3 would be that the amino-acid distribution probability as it reflects the randomness in enzyme would more accurately reflect the changes in the conformation of β-glucosidase due to different pH levels, while the other predictors due to the fact that they have constant values, for example, physicochemical property, would less accurately reflect the changes in conformation of β-glucosidase.
Table 3 shows the comparison between documented and predicted pH optimum for each β-glucosidase found in the database [9]. We should consider a predictor workable if there is no statistical difference between documented and predicted pH optimum, and the last row of Table 3 shows the overall performance, where we can see that the amino-acid distribution probability gives better predictive results than other predictors, whose results are similar.
![]()
Figure 2. Convergence of mean squared error performance function with 100 different initial weights and biases generated by random initialization function.
In Figure 4, we used the regression between documented and predicted pH optimum of activity to visualize the predictive performance using the amino-acid distribution probability as predictor in order to confirm our observation visually.
To furthermore validate the above findings, we used the delete-1 jackknife validation and 3-fold, 10-fold cross-validation to treat these predictors as shown in Figure 5, where we once again find that the
![]()
Figure 3. Percentage of correct predictions using different predictors.
![]()
Figure 4. Linear regression between documented and predicted pH optimum of activity in training and validation groups with the amino-acid distribution probability as predictor. For training group, predicted pH = 0.9947 × recorded pH + 0.0308 (P < 0.001). For validation group, predicted pH = 0.8492 × recorded pH + 0.6748 (P = 0.003).
![]()
Figure 5. Percentage of correct predictions with delete-1 jackknife validation, 10-fold and 3-fold cross-validation using different predictors. AA represents amino acid.
best predictor is the amino-acid distribution probability.
The predictors used in this study include some related to the amino-acid based secondary structure of β-glucosidase; however, these features do not render better predictions than others. This may open the possibility to use the amino-acid features to predict various working conditions for enzymes, even the possibility to use the information about primary structure to predict the changing environments. This is so because other studies have confirmed that a physicochemical metric of charge distribution correlates better with subcellular pH [6]. The amino-acid composition is one of two factors that influence the pH of maximal protein stability [7] and can empirically model the pH optimum of protein-protein binding [8].
If we pay our attention to validation, the technique detail would puzzle us, that is, the Jackknife validation is worse than traditional validation by comparing Figure 3 with Figure 5. This is interesting because the Jackknife uses almost all the samples to generate model parameters, but produces a worse prediction. This is very counter-intuitive, because the current knowledge indicates that the larger the trained data, the larger the chance that predicted sample would be included, the better the prediction. Clearly, this technique should require many more studies to deal with.
Currently it is not very clear whether different pH optimums would suggest that β-glucosidases would have different structures. If so, our prediction still falls into the so-called structure-function relationship; if not, our prediction would suggest a more sophisticated mechanism between structure and enzymatic working condition, i.e. structure-environment relationship.
In conclusion, this study suggests that we can use the features of amino acids of β-glucosidases to predict their pH optimum of activity. Among 25 amino-acid features screened, 11 can be serve as predictors to estimate the pH optimum, including 6/7 of the electronic properties and 4/7 of the amino-acid based secondary structure predictions, and the amino-acid distribution probability reaches better prediction than other predictors. However, the amino-acid composition, electronic properties and secondary structure predictions themselves cannot work in the neural network model, and they must be weighted with the amino-acid composition. Thus, the amino-acid distribution probability reveals its advantage in the prediction, indicating the random mechanisms may underline in enzyme reaction. The model provides the possibility to use the amino-acid features to predict various working conditions for enzymes.
FUND
This study was supported by National Natural Science Foundation of China (31560315), and Key Project of Guangxi Scientific Research and Technology Development Plan (AB17190534).