Predicting pH Optimum for Activity of Beta-Glucosidases


The working conditions for enzymatic reaction are elegant, but not many optimal conditions are documented in literatures. For newly mutated and newly found enzymes, the optimal working conditions can only be extrapolated from our previous experience. Therefore a question raised here is whether we can use the knowledge on enzyme structure to predict the optimal working conditions. Although working conditions for enzymes can be easily measured in experiments, the predictions of working conditions for enzymes are still important because they can minimize the experimental cost and time. In this study, we develop a 20-1 feedforward backpropagation neural network with information on amino acid sequence to predict the pH optimum for the activity of beta-glucosidase, because this enzyme has drawn much attention for its role in bio-fuel industries. Among 25 features of amino acids being screened, the results show that 11 features can be used as predictors in this model and the amino-acid distribution probability is the best in predicting the pH optimum for the activity of beta-glucosidases. Our study paves the way for predicting the optimal working conditions of enzymes based on the amino-acid features.

Share and Cite:

Yan, S. and Wu, G. (2019) Predicting pH Optimum for Activity of Beta-Glucosidases. Journal of Biomedical Science and Engineering, 12, 354-367. doi: 10.4236/jbise.2019.127027.

1. Introduction

Enzymatic reactions require many elegant conditions, which are usually determined through experiments. Those elegant experimental conditions are valuable for any new experiments with new enzymes because they can save much time and money for experimenters. On the other hand, many elegant experimental conditions are not always available in literature, so the valuable experience could not be fully useful for fellow researchers.

Still, the modern protein designing produces numerous new enzymes, whose optimal working conditions are totally unknown. Although we can extrapolate our previous experience to new enzymes, they are generally empirical.

With fast development on computational chemistry and bioinformatics, it could be possible to use models to predict the optimal working conditions for enzymatic reactions with newly designed enzymes. This is plausible because currently a lot of information on primary, secondary, tertiary, and quaternary structures is readily available, and many studies have been done on account of structure-function relationship of proteins [1,2]. Actually the optimal working conditions are adjusted in order to be suitable for enzymatic function, more exactly for enzyme structure, therefore we could assume that there is a certain relationship between enzyme structure and working conditions in enzymatic reaction. This assumption would lay the foundation for predicting the working condition in enzymatic reaction using enzyme structure.

Another effort made by scientific community is to build a comprehensive database to include enzymes with their functional parameters in enzymatic reactions, for example, Km and pH. However, even such comprehensive database cannot include all the parameters for all enzymes simply because many enzymatic parameters are not documented in literature.

Although a measurement of working condition is not difficult during experiments, a measurement is different from a prediction, not only because they are different along the time course, i.e. a measurement is related to the past while a prediction is related to the future; but also because they are different in mechanism, i.e. a measurement is related to mechanism of enzymatic reaction while a prediction is related to enzyme structure-function relationship.

The β-glucosidase (EC 3.2.1 .21) plays an important role in biological processes because it cuts the β-bond linkage in glucose molecules [3], of which celluloses got much recent attention because of interests in its role in biofuels [4]. With such great interest, more efforts are made not only to search for new β-glucosidases but also to mutate current β-glucosidases, so we have more and more β-glucosidases with clear annotations of their primary structures but without their working conditions for enzymatic reactions, for example, pH optimum. This would provide a good case for developing models to predict the pH optimum for the activity of newly mutated and newly found β-glucosidases because the predictions of pH for protein stability have been the research focus for years [5 - 8]. Therefore, the prediction of pH optimum for enzyme reaction would advance our current knowledge from structure-function relationship to structure-environment relationship.

In this study, we attempted to use the knowledge about amino-acid features from β-glucosidase sequences to predict the pH optimum for the activity of β-glucosidases.


2.1. Data

The β-glucosidases (EC 3.2.1 .21) are found in the Comprehensive Enzyme Information System BRENDA [9]. In this databank, only 34 β-glucosidases were found under the category of pH optimum as functional parameter, of which two β-glucosidases are documented with their mutants [10,11]. Also, two pH values are documented in each of the β-glucosidase A9UIG0, B5TWK3, and P96316, respectively. In total, this databank provides 44 matched β-glucosidases with their pH values, while information on sequences of β-glucosidases was found in the UniProt [12].

2.2. Predictors

For most enzymes, we generally have only their primary structure because the knowledge on secondary, tertiary, and quaternary structures would require considerable amount of experiments. Therefore, the prediction at this stage would focus on using knowledge of amino acids from enzyme sequences. We use several amino-acid properties listed in Table 1 as predictors. The knowledge in Table 1 is actually the values reflecting various aspects of amino acids [13], for example, the spatial properties listed from row 3 to row 6 [14,15]; the hydrophobic properties listed from row 7 to row 11 [16 - 18]; the electronic properties

listed from row 12 to row 18 [19], and the amino-acid based secondary structure predictions listed from row 17 to 25 row, which are depended on assigning a set of predicted values to a residue and then calculated by applying a simple algorithm [20].

A particular characteristic in Table 1 is that those values are constant regardless amino-acid position in a protein, neighboring amino acids, protein length, etc. This is understandable because those properties would not be changed in these regards, for example, an amino acid’s physicochemical property would not be different no matter where this amino acid is located in a protein. As the amino-acid composition is different one from another in β-glucosidases, we weigh the values listed in Table 1 by multiplying their amino-acid composition of each β-glucosidase.

Besides the classical knowledge listed in Table 1, there is also the amino-acid distribution probability that is based on the occupancy of subpopulations and partitions [21] and reflects the random aspect of amino-acid distribution along a protein (for review and textbook, see [22 - 26]). The difference is that the amino-acid distribution probability does not give each amino acid a constant value as shown in Table 1, but the value subject to the length of enzyme and position of each amino acid. Table 2 shows such a difference.

2.3. Predictive Model

As the predictors are directly related to 20 types of amino acids, so it is natural to consider a predictive model to couple 20 inputs of knowledge on amino acids with single output with documented pH optimum. As this predictive model advances a step from structure-function to structure-environment relationship, we choose a 20-1 feedforward backpropagation neural network [27,28] to account this hidden and implicit relationship after large workings on model selection in Figure 1.

2.4. Validation of Predictions

The second column in Table 3 lists all 44 β-glucosidases obtained from the databank, of which 30 were used to generate the model parameters, weights and biases in neural network as the training group, and 14 were used to validate the neural network with generated weights and biases as the validation group. This is a very traditional approach for validation in neural network.

The second approach for validation is the delete-1 observation jackknife, each time we use 43 β-glucosidases as the training group to generate model parameters, and then to validate the prediction in omitted β-glucosidase until all 44 β-glucosidases undergo the same procedure. It is said to be the most effective approach for validation [29] although it is labor-intensive and time-demanding.

The third approach for validation is the cross-validation, through which 44 β-glucosidases were split into 11 subsets containing 4 cases each or 4 subsets containing 11 cases each. Each time, ten or three subsets were used to generate the model parameters, and one subset was used for validation, such a procedure was conducted in turn until each subset has served for validation [29].

2.5. Statistics

For each predictor, we generated 100 sets of model parameters in order that the predictions based on 100 sets of model parameters to have well normally distributed mean±SD to compare with the documented pH optimum of activity for each β-glucosidase [30]. For the data with normal distribution, the Student’s t-test was used, and for the data with abnormal distribution, the non-parametric Mann-Whitney U-test was used. P < 0.05 is considered statistically significant. For visual comparison, linear regression was also used to evaluate the predicted pH values with their documented ones.


It is highly likely that the relationship between feature of amino acids and pH optimum of activity is at least one step beyond the structure-function relationship, which might imply that we need at least two layers in the neural network to account this hidden and implicit relationship (Figure 1). Technically, the development of predictive method includes (1) selection of predictors, and (2) selection of predictive models, while the general and efficient practice is to select predictors at first, and then to select predictive models.

Working with this network model, the next consideration is the training process, which once again guarantees a fair selection of predictors. Technically, both initialization of weights and biases and number of training epochs govern whether the neural network can converge. We used the random initialization

Table 2. Inductive effect scale, amino-acid number and distribution probability in β-glucosidase A9UIG0 and Q4U4W7. The amino-acid distribution probability, is computed according to the equation, r!/(q0! × q1! × ... × qn!) × r!/(r1! × r2! × ... × rn!) × n−r, where! is the factorial function, r is the number of a type of amino acid, q is the number of partitions with the same number of amino acids and n is the number of partitions in the protein for a type of amino acid. The computation can be found in the web site (

Figure 1. 20-1 feedforward backpropagation neural network to account for the relationship between features of primary structure of β-glucosidase, labeled with amino acid abbreviations, and pH optimum. Each triangle presents a neuron. IW{1} is the input weights, LW{2,1} is the layer weights to the second layer from the first layer. b{1} and b{2} are the biases related to each neuron at the first and second layers.

function to initialize weights and biases, and 250 training epochs for convergence. Figure 2 displays the performance of convergence in the training group, where each line represents a training process with random initialization of weights and biases running 250 training epochs. Different predictor has different profiles of its convergence. As seen, the convergence of 11 predictors can be reached within 250 training epochs with any random initialization, which guarantees our training process. However, the convergence of other predictors is not possible after 100 epochs, as the amino-acid composition shown in the top-left panel of Figure 2.

Figure 3 shows that the percentage of correctly predicted pH optimum ranges from about 70% to 90% with respect to different features. Actually, Figure 3 gives us a basic concept on which predictor has better effect on predicting pH optimum of activity. Accordingly, the amino-acid distribution probability is the better one than others. This is because the amino acid features except for amino-acid distribution probability are not subject to their positions in amino acid sequence and their neighboring amino acids, whereas the amino-acid distribution probability is sensitive to these conditions.

Proteins evolve to function in specific cellular environment; thus pH of activity is subject to evolutionary pressure [5]. If the pH level could change the conformation of β-glucosidase; then different pH levels could have slight difference in conformation of β-glucosidase. In this context, the explanation for the results in Figure 3 would be that the amino-acid distribution probability as it reflects the randomness in enzyme would more accurately reflect the changes in the conformation of β-glucosidase due to different pH levels, while the other predictors due to the fact that they have constant values, for example, physicochemical property, would less accurately reflect the changes in conformation of β-glucosidase.

Table 3 shows the comparison between documented and predicted pH optimum for each β-glucosidase found in the database [9]. We should consider a predictor workable if there is no statistical difference between documented and predicted pH optimum, and the last row of Table 3 shows the overall performance, where we can see that the amino-acid distribution probability gives better predictive results than other predictors, whose results are similar.

Figure 2. Convergence of mean squared error performance function with 100 different initial weights and biases generated by random initialization function.

In Figure 4, we used the regression between documented and predicted pH optimum of activity to visualize the predictive performance using the amino-acid distribution probability as predictor in order to confirm our observation visually.

To furthermore validate the above findings, we used the delete-1 jackknife validation and 3-fold, 10-fold cross-validation to treat these predictors as shown in Figure 5, where we once again find that the

Figure 3. Percentage of correct predictions using different predictors.

Figure 4. Linear regression between documented and predicted pH optimum of activity in training and validation groups with the amino-acid distribution probability as predictor. For training group, predicted pH = 0.9947 × recorded pH + 0.0308 (P < 0.001). For validation group, predicted pH = 0.8492 × recorded pH + 0.6748 (P = 0.003).

Figure 5. Percentage of correct predictions with delete-1 jackknife validation, 10-fold and 3-fold cross-validation using different predictors. AA represents amino acid.

best predictor is the amino-acid distribution probability.

The predictors used in this study include some related to the amino-acid based secondary structure of β-glucosidase; however, these features do not render better predictions than others. This may open the possibility to use the amino-acid features to predict various working conditions for enzymes, even the possibility to use the information about primary structure to predict the changing environments. This is so because other studies have confirmed that a physicochemical metric of charge distribution correlates better with subcellular pH [6]. The amino-acid composition is one of two factors that influence the pH of maximal protein stability [7] and can empirically model the pH optimum of protein-protein binding [8].

If we pay our attention to validation, the technique detail would puzzle us, that is, the Jackknife validation is worse than traditional validation by comparing Figure 3 with Figure 5. This is interesting because the Jackknife uses almost all the samples to generate model parameters, but produces a worse prediction. This is very counter-intuitive, because the current knowledge indicates that the larger the trained data, the larger the chance that predicted sample would be included, the better the prediction. Clearly, this technique should require many more studies to deal with.

Currently it is not very clear whether different pH optimums would suggest that β-glucosidases would have different structures. If so, our prediction still falls into the so-called structure-function relationship; if not, our prediction would suggest a more sophisticated mechanism between structure and enzymatic working condition, i.e. structure-environment relationship.

In conclusion, this study suggests that we can use the features of amino acids of β-glucosidases to predict their pH optimum of activity. Among 25 amino-acid features screened, 11 can be serve as predictors to estimate the pH optimum, including 6/7 of the electronic properties and 4/7 of the amino-acid based secondary structure predictions, and the amino-acid distribution probability reaches better prediction than other predictors. However, the amino-acid composition, electronic properties and secondary structure predictions themselves cannot work in the neural network model, and they must be weighted with the amino-acid composition. Thus, the amino-acid distribution probability reveals its advantage in the prediction, indicating the random mechanisms may underline in enzyme reaction. The model provides the possibility to use the amino-acid features to predict various working conditions for enzymes.


This study was supported by National Natural Science Foundation of China (31560315), and Key Project of Guangxi Scientific Research and Technology Development Plan (AB17190534).

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.


[1] Wang, T., Wu, M.B., Lin, J.P. and Yang, L.R. (2015) Quantitative Structure-Activity Relationship: Promising Advances in Drug Discovery Platforms. Expert Opinion on Drug Discovery, 10, 1283-1300.
[2] Xue, L.C., Dobbs, D., Bonvin, A.M. and Honavar, V. (2015) Computational Prediction of Protein Interfaces: A Review of Data Driven Methods. FEBS Letters, 589, 3516-3526.
[3] Jeng, W.Y., Wang, N.C., Lin, M.H., Lin, C.T., Liaw, Y.C., Chang, W.J., Liu, C.I., Liang, P.H. and Wang, A.H.J. (2011) Structural and Functional Analysis of Three Beta-Glucosidases from Bacterium Clostridium cellulovorans, fungus Trichoderma reesei and Termite Neotermes koshunensis. Journal of Structural Biology, 173, 46-56.
[4] Sticklen, M. (2006) Plant Genetic Engineering to Improve Biomass Characteristics for Biofuels. Current Opinion in Biotechnology, 17, 315-319.
[5] Talley, K. and Alexov, E. (2010) On the pH-Optimum of Activity and Stability of Proteins. Proteins: Structure, Function, and Bioinformatics, 78, 2699-706.
[6] Garcia-Moreno, B. (2009) Adaptations of Proteins to Cellular and Subcellular pH. Journal of Biology, 8, 98.
[7] Alexov, E. (2004) Numerical Calculations of the pH of Maximal Protein Stability. The Effect of the Sequence Composition and Three-Dimensional Structure. European Journal of Biochemistry, 271, 173-185.
[8] Mitra, R.C., Zhang, Z. and Alexov, E. (2011) In Silico Modeling of pH-Optimum of Protein-Protein Binding. Proteins: Structure, Function, and Bioinformatics, 79, 925-936.
[9] Placzek, S., Schomburg, I., Chang, A., Jeske, L., Ulbrich, M., Tillack, J. and Schomburg, D. (2017) BRENDA in 2017: New Perspectives and New Tools in BRENDA. Nucleic Acids Research, 45, D380-D388.
[10] Berrin, J.G., Czjzek, M., Kroon, P.A., McLauchlan, W.R., Puigserver, A., Williamson, G. and Juge, N. (2003) Substrate (Aglycone) Specificity of Human Cytosolic Beta-Glucosidase. Biochemical Journal, 373, 41-48.
[11] Tsukada, T., Igarashi, K., Fushinobu, S. and Samejima, M. (2008) Role of Subsite +1 Residues in pH Dependence and Catalytic Activity of the Glycoside Hydrolase Family 1 β-Glucosidase BGL1A from the Basidiomycete Phanerochaete chrysosporium. Biotechnology and Bioengineering, 99, 1295-1302.
[12] UniProt Consortium (2019) UniProt: A Worldwide Hub of Protein Knowledge. Nucleic Acids Research, 47, D506-D515.
[13] Burlingame, A.L. and Carr, S.A. (1996) Mass Spectrometry in the Biological Sciences. Humana Press, Totowa, NJ.
[14] Zamyatin, A.A. (1972) Protein Volume in Solution. Progress in Biophysics & Molecular Biology, 24, 107-123.
[15] Darby, N.J. and Creighton, T.E. (1993) Dissecting the Disulphide-Coupled Folding Pathway of Bovine Pancreatic Trypsin Inhibitor. Forming the First Disulphide Bonds in Analogues of the Reduced Protein. Journal of Molecular Biology, 232, 873-896.
[16] Kyte, J. and Doolittle, R.F. (1982) A Simple Method for Displaying the Hydropathic Character of a Protein. Journal of Molecular Biology, 157, 105-132.
[17] Trinquier, G., Sanejouand, Y.H. and Hausman, R.E. (1998) Which Effective Property of Amino Acids Is Best Preserved by the Genetic Code? Protein Engineering, Design and Selection, 11, 153-169.
[18] Cooper, G.M. (2004) The Cell: A Molecular Approach. ASM Press, Washington DC, 51.
[19] Dwyer, D.S. (2005) Electronic Properties of Amino Acid Side Chains: Quantum Mechanics Calculation of Substituent Effects. BMC Chemical Biology, 5, 2.
[20] Chou, P.Y. and Fasman, G.D. (1978) Prediction of Secondary Structure of Proteins from Amino Acid Sequence. Advances in Enzymology and Related Subjects of Biochemistry, 47, 45-148.
[21] Feller, W. (1968) An Introduction to Probability Theory and Its Applications. 3rd Edition, Wiley, New York.
[22] Wu, G. and Yan, S.M. (2002) Randomness in the Primary Structure of Protein: Methods and Implications. Molecular Biology Today, 3, 55-69.
[23] Wu, G. and Yan, S. (2006) Mutation Trend of Hemagglutinin of Influenza a Virus: A Review from Computational Mutation Viewpoint. Acta Pharmacologia Sinica, 27, 513-526.
[24] Wu, G. and Yan, S. (2006) Fate of Influenza a Virus Proteins. Protein & Peptide Letters, 13, 377-384.
[25] Yan, S. and Wu, G. (2010) Creation and Application of Computational Mutation. Journal of Guangxi Academy of Sciences, 17, 145-150.
[26] Wu, G. and Yan, S. (2008) Lecture Notes on Computational Mutation. Nova Science Publishers, New York.
[27] Demuth, H. and Beale, M. (2001) Neural Network Toolbox for Use with MatLab. User’s Guide, Version 4.
[28] MathWorks Inc (1984-2001) MatLab—The Language of Technical Computing.
[29] Chou, K.C. (2011) Some Remarks on Protein Attribute Prediction and Pseudo Amino Acid Composition. Journal of Theoretical Biology, 273, 236-247.
[30] Sokal, R.R. and Rohlf, F.J. (1995) Biometry: The Principles and Practices of Statistics in Biological Research. 3rd Edition, W. H. Freeman, New York, 203-218.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.