Predictors for Predicting Temperature Optimum in Beta-Glucosidases

This is the continuation of our studies on beta-glucosidase, which plays an important role in biological processes and recently strong interests focus on their potential role in biofeul production. In order to develop simple methods to predict the optimal working condition for beta-glucosidase, we used a 20-1 feedforward backpropagation neural network to screen possible predictors to predict the temperature optimum of beta-glucosidase from 25 ami-no-acid properties related to the primary structure of beta-glucosidases. The results show that the normalized polarizability index and amino-acid distribution probability can predict the temperature optimum of beta-glucosidase, which highlights a cost-effective way to predict various enzymatic parameters of beta-glucosidase.


INTRODUCTION
The β-glucosidase (EC 3.2.1.21) plays an important role in biological processes because it cuts the β-bond linkage into glucose molecules [1]. For example, mutations in the gene of lysosomal enzyme acid beta-glucosidase can lead to human metabolic disorder Gaucher disease characterized by deficient activity of the enzyme [2,3]. β-glucosidase can deglycosylase isoflavones to their aglycone forms, which provides wide applications in food and pharmaceutical industries [4]. Recently, more and more interest on its potential role on biofeul production because cellulose is a linear biopolymer of glucose molecules connected by β-1,4-glycosidic bonds, of which enzymatic hydrolysis requires mixtures of hydrolytic enzymes including endoglucanases, exoglucanases (cellobiohydrolases), and β-glucosidases [5]. Therefore, great efforts have been made to develop renewable biofuel by enzymatically hydrolyzing carbohydrate polymers in biomass to sugars and fermenting them to ethanol [6].
Generally speaking, the optimal working conditions for enzymes are determined through the experi-Open Access J. Biomedical Science and Engineering mental approaches, which are costly and time-consuming. Nowadays, the experimental speed apparently lags the speed of increase of enzymes in database because in 2002 there were only 789 enzymes documented in the Comprehensive Enzyme Information System BRENDA [7,8]. However, there are enzymes from 33,721 organisms currently. In this situation, it is easily found that many enzymes have their sequence information but lack their optimal working conditions. Thus it is intriguing to develop methods to predict the optimal working conditions of enzymes based on their primary structure, and recently we have conducted several studies on predicting functional parameters of enzymes using amino acid properties, including pH optimum [9][10][11][12], temperature optimum [11][12][13][14][15], Michaelis-Menten constant [16][17][18] and turnover number [19]. However, more studies are needed in order to get solid conclusions. The aim of this study is to find out the predictors that are useful to predict the temperature optimum of β-glucosidase.

Data
From the Comprehensive Enzyme Information System BRENDA, 37 β-glucosidases (EC 3.2.1.21) have their sequence information under the category of temperature optimum, of which one β-glucosidase was documented with its mutant [20,21]. Also, two temperature values are documented in the β-glucosidases B5TWK3 at 22˚C and 37˚C [22] and Q12715 [23] at 65˚C and 70˚C. In total, this databank provides 40 matched sequences and temperature values of β-glucosidases. The amino-acid sequences of β-glucosidases are obtained from the Universal Protein Resource (UniProt) [24]. Table 1 lists the amino acid properties to be scanned, which involve the characteristics of charge, hydrophilicity or hydrophobicity, size and functional groups, and they are crucial for protein structure and protein-protein interactions [25]. Some properties are related to primary structure of enzymes and include the spatial properties [26,27] listed in rows 2 -5 in Table 1; hydrophobic properties [28][29][30] listed in rows 6 -10 in Table 1; electronic properties [31] listed in rows 11 -17 in Table 1, and the secondary structure predictions [32] listed in rows 18 -24 in Table 1. All of these properties have a particular number to a certain amino acid in proteins, thus each amino acid has a fixed value, which surely cannot represent different β-glucosidases. Because each β-glucosidase has its own amino-acid composition, we multiply the values listed in Table 1 by their amino-acid composition for each β-glucosidase.

Possible Predictors
Based on occupancy of subpopulations and partitions [33], we have developed a measure to calculate amino acid distribution probability according to the following equation:  where ! is the factorial function, r is the number of a type of amino acid, q is the number of partitions with the same number of amino acids and n is the number of partitions in the protein for a type of amino acid.
And its calculation can be available at http://www.gxas.cn/dp.htm. Each type of amino acids has its distribution probability as example shown in Table 2. However, the same type of amino acids can have different values in different proteins according to their real distribution pattern along protein sequence [34][35][36][37][38].

Predictive Model
In order to find out possible predictors to predict the temperature optimum of β-glucosidases, a 20-1 feedforward backpropagation neural network was used as predictive model [39], whose structure is shown in Figure 1. In this model, the first layer contains 20 neurons corresponding to 20 inputs (or 20 elements of input in neural network terminology), which can be any measure related to 20 types of amino acids. The second layer contains a single neuron corresponding to the single output, temperature optimum. The transfer functions are tan-sigmoid and linear for two layers. The training algorithm is the resilient backpropagation, which is the fastest algorithm on pattern recognition in MatLab [40]. Table 1. Features of amino acids used as predictors. A, alanine; R, arginine; N, asparagine; D, aspartic acid; C, cysteine; E, glutamic acid; Q, glutamine; G, glycine; H, histidine; I, isoleucine; L, leucine; K, lysine; M, methionine; F, phenylalanine; P, proline; S, serine; T, threonine; W, tryptophan; Y, tyrosine; V, valine. σ I : Inductive effect scale; H M ΔPH: Normalized Mulliken population data for the amino-acid side chains in the context of phenol; σ R : Resonance effect scale; σ α : Normalized polarizability index; σ F : Field effect index; A I : Additional scale; f(i): Frequency of the 1st residue in turn; f(i + 1): Frequency of the 2nd residue in turn; f(i + 2): Frequency of the 3rd residue in turn; f(i + 3): Frequency of the 4th residue in turn.   pieces of information on primary structure of β-glucosidase, which are labeled using the symbols of 20 types of amino acids, and its temperature. Each diamond presents a neuron. IW{1} is the input weights, LW{2,1} is the layer weights to the second layer from the first layer. b{1} and b{2} are the biases related to each neuron at the first and second layers.

Validation of Predictions
Each predictor went through this predictive model with same procedures in order to compare its output statistically. Table 3 lists a total of 40 β-glucosidases to be analyzed, of which 25 were used to generate the weights and biases in neural network as training group, and 15 were used to validate the neural network with trained weights and biases as validation group. This is a traditional way used in neural network. Then, the delete-1 observation jackknife was used and each time one observation was left out from the sample set for validation, because it is most effective in comparison with independent dataset test and subsampling test, and is widely used [41]. Finally, cross-validation was used, and the data were split into 10 or 4 subsets, which had 4 or 10 cases and was held out in turn as the validation set [42].

Statistics
One hundred trainings were conducted for each predictor in the predictive model, and their weights and biases were used to predict the temperature optimum 100 times. The mean and standard deviation of predicted values were compared with the recorded temperature optimum for each β-glucosidase [43], and linear regression was also used to evaluate the predicted temperature values with their recorded ones.

RESULTS AND DISCUSSION
Theoretically, the neural network displayed in Figure 1 can account for various linear and nonlinear relationships between amino acid properties of primary structure and temperature optimum of β-glucosidases, which can guarantee the screening of various predictors, no matter whether the relationship between predictors and temperature is linear or nonlinear [39].
Technically, the initialization of weights and biases and number of training epochs govern whether the neural network can converge during training process, for which the weights and biases were initialized by random initialization function, and 250 training epochs were conducted. Only 4 out of 25 amino acid properties can be converged and shown in Figure 2, where each line represents that a training process contains random initialization of weights and biases with 250 training epochs. As seen, the convergence can be reached within 250 training epochs with any random initialization, which lays the foundation to guarantee the training process, indicating that these 4 properties can be served as predictors to predict the temperature optimum of β-glucosidases. However, it can be found that different predictors have different profiles of their convergence and the convergence of profiles of amino-acid distribution probability (bottom panel) reached narrower than others. Table 3 demonstrates the comparison of recorded temperature optimum with predicted temperature optimum for 40 β-glucosidases. If there is no statistical difference between recorded and predicted temperature optimum, a predictor would be considered workable. Accordingly, if no statistical difference was found between recorded and predicted temperature optimum, the predicted temperature optimum is marked with asterisk. The last row in Table 3 summarizes the overall performance, where it can be seen that the normalized polarizability index (σα) and amino-acid distribution probability works better than the other two. Figure 3 displays the percentage of β-glucosidases with correctly predicted temperature during the training process. As can be seen, the amino-acid distribution probability worked best in training group, which resulted that the temperature optimum of all β-glucosidases was correctly predicted, and followed by the normalized polarizability index (88%), whereas only the normalized polarizability index reached 60% of correctly prediction in validation group. Figure 4 visualized the regression between recorded and predicted temperature optimum by using these four amino-acid properties as predictors. Figure 5 shows the results of delete-1, delete-4 and delete-10 jackknife validations, where it can be seen that both normalized polarizability index and amino-acid distribution probability gave better performance and that there was generally no significant difference between different deletions.
The predicted temperature optimum was presented as mean ± SD of 100 predictions. AA, the amino-acid composition; AA DP, amino-acid distribution probability. *, no statistical difference with the recorded temperature optimum.

Group
Accession    do have a promising prospective to predict the enzymatic optimal working conditions based on the information related to enzyme primary structure. Surely, further efforts are needed to explore a cost-effective way to predict various enzymatic parameters of β-glucosidases.

FUND
This study was supported by National Natural Science Foundation of China (31560315), and Key Project of Guangxi Scientific Research and Technology Development Plan (AB17190534).