MetalloPred: A tool for hierarchical prediction of metal ion binding proteins using cluster of neural networks and sequence derived features ()
1. INTRODUCTION
Metalloprotein is a generic term for a protein that contains a metal ion cofactor. Metalloproteins have captivated chemists and biochemists, particularly since 1950s, when the first X-ray crystal structure of a protein, sperm whale myoglobin indicated the presence of an iron atom [1]. The metal ion is usually coordinated by nitrogen, oxygen or sulfur atoms belonging to amino acids in the polypeptide chain and/or a macro-cyclic ligand incorporated into the protein [2,3]. The presence of the metal ion allows metalloenzymes to perform functions such as redox reactions that cannot be performed by the limited set of functional groups found in amino acids [1]. Metalloproteins play important roles in structural stability and complex formation [4-8], gene expression regulation and alteration [9-12], DNA processing [13], signaling processes and cellular event [14], transport [11,15,16], metabolism control [15,17], antibody recognition [18] and other biological processes such as cellular respiration, photosynthesis, nitrogen fixation and antioxidant defense [19]. Approximately, 1/3 of structurally-determined proteins are metalloproteins [20]. Much effort has been devoted to understanding the structure and function of these proteins.
Traditionally the metalloproteins have been identified, based on experimental techniques such as absorbance spectroscopy [21], gel electrophoresis [22], metal-affinity columns and shift assay [23], chromatography [24], mass spectroscopy [22], NMR [9] and combined spectroscopic studies [25]. These techniques which require purified or semi-purified proteins of interest, do not facilitate identification of unknown proteins from a complex mixture, or require multi-step processes and very specialized equipment which limit their application ranges. Therefore, there is need to explore alternative methods for facilitating the identification of metalloproteins to complement these experimental methods. With the exponential growth of sequence data, an insurmountable task of characterizing these sequences with experimental methods is very cumbersome. It is thus desirable to explore automated computational methods for the annotation of novel protein sequences. Several sequenced-based computational methods have been explored based on similarity search, metal-binding sites sequence motifs [26,27] and multiple sequence alignments against known metalloproteins [28]. Because of the sequence, structural and functional diversity of metalloproteins [4-8,14-17], it is desirable to explore additional methods that predict metalloproteins directly from sequence or sequence-derived properties. For a newly-found protein sequence the most interesting thing people wish to know is about its biological function and hence the following questions are often asked: Is the query protein a metalloprotein or non-metalloprotein? If it is, which main functional class does it belong to? Or going further deeper, what about its sub-functional class? The present study was initiated in an attempt to develop a top–down approach to solve all these problems and make it accessible to the vast majority of experimental scientists by providing a user-friendly web-server In this study, we have developed cascade of artificial neural network (NN) prediction systems for metalloproteins. The generalized classification obtained by the method suggests that MetalloPred could be useful as a starting point in initial screening and ab initio prediction of metalloproteins, and, in combination with comparative studies on completed genomic sequences, it could give further insight into the evolution of protein structure and function.
2. MATERIALS AND METHODS
2.1. Preparation of Dataset
All metalloproteins used in this study are collected from a comprehensive search of protein data bank (www.rcsb.org). A total of 14625 metalloprotein sequences were obtained and have been classified into calcium-binding (3466), magnesium-binding (2886), potassium-binding (173), sodium-binding (157), cobalt-binding (200), copper-binding (887), manganese-binding (968), molybdenum-binding (134), nickel-binding (147), vanadium-binding (11), zinc-binding (4861) and ironbinding (328). This data set was further refined by discarding protein sequences having length less than 20 amino acids, as they are very unlikely form a proper pocket to coordinate with metal ion. Some proteins were found to bind with more than one metal ion and have been discarded. With the aim of avoiding prejudiced learning in the networks, we scaled the sequences such that the inequality in the number of protein sequences in each class may be compromised. We reduced the proteins in each class with a similarity cutoff of 30% using BLASTClust [29]. A negative dataset consisting of 5738 protein sequences, representing non-class members is also selected from PDB database. These datasets are divided into training, testing and independent evaluation sets (Table 1).
2.2. Feature Extraction
Following three types of discrete feature vectors were constructed for each protein sequence.
1) Amino acid composition: given the sequence of a protein, its amino acid composition was computed and then used to generate a set of 20 features representing composition of 20 standard amino acids in the protein sequences that include A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y. These features have been widely used in predicting different structural classes

Table 1. Number of proteins used for training and validation of MetalloPred.
[30-32] and subcellular localization [33-36] of proteins. The formula used to calculate amino acid composition is:

where AA(i) = Frequency of ith amino acid.
2) Physicochemical properties: twelve sequence derived properties for each protein sequence was calculated using EMBOSS (EBI) package [37]. The parameters include molecular weight, total charge, isoelectric point, mole percentages of tiny (A, C, G, S, T); small (A, C, D, G, N, P, S, T, V); aliphatic (I, L, V); aromatic (F,H,W,Y); non-polar (A, C, F, G, I, L, M, P, V, W, Y); polar (D, E, H, K, N, Q, R, S, T), charged (D, E, H, K, R); acidic (D, E) and basic (H, K, R) amino acids.
3) Pseudo amino acid composition (PseAA)
This class of descriptor consists of a set of 37 features, 20 of which are weighted amino acid compositions and rest 17 are correlation factors calculated among amino acids for each protein sequence [38].
A protein sequence P with L amino acid resides can be represented as:
(1)
where R1 represents the 1st residue of the protein P, R2 the 2nd residue, and so forth. According to the simplest discrete model, the amino acid composition of the protein P based on the equation (1) can be expressed as:
(2)
where
are the normalized occurrence of frequencies for the 20 native amino acids in P, and T the transposing operator. The additional 17 features are a series of rank-different correlation factors along a protein chain and were calculated as follows.
A protein sequence P consisting of L amino acid resides can be represented as:

(3)
where 20 + λ components are given by
(4)
where w is the weight factor and τk is the k-th tier correlation factor that reflects the sequence order correlation between all the k-th most contiguous residues as formulated by
(5)
with
(6)
where Φξ(Ri) is the ξ-th function of the amino acid Ri, and Г the total number of the functions considered. Φ1(Ri), Φ2(Ri) and Φ3(Ri) represented respectively the hydrophobicity value [39], hydrophilicity value [40], and side chain mass of amino acid Ri (Table 2); while Φ1(Ri+k), Φ2(Ri+k) and Φ3(Ri+k) are the corresponding values for the amino acid Ri+k. Therefore, the total number of functions considered is Γ = 3.
It can be seen from equation (3) that the first 20 components, i.e., p1, p2,
, p20 are associated with the conventional AA composition of protein, while the remaining components p20+1,
, p20+λ are the correlation factors that reflect the 1st tier, 2nd tier,
, and the λth tier sequence order correlation patterns. It is through these additional λ factors that the important sequence-order information is incorporated.
2.3. System Architecture and Component of NN Topology
The overall classification system consists of three layers of successive multilayer feed forward (acyclic) artificial NNs (Figure 1), each one with a single hidden layer at which the computation takes place. Some com
Table 2. Hydrophobicity, hydrophilicity and mass of side chain scales for 20 amino acids used in calculating pseudo amino acid composition (PseAA).
Figure 1. A schematic drawing to classify metalloproteins and non-metalloproteins as well as the three main functional classes of metalloproteins and their subclasses.
mon features shared by all NNs are the following:
1) There is full connectivity as every node in each network layer is connected to every other node in the adjacent forward layer.
2) There are a small number of nodes in the hidden layer responsible for the actual learning process carried out by each component network.
3) The activation function on each node is a nonlinear, sigmoid logistic function of the weighted sum of all synaptic weights (plus a constant bias).
NN1 is binary classifiers which classify an input protein sequence as a metalloprotein or non-metalloprotein. If the input protein sequence is classified as a metalloprotein then it is processed by NN2 which gets classified into one of the three main classes of metal binding proteins (alkali earth metal, alkali metal or transition metal). Each class consists of an independent NN [(alkali earth metal (NN3), alkali metal (NN4) and transition metal (NN5)] for classification of input protein sequence specifically into metal ion it binds. We have used three categories of sequence derived features such as physicochemical properties, amino acid composition and pseudo amino acid composition for training of NNs. Using these parameters independently and with combination we have developed seven neural network clusters: NNprop, NNAAcomp, NNpseAA, NNpseAA + prop, NNAAcomp + prop, NNpseAA + AAcomp and NNpseAA + AAcomp + prop. Before the learning process, all network synaptic weights are initialized to small random values which have been optimized to final weights during the learning process based on backpropagation algorithm [41].
An important issue in the design of a NN classification system is the network’s generalization, that is, its ability to give correct predictions when it is presented with unseen examples. With a small number of training samples and a relatively large number of synaptic weights, there is always the possibility that the network’s free parameters will adapt to the special features of the training data (over-fitting). A straightforward way to overcome this problem is to use a sufficient number of training examples (usually more than 30 times the number of adjustable network parameters). However, the protein classes are unbiased and it is not possible to have these many numbers. Therefore to control the over fitting in our application, we have employed non-convergent criteria (early stopping method); the training process is stopped before the finishing of optimization procedure. We follow the common method which is to withhold and use part of the training data (20%) as an internal validation set. Training is stopped at the point at which the classification error on the holdout subset begins to rise.
In the prediction phase, just like the forward pass in learning, network weights are globally fixed (those obtained after the convergence of the training process) and the NN is presented with an unknown example for classification. In the same hierarchical manner, the input signal propagates once in the forward direction and the output value constitutes the network’s decision based on the already studied training examples. The prediction accuracy of the models has been validated using self test, jackknife test and independent data set. For jackknife test we randomized the test set for 100 times and recorded average performance accuracy.
3. RESULTS AND DISCUSSION
To assess the performance of the MetalloPred, we applied several tests. We created a new independent test set with well-characterized protein sequences from all level of classes and sub-classes (Table 1) to evaluate the performance of the new integrated system. In addition we have also performed sub-sampling test (self test) and jackknife test for examining the accuracy of MetalloPred. All these validation tests are commonly used for performance evaluation of a predictor. Jackknife test is deemed the most rigorous and objective [30] and hence has been increasingly adopted by investigators in examining the quality of various prediction methods [42-44]. A direct comparison with results from previous metal binding protein prediction studies may not be most appropriate because of the differences in the protein classes predicted, datasets, protein descriptors, prediction methods and parameters.
3.1. Performance of 1st Layer of Neural Network
The performance and validation results of NN1 are given in Table 3" target="_self"> Table 3. An overall accuracy of 99.74% and

Table 3. Performance accuracy and validation results of 1st layer of MetalloPred based on combination of pseudo amino acid composition, amino acid composition and physicochemical properties.
87.99% for the training and test set data using combination of sequence derived features such as pseudoamino acid composition, amino acid composition and physicochemical properties. While considering the validation techniques by using an independent data set, self test and jackknife test, the overall accuracy of the 1st layer of MetalloPred is 83.42%, 80.49% and 74.18% respectively. The details of the performance accuracy and validation results based on different types of sequence derived feature have been represented in supplementary Table 1.
3.2. Performance of 2nd Layer of Neural Network
The overall success rate in identifying the metalloproteins among their three main functional classes is 99.25% (using training set) and 81.91% (using test set) (Table 4). Similarly the overall performance accuracy based on three types of validation tests has been found to be 75.16% (using independent data set), 73.24% (using self test) and 64.23% (using jackknife test). The corresponding results by Metallopred on the data set for three major classes of metalloproteins using different types of sequence derived features are given in supplementary Table 2.
3.3. Performance of 3rd Layer of Neural Network
The performance accuracy and validation results of NNs in identifying subclasses of alkali earth metal (NN3), alkali metal (NN4) and transition metal (NN5) binding proteins using the combination of all sequence derived features has been given in Table 5. The corresponding results by MetalloPred on the detection of calcium and magnesium metal binding proteins are 91.72% (training set), 91.07% (test set), 77.46% (independent data set), 81.1% (self test) and 75.51% (jackknife test) on the data set ‘S1’. Similarly for the data set ‘S2’ the performance accuracy for the detection of potassium and sodium binding proteins are 97.95% (training set), 96.4% (test set), 79.2% (independent data set), 91.57% (self test) and 83.33% (jackknife test). The overall accuracy of detection of cobalt-binding, copper-binding, iron-binding, manganese-binding, molybdenum-binding, nickel-binding, vanadium-binding and zinc-binding is 98.88% (training set), 95.39% (test set), 84.06% (independent data set), 71.39% (self test) and 67.98% (jackknife test) using the data set ‘S3’. The details of the performance accuracy have been represented in supplementary Table 3.
For the current data sets in which none of the protein sequence has ≥30% sequence identity to any other in a same class or subclass, the overall success rates by the MetalloPred in identifying the main functional classes of metalloproteins and their subclasses is very high. In an earlier study, contribution of individual feature property to protein classification is investigated by separately conducting classification by the use of each feature property [45-47].
The same method was employed here. An analysis on the classification of the group of all metal binding proteins seems to suggest that, in order of prominence, the hydrophobicity and hydrophilicity play a more prominent role than other feature properties. Hydrophobicity has been shown to be important for metal-protein interactions such that metal binding sites usually appear in clusters with hydrophobic environment. High-affinity metal binding sites in some proteins are located at sequence segments with specific amino acid composition, and specific sequence motifs have been used for predicting metal-binding proteins [48-50]. It was also found that polarity and solvent accessibility of the binding site influences the functional properties of metal-binding proteins. Therefore, our prediction results are consistent with these experimental findings. Overall MetalloPred is a very powerful predictor in identifying metalloproteins, their main classes, and their subclasses.
4. CONCLUSIONS
From a practical point of view, the most important aspect of a prediction model is its ability to make correct predictions. Till date most of the available methods use the 3-D structure of the protein to predict and classify metal ion binding proteins. This is a very tedious job and requires much costlier endeavors. The sequence of a protein is an important determinant for the detailed molecular function of proteins and would consequently also be useful for prediction of metal ion binding protein and
Table 4. Performance accuracy and validation results of 2nd layer of MetalloPred based on combination of pseudo amino acid composition, amino acid composition and physicochemical properties.
Table 5. Performance accuracy and validation results of 3rd layer of MetalloPred based on combination of pseudo amino acid composition, amino acid composition and physicochemical properties.
their classes. Additionally much encouraging results have been predicted using the sequence derived parameters technique. Therefore, a much accurate and reliable method is to predict the metal ion binding proteins and metal ion binding protein classes based on both strategies. Cascade of neural networks used in this study appears to be a potentially useful tool for the prediction of metalbinding proteins of different classes. The prediction accuracy may be further enhanced with the further expansion of our knowledge about metal-binding proteins, particularly for those small metal-binding classes, more refined representation of the structural and physicochemical properties of proteins and the improvement of prediction algorithms such as the better treatment of imbalanced dataset in the next version of our prediction tool.


Supplementary Table 1. The summary of the performance accuracy and validation results of 1st layer (NN1) of MetalloPred based on different sequence derived features.


Supplementary Table 2. The summary of the performance accuracy and validation results of 2nd layer (NN2) of MetalloPred based on different sequence derived features.