mLysPTMpred: Multiple Lysine PTM Site Prediction Using Combination of SVM with Resolving Data Imbalance Issue ()
1. Introduction
The structural and functional diversities of proteins as well as plasticity and dynamics of living cells are significantly dominated by the post-translational modifications (PTMs) [1]. PTMs are also responsible for expanding the genetic code and for regulating cellular physiology [2-4]. It changes the properties of a protein by proteolytic cleavage or adding a modifying group to one or more amino acids [5,6].
In general, the side chain of lysine plays the key role in increasing the complexity of PTM network [7]. The lysine residue in proteins can experience various types of PTM, called K-PTM, such as methylation, acetylation, biotinylation, ubiquitination, ubiquitin-like modifications, propionylation, and butyrylation [7,8]. Moreover, some lysine residues in proteins can undergo multiple K-PTMs which lead to the requirement of multi-label PTM site identification. This kind of multiplex lysine residues in proteins may have some exceptional functions that require special attention [9]. As a result, the identification of multiple K-PTM sites in proteins has become a vital question in cellular physiology and pathology, which in turns, helps in providing some valuable evidence for both biomedical research and drug development [6,9].
However, the purely experimental technique such as mass spectrometry, peptide micro-array, liquid chromatography, etc., to determine the exact modified sites of protein is expensive as well as time-consuming, especially for large-scale datasets. In this context, it is highly demanded to use computational approaches to identify the K-PTM sites effectively and accurately [8]. Meanwhile, many computational methods have been developed to predict the modification sites in proteins for different K-PTM types which called single-label prediction [10-16]. According to our best knowledge, so far one computational tool has been developed to predict several different K-PTM types simultaneously for multiplex lysine residues. It is noted that various types of multi-label systems have been successfully studied in some other fields of computational biology [17,18]. However, in order to meet the current demand to produce efficient high-throughput tools, additional effort is required to further improve the prediction quality [9].
In the development of computational classifier, one of the major challenges is to handle imbalance dataset problem [8,15,19,20], as it is found in most of the dataset for this kind of prediction, the number negative subset is much larger than the corresponding positive subset [8,15]. As the real world picture is that here the non K-type modification sites are always the majority compared with the K-type modification ones, so naturally the predictor should be biased to the non K-type modification sites. Here the problem is that, for this type of predictors may interpret many K-PTM sites as non K-PTM sites [21-23]. But, the information about the K-PTM sites is mostly desired than non K-PTM sites. As a result, it is crucial to find an effective solution to balance this kind of bias consequence.
The current study has been begun with an attempt to address the problems mentioned above and then tried to develop a more powerful predictor using combination of support vector machine which can be used to predict the multiple K-type modification sites of proteins. In this predictor, the Different Error Costs (DEC) method [24-26] has been used to resolve the data imbalance issue. It should be noted here that the features used in this predictor are extracted by using vectorized sequence-coupling model [27]. In the recent works, the performance of iPTM-mLys [9] on a large set of proteins has been studied in [9]. Therefore, in order to compare the performance of mLysPTMpred with the system iPTM-mLys [9], we use the exactly same dataset employing the commonly used stratified 5-fold cross-validation [9]. Since the information about the exact 5-way splits used in previous studies [9] is not available, so we have performed five complete runs of 5-fold-crossvalidation (i.e. 25 runs in total), where each complete run of 5-fold cross-validation uses a different 5-way split. The use of multiple runs with different splits helps to validate the stability and the statistical significance of the results. Finally, the average results of all metrics found from this study have been reported. Our experimental results indicate that mLysPTMpred achieves better results than those found from iPTM-mLys [9].
In order to launch a useful sequence-based statistical predictor for a biological system as demonstrated in a series of recent publications [8,15,28-35], the Chou’s five-step rules [36] should be followed: 1) construct or select a valid benchmark dataset to train and test the predictor, 2) formulate the biological sequence samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted, 3) introduce or develop a powerful algorithm (or engine) to operate the prediction, 4) properly perform cross-validation tests to objectively evaluate its anticipated accuracy, and 5) establish a user-friendly webserver that is accessible to the public.
2. Material and Methods
2.1. Benchmark Dataset
iPTM-mLys’s [9] benchmark dataset set has been used in this study. iPTM-mLys’s dataset was derived from the 1769 protein sequences from human. These 1769 protein sequences were collected from the web site at http://www.uniprot.org/ by giving various constrains such as 1) experimental assertion for evidence, 2) consider only human protein sequences, and 3) use keywords of “acetyllysine”, “crotonyllysine”, “methyllysine” or “succinyllysine” in the advance search option.
In iPTM-mLys [9], according to Chou’s scheme, a peptide sample was generally expressed by
(1)
where the subscript ξ is an integer,
represents the ξ-th up stream amino acid residue from the center, the
represents the ξ-th downstream amino acid residue, and so forth.
The
-tuple peptide sample
was further classified into the following two categories [9]
(2)
where
denotes a true K-PTM segment with K at its center,
a false K-PTM segment with K at its center, and the symbol
means “a member of” in the set theory.
In iPTM-mLys’s work,
-tuple peptide window was used to collect peptide segment that have K at the center. It should be mentioned here that if the upstream or downstream in a protein sequence is less than ξ or greater than L-ξ (L is the length of the protein sequence concerned) then the lacking amino acid has been filled with the same residue as its nearest one [9].
After applying some screening procedure based on some constraints on that collected peptide samples, for example, considering window size, keep only one when two or more samples share same sequence, iPTM-mLys finally constructed a benchmark dataset [9]. It should be mentioned here that iPTM-mLys constructed four bench mark dataset for “acetylation”, “crotonylation”, “methylation” and “succinylation”, respectively using same the above mentioned procedure. The detail procedure about the construction of iPTM-mLys’s benchmark dataset is explained in [9].
The four benchmark dataset
in iPTM-mLys’s study was formulated as
(3)
where the positive subset
contains only the peptide samples with their center residues K (Equation (3)) confirmed by experiments being able to be of acetylation, while the negative subset
only contains those samples unable to be of acetylation, and the symbol
means union in the set theory. Likewise, the remaining three sub-equations in Equation (3) have exactly the same definition but refer to “crotonylation”, “methylation” and “succinylation”, respectively.
Using numeric values, in iPTM-mLys’s study, the Equation (3) was formulated as
(4)
where the numerical argument 1, 2, 3 or 4 denotes ‘acetylation’, ‘crotonylation’, ‘methylation’ or ‘succinylation’, respectively.
Note that, depending on some preliminary test, window size was selected as 27 (2*𝜉 + 1) in iPTM-mLys’s study, where
. Thus, the benchmark dataset obtained by iPTM-mLys for
,
,
,and
are available at online supplementary materials (http://research.ru.ac.bd/mLysPTMpred/) as Supporting Information. It should be mention that our published online supplementary materials are taken from iPTM-mLys’s work [9]. A summary of this benchmark dataset is given in Table 1.
2.2. Feature Extraction
The appropriate features of protein sequences or samples plays very important roles for the prediction of PTM site, as a result it draws the much attention of scientist that how to select the core and essential features of protein samples but this task becomes harder as this types of features are either hidden or burred in the complicated protein sequences. As most existing machine learning algorithm can handle only vector but not sequence sample, one of the critical problem in bioinformatics is how to extract vector from biological sequence with keeping considerable sequence characteristics [7].
To avoid complete losing the sequence pattern information for a protein, the pseudo amino acid composition or PseAAC [37] was proposed. Ever since the concept of Chou’s PseAAC was proposed, it has been widely used in nearly all the areas of computational proteomics [38,39]). Because it has been widely and increasingly used, recently a very powerful web-server called “Pse-in-One” [40] and its updated version “Pse-in-One2.0” [41] have been established that can be used to generate any desired feature vectors for protein/peptide and DNA/RNA sequences according to the need of users’ studies.
In this paper, the incorporation of sequence-coupling model [27,42] into Chou’s general PseAAC [36] has been adopted to extract feature from peptide segment. Based on the concept of sequence-coupled information [27,42] into the general PseAAC, the peptide sequence of Equation (1) can be formulated as
(5)
where
(6)
and
![]()
Table 1. Summary of the four benchmark dataset.
Ace, acetylation; Cro, crotonylation; Met, methylation; Suc, succinylation.
(7)
In Equation (5)
is the conditional probability of amino acid
occurring at the left 1st position (see Equation (1)) given that its closest right neighbor is
,
is the conditional probability of amino acid
occurring at the left 2nd position given that its closest right neighbor is
,and so forth. It should be mentioned here that in Equation (6), only
and
are of non-conditional probability since the right neighbor of
and the left neighbor of
are always K. All these probability values can be easily derived from the positive benchmark dataset given in Supporting Information as done in [42]. Likewise, the components in Equation (7) are the same as those in Equation (6) except for that they are derived from the negative benchmark dataset given in Supporting Information.
2.3. SVM Classification
The modeling algorithm of SVM searches an optimal hyperplane with the maximum margin for separating two classes by finding a solution of the following constraint optimization problem [43-45]:
(8)
where
and
is the class label of
,
.
Finally, the discriminant function of SVM by involving the kernel function takes the following form
(9)
It noted here that a kernel function and its parameter have to be chosen to build a SVM classifier [43-45]. In this work, radial basis function kernel has been used to build SVM classifier which is defined below:
,
is the width of the function.
2.4. Imbalance Data Management
Any data set that shows an unequal distribution between its classes can be considered imbalanced data set problem. The main challenge in imbalance problem is that the small classes are often more useful, but standard classifiers tend to be weighed down by the huge classes and ignore the tiny ones. Although SVMs work effectively with balanced datasets, they provide sub-optimal models with imbalanced datasets [24,25]. The main reason for the SVM algorithm to be sensitive to class imbalance would be that the soft margin objective function [43-45] assigns the same cost (i.e., C) for both positive and negative misclassifications in the penalty term [26].
In this paper, we have used a Different Error Costs (DEC) method to handle imbalance dataset problem of K-PTM sites prediction. The Different Error Costs (DEC) method is a cost-sensitive learning solution proposed in [24] to overcome this problem in SVMs. In DEC method, the SVM soft margin objective function is modified to assign two misclassification costs, such that C+ is the misclassification cost for positive class examples, while
is the misclassification cost for negative class examples. In our work, the following equations give the cost for the positive and negative classes
(10)
where N is the total number of instances, N1 is the number of instances for positive class, and N2 is the number of negative class.
2.5. Experimental Setting
In statistical prediction, there are three commonly used methods to derive the metric values for a predictor, these are, the independent dataset test, subsampling (e.g., K-fold cross validation) test, and jackknife test [15,46]. These methods are often used for testing the accuracy of a statistical prediction algorithm. However, among those three methods, the jackknife test is deemed the most objective because it can always yield a unique result for a given benchmark data set, as reported in a comprehensive review [36]. Although the jackknife test has been increasingly and widely adopted by investigators to examine the power of various prediction methods, it takes huge computational time for a larger dataset.
In this study, we have used K-fold cross validation (subsampling) method to save the computational time. As the information about the exact 5-way splits of dataset used in previous studies is not published [9], therefore, in order to validate the stability and the statistical significance of our results, we have repeated the 5-fold cross validation for 5 times (i.e. 25 runs in total). It can be mentioned here that in each 5-fold cross validation the given training samples are randomly partitioned into 5 mutually exclusive sets of approximately equal size and approximately equal class distribution. Finally, we have reported the average results of all metrics in this study.
2.6. Measuring Metrics
According to the description of iPTM-mLys dataset in [9], we have a total of 6394 samples, of which 3991 are labeled with “acetylation”, 115 with “crotonylation”, 127 with “methylation”, 1169 with “succinylation” and 1,750 with “non-K-PTM”. However, in the above samples, some have two or more labels. It should be noted that in this study, we have considered {acetylation, crotonylation, methylation, succinylation, ∅} as class label set for a protein. Here ∅ is used to denote non-K-PTM. Since we are dealing with a multi-label system [47], so the metrics for a multi-label system will be used in this work instead of the conventional metrics defined for single-label systems [48-50].
For measuring the predictive capability and reliability for this kind of classification, a set metrics are usually used in the literature which are define below [47]:
(11)
where N is the total number of the samples concerned, M the total number of labels in the system,
and
the symbols are for the “union” and “intersection” in the set theory,
means the operator acting on the set therein to count the number of its elements,
denotes the subset that contains all the labels experiment-observed for the k-th sample,
represents the subset that contains all the labels predicted for the kth sample, and
All of these metrics defined in this section have been successfully applied to study several multi-label systems, such as those in which a protein may stay in two or more different subcellular locations [51], or a membrane protein may have two or more different types [52], or an antimicrobial peptide may have two or more different types [53].
3. Results and Discussion
3.1. Model Selection and Working Procedure of the Proposed System
In order to generate highly performing SVM classifiers capable of dealing with real data an efficient model selection is required [54]. Grid-search technique has been used to find the best model for SVM in this work. In our experiments, grid-search technique selects the values of parameters considering highest performance which is measured using a metric and then time if more than one position in search space has the same performance.
In this study, four SVM classifiers, one for each dataset, have been used for predicting the acetylation, crotonylation, methylation and succinylation sites. The model selection of each SVM classifiers has been done separately as binary classifier using the corresponding benchmark dataset given in Table 1. In this work, we have used RBF kernel for all SVM classifiers.
For radial basis function (RBF) kernel, to find the parameter value C (penalty term for soft margin) and σ (sigma), we have considered the value from 2−8 to 28 for C and from 2−8 to 28 for sigma as our searching space. Herein, the value of C will be used to find the misclassification cost of C+ and C− defined in Equation (10). Since the information about the exact 5-way splits of dataset used in previous studies is not published [9], We have performed 5 times complete run of 5-fold cross-validations and each time we have selected the best parameter of the classifier depending on the value of AUC (area under curve). It should be noted here that AUC (area under the curve) is an important metric for single-label PTM site prediction [8,15] which will be calculated from ROC curve (receiver operating characteristic curve). Finally, five sets of C and sigma (Supplementary Tables S1) have been selected from 5 times complete run of 5-fold cross-validations for each SVM classifier which is dedicated to a specific types of training dataset (acetylation, crotonylation, methylation, or succinylation).
After getting the four trained binary SVM classifier with appropriate values of C and sigma (Supplementary Tables S1), a multi-label predictor, named mLysPTMpred, has been developed by combing output from these four SVM classifiers, as shown in Figure 1. As we have repeated the 5-fold cross validation for 5 times for our mLysPTMpred, we have got five sets of values for all metrics defined in section 2.5. Finally, we have averaged our results in order to ensure unbiased model selection.
However, in order to train the system for the web server, we have used that value of C and sigma which appears most of the times as best model in 5 times complete run of 5-fold cross validation in each dataset. Note that, a random selection of the value of C and sigma has also been performed from 5 set of C and sigma of each dataset where “most of the times” criteria fail to select C and sigma. In this way, the selected C and sigma for each type of dataset is given in Table 2.
![]()
Figure 1. A flowchart to show how the mLysPTMpred predictor works.
![]()
Table 2. Selection of C and σ to train the system for web server.
It can be mentioned here that all the trains and tests have been conducted on a standard machine of DELL Optiplex 390 with 8 GB RAM and Core-i3 processor running at 3.30 GHz. We have used Matlab 2014b version to implement our system where the svmtrain function of Matlab by default uses DEC with the same cost defined in Eq. (8) to handle imbalance situation.
3.2. Comparison with the Existing Methods
The values of the five metrics (cf. Equation (11)) obtained by the current mLysPTMpred predictor for multi-label lysine PTM site are given in the Table 3. These values are the average result of 5 times complete run of 5-fold cross-validation on the benchmark dataset given in Supporting Information. Moreover, standard deviations of each metrics of 5 times complete run of 5-fold cross validation are shown in parentheses.
Table 3 also includes the corresponding rates achieved by iPTM-mLys [9], the one existing predictors for identifying the multiple lysine PTM site in the aforesaid benchmark dataset. It should be mentioned here that the performance of iPTM-mLys [9] as shown in Table 3 are noted from [9].
In Equation (11), the first four metrics are completely opposite to the last one. For the former, the higher the rate is, the better the multi-label predictor’s performance will be; for the latter, the lower the rate is, the better its performance will be [9]. The rate of ‘‘Absolute-False’’ or “Hamming-Loss” [47] for our predictor is 6.66% which is about half than iPTM-mLys. So, the average ratio of the completely wrong hits over the total prediction events for mLysPTMpred is significantly lower than iPTM-mLys.
Among the five metrics in Equation (11), the most strict and harsh one is the “Absolute-True”. According to [9], very few multilabel predictors in biology could reach over 50% for the absolute true rate. However, the absolute-true rate achieved by mLysPTMpred can reach over 80% as shown in Table 3.
Also, among the same five metrics, the most important is the “Accuracy”, the average ratio of the correctly predicted labels over the total labels including correctly and incorrectly predicted ones as well as
![]()
Table 3. A comparison of the proposed predictor with the existing methods on the same dataset.
those real labels but are missed out during the prediction. The mLysPTMpred achieves 83.73% accuracy which is considerable amount of higher than iPTM-mLys. In addition, the rate of ‘‘Aiming’’ or ‘‘Precision’’ [47] and the rate of ‘‘Coverage’’ or ‘‘Recall’’ [47] achieved by mLysPTMpred also better than iPTM-mLys.
Therefore, it is obvious from Table 3, mLysPTMpred has performed remarkably better over iPTM-mLys [9] in all types of metrics measurement. To provide an intuitive comparison, a bar chart to represent Table 3 is shown Figure 2. Therefore, it is projected that mLysPTMpred may become a useful and higher throughput tool in multiple lysine PTM sites predictions.
In iPTM-mLys, an example protein sequence (Q16778) has been used to validate their findings. It should be noted here that the example protein sequence (Q16778) is also available in our site under Example button. For making comparison, we have not changed the sequence as example. The prediction result using this sequence (Q16778) from mLysPTMpred and the actual experimental result of this sequence is reported in Table 4. By putting this predicted result in equation 11, we have got the rate of aiming = 88.33%, coverage = 87.50%, accuracy = 85.83%, absolute-true = 80.00% and absolute-false = 6.00% which is similar to the rates obtained by the cross-validation tests as given in Table 3.
Why can the proposed method enhance the prediction quality so significantly? First, the coupling effects among the amino acids around the target sites have been taken into account via the conditional probability as done by many investigators in successfully enhancing the prediction quality in some applications [27,31,35,42]. Second, the predictor used Different Error Costs (DEC) method to balance the effect of skewed training dataset and hence many false prediction events produced by imbalanced and skewed training datasets can be avoided as established in some recent studies [7,8,15,31,35].
3.3. Web Server
To attract more users especially for the convenience of experimental scientists and enhance the value of practical application, a user-friendly web-server for mLysPTMpred has been established at http://research.ru.ac.bd/mLysPTMpred/. In order to get the predicted result, users are required to submit protein sequence through the input text box in our site. The input sequence should follow the FASTA format. An example of a sequence of FASTA format is available under example button in our published site. Moreover, in order to get batch prediction, users are required to enter desired batch input file in the FASTA format. Noted that, the benchmark dataset used to train and test the mLysPTMpred predictor are available under Supporting Information button.
4. Conclusions
In this article, we have designed a simple and efficient predictor mLysPTMpred for predicting multiple lysine PTM sites. Experimental results show that our method is very promising and can be a useful tool for prediction of multiple lysine PTM site. The mLysPTMpred has achieved remarkably higher success rates in comparison with the existing predictors (iPTM-mLys) in this area. We believe that the approach and formulations proposed in this article for multi-label K-PTM can be used to study other multi-label PTM systems such as C-PTM, R-PTM and S-PTM for the corresponding multi-label PTM sites at Cys, Arg and Ser residues, respectively.
For convenience of the experimental scientists, we have established a user-friendly web server and a step by step guide has been provided about how to use this web server. It provides an easier way to obtain the desired results without knowing the mathematical details. We have projected that the mLysPTMpred
![]()
Figure 2. Performance Comparison Between iPTM-mLys and mLysPTMpred.
![]()
Table 4. Comparison between the predicted and experimental results on protein Q16778.
will become a very useful and higher throughput tool to deal with both single- and multi-label PTM systems.
As the current mLysPTMpred has been developed to study the multi-label system for only four different K-PTM types, we will try to add more types of K-PTM and include more new sequences [55] in future in our system and an announcement about the new version will be provided at the website http://research.ru.ac.bd/mLysPTMpred/.