pLoc-mGpos: Incorporate Key Gene Ontology Information into General PseAAC for Predicting Subcellular Localization of Gram-Positive Bacterial Proteins ()
1. Introduction
As the most basic unit of life, a cell must also undergo three most important processes of any living things: growth, reproduction, and death [ 1 ]. It is one of the fundamental problems in cellular and molecular biology to thoroughly understand these processes. The knowledge thus acquired is also closely associated with drug development. To realize it, however, the knowledge of proteins in different organelles of a cell or its subcellular localization is prerequisite.
During the last two decades or so, many computational methods were developed to address this problem (see [ 2 , 3 ] as well as a long list of references cited in the two important review articles).
But most of the existing computational methods were designed to treat the single-label system in which each of the constituent proteins has one, and only one, subcellular location. With more experimental data emerging, however, the localization of proteins in a cell is actually a multi-label system, where some proteins may simultaneously occur in two or more different location sites. This kind of multiplex proteins often bears some exceptional biological functions [ 4 - 6 ], and should deserve our special attention [ 7 - 12 ], particularly from the viewpoint of selecting multiple targets [ 13 - 15 ] or key targets [ 16 - 19 ] for drug development.
About 10 years ago, some efforts have been made to explore this kind of multiplex protein systems [ 6 , 7 , 10 , 12 , 20 - 30 ]. In comparison with the single-label systems, it would be much more difficult and complicated to deal with the multi-label systems. Particularly, it is extremely difficult for a multi-label predictor to yield a descent result for the “absolute true” rate. The reason is as follows. Suppose a gram-positive bacterial protein is labeled with “1” and “2”, meaning that it may simultaneously exist in subcellular locations 1 and 2 in the real world. If its predicted result is “1”, or “2”, or “1 and 3”, or “2 and 3”, no score at all will be added for the absolute true rate. When and only when the predicted result is also exactly “1 and 2” meaning perfectly identical to the actual labels, will one score be added in calculating the absolute true rate. Therefore, it is the harshest metrics in measuring the quality of a multi-label predictor [ 31 ]. And that was why in proposing their multi-label predictors, many authors even did not mention the term of “absolute true rate”.
In this study, we used the multi-label theory [ 31 ] to develop a new predictor to identify the subcellular localization of Gram-positive bacterial proteins aimed at improving its absolute true and absolute false rates, the two most important and harshest metrics for a multi-label predictor [ 31 ].
2. Materials and Methods
2.1. Benchmark Dataset
According to the Chou’s 5-step rule [ 32 ] that has been widely used by many recent investigators (see, e.g., [ 33 - 47 ]) for developing a statistical predictor, the first important and foremost thing is to construct or select a valid benchmark dataset to train and test the model [ 1 , 42 , 48 ]. In literature, the benchmark dataset usually consists of a training dataset and a testing dataset: the former is for the purpose of training a proposed model, while the latter for the purpose of testing it. But as elucidated in [ 3 ], it would suffice with one good quality benchmark dataset if the model is tested by the jackknife or subsampling (K-fold cross-validation) test because the outcome thus obtained is actually from a combination of many different independent dataset tests. In this study, the benchmark dataset was taken from [ 21 , 27 ]. The reasons to do so are as follows: 1) The dataset contains statistically significant number of Gram-positive bacterial proteins with both single location and multiple locations confirmed by experiments. Besides, none of the proteins included has ≥25% pairwise sequence identity to any other in a same subset, which is important for reducing homologous bias. 2) It is also the same benchmark dataset used to train and test iLoc-Gpos [ 27 ], the state-of-the-art predictor in this area, and hence will make the comparison based on the same condition and same criteria. For readers’ convenience, the benchmark dataset is given in Supporting Information S1. It contains
sequence-different Gram-positive bacterial proteins classified into 4 subsets according to their subcellular locations. An overall view of these proteins in the 4 subcellular locations is given in Supporting Information S2, from which we can see that, of the 519 different Gram-positive bacterial proteins, 515 belong to one location, and 4 to two locations.
A breakdown of the
Gram-positive bacterial proteins according to their occurrences in the 4 different subcellular locations is given in Table 1, where
(1)
is the total number of “virtual proteins” [ 22 , 49 ] or “locative proteins” [ 28 ] in the benchmark dataset, and
is the number of different labels (or subcellular locations) marked on the k-th sequence-different Gram-positive bacterial protein. Accordingly, the multiplicity degree MD [ 31 ] of the current benchmark dataset is
(2)
As we can see from Equation (2),
means the system containing no protein with more than one location, while
means some proteins having more than one location. The higher the value of MD, the more protein samples that have multiple locations or labels.
For simplify the description later, the benchmark dataset is denoted by
, which can be further formulated as
(3)
where
only contains the Gram-positive bacterial protein samples from the “Cell membrane” location (cf. Table 1),
only contains those from the “Cell wall” location,
only contains those from the “Cytoplasm” location, and
only contains those from the “Extracell” location;
denotes the symbol for “union” in the set theory.
![]()
Table 1. Breakdown of the Gram-positive bacterial proteins in the benchmark dataset
into 4 subsets according to their different subcellular localizations (cf. Supporting Information S1 and Supporting Information S2).
aSee Equation (1) and the relevant text for the definition of the number of virtual proteins; bSee Equation (2) for the definition of multiplicity degree.
2.2. Proteins Sample Formulation
Now let us consider the 2nd step of the Chou’s 5-step rule [ 32 ]; i.e., how to formulate the biological sequence samples with an effective mathematical expression that can truly reflect their essential correlation with the target concerned. Given a Gram-positive bacterial protein sequence P, its most straightforward expression is
(4)
where L denotes the protein’s length or the number of its constituent amino acid residues,
is the 1st residue,
the 2nd residue,
the 3rd residue, and so forth. Since all the existing machine-learning algorithms, such as SVM (Support Vector Machine) [ 36 ], KNN (K-Nearest Neighbor) [ 50 ], and RF (Random Forest) [ 51 ], can only handle vectors [ 52 ], we have to convert the sequential expression of Equation (4) into a vector. But a vector defined in a discrete model might completely lose all the sequence-order information. To deal with this problem, the PseAAC (Pseudo Amino Acid Composition) was introduced [ 53 - 55 ]. Ever since the concept of pseudo amino acid composition or Chou’s PseAAC [ 55 - 58 ] was proposed, it has been widely used in many biomedicine and drug development areas [ 59 , 60 ] as well as nearly all the areas of computational proteomics(see, e.g., [ 39 , 43 , 45 , 61 - 73 ] and a long list of references cited in two review papers [ 74 , 75 ]). Encouraged by the successes of using PseAAC to deal with protein/peptide sequences, its idea and approach have been extended to deal with DNA/RNA sequences [ 76 - 82 ] in computational genomics via PseKNC (Pseudo K-tuple Nucleotide Composition) [ 83 , 84 ]. Recently, a very powerful web-server called “Pse-in-One” [ 85 ] and its updated version “Pse-in-One 2.0” [ 86 ] were developed, by which users can generate any pseudo components for both protein/peptide and DNA/RNA sequences as they wish or define.
According to the concept of Chou’s general PseAAC [ 32 ], any protein sequence can be formulated as a PseAAC vector given by
(5)
where T is a transpose operator, while the integer
is a parameter and its value as well as the components
will depend on how to extract the desired information from the amino acid sequence of P, as elaborated below.
Being one type of general PseAAC [ 32 ], the GO (Gene Ontology) has been widely used to improve the prediction quality of protein subcellular localization (see, e.g., [ 23 , 25 , 26 , 87 - 91 ]). The advantage of using the GO approach is that proteins mapped into the GO space (instead of Euclidean space or any other simple geometric space) would be better clustered according to their subcellular locations, as elaborated in [ 9 , 92 ]. For the rationale of using the GO approach to predict the protein subcellular localization, and an incisive discussion/analysis to justify the GO approach, see Section VI in a comprehensive review paper [ 31 ].
However, the existing GO approaches (see, e.g., [ 10 , 23 , 25 , 26 , 87 ]) have the following shortcomings. 1) Only the digital numbers 0 and 1 (or their simple combination) were used to incorporate the GO information, and hence some important information may be missed. 2) The dimension of the protein vectors, namely
of Equation (5), in the previous GO approaches was very high; e.g., it is 1,930 in [ 88 ] and 9567 in [ 93 ], and hence may lead to the “curse of dimensionality” or “high-dimension disaster” problem [ 94 ].
Here, we are to introduce a novel GO approach, through which we can extract the key information by winnowing many trivial ones so as to significantly reduce the dimension of PseAAC vector of Equation (5). The detailed procedures are as follows.
Step 1. Use BLAST to search all the Gram-positive bacterial proteins in the Swiss-Prot database for those proteins that have high homology (i.e., more than 60% pairwise sequence identity) with the protein P of Equation (4). The proteins thus obtained are collected into a subset,
, called the homology set of P. Subsequently, retrieve the GO codes of the protein in
that has the highest homology with P. Each of the GO codes is a numerical label containing 7-digit figure (see, e.g., [ 88 ]). If it has no GO code at all, do the same for the 2nd highest homologous protein in
; if it has no GO gode again, do the same for the 3rd highest homologous one; go on like this until obtaining a GO code or a set of GO codes as given below
(6)
where
is the k-th GO code for the protein in
that has first been found with a set of GO codes according to the aforementioned order, and
is the total number of the GO codes it has. Suppose we find from the training dataset that the total number of proteins having exactly the same GO code as
is N(k), of which the number of proteins in the u-th subset is
(7)
where
is the total number of subcellular locations investigated (see Equation (2) or Table 1).
Step 2. Based on Equation (7), the general PseAAC vector in Equation (5) and its dimension can be uniquely defined as
(8)
where N(k) is the total number of Gram-positive bacterial proteins in the training dataset that have the same GO number as
and the operator Max means taking the maximum value among those with respect to different k. It is through such optimization operation to extract the most important GO information for the current study and screen out many trivial GO codes to significantly reduce the PseAAC vector’s dimension.
Listed in Supporting Information S3 are the PseAAC vectors defined by Equation (8) for the 519 sequence-different Gram-positive bacterial proteins in Supporting Information S1, respectively. As we can see there, the dimension of the current PseAAC vectors has been reduced to 4, about thousand times lower than those in the previous approaches [ 21 , 27 , 88 , 93 ]. This is really a big breakthrough in using GO approach to predict protein subcellular localization.
2.3. Operation Algorithm
The 3rd step in the Chou’s 5-step rule [ 32 ] is about the operation algorithm (or engine) to run the prediction. Here, we adopted the ML-GKR (multi-label Gaussian kernel regression) classifier, as described below.
According to Equation (8) or Supporting Information S3, the i-th Gram-positive bacterial protein
in the benchmark dataset
of Equation (3) can be formulated as
(9)
Now let us use the 4-D vector
to describe its subcellular location(s) in the multi-label system; i.e.,
(10)
where
(11)
Likewise, for a query Gram-positive bacterial protein
we have
(12)
Its subcellular location label (s) in the multi-label system should be accordingly given by
(13)
where
(14)
The
in Equation (13) is given by
(15)
where N(train) is the number of proteins used to train the model,
is a parameter whose optimal value will be determined later, and
is the Euclidean distance [ 95 ] between the query protein (Equation (12) and the i-th protein(Equation (9) in the benchmark dataset
; i.e.,
(16)
Thus, the location label vector
of Equation (13) for the query Gram-positive bacterial protein
is well defined, and hence its subcellular location or locations can be explicitly predicted as well. For example: if
while all the other components in Equation (13) are equal to
, this means that the query Gram-positive bacterial protein
is located in the 1st and 2nd subcellular locations (cf. Table 1); if
while all the others are equal to
, meaning that the query Gram-positive bacterial protein is located in the 3rd subcellular location only; and so forth.
The predictor developed via the aforementioned procedures is called pLoc-mGpos, where “pLoc” stands for “predict subcellular localization”, and “mGpos” for “multi-label Gram-positive bacterial proteins”. Shown in Figure 1 is a flowchart to illustrate the process of how the pLoc-mGpos is working.
![]()
Figure 1. A flowchart to show the process of how the pLoc-mGpos predictor works.
3. Results and Discussion
As mentioned in the Chou’s 5-step rule [ 32 ], one of the important procedures in developing a new predictor is how to objectively evaluate its anticipated accuracy. To address this, two issues need to be considered. 1) What metrics should be used to quantitatively reflect the predictor’s quality? 2) What test approach should be adopted to count the metrics scores?
3.1. A Set of Five Metrics for Multi-Label Systems
Different from the metrics used to measure the prediction quality of single-label systems, the metrics for the multi-label systems are much more complicated. To make them more intuitive and easier to understand for most experimental scientists, here we adopt the following intuitive Chou’s five metrics [ 31 ] that have recently been widely used for studying various multi-label systems (see, e.g., [ 30 , 39 , 44 , 50 , 96 - 101 ]):
(17)
where
is the total number of query proteins or tested proteins, M is the total number of different labels for the investigated system (for the current study it is
),
means the operator acting on the set therein to count the number of its elements,
means the symbol for the “union” in the set theory,
denotes the symbol for the “intersection”,
denotes the subset that contains all the labels observed by experiments for the k-th tested sample,
represents the subset that contains all the labels predicted for the k-th sample, and
(18)
In Equation (17), the first four metrics with an upper arrow
are called positive metrics, meaning that the larger the rate is the better the prediction quality will be; the 5th metrics with a down arrow
is called negative metrics, implying just the opposite meaning.
From Equation (17) we can see the following: 1) the “Aiming” defined by the 1st sub-equation is for checking the rate or percentage of the correctly predicted labels over the practically predicted labels; 2) the “Coverage” defined in the 2nd sub-equation is for checking the rate of the correctly predicted labels over the actual labels in the system concerned; 3) the “Accuracy” in the 3rd sub-equation is for checking the average ratio of correctly predicted labels over the total labels including correctly and incorrectly predicted labels as well as those real labels but are missed in the prediction; 4) the “Absolute true” in the 4th sub-equation is for checking the ratio of the perfectly or completely correct prediction events over the total prediction events; 5) the “Absolute false” in the 5th sub-equation is for checking the ratio of the completely wrong prediction over the total prediction events.
3.2. Jackknife Test
Three cross-validation methods are often used in statistical prediction. They are: 1) independent dataset test, 2) subsampling (or K-fold cross-validation) test, and 3) jackknife test [ 95 ]. Of these three, however, the jackknife test is deemed the least arbitrary that can always yield a unique outcome for a given benchmark dataset as elucidated in [ 32 ]. Accordingly, the jackknife test has been widely recognized and increasingly used by investigators to examine the quality of various predictors (see, e.g., [ 35 , 39 , 41 , 63 , 65 , 102 - 105 ]). Accordingly, the jackknife test was also used in this study.
3.3. Parameter Determination
Since Equation (15) contains a parameter
, the predicted results obtained by pLoc-mGpos will depend on the parameter’s value. In this study, the optimal value for
was determined by maximizing the absolute true rate (see the 4th sub-equation in Equation (17) by the jackknife validation on the benchmark dataset. As shown in Figure 2, when
, the absolute true rate reached its highest score. And such a value would be used for further study.
3.4. Comparison with the State-of-the-Art Predictor
Listed in Table 2 are the rates obtained by the current pLoc-Gpos predictor via the jackknife test on the benchmark dataset (Supporting Information S1). For facilitating comparison, listed in that table are also the corresponding results obtained by the iLoc-Gpos [ 27 ] and Gpos-mPLoc [ 21 ], the two existing most powerful predictors for identifying the subcellular localization of Gram-positive bacterial proteins with both single and multiple sites.
As shown in Table 2, among the five metrics in Equation (17) used to quantitatively measure the quality of a multi-label predictor [ 31 ], the rates for “Aiming”, “Accuracy”, and “Absolute false” by iLoc-Gpos [ 27 ] and Gpos-mPLoc [ 21 ] were missed, indicating lack of rigorousness in checking the prediction quality. In other words, the authors of the two previous predictors only reported the rates for “Coverage” and “Absolute true”. But even though, their reported success rates are remarkably lower than the corresponding rates achieved by the current predictor pLoc-mGpos proposed in this paper.
![]()
Figure 2. A plot to show the process of finding the optimal θ value in Equation (15). See the main text for further explanation.
![]()
Table 2. Comparison with the state-of-the-art methods in predicting the subcellular localization of Gram-positive bacterial proteinsa.
aThe rates listed below were derived by the jackknife test on the benchmark dataset
(Supporting Information S1); bSee Equation (17) for the definition of the metrics; cThe predictor proposed in this paper with the parameter θ = 1/8; dThe predictor proposed in [ 27 ]; eThe predictor proposed in [ 21 ].
As pointed out in a comprehensive review [ 31 ], among the aforementioned five metrics listed in Table 2, the most important are “absolute true” and “absolute false”. It is extremely difficult for a multi-label predictor to enhance its absolute true rate and lower down its absolute false rate. Therefore, in developing methods for predicting subcellular localization of proteins with both single location site and multiple location sites, many investigators even did not mention the “absolute true” and “absolute false” rates. In contrast to that, it has been clearly reported in Table 2 that the absolute true rate achieved by the current pLoc-mGpos predictor can reach as high as over 97%, while its absolute false rate is only 0.14% meaning that the error rate is extremely low.
Furthermore, in both the iLoc-Gpos paper [ 27 ] and the Gpos-mPLoc paper [ 21 ], no detailed scores whatsoever were given for the four metrics [ 106 ] widely used in studying various classifications. To make it up, let us introduce the following set of metrics:
(19)
where Sn, Sp, Acc, and MCC represent the sensitivity, specificity, accuracy, and Mathew’s correlation coefficient, respectively [ 106 ], and i denotes the i-subcellular location in the benchmark dataset.
is the total number of the samples investigated in the i-th subset, whereas
is the number of the samples in
that are incorrectly predicted to be of other locations;
is the total number of samples in any location but not the i-th location, whereas
is the number of the samples in
that are incorrectly predicted to be of the i-th location. The metrics of Equation (19) have been widely used to examine the quality of predictors in genome/proteome analysis (see, e.g., [ 46 , 47 , 76 - 80 , 107 - 109 ]) and computational biomedicine (see, e.g., [ 82 , 110 - 112 ]).
Given in Table 3 are the corresponding results obtained by pLoc-mGpos for each of the four subcellular locations. As we can see from the table, all the scores are within the region of 0.8374 to 0.9924, fully consistent with its overall performance as reported in Table 2.
The above compelling facts have clearly demonstrated that the new iLoc-mGpos predictor is indeed very powerful for predicting the subcellular localization of multi-label Gram-positive bacterial proteins.
![]()
Table 3. Performance of pLoc-mGpos for each of the four subcellular locations.
aSee Table 1 and the relevant context for further explanation; bSee Equation (19) for the metrics definition.
3.5. Web Server and User Guide
As pointed out in [ 113 ], user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful predictors or any computational tools. Actually, user-friendly web-servers as shown in a series of recent publications [ 40 , 46 , 100 , 107 - 112 , 114 - 122 ] will significantly enhance the impacts of theoretical work because they can attract the broad experimental scientists [ 52 ]. In view of this, the web-server for the new predictor pLoc-mGpos has been established at http://www.jci-bioinfo.cn/pLoc-mGpos/. Moreover, to maximize the convenience of most experimental scientists, a step-by-step guide of how to use the web-server to get their desired results is given in given below.
Step 1. Opening the web-server at http://www.jci-bioinfo.cn/pLoc-mGpos/, you will see the top page of pLoc-mGposon your computer screen, as shown in Figure 3. Click on the Read Me button to see a brief introduction about the predictor.
![]()
Figure 3. A semi screenshot for the top page of pLoc-mGpos web-server predictor.
Step 2. Either type or copy/paste the sequences of query Gram-positive bacterial proteins into the input box at the center of Figure 3. The input sequence should be in the FASTA format. For the examples of sequences in FASTA format, click the Example button right above the input box.
Step 3. Click on the Submit button to see the predicted result. For instance, if you use the three protein sequences in the Example window as the input, after 10 seconds or so, you will see the following on the screen of your computer (Figure 4). 1) The names of the subcellular locations numbered from1 to 4 covered by the current predictor are shown on the top. 2) The query protein Q93QY7 of example-1 corresponds to “1” meaning it belonging to “cell membrane” only; the query protein P60611 of example-2 corresponds to “3” meaning it belonging to “cytoplasm” only; the query protein P25959 of example-3 corresponds to “1, 4”, meaning it belonging to “cell membrane” and “extracell; the query protein P34020 of example-4 corresponds to “3, 4”, meaning it belonging to “cytoplasm” and “extracell”. All these results are fully consistent with experimental observations.
Step 4. As shown on the lower panel of Figure 3, you may also choose the batch prediction by entering your e-mail address and your desired batch input file (in FASTA format of course) via the “Browse” button. To see the sample of batch input file, click on the button Batch-example. After clicking the button Batch-submit, you will see “Your batch job is under computation; once the results are available, you will be notified by e-mail.”
Step 5. Click on the Citation button to find the papers that have played the key role in developing the current predictor of pLoc-mGpos.
Step 6. Click the Supporting Information button to download the Supporting Information mentioned in this paper.
![]()
Figure 4. A semi screenshot for the webpage obtained by executing Step 3 of Section 3.5.
4. Conclusion
Gram-positive bacterial protein subcellular location prediction is a challenging problem, particularly when the query Gram-positive bacterial proteins have multi-label features meaning that they may occur at two or more different location sites. Here, we have developed a new predictor called pLoc-mGpos by incorporating the key GO information into Chou’s general PseAAC [ 32 ]. Compared with iLoc-Gpos [ 27 ], the existing most powerful predictor that also has the capacity to deal with the multiple locations of Gram-positive bacterial proteins, the success scores achieved by the new predictor are overwhelmingly better according to the metrics widely used to measure the quality of multi-label predictors.
Why could the new predictor be so powerful? The key is that the PseAAC vectors used in the new predictor has been optimized via Equation (8) to substantially reduce their dimension but mean while significantly better reflect the correlation with the desired targets. The novel approach represents a revolutionary breakthrough in using the GO approach for predicting the subcellular localization of proteins with both single location site and multiple location sites.
It is anticipated that pLoc-mGpos will become a very useful high throughput tool for both basic research and drug development.
Acknowledgments
This work was supported by the grants from the National Natural Science Foundation of China (No. 31560316, 61261027, 61262038, 61202313 and 31260273), the Province National Natural Science Foundation of JiangXi (No. 20132BAB201053), the Jiangxi Provincial Foreign Scientific and Technological Cooperation Project (No.20120BDH80023), the Department of Education of JiangXi Province (GJJ160866). This paper was partially supported by National Natural Science Foundation of China (No. 61271114 and No. 61203325) and Innovation Program of Shanghai Municipal Education Commission (No. 14ZZ068).