pLoc-mGpos : Incorporate Key Gene Ontology Information into General PseAAC for Predicting Subcellular Localization of Gram-Positive Bacterial Proteins

The basic unit in life is cell. It contains many protein molecules located at its different organelles. The growth and reproduction of a cell as well as most of its other biological functions are performed via these proteins. But proteins in different organelles or subcellular locations have different functions. Facing the avalanche of protein sequences generated in the postgenomic age, we are challenged to develop high throughput tools for identifying the subcellular localization of proteins based on their sequence information alone. Although considerable efforts have been made in this regard, the problem is far apart from being solved yet. Most existing methods can be used to deal with single-location proteins only. Actually, proteins with multi-locations may have some special biological functions that are particularly important for drug targets. Using the ML-GKR (Multi-Label Gaussian Kernel Regression) method, we developed a new predictor called “pLoc-mGpos” by in-depth extracting the key information from GO (Gene Ontology) into the Chou’s general PseAAC (Pseudo Amino Acid Composition) for predicting the subcellular localization of Gram-positive bacterial proteins with both single and multiple location sites. Rigorous cross-validation on a same stringent benchmark dataset indicated that the proposed pLoc-mGpos predictor is remarkably superior to “iLoc-Gpos”, the state-of-the-art predictor for the same purpose. To maximize the convenience of most experimental scientists, a user-friendly web-server for the new powerful predictor has been established at Open Access


INTRODUCTION
As the most basic unit of life, a cell must also undergo three most important processes of any living things: growth, reproduction, and death [1].It is one of the fundamental problems in cellular and molecular biology to thoroughly understand these processes.The knowledge thus acquired is also closely associated with drug development.To realize it, however, the knowledge of proteins in different organelles of a cell or its subcellular localization is prerequisite.
During the last two decades or so, many computational methods were developed to address this problem (see [2,3] as well as a long list of references cited in the two important review articles).
But most of the existing computational methods were designed to treat the single-label system in which each of the constituent proteins has one, and only one, subcellular location.With more experimental data emerging, however, the localization of proteins in a cell is actually a multi-label system, where some proteins may simultaneously occur in two or more different location sites.This kind of multiplex proteins often bears some exceptional biological functions [4][5][6], and should deserve our special attention [7][8][9][10][11][12], particularly from the viewpoint of selecting multiple targets [13][14][15] or key targets [16][17][18][19] for drug development.
About 10 years ago, some efforts have been made to explore this kind of multiplex protein systems [6,7,10,12,[20][21][22][23][24][25][26][27][28][29][30].In comparison with the single-label systems, it would be much more difficult and complicated to deal with the multi-label systems.Particularly, it is extremely difficult for a multi-label predictor to yield a descent result for the "absolute true" rate.The reason is as follows.Suppose a gram-positive bacterial protein is labeled with "1" and "2", meaning that it may simultaneously exist in subcellular locations 1 and 2 in the real world.If its predicted result is "1", or "2", or "1 and 3", or "2 and 3", no score at all will be added for the absolute true rate.When and only when the predicted result is also exactly "1 and 2" meaning perfectly identical to the actual labels, will one score be added in calculating the absolute true rate.Therefore, it is the harshest metrics in measuring the quality of a multi-label predictor [31].And that was why in proposing their multi-label predictors, many authors even did not mention the term of "absolute true rate".
In this study, we used the multi-label theory [31] to develop a new predictor to identify the subcellular localization of Gram-positive bacterial proteins aimed at improving its absolute true and absolute false rates, the two most important and harshest metrics for a multi-label predictor [31].

Benchmark Dataset
According to the Chou's 5-step rule [32] that has been widely used by many recent investigators (see, e.g., [33][34][35][36][37][38][39][40][41][42][43][44][45][46][47]) for developing a statistical predictor, the first important and foremost thing is to construct or select a valid benchmark dataset to train and test the model [1,42,48].In literature, the benchmark dataset usually consists of a training dataset and a testing dataset: the former is for the purpose of training a proposed model, while the latter for the purpose of testing it.But as elucidated in [3], it would suffice with one good quality benchmark dataset if the model is tested by the jackknife or subsampling (K-fold cross-validation) test because the outcome thus obtained is actually from a combination of many different independent dataset tests.In this study, the benchmark dataset was taken from [21,27].The reasons to do so are as follows: 1) The dataset contains statistically significant number of Gram-positive bacterial proteins with both single location and multiple locations confirmed by experiments.Besides, none of the proteins included has ≥25% pairwise sequence identity to any other in a same subset, which is important for reducing homologous bias.2) It is also the same benchmark dataset used to train and test iLoc-Gpos [27], Natural Science the state-of-the-art predictor in this area, and hence will make the comparison based on the same condition and same criteria.For readers' convenience, the benchmark dataset is given in Supporting Information S1.It contains ( ) sequence-different Gram-positive bacterial proteins classified into 4 subsets according to their subcellular locations.An overall view of these proteins in the 4 subcellular locations is given in Supporting Information S2, from which we can see that, of the 519 different Gram-positive bacterial proteins, 515 belong to one location, and 4 to two locations.
A breakdown of the ( ) Gram-positive bacterial proteins according to their occurrences in the 4 different subcellular locations is given in Table 1, where is the total number of "virtual proteins" [22,49] or "locative proteins" [28] in the benchmark dataset, and ( ) n k is the number of different labels (or subcellular locations) marked on the k-th sequence-different Gram-positive bacterial protein.Accordingly, the multiplicity degree MD [31] of the current benchmark dataset is As we can see from Equation (2), MD 1 = means the system containing no protein with more than one location, while MD 1 > means some proteins having more than one location.The higher the value of MD, the more protein samples that have multiple locations or labels.
For simplify the description later, the benchmark dataset is denoted by  , which can be further formulated as where 1  only contains the Gram-positive bacterial protein samples from the "Cell membrane" location (cf.Table 1), 2  only contains those from the "Cell wall" location, 3  only contains those from the "Cytoplasm" location, and 4  only contains those from the "Extracell" location;  denotes the symbol for "union" in the set theory.

Proteins Sample Formulation
Now let us consider the 2 nd step of the Chou's 5-step rule [32]; i.e., how to formulate the biological sequence samples with an effective mathematical expression that can truly reflect their essential correlation with the target concerned.Given a Gram-positive bacterial protein sequence P, its most straightforward expression is where L denotes the protein's length or the number of its constituent amino acid residues, 1 R is the 1 st residue, 2 R the 2 nd residue, 3 R the 3 rd residue, and so forth.Since all the existing machine-learning algorithms, such as SVM (Support Vector Machine) [36], KNN (K-Nearest Neighbor) [50], and RF (Random Forest) [51], can only handle vectors [52], we have to convert the sequential expression of Equation (4) into a vector.But a vector defined in a discrete model might completely lose all the sequence-order information.To deal with this problem, the PseAAC (Pseudo Amino Acid Composition) was introduced [53][54][55].Ever since the concept of pseudo amino acid composition or Chou's PseAAC [55][56][57][58] was proposed, it has been widely used in many biomedicine and drug development areas [59,60] as well as nearly all the areas of computational proteomics(see, e.g., [39,43,45,[61][62][63][64][65][66][67][68][69][70][71][72][73] and a long list of references cited in two review papers [74,75]).Encouraged by the successes of using PseAAC to deal with protein/peptide sequences, its idea and approach have been extended to deal with DNA/RNA sequences [76][77][78][79][80][81][82] in computational genomics via PseKNC (Pseudo K-tuple Nucleotide Composition) [83,84].Recently, a very powerful web-server called "Pse-in-One" [85] and its updated version "Pse-in-One 2.0" [86] were developed, by which users can generate any pseudo components for both protein/peptide and DNA/RNA sequences as they wish or define.
According to the concept of Chou's general PseAAC [32], any protein sequence can be formulated as a PseAAC vector given by [ ] where T is a transpose operator, while the integer Ω is a parameter and its value as well as the components ( ) will depend on how to extract the desired information from the amino acid sequence of P, as elaborated below.
Being one type of general PseAAC [32], the GO (Gene Ontology) has been widely used to improve the prediction quality of protein subcellular localization (see, e.g., [23,25,26,[87][88][89][90][91]).The advantage of using the GO approach is that proteins mapped into the GO space (instead of Euclidean space or any other simple geometric space) would be better clustered according to their subcellular locations, as elaborated in [9,92].For the rationale of using the GO approach to predict the protein subcellular localization, and an incisive discussion/analysis to justify the GO approach, see Section VI in a comprehensive review paper [31].
However, the existing GO approaches (see, e.g., [10,23,25,26,87]) have the following shortcomings.1) Only the digital numbers 0 and 1 (or their simple combination) were used to incorporate the GO information, and hence some important information may be missed.2) The dimension of the protein vectors, namely Ω of Equation ( 5), in the previous GO approaches was very high; e.g., it is 1,930 in [88] and 9567 in [93], and hence may lead to the "curse of dimensionality" or "high-dimension disaster" problem [94].
Here, we are to introduce a novel GO approach, through which we can extract the key information by winnowing many trivial ones so as to significantly reduce the dimension of PseAAC vector of Equation (5).The detailed procedures are as follows.
Step 1. Use BLAST to search all the Gram-positive bacterial proteins in the Swiss-Prot database for those proteins that have high homology (i.e., more than 60% pairwise sequence identity) with the protein P of Equation ( 4).The proteins thus obtained are collected into a subset, homo  P , called the homology set of P. Subsequently, retrieve the GO codes of the protein in homo  P that has the highest homology with P. Natural Science Each of the GO codes is a numerical label containing 7-digit figure (see, e.g., [88]).If it has no GO code at all, do the same for the 2 nd highest homologous protein in homo  P ; if it has no GO gode again, do the same for the 3 rd highest homologous one; go on like this until obtaining a GO code or a set of GO codes as given below where is the k-th GO code for the protein in homo  P that has first been found with a set of GO codes according to the aforementioned order, and g n is the total number of the GO codes it has.Suppose we find from the training dataset that the total number of proteins having exactly the same GO code as GO k P is N(k), of which the number of proteins in the u-th subset is where cell 4 L = is the total number of subcellular locations investigated (see Equation ( 2) or Table 1).
Step 2. Based on Equation ( 7), the general PseAAC vector in Equation ( 5) and its dimension can be uniquely defined as where N(k) is the total number of Gram-positive bacterial proteins in the training dataset that have the same GO number as GO k P and the operator Max means taking the maximum value among those with respect to different k.It is through such optimization operation to extract the most important GO information for the current study and screen out many trivial GO codes to significantly reduce the PseAAC vector's dimension.Listed in Supporting Information S3 are the PseAAC vectors defined by Equation ( 8) for the 519 sequence-different Gram-positive bacterial proteins in Supporting Information S1, respectively.As we can see there, the dimension of the current PseAAC vectors has been reduced to 4, about thousand times lower than those in the previous approaches [21,27,88,93].This is really a big breakthrough in using GO approach to predict protein subcellular localization.

Operation Algorithm
The 3 rd step in the Chou's 5-step rule [32] is about the operation algorithm (or engine) to run the prediction.Here, we adopted the ML-GKR (multi-label Gaussian kernel regression) classifier, as described below.
According to Equation (8) or Supporting Information S3, the i-th Gram-positive bacterial protein i P in the benchmark dataset  of Equation (3) can be formulated as , 1, 2, , seq Now let us use the 4-D vector i L to describe its subcellular location(s) in the multi-label system; i.e., where ( ) Likewise, for a query Gram-positive bacterial protein q P we have Its subcellular location label (s) in the multi-label system should be accordingly given by T q q q q q 1 2 3 4 where ( ) The Δ u in Equation ( 13) is given by where N(train) is the number of proteins used to train the model, θ is a parameter whose optimal value will be determined later, and is the Euclidean distance [95] between the query protein (Equation ( 12) and the i-th protein(Equation ( 9) in the benchmark dataset  ; i.e., ( ) Thus, the location label vector q L of Equation ( 13) for the query Gram-positive bacterial protein q P is well defined, and hence its subcellular location or locations can be explicitly predicted as well.For example: if q q 1 2 1 = = +   while all the other components in Equation ( 13) are equal to 1 − , this means that the query Gram-positive bacterial protein q P is located in the 1 st and 2 nd subcellular locations (cf.Table 1); if q 3 1 = +  while all the others are equal to 1 − , meaning that the query Gram-positive bacterial protein is located in the 3 rd subcellular location only; and so forth.
The predictor developed via the aforementioned procedures is called pLoc-mGpos, where "pLoc" stands for "predict subcellular localization", and "mGpos" for "multi-label Gram-positive bacterial proteins".Shown in Figure 1 is a flowchart to illustrate the process of how the pLoc-mGpos is working.

RESULTS AND DISCUSSION
As mentioned in the Chou's 5-step rule [32], one of the important procedures in developing a new predictor is how to objectively evaluate its anticipated accuracy.To address this, two issues need to be considered.1) What metrics should be used to quantitatively reflect the predictor's quality?2) What test approach should be adopted to count the metrics scores?

A Set of Five Metrics for Multi-Label Systems
Different from the metrics used to measure the prediction quality of single-label systems, the metrics for the multi-label systems are much more complicated.To make them more intuitive and easier to understand for most experimental scientists, here we adopt the following intuitive Chou's five metrics [31] that have recently been widely used for studying various multi-label systems (see, e.g., [30,39,44,50,[96][97][98][99][100][101] where q N is the total number of query proteins or tested proteins, M is the total number of different labels for the investigated system (for the current study it is cell 4 L = ), means the operator acting on the set therein to count the number of its elements,  means the symbol for the "union" in the set theory,  denotes the symbol for the "intersection", k  denotes the subset that contains all the labels observed by experiments for the k-th tested sample, * k  represents the subset that contains all the labels predicted for the k-th sample, and 1, if all the labels in are identical to those in Δ , 0, otherwise In Equation ( 17), the first four metrics with an upper arrow ↑ are called positive metrics, meaning that the larger the rate is the better the prediction quality will be; the 5 th metrics with a down arrow ↓ is called negative metrics, implying just the opposite meaning.
From Equation ( 17) we can see the following: 1) the "Aiming" defined by the 1 st sub-equation is for checking the rate or percentage of the correctly predicted labels over the practically predicted labels; 2) the "Coverage" defined in the 2 nd sub-equation is for checking the rate of the correctly predicted labels over the actual labels in the system concerned; 3) the "Accuracy" in the 3 rd sub-equation is for checking the average ratio of correctly predicted labels over the total labels including correctly and incorrectly predicted labels as well as those real labels but are missed in the prediction; 4) the "Absolute true" in the 4 th sub-equation is for checking the ratio of the perfectly or completely correct prediction events over the total prediction events; 5) the "Absolute false" in the 5 th sub-equation is for checking the ratio of the completely Natural Science wrong prediction over the total prediction events.

Jackknife Test
Three cross-validation methods are often used in statistical prediction.They are: 1) independent dataset test, 2) subsampling (or K-fold cross-validation) test, and 3) jackknife test [95].Of these three, however, the jackknife test is deemed the least arbitrary that can always yield a unique outcome for a given benchmark dataset as elucidated in [32].Accordingly, the jackknife test has been widely recognized and increasingly used by investigators to examine the quality of various predictors (see, e.g., [35,39,41,63,65,[102][103][104][105]).Accordingly, the jackknife test was also used in this study.

Parameter Determination
Since Equation ( 15) contains a parameter θ , the predicted results obtained by pLoc-mGpos will de- pend on the parameter's value.In this study, the optimal value for θ was determined by maximizing the absolute true rate (see the 4 th sub-equation in Equation ( 17) by the jackknife validation on the benchmark dataset.As shown in Figure 2, when θ 1 8 = , the absolute true rate reached its highest score.And such a value would be used for further study.

Comparison with the State-of-the-Art Predictor
Listed in Table 2 are the rates obtained by the current pLoc-Gpos predictor via the jackknife test on the benchmark dataset (Supporting Information S1).For facilitating comparison, listed in that table are also the corresponding results obtained by the iLoc-Gpos [27] and Gpos-mPLoc [21], the two existing most powerful predictors for identifying the subcellular localization of Gram-positive bacterial proteins with both single and multiple sites.
As shown in Table 2, among the five metrics in Equation ( 17) used to quantitatively measure the quality of a multi-label predictor [31], the rates for "Aiming", "Accuracy", and "Absolute false" by iLoc-Gpos [27] and Gpos-mPLoc [21] were missed, indicating lack of rigorousness in checking the prediction quality.In other words, the authors of the two previous predictors only reported the rates for "Coverage" and "Absolute true".But even though, their reported success rates are remarkably lower than the corresponding rates achieved by the current predictor pLoc-mGpos proposed in this paper.
Figure 2. A plot to show the process of finding the optimal θ value in Equation (15).See the main text for further explanation.Natural Science The rates listed below were derived by the jackknife test on the benchmark dataset  (Supporting Information S1); b See Equation (17) for the definition of the metrics; c The predictor proposed in this paper with the parameter θ = 1/8; d The predictor proposed in [27]; e The predictor proposed in [21].
As pointed out in a comprehensive review [31], among the aforementioned five metrics listed in Table 2, the most important are "absolute true" and "absolute false".It is extremely difficult for a multi-label predictor to enhance its absolute true rate and lower down its absolute false rate.Therefore, in developing methods for predicting subcellular localization of proteins with both single location site and multiple location sites, many investigators even did not mention the "absolute true" and "absolute false" rates.In contrast to that, it has been clearly reported in Table 2 that the absolute true rate achieved by the current pLoc-mGpos predictor can reach as high as over 97%, while its absolute false rate is only 0.14% meaning that the error rate is extremely low.
Furthermore, in both the iLoc-Gpos paper [27] and the Gpos-mPLoc paper [21], no detailed scores whatsoever were given for the four metrics [106] widely used in studying various classifications.To make it up, let us introduce the following set of metrics: ) where Sn, Sp, Acc, and MCC represent the sensitivity, specificity, accuracy, and Mathew's correlation coefficient, respectively [106], and i denotes the i-subcellular location in the benchmark dataset.
( ) is the total number of the samples investigated in the i-th subset, whereas ( ) is the number of the samples in ( )

N i
+ that are incorrectly predicted to be of other locations; ( ) is the total number of samples in any location but not the i-th location, whereas ( ) is the number of the samples in ( ) incorrectly predicted to be of the i-th location.The metrics of Equation (19) have been widely used to examine the quality of predictors in genome/proteome analysis (see, e.g., [46,47,[76][77][78][79][80][107][108][109]) and computational biomedicine (see, e.g., [82,[110][111][112]). Natural Science Given in Table 3 are the corresponding results obtained by pLoc-mGpos for each of the four subcellular locations.As we can see from the table, all the scores are within the region of 0.8374 to 0.9924, fully consistent with its overall performance as reported in Table 2.
The above compelling facts have clearly demonstrated that the new iLoc-mGpos predictor is indeed very powerful for predicting the subcellular localization of multi-label Gram-positive bacterial proteins.1 and the relevant context for further explanation; b See Equation (19) for the metrics definition.

Web Server and User Guide
As pointed out in [113], user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful predictors or any computational tools.Actually, user-friendly web-servers as shown in a series of recent publications [40,46,100,[107][108][109][110][111][112][114][115][116][117][118][119][120][121][122] will significantly enhance the impacts of theoretical work because they can attract the broad experimental scientists [52].In view of this, the web-server for the new predictor pLoc-mGpos has been established at http://www.jci-bioinfo.cn/pLoc-mGpos/.Moreover, to maximize the convenience of most experimental scientists, a step-by-step guide of how to use the web-server to get their desired results is given in given below.
Step 1. Opening the web-server at http://www.jci-bioinfo.cn/pLoc-mGpos/,you will see the top page of pLoc-mGposon your computer screen, as shown in Figure 3. Click on the Read Me button to see a brief introduction about the predictor.Step 3. Click on the Submit button to see the predicted result.For instance, if you use the three protein sequences in the Example window as the input, after 10 seconds or so, you will see the following on the screen of your computer (Figure 4). 1) The names of the subcellular locations numbered from1 to 4 covered by the current predictor are shown on the top.2) The query protein Q93QY7 of example-1 corresponds to "1" meaning it belonging to "cell membrane" only; the query protein P60611 of example-2 corresponds to "3" meaning it belonging to "cytoplasm" only; the query protein P25959 of example-3 corresponds to "1, 4", meaning it belonging to "cell membrane" and "extracell; the query protein P34020 of example-4 corresponds to "3, 4", meaning it belonging to "cytoplasm" and "extracell".All these results are fully consistent with experimental observations.
Step 4. As shown on the lower panel of Figure 3, you may also choose the batch prediction by entering your e-mail address and your desired batch input file (in FASTA format of course) via the "Browse" button.To see the sample of batch input file, click on the button Batch-example.After clicking the button Batch-submit, you will see "Your batch job is under computation; once the results are available, you will be notified by e-mail." Step 5. Click on the Citation button to find the papers that have played the key role in developing the current predictor of pLoc-mGpos.
Step 6. Click the Supporting Information button to download the Supporting Information mentioned in this paper.

CONCLUSION
Gram-positive bacterial protein subcellular location prediction is a challenging problem, particularly when the query Gram-positive bacterial proteins have multi-label features meaning that they may occur at two or more different location sites.Here, we have developed a new predictor called pLoc-mGpos by incorporating the key GO information into Chou's general PseAAC [32].Compared with iLoc-Gpos [27], the existing most powerful predictor that also has the capacity to deal with the multiple locations of Gram-positive bacterial proteins, the success scores achieved by the new predictor are overwhelmingly better according to the metrics widely used to measure the quality of multi-label predictors.Natural Science Why could the new predictor be so powerful?The key is that the PseAAC vectors used in the new predictor has been optimized via Equation ( 8) to substantially reduce their dimension but mean while significantly better reflect the correlation with the desired targets.The novel approach represents a revolutionary breakthrough in using the GO approach for predicting the subcellular localization of proteins with both single location site and multiple location sites.
It is anticipated that pLoc-mGpos will become a very useful high throughput tool for both basic research and drug development.

Figure 1 .
Figure 1.A flowchart to show the process of how the pLoc-mGpos predictor works.

Figure 3 .
Figure 3.A semi screenshot for the top page of pLoc-mGpos web-server predictor.

Figure 4 .
Figure 4.A semi screenshot for the webpage obtained by executing Step 3 of Section 3.5.

Table 1 .
Breakdown of the Gram-positive bacterial proteins in the benchmark dataset 

Table 2 .
Comparison with the state-of-the-art methods in predicting the subcellular localization of Gram-positive bacterial proteins a .Predictor Aiming (↑) b Coverage (↑) b Accuracy (↑) b Absolute true (↑) b Absolute false (↓) b

Table 3 .
Performance of pLoc-mGpos for each of the four subcellular locations.
a See Table