An Improved Approach for Rapidly Identifying Different Types of Gram-Negative Bacterial Secreted Proteins ()
1. Introduction
As a universal and important biological process, protein secretion may occur in all organisms. In this process, Gram-negative bacterial secreted proteins should cross two lipid bilayers including the cytoplasmic membrane (CM) and the outer membrane (OM), while Gram-positive bacterial secreted proteins just need cross the CM [1]. Therefore, the secretion process of the former is more complex than that of the latter, and more secretion systems are existing in Gram-negative bacterial cells.
Up to now, at least nine secretion systems have been discovered from Gram-negative bacteria, which are named from the type I (T1SS) to the type IX secretion system (T9SS) on the basis of the OM secretion mechanisms [2]. Proteins released via the T1SS are called type I secreted proteins (T1SPs), and other types of proteins are known by analogy with this. According to the presence of N-terminal signal peptides or not, secreted proteins can be simply classified into two groups: classically secreted proteins (CSPs) (e.g., T2SPs, T5SPs, T7SPs, T8SPs and T9SPs) and non-classically secreted proteins (NCSPs) (e.g., T1SPs, T3SPs, T4SPs and T6SPs) [3]. They are normally secreted into the extracellular environment or directly injected into host cells, but also anchored to the OM at times, even as a part of cell-surface appendages such as flagella and pili [4]. They are essential for the virulence of bacteria and lead to various diseases [5] [6], so it is crucial to study them for the pathogenesis of diseases and the development of drugs. Unfortunately, researchers pay more attention to the structure and function of different secretion systems, rather than their secretory products [7]. Moreover, there have been a number of computational approaches designed to identify type-specific Gram-negative bacterial secreted proteins, such as T3SPs [9]-[14] or T4SPs [15] [16] [17] [18] [19], but only a few for distinguishing different types of secreted proteins simultaneously.
Based on our previous research [20], this work is intended to further improve the efficiency of recognition among different types of Gram-negative bacterial secreted proteins. Firstly, two different substitution models are developed based on AAC, PSSM and N-terminal signal peptides. Then, a SVM-based multi-classifier is constructed by the “one to one” algorithm, which is called SecretP v.2.2 in this paper. When using a test set to assess the actual performance of SecretP v.2.2, it achieves an overall sensitivity of 93.60% for distinguishing six different types of Gram-negative bacterial secreted proteins. Furthermore, a public independent dataset is used to evaluate the prediction performance of SecretP v.2.2 in identifying different types of NCSPs, and the prediction results are comparable to those of the previous version SecretP v.2.1.
2. Materials and Methods
2.1. Data Sets
To make a comprehensive comparison in method, all data sets used in this study are exactly the same as those in our previous work [20]. The training and test sets consisted of six types of Gram-negative bacterial secreted proteins, including T1SPs, T2SPs, T3SPs, T4SPs, T5SPs and T7SPs. Here, “T1SP” represents the type I secreted protein, and the remaining are named by analogy with it. A public independent dataset of 89 NCSPs was constructed by Kampenusa and Zikmanis [21], which contains 32 T1SPs, 41 T3SPs and 16 T4SPs. The detailed data processing has been described in our previous work [20], and all data sets used in this study are listed in Table 1.
![]()
Table 1. All data sets used in this study.
2.2. Feature Extraction
2.2.1. Amino Acid Composition
Amino acid composition (AAC) represents the occurrence frequencies of the twenty common amino acids in a protein sequence, and each protein is described as a 20-dimensional vector by this method.
2.2.2. Position-Specific Scoring Matrix
Position-specific scoring matrix (PSSM) is commonly used to describe the evolutionary information of amino acid residues in protein sequences, and it has been repeatedly proved that when adding PSSM into a protein substitution model for protein classification, the prediction performance of the method will significantly improve [22] [23]. So PSSM was also chosen to represent protein samples in this study. The PSSM for each protein sequence was firstly generated by using PSI-BLAST [24] against the Swiss-Prot database, with three iterations and an E-value cut-off of 0.001. In this way, a matrix consisting of L rows and 20 columns is created, where L is the length of a query sequence, and 20 columns represent occurrence or substitution of each type of twenty common amino acids. Because the lengths of proteins are not equal, an equation was then used to make all PSSM matrix size-uniformed, as described in our earlier study [23]. Finally, a 20-dimensional vector is also obtained for each protein sequence.
2.2.3. N-Terminal Signal Peptides
As a critical factor for distinguishing CSPs from NCSPs, N-terminal signal peptides in protein sequences were predicted by the SignalP 4.1 server [25], and represented by the D-scores.
2.3. Model Construction
Support vector machine (SVM) has been shown as a powerful machine learning algorithm in computational biology [20] [23] [26] [27] [28] [29]. Here, the LIBSVM program (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) was employed to build different SVM models. As the default kernel function of LIBSVM, the radial basis function (RBF) was chosen here, and a grid search approach was used to optimize the regularization parameter C and the kernel width parameter γ. Though there have been several different validation methods in statistical prediction, the jackknife test is deemed the most rigorous and objective [30], and it was also adopted for this study. It has been confirmed that the “one to one” algorithm is more effective than the “one to rest” algorithm [20], so the “one to one” algorithm was also selected to solve the multi-class classification problem. Meanwhile, different weights were assigned to reduce the data imbalance, which are inversely proportional to the corresponding rates between any two types of secreted proteins in the training set.
2.4. Performance Evaluation
In order to evaluate the performance of different types of SVM models, sensitivity and accuracy are used here, and they are defined by the following equations.
(1)
(2)
where TP, TN, FP and FN represent true positive, true negative, false positive and false negative, respectively.
3. Results
3.1. Parameters Optimization
Based on the features described in Section 2.2 and the “one to one” algorithm, three different multi-classifiers are developed in this study, and each one of them contains 15 SVM models. The substitution model of the first multi-classifier consists of AAC and PSSM, as called PsePSSM by Shen and Chou [31], and each protein sequence is represented by a 40-dimensional vector. While the substitution model of the second multi-classifier is constructed by combining AAC, PSSM and N-terminal signal peptides, and a 41-dimensional vector is used to describe each protein sequence. With the adding of N-terminal signal peptides, the predictive ability of SVM models for discriminating CSPs from NCSPs can effectively improve, but reduce for identifying different types of CSPs or NCSPs. So 6 SVM models (13, 14, 34, 25, 27, 57) from the first multi-classifier and 9 SVM models (12, 15, 17, 23, 24, 35, 37, 45, 47) from the second multi-classifier, are selected to construct the third hybrid multi-classifier, which is called SecretP v.2.2 in this study. Here, model “12” represents that this model was constructed by using the training sets of T1SPs and T2SPs, and the remaining are known by analogy with it.
All prediction results of the three multi-classifiers were presented in Supplementary Tables S1-S3, respectively. As shown in these Supplementary Tables, models constructed by both of CSPs and NCSPs (e.g., 45 and 12) tend to achieve higher accuracies, while those constructed by only CSPs or NCSPs appear to get lower accuracies (e.g., 27 and 34). Comparing the results listed in Supplementary Table S1 and Table S2, it is clear that with the adding of N-terminal signal peptides, the performance of models composed by CSPs and NCSPs (e.g., 23 and 45) slightly improve, while those composed by only CSPs or NCSPs (e.g., 27 and 34) cut down. In view of these factors, SecretP v.2.2 is proposed as described in the previous paragraph, and chosen as the final predictor for distinguishing different types of Gram-negative bacterial secreted proteins.
3.2. Performance on the Independent Data Sets
In order to compare the prediction performance of SecretP v.2.2 with other methods, including the first and the second multi-classifiers described in Section 3.1, and SecretP v.2.1, the test set shown in Table 1 is used here. All statistical results of the four methods are listed in Table 2. From this table, it is clear
![]()
Table 2. Prediction results of the four methods obtained by analyzing the test set.
that SecretP v.2.2 gets the highest total sensitivity of 93.60%, while other three methods achieves the same total sensitivity of 90.12%, but the detailed prediction results of them are different. This indicates that it is a right decision to choose SecretP v.2.2 as the final predictor in this study.
As described in Section 2.1, a public independent dataset is selected to further evaluate the predictive power of SecretP v.2.2 for identifying different types of NCSPs. The comparison results of the four methods for this dataset are listed in Table 3. As shown in Table 3, 86 of the 89 NCSPs are correctly identified by SecretP v.2.2 and SecretP v.2.1, but only 82 are correctly identified by the first and the second multi-classifiers. For the detailed results, SecretP v.2.2 wrongly predicted 2 T1SPs as T5SPs and 1 as a T7SP, while SecretP v.2.1 wrongly predicted 2 T1SPs as T2SPs and 1 as a T5SP [20]. Therefore, the prediction performance of SecretP v.2.2 for identifying NCSPs is comparable to that of SecretP v.2.1.
4. Discussion and Conclusion
A large number of secreted proteins have been discovered from Gram-negative bacteria in recent years, and they are classified into different types according to diverse secretion systems. These proteins play an important role in the interactions between bacteria and host cells, so more and more works have been done for them.
Many computational methods have been proposed to identify secreted proteins so far, but only a very few for distinguishing different types of secreted proteins simultaneously. To address this, SecretP v.2.1 has been developed in our previous work [20]. As an upgraded version of SecretP v.2.1, SecretP v.2.2 is also proposed for this purpose here. The same training and test sets are used to build the two methods, and both of them are constructed based on SVM and the “one to one” algorithm. The biggest difference between them is the feature sets of protein sequences. The substitution model of SecretP v.2.1 contains AAC
![]()
Table 3. Prediction results of the four methods obtained by analyzing the independent dataset.
and auto covariance (AC), and each protein is translated into a 45-dimensional numerical vector. While the substitution models of SecretP v.2.2 consist of AAC, PSSM, with or without N-terminal signal peptides, and a 40-dimensional or 41-dimensional vector is used to represent a protein sequence. The dimension of numerical vectors for SecretP v.2.2 is slightly less than that for SecretP v.2.1, which results in shorter transit times. Moreover, though AC can reflect the neighboring effects between amino acid residues in a protein sequence, it has been confirmed that the parameter lg in the equation of AC is not sensitive enough for the classification of different types of secreted proteins [20] [28]. Conversely, PSSM can effectively describe the evolutionary information of amino acid residues in protein sequences, and N-terminal signal peptides are very useful for distinguishing CSPs from NCSPs and NSPs [27]. So comparing with SecretP v.2.1, SecretP v.2.2 seems to be a more reasonable predictor for distinguishing different types of Gram-negative bacterial secreted proteins, and the final results also support this view.
With a comprehensive comparison between SecretP v.2.2 and SecretP v.2.1, several conclusions could be drawn from this study. 1) The evolutionary information of protein sequences can effectively improve the total power of predictors for protein classification. 2) Though N-terminal signal peptides are originally used to distinguish CSPs from non-secreted proteins (NSPs), they also play an important role in the classification of CSPs and NCSPs. 3) The effective feature selection can not only improve the prediction performance of classifiers, but also cut down the dimension of numerical vectors to reduce operation time. 4) The “one to one” algorithm is really good at solving the multi-class classification problem.
Overall, as an improved approach for rapidly and accurately identifying different types of Gram-negative bacterial secreted proteins, SecretP v.2.2 is established in this work, which could be a beneficial supplement for future secretome studies.
ACKNOWLEDGMENTS
This work was supported by grants from The National Natural Science Foundation of China (21305096), The Fund of Science and Technology Department of Guizhou Province (J[2014]2134) and The Development Program for Youth Science and Technology Talents in Education Department of Guizhou Province (KY[2016]219).
Supplementary
![]()
Table S1. Parameter statistics of different SVM models for the first multi-classifier.
Note: “12” represents that this model was constructed by using the training sets of T1SPs and T2SPs, and the remaining are known by analogy with it. Different weights were assigned to reduce the data imbalance, which are inversely proportional to the corresponding rates between any two types of secreted proteins in the training set.
![]()
Table S2. Parameter statistics of different SVM models for the second multi-classifier.
Note: “12” represents that this model was constructed by using the training sets of T1SPs and T2SPs, and the remaining are known by analogy with it. Different weights were assigned to reduce the data imbalance, which are inversely proportional to the corresponding rates between any two types of secreted proteins in the training set.
![]()
Table S3. Parameter statistics of different SVM models for the third multi-classifier.
Note: “12” represents that this model was constructed by using the training sets of T1SPs and T2SPs, and the remaining are known by analogy with it. The third multi-classifier also contains 15 SVM models, 6 (models 13, 14, 34, 25, 27, 57) of which are from the first multi-classifier, and 9 (models 12, 15, 17, 23, 24, 35, 37, 45, 47) from the second multi-classifier. Different weights were assigned to reduce the data imbalance, which are inversely proportional to the corresponding rates between any two types of secreted proteins in the training set.