Using position specific scoring matrix and auto covariance to predict protein subnuclear localization

The knowledge of subnuclear localization in eukaryotic cells is indispensable for understanding the biological function of nucleus, genome regulation and drug discovery. In this study, a new feature representation was proposed by combining position specific scoring matrix (PSSM) and auto covariance (AC). The AC variables describe the neighboring effect between two amino acids, so that they incorporate the sequence-order information; PSSM describes the information of biological evolution of proteins. Based on this new descriptor, a support vector machine (SVM) classifier was built to predict subnuclear localization. To evaluate the power of our predictor, the benchmark dataset that contains 714 proteins localized in nine subnuclear compartments was utilized. The total jackknife cross validation accuracy of our method is 76.5%, that is higher than those of the Nuc-PLoc (67.4%), the OETKNN (55.6%), AAC based SVM (48.9%) and ProtLoc (36.6%). The prediction software used in this article and the details of the SVM parameters are freely available at http://chemlab.scu.edu.cn/ predict_SubNL/index.htm and the dataset used in our study is from Shen and Chou’s work by downloading at http://chou.med.harvard.edu/ bioinf/Nuc-PLoc/Data.htm.


INTRODUCTION
The cell nucleus is complex, important subcellular organelle in eukaryotes cell.It organizes the comprehensive assembly of our genes and their corresponding regulatory factors [1].Meanwhile, it also reflects various intricate biological activities, and controls various kinds of biologic processes [2].Many proteins, from outside a nuclear, trend to be localized into specific subnuclear locations of the nucleus [3].If proteins can not be cor-rectly localized into its specific subnuclear locations in human, it will lead to genetic disease [4], cancer [5] or virally infected cells [6].Thus, it's desirable to get the knowledge of protein subnuclear localization for indepth understanding cell biological processes and genomic regulation.However, it is costly and time-consuming to assay the subnuclear localization of proteins by biology experiments [7].The number of protein sequences is increasing more rapidly than that of identified proteins [7].So it is of great practical significance to develop computational approaches for identifying the protein subnuclear localizations in cell nucleus.At the same time, many lines of evidences have indicated that computational approaches, such as structural bioinformatics [8], molecular docking [9], pharmacophore modelling [10], QSAR [11,12,13], protein subcellular location prediction [7,14], identification of membrane proteins and their types [15], identification of enzymes and their functional classes [16], identification of proteases and their types [17], protein cleavage site prediction [18,19], and signal peptide prediction [20,21] can provide very useful information for both basic research and drug discovery in a timely manner.The present study is devoted to develop a new method for predicting protein subnuclear localization in hope to stimulate the development of the relevant areas.
Compared to the conventional amino acid composition (AAC), pseudo amino acid (PseAA) composition [46], originally introduced by Chou [47,48], can include the sequence-order information of sequences.Similarly, the PsePSSM was also proposed by Shen and Chou in order to incorporate the evolution information of proteins [44]

Data Sets
In this paper, our dataset is obtained from article by Shen and Chou [44].And anyone can freely download it at this page (http://chou.med.harvard.edu/bioinf/Nuc-PLoc/Data.htm).This dataset consists of nine classes and 714 proteins in total.Details of this benchmark dataset are shown in Table 1.S i (i=1, 2… 9) is used to represent each of nine subsets and S represents the total dataset.

Auto Covariance (AC)
We selected three common physicochemical properties, hydrophobicity [49], volumes of side chains of amino acids [50], and polarity [51], to represent the structure and function [52], the stereospecific blockade [53] and the electronic property [54] of residues in a protein respectively.These original values were taken from Guo et al. [55] and were first normalized to zero mean value and unit standard deviation (SD) by Equation (1): , ' , i j j i j j (i=1, 2, 3; j=1, 2, 3…, 20.) Where P i,j is the i-th descriptor value for j-th amino acid, P j is the mean of the j-th descriptor of the 20 amino acids and S j is the value of SD.So each protein sequence was translated into three vectors with each amino acid represented by the normalized values.
There are many approaches to convert the protein sequences into numerical order sequences, including autocorrelations and auto covariance (AC).Autocorrelations, quite similar to AC, has been used in the prediction of secondary structure content [56,57,58] and structural class [59,60,61,62]; however, AC as a statistical tool for analyzing sequences of vectors has also been successfully adopted by our research group for protein classifications [55,63] from primary sequence.So in our study, AC was selected to transform these numerical vectors into uniform matrices in order to take the neighboring effect of the sequences into account.Here, lag is the distance between one residue and its neighbour, a certain number of residues away.The AC variables are calculated by the Equation ( 2) [55].
, , , , ( ) , Where i is the position in the sequence P, j is one descriptor, L is the length of the sequence P and lag is the value of the lag.
In this way, the number of AC variables, D, can be calculated according to Equation (3) [55].
Where lg is the maximum lag (lag=1, 2, 3…, lg) and p represents the number of descriptors.

Position Specific Scoring Matrix (PSSM)
A PSSM is a Position Specific Scoring Matrix and is a commonly used representation of motifs (patterns) in biological sequences [64].So far, this method has been used for predicting protein subcellular localization [65] and subnuclear localization [40,44].
For a protein sequence P with L amino acid residues, PSSM is obtained according to the following Equation [44].
In Equation (4), where i→j describes i-th amino acid residue of the protein sequence P being mutated to amino acid type j in the biology evolution process, P i→j is the score of this mutation and L is the length of the sequence P.Here we used the numerical codes 1, 2, 3… 20 to represent the single character of ordered 20 native amino acid types in Equation (4).To get the 20 L × scores of the P PSSM in the Equation (4), we used three iterations of PSI-BLAST [66] with default threshold (the default E-value is 0.001) to search the Swiss-Prot database (version 54.4,released on 25 Oct. 2007) for multiple sequence alignment against the protein P.Then, the value of P i→j is standardized by Equation ( 5), as given below.native amino acids and the value is between -1 and 1.However, because of proteins with different lengths L, the matrices of the PSSM descriptor in Equation ( 4) have different numbers of rows.To gain the uniform matrix for protein sequences of different lengths, we converted the PSSM of protein P to a uniform vector through the Equation ( 6) [44].
( 1,2, ,20) Where T is the transpose operator, j P is the average score over j-th column in Equation ( 4).
Finally, the PSSM P describes the evolutionary information of a protein sample, and AC variables contain the interaction information between two amino acid residues of a sequence.So each protein sequence was converted into a numerical vector by concatenating PSSM and AC.Here, each AC variable was appended a weight factor of 0.05.

Accuracy and Matthew's Correlation Coefficient (MCC)
To evaluate the performance of this method, two parameters, accuracy and Matthew's correlation coefficient (MCC), were selected in this article.They are calculated by Equation (7) and Equation ( 8), respectively.
Where TP represents the true positive; TN, the true negative; FP, the false positive and FN, the false negative.

RESULTS AND DISCUSSION
In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling test, and jackknife test [67].However, as elucidated in [14] and demonstrated by Eq.50 of [7], among the three cross-validation methods, the jackknife test is deemed the most objective that can always yield a unique result for a given benchmark dataset, and hence has been increasingly used by tigators to examine the accuracy of various predictors (see, e.g., [7,33,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82]).So in this paper, the jackknife test was chosen to validate the current algorithm.Because the benchmark dataset used has nine subsets, the one-to-one multiclass classification system led to 9*(9-1)/2=36 SVM models for one single encoding methods.Meanwhile, for AC variables, the value of lg was optimized as 13 through a series of control experiments, and the value of p is 3. So, the number of AC variables, D, is 39 ( D lg p = × 13 3 39 = × = ) according to Equation (3).Amino acid composition (AAC) has been widely used for predicting subcellular localizations [7,14,22,23,24,25,26,27,28,30,31,32,34,35,36,37,83,84,85], so it was also used as a substitution model in our study.And thus, three SVM models based on AAC, AC and PSSM, were respectively constructed.
The results according to jackknife test are listed in Table 2.As can be seen from Table 2, the prediction accuracy of PSSM based model is nearly equal to that of AAC based model.However, AC based model gives the lower accuracy of 64.13%.Then we constructed models by fusing the three substitution models, so four fused classifier were built.Table 2 shows that the accuracies of the four fused models are higher than those of the three anterior models.Among those four fused models, the accuracy of the model combining PSSM, AAC and AC is lower than that of PSSM and AC based model that obtains the best performance with an accuracy of 76.45%.So the final SVM model was built based on PSSM and AC.The kernel function of SVM is radio basis function (rbf), and the parameters of C and γ are listed in the table by downloading at http://chemlab.scu.edu.cn/predict_SubNL/index.htm.
In order to further examine the prediction power of the current classifier, the performance of this method was also compared with those of the existing methods on the same training dataset.The results obtained by several algorithms with different substitution models were summarized in Table 3. From Table 3, we can see that the accuracy obtained by Nuc-PLoc [44] is much higher than those of ProtLoc [43], AAC based SVM and OET-KNN [42].When compared to Nuc-PLoc, our method obtains a better performance with the accuracy of 76.5%.It means our method is successful in predicting protein subnuclear localization only using primary sequences of proteins mean value over the 20

Table 1 .
The benchmark dataset consists of 714 nuclear proteins classified into nine subnuclear localizations

Table 2 .
Overall accuracies by jackknife tests with different substitution models on the benchmark dataset of Table1