Semantic Similarity over Gene Ontology for Multi-Label Protein Subcellular Localization

As one of the essential topics in proteomics and molecular biology, protein subcellular localization has been extensively studied in previous decades. However, most of the methods are limited to the prediction of single-location proteins. In many studies, multi-location proteins are either not considered or assumed not existing. This paper proposes a novel multi-label subcellular-localization predictor based on the semantic similarity between Gene Ontology (GO) terms. Given a protein, the accession numbers of its homologs are obtained via BLAST search. Then, the homologous accession numbers of the protein are used as keys to search against the gene ontology annotation database to obtain a set of GO terms. The semantic similarity between GO terms is used to formulate semantic similarity vectors for classification. A support vector machine (SVM) classifier with a new decision scheme is proposed to classify the multi-label GO semantic similarity vectors. Experimental results show that the proposed multi-label predictor significantly outperforms the state-of-the-art predictors such as iLoc-Plant and Plant-mPLoc.


Introduction
In recent years, protein subcellular localization has gained tremendous attention due to its important roles in elucidating protein functions, identifying drug targets, and so on [1].Computational methods are required to replace time-consuming and laborious wet-lab methods for predicting the subcellular locations of proteins.
Conventional methods for subcellular-localization prediction can be roughly divided into sequence-based methods [2][3][4][5][6] and annotation-based methods [7][8][9][10][11][12][13]. It has been demonstrated that methods based on Gene Ontology are superior [10].However, most of the existing methods are limited to the prediction of single-location proteins.These methods generally exclude the multi-label proteins or are based on the assumption that multi-location proteins do not exist.In fact, there exist multi-location proteins that can simultaneously reside at, or move between, two or more different subcellular locations.Recently, several multi-label predictors have been proposed, including Plant-mPLoc [14], Virus-mPLoc [15], iLoc-Plant [16] and iLoc-Virus [17].These predictors use the GO information and have demonstrated superiority over other methods.But these predictors only make use of the occurrences of the GO terms and do not exploit the semantic relationships between GO terms.
Since the relationship between GO terms reflects the association between different gene products, protein sequences annotated with GO terms can be compared on the basis of semantic similarity measures.Actually, the semantic similarity over Gene Ontology has been extensively studied and have been applied in many biological problems, including protein function prediction [18], subnuclear localization prediction [19], protein-protein interaction inference [20] and microarray clustering [21].The performance of these predictors depends on whether the similarity measure is relevant to the biological problems.Over the years, a number of semantic similarity measures have been proposed, some of which have been used in natural language processing.For example, Resnik [22] proposed the information content of terms in natural language as a similarity measure.Later, Lord et al. [23] introduced this idea into measuring the semantic similarity of GO terms.Lin et al. [24] proposed a method based on information theory and structural information.More recently, Pesquita et al. [25] reviewed the semantic similarity measures applied to biomedical ontologies.
This paper proposes a novel predictor based on the GO semantic similarity for multi-label protein subcellular localization prediction.The predictor proposed is different from other predictors in that 1) it formulates the fea-ture vectors by the semantic similarity over Gene Ontology which contains richer information than only GO terms; 2) it adopts a new strategy to incorporate richer and more useful homologous information from more distant homologs rather than using the top homologs only; 3) it adopts a new decision scheme for an SVM classifier so that it can effectively deal with datasets containing both single-label and multi-label proteins.Results on a recent benchmark dataset demonstrate that these three properties enable the proposed predictor to accurately predict multi-location proteins and outperform three state-of-the-art predictors.

Retrieval of GO Terms
The proposed predictor can use either the accession numbers (AC) or amino acid (AA) sequences of query proteins as input.Specifically, for proteins with known ACs, their respective GO terms are retrieved from the Gene Ontology Annotation (GOA) database1 using the ACs as the searching keys.For proteins without ACs, their AA sequences are presented to BLAST [26] to find their homologs, whose ACs are then used as keys to search against the GOA database.
While the GOA database allows us to associate the AC of a protein with a set of GO terms, for some novel proteins, neither their ACs nor the ACs of their top homologs have any entries in the GOA database; in other words, no GO terms can be retrieved by using their ACs or the ACs of their top homologs.In such case, the ACs of the homologous proteins, as returned from BLAST search, will be successively used to search against the GOA database until a match is found.With the rapid progress of the GOA database, it is reasonable to assume that the homologs of the query proteins have at least one GO term [12].Thus, it is not necessary to use back-up methods to handle the situation where no GO terms can be found.The procedures are outlined in Figure 1.

Semantic Similarity Measure
To obtain the GO semantic similarity between two proteins, we should start by introducing the semantic similarity between two GO terms.The semantic similarity between two categories is based on the information content.As suggested by Resnik [22], the similarity measure of two categories relies on the most specific common ancestor in the GO hierarchy 2 .The semantic similarity between two GO terms x and y is defined as [22]:  where A(x,y) is the set of ancestor GO terms of both x and y, and p(c) is the number of gene products annotated to the GO term c divided by the number of all the gene products annotated to the GO taxonomy.
To further incorporate structural information from the GO hierarchy, we used Lin's measures [24] to normalize the above measure.Then given two GO terms x and y, the similarity is calculated as: 2 max log sim , log log Based on the semantic similarity between two GO terms, we adopted a continuous measure proposed in [21] to calculate the similarity of two proteins, which are functionally annotated by a set of GO terms.Given two proteins Pi and Pj, which are annotated by two sets of GO terms G i and G j retrieved in Section II-A3 , we first computed S(G i , G j ) as follows: ( ) where sim(x, y) is defined in Equation (2).
Then, S(G j , G i ) is computed in the same way by swapping G i and G j .Finally, the overall similarity between the two proteins is given by: Thus, for a testing protein Q t , a GO semantic similarity vector q t can be formulated by performing pairwise comparisons with every training protein { } 1

P
, where N is the number of training proteins.Then, q t can be represented as: where Q t is the set of GO terms for the test protein Q t .

Multi-Label Multi-Class SVM Classification
To predict the subcellular locations of both single-label and multi-label proteins, a multi-label support vector machine (SVM) classifier is proposed in this paper.Specifically, denote the GO semantic similarity vector of the t-th query protein as q t .Then, given the t-th query protein Q t , the score of the m-th SVM is ( ) where S m is the set of support vector indexes corresponding to the m-th SVM, α m;r are the Lagrange multipliers, K(•,•) is a kernel function; here, the linear kernel is used.y m;r ϵ{−1, +1} are the class labels.Unlike the single-label problem where each protein has one predicted label only, a multi-label protein could have more than one predicted labels.Thus, the predicted subcellular location(s) of the t-th query protein are given by:

Dataset and Performance Metrics
In this paper, the plant dataset used in Plant-mPLoc [14], iLoc-Plant [16] and mGOASVM [27] 4 were used to evaluate the performance of the proposed predictor.The plant dataset was created from Swiss-Prot 55.3.It contains 978 plant proteins distributed in 12 locations.Of the 978 plant proteins, 904 belong to one subcellular location, 71 to two locations, 3 to three locations and none to four or more locations.In other words, 8% of the plant proteins in this dataset are located in multiple locations.The sequence identity of this dataset was cut off at 25%.To facilitate comparison, the locative accuracy [28] and the actual accuracy were used to assess the prediction performance.Specifically, denote L(p i ) and M(p i ) as the true label set and the predicted label set for the i-th protein p i (i = 1,…, N act ), respectively.Then, the overall locative accuracy is: where |•| means counting the number of elements in the set therein and ∩ represents the intersection of sets , N act represents the total number of actual proteins and N loc represents the total number of locative proteins.And the overall actual accuracy is: where Note that the actual accuracy is more objective and stricter than the locative accuracy [27].

Comparing with State-of-the-Art Predictors
Table 1 compares the performance of the proposed predictor against three state-of-the-art multi-label predictors on the plant dataset.Plant-mPLoc [14], iLoc-Plant [16] and mGOASVM [27] use the accession numbers of homologs returned from BLAST [26] as searching keys to retrieve GO terms from the GOA database.For a fair comparison with these predictors, the performance of our proposed predictor shown in Table I was obtained by using the accession numbers of homologous proteins as the searching keys.Unlike Plant-mPLoc and iLoc-Plant, the ACs of the homologous proteins, as returned from BLAST search, will be successively used to search against the GOA database until a match is found (See Figure 1 for details).
As shown in Table 1, our proposed predictor performs significantly better than Plant-mPLoc and iLoc-Plant.Both the overall locative accuracy and overall actual accuracy of mGOASVM are more than 20% (absolute) higher than iLoc-Plant (97.9% vs 71.7% and 89.6% vs 68.1%, respectively).Our proposed predictor also performs better than mGOASVM in terms of both the overall actual accuracy (89.6% vs 97.4%) and the overall locative accuracy (97.9% vs 96.2%).As for the individual locative accuracy, the individual locative accuracies of our proposed predictor for all of the 12 locations are impressively higher than those of Plant-mPLoc, iLoc-Plant and mGOASVM.
In terms of GO information extraction, Plant-mPLoc, iLoc-Plant and mGOASVM only exploit the occurrences

Conclusions and Future Works
This paper proposes a new multi-label predictor based on Gene Ontology semantic similarity to predict the subcellular locations of multi-label proteins.By using the accession numbers of the homologs of the query proteins as the searching keys to search against the GO annotation database, the GO terms of each query protein are retrieved.Then the information of the semantic similarity over GO terms is exploited, which is further utilized to formulate GO semantic similarity vectors for every query protein.The feature vectors are subsequently recognized by support vectors machine (SVM) classifiers equipped with a decision strategy that can produce multiple class labels for a query protein.Experimental results demonstrate that the proposed predictor can efficiently predict the subcellular locations of multi-label proteins.It was also found that the exploitation of the semantic similarity over Gene Ontology is conducive to multi-label protein subcellular localization prediction.There are many different methods [20,22,23] for measuring the GO semantic similarity.The semantic similarity measure used in this paper may not be the best for protein subcellular location.Therefore, as a future work, it is of interest to develop a similarity measure that is more relevant to subcellular localization.

Figure 1 .
Figure 1.Procedures of retrieving GO terms.