Semi-Global Inference in Phenotype-Protein Network

Discovering genetic basis of diseases is an important goal and a challenging problem in bioinformatics research. Inspired by network-based global inference approach, Semi-global inference method is proposed to capture the complex associations between phenotypes and genes. The proposed method integrates phenotype similarities and protein-protein interactions, and it establishes the profile vectors of phenotypes and proteins. Then the relevance between each candidate gene and the target phenotype is evaluated. Candidate genes are then ranked according to relevance mark and genes that are potentially associated with target disease are identified based on this ranking. The model selects nodes in integrated phenotype-protein network for inference, by exploiting Phenotype Similarity Threshold (PST), which throws lights on selection of similar phenotypes for gene prediction problem. Different vector relevance metrics for computing the relevance marks of candidate genes are discussed. The performance of the model is evaluated on Online Mendelian Inheritance in Man (OMIM) data sets and experimental evaluation shows high performance of proposed Semi-global method outperforms existing global inference methods.


Introduction
It is challenging for biomedical research to figure out the genetic basis of diseases.Traditional biology researchers adopt linkage analysis and association studies [1] to discover disease genes, which firstly locate disease genes in a chromosome region.However, the resolution of this approach is low and further analysis of candidate genes in a large genomic region is an expensive task, which prevents gene identification even after a region has been detected.
Many studies have tried to discover disease genes with computational methods.Some work related was based on annotations [2][3][4], or based on sequences [5].But, the methods rely on functional annotations are limited because only a small part of genes in the genome have been annotated currently and methods based on sequencing is an expensive task.Moreover, they treated disease genes as separate and independent, however, biological processes are not realized by a single molecule, but rather by the complex interactions of proteins, and the breakdown in protein interaction networks could result in diseases [6].Moreover, some research indicates that phenotypically similar diseases are caused by functionally related genes [7], and the proteins coded by these functionally related genes usually have direct or indirect interactions [8].From this perspective, disease genes could then be investigated through the interaction networks of disease proteins.
Recently, researchers took advantage of the computing method to build biological network to help explore the relationship among biological information in multiple granularity, and network approach in biology is proposed and under active research [9], which also facilitates disease gene discovery.A wide range of methods are proposed based on network methods for disease gene prioritization [10][11][12][13][14][15][16].A method utilizing Bayesian predictor and ranking of protein complexes linked to human diseases is proposed by Kasper Lage et al. to predict genes of human's inherited phenotypes [13].Xuebing Wu et al. proposed network-based global inference approach [14].These methods achieve some accomplishments in disease gene prioritization, which primarily relies on analysis of the topological properties of PPI networks and the expectation that the products of genes that are associated with similar diseases interact heavily with each other.
Motivated by these existing network based approaches, we propose a network based Semi-global inference model for disease gene prioritization, which selects diseases in integrated phenotype-protein network for building profile vectors of candidate genes and target disease, by exploiting Phenotype Similarity Threshold (PST).The model evaluates the relevance between candidate genes and the given target phenotype.Candidate genes are then ranked according to relevance marks.Genes that are potentially associated with target disease are prioritized based on this ranking.To evaluate the effectiveness of the model, the proposed model is tested on known phenotype and gene pairs from OMIM.Our research has three contributions: In Section 2, we briefly introduce the background of network based candidate gene prioritization by describing the problem formally and discussing the related work and their limitations.Section 3 presents Semi-global inference model and explains strategies of PST to select nodes in phenotype-protein network.Section 4 shows experimental results of proposed Semi-global inference model with variation of relevance metrics and PST，and comprehensively compares the performance of proposed model against an existing global inference method.In Section 5, we draw some conclusions and point out further work.

Network Based Candidate Gene Prioritization
Here is a brief description of network-based disease gene prioritization problem referring to [17]: given target disease d, the input to the candidate disease gene prioritization problem consists of two sets of genes, known set K and candidate set C. The known set K contains prior knowledge of the disease d, e.g., it is the set of genes known to be associated with d and diseases similar to d.
Each gene g ∈ K is associated with a similarity score σ(g, d), indicating the known degree of association between g and d.The candidate set C contains candidate genes, one or more of which is potentially associated with target disease d (e.g., these genes might be in the linkage interval of d that is identified by association studies).The purpose of network based disease prioritization is to use a PPI network G = (V, E), to compute a score φ(v, D) for each gene g ∈ C that represents the likelihood of g to be associated with d.
The PPI network G = (V, E) consists of a set of gene products V and a set of undirected interactions E between these gene products, in which uv ∈ E represents an interaction between u ∈ V and v ∈ V.In this network, the set of interacting partners of a gene product v ∈ V is defined as Global prioritization methods use this network information to compute φ by propagating σ over G. Candidate genes with high relevance to target disease of interest are ranked in the top and are regarded as the disease genes.

Related Work
Xuebing Wu et al. have proposed network-based global inference approach called CIPHER algorithm [14], in which Pearson correlation coefficient is adopted to evaluate the relevance between candidate genes and the target disease.Another global inference method is proposed based on a network propagation algorithm to formulate constraints on the prioritization function [16].
Although these existing global network based methods to some extent throw lights on disease gene prioritization problems, they have some drawbacks and limitations.Research of Xuebing Wu et al. is based on the assumption of the linear correlation between profiles of phenotypes and disease genes, which shows some bias against genes whose related proteins have few interactions with other peers [14].Moreover, as reported in literatures, network based global inference methods, favor genes whose products are highly connected in the network and perform poorly in identifying loosely connected disease genes, due to centrality of target disease genes [17] and incomplete and noisy nature of the PPI data [18].
In global inference method, all the diseases in the phenotype similarity network are exploited to generate a prediction, including less related diseases to profile a target disease, which fails to take into consideration that more similar diseases may play more important roles in inference.No work has been done for disease gene prioritization using only parts of diseases in phenotype network, and nodes selection strategy has not been explored.Secondly, phenotype similarities vary.A target disease has different phenotype similarities to other diseases in the network.No selection criteria is made to treat roles of diseases differently in phenotype network, no methods make a difference between high similarities and low similarities, which might be considered to determine which related diseases to refer in gene prioritization problems.
Our research aims at exploring the uncovered areas mentioned and overcoming limitations of global inference methods.We propose Semi-global inference method by exploiting PST as the criteria to select phenotypes in network for inference, which is the essential difference between proposed Semi-global model and existing global inference methods.

Methodology
In this section we present the mathematical model and show the general framework of gene prioritization algorithm of Semi-global inference.Furthermore, we explain how Phenotype Similarity Threshold is exploited for nodes selection in phenotype network, which is the core of Semi-global inference model.
It is important to note that the purpose here is to infer functional associations between genes from functional and physical interactions between their products.For this reason, any reference to interactions between genes in this paper refers to the interactions between their products.Meanwhile, disease gene prioritization is inferred from phenotypically similar diseases, term disease and term phenotypes deliver identical conception in this paper.

Mathematical Model
• Undirected graph (1) is defined as phenotype similarity network; is a subset of all the phenotypes, ; and the element is the similarity of phenotypes .
is defined as protein interaction network; is a subset of all the proteins, ; and the element denotes the interaction of proteins .
is defined as association set of ; each element in is an association of ; set (4) is defined as global association set, which contains all phenotype-protein associations.
• Given phenotype similarity network GPhenotype, protein interaction network and global association set , set is phenotype-protein network.
• Given a phenotype and a protein , , (6) denotes one dimension of the profile vector of protein .
• Phenotype Similarity Threshold (PST) is a manually set similarity value that satisfies (7) contains the phenotypes that have similarities higher or equals to PST with .Each element in is defined as a Closely Related Phenotype of .

Phenotype Similarity Threshold (PST)
According to the biological assumption that phenotypically similar diseases are caused by functionally related genes [7], the proposed Semi-global inference model takes into consideration only phenotypes that are highly similar to target disease, with similarities higher than PST.We use only those Closely Related Phenotypes (refer to ( 8)) of and exploit corresponding similarities to characterize the target phenotype.Therefore, In ( 9) and ( 10), given a phenotype , the dimensions of profile vector j p are determined by the number of phenotypes in , the dimensions of profile vector of candidate genes are reduced correspondingly.

Semi-Global Inference
Based on the mathematical model above, here we give the computation framework of proposed semi-global inference method, which consists of two algorithms to prioritize candidate disease genes.
Algorithm 1 Relevance Mark Calculation calculates the relevance mark for a given pair of target phenotype and candidate protein .Algorithm 2 Disease Gene Prioritization takes a target phenotype as the input and evaluates relevance mark for all candidate proteins in linkage interval, then prioritizes the candidate proteins based on their relevance marks.Proteins with high relevance mark are regarded highly related to target phenotype and thus genes associate with these top ranked proteins are the underlying causing genes of target disease, as the predictive result of Semi-global inference model.
In practice, each of metrics (11) or (12) (13) are tested respectively in Algorithm 1 as relevance evaluation of candidate proteins.Algorithm 2 is invoked to prioritize candidate genes for all phenotypes we are interested in.

Results
In this section, we comprehensively evaluate the performance of proposed Semi-global inference model with different setting of metrics and PSTs.Then we compare proposed model to global inference method.

Datasets
To evaluate the proposed model, data sets needed are listed as follows: Phenotype set and quantified similarities between each pair of phenotypes.Protein set and quantified protein interaction between each pair of proteins.Set of known pairs (associations) of phenotypes and associated proteins, which serves as the validation set.
Phenotype set and their linkage intervals are obtained from Online Mendelian Inheritance in Man (OMIM) Morbid Map [19], which provides a publicly accessible and comprehensive database of genotype-phenotype relationship in humans; phenotype similarities come from the research of van Driel et al. [20]; quantified protein interaction marks are extracted from STRING database [21] to build PPI network; chromosome mapping of proteins are extracted from Ensembl database [22]; validation set can be built from phenotype-protein network, by extracting the phenotype-gene mapping from OMIM Morbid Map and gene-protein mapping from bioDBnet database [23] and mapping phenotype network to PPI network.
Those phenotypes that can not be mapped to proteins are removed, due to lack of known associated genes or incomplete information of proteins coded by genes in the linkage interval.We finally get 1897 phenotypes and 84652 proteins in total, while only 156584 protein-protein interactions are available.Those missing PPI records are regarded as zero.2549 known phenotype-gene pairs are maintained for evaluation.

Experimental Setting
We apply leave-one-out cross-validation in order to evaluate the performance of different methods in terms of accuracy of disease gene prioritization.For each disease of interest, we conduct following experiment: • We remove all associations of this target disease from global association set (refer to (4)).• All the genes in the linkage interval are regarded as candidate genes to be prioritized.On average, there are 750 candidate genes in the linkage interval of a disease.

• In practice, we exploit Position Parameter
to get PST: Phenotype similarities are sorted in an array in ascending order, then PST is assigned as the value retrieved from the array with index of (array size * ), so PST is determined by assigning a value from zero to one.It is important to note that when , all the nodes in phenotype network are considered in inference.In this case, Semi-global model degenerates into global inference.Thus, global inference method is a case of proposed Semi-global model when .We conduct experiment with two methods to get PST: Static method (S-PST).All the phenotype similarities are sorted in one array.PST is a global static value for all target diseases during the experiment.Fold Enrichment.Ability to enrich known disease genes over random selection [13].
Distribution of Cases.Percentage of the test cases ranked within top 1%, top 5% and top 10%.

Experiment with Variation of PST and Relevance Metrics
Proposed model with Euclidean distance shows a rapid increase of average rank with the increase of λ, though the performance is always poorer than that of model with the other two relevance metrics.The model exhibits a high average rank with high PST (high Position Parameter λ) using S-PST, in spite of relevance metrics adopted.
For model with Euclidean distance and Cosine similarity, fold enrichment gets higher along with the increase of .On the other hand, Figure 1 to Figure 4 show that proposed model with Cosine similarity gets higher performance than the other two relevance metrics.Moreover, the trend of the performance with increasing λ shows that the model gains better performance when highly similar diseases are referred to profile target disease and candidate genes, in which the profile vectors consist of only a few dimensions and only small part of nodes (eg.diseases holding top 5% highest similarities in whole phenotype network in S-PST and diseases holding     Model with Pearson correlation coefficient reaches its best performance when λ = 0 (global inference method) and shows a decline with increase of λ.Therefore, proposed Semi-global inference does not increase the performance if Pearson correlation is adopted as the relevance metric.

Comparison to Global Inference Method
Here we discuss the cases when D-PST is exploited with a certain λ assigned to get the relative high performance using different relevance metrics and compare them to global inference method using CIPHER algorithm [14] with the same relevance metric.Table 3 shows that with same relevance metrics, more known disease genes are ranked within top 1%, top 5%

Conclusions
In this paper, a Semi-global inference model with PST is proposed for disease gene prioritization, which applies profile vectors in phenotype-protein network to characterize target disease and candidate genes.The model is evaluated comprehensively on OMIM dataset and the experimental result shows proposed Semi-global model outperforms existing global inference method.
Phenotype Similarity Threshold (PST) is proposed and Closely Related Phenotypes are defined.It is adopted as a criterion to select diseases in phenotype network to profile the target disease.Thus, by considering only highly similar diseases, proposed PST has significance in nodes selection in phenotype-protein network for gene prioritization problem, which as a trial demonstrates a novel understanding of the well accepted belief that phenotypically similar diseases are caused by functionally related genes.
Effect of different relevance metrics of profile vectors, different methods and variation of PST on the proposed model are discussed.The proposed model with Cosine similarity as relevance metric shows higher performance than model using other two metrics.Moreover, proposed model achieves performance improvement along with the increase of PST when Cosine similarity and Euclidean distance are adopted as relevance metrics.We have also shown proposed Semi-global model using D-PST exhibits higher average rank, fold enrichment and more admirable distribution than global method.
Further research includes configurations of Semi-global model (proper PST, Position Parameter and relevance metric) to achieve better performance, sensitivity of proposed model to noise of PPI data, and the issue of bias occurs in global inference.
Dynamic method (D-PST).PST is retrieved from a smaller phenotype similarity set containing only the similarities related to current target disease.Different PSTs are gained for prediction of different target diseases, according to the similarity range of that target disease.•We conduct the experiment with each combination of relevance metrics and PST methods.• In order to systematically compare the performance of proposed model, we use following evaluation criteria: Average Rank.Average rank in proposed model of known disease genes.

Figure 1 .
Figure 1.Average rank to compare the performance of proposed model using S-PST with different relevance metrics.

Figure 2 .
Figure 2. Average rank to compare the performance of proposed model using D-PST with different relevance metrics.

Figure 3 .
Figure 3. Fold enrichment to compare the performance of S-PST model with different metrics.

Figure 4 .
Figure 4. Fold enrichment to compare the performance of S-PST model with different metrics.

λ
top 5% highest similarities to the target disease in D-PST) are exploited.Therefore it indicates the strategy that referring only part diseases in proposed Semi-global model works well with these two relevance metrics (especially with Euclidean distance) and nodes selection with PST and dimension reduction of profile vectors achieves performance improvement.

Figure 5 . 1 Figure 6 .
Figure 5.Comparison of distribution of test cases betweenproposed Semi-global model using D-PST and a global inference method.1

Table 1 and
Table 2 demonstrate that Semi-global model with D-PST and high λ outperforms global inference method using same relevance metrics.Especially for Euclidean distance, when λ is assigned with a high value, Semi-global model shows much higher performance than global inference.

Table 3 . Percentage of the known disease genes ranked within top 1%, top 5% and top 10% in proposed Semi- global model and a global inference model. 1,2 Top 1% Top 5% Top 10%
It ranks more than 75% cases within top 1%, and the accumulated ratio of test cases is higher than global inference method.