A Clique-based Approach to the Identification of Common Gene Association Sub-networks

We developed a computational framework to identify common gene association sub-network. This framework combines graphical lasso model, graph product and a replicator equation based clique solver. We applied this method to find common stress responsive sub-networks from two related Deinococcus-Thermus bacterial species.


Introduction
Gene and gene products interact with each other due to biochemical interactions and regulatory activities [1].Many methods have been developed to analyze these networks.Popular methods include weighted correlation network analysis (WGCNA) [2], Bayesian networks [3], autoregressive models [4], state-space models [5] and graphical Gaussian models [6].Few studies however, have been devoted to analyze networks from multiple species simultaneously.In this study, we focus on the identification of common gene association sub-networks from multiple species.First we need to derive gene network from individual species.We chose Graphical Lasso model [6] for this task because it can handle large covariance/correlation matrices of mathematically deficient rank which is often the case for genomic data.
Identification of common gene association sub-networks is related to the subgraph isomorphism problem.The subgraph isomorphism problem can be reduced to finding maximal clique from merged graph [7] which can be constructed following graph product rules.There are a number of heuristics for finding maximal cliques.Local search may be the simplest greedy strategy that starts with some initial solution and moves from neighbor to neighbor as long as possible while increasing the clique number.The main problem with this strategy is its inability to escape local maxima where the search cannot find any further neighborhood solution.Battiti and Protasi [8] proposed reactive local search that allows the search to explore solutions that do not decrease the clique number by dynamically changing some of the parameters [8].Another widely used heuristic is replicator equation [7].This method is based on a continuous formulation of the maximal clique problem as quadratic programming [9].
The paper is organized as follows.In Section 2, we will describe graphical Lasso method to construct gene association networks.We will also describe graph merging and how to find maximal cliques using replicator equations.The experiments and results will be discussed in Sections 3 & 4. We offer a conclusion in Section 5.

Methods
To find common gene association sub-networks from two species, we need to perform ortholog mapping from two species.Orthologs are genes in different species that originated by vertical descent from a single gene of the last common ancestor.We will then construct gene association network of the orthologous genes for the two species respectively.This is followed by graph merging and maximal clique searching of the merged graph.Finally the common sub-networks are recovered for each species.Figure 1 shows the overview of the approach.

Construction of Gene Association Network
The inverse covariance matrix is used to construct individual gene association network.Gene i and j are considered conditionally independent given other genes if [6].Meinshausen and Muhlmann [10] estimated a sparse graph by fitting a lasso model to each variable using others as predictors.Friedman et al. made this method faster by solving a 1000-node problem (≈500,000 parameters) in less than 1 minute [6].Consider a multidimensional normal distribution of dimension p, with mean μ and covariance matrix .Let S be the empirical covariance matrix, the estimation of  −1 is the solution to the following optimization problem: where tr denotes trace and the L 1 norm, and  is the user-defined penalty.Banerjee et al. [11] show the problem is convex and considered estimating  rather than its inverse.Let W be the estimate of  , the problem is solved by optimizing over each row and corresponding column of W in a block coordinate descent fashion.Partition W and S as: where is a matrix, and is a and c W is a scalar.The dimensions are the same for the corresponding partitions in S. The solution satisfies This is a box-constrained quadratic program and can be solved using an interior-point procedure [12].By permuting the rows and columns so the target column is always the last, they solve a similar problem like (3) for each column, updating their estimate of W after each stage [11].This is repeated until convergence.If this procedure is initialized with a positive definite matrix, they showed that the iterates from this procedure remains positive definite and invertible even if which is normally the case for gene expression data.That is also one of the reasons that we choose this method to construct gene association network.Using convex duality, Banerjee et al. showed that solving (3) is equivalent to solving the following minimization problem which resembles a regularized least squares problem.
The solution for problem (3) is b . Algorithm 1 describes the procedure to compute , the estimate of  .In algorithm 1,  is solved using coordinate descent described by Friedman et al. [13] and Wu and Lange [14].The threshold T is typically defined as diag t ave S   , where are the off-diagonal elements of the empirical covariance matrix , and t is typically set to a small number such as 0.001.The computing efficiency can be further improved via active set convergence [13].

Identification of Conserved Gene Association Sub-Networks
Detecting common sub-network is a challenging task.However, we are approaching this problem from a graph product point of view.We will merge the two graphs by mapping corresponding orthologous genes and create the edges for merged graph based on the following graph product rule.
In other words, an edge in m indicates both 1 and 2 contain the edge or neither of them contain the edge.Finding common sub-networks in 1 and 2 can be reduced to finding maximal cliques in m .A subset of vertices C is called a clique if all its vertices are mutually adjacent.A clique is said to be maximal if it is not contained in any larger clique.Pelillo has established equivalence between the graph isomorphism problem and the maximal clique problem [7].The Motzkin-Straus theorem [9] has established a connection between the maximal cliques and the local maximizers of the following quadratic function: where , and A is the adjacency matrix for m G with .Specifically, it states that a   , 0, Otherwise of a graph is a maximum clique if and only if its characteristic vector c X is a global maximizer of f on  .A similar relationship holds between local maximizers and maximal cliques [15].The Motzkin-Straus theorem has served as the basis of many clique-finding procedures [16][17][18].Pardalos and Phillips [17] observed that there existed spurious solutions to the original Motzkin-Straus formula, and Pelillo and Jagota [15] confirmed this finding later in 1996.Bomze provided a straight-forward solution to this problem [19].Consider the following regularized version of function , which is obtained from (6) by substituting the adjacency matrix A with 1 2 , where p I is the identity matrix.p p  We can avoid spurious solutions by substituting A with .W The optimization problem   T , f x x Wx x    may have many local maxima.Each large local maximum correspond to a true common gene association sub-net-work, while small local maxima usually result from noises and outliers.Given an initialization   1 x , the corresponding local solution x  can be efficiently obtained by replicator equation, which arises in evolutionary game theory.The discrete time version of first-order replication equation has the following form: The simplex  is invariant under these dynamics, which means that every trajectory starting in  will remain in  .It has been proven that when is symmetric and non-negative, the objective function is strictly increasing along any non-constant trajectory and its asymptotically stable points are in one-to-one correspondence to strict local solution of (6).
To find all large local maximizers   x  , we will start with different initializations that will lead to different local maxima.The procedure for finding all large local maxima is described in Algorithm 2.
Algorithm 2: Finding all large local maxima 0.5

End For Output all local maximizers
For a local maximizer X , we need to recover the corresponding common gene association sub-network between the two species.The non-zero elements in local maximizer X correspond to the genes that share the same association network between the two species.It is possible to have some genes in the shared network that are independent.We need to remove these isolated genes from the conserved association network.Algorithm 3 describes how to recover the common association network.

Algorithm 3: Recover Conserved Gene Association Network
Input: Local maximizer X and adjacency matrix M for one species End For Output conserved association network (subset of M consisting of genes in L)

Graph Benchmark Data
We used the DIMACS benchmark data set [20] to validate the effectiveness of the replicator equations to find maximal cliques.

Gene Expression Data
We applied the method to find common stress responsive gene association networks for two related bacteria Deinococcus radiodurans and Thermus thermophilus.We downloaded the gene expression data sets GSE 29516 for D. radiodurans from gene expression omnibus [21].GSE29516 consists of microarray data from transcription profiling of D. radiodurans treated with 0.3 M NaCl or 2 M salt.We downloaded the gene expression series GSE21289 for T. thermophilus.GSE21289 contains the gene expression data of T. thermophilus HB8 wild-type strain in response to high salt stress.

Ortholog Mapping
Ortholog mapping is done via multi-genome homology comparison tools available from the Comprehensive Microbial Resource web site [22].In the case of multiple genes in a cluster, we used the one with the highest score, resulting in 744 one-to-one ortholog pairs.Gene expression data and ortholog mapping table is stored in an inhouse relational database for easy retrieval and crossreferencing across the two species Gene expression data for orthologous genes are retrieved through database queries.

Benchmark Results on DIMACS Challenge Sets
We recorded running time and clique found using our replicator equations.Table 1 shows the results on some DIMACS challenge instances.In general, our implementation is able to find maximal cliques in reasonable time.And we were able to find cliques that are close to their corresponding maximum clique numbers for each benchmark data set.

Identification Common Gene Association Sub-Networks
Two common gene association sub-networked were identified using the described procedure (Figure 2).Annotations of the genes in Figure 2 are given in Table 2.Among these genes, ABC transporter ATP-binding protein has been previously reported to be osmo-regulated [23].Two of the genes are not fully annotated (hypothetical protein in Table 2), we think they are worthy  [24].We found three ribosomal proteins that are related to the stress response in both D. radiodurans and T. thermophilus.This is consistent with the finding from Schmalisch et al. [24].

Conclusions
In this study, we developed an efficient computational framework that combines graphical lasso model, graph product and replicator equation based clique solver to identify common gene association sub-network from multiple species.Our method provides an approach to identifying conserved pathway components.We applied our method and identified common gene association sub-networks for two related bacterial species D. radiodurans and T. thermophilus subjected to similar environmental stress.We confirmed some stress responsive genes with previous studies.Our method also demonstrated how these genes interact with other genes and these interactions potentially are conserved because they are discovered via simultaneous study of two related species.
Our method is not limited to finding common gene association sub-network across multiple species.It can also be adapted to identify core interaction network for the same species subjected to different environmental stresses.It can also be employed to identify common gene/protein sub-networks for related diseases such as

1 Figure 1 .
Figure 1.Clique-based approach to identification of common gene association sub-network from two species.

Figure 2 .
Figure 2. Common gene association sub-networks identified for D. radiodurans and T. thermophilus.
diabetes and hypertension.This work was supported by a grant from North Carolina Space Consortium (2010-1435-CE-02).Financial support from the National Science Foundation (NSF) through a grant to the Quality Education for Minorities Networks

Table 2 . Annotation of genes in the identified common sub- net-networks.
of further investigation because they seem to be related to osmosis stress response based on our study on two different species.A ribosomal protein has been found by Schmalisch et al. to be general stress protein in Bacillus subtilis