Searching maximum quasi-bicliques from protein-protein interaction network

Searching the maximum bicliques or bipartite subgraphs in a graph is a tough question. We proposed a new and efficient method, Searching Quasi-Bicliques (SQB) algorithm, to detect maximum quasi-bicliques from protein-protein interaction network. As a Divide-and-Conquer method, SQB consists of three steps: first, it divides the protein-protein interaction network into a number of Distance-2-Subgraphs; second, by combining top-down and branch-and-bound methods, SQB seeks quasi-bicliques from every Distance-2-Subgraph; third, all the redundant results are removed. We successfully applied our method on the Saccharomyces cerevisiae dataset and obtained 2754 distinct quasi-bicliques.


INTRODUCTION
As high-throughput technologies such as Yeast Two-Hybrid [1] and Affinity Purification/ Mass Spectrometry [2] have made significant progress, human beings have collected a great number of protein-protein interaction datasets.It is meaningful to dig out substructures from large-scale protein interaction data.Biclique, one kind of the substructures, is common in protein-protein interaction network.Biclique often contains useful biologically meaningful units.For example, the biclique shown in Figure 1 indicates an "all-versus-all" predicted interaction subnetwork [3], Where most of the edges, each representing a protein-protein interaction,were approved by biological experiments.Furthermore, six proteins on the left side all contain SH3 domain and four proteins on the right side are all with the SH3-binding motifs.Therefore mining biclique can help biologists unveil the cellular function at the molecular level.
However, mining bicliques from graph (or protein-protein interaction network in this study) is a com-putationally intensive work, and has been proven as NP complete [4,5,6].Although many researchers [7,8,9,10,11] have developed some algorithms to solve the maximum biclique problem, they often focused on some special characteristics of the graph, so the problem is still intractable.Therefore, in computational biology field, some researchers mined quasi-bicliques instead of exact bicliques.Li [12] used the "frequent pattern" developed by Agrawal [13] to find "all-versus-all subnetwork" (or quasi-biclique).The "existing closed itemset mining algorithms" (proposed by Agrawal [13]) only uses the size constraint on transaction sets to decrease search space, which brings a great number of small maximum bicliques and greatly influences the process speed.
Here, we propose Searching Quasi-Bicliques (SQB) algorithm to detect maximum quasi-bicliques from protein interaction network.By means of Divide-and-Conquer method, SQB partitions the protein-protein interaction network into a mount of Dis tance-2-Subgraphs , each for one vertex, and only containing two kinds of nodes: those being connected with the vertex (we call them the direct neighbors), and those being reachable from the vertex by passing just one other node (we call them distance-2-neighbors).Next, through top-down and branch-and-bound methods, SQB tries to find the quasi-bicliques from all the Distance-2-Subgraphs.At last, SQB merges the redundant ones in the quasi-biclique clusters.We applied our algorithm on the Saccharomyces cerevisiae dataset and obtained 2754 distinct quasi-bicliques.Figure1.An all-versus-all predicted interaction subnetwork.

SciRes Copyright © 2008 JBiSE
The organization of this paper is as follows.Section 2 states the maximum Quasi-biclique problem.Section 3 describes the SQB algorithm for finding the maximum quasi bicliques.Section 4 reports the results of the application of SQB algorithm on in a real proteomic data.The paper ends with conclusions and the future work.

MAXIMUM QUASI-BICLIQUE PROBLEM
We use a simple graph like [14] to describe a protein-protein interaction network.A vertex represents a kind of protein and an edge means there is an interaction between two kinds of proteins.Quasi-biclique is a graph G= (V, E), in which V can be divided into two non empty sets {V1, U1} and every vertex in V1 directly links to nearly every vertex in U1.The question of finding maximum quasi-biclique in a graph G= (V, E) can be formalized as following function.
where |V| denotes cardinality of the vertex set of the input graph, n and m should be greater than 1 and lower than |V|-2.A quasi-biclique is measured by the value of nm which actually is the number of interacting edges between two sets.In the following, we denote a quasi-bicluque as QB (V1, U1).

SQB ALGORITHM
The main method of SQB is Divide-and-Conquer, which includes three parts.The first one is to seek every vertex's Distance-2-Subgraph from a graph.The second one is to find every vertex's quasi-biclique from its Distance-2-Subgraph.The third one is to merge solutions: after finding every vertex's quasi-bicliques, SQB puts all the quasi-bicliques together, removes the similar ones, prunes the smaller ones, and obtains the quasi-bicliques of the whole graph.The three parts of SQB are detailed in the following.

Finding Distance-2-Subgraph
As some graph, especially the biological protein-protein interaction network, is very large, the process on the graph will need a very large memory space so it is not feasible in common applications.But it is obvious that the distance between any two vertexes in a quasi-biclique is not greater than 2. So if we want to find a quasi-biclique which includes a specific vertex, we only need to consider the vertex and its related neighbors.The related neighbors are vertexes which are less than 3 in distance to the specific vertex.The induced subgraph, which consists of the vertex and its related neighbors, is denoted as Distance-2-Subgraph.The edge status between any two vertexes in an induced subgraph is the same as that in the original graph.SQB needs to find every vertex's Distance-2-Subgraph in order to obtain its maximum quasi-biclique.

Detecting Maximum Quasi-bicliques
After finding every vertex's Distance-2-Subgraph, SQB begins to find the quasi-biclique.This process, detecting quasi-bicliques, is the essential part of SQB.SQB uses the size (nm) to measure a quasi-biclique and it is crucial to know the specific value of n and m of a maximum quasi-biclique.As n and m are in a limited range, SQB tests the values of n and m from the upper limit to the lower one.If a graph has a quasi-biclique QB(|V|=n, |U|=m), the vertexes in the graph with degree lower than n and m should not be in the QB, so SQB removes these smaller vertexes during the process.Furthermore, if the test value of n and m are greater, SQB can remove more vertexes and increase the speed of the process.
Before explaining our program, we introduce how to split a graph.We use a complex data structure CD to store the V1 set, U1 set and the induced subgraph G of V1 set and U1 set.The program splits the graph at the V1 set, and U1 set in turn.The program chooses the vertex in V1 or U1 with largest degree and labels it so that next time, the program avoids splitting at the same vertex again.For example, if the program chooses v15 as the candidate vertex, it then produces four sets V150, V151, U150, and U151.The first set V150 includes v15 and vertexes in V1 which has a distance of 2 to vertex v15.The second set V151 consists of elements in V1 except v15.The third set U150 contains vertexes in U1 which is the direct neighbor of vertex v15.The fourth set U151 is the same as U1.Next, the program produces induced graph G150 which contains vertexes V150 and U150, then puts V150, U150, G150 into data structure CD150(V150, U150, G150).In the same way, it gets another data structure CD151(V151, U151, G151).
The algorithm of detecting quasi-bicliques is listed in the end of this subsection.The algorithm consists of a FOR loop that begins from 20 to 2. (20 is an experimental value which should be increased with the growing of nodes of the input graph).At first SQB uses the sub-function Search_k_Quasi_Bicliques(G, k) to test whether the graph contains quasi-bicliques in which the |V|>20 and |U|>20.If the sub-function finds it's true, the FOR loop terminates, otherwise the FOR loop decreases the test value by one and continues to test, until the sub-function finds quasi-bicliques or the test value lower than 2.
The sub-function func Search_k_Quasi_Bicliques(G, k) is the key component of SQB.At first, the input graph G's vertex set is divided into two parts, V1 and U1.Next, V1 and U1, and G are put into complex data structure CD(V1, U1, G).CD is put into a buffer BUFFER.Next, the program go into a WHILE loop.This loop's terminate condition is that the buffer is empty.During the loop, first, the program removes one element from BUFFER and puts it into CD0(V01, U01, G0), then deletes vertex in V01 with is true, then delete QB2.
The first rule means that if V1 is V2's subset and U1 is U2's subset, or if V1 is U2's subset and U1 is V2's subset, then QB1 is a part of QB2, QB1 can be deleted.The second rule is opposite to the first one.Otherwise, if two quasi-bicliques match neither of the above two rules, SQB keeps both of them.
After the pruning operation, SQB obtains the distinct quasi-bicliques of the whole graph and the biggest one is the optimum one of the whole graph.

APPLICATION OF SQB
The experiment was done on our web server which consisted of two Pentium 2 PCs with 4.8 GHZ CPU and 2G RAM.The Saccharomyces cerevisiae dataset Y78, derived from [15], consists of 78,390 protein-protein interactions, including 5321 proteins.During our experiments, we removed vertexes with degree 1 because they could not produce a biclique.The input graph of our program is with node number 4546.At first, we produced 4546 distinct Distance-2-Subgraphs according to every vertex's neighbors and their neighbors.The maximum subgraph is with 3164 nodes and the average value is 746.1.So the questions are very tough.
About eighty percent of the vertexes have a process time less than 20 seconds and half of them are processed within 1 second.The average time is about 13.4 seconds.Giving the large input graphs, the performance is very remarkable.
During our experiments we predicted 5616 quasi-bicliques which include empty or redundant ones.A small number of vertexes have more than one maximum quasi-bicliques, so the number of quasi-bicliques is greater than 4546, the number of Distance-2-Subgraphs.