Complex tree : the basic framework of protein-protein interaction networks

In living cells, proteins are dynamically connected through biochemical reactions, so its functional features are properly encoded into proteinprotein interaction networks (PINs). Up to present, many efforts have been devoted to exploring the basic feature of PINs. However, it is still a challenging problem to explore a universal property of PINs. Here we employed the complex networks theory to analyze the protein-protein interactions from Database of Interacting Protein. Complex tree: the unique framework of PINs was revealed by three topological properties of the giant component of PINs (GCOP), including right-skewed degree distributions, relatively small clustering coefficients and short characteristic path lengths. Furthermore, we proposed a nonlinearly growth model: complex tree model to reflect the tree framework, the simulation results of this model showed that GCOPs were well represented by our model, which could be helpful for understanding the tree-structure: basic framework of PINs. Source code and binaries freely available for download at http://cic.scu. edu.cn/bioinformatics/STM/STM_code.rar.


INTRODUCTION
Protein-protein interactions (PPIs) are crucial to most biochemical processes in living organisms.Identification of protein-protein interactions has been the focus of the post-proteomic studies.Various experimental techniques have been developed for the large-scale PPI analysis, including yeast two-hybrid systems [1], mass spectrometry [2,3], protein chip [4] and so on.Consequently, great quantity of protein interaction data from several organisms such as yeast and fruit fly has been produced [2,3].
In the past decade, with the explosion of available high throughout biological data, the analysis of biological networks has attracted significant interest in academic community.Since the "right-skewed" degree distribution of metabolic networks found by Jeong et al. [5], lots of research works have been done to understand the biological systems in terms of network.The system's elements are represented as vertices (such as proteins, DNA, RNA and small molecules) and their interactions are reduced to links (biochemical interactions between these biological components) connecting pairs of vertices.And the cell's behaviors are distinct attributed to "network of networks" [6].
In a living cell, group of proteins participate in diverse biochemical interactions that lead to changing the effect of protein or forming protein complexes.All these processes constitute protein-protein interaction networks (PI-Ns) [7].Studying the topology features of PINs is beneficial to understand the cell's higher-level organization mechanism.Recently, notable structure properties have been reported in several research works of PINs [8][9][10][11][12], including right-skewed degree [13], short pathway length [7], hierarchical structure [14].Meanwhile, many network models have been proposed to characterize the PINs to explore the basic universal property of PINs [15,16].However, it is still a challenging problem to find a universal framework of PINs for each species.
In our works, we employed the complex networks theory to analyze the datasets from the PPI database DIP including eight species: D. melanogaster (Dmela), S. cerevisiae (Scere), E. coli (Ecoli), C. elegans (Celeg), H. sapiens (Hsapi), H. pylor (Hpylo), M. musculus (Mmusc), R. norvegicus (Rnorv).Based on the analyzing results, we constructed a complex tree model (CTM) to mimic the giant component of PINs (GCOP) of each species.The simulations of CTM suggest that PINs can be well represented by this model, and the proposed CTM is helpful to understand the basic framework of PINs.

Data Source-Database of Interacting Protein (DIP)
In this paper, the DIP database which collects experimentally determined protein-protein interaction data was utilized as the input data.We collected the Protein-Protein Interactions with binary relations data from DIP, version DIP_20081014 (http://dip.doe-mbi.ucla.edu).In order to obtain integrated topology, the "full set" subset containing PPI data identified by experiment was used.Although false positive in "full set" might lead to false edges in PINs and inaccurate topological features, we considered that our analysis results are still robust because a few local inaccuracies have less influence on global properties of PINs.Table 1 summarizes the numbers of proteins and interactions for aforementioned species.

Topological Features of PINs
In this study, we constructed adjacency matrix of PIN from the corresponding PPI pair tables, where each element in the adjacency matrix is assigned 0 or 1 to stand for whether it is a direct interaction, or non-interaction.
After removed loops and multiple edges, the network was normalized into an undirected graph.All following topological properties were calculated using the "igraph" R package which is widely used in network analysis [30].

Component Size and Giant Component
As the component size of PINs may reflect its fundamental properties [17], we calculated the component size for each species, and extracted the giant component of PINs (GCOP) from entire network.Measurement of component size and giant component was listed in Table 1.For Rnorv and Mmusc, the size of GCOP is too small for large scale analysis.Hence, in the rest of our work we focused on the other six species.Three main topological parameters of GCOP were measured for these species and the details of measurements are described in following section.

Node Degree Distribution
One most basic parameter of a network is its node degree distribution.In the past century, random graph is the most important model for real systems.The degree distribution of random graph is bell-shaped and the node degree clusters around the mean value.But at the begining of this century, Jeong [13] found the degree distribution of yeast PIN is far from bell-shaped distribution of random graph.Jeong et al. deduced that the degree distri-bution follows power-law approximately and called it "scale-free".The phenomena so-called scale-free have prompted many scientists to explore real networks.Up to many present research works have been done in system biology, showing that so many biological networks have the "right-skewed" degree distribution [12].
In this paper, we defined the degree k of a given protein as the number of interactions with other proteins and P(k) as the frequency distribution of degree.We collected the degree sequence datasets of GCOP and plotted the logarithm of the cumulative degree distribution (CD-D): P cum (k) vs. the logarithm of k for each species.The CCD curves were plotted in Figure 1.It is obvious that each GCOP has the "right-skewed" degree distribution.This means that most proteins in yeast PIN have very few interactions and yet a few ones have many (hubs), and the degree distribution has no well-defined peak but is approximate to skewed line under a double logarithmic plot.On account of that power-low distribution is a reasonable hypothesis for right-skewed distributions, we made power-law parameters estimation and tested the power-law hypothesis for those distributions with the techniques proposed by Aaron Clauset [18] based on maximum likelihood methods and the Kolmogorov-Smirnov statistic.By applying this method to the degree sequences of GCOP for each species, we can not only find the best-fit power low model for degree distributions of GC-OPs, but also test whether the power-low distribution is a reasonable hypothesis for those distributions.The estimation result is showed in Table 2.

Clustering Coefficient and Characteristic
Path Length Other two basic topology properties of network are average clustering coefficient and characteristic path length.Considering that PINs are undirected networks, we defined L as the characteristic path length (also known as the average path length) between protein pairs in PINs: d ij is defined as the number of links from vertex i along the shortest path to vertex j.We measured the L of PINs by employing the Floyd algorithm.The clustering, sometimes also called transitivity, generally means the presence of a heightened number of triangles (groups of three vertices each of which is connected to the others) in the network [19].In many sys tems it has been found that if constituent A is connected to constituent B and constituent B to constituent C, then there is a heightened probability that constituent A will also be connected to constituent C.And we computed the clustering coefficient of PIN following the definitions given by Watts and Strogatz [20], defining a local Then the global clustering coefficient for the whole network is the average: The calculation results are also showed in Table 2.

Complex Tree Model
In the past few years, notable discoveries of exploring complex networks have redefined our understanding of complex systems.Meanwhile, certain models have been constructed to mimic real networks [20][21][22][23][24].The BA model is the most famous stochastic model which generates scale-free topology by the combination of network size linear growth and preference attachment rule [25].However the randomness in this model makes it hard to gain a visual understanding of scale-free topology.A deterministic model [22] was constructed by Barabasi to solve this problem.The unique property of this model is its hierarchical fashion.After that, Ravasz and Barabasi uncovered the hierarchical structure in metabolic networks [26], they proposed a deterministic hierarchical network model to explain the modularity and hierarchical organization in real networks [27].Up to present, many network models have been proposed to explain the organization mechanism of biochemical system, but it is still a challenging problem to construct a universal model for PINs.
Inspired by previous study, we tried to build a model to reflect the unique framework of PINs: complex tree.We constructed this model in a tree-like fashion to mimic the GCOP.In GCOPs, there're many varieties of treestructure, it's really difficult to build a deterministic model for it.So we made a simplification here and built this heuristic model by adding shortcuts on a tree substrate, we called it "complex tree model".The details of model construction are depicted as follow.

Substrate
In building small-world model, Watts made random rewiring test on several substrates, including tree, ring lattice, and many other structures.Finally the one dimension ring lattice was chosen because of its equivalence [20].Recently, Dong-Hee Kim et al. found thatmany real systems have their own communication kernels and the communication kernel of scale-free network is scale-free tree that is called the skeleton of complex networks [28].They deduced that scale-free networks can decompose into (scale-free) trees and shortcuts.So in this paper we focused on building a simple model by adding shortcuts in a tree-structure.The tree graph was chosen as substrate to construct the CTM.Here we focused on the situation of perfect binary tree.Its inherent hierarchical structure and connectivity make it a proper substrate.By altering the number of levels, n, the size of substrate N(n) can grow nonlinearly as:

Shortcut Adding
In the construction of NW small-world model, Newman and Watts added a few shortcuts between different parts of the ring lattice.As a result, the local construction doesn't change, but the distance between two remote parts reduces dramatically by long-range connections [29].In our model, we added shortcuts between two nodes in different hierarchies according to a simple rule.An arbitrary node in level n is linked with nodes of lower levels (excluding the two leaf nodes already connected) with probability: 0 ( ) P0 is the shortcut adding probability of the main root to all other nodes below level 0 except the two leaf nodes it already linked.Following this simple adding rule, the nodes in higher levels get more shortcuts than those in lower levels, and the adding probability drop exponentially with n.We 'tuned' the amount of shortcuts by changing P0 from 0 to 1.While P0 = 0, there is no shortcut and loop, the model is just a perfect binary tree with big characteristic path length; P0 = 1, the root gets every node connected and the maximum amount of shortcuts is added to the substrate.In the region 0 < P0 < 1, some interesting phenomena can be seen between two extremes above.The details of model construction are shown in Figure 2, which depicts both the substrate and the shortcuts adding process.The model simulation is described in the next section.

Simulations
Here we concentrated on the substrate n = 10, N = 1023 and added shortcuts on it.When altering P0 from 0 to 1; the topological features can be altered dramatically during this process.We calculated three main topological parameters (CDD, C, L) of CTM to describe this stochastic process.Furthermore, with the aim to represent the topological features of GCOP, we focused on these cases of CTM with n = 9 N = 511, n = 10 N = 1023, n = 11 N = 2047, n = 12 N = 4095, n = 13 N = 8191, which made the CTM attained similar size with the GCOP for five species.By tuning P0, we found out the certain P0 for each scale of CTM to fit PINs, and compared the topological features of GCOP with fitted CTM.The comparison result is depicted in Table 3.

RESULTS
Several basic statistics of the PINs are summarized in Table 1.It is distinct that the GCOP of Dmela, Scere, Ecoli, Celeg, Hsapi, Hpylo is remarkably big, containing most proteins and interactions of PINs.Especially in the PINs of Celeg, Dmela, Hpylo and Scere, the giant components contain more than 90% proteins of whole PINs.For Rnorv and Mmusc, the size of GCOP is smaller and not suitable for large scale analysis.
Table 2 shows the average node degree <k>, maximum node degree k max , characteristic path length L, clustering coefficient C and the power-law estimation results of the CCD (scaling parameter, lower-bound and p-value) for every GCOP.The <k> ranges from 1.60 to 4.06, which means many proteins have only a few interactions.But there exist some proteins with many interactions in PINs, the number of interactions can be as high as 283 (Protein JSN1 in S. cerevisiae, dip: 1281N).The k max values are much larger than <k>，it strongly indicates that PINs cannot be well depicted by random graph.Then we made the power-law estimation for the CDD of GCOP, the results show that the scaling parameter  lies between 1.65 and 3.5 of GCOP for six species, and the low-bound d min ranges from 1 to 28.Furthermore power-law hypothesis was evaluated by goodness-of-fit test [18].We calculated the goodness-of-fit between the cumulative degree distribution and the best fitted power law model.The resulting p-value was listed in P-value columns.Here, a relatively conservative choice was made that the power law was ruled out if p-value ≤ 0.1.When the p-value is greater than 0.1, power law is a plausible model for the CDD, otherwise it is rejected.The results show that the power-law is not a proper model for Ecoli, Hpylo and Scere but a reasonable one for Celeg, Hsapi, Dmela.The characteristic path length L of PINs lies between 3.81 and 6.53.
The characteristic path length L of the GCOP for six species is around 4, which means that any two proteins can indirectly interact via relatively short successive biochemical reactions.The clustering coefficient C of the GCOP is relatively small; the highest clustering coefficient appears in GCOP of Ecoli is 0.108 and for other species C is around 0.03.This is important evidence for that GCOP is a multi-scale network with tree structure.
The simulation results of CTM include the CDD, C and L for the family of CTM.By altering the number of levels (n), the size of substrate N(n) can ascend nonlinearly.Furthermore, with the increase of P0, the adding rule can result in nodes in higher levels can obtain more shortcuts, and the lower level nodes have less chance to connect with other nodes.Especially the nodes in lowest level can be only linked by upper nodes.The shift of the CDD curve for the family of CTM is depicted in Figure 3. Just a few shortcuts between different hierarchies lead to rapid descending of L. Meanwhile, the C rises up with more loops formed in shortcut adding process.Characteristic path length L(p) and clustering coefficient C(p) of the CTM family are showed in Figure 4. Table 3 shows that the two main topological parameters of CTM are really approximate to the parameters of GCOP; GC-OP can be well presented by CTM.1003

DISCUSSIONS AND CONCLUSIONS
In this paper, we employed COMPLEX NETWORKS THEORY to analyze the PPI database DIP (Oct.2008).Eight species (Ecoli, Hpylo, Celeg, Dmela, Hsapi, Mmusc, Rnorv and Scere) were considered, but the GCOP of Mmusc and Rnorv are smaller, we focused on the largescale analysis of GCOP of other six species.Three global parameters (node degree distribution, characteristic path length L, and the clustering coefficient C) were used to characterize the GCOP.The logarithm of P cum (k) vs. the logarithm of node degree log k indicated that the CCDs of PINs are a group of right-skewed curves.We also tested the power-law behavior for CCD curves.Thus in some cases power-law may not be more interesting than any other heavy-tailed distribution.But in our work, the goal is to infer plausible mechanisms that might underlie the formation and evolution of PINs; it may matter greatly whether the CDD of PINs follows a power law or not.We employed a new testing technique to evaluate the power-law hypothesis for CDD.The estimation results show that power-law is a plausible model for the CDD of GCOP for Celeg, Dmela and Hsapi, but for other three species: Ecoli, Hpylo, Scere, the powerlaw is ruled out.Furthermore, there are only a few proteins of which the degree values are larger than lower bound d min in GCOP of Celeg, Dmela and Hsapi (Celeg 227/2386 Dmela 287/7351 Hsapi 132/805), the bestfitted part of CDD is really short.In conclusion, the power-law model might not be a proper model for the CDD of GCOP and previous scale-free models are not proper for PINs.
The calculations of characteristic path length L and clustering coefficient C indicate that the GCOPs are multi-scale networks without many loops.The long-range interactions between different local parts make the L be close to the small-world limit given by random networks, otherwise the long-range interactions have no significant effect on local structure, the clustering coefficient is relatively small, and those issues suggest that tree structure be the basic framework of GCOP.With the aim to represent the framework of GCOP, we proposed a nonlinearly growth model: CTM.In our model, network size can be changed nonlinearly with different n (number of tree levels) and the amount of shortcuts can be altered by altering P0(shortcut adding probability of the main root to all other nodes below level 0 except the two leaf nodes it already linked).We found out the P0 and n to approximate the GCOP after accomplishing mass computer simulations of CTM.From the comparison result between GCOP and according CTM in Table 3, it is clear that the giant component can be well represented by CTM.
Our study only offers a starting point for understanding the simple nature of PINs: complex tree.The proposed CTM offers us a new perspective in exploring the PINs.In this model, the upper nodes of substrate have m = 2 leaves.For further modeling, we may generalize this situation by changing the arbitrary parameter m, such as to 3 or 4, or connect the brotherhood leaves in the same level.In summary, our research might be helpful to understand the basic framework of PINs: complex tree and the CTM would be a powerful tool for correlating the topological with functional properties of the PINs.

Figure 4 .
Figure 4. Characteristic path length L and clustering coefficient C for the family of CTM with n = 9.L0 and C0 is the characteristic path length and clustering coefficient of CTM with P0 = 1.

Table 1 .
Basic statistics of PINs and corresponding GCOP.

Table 2 .
Average node degree < k >, maximum connectivity k max , clustering coefficient and characteristic path length, and the power-law testing results of CCD for the six species.

Table 3 .
Comparisons between PINs and CTM.