1. Introduction
Breast cancer is the most common cancer in women world-wide and it has the ability to get inherited [1] . This inheritance is propelled by various common invariants which renders lifetime risk [2] . The possible curative approach of this disease is tumor surgery, while chemotherapy still poses a high risk for initiating metastasis [3] [4] . Further, depending on the hormonal dependency of the breast carcinoma, there are few chemoprevention strategies, such as, employment of selective estrogen receptor modulators (SERMs), anti-estrogen drugs and micronutrients, which have been tested for anticancer activity [2] . There are few developments in the curative approaches of breast cancer, due to various systems approaches like network theory, development in omics, availability of gene expression data and integrative techniques of mapping genes of specific functions [5] . In this type of cancer, mutation plays an important role in compelling key gene(s) to cause defective protein(s) translation in regulating normal cell functioning [6] . The process involves complicated interaction of few thousand of genes in various complex biological processes of large number of molecular functions [7] , and the molecular functional organization of this network is very complicated to understand [8] . This complex network involves organization of functional molecules and modules at various system levels associating the principle of disease progression [9] . The organization of diverse modules in this type of network could be the potential source of various domains of activities [10] . Structural and functional properties of complex biological systems have been studied within the formalism of network theory [11] .
It has been reported that most of the existing networks in nature fall in one of the following nature, namely, scale-free, small world, random and hierarchical, and their combinations [12] [13] . Hierarchical network is of special interest because of its important topological properties (distribution of diverse modules/ communities and sparsely distributed hubs) [12] [13] and systems level working mechanisms [12] . The emerging modules in this network type are of particular attention because they may correspond to independent functions obeying their own laws and their complicated organization [12] exhibiting nonlinear activities and emergent behavior [11] . The sparsely distributed hubs generally regulate the system along with modules to maintain network stability, or help to adapt to a new fit change [14] . The present study focuses on the possibility of finding important inferred genes in breast cancer network constructed from standard cancer databases available using network theoretical approach. The newly predicting genes could be of rigorous experimental situation for important target genes of this disease.
2. Methods
Integration of breast cancer data: We have incorporated six standard databases of cancer, namely, KEGG (Kyoto Encyclopedia of Genes and Genomes), CGC (Cancer Gene Census), BCGD (Breast Cancer Gene Database), CGAP (Cancer Genome Anatomy Project), GAD (Genetic Association Database) and NCG (Network of Cancer Genes), to obtain a comprehensive list of breast cancer genes. We extracted 2050 genes from these databases, out of which 1332 were found to be unique (Figure 1). Then we follow a simple work flow stared with the mining of the list of genes (associated with breast cancer) from all the six defined storehouses. These lists were subjected to CGI-Perl codes (developed locally) for the removal of duplication both in terms of redundancy of names and use of synonymic (multiple names for the same gene) gene names. The method of removal involves pattern matching and searching globally in Gene card
Figure 1. Schematic diagram of work flow of the methodology implemented in this work.
(http://www.genecards.com/) database. Following this method, we could arrive at unique 1332 genes. Now, data is further curated using Agilent literature sear- ch, a plugin of cytoscape. Finally, from the whole process we possessed the list of 70 genes out of 1332 unique. Now, the details of the genes extracted by mapping these genes to Uni Prot (January 2016).
Construction of primary network: The breast cancer network is constructed following simple rule of one gene one protein concept. The network was constructed using APID2NET plug-in implemented in cytoscape version 2.8.3, which was used to retrieve all the possible information from seven main resources namely the DIP (Database of Interacting Proteins), BIND (Bio- molecular Interaction Network Database), IntAct, MINT (Molecular Interactions Database), UniProt, BioGRID (The General Repository of Interaction Datasets) and HPRD (Human Protein Reference Database) [15] . The integrative and analytical effort done in APID provided an efficient open access repository where all the curated as well as experimentally verified PPIs (protein-protein interaction) are amalgamated into an exclusive web application. On combining all the information finally, we got a network of 1732 nodes harboring 55,444 interactions from which we only selected the first neighbors of selected 70 genes (discarding self- loops and isolated nodes) ending up with network of 1476 nodes defining 22,314 connections between them.
Characterization of network compactness: LCP-DP approach: The LCP- decomposition-plot (LCP-DP) is two-dimensional representation of common neighbors
index of interacting nodes and local community links
to characterize the topological properties of a network. It provides information on number, size, and compactness of communities in a network [16] . The
index between two nodes
and
can be calculated from the measure of overlapping between their sets of first-node-neighbors
and
given by,
. The possible likelihood of interaction of these two nodes could happen if there is significant amount of overlapping between the sets
and
(large value of
). The increase in
is due to the increase in compactness in the network, indicating faster information processing in the network. The
between two nodes x and y, whose upper
bound is defined by,
, is the number of internal
links in local-community
. These two nodes most probably link together if
of these two nodes are members of
[16] .
The LCP correlation
is the Pearson correlation co-efficient
of
and
defined by
with
,
where
is the covariance between
and
,
and
are standard deviations of
and
respectively.
Constant pott’s model: energy distribution in a network. The state of a persisting system can be estimated by calculating the difference in the HE (Hamiltonian Energy) between two ensemble states of the system. HE based calculation was done for a network or module by considering hub influencing modules. We then identified the modules where particular hub is present at each level. HE of the system having these modules were calculated according to the formalism built by Constant Potts Model [17] [18] given by 1 that consider the contribution of nodes (N) and edges (E) in a competitive manner. HE behaves as a window to look into the variation in the network components.
where
and
number of edges and nodes in a community (“C”) and
is the resolution parameter acting as edge density thresh hold. in general,
should be
.
Centrality based link prediction: Since centrality measurements can characterize the most influencing candidates in a network, which are capable of fast information propagation, reception, and sensitivity to the local and global perturbations, it can be used as a method to identify important fundamental regulators. For each of the centrality Degree, Betweenness, Closeness and Eigenvector, we computed the centrality score (using CytoNCA) for each node in the breast cancer network [19] . According to the scores of the nodes for each individual matrix in the network, we rank them in a descending order and compute the percentage of the known breast cancer-associated genes. Among the top 20 ranked genes the percentages of the known breast cancer-associated genes were 85% (Closseness), 75% (Betweeness), 55% (Degree) and 40% (Eigenvector). Fr- om these four centralities, betweeness and closeness centrality measures out performed as they are able to capture high percentage of genes (associated with breast cancer) for the present study.
3. Results and Discussion
The complex breast cancer network constructed from experimentally verified seventy genes obeys hierarchical characteristics [12] in the properties of topological parameters of it (Figure 2), and scale free behavior because of the power law nature in these parameters [20] [21] . The calculated data distributions of the probability of degree distributions
, clustering co-efficient
and connectivity
exhibit power law nature with respect to degree k (fitted lines on the data distributions in Figure 2). The fitted lines on the data distributions are confirmed and verified by following a standard statistical fitting procedure due to Clauset et al. [22] , where we considered the 2500 random sampling of each data set and found the p-value in each case larger than 0.1 which is the predicted threshold value. Hence, we found that,
,
and
, and the power exponents are found to be,
, where T is the transpose of the vector. If
, where a is a constant scale factor with
as the
fractal dimension of the
component of F. Hence, the network properties indicate that the breast cancer network follows hierarchical scale free fractal network [12] [21] [23] [24] [25] [26] . The negative value in β of connectivity parameter shows non-assortive nature of the network, and possibility of rich-club formation among the leading hubsis unlikely [24] .
Similarly, the centrality parameters, namely, betweenness
, closeness
and eigen-vector
centralities of the network also exhibit fractal behavior (Figure 2) given by,
, such that
,
,
and
. The positive values of q components of these centrality parameters indicate the strong regulatory role of the leading hubs in the breast cancer network [27] [28] . Then the topological
Figure 2. Topological properties of the breast cancer network. The lines are fitted lines with powerlaws in the data sets.
properties of the breast cancer network can be represented by,
,
where,
maintaining fractal properties.
Now following the centrality measurements based methodology (see in Methods), we examined the first top twenty genes each identified by each centrality and degree measurements (Figure 3 left four panels), and could able to identify eighty genes from all measurements. The repeated occurrence of some genes (EF1A1, HS90B, CTNB1, KU70, 1433Z) in the four lists of measurements draws our attention to visualize their neighbours as shown in Figure 4(a)-(e). Among these 80 central genes, 49 genes are the known breast cancer-associated genes and 31 genes are inferred genes whose relationships with cancers are needed to be further investigated. We then manually searched the evidence of their relationships with cancers from various resources such as databases and published papers, and found that among 19 out of 31 (after removing repetitions) inferred genes 14 genes are cancer-associated (but not breast cancer), which suggests that these four centralities are effective in identifying cancer-associated genes (Table 1). Further,
out of 19 identified genes are found to be non-cancer associated genes which are needed further experimental investigation for their importance in the study of breast cancer genome (Table 1). We now review the detailed information about the 19 identified inferred genes as follows:
Cancer associated genes: This category holds two sub-categories depending upon the source. Thus, 14 genes include 10 genes that acquired association to cancer from literature while the other 4 genes from NCG database (Table 1) [29] .
Figure 3. Plots of the degree and centrality based identification of first top twenty genes in each measurement. The percentage of common overlapping of the identified genes by the four measurements.
These 4 genes, namely, CTNNB1, HS90B, NMP, and PAPB1 are obtained after verification from NCG (Table 1) [29] . The other
genes, namely, 1433Z, TF- 65, CSK1, KU70, SF3B3, RL11, RS3A, HNRPU, RL6, RL26 were verified for the association to cancer from extensive literature survey (using Pubmed, scholar etc). Out of these 10 genes the expressions of KU70 and SF3B3 were recently been correlated to resist the prognosis of ER+ breast cancer [30] .
Non-cancer associated genes: This category holds only 5 genes that were found associated to other diseases but not cancer (neither in NCG nor in Literature). Out of these five genes EF1A1, 1433G, RL23, RL24 and RS26, RS26 is correlated to the conjunctival cancer.
Further, after removing breast cancer related genes from the list, the highly repeated genes (EF1A1, HS90B, CTNB1, KU70, 1433Z) in the four measurements are most probably important inferred genes which help in regulating bre- ast cancer regulatory network and their regulating roles should be significant other than other inferred genes. Hence, we further study the topological properties of the sub-networks associated with these genes for understanding their activities (Figures 4(a)-(e)). These sub-networks still follow hierarchical scale free characteristics, may be inherited from the main network obeying fractal property of the network. These sub-networks are compact (all
) where nodes are tightly bound (see Method), their sizes are in the range (26 - 170) and the points in the LCP-DP plots indicating strong linkage of the nodes in each sub-network [16] . These properties reveal that these five important inferred genes might have strong regulating activities to breast cancer network. The compactness or how strongly the nodes are interconnected in each sub-network can be characterized by defining a relative LCP-correlation given by,
where
is the value of LCP-correlation of
sub-network and
is the LCP-correlation of the complete breast cancer network. Since the calculated PLCP values of EF1A1 and 1433 Z are largest, the sub-networks corresponding to these inferred genes strongly correlate with the breast cancer network, and actively regulate it. Whereas, CTNB1 has lowest
Table 1. List of non-breast cancer inferred genes identified by centrality based method of identification of inferred genes.
Literature (cancerous genes): green; NCG (cancerous genes): yellow; not cancer white: white; (*): Commonly appeared in all (i.e. four) centralities(Cc); (Cc): Gene appeared in both Betweeness (Cb) and Closeness centralities (Cc); (uCb): Gene unique Betweeness (Cb); (d): Gene appeared in both Degree (d) and Closenesscentralities (Cc); (Ce): Gene appeared in both Degree (d) and Eigenvector centralities (Ce); (uCe): Gene unique to Eigenvector centralities (Ce).
value indicating weak correlation and regulation of corresponding sub-network to the breast cancer network (Figure 4(g)). Again, relative energy distributions in these sub-networks of the inferred genes can be estimated using Hamiltonian function within the formalism of constant potts model (see Method), by defining the energy distribution per node, which is the ratio of Hamiltonian energy of a sub-network “j” Hj to the size ofthe corresponding sub-
network
given by,
. The calculated
of sub-
networks corresponding to EF1A1 and HS90B show largest values, and those of CTNB1 and 1433 Z show smallest values indicating strong and weak distribution of energies in their respective sub-networks.
4. Conclusion
Complex breast cancer network constructed from experimentally verified seventy genes follows hierarchical scale free network which involves interaction of emergent diverse modules and sparsely distributed hubs in regulating the network. Regulation of this network is done by various breast and non-breast cancer genes. These genes can be identified by centrality based measurements which is an important method for identifying inferred genes [19] . As betweenness and closeness centrality predicted more genes whose relation to the disease is currently unknown and that are candidates for experimental study. This method could able to recognize 49 breast cancer genes verified by standard database and literature reports, and nineteen genes are non-breast cancer genes. Out of nineteen inferred genes, fourteen genes are involved in other types of cancers other than breast cancer and other diseases. The other five genes are involved in other non-cancer diseases. The identified inferred non-breast cancer genes should be addressed for important experimental attention in order to understand their direct and indirect roles of regulation of these genes in breast cancer network. The highly repeating genes in the centrality based identification of inferred genes could be of significantly important regulating activities in breast cancer network. Because the sub-network associated with each inferred gene is compact and strongly interlinked, and follows hierarchical features. The energy distributions in these sub-networks are also strong for some genes indicating their significant roles in regulating breast cancer network. We strongly propose for immediately rigorous experimental investigation like on p. 53 [31] [32] [33] , these inferred genes for possible proper understanding of how this particular disease network works. The proper attention to these genes may open up new understanding and preventive mechanisms of this disease.
Acknowledgements
KC and RKBS are financially supported by UPE-II sanction No.
. SA and MZM is financially supported by Indian Council of Medical Research under SRF (Senior Research Fellowship).