Combining Gene-phenotype Association Matrix with Kegg Pathways to Mine Gene Modules Using Data Set in Gaw17

Currently, genome-wide association studies have been proved to be a powerful approach to identify risk loci. However, the molecular regulatory mechanisms of complex diseases are still not clearly understood. It is therefore important to consider the interplay between genetic factors and biological networks in elucidating the mechanisms of complex disease pathogenesis. In this paper, we first conducted a genome-wide association analysis by using the SNP genotype data and phenotype data provided by Genetic Analysis Workshop 17, in order to filter significant SNPs associated with the diseases. Second, we conducted a bioinformatics analysis of gene-phenotype association matrix to identify gene modules (biclusters). Third, we performed a KEGG enrichment test of genes involved in biclusters to find evidence to support their functional consensus. This method can be used for better understanding complex diseases.


Introduction
It is well known that genome-wide association studies (GWAS) have become an increasingly effective tool to identify genetic variation associated with the risk of complex disease.However, in this case, univariate single-locus association analyses may not be the most appropriate strategy.Instead, many multivariate methods that were used for identifying causal loci have been developed rapidly.Gauderman et al [1] induced a principal-components method to assess whether multiple SNPs within a candidate are associated with disease.In fact, it is suggested that gene-gene interaction or gene modules (gene-gene networks) may play important roles and provide more information.Xiaoqi Cui et al [2] proposed a new combinatorial association test incorporating multiple traits at the same time to detect gene-gene interaction unlike other general methods for studying gene-gene interaction.In addition, there are some attempts to integrate the biological knowledge (e.g., Gene Ontology and KEGG pathways) into the genomics field.For example, Iossiov et al. [3] predicted pathways or networks of interacting genes that contribute to common heritable disorders by combining the standard genetic linkage formalism with whole-genome molecular-interaction data.Furthermore, many gene set analysis methods, such as GeneTrail (http://genetrail.bioinf.uni-sb.de/)and GSEA [4], are used to detect disease-related risk pathways or gene modules.These methods integrate heterogeneous data to elucidate biological mechanisms, which is an essential and challenging problem in systems biology.
In this paper, we conducted a genome-wide association analysis by combining the SNP genotype data with the phenotype data to filter significant SNPs associated with the diseases.Then, further bioinformatics analysis to gene-phenotype association matrix was used to identify gene biclusters (gene modules).A KEGG enrichment test of genes involved in biclusters was applied to find whether they were functional aggregated.This method can be used for better understanding complex diseases.

Materials
The GAW17 data consisted of two data sets.One consists of a collection of 697 unrelated individuals and their genotypes and phenotypes.These are subjects from the 1000 Genomes Project.The second data set is comprised of 697 individuals in eight extended families and their genotypes and phenotypes.The 202 founders in the family data set were chosen at random from the set of unrelated individuals.The extended families are loosely based on the families from previous GAWs.SNP geno-types were obtained from the sequence alignment files provided by the 1000 Genomes Project for their pilot3 study (http://www.1000genomes.org).There is a total of 24487 SNPs, all of which are autosomal.A total of 200 replicates (phenotypes) of the trait simulation were carried out in both data sets.In this paper, we only selected the disease status (coded 0 = no, 1 = yes) of the first ten replicates as dichotomous disease phenotype to perform our analysis.

Data Preprocessing
Plink software can be used to calculate the association of genome-wide SNPs with the phenotypes.Here, all of p-values were used to construct gene-phenotype association matrix.This matrix is 24487 × 10 (SNPs × number of phenotypes) matrix.We performed bioinformatics analysis by using the bicluster method to this matrix in the following step.

Applying Bicluster Method to Gene-Phenotype Association Matrix
A bicluster is a subset of the SNPs exhibiting consistent association strength over a subset of the phenotypes.We applied Statistical-Algorithmic Method for Bicluster Analysis (SAMBA) algorithm [5] to detect significant biclusters from our constructed gene-phenotype association matrix.This algorithm includes three phases.In the first phase, the bipartite graph was formed and vertex pair weights were calculated using weighting methods.In the second phase, the hashing technique was applied to find the heaviest bicliques in the graph.In the last phase, a local procedure on the biclusters in each heap was performed.We used Expander software (http://acgt.cs.tau.ac.il/expander/) to implement this algorithm.

Functional Gene Modules Mining in Combination with KEGG Pathway
To see if the genes significantly aggregating in extracted, biclusters are also aggregating in functional categories, we performed a KEGG enrichment test.For a given KEGG pathway, a gene is either in this pathway or not in this pathway.We suppose that a total of N (=3205) genes for the analyzed data are presented in KEGG pathways in which a set of genes in biclusters are significantly aggregated.We applied GeneTrail software (http://genetrail.bioinf.uni-sb.de/) to identify functional gene modules (gene biclusters) enriched on KEGG pathways significantly.In this study, to avoid the possible loss of the true positives, the multiple test correction was not performed and the p-value quoted should be considered as a heuristic measure, useful as an indicator that roughly rates the relative enrichment of significantly aggregated genes for each KEGG pathway.

Constructing Gene-Phenotype Association Matrix
Of 22487 genome-wide SNPs, we found 1775, 2444, 1599, 1244, 1492, 1799, 1480, 1689, 1415 and 1625 SNPs are significantly associated with disease1-dis-ease10, respectively according to p < 0.05.Table 1 shows the distributions of the number of significant SNPs (genes) associated with diseases in terms of p < 0.05 and p < 0.01.We found that there are more significant overlapping SNPs between disease1 and disease2 than that of between any of other two diseases (See Figure 1).Further look at the pool of significant SNPs, only C10S2297 (FRMPD2) was shared by all of ten diseases (phenotypes), thus likely to be associated with some of complex diseases.In fact, Nina Stenzel et al [6] approved that the down regulation of FRMPD2 protein in Caco-2 cells is associated with an impairment of tight junction formation.
Additionally, it is noted that at the level of 0.05 the number of significant association was dramatically reduced as the increasing of the number of diseases (See Figure 2).

Biclustering to Gene-Phenotype Association Matrix
We applied Statistical-Algorithmic Method for Bicluster Analysis (SAMBA) algorithm to detect biclusters from our constructed gene-phenotype association matrix.All parameters were selected as default.As a result, a total of 11 biclusters were acquired (See Table 2).Each bicluster includes at least 18 genes.The scores given in the second   column of Table 2 were acquired by the SAMBA algorithm and they are size-dependent, thus, it is not recommended to use them to compare the quality of two biclusters of different sizes.Further bioinformatics analysis will identify functional biclusters enriched on KEGG pathways significantly.

Performing KEGG Enrichment Test
We put all genes onto KEGG pathways.Then we selected the KEGG pathways that contained at least two genes.GeneTrail software was applied to obtain a p-value of each studied pathway for its enrichment with significantly function aggregated genes.We found that more than half of the biclusters (54.5%) were significantly enriched on KEGG pathways.Table 2 lists the significantly (nominal p ≤ 0.05) enriched KEGG pathways.It is interesting to note that the majority of the enriched KEGG pathways relate to insulin signaling, autoimmune thyroid disease and Fc gamma R-medidated phagocytosis.It is noteworthy to look at a highly aggregated genes included in bicluster 9 (See Figure 3), three of which (HLA-A, HLA-B and TG) have direct relevance to immunology.Fred Sanfilippo et al [7] indicated that good HLA-A and B matching is highly dependent on a system for sharing organs among institutions, and results in decreased graft rejection, better long-term graft function, and less need for post-transplantation immunosuppression.Another example, genes HSD3B2 and UGT1A10 included in bicluster 9 were insulin-related.Goldy Crabunaru et al [8] tested the hypothesis that HSD3B deficiency in hyperandrogenic females (HF) is related to insulin-resistant polycystic ovary syndrome (PCOS), and Katriina Itaaho et al [9] shed new light on the structure and function of UGT1A10 which can catalyze dopamine glucuronidation at substantial rates, yielding both dopamine-4-O-glucuronide and dopamine-3-O-glucuronide.

Discussions
This study is the attempt to relate the functional aggrega-tion patterns of genes with their disease-loci association.Statistically significant biclusters acquired by SAMBA algorithms are generated in an unsupervised fashion directly from the gene-phenotype association matrix.Further bioinformatics analysis suggests that some of biclusters (gene modules) characterize biological phenomenon and can be evaluated using existing biological knowledge.In addition, some developed protein structural genomics and computational methods may be utilized to find the small portion of SNPs of likely functional importance.For each of 428 SNPs involved in biclusters, we used SIFT software (http://www.blocks.fhcrc.org/) to acquire its tolerance index (from 0 to 1) for SNP functionality to determine the conservation level of a particular amino acid position in a protein.A higher tolerance index indicates that the position is less conserved across species [10].The results showed that only 93 SNPs (21.7%) had their tolerance indexes.Among these SNPs, 62 of them (66.7%) had higher tolerance indexes which were greater than 0.5.In other words, the positions of SNPs involved in functional biclusters might be less conserved, which need to be highlighted in future studies.However, we feel that these exploratory results need further investigation because of the limited number of phenotype used in our analysis.Moreover, we only selected unrelated data sets; further study is needed regarding our method for the pedigree data, which will be included in future studies.

Conclusion
In this paper, we conducted a genome-wide association analysis by combining the SNP genotype data with the phenotype data to filter significant SNPs associated with diseases.Further bioinformatics analysis revealed that aggregated genes in biclusters tended to aggregate on some KEGG pathways and to be functional association.

Figure 1 .
Figure 1.The number of significant overlapping SNPs between any of two diseases.(The number of significant overlapping SNPs between any of two phenotypes (diseases) is depicted by a color cell, where the red colour indicates that the greater number of overlapping SNPs is shared by two different diseases).

Figure 2 .
Figure 2. The trendy of the number of significant overlapping SNPs as the increasing of the number of diseases (phenotypes).(The transverse axis stands for the first k (k = 2, 3 …10) phenotypes which were used to be compared and the longitudinal one denotes the number of overlapping significant SNPs shared by them).

Figure 3 .
Figure 3.The heatmap of bicluster 9(a) and bicluster 10(b).(The p-values of gene-phenotype association are depicted by cells with different colours, where the green colour indicates the significant association between gene and disease (phenotype)).