Combining Gene-Phenotype Association Matrix with KEGG Pathways to Mine Gene Modules Using Data Set in GAW17

doi:10.4236/eng.2013.510B067

Paper Menu >>

Journal Menu >>

Engineering, 2013, 5, 332-337

http://dx.doi.org/10.4236/eng.2013.510B067 Published Online Octob er 2013 (http://www.scirp.org/journal/eng)

Combining Gene-Phenotype Association Matrix with

KEGG Pathways to Mine Gene Modules

Using Data Set in GAW17

Hua Lin*, Yang Zheng, Ping Zhou

Biomedical Engineering Institute of Capital Medical University, Beijing, China

Email: *hualin7750@139.com

Received August 2013

ABSTRACT

Currently, genome-wide association studies have been proved to be a powerful appro ach to identify ris k loci. H o wev e r ,

the molecular regulatory mechanisms of complex diseases are still not clearly understood. It is therefore important to

consider the interplay between genetic factors and biological networks in elucidating the mechanisms of complex dis-

ease pathogenesis. In this pap e r, we first conducte d a genome-wide assoc iation analysis by usi ng the SNP genotype data

and phenotype data provided by Genetic Analysis Worksho p 17, in orde r to filter significant SNPs associated with the

diseases. Second, we conducted a bioinformatics analysis of gene-phenotype association matrix to identify gene mod-

ules (biclusters). Third, we performed a KEGG enrichment test of gene s involve d in biclusters to find evidence to sup-

port their functio nal consensu s. This method can be used for better understanding complex diseases.

Keywords: Gene Modules; KEGG Pathways; Bic lusters

1. Introduction

It is well known that genome-wide association studies

(GWAS) have become an increasingly effective tool to

identify genetic variation associated with the risk of

complex disease. However, in this case, univariate sin-

gle-locus association analyses may not be the most ap-

propriate strategy. Instead, many multivariate methods

that were used for identifying causal loci have been de-

veloped rapidly. Gauderman et al [1] induced a princip-

al-components method to assess whether multiple SNPs

within a candidate are associated with disease. In fact, it

is suggested that gene-gene interaction or gene modules

(gene–gene networks) may play important roles and pro-

vide more information. Xiaoqi Cui et al [2] proposed a

new combinato rial as soc iation test i ncorpor ating multiple

traits at the same time to detect gene-gene interaction

unlike other general methods for studying gene-gene

interaction. In addition, there are some attempts to int e-

grate the biological knowledge (e.g., Gene Ontology and

KEGG pathways) into the genomics field. For example,

Iossiov et al. [3] predicted pathways or networks of inte-

racting genes that contribute to common heritable dis-

orders by combining the standard genetic linkage for-

malism with whole-genome molecular-interaction data.

Furthermore, many gene set analysis methods, such as

GeneTrail (http://genetrail.bioinf.uni-sb.de/) and GSEA [4],

are used to detect disease-related risk pathways or gene

modules. These methods integrate heterogeneous data to

elucidate biological mechanisms, which is an essential

and challenging problem in systems biology.

In this paper, we conducted a genome-wide associa-

tion analysis by combining the SNP genotype data with

the phenotype data to filter significant SNPs associated

with the diseases. Then, further bioinformatics analysis

to gene -phenotype association matrix was used to identi-

fy gene biclusters (gene modules). A KEGG enrichment

test of genes involved in biclusters was applied to find

whether they were functional aggregated. This method

can be used for better understanding complex diseases.

2. Method s

2.1. Materials

The GAW17 data consisted of two data sets. One con-

sists of a collection of 697 u nrelated individ uals a nd t hei r

genotypes and phenotypes. These are subjects from the

1000 Genomes Project. The second data set is comprised

of 697 individuals in eight extended families and their

genot ype s and phenotypes. The 202 founders in the fa m-

ily data set were chosen at random from the set of unre-

lated individ uals. The extended families are loosely

based on the families from previous GAWs. SNP geno-

*Corresponding author.

L. HUA ET AL.

333

types were obtained from the sequence alignment files

provided by the 1000 Genomes Project for their pilot3

study (http://www.1000genomes.org). There is a total of

24487 SNPs, all of which are autosomal. A total of 200

replicates (phenotypes) of the trait simulation were car-

ried out in both data sets. In this paper, we onl y selected

the disease status (coded 0 = no, 1 = yes) of the first ten

replicates as dichotomous disease phenotype to perform

our analysi s .

2.2. Data Preprocessing

Pli nk soft ware can be used to calculate the associatio n of

genome-wide SNPs with the phenotypes. Here, all of

p-values were used to construct gene-phenotype associa-

tion matrix. This matrix is 24487 × 10 (SNPs × number

of phenotypes) matrix. We performed bioinformatics

analysis by using the bicluster method to this matrix in

the followin g step.

2.3. Applying Bicluster Method to

Gene-Phenotype Association Matrix

A bicluster is a subset of the SNPs exhibiting consistent

association strength over a subset of the phenotypes. We

applied Statistical-Algorithmic Method for Bicluster

Analysis (SAMBA) algorithm [5] to detect significant

biclusters from our constructed gene -phenotype associa-

tion matrix. This algorithm includes three phases. In the

first phase, the bipartite graph was formed and vertex

pair weights were calculated using weighting methods. In

the second phase, the hashing technique was applied to

find the hea viest bicliques in the graph. In the last phase,

a local procedure on the biclusters in each heap was per-

formed. We used Expander software (http://acgt.cs.tau.

ac.il/expander/) to implement this algorith m.

2.4. Functional Gene Mod ules Min ing in

Combination with KEGG Pathway

To se e if the gene s si gni fica ntl y aggr egat ing i n extracted,

biclusters are also aggregating in functional categories,

we performed a KEGG enrichment test. For a given

KEGG pathway, a gene is either in this pathway or not in

this pathway. We suppose that a total of N (=3205) genes

for the analyzed data are presented in KE GG pathways in

which a set of genes in biclusters are significantly ag-

gregated. We app lied GeneTrail software (http://genetr ail.

bioinf.uni-sb.de/) to identify functional gene modules

(gene biclusters) enriched on KEGG pathways signifi-

cantly. In t his stud y, to avoid the p ossible los s of the tr ue

positives, the multiple test correction was not performed

and the p-va lue quo ted sho uld be considered a s a heuris-

tic measure, useful as an indicator that roughly rates the

relative enrichment of significantly aggregated genes for

each KEGG pathway.

3. Resul ts

3.1. Constructing Gene-Phenotype Association

Matrix

Of 22487 genome-wide SNPs, we found 1775, 2444,

1599, 1244, 1492, 1799, 1480, 1689, 1415 and 1625

SNPs are significantly associated with disease1-dis-

ease10, respectively according to p < 0.05. Table 1

shows the distributions of the number of significant

SNPs (genes) associated with diseases in terms of p <

0.05 and p < 0.01. We found that there are more signifi-

cant overlapping SNPs between disease1 and disease2

than that of between any o f other two diseases (See Fig-

ure 1). Furthe r loo k at the pool o f significa nt SNP s, only

C10S2297 (FRMPD2) was shared by all of ten diseases

(phenotypes), thus likely to be associated with some of

complex diseases. In fact, Nina Stenzel et al [6] approved

that the down regulation of FRMPD2 protein in Caco-2

cells is associated with an impairment of tight junction

formation.

Additionally, it is noted that at the level of 0.05 the

number of significant association was dramatically re-

duced as the increasing of the number of diseases (See

Figure 2).

3.2. Biclustering to Gene-Phenotype Association

Matrix

We applied Statistical-Algorithmic Method for Bicluster

Analysis (SAMBA) algorithm to detect biclusters from

our constructed gene-phenotype association matrix. All

parameters were selected as default. As a result, a total of

11 biclusters were acquired (See Tabl e 2). Each bicluster

includes at least 18 genes. Th e scores gi ve n i n t he se c ond

Table 1. The number of significant SNPs associated w ith ten

diseases according to p < 0.05 and p < 0.01.

phenotype p < 0.05 p < 0.01

SNP Gene SNP Gene

Dis ease1 1775 995 506 379

Dis ease2 2444 1251 839 591

Dis ease3 1599 959 398 330

Dis ease4 1244 789 262 220

Dis ease5 1492 869 367 288

Dis ease6 1799 988 527 369

Dis ease7 1480 934 392 312

Dis ease8 1689 956 465 355

Dis ease9 1415 903 338 275

Dis ease10 1625 958 459 363

L. HUA ET AL.

334

Figure 1. The number of significant overlapping SNPs between any of two diseases. (The number of significant overlapping

SNPs between any of two phenotypes (diseases) is depicted by a color cell, where the red colour indicates that the greater

numb e r of overlapping SNPs is shared by two different di seases).

Figure 2 . T he t r en dy of the n u mbe r of sig ni f ica nt ove rl a ppi ng SN P s as the i nc r easing o f t he nu mber of dis e ase s (p he n ot ype s) .

(The transverse axis stands for the first k (k = 2, 3 …10) phenotypes which were used to be compared and the longitudinal

one denotes t he number of overlapping significant SNPs shared by them).

column of Table 2 were acquired by the SAMBA algo-

rithm and they are size-dependent, thus, it is not recom-

mended to use them to compare the quality of two bic-

lusters of different sizes. Further bioinformatics analysis

will identify functional biclusters enriched on KEGG

pathways significantly.

3.3. Performing KEGG Enrichment Test

We put all genes onto KEGG pathways. Then we se-

lected the KEGG pathways that contained at least two

genes. GeneTrail software was applied to obtain a

p-value of each studied pathway for its enrichment with

L. HUA ET AL.

335

Table 2. Extracted biclusters and thei r enrichment effect on KEGG pathways .

Bicluster Score The number of SNPs The number of genes KEGG pathways p-values

1 50.607 55 52 - -

2 38.672 54 49 - -

3 37.280 37 37 - -

4 48.735 52 51 Caff eine metabolism 0.0353

Drug metabolism –other enzym es 0.0353

5 39.940 46 42 -

6 79.569 88 70 Fc gamma R-mediated phagocytosis 0.0424

7 64.039 66 57 I ns ulin signa l ing pa th way 0.0141

8 48.447 57 52 Fc gamma R-mediated phagocytosis 0.0424

9 40.403 52 47

Ins ulin sign al ing pa t hway 0.0056

Ste r o id ho r m o ne bios ynthes is 0.0281

Autoimmune thyroid disease 0.0419

10 13.462 19 18

Ins ulin sign al ing pa t hway 0.0056

Ste r o id ho r m o ne bios ynthes is 0.0281

Autoimmune thyroid disease 0.0419

11 22.782 28 25 - -

significantly function aggregated genes. We found that

more than half of the bicluster s (54.5%) were signifi-

cantly enriched on KEGG pathways. Table 2 lists the

significantly (nominal p ≤ 0.05) enriched KEGG path-

ways. It is interesting to note that the majority of the

enriched KEGG pathways relate to insulin signaling,

autoimmune thyroid disease and Fc gamma R-medidated

phagocytosis. It is noteworthy to look at a highly aggre-

gated ge nes included in bicluster 9 (See F ig ure 3), three

of which (HLA-A, HLA-B and TG) have direct relev-

ance to immunology. Fred Sanfilippo et al [7] indicated

that good HLA-A and B matching is highly dependent on

a system for sharing organs among institutions, and re-

sults in decreased graft rejection, better long-term graft

function, and less need for post-transplantation immuno-

suppr ession. Another example, genes HSD3B2 and

UGT1A10 included in bicluster 9 were insulin-related.

Goldy Crabunaru et al [8] tested the hypothesis that

HSD3B deficiency in hyperandrogenic females (HF) is

related to insulin-resist ant polycystic ovary syndrome

(PCO S ), and Katriina Itaaho et al [9] shed new light on

the structure and function of UGT1A10 which can cata-

lyze dopamine glucuronidation at substantial rates, yiel-

ding both dopamine-4-O-glucuronide and dopamine-3-

O-glucuronide.

4. Discussions

This study is the atte mpt to rela te the functional aggrega-

tion patterns o f genes with t heir disease-loci association.

Statistically significant bicluster s acquired by SAMBA

algorithms are generated in an unsupervised fashion di-

rectly from the gene-phenotype association matrix. Fur-

ther bioinformatics analysis suggests that some of bic-

lusters (gene modules) characterize biological phenome-

non and can be evaluated using existing biological

knowledge. In addition, some developed protein struc-

tural genomics and computational methods may be uti-

lized to find the s mall por tion of SNP s of like ly func tional

importance. For each of 428 SNPs involved in bicluster s,

we used SIFT software (http://www.blocks.fhcrc.org/) to

acquire its tolerance index (from 0 to 1) for SNP func-

tionality to d etermine the c onservation le vel of a partic u-

lar amino acid position in a protein. A higher tolerance

index indicates that the position is less conserved across

species [10]. The results showed that only 93 SNPs

(21.7%) had their tolerance indexes. Among these SNPs,

62 of them (66.7%) had higher tolerance indexes which

were greater than 0.5. In other words, the positions of

SNPs involved in functional biclusters might be less

conserved, which need to be highlighted in future studies.

However, we feel that these exploratory results need fur-

ther investigation because of the limited number of phe-

notype used in our analysis. Moreover, we only selected

unrelated data sets; further study is needed regarding our

met hod for the p edigree data, which will be included in

future studies.

L. HUA ET AL.

336

(a)

(b)

Figure 3. The heatmap of bicluster 9(a) and bicluster 10(b).

(The p-values of g ene -phenotype association are depi cte d by

cells with different colo urs, where the green colour indicates

the significant association between gene and disease (phe-

notype)).

5. Conclusion

In this paper, we conducted a genome-wide association

analysis by combining the SNP genotype data with the

phenotype data to filter significant SNPs associated with

diseases. Further bioinformatics analysis revealed that

aggregated genes in biclusters tended to aggregate on

some KEGG pat hways and to be functional association.

6. Acknowledgemen ts

We thank reviewers for their valuable comments and

suggestions.

This work is supported by the National Natural

Science Foundation of China (Grant Nos. 31100905) and

the Science Technology Development Project of Beijing

Municipal Commission of Education (SQKM

201210025008). This study is also funded by the excel-

lent talent cultivation project of Beijing (2012D

005018000002) and the young backbone teacher’s culti-

vation project of Bei jing Municip al Commission of Edu-

cation, and supported by the foundation-clinical coopera-

tion project of capital medical university (11JL30,

11JL33 and 12JL75).

REFERENCES

[1] J. Gauderman, C. Murcray, F. Gilliland and D. Conti,

“Testing Association between Disease and Multiple SNPs

in a Candidate Gene,” Genetic Epidemiology, Vol. 31, No.

5, 2007, pp. 383-395.

http://dx.doi.org/10.1002/gepi.20219

[2] X. Cui, Q. Sha, S. Zhang and H.-S. Chen, “A Combina-

torial Approach for Detecting Gene-Gene Interaction Us-

ing Multiple Traits of Genetic Analysis Workshop 16

Rheumatoid Arthritis Data,” BMC Proceedings, Vol. 3,

Suppl. 7, 2009, p. S43.

[3] I. Iossifov, T. Zheng, M. Baron, T. C. Gilliam and A.

Rzhetsky, “Genetic-Linkage Mapping of Complex Here-

ditary Disorders to a Whole-Genome Molecular-Interac-

tion Network, ” Genome Research, Vol. 18, 2008, pp.

1150-1162. http://dx.doi.org/10.1101/gr.075622.107

[4] M. Holden, S. Deng, L. Wojnowski and B. Kulle, “GSEA-

SNP: Applying Gene Set Enrichment Analysis to SNP

Data from Genome-Wide Association Studies,” Bio in-

formatics, Vol. 24, No. 23, 2008, pp. 2784-2785.

http://dx.doi.org/10.1093/bioinformatics/btn516

[5] A. Tanay, R. S haran and R. Shamir, “Discovering Statis-

tically Significant Biclusters in Gene Expression Data,”

Bioinformatics, Vol. 18, Suppl. 1, 2002, pp. 136-144.

http://dx.doi.org/10.1093/bioinformatics/18.suppl_1.S136

[6] N. Stenzel, C. P. Fetzer, R. Heumann, S. Kai, “PDZ-Do-

main-Directed Basolateral Targeting of the Peripheral

Membrane Protein FRMPD2 in Epithelial Cells,” Journal

of Cell Science, Vol. 122, 2009, pp. 3374-3384.

http://dx.doi.org/10.1242/jcs.046854

[7] F. Sanfilippo, W. K. Vaughn, E. K. Spees, J. A. Light and

W. M. LeFor , “Be ne fi ts o f H LA-A and HLA-B Matching

on Graft and Patient Outcome after Cadaveric-Donor

Renal Transplantation,” The New England Journal of

Medicine, Vol. 311, No. 6, 19 84, pp. 358-364.

[8] G. Carbunaru, P. Prasad, B. Scoccia, P. Shea, N. Hop-

L. HUA ET AL.

337

wood, F. Ziai, Y. T. Chang, S. Myers, J. Mason and S.

Pang, “The Hormonal Phenotype of Nonclassic 3β-Hy-

droxysteroid Dehydrogenase (HSD3B) Deficiency in

Hyperandrogenic Females Is Associated with Insulin-

Resistant Polycystic Ovary Syndrome and Is Not a Va-

riant of Inherited HSD3B2 Deficiency,” The Journal of

Clinical Endocrinology & Metabolism, Vol. 89, No. 2,

2003, pp . 783-794.

http://dx.doi.org/10.1210/jc.2003-030934

[9] K. Itäaho, M. H. Court, P. Uutela, R. Kostiainen, A. Ra-

dominska-Pandya, M. F. D opamine, “Is a Low-Affinity

and High-Specificity Substrate for the Human UDP-Glu-

curonosyltransferase 1A10,” Drug Metabolism & Dispo-

sition, Vol. 37, No. 4, 2009, pp.768-775.

http://dx.doi.org/10.1124/dmd.108.025692

[10] Y. Zhu, A. Hoffman, X. Wu, H. Zhang, Y. Zhang, D.

Leaderer and T. Zheng, “Correlating Observed Odds Ra-

tios from Lung Cancer Case-Control Studies to SNP

Functional Scores Predicted by Bioinformactic Tool,”

Mutation Research, Vol. 639, No. 1-2, 200 8, pp. 80-88.

http://dx.doi.org/10.1016/j.mrfmmm.2007.11.005