Genome-Wide Likelihood Ratio Tests under Heterogeneity

The commonly used statistical methods in medical research generally assume patients arise from one homogeneous population. However, the existence and importance of significant heterogeneity have been widely documented. It is well known that common and complex human diseases usually have heterogeneous disease etiology, which often involves interplay of multiple genetic and environmental factors, leading to latent population substructure. Genome-wide association studies (GWAS) is a useful tool to uncover genetic association with disease of interest, while linkage analysis is a commonly used method to identify statistical association between the inheritance of a human disease and inheritance of marker loci that are in linkage with disease causing loci. We propose a likelihood ratio test for genome-wide linkage analysis under genetic heterogeneity using family data. We derive a closed-form formula for the LRT test statistic and provide explicit asymptotic null distribution. The closed form asymptotic distribution allows easy determination of the asymptotic p-values. Our extensive simulation studies indicate that the proposed test has proper type I error and good power under genetic heterogeneity. In order to simplify application of the proposed method for non-statisticians, we develop an R package gLRTH to implement the proposed LRT for genome-wide linkage analysis as well as Qian and Shao’s LRT for GWAS under heterogeneity. The newly developed open source R package gLRTH is available at CRAN.


Introduction
The commonly used statistical methods in medical research generally assume X. X. Han, Y. Z. Shao that patients in the study arise from one homogeneous population.However, the existence and importance of significant heterogeneity are well known and have been documented in literature for many diseases, including Alzheimer's disease [1] [2], asthma [3] [4], diabetes [5] [6], and multiple cancer types [7] [8] [9] [10].
These common and complex human diseases usually have non-unique disease etiology, which also frequently involve interplay of multiple genetic and environmental factors, leading to latent population substructure [11] [12] [13].Therefore, it is common that the patient population of a complex disease consists of various latent subpopulations, each with disease caused by mutations at different loci.Yet each of the unobservable subgroups is relatively homogeneous in etiology or diagnosis.
The genome-wide association study (GWAS) and linkage analysis are two classical approaches for studying human genetic disorders.GWAS is an experimental design (typically case-control) used to detect associations between genetic variants and diseases/traits from a study population [14].The ultimate goal of the population-based GWAS is to assist researchers to have a better understanding of the biology of the disease and develop better prevention or treatment strategies for common and complex diseases.However, the standard GWAS analysis methods ignore the widely existing genetic heterogeneity.To account for latent genetic heterogeneity in GWAS, Qian and Shao [15] recently developed a novel likelihood ratio test under genetic heterogeneity (LRT-H).
This methods has been shown to have superior power advantage over the commonly used Cochran-Armitage trend test (CATT) in GWAS for complex diseases where genetic heterogeneity commonly exists [15] [16].
Linkage analysis is a commonly used method to identify statistical association between the inheritance of a human disease and inheritance of marker loci before the era of GWAS.In the last two decades, linkage based gene mapping has been marginalized by the population-based genome-wide association study.
Association analysis uses common variants and allows for finer mapping than linkage analysis in general.However, one major problem for association study is population stratification, which can lead to increased number of false negative as well as false positive findings if latent heterogeneity is not properly controlled for [17].Yet this is not a concern for family-based linkage analysis, as children's genotypes only depend on their parents but not on the population genotype frequencies [18] [19].Recent advancement in next generation sequencing (NGS) has made it technologically feasible and financially affordable to determine mutation profiles for families.Linkage analysis again becomes important to identify causal variants using family-based deep sequencing data.Ott et al. [20] and Shao [21] presented reviews of genetic linkage analysis in the age of NGS.
For marker alleles that are associated with inheritance of complex disease, it is not uncommon that the transmission probabilities of a marker allele of interest vary across heterozygous parents, due to locus heterogeneity, etiologic heterogeneity, and many other complexities and/or combinations of them [11].

X. X. Han, Y. Z. Shao
For example, breast cancer as a complex disease, is well known to be heterogeneous.Some cases of breast cancer are due to the inherited mutations in BRCA1/2 in some families [7], while in other families due to mutations in other genes (e.g.PTEN) [22].These genetic heterogeneities are often not directly observable from linkage data or GWAS data.The current available genetic linkage methods that account for latent genetic heterogeneity are based on mixture models and generally are computational expensive for genome-wide or NGS data [13] [23] [24] [25], yet ignoring heterogeneity can cause loss of efficiency in statistical test with increased numbers of false negative findings or missed opportunities.
In the era of whole genome sequencing, it is important to have statistical tests that are 1) computationally efficient even for genome-wide data, 2) robust under genetic heterogeneity and 3) statistically powerful.Motivated by the Qian and Shao's [15] LRT-H for GWAS, in this paper we propose a powerful and computational efficient likelihood ratio test under genetic heterogeneity for linkage analysis based on a binomial mixture model, using family data with parental marker genotypes and genotypes of two affected siblings.We have developed an R package gLRTH to implement the newly proposed LRT for genome-wide linkage analysis under genetic heterogeneity as well as Qian and Shao's [15] LRT-H for GWAS.The package is freely available on CRAN.The purpose of this R package is to simplify the application of these two methods for non-specialists.The rest of paper is organized as follows.In Section 2, we introduce the LRT for linkage analysis under genetic heterogeneity.We derive the closed-form test statistic and provide explicit asymptotic null distribution that simplify the computations for p-values.In Section 3, we present numerical simulation studies for type I error and power analysis.In Section 4, we describe the R functions and their arguments.The paper is concluded in Section 5.

Methods
Genetic markers can have multiple alleles.In next generation sequencing (NGS), GWAS and other genome-wide studies, markers with two alleles are most common.Thus, without much loss of generality, we focus on markers with two alleles.Here we consider a binary trait and focus on detecting linkage under genetic heterogeneity at a single marker locus with two alleles A and a.We consider independent families each with one marke-homozygous (AA) parent, one marker-heterozygous parent (Aa) and two diseased children.Let X denote the total number of allele a inherited by the two affected children from their heterozygous parent (Aa).Then X has a binomial distribution where and b θ is the transmission probability for the marker-heterogeneous patient to pass allele a to a child.
Under the null hypothesis H 0 of no linkage between the marker and any disease-causing loci, 0.5 b θ = for all families, i.e.
However, transmission heterogeneity, i.e., variations among b θ generally exists in complex diseases.For example, any combination of the complexities listed in Lander and Schork [12] can result in transmission heterogeneity.Thus, under transmission heterogeneity, we assume X, the number of allele a, follows a binomial mixture distribution in the population, that is where , In particular, for many of the complex diseases with transmission heterogeneity, it is likely that J is quite large.Since it is hard to know the exact number of the sub-populations J under transmission heterogeneity, it is desirable to have a new test that is applicable without knowing the exact value of J while allowing 2 J ≥ .Suppose n independent families each with one marker homozygous (AA) parent, one marker heterozygous parent (Aa) and two diseased children are sampled from the population.For each locus, the observed genotype frequencies inherited from the heterozygous Aa parent in the two diseased children are summarized in the first row in Table 1.Under H 0 , the expected genotype frequencies are summarized in the second row in Table 1.

Mixture Binomial and Maximum Likelihood
Assuming the setup in the previous subsection and using notations in Table 1, the maximum likelihood estimator (MLE) of θ under the binomial likelihood in Equation ( 1) is ( ) Thus, the binomial likelihood in Equation ( 1) evaluated at θ is ( ) .
Under H 0 , 1 2 θ = , the binomial likelihood value is The maximum of the mixture likelihood for X in Equation ( 3) has an explicit formula [15], that is where θ is defined in Equation (4).

The Likelihood Ratio Test
Using the maximum of the likelihood 0 L , M L and D L , respectively, we can write down the explicit formula of the log-LRT statistic 2 N λ as follows, ( ) Equation ( 8) can be written as following First, we may consider a classic problem for testing ( ) . The LRT statistic is well known to have a ( ) where ( ) Therefore, when 4n n n > , we can consider testing of goodness-of-fit of ( ) The LRT statistic has a 2 2 χ asymptotic distribution and can be written as χ distribution.Note that the two terms in the right hand side of equation (12)   are well known to be asymptotically independent.Therefore, when we have It is easy to show that ( ) Shao [15].Thus, we obtain the explicit form of asymptotic distribution under the null hypothesis.That is, under H 0 , Importantly, to implement the LRT, there is no need to identify the exact number of mixture components J in equation (3).

Type I Errors
As the LRT 2 N λ has an explicit asymptotic distribution under H 0 , it is convenient to evaluate p-value and type I error.We conducted simulations to compare the empirical type I error of the LRT to the nominal significant level ranging from 10 −2 to 10 −8 .The genotype data were generated from binomial distribution ( ) . The simulation was replicated 10 11 times.As shown in Table 2, the empirical type I error is slightly smaller than the nominal level, but they are very close to each other.Therefore, using the asymptotic null distribution for the LRT is valid.The closed form asymptotic distribution allows easy determination of the asymptotic p-values.

Power Comparison
In the simulation studies for power comparison, the sample was generated from a two-component mixture binomial distribution as described in equation ( 3)

The R Package Description and Examples
The gLRTH R package is available on CRAN and the installation is standard.
The purpose of this package is to implement the previously discussed two methods, i.e., LRT for genome-wide linkage analysis under genetic heterogeneity and Qian and Shao's LRT-H for GWAS [15].The gLRTH R package is composed of two main functions: gLRTH_L for linkage analysis under heterogeneity and gLRTH_A for association studies.
The gLRTH_L function calculates the test statistic and asymptotic p-value for the likelihood ratio test for testing linkage.The gLRTH_L function in the package can be called with the following syntax: The required arguments are: 1) n0: Number of affected sibling pairs that both inherited A from their heterozygous parent Aa 2) n1: Number of affected sibling pairs that one inherited A and the other inherited a from their heterozygous parent Aa 3) n2: Number of affected sibling pairs that both inherited a from their heterozygous parent Aa To illustrate the gLRTH_L function, suppose we have hypothetical genetic marker M1/M2 information from a sample of 1000 n = independent families, with M2 be the marker of interest.Each family has one marker homozygous (M1/M1) parent, one marker heterozygous parent (M1/M2) and two diseased children.Suppose we have 0 100 n = families with both sibling inherited M1 from their heterozygous parent (M1/M2), 1 650 n = families have one sibling inherited M2 and one sibling inherited M1 from their heterozygous parent (M1/M2), and 2 250 n = families have both siblings inherited M2 from their heterozygous parent (M1/M2).

Conclusions
The commonly used statistical methods in medical research often assume patients arise from one homogeneous population.However, the impact of heterogeneity is well known and has been document in much of the existing literature for common and complex diseases.Inadequate attention to the DOI: 10.4236/ojs.2018.83030X. X. Han, Y. Z. Shao heterogeneity inherent in the complexity of complex human disease could lead to increased number of false negatives and missed opportunities in research.To solve this problem, using finite mixture models to account for latent genetic heterogeneity is an intuitive strategy.However, there are well known difficulties associated with likelihood-based inference in the context of finite mixture due to issues regarding parameter identifiability and degenerate Fisher information [27].The mixture likelihood often has many local maximum values making the numerical maximization complicated.Moreover, the likelihood irregularities lead to great challenges in deriving the limit distribution of the LRT statistic under loss of identifiability.The strength of the proposed method is that we are able to derive closed form formula for the LRT statistic and its simple closed form asymptotic distribution despite the loss of identifiability in parameters in the context of mixture likelihood.This leads to efficient computation of the test statistic and its asymptotic p-values; and thus, it is suitable for high throughput data and genome-wide studies.The proposed method also works for a single marker or a few markers.There are a few existing methods for linkage analysis that account for latent heterogeneity [13] [23] [24] [25], but the existing methods are computationally expensive for NGS and genome-wide studies.
The rapid development of next generation whole-genome sequencing (WGS) has revived family-based linkage analysis for identification and characterization of functional variants.Our proposed LRT for linkage analysis under genetic heterogeneity will likely to be a powerful tool for genetic mapping of complex traits [20].In the era of precision medicine, using individual variations in genes and environment to develop diagnostics, prognostics, and therapies is the primary approach for disease prevention and treatment.For example, instead of using "one-size-fits-all-approach", "precision medicine" based on genetic markers can be used to optimize effectiveness of disease prevention and treatment as well as minimize side effects for persons less likely to respond to a particular therapeutic.Reliable disease associated SNPs could serve as predictive markers that inform our decisions about numerous aspects of medical care, including specific diseases, effectiveness of various drugs and adverse reactions to specific drugs.We believe that with the reduction in cost of whole-genome sequencing (WGS), genome-wide linkage analysis of family based WGS data as well as GWAS will facilitate the identification of causal variants and may contribute tremendously to the advancement of precision medicine.Our open source R package gLRTH is meant to be a valuable package to help researchers perform GWAS and genome-wide linkage analysis accounting for the ubiquitous genetic heterogeneity in common and complex human diseases without a lot of programming and computational burden.

Pearson's classic 2 χ
.4236/ojs.2018.83030X. X. Han, Y. Z. ShaoThe first term at the right-hand side of the last equality is equivalent to the statistic (via comparing observed to expected cell frequencies) for testing Hardy-Weinberg equilibrium which is know to have the 2 1

2 χ
. Han, Y. Z. Shao and the empirical power for LRT and 2 are shown in

Table 1 .
Genotype frequencies inherited from the heterozygous Aa parents for n affected thousand replicate dataset of n disease cases

Table 3 .
The simulation results indicate that the LRT has power advantage over the 2

Table 3 .
Empirical power (significant level is set at