1. Introduction
The commonly used statistical methods in medical research generally assume that patients in the study arise from one homogeneous population. However, the existence and importance of significant heterogeneity are well known and have been documented in literature for many diseases, including Alzheimer’s disease [1] [2] , asthma [3] [4] , diabetes [5] [6] , and multiple cancer types [7] [8] [9] [10] . These common and complex human diseases usually have non-unique disease etiology, which also frequently involve interplay of multiple genetic and environmental factors, leading to latent population substructure [11] [12] [13] . Therefore, it is common that the patient population of a complex disease consists of various latent subpopulations, each with disease caused by mutations at different loci. Yet each of the unobservable subgroups is relatively homogeneous in etiology or diagnosis.
The genome-wide association study (GWAS) and linkage analysis are two classical approaches for studying human genetic disorders. GWAS is an experimental design (typically case-control) used to detect associations between genetic variants and diseases/traits from a study population [14] . The ultimate goal of the population-based GWAS is to assist researchers to have a better understanding of the biology of the disease and develop better prevention or treatment strategies for common and complex diseases. However, the standard GWAS analysis methods ignore the widely existing genetic heterogeneity. To account for latent genetic heterogeneity in GWAS, Qian and Shao [15] recently developed a novel likelihood ratio test under genetic heterogeneity (LRT-H). This methods has been shown to have superior power advantage over the commonly used Cochran-Armitage trend test (CATT) in GWAS for complex diseases where genetic heterogeneity commonly exists [15] [16] .
Linkage analysis is a commonly used method to identify statistical association between the inheritance of a human disease and inheritance of marker loci before the era of GWAS. In the last two decades, linkage based gene mapping has been marginalized by the population-based genome-wide association study. Association analysis uses common variants and allows for finer mapping than linkage analysis in general. However, one major problem for association study is population stratification, which can lead to increased number of false negative as well as false positive findings if latent heterogeneity is not properly controlled for [17] . Yet this is not a concern for family-based linkage analysis, as children’s genotypes only depend on their parents but not on the population genotype frequencies [18] [19] . Recent advancement in next generation sequencing (NGS) has made it technologically feasible and financially affordable to determine mutation profiles for families. Linkage analysis again becomes important to identify causal variants using family-based deep sequencing data. Ott et al. [20] and Shao [21] presented reviews of genetic linkage analysis in the age of NGS.
For marker alleles that are associated with inheritance of complex disease, it is not uncommon that the transmission probabilities of a marker allele of interest vary across heterozygous parents, due to locus heterogeneity, etiologic heterogeneity, and many other complexities and/or combinations of them [11] . For example, breast cancer as a complex disease, is well known to be heterogeneous. Some cases of breast cancer are due to the inherited mutations in BRCA1/2 in some families [7] , while in other families due to mutations in other genes (e.g. PTEN) [22] . These genetic heterogeneities are often not directly observable from linkage data or GWAS data. The current available genetic linkage methods that account for latent genetic heterogeneity are based on mixture models and generally are computational expensive for genome-wide or NGS data [13] [23] [24] [25] , yet ignoring heterogeneity can cause loss of efficiency in statistical test with increased numbers of false negative findings or missed opportunities.
In the era of whole genome sequencing, it is important to have statistical tests that are 1) computationally efficient even for genome-wide data, 2) robust under genetic heterogeneity and 3) statistically powerful. Motivated by the Qian and Shao’s [15] LRT-H for GWAS, in this paper we propose a powerful and computational efficient likelihood ratio test under genetic heterogeneity for linkage analysis based on a binomial mixture model, using family data with parental marker genotypes and genotypes of two affected siblings. We have developed an R package gLRTH to implement the newly proposed LRT for genome-wide linkage analysis under genetic heterogeneity as well as Qian and Shao’s [15] LRT-H for GWAS. The package is freely available on CRAN. The purpose of this R package is to simplify the application of these two methods for non-specialists. The rest of paper is organized as follows. In Section 2, we introduce the LRT for linkage analysis under genetic heterogeneity. We derive the closed-form test statistic and provide explicit asymptotic null distribution that simplify the computations for p-values. In Section 3, we present numerical simulation studies for type I error and power analysis. In Section 4, we describe the R functions and their arguments. The paper is concluded in Section 5.
2. Methods
Genetic markers can have multiple alleles. In next generation sequencing (NGS), GWAS and other genome-wide studies, markers with two alleles are most common. Thus, without much loss of generality, we focus on markers with two alleles. Here we consider a binary trait and focus on detecting linkage under genetic heterogeneity at a single marker locus with two alleles A and a. We consider independent families each with one marke-homozygous (AA) parent, one marker-heterozygous parent (Aa) and two diseased children. Let X denote the total number of allele a inherited by the two affected children from their heterozygous parent (Aa). Then X has a binomial distribution
,
(1)
where
and
is the transmission probability for the marker-heterogeneous patient to pass allele a to a child.
Under the null hypothesis H0 of no linkage between the marker and any disease-causing loci,
for all families, i.e.
. One can test linkage by detecting distribution departure from the null
:
(2)
However, transmission heterogeneity, i.e., variations among
generally exists in complex diseases. For example, any combination of the complexities listed in Lander and Schork [12] can result in transmission heterogeneity. Thus, under transmission heterogeneity, we assume X, the number of allele a, follows a binomial mixture distribution in the population, that is
(3)
where
, and
if and only if
. In particular, for many of the complex diseases with transmission heterogeneity, it is likely that J is quite large. Since it is hard to know the exact number of the sub-populations J under transmission heterogeneity, it is desirable to have a new test that is applicable without knowing the exact value of J while allowing
.
Suppose n independent families each with one marker homozygous (AA) parent, one marker heterozygous parent (Aa) and two diseased children are sampled from the population. For each locus, the observed genotype frequencies inherited from the heterozygous Aa parent in the two diseased children are summarized in the first row in Table 1. Under H0, the expected genotype frequencies are summarized in the second row in Table 1.
2.1. Mixture Binomial and Maximum Likelihood
Assuming the setup in the previous subsection and using notations in Table 1, the maximum likelihood estimator (MLE) of
under the binomial likelihood in Equation (1) is
(4)
Thus, the binomial likelihood in Equation (1) evaluated at
is
(5)
Table 1. Genotype frequencies inherited from the heterozygous Aa parents for n affected sibling pairs.
where
.
Under H0,
, the binomial likelihood value is
(6)
The maximum of the mixture likelihood for X in Equation (3) has an explicit formula [15] , that is
(7)
where
is defined in Equation (4).
2.2. The Likelihood Ratio Test
Using the maximum of the likelihood
,
and
, respectively, we can write down the explicit formula of the log-LRT statistic
as follows,
(8)
Equation (8) can be written as following
(9)
First, we may consider a classic problem for testing
against
. The LRT statistic is well known to have a
distribution.
(10)
where
is the MLE of
defined in Equation (4).
When
, we have
. Thus,
(11)
Therefore, when
, we have
.
When
, we can consider testing of goodness-of-fit of
. The LRT statistic has a
asymptotic distribution and can be written as
(12)
The first term at the right-hand side of the last equality is equivalent to the Pearson’s classic
statistic (via comparing observed to expected cell frequencies) for testing Hardy-Weinberg equilibrium which is know to have the
distribution. Note that the two terms in the right hand side of equation (12) are well known to be asymptotically independent. Therefore, when
, we have
(13)
It is easy to show that
as
as in Qian and Shao [15] . Thus, we obtain the explicit form of asymptotic distribution under the null hypothesis. That is, under H0,
(14)
Importantly, to implement the LRT, there is no need to identify the exact number of mixture components J in equation (3).
3. Simulations
3.1. Type I Errors
As the LRT
has an explicit asymptotic distribution under H0, it is convenient to evaluate p-value and type I error. We conducted simulations to compare the empirical type I error of the LRT to the nominal significant level ranging from 10−2 to 10−8. The genotype data were generated from binomial distribution
. The simulation was replicated 1011 times. As shown in Table 2, the empirical type I error is slightly smaller than the nominal level, but they are very close to each other. Therefore, using the asymptotic null distribution for the LRT is valid. The closed form asymptotic distribution allows easy determination of the asymptotic p-values.
3.2. Power Comparison
In the simulation studies for power comparison, the sample was generated from a two-component mixture binomial distribution as described in equation (3) with
, i.e.,
One hundred-thousand replicate dataset of n disease cases
were simulated for each of the seven simulation setup
Table 2. Empirical type I error and nominal significant level at
and
with 1011 replications.
and the empirical power for LRT and
are shown in Table 3. The simulation results indicate that the LRT has power advantage over the
test under genetic heterogeneity.
4. The R Package Description and Examples
The gLRTH R package is available on CRAN and the installation is standard. The purpose of this package is to implement the previously discussed two methods, i.e., LRT for genome-wide linkage analysis under genetic heterogeneity and Qian and Shao’s LRT-H for GWAS [15] . The gLRTH R package is composed of two main functions: gLRTH_L for linkage analysis under heterogeneity and gLRTH_A for association studies.
The gLRTH_L function calculates the test statistic and asymptotic p-value for the likelihood ratio test for testing linkage. The gLRTH_L function in the package can be called with the following syntax:
gLRTH_L(n0, n1, n2)
The required arguments are:
1) n0: Number of affected sibling pairs that both inherited A from their heterozygous parent Aa
2) n1: Number of affected sibling pairs that one inherited A and the other inherited a from their heterozygous parent Aa
3) n2: Number of affected sibling pairs that both inherited a from their heterozygous parent Aa
To illustrate the gLRTH_L function, suppose we have hypothetical genetic marker M1/M2 information from a sample of
independent families, with M2 be the marker of interest. Each family has one marker homozygous (M1/M1) parent, one marker heterozygous parent (M1/M2) and two diseased children. Suppose we have
families with both sibling inherited M1 from their heterozygous parent (M1/M2),
families have one sibling inherited M2 and one sibling inherited M1 from their heterozygous parent (M1/M2), and
families have both siblings inherited M2 from their heterozygous parent (M1/M2).
Table 3. Empirical power (significant level is set at
) when X has a mixture distribution with
.
The LRT for linkage under transmission heterogeneity for this genetic marker can be done as following:
gLRTH_L(n0 = 100, n1 = 650, n2 = 250)
The output is:
$test.stat
[1] 45.17029
$pval
[1] 8.672153e-11
In this example, for linkage analysis under transmission heterogeneity of marker M1/M2, the test statistic is 24.04 and the p-value is
.
The gLRTH_A function calculates the test statistic and asymptotic p-value for the likelihood ratio test for GWAS. The gLRTH_A function in the package can be called with the following syntax:
gLRTH_A(n0, n1, n2, m0, m1, m2)
The required arguments are:
1) n0: AA genotype frequency in case
2) n1: Aa genotype frequency in case
3) n2: aa genotype frequency in case
4) m0: AA genotype frequency in control
5) m1: Aa genotype frequency in control
6) m2: aa genotype frequency in control
To illustrate the gLRTH_A function, we consider a SNP called SNP rs429358 in gene Apolipoprotein E (ApoE), which is a well-known common variants that is associated with late-onset Alzheimer’s diseases (AD). We use APOE ò4 variants frequency in Han et al. [26] to determine SNP rs429358 in AD converters and AD non-converters. The LRT-H for SNP rs429358 can be done as following:
gLRTH_A(n0 = 89, n1 = 139, n2 = 47, m0 = 266, m1 = 153, m2 = 39)
The output is:
$test.stat
[1] 46.02864
$pval
[1] 5.640675e-11
In this example, for SNP rs429358, the test statistic is 46.02 with a p-value
. We conclude that SNP rs429358 reaches genome-wide significance
.
5. Conclusions
The commonly used statistical methods in medical research often assume patients arise from one homogeneous population. However, the impact of heterogeneity is well known and has been document in much of the existing literature for common and complex diseases. Inadequate attention to the heterogeneity inherent in the complexity of complex human disease could lead to increased number of false negatives and missed opportunities in research. To solve this problem, using finite mixture models to account for latent genetic heterogeneity is an intuitive strategy. However, there are well known difficulties associated with likelihood-based inference in the context of finite mixture due to issues regarding parameter identifiability and degenerate Fisher information [27] . The mixture likelihood often has many local maximum values making the numerical maximization complicated. Moreover, the likelihood irregularities lead to great challenges in deriving the limit distribution of the LRT statistic under loss of identifiability. The strength of the proposed method is that we are able to derive closed form formula for the LRT statistic and its simple closed form asymptotic distribution despite the loss of identifiability in parameters in the context of mixture likelihood. This leads to efficient computation of the test statistic and its asymptotic p-values; and thus, it is suitable for high throughput data and genome-wide studies. The proposed method also works for a single marker or a few markers. There are a few existing methods for linkage analysis that account for latent heterogeneity [13] [23] [24] [25] , but the existing methods are computationally expensive for NGS and genome-wide studies.
The rapid development of next generation whole-genome sequencing (WGS) has revived family-based linkage analysis for identification and characterization of functional variants. Our proposed LRT for linkage analysis under genetic heterogeneity will likely to be a powerful tool for genetic mapping of complex traits [20] . In the era of precision medicine, using individual variations in genes and environment to develop diagnostics, prognostics, and therapies is the primary approach for disease prevention and treatment. For example, instead of using “one-size-fits-all-approach”, “precision medicine” based on genetic markers can be used to optimize effectiveness of disease prevention and treatment as well as minimize side effects for persons less likely to respond to a particular therapeutic. Reliable disease associated SNPs could serve as predictive markers that inform our decisions about numerous aspects of medical care, including specific diseases, effectiveness of various drugs and adverse reactions to specific drugs. We believe that with the reduction in cost of whole-genome sequencing (WGS), genome-wide linkage analysis of family based WGS data as well as GWAS will facilitate the identification of causal variants and may contribute tremendously to the advancement of precision medicine. Our open source R package gLRTH is meant to be a valuable package to help researchers perform GWAS and genome-wide linkage analysis accounting for the ubiquitous genetic heterogeneity in common and complex human diseases without a lot of programming and computational burden.
Acknowledgements
This research was partially supported by a pilot project from NYU Alzheimers Disease Center under the NIH/NIA grant P30 AG08051.