For case-parents data, the information from offspring can be used to reduce the uncertainty of parents’ haplotype. In this article we develop likelihood ratio test to compare haplotype frequencies in transmitted and non-transmitted group. The maximum likelihood estimate of the haplotype frequencies for the family data is obtained via expectation-maximization (EM) algorithm. Our proposed method can handle the uncertainty of haplotypes and missing data. The simulations show that the method is more powerful to test association between haplotype and traits than TRANSMIT. We also demonstrated the method to detect the association between Megsin gene and immunoglobulin A nephropathy.
In association studies, to avoid false positive results caused by population stratification, family-based tests of association are often used for fine mapping of a disease susceptibility locus. Transmission disequilibrium test (TDT) proposed by Spielman is an association test for case-parents triad data [
Multiple linked markers can provide more polymorphism information than single marker (especially SNPs). However, haplotype phase is often uncertain for multi-locus genotype. There may be several haplotype pairs compatible with observed genotype. Many haplotype reconstruction algorithms are developed, e.g. expectation-maximization (EM) algorithm [
TDT has been extended for tightly linked marker loci. Zhao et al. (2000) [
In this paper, based on full likelihood, we proposed a likelihood ratio test integrating haplotype construction for case-parent or case-parents data.
The key idea is from classical case-control association study. Association between trait and marker yields some allele or haplotypes are found more often in case group than control group. In case-parents data, the allele or haplotypes of the case are transmitted from the parents. And so, we can test whether a particular allele or haplotype exists more often in transmitted (case) than in non-transmitted (control).
For tightly linked marker loci, we treat haplotypes as extended alleles and use the transmission information to reduce the phase uncertainty.
Let H denote the number of haplotypes for l tightly linked loci. The set of all possible H ( H + 1 ) / 2 genotypes (haplotype pairs) is G = { 1 / 1 , 1 / 2 , ⋯ , 1 / H , 2 / 2 , 2 / 3 , ⋯ , 2 / H , ⋯ , ( H − 1 ) / H , H / H } .
For a case-parents trios, the genotypes for father, mother and affected child are denoted by g f , g m , g c respectively. Let F = ( F 1 , F 2 , ⋯ , F H ) and F * = ( F 1 * , F 2 * , ⋯ , F H * ) denote haplotype frequency for transmitted and non-transmitted group, and R i j = P ( A f f e c t e d | i / j ) / P ( A f f e c t e d | H / H ) denote relative risk for genotype i/j reference to H/H. Under multiplicative haplotype risk R i j = R i R j , where R i = R i i is relative risk for haplotype i reference to H. Under assumption of Hardy-Weinberg equilibrium, the probability of the father transmitted haplotype i and non-transmitted j, whereas the mother transmitted haplotype k and non-transmitted j to the child conditional on that the child is affected, is
P ( g f = i / j , g m = k / l , g c = i / k | A ) = F i * F j F k * F l (1)
where A means the child is affected, and F i * = R i F i / ∑ j = 1 H R j F j is regarded as frequency of haplotype i for disease population.
However, we only observed the genotype of each locus. The haplotype phase of the l tightly linked loci is often uncertainty, especial for missing parental genotype data. So there exist ambiguities to decide which haplotype is transmitted or not transmitted from the parent. Then
P ( g f , g m , g c | A ) = [ ∑ ( i , j , k , l ) ∈ G ˜ F i r * F j r F k r * F l r ] , (2)
where G ˜ is the set of haplotype groups ( i , j , k , l ) which haplotype pairs ( i , j ) , ( k , l ) , and ( i , k ) are compatible with g f , g m and g c , respectively. Here, missing parental genotype is allowable.
Suppose there are N case-parents trios, and then there are 2N parents in all. The genotypes for the r-th trios are g r f , g r m , g r c . The log-likelihood
ln L ( F , F * ) = ∑ r = 1 N ln P ( g r f , g r m , g r c | A ) = ∑ r = 1 N ln [ ∑ ( i r , j r , k r , l r ) ∈ G ˜ r F i r * F j r F k r * F l r ] . (3)
It is difficult to find the maximum likelihood estimate (MLE) ( F ^ , F ^ * ) directly. We employ expectation-maximization (EM) algorithm to estimate the haplotype frequencies ( F ^ , F ^ * ) , by treating underlying haplotype pairs as “missing data”. The complete-data log-likelihood after adding missing data { ( i 1 , j 1 , k 1 , l 1 ) , ⋯ , ( i N , j N , k N , l N ) } is
ln L c ( F , F * ) = ∑ r = 1 N ln ( F i r * F j r F k r * F l r ) . (4)
We can show that the expected complete-data log-likelihood Q ( F , F * ) in E (expectation) step (see Appendix)
ln Q ( F , F * ) = E [ ln L c ( F , F * ) ] = ∑ r = 1 N ∑ i , j , k , l = 1 M w r ( i , j , k , l ) ln ( F i * F j F k * F l ) , (5)
where
w r ( i , j , k , l ) = { F i * F j F k * F l ∑ ( i ′ , j ′ , k ′ , l ′ ) ∈ G ˜ r F i ′ * F j ′ F k ′ * F l ′ , ( i , j , k , l ) ∈ G ˜ r , 0 , ( i , j , k , l ) ∉ G ˜ r .
An iterative procedure can be used to find the MLE via EM algorithm. Given the current estimates ( F ( t ) , F * ( t ) ) , the estimates in the next step:
{ F h ( t + 1 ) = 1 2 N ∑ r = 1 N ∑ i , j , k , l = 1 H [ w r ( t ) ( i , h , k , l ) + w r ( t ) ( i , j , k , h ) ] , F h * ( t + 1 ) = 1 2 N ∑ r = 1 N ∑ i , j , k , l = 1 H [ w r ( t ) ( h , j , k , l ) + w r ( t ) ( i , j , h , l ) ] , (6)
h = 1 , 2 , ⋯ , H . Under the null hypothesis F = F * , we have
F h ( t + 1 ) = F h * ( t + 1 ) = 1 4 N ∑ r = 1 N ∑ i , j , k , l = 1 H [ w r ( t ) ( h , j , k , l ) + w r ( t ) ( i , h , k , l ) + w r ( t ) ( i , j , h , l ) + w r ( t ) ( i , j , k , h ) ] (7)
Let L 0 ( F ^ ) denote the Likelihood under the null hypothesis F = F * or R = ( 1 , 1 , ⋯ , 1 ) . The likelihood ratio statistic
Λ = 2 ln L ( F ^ , F ^ * ) − 2 ln L 0 ( F ^ ) (8)
follows an asymptotic c2 distribution with H-1 degrees of freedom (df) when the null hypothesis is true.
According to the relationship between R = ( R 1 , R 2 , ⋯ , R H ) and ( F , F * )
F H * F H = R H ∑ j = 1 H R j F j = 1 ∑ j = 1 H R j F j , F i * F i = R i ∑ j = 1 H R j F j , i = 1 , 2 , ⋯ , H − 1 ,
We can get
R i = F i * / F i F H * / F H , i = 1 , 2 , ⋯ , H . (9)
The maximum likelihood estimator (MLE) of the relative risk R i for haplotype i relative to haplotype H is therefore given as
R ^ i = F ^ i * / F ^ i F ^ H * / F ^ H . (10)
We apply our method to the published data which was used for family-based association analysis for immunoglobulin A nephropathy (IgAN) [
There exist 4 haplotypes for Megsin C2093T-C2180T, CC, CT, TC and TT, coded as 1, 2, 3 and 4. Give initial value F ( 0 ) = F * ( 0 ) = ( 0.25 , 0.25 , 0.25 , 0.25 ) and precision ε = 10 − 6 . The haplotype frequency estimates after the iterative procedure stops
F ^ = ( 0.078118 , 0.508773 , 0.245039 , 0.168070 ) , F ^ * = ( 0.055631 , 0.638335 , 0.181438 , 0.124596 ) .
The log-likelihood ln L ( F ^ , F ^ * ) = − 904.475 . For the null model, the haplotype frequency estimation
F ^ = F ^ * = ( 0.065499 , 0.579534 , 0.211046 , 0.143921 ) ,
and log-likelihood ln L 0 = − 911.554 . So likelihood ratio statistic
Λ = 2 × ( 911.554 − 904.475 ) = 14.158 , df = 3, P = 0.0027. The results show the significant difference between F and F*, the transmitted haplotype frequencies
C2093T trios + C2180T trios | C2093T trios + C2180T SPF | C2093T SPF + C2180T trios | C2093T SPF + C2180T SPF | Total |
---|---|---|---|---|
125 | 25 | 26 | 56 | 232 |
SPF: single parent family.
and non-transmitted haplotype frequencies. F ^ 2 * is much higher than F ^ 2 . That means haplotype 2 (2093C-2180T) is over-transmitted from parents to cases. In addition, reference to haplotype 4 (2093T-2180T), the estimated relative haplotype risk R ^ = ( 0.961 , 1.692 , 0.999 , 1 ) .
In our simulations, two tightly linked single nucleotide polymorphism (SNP) marker genotype data for 100 case-parents trios are generated. In a similar way in Morris et al. (1997) [
P ( g c = s i / u k , g f = s i / t j , g m = u k / v l | A ) = P ( g c = s i / u k , g f = s i / t j , g m = u k / v l , A ) P ( A ) = P ( A | g c = s i / u k , g f = s i / t j , g m = u k / v l ) P ( g c = s i / u k , g f = s i / t j , g m = u k / v l ) K p = P ( A | g c = s / u ) K p h s i h t j h u k h v l = ( f s u K p h s i h u k ) ⋅ ( h t j h v l ) ,
where s , t , u , v ∈ { D , d } , i , j , k , l ∈ { 1 , 2 , ⋯ , H } , f s u = P ( A | s / u ) is penetrance for genotype s / u , and K p = P ( A ) is prevalence.
The frequencies of mutant disease allele D and normal allele d in a disease locus are denoted by q 1 = q and
where
equilibrium.
i.e. marker haplotype 1 is positively associated with the disease, and the other haplotypes are equally negatively associated, if there is association. In addition, the marker haplotypes are assumed to be equally frequent, and
and common models, specified heredity modes are considered. These models are shown in
For each genetic model, 5000 replicated samples were generated to evaluate the distribution of test statistics in the case of no association, i.e. e = 1. The quantile-quantile (QQ) plots for test statistic
Heredity mode | q | Penetrances | ||
---|---|---|---|---|
fDD | fDd | fdd | ||
Recessive | 0.0001 | 1 | 0 | 0 |
Dominant | 0.0001 | 1 | 1 | 0 |
Common | 0.2 | 0.02 | 0.005 | 0.001 |
plots show that the scatters of the observed quantiles and the expected quantiles of distribution
In addition, 1000 replicated samples were generated for statistical power analysis. The statistical significance level a = 0.05 or 0.01. The empirical power for our proposed method, comparison with TRANSMIT, are summarized in
Recessive | Dominant | Common | ||||
---|---|---|---|---|---|---|
e | Proposed | TRANSMIT | Proposed | TRANSMIT | Proposed | TRANSMIT |
a = 0.05: | ||||||
1.00 | 0.0440 | 0.0434 | 0.0474 | 0.0462 | 0.0514 | 0.0486 |
1.25 | 0.184 | 0.173 | 0.073 | 0.071 | 0.058 | 0.052 |
1.50 | 0.557 | 0.539 | 0.187 | 0.179 | 0.134 | 0.132 |
1.75 | 0.914 | 0.912 | 0.348 | 0.347 | 0.250 | 0.240 |
2.00 | 0.994 | 0.990 | 0.570 | 0.557 | 0.430 | 0.410 |
2.25 | 1 | 1 | 0.777 | 0.765 | 0.607 | 0.596 |
2.50 | 1 | 1 | 0.893 | 0.885 | 0.762 | 0.746 |
2.75 | 1 | 1 | 0.978 | 0.978 | 0.881 | 0.873 |
3.00 | 1 | 1 | 0.998 | 0.997 | 0.952 | 0.949 |
3.25 | 1 | 1 | 1 | 1 | 0.989 | 0.987 |
3.50 | 1 | 1 | 1 | 1 | 0.998 | 0.996 |
3.75 | 1 | 1 | 1 | 1 | 1 | 1 |
4.00 | 1 | 1 | 1 | 1 | 1 | 1 |
a = 0.01: | ||||||
1.00 | 0.0086 | 0.0082 | 0.0092 | 0.0082 | 0.0100 | 0.0094 |
1.25 | 0.055 | 0.049 | 0.019 | 0.015 | 0.015 | 0.015 |
1.50 | 0.315 | 0.300 | 0.055 | 0.050 | 0.043 | 0.037 |
1.75 | 0.790 | 0.775 | 0.160 | 0.147 | 0.086 | 0.082 |
2.00 | 0.966 | 0.963 | 0.321 | 0.304 | 0.221 | 0.204 |
2.25 | 1 | 0.999 | 0.573 | 0.548 | 0.394 | 0.363 |
2.50 | 1 | 1 | 0.749 | 0.741 | 0.545 | 0.525 |
2.75 | 1 | 1 | 0.916 | 0.906 | 0.720 | 0.705 |
3.00 | 1 | 1 | 0.977 | 0.976 | 0.879 | 0.868 |
3.25 | 1 | 1 | 0.994 | 0.991 | 0.946 | 0.936 |
3.50 | 1 | 1 | 1 | 1 | 0.985 | 0.979 |
3.75 | 1 | 1 | 1 | 1 | 0.993 | 0.992 |
4.00 | 1 | 1 | 1 | 1 | 0.997 | 0.996 |
Haplotype frequencies are usually estimated when haplotypes are reconstructed or linkage disequilibrium is tested. For tightly linked loci, the likelihood as a function of haplotype frequencies in transmitted and non-transmitted group was given for case-parents data. We estimate the MLEs of the haplotype frequencies via an EM algorithm. The results showed that haplotype frequencies could be estimated using a simple iterative procedure. The likelihood ratio test to compare haplotype frequencies in transmitted and non-transmitted group was used to detect association. When the information of parents is not available in the nuclear family, classical TDT is no longer suitable. However, as you can see in the application, missing parental genotypes are allowed in our method. In addition, when there are siblings available, the information can be used to reduce the uncertainty of phase, and the likelihood can be given similarly.
Under different simulated conditions where heredity mode, linkage disequilibrium coefficient are specified, 5000 and 1000 replicated samples were generated to evaluate the distribution of test statistics and statistical power respectively. Our method is more powerful than TRANSMIT.
The authors declare no conflicts of interest regarding the publication of this paper.
Li, C.X. and Li, P.X. (2018) Haplotype Frequency Comparison for Case-Parents Data. Open Journal of Statistics, 8, 721-730. https://doi.org/10.4236/ojs.2018.84047
Give the current estimate
where
And then the expected complete-data log-likelihood