Haplotype Frequency Comparison for Case-Parents Data

For case-parents data, the information from offspring can be used to reduce the uncertainty of parents’ haplotype. In this article we develop likelihood ratio test to compare haplotype frequencies in transmitted and non-transmitted group. The maximum likelihood estimate of the haplotype frequencies for the family data is obtained via expectation-maximization (EM) algorithm. Our proposed method can handle the uncertainty of haplotypes and missing data. The simulations show that the method is more powerful to test association between haplotype and traits than TRANSMIT. We also demonstrated the method to detect the association between Megsin gene and immunoglobulin A nephropathy.


Introduction
In association studies, to avoid false positive results caused by population stratification, family-based tests of association are often used for fine mapping of a disease susceptibility locus.Transmission disequilibrium test (TDT) proposed by Spielman is an association test for case-parents triad data [1] [2].The classic TDT is McNemar test on a 2-way transmission/non-transmission table.
Multiple linked markers can provide more polymorphism information than single marker (especially SNPs).However, haplotype phase is often uncertain for multi-locus genotype.There may be several haplotype pairs compatible with observed genotype.Many haplotype reconstruction algorithms are developed, e.g.expectation-maximization (EM) algorithm [3], pseudo MCMC [4], Bayesian haplotype inference [5] for population data of unrelated individuals.
TDT has been extended for tightly linked marker loci.Zhao et al. (2000) [6] performed TDT via two steps: constructing the underlying haplotype first, and then constructing the 2-way transmission/non-transmission table in TDT by assigning a weight to each possible phase.Based on conditional likelihood, Clayton (1999) proposed a score test with program TRANSMIT [7].The TDT-type methods were widely used in medical studies and GWAS studies [8] [9].
In this paper, based on full likelihood, we proposed a likelihood ratio test integrating haplotype construction for case-parent or case-parents data.

Method
The key idea is from classical case-control association study.Association between trait and marker yields some allele or haplotypes are found more often in case group than control group.In case-parents data, the allele or haplotypes of the case are transmitted from the parents.And so, we can test whether a particular allele or haplotype exists more often in transmitted (case) than in non-transmitted (control).
For tightly linked marker loci, we treat haplotypes as extended alleles and use the transmission information to reduce the phase uncertainty.
Let H denote the number of haplotypes for l tightly linked loci.The set of all possible ( ) For a case-parents trios, the genotypes for father, mother and affected child are denoted by is relative risk for haplotype i reference to H.
Under assumption of Hardy-Weinberg equilibrium, the probability of the father transmitted haplotype i and non-transmitted j, whereas the mother transmitted haplotype k and non-transmitted j to the child conditional on that the child is affected, is where A means the child is affected, and * 1 is regarded as frequency of haplotype i for disease population.
However, we only observed the genotype of each locus.The haplotype phase of the l tightly linked loci is often uncertainty, especial for missing parental genotype data.So there exist ambiguities to decide which haplotype is transmitted or not transmitted from the parent.Then where G  is the set of haplotype groups ( ) , , , i j k l which haplotype pairs Here, missing parental genotype is allowable.
Suppose there are N case-parents trios, and then there are 2N parents in all.The genotypes for the r-th trios are , , rf rm rc g g g .The log-likelihood It is difficult to find the maximum likelihood estimate (MLE) ( ) rectly.We employ expectation-maximization (EM) algorithm to estimate the haplotype frequencies ( ) * ˆ, F F , by treating underlying haplotype pairs as "missing data".The complete-data log-likelihood after adding missing data , , , , , , , , We can show that the expected complete-data log-likelihood An iterative procedure can be used to find the MLE via EM algorithm.Given the current estimates F F , the estimates in the next step:

j k l w i h k l w i j h l w i j k h N
follows an asymptotic χ 2 distribution with H-1 degrees of freedom (df) when the null hypothesis is true.
According to the relationship between ( ) The maximum likelihood estimator (MLE) of the relative risk i R for haplo- type i relative to haplotype H is therefore given as

Application
We apply our method to the published data which was used for family-based association analysis for immunoglobulin A nephropathy (IgAN) [10].Two tightly linked loci C2093T and C2180T, located in the 3' untranslated region (UTR) in Megsin gene, were studied.This dataset contains 232 families with an affected child were entered into analysis (Table 1).There are missing data since the genotyping information of some parents are not available.

Simulations
In our simulations, two tightly linked single nucleotide polymorphism (SNP) marker genotype data for 100 case-parents trios are generated.a similar way in Morris et al. (1997) [11], samples are generated for simulation using the transmission model provided by Bickeboller et al. (1995) [12].

tj uk vl p su si uk tj vl p P g si uk g si tj g uk vl A P g si uk g si tj g uk vl A P A P A g si uk g si tj g uk vl P g si uk g si tj g uk vl K P A g s u h h h h K f h h h h K
is penetrance for genotype s u , and ( ) The frequencies of mutant disease allele D and normal allele d in a disease locus are denoted by 1 q q = and 2 1 q q = − .Under Hardy-Weinberg Equilibrium, the population prevalence is therefore ( ) ( ) We assume that there is no recombination between disease and marker locus.
Linkage diseqilibrium parameter to detect association between disease locus and marker locus is defined as in Sham (1995) [13] , , ; 1, 2, , , where si h is frequency of disease-marker haplotype si.The { } si e satisfies Di e < means marker haplotype i is negatively associated.In our simulations, the LD pattern is given as i.e. marker haplotype 1 is positively associated with the disease, and the other haplotypes are equally negatively associated, if there is association.In addition, the marker haplotypes are assumed to be equally frequent, and and common models, specified heredity modes are considered.These models are shown in Table 2.
For each genetic model, 5000 replicated samples were generated to evaluate the distribution of test statistics in the case of no association, i.e. e = 1.The quantile-quantile (QQ) plots for test statistic Λ from 5000 samples under the null hypothesis for the three genetic models are showed in Figures 1(a)-(c).The Table 2. Genetic models for simulation study.χ are very close to the line y = x.
In addition, 1000 replicated samples were generated for statistical power analysis.The statistical significance level α = 0.05 or 0.01.The empirical power for our proposed method, comparison with TRANSMIT, are summarized in Table 3, under different association level e for three disease model.

Discussion
Haplotype frequencies are usually estimated when haplotypes are reconstructed or linkage disequilibrium is tested.For tightly linked loci, the likelihood as a function of haplotype frequencies in transmitted and non-transmitted group was given for case-parents data.We estimate the MLEs of the haplotype frequencies via an EM algorithm.The results showed that haplotype frequencies could be estimated using a simple iterative procedure.The likelihood ratio test to compare haplotype frequencies in transmitted and non-transmitted group was used to detect association.When the information of parents is not available in the nuclear family, classical TDT is no longer suitable.However, as you can see in the application, missing parental genotypes are allowed in our method.In addition, when there are siblings available, the information can be used to reduce the uncertainty of phase, and the likelihood can be given similarly.Under different simulated conditions where heredity mode, linkage disequilibrium coefficient are specified, 5000 and 1000 replicated samples were generated to evaluate the distribution of test statistics and statistical power respectively.
Our method is more powerful than TRANSMIT.

1
Di e > means marker haplotype i is positively associated with the disease, and 1 classical genetic models, recessive, dominant

Figure 1 .
Figure 1.QQ plots for test statistic Λ from 5000 samples.Red: expected quantiles of distribution

Table 3 .
Comparisons of empirical power.