An Introduction to Basic Statistical Models in Genetics

The use of the three genetic models viz. additive, dominant and recessive in Genome-wide association study (GWAS) is a common and powerful approach to study the association between genetic variants and a trait (disease). The selection of these models depends on the pattern of inheritance and the scope of the study. GWAS typically focuses on single-nucleotide polymorphism (SNPs) and common human diseases in a case-control setup. In order to study this type of association between the risk genotype and the phenotype for a given inheritance pattern, the use of these genetic models helps to identify the disease risk appropriately. This study provides an overview of the existing genetic models (additive, dominant and recessive) and a practical demonstration of these model tests for the contingency tables of SNP genotypes and the disease phenotypes in a case-control setting.


Introduction
The main goal of human genetics is to identify genetic risk factors for common and complex diseases [1] [2] [3] [4] [5]. The risks related to allelic variants of candidate genes for which there is evidence of linkage to disease susceptibility are determined [4] [6]. These studies collect valid and precise information on the causes, prevention, and treatment of disease [6].
The genetic association studies such as genome-wide association study (GWAS) is a powerful and complete analysis of the genetic association between certain observable traits and specific genetic variations in the form of Single Nucleotide Polymorphisms (SNPs). GWAS provides a relatively superficial approach to detect potential genetic contributors to phenotypes (common and complex diseases) from a simple case-control setup [1] [3] [7]. These studies attempt to discover novel genes by testing huge number of SNPs for association [3].
The statistical analysis of genetic data can be performed for a study population when a well-defined phenotype is selected, and the genotypes are collected using a sound technique [4]. GWAS perform a series of single-locus statistic tests and examine the susceptibility of each SNP independently for association to the phe- The genotypic association tests examine the association between genotypes and the phenotype, where the genotypes for a SNP can also be grouped into different genotype models, such as additive, dominant or recessive models [4] [5].
The main objective of this paper is to provide a practical demonstration of three basic genetic models (additive, dominant, recessive) in the case-control GWAS studies for DNA sequencing data.

Genetic Models
The existing three genetic models can be rephrased as following.
For a single SNP, the 3 genotypes together with a categorical phenotype with two categories can be presented in a 2 × 3 contingency , , , , , n n n n n n are the numbers of samples in a case-control with a particular genotype and phenotype combination, where the SNP has two alleles (D = disease-causing allele and N = allele not causing the disease).
Each model makes different assumptions about the genetic effect in the data.
For a single SNP with the two alleles, N and D, the dominant model (for D allele) assumes that having one or more copies of the D allele increases risk compared to N. Hence, the genotypes DD or ND have the higher risk. In case of the recessive model (for D allele), the assumption is two copies of the D allele are required to alter the risk. Hence, the individuals with the genotype DD are compared to individuals having genotypes ND and NN. A linear and uniform increase is assumed based on the number of each copy of the disease-causing allele (D). Thus, the additive model (for D allele) assumes, if the risk for ND is k then the risk for DD is 2k [4] [8] [9].

Models with the Penetrance Function
Penetrance functions represent one approach to modeling the relationship between SNPs and risk of disease [10] [11] [12]. The penetrance of a genetic disorder is measured by evaluating how often a particular phenotype occurs given a  ing affected with disease x given a specific genotype g. Now, the probabilities of being affected depending on a disease-causing genotype with one diseasecausing allele D and one allele not causing the disease N, can be expressed as [13] [14], Here, 0 f is the frequency of individuals who are affected without carrying a disease-causing allele (frequency of phenocopies). According to Bush (2012), different inheritance patterns (recessive, dominant, additive) can be expressed in terms of mathematical models (Table 2). Here, the phenotypes show full penetrance and no phenocopies. That is, no individual without the disease-causing genotype will become affected.
For example, if a disease is transmitted in an additive fashion, the risk for a heterozygous person to be affected is half that of the person who is homozygous D as compared to an individual who is homozygous N. Hence, according to the penetrance probabilities shown in Table 2, On the other hand, these models could be represented with respect to the genotypic relative risks (GRR) under the assumption of phenocopies that is 0 0 f > (Table 3).
For 0 0 f > , the GRR can be expressed in terms of the functions 0 1 , f f and 2 f defined in Equation (1), So, the GRR presents the increased risk of an individual having a disease causing genotype over a person without disease-causing allele. By introducing the GRR, the three parameters ( 0 1 2 , , f f f ) defined in Equation (1) are reduced to Table 2. Penetrances for simple Mendelian inheritance patterns.

Genotype Data Preparation
The individual SNP genotype data for single SNPs were generated for 1000 individuals via computer simulation in R-programming language. Then, these 1000 individuals were randomly allocated to the cases and the controls with the equal probability of cases (0.5) and controls (0.5). This random allocation was repeated for 1000 times. The independence test of single SNP was performed in each repetition using the proportion trend test [15] for the three genetic models (additive, dominant and recessive) and the Pearson chi-squared test [16]. The three p-values were recorded from the independence tests of the three genetics models along with the p-value from the Pearson chi-squared test in each repetition.  Apparently, the curve shapes and features of the three tests are seems to be the similar with the Pearson chi-squared test. But, the differences in the results are observed by investigating the Figure 3. Figure 3 is presenting the pairwise difference plots between the p-values and 2 χ -values of each of the three tests with the Pearson chi-squared test. A positive relation is observed in each of the plot, where many values are grouped together near the origin. This is because, the tables corresponding to these cases have relatively smaller deviations from the Pearson chi-squared test in terms of the p and 2 χ -values.

Results and Discussion
On the other hand, the 3-dimensional scatter plot of the p-values from the three genetic tests in Figure 4 is indicating that the three genetic tests are producing different p-values having a positive relation among them for different tables obtaining from shuffling of the phenotypes. The result shows, a table with the fixed genotype counts are producing different results while applying the different genetic tests. Also, for a fixed sample size,

Conclusion
This paper is a practical demonstration of the three genetic model tests for the SNP genotype data. Here, the simulated SNP genotype data used in the analysis. But, this application could be extended for the real datasets. The basic structure of both the simulated and real data would be the same. So, the directions of the results would be the same for both the cases. On the other hand, the choice of a proper model is important in such association studies, which generally depends on the inheritance pattern of a disease. So, the investigation of the suitability of these models depending inheritance patterns of disease would be the future directions of this research. The appropriate selection of genetic model in association studies will enhance to detect the risks related to allelic variants of candidate genes. The result of this paper indicates that different genetic model tests are producing different p-values for a table of fixed sample size and genotype counts. Also, for the same test, different p-values are obtaining for all the tables while the tables were constructed by the shuffling of the phenotypes of the given table.
Hence, the models should be correctly chosen according to the mode of inheritance (dominant, additive and recessive).