A Gene Score Test for Disease Association with Multiple Genes

The traditional method for creating a gene score to predict a given outcome is to use the most statistically significant single nucleotide polymorphisms (SNPs) from all SNPs which were tested. There are several disadvantages of this approach such as excluding SNPs that do not have strong single effects when tested on their own but do have strong joint effects when tested together with other SNPs. The interpretation of results from the traditional gene score may lack biological insight since the functional unit of interest is often the gene, not the single SNP. In this paper we present a new gene scoring method, which overcomes these problems as it generates a gene score for each gene, and the total gene score for all the genes available. First, we calculate a gene score for each gene and second, we test the association between this gene score and the outcome of interest (i.e. trait). Only the gene scores which are significantly associated with the outcome after multiple testing correction for the number of gene tests (not SNPs) are considered in the total gene score calculation. This method controls false positive results caused by multiple tests within genes and between genes separately, and has the advantage of identifying multi-locus genetic effects, compared with the Bonferroni correction, false discovery rate (FDR), and permutation tests for all SNPs. Another main feature of this method is that we select the SNPs, which have different effects within a gene by using adjustment in multiple regressions and then combine the information from the selected SNPs within a gene to create a gene score. A simulation study has been conducted to evaluate finite sample performance of the proposed method.


Introduction
Due to rapid developments in high-throughput genetic technologies, genome-wide association studies (GWAS) have become common.The success of GWAS depends on genotyping a large number of SNPs (i.e.500,000 to 1 million) and determining which of these SNPs are significantly associated with the outcome of interest.It is expected that genotyping more SNPs should lead to more accurate gene localization.However, the benefit of the increased number of SNPs is reduced either by multiple testing correction if the SNPs are tested one at a time, or by the increased number of degrees of freedom in the statistical test if multiple regression or haplotype analysis is used.Wang and Elston [1] describe the two possibly conflicting goals of "catching information" and "cutting the cost" of multiple testing or the large number of de-grees of freedom.They suggested using a weighted score test (WST) which has only one degree of freedom to achieve the two goals simultaneously.However, Chapman and Whittaker [2] show that, if some of the coded SNPs are positively correlated with the outcome, while others are negatively correlated with the outcome, the WST may have low power.Pan [3] suggested an alternative approach called the sum test, which also has only 1 degree of freedom.Like WST, the sum test has the same problem of sign (i.e.some coded SNPs are positively correlated with the outcome and the others are negatively correlated with the outcome).To overcome the limitation of the sum test, Pan [3] proposed five tests, which are closely related to each other.There is a heuristic solution to the sign problem, discussed by Wang and Elston [1]: before using the WST (or the sum test), one needs to adjust the coding of SNPs so that all SNPs are positively correlated with the outcome.The sign problem is not the only limitation of the sum test.It uses information from all of the SNPs, although the majority of the SNPs might not be associated with the trait, and their use may reduce the power.To increase power and reduce false positive results caused by multiple tests and dependence among test statistics, Gu et al. [4] proposed a modified forward multiple regression approach.They chose the SNP with the maximum order statistics in the regression model if its P-value is less than a pre-specified α level and then retained the selected SNP in the regression model and looked for the second SNP among the SNPs with the largest 5% of order statistics in the previous step of regression.Repeating this procedure until no more SNP could be selected.Their simulation studies show that the modified forward multiple regression approach has higher power than the Bonferroni and false discovery rate (FDR) procedure for detecting moderate and weak genetic effects.
For multiple testing correction, the current methods such as Bonferroni correction, FDR and permutation tests, do not consider the situation described in Figure 1, where an association pattern of 4 SNPs is provided.All of those methods will choose SNP2 before SNP3 since SNP2 has a smaller P-value than SNP3, although SNP2 might be significant just because it is highly linked with SNP1 and has nothing to do with the trait as shown in Figure 1.This predicament motivates the development of the new gene score method.
We propose a method, which controls false positive results caused by multiple tests within genes and between genes separately.First, the SNPs within a gene compete with each other by using Gu et al.'s [4] modi-fied forward multiple regression method, which has more power to detect multiple weak genetic factors than FDR and Bonferroni.We then combine the information from the selected SNPs within a gene to create a gene score, which has only 1 degree of freedom.To avoid the sign problem, we follow an approach, which is similar to Wang and Elston's [1] approach.We adjust the coding of SNPs so that all SNPs are positively associated with the trait.Finally, the genes compete with each other by using gene scores and Bonferroni correction.As shown in Figure 1, SNP2 will compete with SNP1 first, instead of competing with SNP3 directly.Since we compare the gene score of gene 1 with that of gene 2, SNP3 in gene 2 might be chosen before SNP1 in gene 1 if the joint effect of gene 2 is stronger than the joint effect of gene 1.This is another advantage of our method compared with current methods including Gu et al's modified forward multiple regression methods for all SNPs.

Methods
Let X ij denote the locus score [5], defined as the number of risk alleles (0,1, or 2) for SNP j (j = 1,2, ,L i ) for gene i ( I = 1,2, , K) carried by an individual.L (L = L 1 +L 2 + +L K ) would denote the total number of SNPs.Suppose the trait value is Y and a test is conducted for SNP j in gene i by using a generalized linear model (GLM) [6]: where E(Y) is the mean of Y, h() is the link function.Typical link functions include the identity link for a continuous normally distributed outcome and a logit link for binary traits.The model can be conditional  GLM for a matched data set or adjusted for some covariates such as age and sex.This test yields P-value p ij (j = 1,2, ,L i , I = 1,2, ,K).Let p i(1) = min(p ij , j = 1,2, ,L i ), whose corresponding SNP is denoted by SNP i (1) and corresponding locus score is denoted by X i (1) .If p i (1) > a pre-specified , the gene score for gene i is 0. If p i(1)  a pre-specified , we ask whether there is another SNP (in gene i) which is associated with the trait after the effect of SNP i( 1) is accounted for by using the model: where SNP j does not include the selected SNP i(1) and the SNPs with P-value > 0.05.The P-value of this test is denoted by p ij . Let p i(1) ), whose corresponding SNP is denoted by SNP i(2) and corresponding locus score is denoted by X i (2) .If SNP i(2)  the pre-specified , we will search the third SNP (in gene i) which is associated with the trait after the effect of SNP i(1) and SNP i( 2) is accounted for by using the model: where the SNP j does not include the selected SNP i(1) , SNP i(2) and the SNPs with P-value > 0.05 in model ( 2).These steps continue until no further SNP can be found in gene i.Suppose we select SNP i(1) , SNP i(2) , , SNP i(s) from gene i, the gene score for gene i can be defined as Now we focus on the gene instead of SNPs within the gene.In order to obtain the association between the gene and the trait, we use the model This model is similar to the SUM test [3], having only one degree of freedom.However, this model does not have the sign problem limitation that the SUM test does, because we use X ij , the locus score, which is the number of risk alleles.Unlike the SUM test, this method uses only selected SNPs to remove noise.To adjust for the multiple testing of multiple genes, we use Bonferroni correction (P-value  / K) by assuming there is no linkage disequilibrium (LD) between different genes.Here we consider all genotyped genes, including those genes, whose gene scores = 0 because of no selected SNPs.Suppose there are t gene scores (for example, G 1 , G 2 , , G t ) that are significant after Bonferroni correction.The total gene score is defined as

Simulation
To evaluate the performance of the proposed method, we conducted a simulation study to compare power and av-erage number of false positives to detect associated genes for our method, the Bonferroni procedure, the FDR method and modified forward multiple regression method.The simulated data set is generated to have a similar structure to that of the genotype data from the INTERHEART genetics study [7].On average, there are 15 SNPs per gene and 100 genes in the simulated data set.We picked one gene with 8 SNPs and the other gene with 21 SNPs and then we picked 2 SNPs from each gene.Let the probability of complex disease for the tth subject follow a logistic function   where X ijt denotes the number of risk alleles (0,1, or 2) for SNP j (j = 1, 2) for gene i ( I = 1, 2) carried by subject t.Using Bernoulli distribution, the disease status of each subject was generated based on the probability of disease for the subject.From the simulated data set, we randomly selected 1000 cases (with disease) and 1000 controls (without disease) to form a data set.To obtain 1000 replicates, the simulated data set was generated 1000 times.From each data set, 1000 cases and 1000 control was randomly selected.We choose  as: 1)  = 0.05/15 (note that 15 is the average number of SNPs per gene); 2)  = 0.05/ # of SNPs within gene.For the first case,  is the same for all genes, while for the second case,  is different for the genes with different numbers of SNPs.The power calculated for detecting each causal gene is the number of times detected in 1000 replicates divided by 1000.The average number of false positives (ANFP) in each replicate is calculated by dividing the number of total false-positive genes found in 1000 replicates by 1000.For Bonferroni, FDR and modified forward multiple regression, there are only tests for SNPs , not for genes.We define a gene "detected" if one or more SNPs in the gene are significant.The results of this simulation are listed in Table 1 and Table 2. Our proposed method (GST) has much higher power than Bonferroni, FDR and modified forward multiple regression.

Discussion
In this paper, we have proposed a new method to create a gene score for each gene and then a total gene score for all the genes tested.Compared with the traditional gene score method, this method has the advantage of using SNPs, which may have weak single effects, yet strong joint effects.This method controls false positive results caused by multiple tests within genes and between genes separately, which has the advantage of identifying multi-locus genetic effects, compared with the Bonferroni correction, FDR, and permutation tests for all  SNPs as shown in Figure 1.Unlike the sum test, which counts all the SNPs (even when the majority of SNPs are not associated with the trait), our method removes these SNPs with a resultant increase in power.Our method can be easily generalized to consider interaction between SNPs within each gene and the interaction between genes.Our method can also be modified to develop a weighted gene score by using genetic effect sizes as the weights if the estimates of the true genetic effect sizes are reliable and accurate.Our simulation shows that our proposed method (GST) has much higher power than FDR to detect associated genes.Further research is required to assess gene scores that include genes, which are not independent because of LD between the genes.One solution might be to combine the genes with strong LD into one gene cluster and then use the gene cluster to replace the genes with strong LD.Another solution might be to use permutation tests instead of Bonferroni correction for gene scores.

Figure 1 .
Figure 1.An association pattern of 4 SNPs within 2 genes: SNP2 has smaller P-value than SNP3, however, SNP2 might be significant just because it is in high linkage disequilibrium with SNP1 and has nothing to do with the trait.SNP3 might be more important than SNP2.

Table 1 . Power and average number of false positives to detect the associated genes (for GST,
=0.05/the average

Table 2 . Power and average number of false positives to detect the associated genes (for GST
, =0.05/the number of SNPs within the gene).
FDR, false discovery rate; MFMR, modified forward multiple regression; GST, gene score test; ANFP, average number of false positives; q is the controlled q-value level.