A major challenge to the interpretation of results from a GWAS is setting the appropriate statistical threshold. In classical statistics, as the number of tests against a single null hypothesis increases, the statistical threshold (p-value) has to account for the probability of a false positive occurring by chance. This is typically done using a Family-Wise Error Rate (FWER) approach such as a Bonferroni correction; if a GWAS is going to test for 1 million SNPs (and not considering here the complexities potentially related to testing 1 million hypotheses that we have previously discussed), the set significance level is 10−8
for a global significance level of 5%. However, the Bonferroni’s correction is not appropriate for observational studies such as GWAS (Perneger, 1998
) because it does not account for the dependencies of SNPs that are close to each other across chromosomes, thus leading to an overcorrection. Arking et al. proposed less stringent significance levels that accounted for LD across SNPs (Arking et al., 2006
). Zondervan and Cardon (2007)
provide a method to adjust for the actual number of independent tests (Zondervan & Cardon, 2007
). The HapMap consortium proposed a local significance threshold of 5.5 * 10−8
based on re-sampling from empirical data under the null hypothesis (Altshuler et al., 2005
), reaching a similar result to the WTCCC consortium (5×10−7
) (Dudbridge & Gusnanto, 2008
; WTCCC, 2007
Various combinations of a classical method with other methods have been suggested, including multi-stage experimental designs, ranking the results and identifying a top subset of likely genes, and permutation testing to determine an empirical p-value. Also, expanding the work by Benjamini and Hochberg (Benjamini & Hochberg, 1995
), Efron and Tibshirani (Abi-Dargham et al., 2002
) proposed controlling for the risk of false positives over all the positive results rather than over all the possible tests (the False Discovery Rate approach – FDR), thus partially compensating for the overcorrection of the more traditional FWER methods.
It is important to re-emphasize that all these proposed methods address the evaluation of a single SNP at a time. If two or more SNPs are in (strong) LD to each other and show a similar pattern of significant association with the trait of interest, it is appropriate to use a less stringent threshold (Ziegler et al., 2008
). Alternatively one can avoid putting forth a prespecified formal significance threshold, and present all results ordered from lowest to highest p-values (Almasy et al., 2008b
; Helgadottir et al., 2007
In GWASs, a false positive results in an additional cost of following up the initial finding, always a necessity considering the exploratory nature of GWAS. The consequences of false positives in a GWAS are less than finding a false positive in a pivotal clinical trial that could lead to potentially dangerous treatment (Rothman, 1998
). Importantly, any correction for multiple testing to avoid increasing false positives also affects the false negative rate, reducing power to detect a true significant finding (Samani et al., 2007
). The latter may be a more serious error, in that an important finding is prematurely dismissed.
Several other approaches are being developed to determine which GWAS significance threshold is most appropriate to a specific research question. It is important to keep in mind that in a GWAS each SNP is its own hypothesis; a GWAS involves testing of hundreds of thousands of hypotheses. This is a fundamentally different question than testing the same hypothesis a million times. Bonferroni corrections are more suited to the latter case but are not well suited to the testing of many different hypotheses (WTCCC, 2007
). Perneger (1998)
made the case that Bonferroni adjustments are concerned with the wrong hypothesis (i.e. that all null hypotheses are true simultaneously which is not of interest) and increases the likelihood of type II errors concluding that the Bonferroni method may create more problems than it solves. Appropriate statistical correction for multiple statistical tests is an area of intense statistical research (Dudoit S, 2008
). Despite a lack of consensus various practical approaches are in use to address the problem of multiple testing and determining the appropriate GWAS threshold, several of which are briefly summarized.
Permutation methods (e.g.(Manly, 1997
)) offer the possibility of using the dataset collected to empirically determine the statistical threshold, both for case-control and QT phenotypes. Although computationally expensive, current technology is robust enough to perform permutation testing.
While permutation methods are a broad class of techniques, in GWAS applications they are used to determine the proportion of cases in which the F or chi-square statistics would arise under the null hypothesis of no genetic effect. With a permutation approach, the “labels” identifying cases and controls (or a given measure for a quantitative phenotype) are randomly reassigned, the original analysis re-done and the subsequent statistics is noted; when this is done many thousands or hundreds of thousands of times, the distribution of the test statistic under the null hypothesis for that specific dataset is known. This data derived distribution can deviate significantly from the a priori distribution. The probability of the original statistics in the current sample arising by chance, then, can be empirically estimated.
In a dichotomous case-control study, when testing the main genetic effect on case-control status, usually the case-control designation is permuted across subjects; these methods are now standard with GWAS software (e.g., PLINK release v1·03 (Purcell et al., 2007
)). However, for quantitative trait interactions it is not clear if the QT measure (Y
) should be permuted, and what level of randomization (full, reduced, or constrained) is required. If the significance of an interaction term between SNP and diagnosis is being assessed, for example, the method may permute residuals for that term rather than the original raw data. Depending on the precise factorial design, full permutation may not be used, but rather permutation within exchangeable units within the design (for more detail on these issues, see for example (Manly, 1997
); (Anderson & Ter Braak, 2002
; Jung, Jhun, & Song, 2006
With the increasing interest about the identification of improved multiple testing procedures for GWAS, we suggest a possible strategy to modify the existing non-parametric inferential procedures in order to fit the interaction event statistical assumptions.. For illustrative purposes we propose to randomly select a sample of SNPs (e.g. say 1% or 5,000 of the 500,000 SNPs in the GWAS) and perform the permutation tests 1,000 times for each of the 5,000 SNPs creating a set of 5 million interaction p values, representing a null distribution for a random sample of SNPs in the GWAS. From this null distribution, the top 5% or 1% empirical threshold p-value can be set as an empirically determined, genome-wide threshold. The number of SNPs and permutations required to achieve the desire power will depend on sample size. This combines the strength of the permutation testing method with the GWAS data to determine an underlying null distribution for the full GWAS (see Weinberger presentation to Psych Gen NYC 2008).
Additional SNPs in the same region
Combinations of SNPs can also be considered, through various “haplotype-based” approaches (Seasholtz et al., 2006
), though that has not been applied to a GWAS with a QT. The greater the number of significant tagging SNPs in proximity to one another, the smaller is the likelihood of a finding being a false positive. As chromosomal regions rather than individual SNPs represent the inherited units into which am inherited genome can be partitioned, finding several significant tagging SNPs in physical proximity provide additional support for the locus being a true risk factor. The focus on tagging SNPs is critical, since we already know that adjacent SNPs are more likely to be in greater linkage disequilibrium and hence finding many close SNPs significant may not be particularly relevant since they share the same meaning. Haplotype approaches may also be able to capture effects in SNPs which are rare, in the 1–5% range (de Bakker et al., 2005
; Kamatani et al., 2004
; Lin, Chakravarti, & Cutler, 2004
). The haplotype approach known as the “sliding window” (Fallin & Schork, 2000
) can be used to detect the effects of rare alleles, using the extended block method of Purcell (Purcell et al., 2007
). A complete catalogue of all the low frequency SNPs and CNVs discovered will improve pinpointing genes, structural variants in chromosomes and other individual genomic variations that are associated with disease.