In this approach, investigators identify a group of functionally related genes, and then apply multivariable analysis techniques to the markers in these genes.
10,19–23 The hope may be that, although none of the individual markers were credibly associated with the studied trait, the aggregate evidence for association may be strong. Alternatively, a more flexible analysis that incorporates nonlinear relations between the coded marker genotypes and the trait may uncover additional compelling evidence for association. The paper by Lesnick at al,
24 discussed by Breitling et al, took this approach, analyzing single nucleotide polymorphisms (SNPs) in genes involved in the axon guidance process using a stepwise regression approach, incorporating SNP main effects and product interaction terms.
As Breitling et al point out, extreme care must be taken with data-mining techniques that search over a large set of possible models, when applied to data sets with a large number of markers. (The “axon guidance pathway” contained between 1,195 and 1,460 markers, depending on the data set analyzed.) Failing to appropriately correct for the model selection procedure can lead to drastic, downwardly biased
P-values and overestimates of the precision of predicted trait values based on the fit model in a new data set. Breitling et al use a permutation procedure to correct for this overfitting, and report a
P-value for association that is 35 orders of magnitude larger than that presented by Lesnick et al in their abstract.
24 We note that in the presence of population stratification or differential genotyping errors, a simple permutation procedure that randomizes phenotypes against genotypes may still yield downwardly biased
P-values. Permutations should be done within strata assumed to be homogeneous with respect to ancestry or genotyping errors, if possible.
An alternative approach to assess statistical significance is to compare the observed test statistic for the pathway to the distribution of the test statistic across multiple synthetic “pathways”: sets of genes randomly drawn from the genome (and assumed not to be associated with the studied trait). Although this procedure might account for population stratification bias or genotyping errors (assuming these affect the tested pathway and the rest of the genome similarly), it comes with its own set of potential biases. In particular, this approach could be confounded by differences in characteristics between the tested and synthetic pathways, such as differences in gene size, linkage disequilibrium, etc. Ideally, the synthetic pathways should be matched to the tested pathway on these characteristics. Lesnick et al implemented this approach by randomly selecting individual SNPs from the genome and found that the observed test statistic for the axon guidance pathway was far larger than for any of the 4000 random marker sets simulated. However, by drawing markers randomly from their genome-wide data set, the authors ensured that the linkage disequilibrium patterns in the axon guidance pathway and the synthetic pathways differed greatly.
Data-mining techniques typically require no missing data on predictors or outcome variables. Lesnick et al implemented a form of complete-case analysis, restricting the models tested at each stage of their stepwise procedure to those with less than a certain amount of missing data. Breitling et al had difficulty replicating Lesnick et al’s procedure, and noted that small differences in tuning parameters could lead to quite different results, in both significance of the overall test for association and the markers selected in the final model. This is a disconcerting property for any analysis, and suggests that its conclusions may not be reliable. As data sharing becomes more widespread in the genetic epidemiology community,
25,26 other investigators will be able to replicate published analyses and assess their sensitivity to different modeling assumptions, as Breitling et al have done. This should lead to more robust scientific conclusions in the long run. We are happy to note that the GWAS data used in Lesnick et al are now available to qualified researchers through dbGAP (Study Accession: phs000048.v1.p1).
Replication of analytic results in an independent data set can provide some assurance that the observed association is not a chance false positive, and provide an unbiased estimate of the prediction accuracy of the fitted model. However, applying a procedure with markedly downwardly biased P-values to 2 data sets and getting a “significant” P-value in each does not constitute replication—especially when there is little overlap between the sets of genes represented in the final models. A more compelling demonstration would be to show that the model fit in the first data set effectively predicts the trait in the second.