|Home | About | Journals | Submit | Contact Us | Français|
Gene set analysis (GSA; “enrichment”) is a popular approach for the interpretation of genome-wide analyses. GSA is most commonly applied to the analysis of transcriptomes, but from the outset it has been considered useful for any study that provides rankings or “hit lists” of genes. The recent review by Mooney et al.  is a valuable resource for geneticists wishing to apply gene set analysis to the output of GWAS. Here we describe some additional points of practical importance if the methods are to be applied and interpreted soundly.
As described by Mooney et al., associating a gene with a SNP requires making some assumptions relating to relative location, unless the functional variant is known. But all the assignment methods described by Mooney et al. can result in the implication of more than one gene by a single variant. This is not problematic from a biological standpoint, but if those genes share any annotation used as input for GSA, the statistical significance of the shared annotation will be inflated. As described by Mooney et al., one aim of GSA is to try to capture the distributed nature of the heritability of the trait across multiple loci. Counting the same locus multiple times defeats this purpose. Put another way, the assumption (inherent in many GSA methods) of statistical independence of the genes can be violated in a particularly insidious way. Few of the methods and tools reviewed by Mooney et al. appear to address this problem.
This “multiple counting” problem has practical impact, leading to the recent retraction of a GWAS study of memory , in which the primary finding was the significance of the Gene Ontology (GO) term “synapse organization and biogenesis”. In this study a single SNP in the PCDHB cluster on chromosome 5 was assigned to at least eight PCDHB cluster members. Because those genes are very similar in their annotations, a GO term they shared reached statistical significance; without the duplication, it does not . Based on our own experience and discussions with other genomics and genetics research groups, this is a common occurrence (protocadherins in particular seem especially problematic). The same issue crops up in genome-wide methylation studies (“EWAS”), in which CpGs are analyzed rather than SNPs. A remedy is to collapse the GO annotations for all genes assigned to a SNP or CpG to a single “meta-gene” analysis unit, rather than using the default gene-to-annotation mappings. Computationally intensive sample permutation methods should be considered , but the simple meta-gene approach will avoid much of the trouble.
The second issue surrounds the conceptual coherency of GSA and the interpretation of the results. For the most part, GSA results are treated as exploratory add-ons to primary findings. In such situations mistakes or problems in using GSA are not of major consequence. But there is a temptation for researchers to salvage negative or underpowered studies (in genetics, epigenetics or transcriptomics) by appealing to groups of genes. This was apparently the approach of Dixson et al., who had a sample size of a few hundred individuals, too small to yield SNPs reaching genome-wide significance. They are not alone, and enrichment results have been reported as a primary result in other studies [4,5]. But we must strongly stress the dangers. As Mooney et al. point out, there is no agreement on what gene sets to use, and sources differ dramatically even when they are attempting to describe the same concepts. Equally problematic, sources such as GO can change rapidly , which can lead to unstable results [7–10]. Dixson et al. used GO annotations dating from 2008 , and the GO group they discuss now has at least 59 genes, not 23 as reported. While the impact is unknown in this case, the incomplete, changeable, conflicting and partly arbitrary nature of gene annotations should be taken into account before treating them as units of analysis with biological meaning. Furthermore, one cannot easily defend assigning biological significance to specific gene set members without considering the strength of association at the gene level. Again referring to the Dixson et al. study, they expressed strong interest in genes in the “synaptic organization” set having nominal (uncorrected) p-values of 0.1 or higher. It seems risky to consider such genes of interest merely due to sharing an annotation with a locus that does have a signal. Finally, GSA is highly questionable if there is no evidence for any association signal at all (i.e., the SNP p-value distribution is uniform, as appears to be the case  for at least one of the disorders considered in ). For all these reasons, GSA should be used as a replacement for a variant-level analysis with trepidation.
We thank Paul Thomas (USC) and Jesse Gillis (CSHL) for discussion and comments on drafts of the manuscript. Supported by NIH grant GM076990 to PP.