In order to highlight biologically relevant associations from a GWAS, which may otherwise be obscured by the large number of tests, we have developed a method for incorporating a broad array of genomic data into the prioritization of SNPs for further study. In this instance we targeted nicotine dependence, but applications to other phenotypes are straightforward: one need only specify a different set of biologically relevant genes. Because investigators are typically drawn to signals in biologically relevant genomic regions, our method of establishing
a priori hypotheses will protect against
post hoc reasoning and legitimize the prioritization process (Chanock
et al.,
2007).
The SNP
rs16969968 ranked number one in our prioritization of the NicSNP results () and was the seventh most biologically relevant out of all common HapMap SNPs (). After initially being reported by the NicSNP study (Saccone
et al., 2007), there is now extensive evidence from independent datasets that this SNP is associated with nicotine dependence and closely related smoking phenotypes (Berrettini
et al.,
2008; Bierut
et al.,
2008; Thorgeirsson
et al.,
2008) and lung cancer (Amos
et al.,
2008; Hung
et al.,
2008). With the exception of Hung and colleagues, who genotyped and reported an association at
rs16969968, this SNP was not genotyped in these studies, but the reported associations were with LD proxies for
rs16969968. It is very interesting to see convincing, replicated evidence of association for a SNP with such strong biological relevance. However, it is still not clear to what extent known biology will predict variants that influence disease. Our prioritization method is not designed to act as a predictor, but to preferentially select biologically relevant signals when resources are limited, either for genotyping or for functional studies in the laboratory.
In , six of the top 10 SNPs, as ranked by the prioritization method, are also among the top 10 SNPs as ranked by the original non-weighted
P-value. Furthermore, these 10 SNPs would have been selected by the straight
P-value method as long as the top 300 SNPs ranked by straight
P-value were followed up. Therefore, an important question is how many of these 10 SNPs are true associations. At the time of writing, there are multiple published replications for the NicSNP result for
rs16969968 in independent samples, either by directly genotyping this SNP itself, or through strong LD proxies (Berrettini
et al.,
2008; Bierut
et al.,
2008; Thorgeirsson
et al.,
2008). The distinct NicSNP result at
rs578776 also shows evidence of published replication through an LD proxy (Berrettini
et al.,
2008; Thorgeirsson
et al.,
2008). Also at the time of writing, there have been two other nicotine dependence or smoking quantity GWAS other than NicSNP (Berrettini
et al.,
2008; Thorgeirsson
et al.,
2008), but the complete results (
P-values) for these studies are not available at this time. However, an important future task will be to determine how true associations fare in the GIN prioritization selection method compared with the straight
P-value method. For example, in , do the true associations tend to occur in (or above) the knife-shaped region, or in (or above) the triangular region? More generally, it will be interesting to study the distribution of prioritization scores among all known true associations for any disease. This will require the configuration of new GINs for other diseases for which a GWAS has been conducted.
Our GIN prioritization method offers one particular strategy for prioritizing various forms of a priori evidence. Different studies will have different preferences for incorporating this kind of data. Our scoring system is flexible, and can be configured to accommodate a variety of objectives.
Other methods have been proposed for prioritizing SNPs once various parameters, analogous to our prioritization scores, have been established (Chen
et al.,
2007, Curtis
et al.,
2007; Lewinger
et al.,
2007). However, it is unclear how to go from one parameter system to the other, and therefore difficult to compare methodologies. This will be studied in future iterations of the method.
There are many other forms of genomic annotation and biological data that could be incorporated into a GIN. For example, the change in amino acid for the top ranked SNP
rs16969968, which changes residue 398 of the protein, from aspartic acid (encoded by the G allele) to asparagine (encoded by A, the risk allele), results in a change in the charge of the amino acid of the α5 subunit (Cserzo
et al.,
1997). It would be straightforward to adjust the link index of the gene node in order to incorporate this additional data into the GIN. There are many other publicly available resources on SNP functional properties (Jegga
et al.,
2007; Jiang
et al.,
2007; Lee and Shatkay,
2007; Wang
et al.,
2006; Yuan
et al.,
2006), and tools for nominating and prioritizing genes biologically relevant to a disease (Adie
et al.,
2006; Gaulton
et al.,
2007; Masotti
et al.,
2007).
No matter how many databases we incorporate, our method will always be limited to using known biology. There may be unknown biological mechanisms driving these associations, and these may fail to be discovered if too much emphasis is placed on current biological knowledge. For example, while the gene node in the GIN prioritizes SNPs using the dbSNP criteria of being within 2 Kb of the 5′ end and 500 bp of the 3′ end of a gene, it has been demonstrated that some genes have regulatory regions as far as 8 kb upstream (Blackwood and Kadonaga, 1998). However, the basic premise of this method is to lead with the strongest biological information available while allowing the more significant signals to be included for further study, even if they reside in regions of apparently low biological relevance. We believe this is a practical procedure for resource-limited situations.
Our method does not incorporate information regarding the number of potential associations detected in or near a gene. For example, it is known that even for the single-gene disorder of cystic fibrosis, there are over 500 different mutant alleles (Zielenski and Tsui, 1995). It would be useful to integrate an additional mechanism into the prioritization process that somehow gives additional weight to genes with multiple SNP associations (the number of associations would have to be corrected for LD). This will be studied in future iterations of the GIN prioritization method.
A GWAS cannot viably detect complex interactions between genes due to low statistical power after adjustment for a staggering number of tests. There are now many public databases that provide data on biochemical pathways and metabolic networks (Altman,
2007; Arakawa
et al.,
2005; Harris
et al.,
2004; Karp
et al.,
2005; Mi
et al.,
2007; Vastrik
et al.,
2007; von Mering
et al.,
2007). In future iterations of the method the GIN model will be generalized to prioritize tests of gene–gene interaction, and will incorporate these databases to elucidate the intricate genetic structure of complex disease (Thomas,
2005,
2006a,
2006b).