Genome-wide association studies with sufficiently large replication components are rapidly identifying regions of the genome that are convincingly associated with risk of cancer and other diseases (Manolio et al.
2008; Rahim et al.
2008). Because the majority of GWAS studies take an approach that is unbiased with respect to function, the genotyped SNPs associated with a disease or trait are not necessarily the functional variants, rather, they are viewed as markers correlated with the true causative variant(s). The identification of a marker association represents the beginning of a process to define the causal variants through functional analyses and molecular phenotyping.
Characterizing all genetic polymorphisms within a region, as we have done here, is a critical next step to GWAS. SNP markers discovered or further characterized as a part of these require subsequent genotyping in sufficiently large sample sets to refine association signals prior to dedicated laboratory analysis. Two advantages of characterizing all common genetic variants prior to undergoing large-scale fine-mapping studies are (1) that all common genetic variants may be represented using a tag SNP approach, and (2) the correlations among all genetic variants will be known, which will allow for rapid nomination of variants for functional studies for those that are most highly correlated with the markers that are most highly associated with disease.
At least four regions of human chromosome 8q24 have recently been implicated in the risk of prostate, breast, and colorectal cancer (Amundadottir et al.
2006; Freedman et al.
2006; Gruber et al.
2007; Gudmundsson et al.
2007; Haiman et al.
2007a,
b; Schumacher et al.
2007; Yeager et al.
2007; Zanke et al.
2007) and colorectal adenoma (Berndt et al.
2008). Replication of markers in 8q24 has been robust, but candidate variants for functional analyses remain elusive, especially in a region with a dearth of candidate genes. The recent emergence of next-generation sequencing technologies provides an unprecedented avenue to quickly and relatively inexpensively characterize genetic variation in fairly large genomic regions for medium-sized sample sets that are designed to detect with great probability common (>1%) variants. In this report, we have utilized the 454/Roche next-generation sequencing technology to characterize with great certainty and high quality all common variation within two of these regions (chr8: 128,473,000–128,609,802), including what will most likely be the variant(s) associated with prostate and colorectal cancers. We have determined that this region of 8q24 contains 780 common SNPs, 454 of which have a MAF ≥ 0.05.
Based on our sequence analysis of 158 chromosomes, we have constructed a map of LD across the region. One hundred and fourteen SNPs are necessary to comprehensively tag this region for further association studies in individuals of European ancestry with an r2 > 0.8. Genotyping 53 of 174 HapMap tag SNPs alone would cover approximately 78% of SNPs in this region in populations of European ancestry; the addition of 125 non-HapMap SNPs previously reported in dbSNP raises coverage to approximately 90%, though it is worth noting that all 299 SNPs would have to be genotyped to ensure this coverage. The present study not only validates these 299 SNPs, but also provides an additional 10% of information that would have not been monitored.
A PHASE analysis of common genetic variation (MAF > 5%) indicates a complex haplotype structure in which there is recent recombination that generates a large number of rarer haplotypes. Therefore, to interrogate this region efficiently in association studies, it appears that tag SNPs represent a more efficient approach, with respect to the number of required SNPs. Moreover, choosing tagSNPs with a high threshold for r2 can improve the opportunity to monitor more SNPs and rare haplotypes, but at the cost of an increase in the number of SNPs needed for follow-up genotype analysis.
Preliminary bioinformatic analyses have identified rs6983267 as an excellent SNP for functional assessment. Indeed it lies within a region that is both highly conserved across vertebrates predicted to likely contain regulatory potential and an enhancer-element (see Fig. ). It has been proposed that variation within evolutionary-conserved regions is likely associated with phenotypic differences that may contribute to human diseases (Dermitzakis et al.
2005). For the telomeric rs1447295 region, three SNPs lie within potentially interesting regions, though strong evidence for nominating one of them as the strongest candidate is still lacking.
In summary, we have extensively characterized the majority of all common SNPs across two high-interest regions, totaling ~136 kb of human chromosome 8q24 that have been reported to be associated with colon and/or prostate cancer as identified by GWAS, replication, and other case–control studies. We have verified that 299 SNPs that have been deposited in dbSNP are polymorphic in our samples (158 chromosomes), and have identified 442 novel polymorphisms, 101 of which have an estimated MAF ≥ 0.05 and are not monitored by HapMap SNPs at an r2 of 0.8 in our sample population. Our data set provides an important resource that may be used to design fine-mapping projects for this region. Such efforts are critical for providing sufficient information for rapidly following up association findings and for fine mapping project for regions of the genome that are found to be significantly associated with a disease or phenotype. Our results underscore the value of resequence analysis in determining the full catalog of variants necessary to choose for further genotyping and functional analyses. Finally, the determination of the correlations among all genetic variation within this region should expedite the nomination of variants for functional studies post-fine-mapping.