We have carried out a comprehensive evaluation of common SNPs in three genomic regions that harbour CRC risk loci discovered by GWA studies. We genotyped between 22 and 156 SNPs in each region in up to 9328 cases and 10 480 controls. In all samples, we also imputed in each region between 324 and 1025 SNPs that had recently been identified or validated by the 1000 Genomes Project or HapMap3.
Genotype imputation led to a large increase in informative markers. For example, we, respectively, obtained a 20-, 12- and 7-fold increase in the number of informative markers when our own data from the Illumina Hap300, Hap550 and Hap1M arrays were used. Even for the two data sets with the highest genotyping density (NSCCG and Scotland2), the use of imputation increased the number of markers by nearly 5-fold and led to a marker density of ~1 SNP/kb. The recently released 1000 Genomes Project data particularly provided extra information. For example, ~6% of the top 25 markers were not present in dbSNP and ~60% were not characterized by the HapMap project (data not shown). Therefore, the imputation to the 1000 Genomes Project was, in principle, capable of providing more accurate and comprehensive information about local patterns of LD.
It is not necessarily the case that SNPs with the strongest evidence of association with disease are those that are truly functional and, as a result, some caution must be exercised in fine-mapping signals around tagSNPs identified in GWA studies. However, our study has led to a considerable reduction in the size of the genomic region most likely to contain the functional variants (for example see Figs –). In all cases, some of the SNPs in the most strongly associated region were genotyped, at least in some studies, and some were imputed. In general, the imputed SNPs provided very similar results to the genotyped SNPs and provided good supporting evidence.
Imputation was less successful in identifying the most likely functional SNPs within the region of strongest association based on strengths of association. We discovered, in two of the three regions (16q22.1 and 19q13.11), SNPs showing stronger associations than those found for the original tagSNP. In total, there were 19 markers with P-values that were lower than that of the reported tagSNPs at 16q22.1 and 19q13.11; many of these SNPs had low minor allele frequencies (MAFs) (for example at 19q13.11) and most of them were either fully imputed (16q22.1: 2 SNPs, 19q13.11: 12 SNPs) or genotyped only in the NSCCG/Scotland2 studies and imputed in the other case–control sets (16q22.1: 5 SNPs, 19q13.11: 1 SNP). Only one marker was fully genotyped in all six case–control sets (rs7199991, the top SNP at 16q22.1) and had a P-value lower that the original tagSNP. Although imputation suggested SNPs, such as 8-117694643 and rs28626308, with stronger associations with CRC than the original tagSNPs, assessment of the accuracy of imputation using direct sequencing showed small, but important, errors. At least for uncommon genotypes imputed to the 1000 Genomes Project data, it seems unlikely that imputation can truly provide definitive evidence of SNP functionality.
In contrast with imputation, annotation of eQTLs and other genomic features within the regions of the top CRC-associated SNPs revealed variants that may be functional. Interestingly, and against previous expectations, our study did not support a role of a number of candidate genes in the haplotype blocks containing these risk alleles. For example, in the past, we proposed that EIF3H
were the clearly the best candidates in these blocks because they were the nearest genes to the tagSNPs and because there was some evidence of their involvement in tumorigenesis (4
). Such latter evidence was particularly compelling for CDH1
, a gene that is mutated or epigenetically silenced in colorectal tumours (15
), and for EIF3H. RHPN2
remains the most likely candidate gene, but through the effect of coding rather than regulatory variation. However, with the important caveat that colorectal eQTL data sets are not available and that these associations were detected in blood cells, CDH1
does not seem to be the most likely target of the genetic variation associated with CRC risk at 16q22.1. Instead, ZFP90
emerged as the best candidates in this region. Interestingly, rs1728785, a SNP recently associated with ulcerative colitis risk (21
), lies within an intron of ZFP90
and is associated with its transcript levels at this gene (22
), suggesting that the gene may be involved in predisposition to both CRC and ulcerative colitis.
At 8q23.3, our analysis suggested that UTP23
was the most likely target of the functional variation. Although our fine mapping results were very similar to those of Pittman et al
), that study found good evidence that EIF3H
was the target of the functional variation in the region. It is, however, entirely conceivable that both genes are coordinately regulated, given that they have related roles in mRNA translation.
In conclusion, we have carried out a large fine-mapping and annotational study of three CRC risk loci. This study is the fourth one to fine-map regions and to propose functional variants at CRC loci discovered by GWA studies (9
); the remaining regions are under investigation, but show more complex patterns of LD or have evidence for multiple independent variants that require further analysis (23
). We have refined the size of the disease-associated regions on 8q23.3, 16q22.1 and 19q13.11, and identified a number of candidate SNPs that might include functional alleles at these regions. We identified an RHPN2
non-synonymous polymorphism as a strong functional candidate, and discovered evidence suggesting that the risk alleles near CDH1
affect the expression of unexpected candidate genes. Our investigation establishes a foundation for future work, including expression studies in colorectal mucosa, to assign disease functionality to these SNPs using experimental and laboratory-based assays.