|Home | About | Journals | Submit | Contact Us | Français|
In genome-wide association studies (GWASs) of colorectal cancer, we have identified two genomic regions in which pairs of tagging-single nucleotide polymorphisms (tagSNPs) are associated with disease; these comprise chromosomes 1q41 (rs6691170, rs6687758) and 12q13.13 (rs7163702, rs11169552). We investigated these regions further, aiming to determine whether they contain more than one independent association signal and/or to identify the SNPs most strongly associated with disease. Genotyping of additional sample sets at the original tagSNPs showed that, for both regions, the two tagSNPs were unlikely to identify a single haplotype on which the functional variation lay. Conversely, one of the pair of SNPs did not fully capture the association signal in each region. We therefore undertook more detailed analyses, using imputation, logistic regression, genealogical analysis using the GENECLUSTER program and haplotype analysis. In the 1q41 region, the SNP rs11118883 emerged as a strong candidate based on all these analyses, sufficient to account for the signals at both rs6691170 and rs6687758. rs11118883 lies within a region with strong evidence of transcriptional regulatory activity and has been associated with expression of PDGFRB mRNA. For 12q13.13, a complex situation was found: SNP rs7972465 showed stronger association than either rs11169552 or rs7136702, and GENECLUSTER found no good evidence for a two-SNP model. However, logistic regression and haplotype analyses supported a two-SNP model, in which a signal at the SNP rs706793 was added to that at rs11169552. Post-GWAS fine-mapping studies are challenging, but the use of multiple tools can assist in identifying candidate functional variants in at least some cases.
Using genome-wide association studies (GWASs), we have identified 14 regions that contain tagging single nucleotide polymorphisms (tagSNPs) associated with the risk of colorectal cancer (CRC) (1). Within three of these regions—chromosomes 14q22.2, 15q13.3 and 20p12.3—we have shown that there exist two SNPs that are independently associated with disease (2). In two further regions—chromosomes 1q41 and 12q13.13—there are two SNPs associated with CRC risk, but from the original GWA analysis, it was unclear as to whether these represented independent signals of association (1). At 1q41, these SNPs are rs6691170 (chr1: 220,112,069 bases) and rs6687758 (chr1: 220,231,571); they are in modest pairwise linkage disequilibrium (LD) (r2= 0.22; D′ = 0.71). At 12q13.13, the two SNPs are rs7136702 (chr12: 49,166,483) and rs11169552 (chr12: 49,441,930); these SNPs too are moderately correlated (r2= 0.11, D′ = 0.76). Our previous analyses had not resolved the issue of whether there could be more than one independent CRC SNP in either of these regions (1).
One of the aims of GWASs is the discovery of functional/causal variants, the effects of which are manifest in the tagSNP associations. It is, however, very challenging to proceed from a tagSNP association to identifying functional variants, and relatively few such studies have been reported to date. One reason for this is that the correlation matrix between tagSNP(s) and functional variant(s) at any locus may be complex. If two association signals occur at tagSNPs at the same locus, the possible causes include the following:
It can be extremely hard to distinguish among these possibilities and our inability to de-convolute association signals may help explain why so much of the heritability of complex diseases is unexplained by GWASs to date (3). Despite these problems, functional variant discovery may be aided by a deeper examination of genetic variation in the LD blocks in which the tagSNPs reside. Such discovery is likely to benefit from efforts such as the 1000 Genomes Project, where a comprehensive discovery of novel variants has been carried out in several populations.
In this study, we had three aims. First, we wished to investigate as fully as possible whether there was likely to be one or more than one functional variant underlying the association signals at 1q41 and 12q13.13 in CRCs. Secondly, we wanted to investigate other tagSNPs in these regions for evidence of further, independent association signals. Thirdly, we wished to use imputation and functional annotation to refine the most likely location of the ‘disease-causing’ variant in both the 1q41 and 12q13.13 regions.
We genotyped a total of 48 174 samples (22 832 cases and 25 892 controls) from 17 sample sets at rs6691170 and rs6687758. This analysis included five replication case/control cohorts that were not previously reported (1) for these SNPs: Kentucky; Prague; EPICOLON; Leiden; and Australia. After meta-analysis in STATA, both rs6691170 and rs6687758 were, as expected, significantly associated with CRC risk (Table 1), with no evidence of heterogeneity among studies. Incorporating both SNPs into an unconditional logistic regression model showed that neither of the pair of SNPs fully captured the association signal in the region [odds ratio (OR) = 1.06, P = 1.06 × 10−4 for rs6691170 and OR = 1.07, P = 2.48 × 10−4 for rs6687758]. We used PLINK to examine the possibility that the two tagSNPs indicated a single high-risk haplotype on which an unknown functional SNP was present (that is, all the functional risk alleles resided on a haplotype composed solely of one of the four possible pairs of tagSNP alleles). However, the association signal was not simply present on the high-risk haplotype TG (for rs6991170|rs6687758). Instead, the risks for the ‘compound’ (high-low or low-high) haplotypes—GG and TA—were greater than those for the low-low haplotype (GA), inconsistent with a functional SNP being in complete LD with a haplotype indicated by the pair of tagSNPs (Supplementary Material, Table S1). We also tested for evidence of epistasis between rs6691170 and rs6687758 using case–control logistic regression analysis, incorporating interaction between SNPs as a variable, but no evidence of deviation from log-additive SNP effects was found (P = 0.292).
Having failed to find evidence for the simplest situations—namely that one of each tagSNP pair captured the great majority of the association signal or that the tagSNPs essentially acted as simple two-locus tags for the functional variants in each region—we attempted to deconvolute the 1q41 signal by association testing of imputed SNPs in the region. The three GWAS sample sets, UK1, Scotland 1 and VQ58, were imputed to the combined 1000 genomes and HapMap3 reference set. A total of 630 SNPs in the 220–221 Mb region on chromosome 1q41 was successfully imputed from 76 genotyped SNPs. The strongest association signal (Fig. 1, Supplementary Material, Table S2), as measured by association test P-value, was at rs11118883 (chr1: 220,127,645), an imputed SNP in moderate LD with rs6691170 (r2= 0.40, D′ = 0.74) and rs6687758 (r2= 0.31, D′ = 0.77).
We then used reverse stepwise logistic regression analysis to determine whether rs6691170 and rs6687758, or other combinations of SNPs, best accounted for the association between CRC and 1q41 variation. Using a final significance threshold of P = 0.01, we found that two imputed SNPs, rs11118883 and rs12726661, were most strongly associated with the CRC risk (Table 2, Supplementary Material, Table S2). By comparison, a joint analysis of rs6687758 and rs6691170 in the same three GWAS data sets gave much weaker evidence of association, as assessed using the Akaike Information Criterion (AIC). Indeed, a model incorporating rs11118883 alone—although not one with rs12726661 alone—provided a better fit than a model incorporating both rs6687758 and rs6691170; haplotype-based association analysis supported these findings (data not shown).
We were surprised to note that in a single-SNP analysis the direction of effect for rs12726661 was reversed—the minor allele was associated with disease risk—compared with that in the two-SNP analysis. We determined that rs11118883 and rs12726661 were in strong LD (r2= 0.98, D′ = 1.00) in our samples, consistent with data from the 1000 genomes project and HapMap3 that had been used for imputation. Examination of the genotype distribution in our data set showed that deviation from perfect LD between the SNPs resulted from two sets of individuals: (i) 50 homozygous for the major allele at rs12726661 and heterozygous at rs11118883; and (ii) 15 heterozygous at rs12726661 and homozygous for the minor allele at rs11118883. Specifically, 28/50 in category (i) were cases and 4/11 in category (ii) were cases. For these 65 individuals, the risk of CRC was significantly greater than that of individuals with the other genotypes at rs12726661 and rs11118883 (OR = 2.10, P = 0.003, χ21 test). A potential explanation for our apparently paradoxical findings is that there exists another allele, almost certainly relatively rare, that is associated with the minor allele of rs12726661 (but not with rs11118883), and that is protective against the CRC risk.
We then analysed our UK1, Scotland 1 and VQ58 individuals using GENECLUSTER with the original GWAS SNP genotypes in the rs6691170/6687758 region as inputs. There was no evidence to favour an underlying two-locus model over a one-SNP model (Fig. 2). The predicted most strongly associated single SNP was rs11577023, a SNP that is in very strong LD with rs11118883 (r2= 0.93, D′ = 1.0) in our data.
We genotyped rs11118883 directly in a set of 84 UK control samples and found complete concordance with the imputed genotypes. rs11118883 (chr1:220,127,645) lies in a gene desert, within a region of LD that extends approximately from 220.0 to 220.3 Mb. The nearest gene, ~150 kb towards the centromere, is the MAP kinase regulator dual-specificity phosphatase 10 (DUSP10). DUSP10 inactivates p38 and also the Jun N-terminal kinase that phosphorylates c-Jun which is believed to play a role in CRC pathogenesis. rs11118883 itself lies upstream of DUSP10 within a region with strong evidence of transcriptional regulatory activity (http://genome.ucsc.edu). Using 1000 genomes data, we found that rs11118883 is in strong LD (r2> 0.7) with at least six SNPs (rs12738322, rs12726661, rs4129271, rs11577023, rs10746414 and rs12137702). Of these, rs10746414 and rs12137702 are also close to regions with potential effects on transcription.
Analysis of the 12q13.13 region proceeded in parallel with that of the 1q41 region using essentially the same strategy. We initially confirmed the individual associations of SNPs rs7136702 and rs11169552 with the CRC risk in the extended data sets (Table 1). Unconditional logistic regression analysis did not exclude the possibility that the two SNPs had independent effects; for rs7136702 and rs11169552, the association statistics were P = 1.63 × 10−5 (OR = 1.07) and P = 1.70 × 10−7 (OR = 0.92), respectively, showing that one SNP did not simply capture all of the association signals. Further analysis showed that the association signal was not derived from a single high-risk haplotype tagged by rs7136702 and rs11169552 (Supplementary Material, Table S3) and there was no evidence of epistasis between the SNPs (P = 0.903).
We imputed SNPs within the 48.5–50 Mb region of chromosome 12 using the combined 1000 genomes and HapMap3 reference panel in the 3 GWAS sample sets (UK1, Scotland 1 and VQ58) (Fig. 3, Supplementary Material, Table S4). A total of 2736 SNPs was successfully imputed from 158 genotyped SNPs. The most significant single-SNP association was at the imputed SNP rs7972465 [OR = 1.18, 95% confidence interval (CI) 1.11–1.27, P = 8.22 × 10−7), a signal slightly stronger than that of rs11169552 (OR = 0.85, 95% CI 0.79–0.91, P = 1.08 × 10−5) and notably stronger than that of rs7136702 (OR = 1.13, 95% CI 2.06–1.21, P = 3.85 × 10−4). Direct genotyping in 91 UK control individuals showed that imputation of rs7972465 was very good, although not perfect (r2= 0.93).
Reverse stepwise logistic regression analysis was then used to assess whether rs11169552 and rs7136702, or other combinations of SNPs in the region, best accounted for the association between CRC and 12q13.13 variation (Table 3). Many highly correlated SNPs exist within the region, making this analysis difficult. Nonetheless, while rs11169552 remained in the regression model after stepwise elimination of less strongly associated SNPs, a number of SNPs provided improved or similar associations compared with rs7136702 in a two-SNP model with rs11169552. One of these SNPs was rs7972465 (Table 3, Fig. 4), but another SNP, rs706793, a SNP in very low LD with rs11169552 (Table 3), provided a larger improvement in the AIC (see also Supplementary Material, Table S4).
We then undertook GENECLUSTER analysis of the UK1, Scotland 1 and VQ58 sample sets in the 12q13.13 region. There was no good evidence to distinguish between underlying two-locus and one-locus models (Fig. 5), although the association signal showed two peaks at ~48.85 Mb (close to rs706793) and at ~49.45 Mb (very close to rs11169552) that could not readily be explained by long-range LD between these two regions (Supplementary Material, Fig. S1). The predicted most strongly associated SNP under the one-SNP model was rs3184122 (Supplementary Material, Table S4), a variant that is in moderate or strong LD (Fig. 5) with rs11169552 (r2= 0.19, D′ = 0.92), rs7136702 (r2= 0.49, D′ = 0.73) and rs706793 (r2= 0.47, D′ = 0.94), and strong LD with rs7972465 (r2= 0.87, D′ = 1.00).
Since the various analyses had not resolved the question of whether there exist one or two independent CRC-associated SNPs in the 12q13.13 region, we used PLINK to examine the associations with disease of the haplotypes (Fig. 4) for rs706793, rs7972465 and rs11169552. As expected, the haplotype CGC was most strongly associated with risk (Table 4, Supplementary Material, Table S5). The G (risk) allele at rs7972465 was essentially present only on this haplotype, but it appeared that haplotypes containing the T allele at rs7972465 were not all low risk and therefore that rs7972465 did not explain all the association signal. We therefore considered the association signals when we fixed the alleles at rs706793 and rs11169552 and allowed those at rs7972465 to vary, and vice versa. Initially, we undertook simple comparisons between haplotype frequencies in cases and controls, and found that the rs706793 and rs11169552 risk alleles, but not the rs7972465 risk allele, were found at significantly higher frequencies in cases than controls (Supplementary Material, Table S6). Since this analysis suggested that there might be independent effects of rs706793 and rs11169552—and that the signal at rs7972465 resulted from LD with these two SNPs—we proceeded to a further evaluation of this possibility using conditional haplotype analysis in PLINK. We again compared two scenarios, (i) in which the CGC and CTC haplotypes were equivalent (that is, varying rs7972465) and (ii) in which the CTC and TTT haplotypes were equivalent (that is, varying rs706793 and rs11169552). No effect was seen in the first case (likelihood ratio test, P = 0.35), whereas there was a significant difference in the second case (P = 0.023), again supporting effects of rs706793 and rs11169552 rather than rs7972465.
Further genotyping in additional sample sets strengthened the rs706793 association with CRC, although it did not reach formal significance and there was some evidence of inter-study heterogeneity, the origins of which remain unclear (Supplementary Material, Table S7). Logistic regression analysis in the extended sample set continued to support a model incorporating rs706793 and rs11169552 (P = 8.38 × 10−4 and P = 7.82 × 10−6, respectively, AIC = 27932) over one with rs7163702 and rs11169552 (P = 3.05 × 10−5 and P = 9.05 × 10−3, AIC = 27999).
rs706793 (chr12:48,754,036) and rs11169552 (chr12:49,441,930) are separated by a predicted recombination hotspot at ~48.8 Mb in the HapMap data (Fig. 3) but not in our own data (Fig. 5), although LD in the region is complex (Supplementary Material, Fig. S1). The 12q13.13 region contains coding genes ACCN2, SMARCD1, GPD1, LASS5, LIMA1 and ATF1. ACCN2 probably encodes an ion channel protein, SMARCD1 is part of chromatin remodelling complex SNF/SWI, GPD1 is glycerol-3-phosphate dehydrogenase and LASS5 is probably a ceramide synthase. LIMA1 codes for EPLIN, a protein downregulated in some cancers. ATF1 is a transcription factor centrally involved in the stress response and in the pathogenesis of angiomatoid fibrous histiocytoma and clear cell sarcoma through translocation. Supplementary Material, Table S8 lists SNPs in strong LD (r2> 0.70) with rs706793, rs7972465 or rs11169552, and provides annotation for those with evidence of potential roles in gene or protein regulation or function.
We have undertaken additional genotyping and more detailed analysis in order to understand better the dual tagSNP association signals that we observed on chromosomes 1q41 (rs6991170, rs6687758) and 12q13.13 (rs11169552, rs7136702) in a GWAS of CRC (1). In both cases, genotyping of additional sample series confirmed the originally reported associations, without demonstrating good evidence for the three simplest scenarios: independent functional variants; capture of the association signal by one of the pair of SNPs; or two-SNP tagging of a single haplotype on which functional variation lay. We therefore proceeded to more detailed analyses in each region, after imputation of genotypes where appropriate in the data sets with best coverage of each region (UK1, Scotland1 and VQ58). It is conceivable that the analysis of these three data sets, which had already been used in SNP discovery, would introduce a small amount of bias into the fine mapping. However, we reasoned that the marginal differences in association that might occur would be more than outweighed by the power provided by the use of these data sets.
For 1q41, the single-SNP association test, logistic regression analysis and GENECLUSTER all found that SNP rs11118883, or a SNP in strong LD, was most likely to be responsible for the signal of association. This SNP itself is a very good functional candidate, lying within or immediately adjacent to regions bearing histone methylation and acetylation marks, DNAse I hypersensitive sites and sites of transcription factor binding (http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=195445293&c=chr1&g=wgEncodeReg). The SCAN expression Quantitative Trait Locus (eQTL) database (4) reports rs11118883 being associated (P = 8 × 10−5) in Europeans with expression of platelet-derived growth factor β (PDGFRB, chr5q31–q32), although this association requires confirmation in appropriate cell types for the CRC risk and is not present in the Genevar eQTL database (http://www.sanger.ac.uk/resources/software/genevar) (5). The possibility that the minor allele of rs12726661 is associated with a second, presumably rare, variant that is protective against the CRC risk is intriguing. While speculative, such a scenario has precedents, such as the MDM2 promoter SNP rs117039649 (6).
For 12q13.13, a complex situation was found. Single-SNP analysis found variants with much stronger association signals than either rs11169552 or rs7136702, notably at rs7972465 although small imputation inaccuracies may have inflated this signal. GENECLUSTER analysis found no greater evidence for a two-SNP than one-SNP model and detected the best signal for the former at a SNP, rs3184122, that is in strong LD with rs7972465. Logistic regression analysis, however, supported a two-SNP model, in which a signal at rs706793 was added to that at rs11169552. rs706793 and rs11169552 are in very weak LD, but rs706793 is in moderate LD with rs7163702 (r2= 0.20, D′ = 0.60). Haplotype analysis supported the logistic regression analysis, in that the genotype at rs7972465 did not affect the risk associated with the rs706793–rs11169552 haplotypes, whereas the reverse scenario (high- versus low-risk rs706793–rs11169552 haplotypes) did affect risk. As regards eQTLs for the 12q13.13 SNPs, Genevar shows rs706793 to be associated with LASS5 expression (at P< 10−4), although this association is not reported in SCAN.
Clearly, all post-GWAS fine-mapping studies face intrinsic difficulties, such as the use of imputed genotypes, despite the use of stringent criteria for SNP inclusion, and a limited ability to differentiate among association signals of similar magnitudes. The analysis of the 12q13.13 region illustrates some of these problems well. Although a much more strongly CRC-associated SNP than the original tagSNPs was identified through imputation, the balance of evidence slightly favours this signal resulting from two independent association signals, as we have previously found for the GREM1 locus (2). In the 1q41 region, in contrast, rs11118883—a SNP in moderate LD with both the original tagSNPs—emerged as an excellent candidate for the functional variant.
The Kentucky samples comprised 1020 incident colon cancer cases and 1598 population controls of white European origin recruited between July 2003 and December 2009. Eligible cases were identified through the population-based Surveillance, Epidemiology and End Results (SEER) Kentucky Cancer Registry covering all residents living in the State of Kentucky at the time of diagnosis. We used random digital dialling to recruit population controls who were 40 years of age or older and had no personal history of cancer other than skin cancer. We excluded those with known inflammatory bowel diseases, family history of familial adenomatous polyposis and hereditary non-polyposis CRC.
The Prague cases (7) were patients with histologically confirmed CRC recruited between September 2004 and February 2009 from nine oncology departments in the Czech Republic: Prague (two), Benesov, Brno, Liberec, Ples, Pribram, Usti nad Labem and Zlin. During this period, a total of 1554 cases provided blood samples. This study includes 1001 subjects who could be interviewed, provided biological samples and were genotyped. Controls were 683 hospital-based volunteers with negative colonoscopy results for malignancy or idiopathic bowel diseases (CFCC, cancer-free colonoscopy inspected controls). CFCCs were selected from among individuals admitted to the same hospitals during the same period of the recruitment of the cases. The reasons for undergoing the colonoscopy were: (i) positive faecal occult blood test, (ii) haemorrhoids, (iii) abdominal pain of unknown origin, or (iv) macroscopic bleeding.
Details of other sample sets have been reported previously (2) and are provided briefly below.
UK1 (CORGI) comprised 922 cases with colorectal neoplasia (47% male) ascertained through the Colorectal Tumour Gene Identification (CORGI) consortium. All had at least one first-degree relative affected by CRC and one or more of the following phenotypes: CRC at age 75 or less; any colorectal adenoma (CRAd) at age 45 or less; ≥3 CRAds at age 75 or less; or a large (>1 cm diameter) or aggressive (villous and/or severely dysplastic) adenoma at age 75 or less. The 929 controls (45% males, 55% females) were spouses or partners unaffected by cancer and without a personal family history (to second degree relative level) of colorectal neoplasia. Known dominant polyposis syndromes, HNPCC/Lynch syndrome or bi-allelic MUTYH mutation carriers were excluded.
Scotland1 (COGS) included 980 CRC cases (51% male; mean age at diagnosis 49.6 years, SD ± 6.1) and 1002 cancer-free population controls (51% male; mean age 51.0 years; SD ± 5.9). Cases were for early age at onset (age ≤ 55 years). Known dominant polyposis syndromes, HNPCC/Lynch syndrome or bi-allelic MUTYH mutation carriers were excluded. Control subjects were sampled from the Scottish population NHS registers, matched by age (±5 years), gender and area of residence within Scotland.
VQ58 comprised 1832 CRC cases (1099 males, mean age of diagnosis 62.5 years; SD ± 10.9) from the VICTOR and QUASAR2 (www.octo-oxford.org.uk/alltrials/trials/q2.html) clinical trials of adjuvant therapy in stage II/III CRC. There were 2720 population control genotypes (1391 males) from the Wellcome Trust Case-Control Consortium 2 (WTCCC2) 1958 birth cohort (also known as the National Child Development Study), which included all births in England, Wales and Scotland during a single week in 1958.
The Australian study comprised 591 patients treated for CRC at the Royal Melbourne, Western and St Francis Xavier Cabrini Hospitals in Melbourne from 1999 to 2009. The 2353 controls were derived from Queensland or Melbourne: for the former, the controls came from the Brisbane Twin Nevus Study; for the latter, individuals were participants in the Genes in Myopia study. There was no overlap between the CFR and Australian data sets. Owing to potential residual ethnic heterogeneity within the Melbourne population, for the Australian cohort only we performed an additional screen to minimize heterogeneity after performing principal components analysis (PCA) to remove individuals who clustered with non-CEU individuals (see below). We achieved this by performing PCA on the Australian cases and controls without reference samples of known ancestry. We then paired each case with a control in a 1:1 ratio based on a maximum separation of 0.050 using the first and second eigenvectors. All unpaired samples were excluded, leaving 441 cases and 441 controls in the study. Calculation of the genomic inflation factor, λGC, showed this to be 1.02 after this filtering.
UK2 (NSCCG) consisted of 2854 CRC cases (58% male, mean age at diagnosis 59.3 years; SD± 8.7) ascertained through two ongoing initiatives at the Institute of Cancer Research/Royal Marsden Hospital NHS Trust (RMHNHST) from 1999 onwards—The National Study of Colorectal Cancer Genetics (NSCCG) and the Royal Marsden Hospital Trust/Institute of Cancer Research Family History and DNA Registry. The 2822 controls (41% males; mean age 59.8 years; SD ± 10.8) were the spouses or unrelated friends of patients with malignancies. None had a personal history of malignancy at the time of ascertainment. All cases and controls had self-reported European ancestry, and there were no obvious differences in the demography of cases and controls in terms of place of residence within the UK.
Scotland2 (SOCCS) comprised 2024 CRC cases (61% male; mean age at diagnosis 65.8 years, SD ± 8.4) and 2092 population controls (60% males; mean age 67.9 years, SD ± 9.0) ascertained in Scotland. Cases were taken from an independent, prospective, incident CRC case series and aged <80 years at diagnosis. Control subjects were population controls matched by age (±5 years), gender and area of residence within Scotland.
UK3 (NSCCG) comprised 7912 CRC cases (65% male; mean age at diagnosis 59 years, SD± 8.2) and 4398 controls (40% male; mean age 62 years, SD± 11.5) ascertained through NSCCG post-2005.
Scotland3 (SOCCS) comprised 1145 CRC cases (50% male; mean age at diagnosis 53.2 years, SD ± 15.4) and 2203 cancer-free population controls (47% male; mean age 51.8 years, SD ± 11.5). Controls were recruited as part of the Generation Scotland study.
UK4 (CORGI2BCD) consisted of 621 CRC or CRAd cases (46% male; mean age at diagnosis 58.3 years; SD ± 14.1) and 1121 cancer-free population or spouse controls (45% male; mean age 45.1 years, SD ± 15.9), sampled using the same criteria as UK1.
Cambridge/SEARCH consisted of 2248 CRC cases (56% male; mean age at diagnosis 59.2 years, SD± 8.1) and 2209 controls (42% males; mean age 57.6 years, SD± 15.1). Samples were ascertained through the SEARCH (Studies of Epidemiology and Risk Factors in Cancer Heredity, http://www.cancerhelp.org.uk/trials/a-study-looking-at-genetic-causes-of-cancer) study based in Cambridge, UK. Recruitment started in 2000; initial patient contact was though the general practitioner. Control samples were collected post-2003. Eligible individuals were sex and frequency matched in 5-year age bands to cases.
The COIN samples were 2151 cases derived from the COIN and COIN-B clinical trials of metastatic CRC. Median age was 63 years. COIN cases were compared against genotypes from 2501 population controls (1237 males), from the WTCCC2 National Blood Service (NBS) cohort (50% male; mean age at diagnosis 53.2 years, SD ± 15.4).
The Helsinki (FCCPS) study (http://research.med.helsinki.fi/gsb/aaltonen/) comprised 988 cases from a population-based collection centred on south-eastern Finland and 864 population controls from the same collection.
EPICOLON included 1410 CRC cases matched with the same number of controls collected in a prospective fashion from centres in Spain. Exclusion criteria were Mendelian CRC syndromes and a personal history of inflammatory bowel disease.
The Leiden sample set included 858 unselected cases with CRC and 690 controls ascertained through genetic testing programmes for non-cancer-related conditions from the Leiden area.
In all cases, CRC was defined according to the ninth revision of the International Classification of Diseases (ICD) by codes 153–154 and all cases had pathologically proven disease. Only individuals of white European origin were included in the study.
Collection of blood samples and clinico-pathological information from patients and controls was undertaken with informed consent and ethical review board approval in accordance with the tenets of the Declaration of Helsinki. DNA was extracted from samples using conventional methods and quantified using PicoGreen (Invitrogen). The VQ, UK1, Scotland1 and Australia GWA cohorts were genotyped using Illumina Hap300, Hap370 or Hap550 arrays. 1958BC and NBS genotyping was performed as part of the WTCCC2 study on Hap1.2M arrays. In UK2 and Scotland2, genotyping was conducted using custom Illumina Infinium arrays according to the manufacturer's protocols. Some COIN SNPs were typed on custom Illumina Goldengate arrays. To ensure quality of genotyping, a series of duplicate samples was genotyped, resulting in 99.9% concordant calls in all cases.
Other genotyping was conducted using competitive allele-specific PCR KASPar chemistry (KBiosciences Ltd, Hertfordshire, UK), Taqman (Life Sciences, Carlsbad, CA, USA) or MassARRAY (Sequenom Inc., San Diego, CA, USA). All primers, probes and conditions used are available on request. Genotyping quality control was tested using duplicate DNA samples within studies and SNP assays, together with direct sequencing of subsets of samples to confirm genotyping accuracy. For all SNPs, >99% concordant results were obtained.
We excluded SNPs from analysis if they failed one or more of the following thresholds: GenCall scores <0.25; overall call rates <95%; minor allele frequency (MAF)<0.01; departure from the Hardy–Weinberg equilibrium (HWE) in controls at P<10−4 or in cases at P< 10−6; outlying in terms of signal intensity or X:Y ratio; discordance between duplicate samples; and, for SNPs with evidence of association, poor clustering on inspection of X:Y plots. We excluded individuals from the GWA analyses if they had evidence of non-white European ancestry by PCA-based analysis in comparison with HapMap samples (http://hapmap.ncbi.nlm.nih.gov/) or by self-report. Deviation of the genotype frequencies in the controls from those expected under HWE was assessed by the χ2 test (1 df), or Fisher's exact test where an expected cell count was <5.
Associations between SNP genotype and disease status were primarily assessed in STATA v10 (http://www.stata.com/) and PLINK v1.07 (http://pngu.mgh.harvard.edu/~purcell/plink/) using allelic and Cochran–Armitage tests (both with 1df) respectively, or by Fisher's exact test where an expected cell count was <5. Genotypic (2 df), dominant (1 df) and recessive (1 df) tests were also performed. The risks associated with each SNP were estimated by allelic, heterozygous and homozygous ORs using unconditional logistic regression, and associated 95% CIs were calculated.
Joint analysis of data generated from multiple phases was conducted using standard methods for combining raw data based on the Mantel–Haenszel method in STATA and in PLINK. Joint ORs and 95% CIs were calculated assuming fixed- and random-effects models. Tests of the significance of the pooled effect sizes were calculated using a standard normal distribution. Cochran's Q statistic to test for heterogeneity and the I2 statistic to quantify the proportion of the total variation due to heterogeneity were calculated. Large heterogeneity is typically defined as I2≥75%. Where significant heterogeneity was identified, results from the random-effects model were reported. Alongside, we also performed meta-analysis based on allele dosage (0, 1, 2) and incorporated age and sex as co-variates. Although age and sex are associated with the CRC risk, they were not associated with SNP genotype and did not materially affect the significance of any of the reported associations (data not shown).
The combined effects of pairs or other multiples of loci identified as possibly associated with the CRC risk were investigated by unconditional or conditional logistic regression analysis in PLINK and STATA to test for independent effects of each SNP, stratifying by sample series. Logistic regression was undertaken both pairwise with the original tagSNP and then in a backwards analysis that initially included all SNPs with good evidence of association in each region. We used Haploview software v4.2 (http://www.broadinstitute.org/haploview) to infer the LD structure of the genome in the 1q41 and 12q13.13 regions, and used the expectation maximum algorithms in Haploview or PLINK to infer haplotypes.
To predict genotypes at untyped SNPs in both regions, imputation of the UK1, Scotland 1 and VQ58 data sets was performed using the IMPUTE2 software and the combined CEU 1000 Genomes low-coverage pilot and complete HapMap3 haplotypes reference set, which was filtered to remove duplicate haplotypes (both from https://mathgen.stats.ox.ac.uk/impute/impute_v2.html) (8). Association statistics for imputed SNPs were calculated in SNPTEST v1.1.5 (www.stats.ox.ac.uk/~marchini/software/gwas/snptest.html) using the ‘–proper’ option, which is an additive model score test based on missing data likelihood, to allow for the uncertainty of imputed genotypes (9). Imputed markers with proper_info scores <0.5, imputed call rates per SNP <0.9 (using a maximum genotype probability threshold of 0.9 to call a genotype) and MAFs <0.01 were excluded from the analyses. Meta-analyses of the sample sets were carried out with Meta (10) (http://www.stats.ox.ac.uk/~jsliu/meta.html) and in STATA, using the genotype probabilities from IMPUTE2 where a SNP was not directly typed.
The GENECLUSTER program (11) was used to analyse our UK1, Scotland1 and VQ58 samples specifically in order to test whether one- or two-SNP models better fitted the association signals in each SNP region. GENECLUSTER is a Bayesian method that uses HapMap haplotypes to estimate genealogy of samples and by jointly testing all SNPs on each branch of the genealogy in cases and controls, the program indicates the identities of the SNP(s) most likely to have the strongest association signal, thus potentially helping to identify functional variation. The default model parameters were used, specifically mutation model prior: (0.50, 0.50, 0.00), max number of trees to consider per location: 1 and beta risk prior parameters: (5.00, 5.00).
Essentially for comparative purposes, we also ran the Margarita program (12), based on ancestral recombination graphs (ARGs), in UK1, Scotland1 and VQ58 for the genotyped SNPs in the 1q and 12q regions. This program aims to maximize available information as to the location and identity of a functional SNP by reconstructing the genealogical history of the sample population. For each ARG, a putative risk mutation is placed on the marginal tree and the frequency of each branch in cases and controls is assessed. For each region, 30ARGs were constructed and the significance of a SNP at each branchpoint assessed by 10 000 permutations. Unlike GENECLUSTER, Margarita does not specifically address the issue of whether there are two independent underlying SNP in each region, and comparison was therefore restricted to the single-SNP scenario.
Genome co-ordinates were taken from the NCBI build 36/hg18 (dbSNP b126).
Conflict of Interest statement. None declared.
Funding was primarily provided by Cancer Research UK. The EU FP7 CHIBCHA grant supported LGC-C through funding to IPMT, SC-B and ACar. Core infrastructure support to the Wellcome Trust Centre for Human Genetics, Oxford was provided by grant 090532/Z/09/Z. I.P.M.T. received support from the Oxford NIHR Comprehensive Biomedical Research Centre. The UK National Cancer Research Network supported the NSCCG. Additional funding to M.D. was provided by the Medical Research Council (G0000657-53203), CORE and Scottish Executive Chief Scientist's Office (K/OPR/2/2/D333, CZB/4/449). The EPICOLON work was supported by grants from the Fondo de Investigación Sanitaria/FEDER (08/0024, 08/1276, PS09/02368), Ministerio de Ciencia e Innovación (SAF2010-19273), Asociación Española contra el Cáncer (Fundación Científica y Junta de Barcelona) and Fundació Olga Torres (CRP). S.C.-B. and C.F.-R. are supported by contracts from the Fondo de Investigación Sanitaria (CP03-0070 and PS09/02368). CIBERehd and CIBERER are funded by the Instituto de Salud Carlos III. For the Melbourne cases, work was supported by the Hilton Ludwig Cancer Metastasis Initiative. The specimens and data from Australian colon cancer patients were provided by the Victorian Cancer Biobank and BioGrid Australia with appropriate ethics approval. The Victorian Cancer Biobank is supported by the Victorian Government. CERA receives operational infrastructure support from the Victorian Government. Funding to pay the Open Access publication charges for this article was provided by the Wellcome Trust.
This study made use of genotyping data on the 1958 Birth Cohort and NBS samples, kindly made available by the Investigators of those studies and the Wellcome Trust Case-Control Consortium 2; a full list of the investigators who contributed to the generation of the data is available from http://www.wtccc.org.uk/. We are also grateful to the Spanish National Genotyping Center (CEGEN-ISCIII)-USC node. The work was carried out (in part) at the Esther Koplowitz Centre, Barcelona. We are grateful to colleagues in the EPICOLON, CORGI and COGENT consortia. Finally, we would like to thank all individuals who participated in the study.