|Home | About | Journals | Submit | Contact Us | Français|
Genome wide association studies (GWAS) have identified four susceptibility loci for epithelial ovarian cancer (EOC) with another two loci being close to genome-wide significance. We pooled data from a GWAS conducted in North America with another GWAS from the United Kingdom. We selected the top 24,551 SNPs for inclusion on the iCOGS custom genotyping array. Follow-up genotyping was carried out in 18,174 cases and 26,134 controls from 43 studies from the Ovarian Cancer Association Consortium. We validated the two loci at 3q25 and 17q21 previously near genome-wide significance and identified three novel loci associated with risk; two loci associated with all EOC subtypes, at 8q21 (rs11782652, P=5.5×10-9) and 10p12 (rs1243180; P=1.8×10-8), and another locus specific to the serous subtype at 17q12 (rs757210; P=8.1×10-10). An integrated molecular analysis of genes and regulatory regions at these loci provided evidence for functional mechanisms underlying susceptibility that implicates CHMP4C in the pathogenesis of ovarian cancer.
Evidence from twin and family studies suggests an inherited genetic component to EOC risk 1,2. Rare, high-penetrance allele of genes such as BRCA1 and BRCA2 account for about 40 percent of the excess familial risk 3 and GWAS have recently identified common risk alleles at 9p22, 8q24, 2q31, and 19p13 4-6 with two additional loci at 3q25 and 17q21 that approached genome-wide significance 6. However these only explain 4 percent of the excess familial risk, and more loci probably exist.
We therefore pooled the data from two GWAS to inform the selection of SNPs for a large-scale replication. The North American study comprised four independent case-control studies that included 1,952 EOC cases and 2,052 controls. The second study was a two-phase multi-center GWAS that included 1,817 EOC cases and 2,354 controls in the first phase and 4,162 EOC cases and 4,810 controls in the second phase. We carried out a fixed effects meta-analysis from the two GWAS for ~2.5 million genotyped or imputed SNPs. We selected 24,551 SNPs associated with the risk of either all-histology (11,647 SNPs) or serous ovarian cancer (12,904 SNPs) based on ranked P-values. Assays were designed for 23,239 SNPs and included on a custom Illumina Infinium iSelect array (“iCOGS”) comprising 211,155 SNPs designed by the Collaborative Oncological Gene-environment Study (COGS) to evaluate genetic variants for association with the risk of breast, ovarian and prostate cancer. These SNPs were then genotyped in cases and controls from 43 individual studies from the Ovarian Cancer Association Consortium (OCAC) that were grouped into 34 case-control strata (Table 1 and Supplementary tables 1 and 2). These included most of the samples genotyped in the initial GWAS.
After applying quality control filters (see Online methods), we tested 22,252 SNPs for association with risk of all invasive EOC and for serous invasive EOC in 18,174 EOC cases (10,316 serous cases) and 26,134 controls. The primary analyses were based on the subjects of European ancestry (16,283 cases and 23,491 controls).
The associations of the four SNPs at 2q31, 8q24, 9p22 and 19p13 previously reported at genome-wide significance were all confirmed (Supplementary table 3). SNPs at the two other loci previously reported near to genome-wide significant (at 3q25 and 17q21)6 were also confirmed. The associated SNP at 3q25 reported by Goode and colleagues (rs2665390) failed design, but a correlated SNP, rs7651446 (r2=0.61) was highly significantly associated with invasive EOC (effect allele frequency (EAF)=0.050, per-allele OR = 1.44, 95% CI: 1.35-1.53, P=1.5×10-28) as was rs9303542 at 17q21 (EAF=0.27, OR = 1.12, 95% CI: 1.08-1.16, P=6.0×10-11). The Manhattan plots for all invasive EOC and serous invasive EOC, after excluding 176 SNPs from the six known loci, are shown in Figure 1. We identified three new loci at genome-wide significance (P<5×10-8), two of which were significant for all invasive EOC (8q21 and 10p12) and another for invasive serous EOC only (17q12) (Table 2). The genotype clustering for the top hits at 8q21 (rs11782652) and 17q12 (rs757210) was good, but clustering of the top hit at 10p12 (rs7084454) was poor (Supplementary Fig. 1). Clustering for a second, highly correlated SNP (r2= 0.86) at this locus (rs1243180) was good and so the results for this SNP are presented instead.
The most significant association for all invasive EOC was rs11782652 at 8q21 (odds ratio (OR) = 1.19, 95% CI: 1.12-1.26, P=5.5×10-9). This SNP had been selected for replication because it was associated with all invasive EOC in the combined GWAS data (OR =1.20, 95% CI: 1.07-1.36, P=0.0025). This SNP is not correlated with any other SNP in HapMap (Supplementary fig. 2) and was the only SNP in the selected for genotyping in COGS. Effects varied by histological subtype (P=0.0002), with the strongest effect in the serous subtype (Table 2). There was little evidence for heterogeneity of the association by ancestry (P=0.55) or between the 31 European studies included (P=0.13, Supplementary fig. 3). Rs1243180 at 10p12 was selected for replication because it was associated with all invasive EOC in the combined GWAS data (OR =1.11, 95% CI: 1.04-1.19, P=0.0027) and was also associated with risk of all invasive EOC in the replication data (OR = 1.10, 95% CI: 1.06-1.13, P=1.8×10-8). There is strong LD in this region and 32 other SNPs in the region were selected for replication (Supplementary fig. 4). T here was some heterogeneity of effects by tumor subtype (P=0.0007) but not by study (P=0.65, Supplementary fig. 5) or by population (P=0.12). At 17q12, rs757210 was selected for replication because it was associated with serous EOC in the combined GWAS data (OR =1.13, 95% CI: 1.04-1.23, P=0.0026) and was most strongly associated with the serous subtype in the replication data (OR = 1.12, 95% CI 1.08-1.17, P=9.6×10-10). Eight other SNPs in the region were selected for replication in COGS (Supplementary fig. 6). The association with all invasive EOC was much weaker (OR = 1.05, 95% CI 1.02-1.09, P=9×10-4); there was substantial heterogeneity by tumor subtype (P=<0.0001) with the risk allele for serous EOC being associated with a reduced risk of both clear cell and mucinous EOC (Table 2). Data from a set of fine mapping SNPs genotyped in this region suggest that this apparent paradox is caused by the presence of two independent loci for serous and clear cell cancer and that the top hit at each of these loci is correlated with rs757210 (CL Pearce, personal communication). There was also heterogeneity by ancestry for the serous subtype (P=0.034) with the risk allele associated with a reduced disease risk in the subjects of mixed ancestry origin, and some between-study heterogeneity (P=0.038, Supplementary fig. 7).
The most significant risk-associated SNPs for the three novel EOC susceptibility loci are located in non-coding DNA sequences, but these may only be markers for the true causal variant(s), which could be functional coding variants, or variants in non-coding DNA elements or non-coding RNAs, influencing the expression of nearby target genes (cis-regulatory effects). They may also act on genes through more distal regulation (trans-regulatory effects) 7-11. In order to identify the most likely functional SNP and target gene we evaluated the putative functional role in EOC for all genes in a one-megabase region centred on the most significant risk-SNP at each locus. We used a combination of locus specific and genome-wide assays to characterize the transcribed genes (see Online methods and Supplementary fig. 8) and regulatory elements (Supplementary fig. 9) within susceptibility regions to evaluate putative functional mechanisms and identify candidate EOC susceptibility gene(s) at each locus.
At the 8q21 locus, the strongest associated SNP, rs11782652, is located in the first intron of CHMP4C. We imputed genotypes in the region to the 1000GP and tested all variants with MAF>0.02 for association. The log-likelihoods of the regression models were compared and eight SNPs with a log-likelihood within 6.91 of the most strongly associated SNP - equivalent to an odds of 1000:1 - were considered the most likely candidates for being the causal variant. Six of them lie in introns of CHMP4C but in silico analysis provided functional evidence for only one (rs74544416), which contains a putative SOX9 binding site. One is an indel (4 nucleotides) at the exon-intron border (rs137960856; alleles -/GTGA) but it is unlikely to have a functional impact because the next four nucleotides are also GTGA. Thus, even in the deleted allele the corrected exonic sequences are retained and it is not expected to affect splicing. The eighth SNP, rs35094336, is predicted to result in a coding change from Ala to Thr that may be functionally relevant (Polyphen2 score: 0.997). This residue is located in a C-terminal amphipatic alpha helix conserved in all CHMP4 proteins and is important for binding to ALIX, a protein involved in the ESCRT12. Further studies will be necessary to determine whether this change is of functional significance and has an impact on ovarian cancer biology. ENCODE data from non-ovarian cancer associated tissues, FAIRE-seq data and mapping of enhancer elements generated in normal serous ovarian cancer precursor cells suggests there are two regulatory regions that may be influenced by risk associated SNPs; one at the CHMP4C promotor and the other in intron one of CHMP4C (Figure 2).
We found no evidence of a correlation between rs11782652 genotype and gene expression in normal ovarian/fallopian tube epithelial cells for any of the nine genes in the region (FABP5, PMP2, FABP4, FABP12, IMPA1, SLC10A5, ZFAND1, CHMP4C and SNX16); but there was a highly statistically significant association between rs11782652 and CHMP4C expression in primary EOCs (P=3.9×10-14) and in transformed lymphocytes (P=0.012). We also found evidence of association for rs11782652 with methylation status (mQTL) for three genes in tumour tissue; ZFAND1 (P=0.003), CHMP4C (P=0.001) and SNX16 (P=0.001). However, only CHMP4C methylation was negative correlated with expression (P=0.036).
Three genes in the region, FABP5 CHMP4C and SNX16, were significantly overexpressed in both EOC cell lines compared to normal tissues (P=0.002, 4.8×10-9 and 5.9×10-4 respectively) and, where data were available, in primary EOC tissues. Finally the Catalogue Of Somatic Mutations In Cancer (COSMIC) database showed that four genes in the region IMPA1, ZFAND1, CHMP4C and SNX16 have functionally significant mutations in cancer, with the last three genes mutated in ovarian carcinoma (Supplementary fig. 10). These four genes form a highly connected co-expression network across different experimental conditions (Supplementary fig. 11).
Taken together these data suggest that several genes at the 8q21 locus may play a role in the somatic development of EOC; but the cumulative evidence indicate that CHMP4C (chromatin-modifying protein 4C), is the most likely candidate susceptibility gene. This is supported by previously published data on the function of CHMP4C. CHMP4C is involved in the final steps of cell division, co-ordinating midbody resolution with the abscission checkpoint 13 and its transcription is regulated by TP53 to enhance exosome production14. A more recently study has shown that CHMP4C is frequently overexpressed in ovarian tumor tissues, with the suggestion that it may represent a diagnostic tumor marker and therapeutic -target for patients with the disease15.
At the 10p12 locus, six known genes (NEBL, C10Orf113, C10Orf114, C10Orf140, MLLT10 and DNAJC1) span the one megabase region around rs1243180, which lies in an intron of MLLT10 (Figure 3). Based on data imputed from the 1000 Genomes Project 57 SNPs are candidates for being functionally significant variants. This includes variants in the 3′UTR of c10orf114 and the 5′UTR of c10orf140, and a synonymous variant in MLLT10. Forty-six SNPs lie in introns of the MLLT10 gene, and the remaining eight are intergenic. In silico analyses found little or no functional evidence that any of these SNPs, including the most risk associated SNP rs1243180. However, after FAIREseq analysis of normal serous ovarian cancer precursor cells, one of these SNPs, rs10828252 (r2=0.87 with rs1243180), was found to coincide with a region of open chromatin, which probably corresponds to the promotor of MLLT10 (Figure 3). Although rs10828252 is not positioned directly at the apex of the signal, but rather within the upstream portion, it is well established that open chromatin at transcription start sites of genes result from the coordinated influence of numerous transcription factors (TF) binding within the vicinity, and therefore it is highly plausible that rs10828247 is modulating one of these TF binding sites. The resulting shape and position of the FAIRE-seq signal may therefore represent the resulting effects of putative TF binding at rs10828247 working in concert with others in close proximity.
This finding suggests a possible mechanism for susceptibility to EOC at this locus through subtle variations in promotor regulation of MLLT10. However, eQTL analysis found no significant associations between genotypes of either rs1243180 or rs10828252 and MLLT10 expression in normal tissues. We did observe eQTL associations for two other coding genes in the region, NEBL and C10orf114 (P=0.04 and 0.03 respectively). C10orf114 expression was also associated with rs1243180 genotypes in primary EOCs (P=0.02) as was C10orf140 (P=0.02). Methylation at both C10orf140 and MLLT10 was associated with rs1243180 genotype in primary EOC tissues (P=0.03 and 0.05 respectively), and both genes show significant negative correlation between methylation and expression (P=0.0016 and 0.002 respectively). C10orf140 also shows a significant difference in methylation in tumors compared to normal tissue (P=1.9×10-5).
Four genes (NEBL, C10orf114, C10orf140 and MLLT10) were significantly overexpressed in EOC cell lines compared to normal tissues (P≤0.01) of which two, C10orf114, and MLLT10, also showed elevated overexpression in primary EOCs (Figure 3). Correlations between gene expression and DNA copy number variation at this locus in primary EOCs suggest that overexpression of the C10orf114 and MLLT10 genes are driven by copy number variation. NEBL is the only gene with reported mutations in ovarian cancer (Supplementary fig. 10). Together, these data suggest that NEBL, C10orf114, C10orf140 and MLLT10 gene may all play a role in ovarian cancer development, and any of these four genes could be the target susceptibility gene at this locus. However, there is no other evidence to implicate C10orf114 or C10orf140 in EOC or to suggest a rationale that may underlie disease susceptibility. More is known about the function of NEBL and MLLT10 although neither gene has previously been implicated in ovarian cancer. MLLT10 (mixed-lineage leukemia (trithorax homolog, Drosophila) translocated to 10) encodes a transcription factor and has been identified as a partner gene involved in several chromosomal rearrangements resulting in leukaemia16. More than 60 MLL fusion partner genes have been described at the molecular level, including a recently reported fusion NEBL-MLL17. NEBL (Nebulette) encodes a nebulin like protein that is abundantly expressed in cardiac muscle, and has been implicated in the genetics of sudden cardiac death syndrome and cardiac remodeling18. This evidence does not support directly a role for these genes in ovarian cancer, but the first common gene-fusion in serous ovarian cancers (ESRRA-C11orf20 ) has been reported recently and provides an underlying hypothesis for the involvement of genes at this locus in EOC development19.
At chromosome 17q12, the most significant SNP, rs757210, lies in an intron of HNF1B and is associated with serous subtype ovarian cancer. Based on data imputed from the 1000 Genomes Project nine SNPs are candidates for being the causal variant. SNPs in this region have been associated with diabetes 20, endometrial cancer21 and prostate cancer22. There are thirteen genes in the 1MB region on either side of this SNP (ACACA, C17orf78, TADA2L, DUSP14, AP1GBP1, DOX52, HNF1B, TBC1D3F, TBC1D3 MRPL45, GRP179, SOCS7, and ARHGAP23). Seven of these (DUSP14, HNF1B, TBC1D3, TBC1D3F, MRPL45, SOCS7 and ARHGAP23) were overexpressed in EOC cell lines and primary tumors compared to normal tissues (Figure 4 and Supplementary fig. 8.iii) indicating they may play a role in EOC. HNF1B is a strong candidate gene target at this locus; it has been extensively studied in EOC and is used as a biomarker for subtype stratification of EOC tumors23, in particular to distinguish clear cell from other EOC subtypes. Consistent with this, overexpression of HNF1B in EOC cell lines was largely driven by higher expression in clear cell EOC cell lines (Figure 4)24. However, HNF1B shows lower expression in primary serous EOCs compared to normal tissues, which may suggest a different role for this gene in clear cell compared to serous tumours 25. The phenotypic consequences of HNF1B knockdown in clear cell EOC cell lines also suggest it may behave as an oncogene in the development of this subtype.26. We found no correlation between HNF1B expression and DNA copy number variation at this locus in primary EOCs; but a highly statistically significant inverse correlation between HNF1B expression and methylation (P=2.1×10-6) which implies the mechanism for overexpression of this gene is epigenetic.
RNAseq analysis of normal ovarian cancer precursor tissues indicates that HNF1B is expressed at extremely low levels (Figure 4), which restricts the extent to which the function of this gene in normal ovarian cancer precursor tissues can be studied. We found no evidence for eQTL associations between rs757210 and the expression of any gene in normal tissues throughout the region, but we observed a strong mQTL association between rs757210 and HNF1B methylation (P=0.009) (Figure 4; Supplementary table 4) in primary serous EOCs. The minor (risk) allele of rs757210 was associated with lower methylation and therefore is predicted to be associated with increased HNF1B expression. In the absence of additional functional data it is difficult to interpret these findings; but given the possible role of HNF1B as an oncogene in the development of clear cell ovarian cancer, it may be that increased HNF1B expression at an early stage in the development of ovarian cancer precursor tissues, driven by the risk variant(s) underlying susceptibility, has increased oncogenic activity in the proportion of serous ovarian that carry this allele.
Overall, the functional data we have generated do not point strongly to any one gene at 17q12 as the functionally relevant gene mediating the genetic association with disease. However, when combined with a large body of previous work data implicating HNF1B in ovarian cancer development, it suggests that this is the strongest candidate, and the mQTL and methylation-expression associations suggest a role for genetic variants influencing HNF1B expression and disease susceptibility through epigenetic regulation.
This study has demonstrated the strength of large-scale collaboration in genetic association studies. We have identified three novel common alleles that confer susceptibility to EOC and confirmed two loci that had been previously reported at near genome-wide significance. Molecular analyses of genes at these loci, combining publicly available datasets and systematic, large-scale experiments point to a small number of candidate gene targets that may play a role in EOC initiation and development. However, the effects of the novel loci were modest, and together they explain less than 1 per cent of the excess familial risk of EOC, with about 4 per cent explained by all known loci with common susceptibility alleles. The lack of heterogeneity between studies of varying designs, carried out in different populations and the high levels of statistical significance indicate that these are robust associations. Fewer common susceptibility loci have now been found for EOC than for several other common cancers including breast, colorectal and prostate cancer27. It seems unlikely that the underlying genetic architecture for EOC susceptibility is substantially different from other cancers. This suggests that a key factor limiting our ability to detect susceptibility loci is sample size – the power of this study to detect risk alleles across a range of likely effect sizes is modest (Supplementary fig. 12). However, EOC is less common than these cancers and has a higher mortality and recruiting extremely large numbers of cases will be difficult. Disease heterogeneity will also reduce power if a substantial proportion of EOC susceptibility alleles are subtype specific. All EOC susceptibility loci so far identified are strongly associated with serous EOC, which is also the most common subtype. Both the discovery and replication phases of this study were weighted towards identifying serous associated risk alleles. It seems likely that additional common risk loci for clear cell, endometrioid and mucinous EOC subtypes also exist and await identification.
Several EOC susceptibility alleles have now been identified that increased risk of multiple cancers. For example, an increased risk of estrogen-receptor-negative breast cancer is associated with the EOC susceptibility allele at 19p135,28 and the EOC susceptibility allele at the 17q12 locus reported in this manuscript is also associated with risk of endometrial cancer21 and prostate cancer22. Several of the loci containing EOC susceptibility alleles have been found to harbour different susceptibility alleles for other cancers. For example, Michailidou and colleagues 29 have found an association between rs7072776 at 10p12 and breast cancer. This SNP is ~120kb centromeric to and partially correlated with rs1243180 (r2=0.51). Michailidou and colleagues also report an association of rs11780156 at 8q24 with breast cancer. The new locus lies ~300kb telomeric of the known locus for ovarian cancer (rs10088218)6, but is uncorrelated with it (r2=0.02). Both loci lie ~400kb 3′ of MYC. Previous GWAS have identified multiple loci 5′ of MYC and associated with different cancer types, including a locus for breast cancer. These associations may reflect tissue-specific regulation of key genes and understanding the functional mechanisms underlying genetic associations at the same locus for different phenotypes may provide insights into more general mechanisms of disease etiology and cancer development.
Assuming a log-additive model of interaction between loci, the currently know loci (Table 2 and Supplementary Table 3) define a genetic risk profile with a combined variance for the log relative risk distribution of 0.057. Such a distribution has limited discriminatory ability; the estimated relative risks at the 5th and 95th centiles are 0.63 and 1.48 respectively. However, based on what is known about the architecture of genetic susceptibility for other cancers it is likely that many more common susceptibility alleles exist. The discovery of genetic association with ovarian cancer may be enhanced by taking advantage functional annotation data and the analysis of gene-gene and gene-environment interactions using a genome-wide approach. Continued international efforts are needed to establish new case-control studies, to expand existing case-control studies and to improve the pathological characterisation of the cases in these studies in order to unravel the inherited genetic basis of the disease. In combination with rarer risk alleles and other risk factors, genetic profiling may provide sufficient discrimination to justify targeted ovarian cancer prevention.
1000 Genomes Project: http://www.1000genomes.org/page.php
NCBI Unigene: http://www.ncbi.nlm.nih.gov/unigene
The Cancer Genome Atlas Project: http://cancergenome.nih.gov/
The Cancer Genome Atlas Project data bioprtal: http://www.cbioportal.org/
The Wellcome Trust Case Control Consortium: http://www.wtccc.org.uk/
We combined the results from two ovarian cancer GWAS from North America and the UK. Details of these studies have been published previously 4,31 and are described in the Supplementary note. In order to account for different marker sets and to improve genome coverage, imputation to HapMap2 was performed using 60 CEU founders as reference. Data on 2,508,744 genotyped SNPs or SNPs imputed with r2>0.3 were available for analysis.
The North American and UK studies were analysed separately and the results combined using fixed effects meta-analysis. The 2.5 million SNPs were ranked according to the P-values for each of four analyses performed: North America study only (all invasive and serous histology) and combined GWAS meta-analysis (all invasive and serous histology). The minimal ranking for each SNP was obtained across the four sets of results. SNPs with minor allele frequency less than 3 percent or SNPs that were already genotyped or in perfect LD with UK GWAS phase 2 SNPs were excluded. We acquired the design score for each SNP using the Illumina Assay Design Tool and removed SNPs that were redundant or predicted to perform poorly. In total, 24,552 SNPs were included on the iCOGS custom genotyping array (see Supplementary materials).
A total of 47,630 samples from 43 studies in OCAC were genotyped of which 44,308 passed QC, including 18,174 (10,315 serous) cases and 26,134 controls (Supplementary table 1). The HapMap samples for European (CEU, n=60), African (YRI, n=53) and Asian (JPT+CHB, n=88) populations were also genotyped using the iCOGS array.
Genotyping was conducted using an Illumina Infinium iSelect BeadChip in six centres, of which two were used for OCAC - McGill University and Génome Québec Innovation Centre (n=19,806) and the Mayo Clinic Medical Genome Facility (n=27,824). Each 96-well plate contained 250 ng genomic DNA (or 500 ng whole-genome amplified DNA). Raw intensity data files for all consortia were sent to the COGS data co-ordination centre at the University of Cambridge for centralized genotype calling and QC. Genotypes were called using GenCall 32. Initial calling used a cluster file generated using 270 samples from Hapmap2. These calls were used for ongoing QC checks during the genotyping. To generate the final calls used for the data analysis, we first selected a subset of 3,018 individuals, including samples from each of the genotyping centres, each of the participating consortia, and each major ethnicity. Only plates with a consistently high call rate in the initial calling were used. The HapMap samples and ~160 samples that were known positive controls for rare variants on the array were used to generate a cluster file that was then applied to call the genotypes for the remaining samples. We also investigated two other calling algorithms: Illumnus 33 and GenoSNP 34, but manual inspection of a sample of SNPs with discrepant calls indicated that GenCall was invariably superior.
One thousand two hundred and seventy three OCAC samples were genotyped in duplicate. Genotypes were discordant for greater than 40 percent of SNPs for 22 pairs. For the remaining 1,251 pairs, concordance was greater than 99.6 percent. In addition we identified 245 pairs of samples that were unexpected genotypic duplicates. Of these, 137 were phenotypic duplicates and judged to be from the same individual. We used identity-by-state to identify 618 pairs of first-degree relatives. Samples were excluded according to the following criteria: 1) 1,133 samples with a conversion rate of less than 95 percent; 2) 169 samples with heterozygosity >5 standard deviations from the intercontinental ancestry specific mean heterozygosity; 3) 65 samples with ambiguous sex; 4) 269 samples with the lowest call rate from a first-degree relative pair 5) 1,686 samples that were either duplicate samples that were non-concordant for genotype or genotypic duplicates that were not concordant for phenotype. Thus, a total of 44,308 subjects including 18,174 cases and 26,134 controls were available for analysis. Of these, 2,482 had been in the North American GWAS, 1,641 were in the phase 1 of the UK GWAS and 8,463 in Phase 2 of the UK GWAS.
Of 211,155 SNP assays successfully designed and included on the array, SNPs were excluded according to the following criteria: (1) 1,311 SNPs without a genotype call; (2) 2,857 monomorphic SNPS; (3) 5,201 SNPs with a call rate less than 95 percent and MAF > 0.05 or call rate less than 99 percent with MAF < 0.05; (4) 2,194 SNPs showing evidence of deviation of genotype frequencies from Hardy-Weinberg equilibrium (p < 10-7); (5) 22 SNPS with greater than 2 percent discordance in duplicate pairs. Overall, 94.5 percent passed QC. Genotype intensity cluster plots were visually inspected for the most strongly associated SNPs at each newly identified locus.
We used the program LAMP35 to assign intercontinental ancestry based on the genotype frequencies for the European, Asian and African populations. Subjects with greater than 90 percent European ancestry were defined as European (n=39,944) and those with greater than 80 percent Asian or African ancestry were defined as being Asian (n=2,388) and African respectively (n=387). All other subjects were defined as being of mixed ancestry (n=1,770). We then used a set of 37,000 unlinked markers to perform principal components analysis within each major population subgroup 36. To enable this analysis on very large samples we used an in-house programme written in C++ using the Intel MKL libraries for eigenvectors (available at http://ccge.medschl.cam.ac.uk/software/).
Unconditional logistic regression treating the number of alternate alleles carried as an ordinal variable (log-additive, co-dominant model) was used to evaluate the association between each SNP and ovarian cancer risk. A likelihood ratio test was used to test for association, and per-allele log odds ratios and 95 percent confidence limits were estimated. The likelihood ratio test has been shown to have greater power than alternatives such as the Wald test and score test for rare variants37. Separate analyses were carried out for each ancestry group. The model for European subjects was adjusted for study and population substructure by including study-specific indicators and the first five eigenvalues from the principal components analysis in the model. For analysis of the Asian and other ancestry groups, the first five ancestry-specific principal components were included in the model, and one principal component was included in the model for analysis of subjects of African ancestry. The number of principal components was chosen based on the position of the inflexion of the principal components scree plot (Supplementary fig. 13). We tested for sub-type specific heterogeneity by comparing genotype frequencies in the four case subtypes using the Kruskal-Wallis test. We tested for heterogeneity of ORs by study and ancestry using the method of Breslow and Day 38.
In order to assess the magnitude of confounding due to cryptic population substructure, we tested the 147,722 SNPs that not been selected as candidates for ovarian cancer susceptibility. Inflation in the test statistics (λ) was estimated by dividing the median of the test statistic by 0.455 (the median for the χ2 distribution on 1df). The inflation was converted to an equivalent inflation for a study with 1000 cases and 1000 controls (λ1000) by adjusting by effective study size, namely
where n is the number of cases and m the number of controls in each study stratum, k. In analyses restricted to European subjects and adjusted only for study there was a small inflation of the test statistics (λ = 1.13, λ1000 = 1.007). This was reduced to 1.078 (λ1000 = 1.004) after adjusting for five principal components. Heterogeneity of odds ratios between studies was tested with Cochran's Q statistic.
We performed the following assays for each gene in the one megabase region centred on the most significant SNP at each locus (see Supplementary note for details): (1) Gene expression in EOC cell lines (n=50) and normal precursor cells and tissues for ovarian cancers - ovarian surface epithelial cells (OSE) and fallopian tube secretory epithelial (FTSE) cells (n=73); (2) CpG island methylation analysis in high grade serous EOC (n=106) and normal tissues (n=7).
We also evaluated these genes in silico, using bioinformatics tools to mine publicly available somatic genetic data generated for primary EOCs and other cancer types. These were The Cancer Genome Atlas (TCGA) data for ~500 high grade serous EOCs (gene expression, somatic mutation, DNA copy number variation, eQTL, and methylation data) 25 and the Catalogue Of Somatic Mutations In Cancer (COSMIC) 39 analysis of mutations in genes curated from the published literature and data from whole genome resequencing of cancer samples, undertaken by the Cancer Genome Project. We generated co-expression networks for genes in each locus using GeneMANIA, a large dataset of gene expression studies (n=154) 40.
All these data enabled us to: (1) compare gene expression in tumor and normal epithelium (EOC cell lines/normal cell lines and TCGA tumors/normal tissue), (2) test for association between copy number alteration and gene expression at each locus (3) compare gene methylation status in tumor and normal tissue (4) carry out a gene expression quantitative trait locus (eQTL) analysis to evaluate associations between germline genotype and gene expression in lymphoblastoid cell lines, normal serous EOC precursor tissues and tumors, and (5) carry out a methylation quantitative trait locus (mQTL) analysis to evaluate associations between germline genotype and gene methylation in tumors (Supplementary figs. 8i-iii).
We used data from the ENCyclopedia Of DNA Elements (ENCODE)7 to evaluate the overlap between regulatory elements in non-coding regions and risk associated SNPs at the three loci. ENCODE describes regulatory DNA elements (e.g. enhancers, insulators and promotors) and non-coding RNAs (e.g. micro-RNAs, long non-coding and piwi-interacting RNAs) that may be targets for susceptibility alleles (Supplementary figs. 9i-iii). However, ENCODE does not include data for EOC associated tissues, and activity of such regulatory elements often varies in a tissue specific manner 7,41. Therefore, we profiled the spectrum of non-coding regulatory elements in OSE and FTSE cells using a combination of formaldehyde assisted isolation of regulatory elements sequencing (FAIRE-seq) and RNA sequencing (RNA-seq) (Supplementary figs. 9i-iii). We also analysed regulatory regions in early-stage transformed OSE cells. For all regulatory biofeatures spanning the one megabase region at each locus, we evaluated their overlap with the most strongly associated SNP and all SNPs correlated with an r2≥0.8.
We thank all the individuals who took part in this study and all the researchers, clinicians and technical and administrative staff who have made possible the many studies contributing to this work (a full list is provided in the Supplementary note). The COGS project is funded through a European Commission's Seventh Framework Programme grant (agreement number 223175 - HEALTH-F2-2009-223175). The Ovarian Cancer Association Consortium is supported by a grant from the Ovarian Cancer Research Fund thanks to donations by the family and friends of Kathryn Sladek Smith (PPD/RPCI.07). The scientific development and funding for this project were in part supported by the Genetic Associations and Mechanisms in Oncology (GAME-ON): a NCI Cancer Post-GWAS Initiative (U19-CA148112). Details of the funding of individual investigators and studies is provided in the Supplementary note. This study made use of data generated by the Wellcome Trust Case Control consortium: Funding for the project was provided by the Wellcome Trust under award 076113; a full list of the investigators who contributed to the generation of the data is available from the website (see URLs). The results published here are in part based upon data generated by The Cancer Genome Atlas Pilot Project established by the National Cancer Institute and National Human Genome Research Institute: Information about TCGA and the investigators and institutions who constitute the TCGA research network can be found on the website (see URLs).
Writing group: PDPP, Y-YT, CMP, SJR, JS, TS, BLF, ELG, ANAM and SAG. All authors read and approved the final manuscript.
Provision of samples and data from contributing studies: KL, MP, JPT, HS, RW, RK, MCL, HS, DCT. FB. DV, JMC, JD, ED, KKA, HA-C, NA, SMA, LB, EVB, MWB, MJB, GB, NB, JDB, LAB, AB-W, R Brown, R Butzow, IC, MEC, RSS, JC-C, YAC, ZC, AD-M, ED, JAD, TD, AdB, MD, DE, RE, ABE, PAF, DF, JF, Y-TG, MG-C, AG-M, GG, AG, MG, JG, QG, MKH, PH, AH, FH, PH, MH, EH, CKH, SH, AJa, AJe, KRK, BYK, LEK, LAK, SK-K, GEK, CK, JK, DL, SL, NDL, NL, JLe, AL, BKL, Jli, JLu, LL, GL, LFAGM, KM, VM, JRM, UM, FM, KBM, TN, SAN, RBN, HN, SN, HN, KO, SO, IO, JP, TP, LMP, JP-W, MCP, EMP, XQ, HAR, LR-R, MAR, AR, IR, IKR, HBS, IS, GS, HS, VS, X-oS, WS, MCS, PS, KY, S-HT, KLT, PJT, AT, SST, AMvA, DvdB, IV, RAV, AFV, SW-G, NW, ASW, EW, BW, YLW, AHW, HPY, WZ, AZ, FZ, MTG, PH, DFE, CLP, AB, GC-T, EI, and JMS
Bioinformatics and data management: JD, ED, ZC and RW.
Data analysis: JT, QG, Y-YT and BLF.
Preparation of samples for genotyping: SJR and CMP.
Genotyping: JMC, DCT, FB and DV.
Functional analyses: SAG, MB, ANAM, BLF, KL, HS, ELG, SJR, AC, ML.