|Home | About | Journals | Submit | Contact Us | Français|
DNA damage is an established mediator of carcinogenesis, though GWAS have identified few significant loci. This cross-cancer site, pooled analysis was performed to increase the power to detect common variants of DNA repair genes associated with cancer susceptibility.
We conducted a cross-cancer analysis of 60,297 SNPs, at 229 DNA repair gene regions, using data from the NCI Genetic Associations and Mechanisms in Oncology (GAME-ON) Network. Our analysis included data from 32 GWAS and 48,734 controls and 51,537 cases across five cancer sites (breast, colon, lung, ovary, and prostate). Because of the unavailability of individual data, data were analyzed at the aggregate level. Meta-analysis was performed using the Association analysis for SubSETs (ASSET) software. To test for genetic associations that might escape individual variant testing due to small effect sizes, pathway analysis of eight DNA repair pathways was performed using hierarchical modeling.
We identified three susceptibility DNA repair genes, RAD51B (p < 5.09 × 10−6), MSH5 (p < 5.09 × 10−6) and BRCA2 (p = 5.70 × 10−6). Hierarchical modeling identified several pleiotropic associations with cancer risk in the base excision repair, nucleotide excision repair, mismatch repair, and homologous recombination pathways.
Only three susceptibility loci were identified which had all been previously reported. In contrast, hierarchical modeling identified several pleiotropic cancer risk associations in key DNA repair pathways.
Results suggest that many common variants in DNA repair genes are likely associated with cancer susceptibility through small effect sizes that do not meet stringent significance testing criteria.
DNA damage is an established mediator of carcinogenesis (1). Several carcinogens (e.g. chemical mutagens, viruses, and irradiation) are known to cause cancer through their ability to damage DNA (2–6). Consistent with this established model of carcinogenesis, mutations in many genes known to confer cancer risk (e.g. TP53 (7), ATM (8), BRCA1 (9), BRCA2 (10)), are known to play major roles in DNA damage repair and signaling response (11–15). However, while mutations in these genes are associated with high degrees of individual cancer risk (7, 9, 10), these rare events explain only a small fraction of all cancers (5). Given the importance of DNA damage to carcinogenesis, it is plausible that cancer risk would be conferred by common variants of these and other DNA repair genes, and that this risk could be measured in large, genome-wide association studies (GWAS).
GWAS have identified hundreds of single nucleotide polymorphisms (SNPs) and susceptibility loci associated with risk for various cancers (16–26). However, few GWAS have identified cancer susceptibility loci near DNA repair genes at stringent levels of significance that have also been shown to function through altered DNA repair (21, 24, 26, 27). These data suggest that common variants in DNA repair genes may not make important contributions to cancer susceptibility, and that cancer susceptibility may be mostly conferred by high-risk, rare variants within this class of genes. However, it is possible that underpowered association studies could miss common variants with weak effect sizes. In order to investigate this hypothesis, a comprehensive candidate gene association study of DNA repair genes was performed.
The present study analyzes genetic data from 229 DNA repair genes. In order to increase the power to detect common variant effects, a meta-analysis was performed, using the NCI Genetic Associations and Mechanisms in Oncology (GAME-ON) Network database, which includes data from breast, colon, lung, ovary, and prostate cancer. The Association analysis for SubSETs (ASSET) software package (Bioconductor) was used to conduct the meta-analysis of the large dataset (48,734 controls, 51,537 cases), which also allows for the evaluation of subset effects in a potentially heterogeneous dataset. Since the effect for each SNP may only reach significance in certain cancers (a subset of studies) this represents a powerful and practical approach to meta-analysis. The use of a candidate gene study restricted to DNA repair genes, the size and comprehensiveness of the GAME-ON database, and the use of ASSET to interrogate this large dataset for subset effects with minimal loss of power, represents a significantly more powerful approach to detect individual genetic variants in loci near DNA repair genes than has been previously attempted.
In order to test for cancer risk associations among DNA repair genes, which might escape individual variant testing due to weak effect sizes, dimensional reduction of the dataset was also performed by pathway analysis, using hierarchical modeling (28, 29). DNA repair genes segregate into fairly exclusive, well-defined pathway categories, which provides a strong, rational basis to use this information as a means to achieve dimensional reduction of the dataset, as findings in the pathway categories are therefore more likely to have underlying biological meaning and less likely to be an artifact of pathway analysis procedures. The hierarchical modeling procedure was selected for use in this study because of its compatibility with the summary-level data available in the GAME-ON database and because this approach to pathway analysis uses information from across the entire dataset, instead of being driven by only a handful of the most significant individual variants. Using pathway membership as binary covariates, the multivariate regression framework of hierarchical modeling allowed for estimation of pathway effect size and significance (p-value) for each pathway. Significant effects in the pathway covariates were interpreted as supportive evidence for the associations between variants in the DNA repair pathways and cancer susceptibility.
The GAME-ON Network (http://epi.grants.cancer.gov/gameon/) includes GWAS data from 32 studies across North America and Europe as well as Australia, representing five common cancer sites: breast, colon, lung, ovary, and prostate (16, 17, 19–23, 30–33). In total, this included 51,537 cancer cases and 48,734 controls. Data analyzed included summary statistics for each study, after adjusting for age, gender, and population stratification using principal components as applicable (Supplementary Table 1). Genomic variant data was imputed to the 1000 Genomes reference panel using either MACH or IMPUTE (34–36). Imputation was separately carried out for each cancer site. Following imputation, there were 6,300,179 SNPs available for analysis, which were shared among all the GAME-ON databases. To avoid population stratification, all study participants included in the analysis were of European descent. Table 1 summarizes the sample sizes of each participating study, and more detailed characteristics are provided in Supplementary Table 1.
We initially identified 247 DNA damage repair and signaling response genes using Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway Database (http://www.genome.jp/kegg/pathway.html). Since the GAME-ON data did not include sex chromosomes, the gene list was reduced to 229 genes for the final analysis (Supplementary Table 2).
Single nucleotide polymorphisms (SNPs) were queried from the region of every gene included in the study, using dbSNP (http://www.ncbi.nlm.nih.gov/snp/) and the GRCh38 reference build of the human genome. There is no data to suggest how far a SNP may be from a gene and still have functional effects on that gene. It is known that variants that affect gene activity can be located as far as 100 kb away from the start and stop sites of the genes; however, the inclusion of a larger search window reduces study power and increases the chance that associations are found that apply to genes other than the one of interest. In an attempt to address these competing concerns, gene regions were a priori defined as 50 kb upstream and downstream of the official start and stop sites for each gene (Supplementary Table 2). Selection criteria for a SNP’s inclusion into the study included MAF > 0.01 and being part of the 1000 Genomes database. This resulted in the initial selection of 156,804 SNPs. SNPs were then omitted from the analysis if they were not present within every dataset in the GAME-ON network, resulting in a final count of 60,297 SNPs to be included in this study (Supplementary Table 2).
Because of the unavailability of individual data for all five cancer sites, the data were analyzed at the aggregate level only. All datasets were standardized so that the reference allele for each SNP was the same across datasets. Summary statistics for each genetic variant were obtained for each cancer site and for select histologic subtypes of breast (estrogen receptor (ER)-negative), prostate (aggressive), ovarian (serous and mucinous) and lung (adenocarcinoma) cancers (Table 1). Lung adenocarcinoma, serous ovarian cancer, and mucinous ovarian cancer were subtypes chosen for special inclusion into this study, given their prior associations with DNA repair genes (24, 37, 38). ER-negative breast cancer and aggressive prostate cancer subsets were also included to reflect genetic associations that may be linked with more aggressive forms of the disease. Aggressive prostate cancer was defined as disease cases having a Gleason score of 8 or greater (except for the BPC3 and CGEMS studies which included cases with tumor stage C or greater or cases with a Gleason score of 7 or greater, respectively) (30).
All dataset summary statistics included odds ratios (ORs) and standard errors (SEs) derived from logistic regression analyses. Study-specific results were combined within each cancer site using a fixed effects model. Pooled estimates by cancer site were adjusted for age, principal components for population structure, and gender, where applicable.
Using the subset-based approach provided for by the ASSET software package (http://www.bioconductor.org) (39), each genetic variant was evaluated for pleiotropic association with cancer risk across multiple cancer sites and histologic subtype. For every genetic variant, effect sizes between studies were combined, by finding the best subset to maximize the test statistic. The final test statistic for each SNP is obtained by maximizing the subset-specific test statistic over all possible subsets, correcting for multiple-testing. ASSET calculates the effect size and significance of each SNP across all studies and also returns a list of studies that constitute the “best subset” of studies associated with the SNP under the assumption of a common direction of association (“1-sided ASSET analysis”) or it can allow for the assumption that significant effects may occur in opposite directions for the same genetic variant, between studies (“2-sided ASSET analysis”). In practice, however, genetic loci that are detected as significant across cancer studies have an overwhelming tendency to have the same direction of association. In this report it was assumed that variants of DNA repair genes were not likely to have effects that were opposite in direction across cancer sites.
The correlation between studies was corrected for by tabulating the number of shared cases and controls between studies and generating a covariance matrix when estimating standard errors. This included overlapping controls from the UK ovarian cancer and UK breast cancer GWAS, both of which included controls form the Wellcome Trust Case Control Consortium (WTCCC). Since significance effects for each SNP may only exist in certain cancers (a subset of studies) this represents a powerful and practical approach to meta-analysis. Of the 60,297 SNPs included in the study, 9,806 SNPs were found to not be in high linkage disequilibrium (LD) (R2 > 0.70). This SNP count was used to set the threshold for a genetic variant to reach statistical significance, using the Bonferroni correction, p = 5.09 × 10−6. The statistical significance for each genetic variant was calculated using the Bonferroni method in ASSET.
To reduce the correlation structure in the SNP dataset, SNPs were pruned from the analysis if they were found to be in high LD (R2 > 0.70), as determined using the online SNP Annotation and Proxy Search (SNAP) tool (Broad Institute, http://www.broadinstitute.org/mpg/snap/). If LD information for a SNP was not available from the SNAP tool, it was pruned from the analysis. This resulted in 9,806 SNPs available for pathway analyses (Table 3).
SNP pathway membership was determined based on the DNA repair gene it was linked to and that gene’s membership in DNA damage repair and signaling pathways, as indicated by the KEGG Pathway Database (Supplementary Table 3). As a result, it was possible for a SNP to be a member of more than one pathway. The hierarchical modeling method (28, 29) used was performed in R (R Foundation for Statistical Computing, http://www.R-project.org/, Version 3.1.1, 2014). Briefly, hierarchical modeling was performed using the summary level data from the GAME-ON consortium. First-stage estimates of SNP association with each cancer site (OR, SE, and p-values < 0.05), were generated by adjusting for principal components, as applicable. This information was then entered into a multivariate regression framework, incorporating higher level information about the SNP (i.e. pathway membership as binary covariates) in order to improve the ranking of results. The effect size and association for each DNA repair pathway covariate was calculated for each cancer site. The SEs were estimated based on the folded-normal distribution (40).
Figure 1 illustrates the genomic distribution for all SNPs included in the analysis and the corresponding p-values for association with cancer risk across one or more cancer sites. Manhattan plots for each of the studies included in the meta-analysis were also generated (Supplemental Figure 1). After correction for multiple comparisons, 29 genomic markers reached statistical significance. Twenty-six of the 29 SNPs were within the RAD51B gene locus (14q24.1). Three of the 29 statistically significant SNPs were within the MSH5 gene locus (6p21.33). A single SNP, near the BRCA2 gene locus (13q13.1), reached borderline significance (p = 5.70 × 10−6). This SNP was at the edge of the defined gene locus window and was actually located within the FRY gene. While FRY has been previously associated with prostate cancer risk, it is not directly involved in DNA damage repair (41). The other 168 SNPs within the BRCA2 gene did not reach significance testing criteria.
The SNPs with the lowest p-value at each locus (RAD51B, MSH5, BRCA2) were then analyzed for pleiotropic association with cancer risk (Table 2). RAD51B-associated marker, rs11844632, had an overall (pleiotropic) OR of 0.90 (95% CI: 0.88–0.93; p = 5.46 × 10−12) across multiple cancer sites. The highly significant inverse association was limited to breast cancer (p = 8.14 × 10−9), ER-negative breast cancer (p = 0.01), overall prostate cancer (p = 1.81 × 10−4), aggressive prostate cancer (2.46 × 10−3), and colon cancer (p = 0.01). Associations with lung cancer and ovarian cancer were in the opposite direction of effect and not statistically significant. MSH5-associated marker, rs3115672, had an overall (pleiotropic) OR of 1.18 (95% CI: 1.12–1.24; p = 2.53 × 10−8). The marker had a highly significant association with lung cancer (p = 3.99 × 10−11), and had weaker associations with colon (p = 0.051), ovarian cancer (serous subtype) (p = 0.050), and lung (adenocarcinoma subtype) (p = 0.03) cancer. BRCA2-associated marker, rs56404467, was borderline significant, having an overall (pleiotropic) OR of 1.39 (95% CI: 1.21–1.61; p = 5.70 × 10−6), driven by an association with overall lung cancer (p = 2.14 × 10−7), colon cancer (p = 7.33 × 10−3), and a weaker association with lung adenocarcinoma (p = 0.01).
To examine whether genomic variations in DNA repair genes might have small, but consistent, effects across cancer sites, left undetected due to being sub-genome wide significant, Q-Q plots were generated using the SNP data from the DNA repair gene regions, for each cancer dataset (Figure 2). Breast, prostate, and lung (overall and the adenocarcinoma subtype) cancer each showed deviations in p-value distribution greater than would be expected by chance, suggesting small but consistent effects in DNA repair genes may exist. Analysis of the genomic inflation factor (λ) was also performed on each cancer site database (42). A standard allelic test for association was performed, based on the median of the χ2 distribution with d.f. = 1. The λ values produced a modest deviation from the expected value of 1, consistent with the Q-Q plots and also suggestive of an excess number of significant associations in some of the cancer sites. The λ values for each dataset are as follows: breast = 1.10, breast (ER-negative) = 0.96, colon = 0.98, lung = 1.02, lung (adenocarcinoma) = 1.04, ovarian = 1.02, ovarian (serous) = 1.09, ovarian (mucinous) = 1.02, prostate = 1.17, and prostate (aggressive) = 1.08.
In order to statistically model the sub-genome-wide-significant trends between DNA repair pathways and association with cancer risk, dimensional reduction of the GAME-ON dataset was performed via pathway analysis. Site-specific cancer associations with DNA repair pathways were evaluated using hierarchical modeling (Table 3). The analysis included 9,806 SNPs. Analysis of the homologous recombination (HR) DNA repair pathway revealed pleiotropic associations with colon cancer (p = 4.18 × 10−4) and ovarian cancer: overall (p = 1.39 × 10−6), the serous subtype (p = 1.65 × 10−6), and the mucinous subtype (p = 5.00 × 10−5). Mismatch repair (MMR) showed pleiotropic associations with prostate cancer: overall (p = 3.54 × 10−5) and the aggressive sub-type (p = 2.76 × 10−3) and lung cancer: overall (4.86 × 10−4) and the adenocarcinoma subtype (p = 8.76 × 10−5). The DNA repair pathway, nucleotide excision repair, also showed a strong association with breast cancer: overall (p = 7.54 × 10−5) and the ER-negative subtype (p = 1.42 ×10−3) and weaker associations with ovarian cancer (p = 8.69 × 10−3), overall lung cancer (p = 0.024) and colon cancer (p = 0.027). All other DNA repair pathways tested showed at least some weaker associations with one or more cancer subtypes (p < 0.05).
Hierarchical modeling’s identification of pleiotropic pathway effects in HR and MMR pathways is consistent with the results obtained from individual SNP testing. In particular, RAD51B and BRCA2 are members of the HR pathway and MSH5 is a member of the MMR pathway. In order to determine whether these three loci, or a small number of other highly significant individual loci, significantly influence the overall hierarchical modeling analysis, a sensitivity analysis was performed. In the first sensitivity analysis, the RAD51B, BRCA2, and MSH5 gene data were removed from the dataset and hierarchical modeling was repeated (Supplementary Table 4). In the second sensitivity analysis, any genes containing SNPs that had associations with p < 1 × 10−4, were removed from the dataset. This resulted in the removal of 6 genes (RAD51B, MSH5, BRCA2, DCLRE1B, SMEK1, RAD52) from the dataset prior to the hierarchical modeling procedure (Supplementary Table 5). Neither analysis appeared to reveal a significant change to the overall results, suggesting that a small number of highly significant loci were not driving the hierarchical modeling results. This suggests that the hierarchical modeling results were most likely a result of a large number of small effect sizes throughout the dataset.
DNA damage and repair are known to be critically important to carcinogenesis and rare mutations in critical DNA repair genes are known to be associated with unusually high cancer risk. However, previous GWAS of common genetic variants (MAF > 0.01) have only identified a handful of statistically significant loci known to function through their effects on DNA repair genes. It was hypothesized that this could be due to the inability of even large studies to detect weak effect sizes. This study tested this hypothesis through use of a large heterogeneous database and a flexible meta-analysis strategy, which represents an unprecedented increase in statistical power to detect associations among common variants of DNA repair genes. This analytical strategy was supplemented with a strategy of dimensional reduction of the dataset, through pathway analysis, to also detect evidence of trends of association between cancer risk and common variants that may escape common variant testing by not meeting the genome-wide significance testing criteria.
Our results indicated that the RAD51B locus was strongly associated with breast cancer and contained a weaker association with prostate cancer, although this did not achieve statistical significance. This locus has been previously associated with breast (43–45), prostate (18, 46), and mucinous ovarian cancer risk (24). Of the associated SNPs at RAD51B, two were previously reported in the literature, rs10483813 and rs17828907 (18, 43–45, 47, 48). No associations were detected for mucinous ovarian cancer at this locus, but this may be due to the relatively small number of mucinous ovarian cancer cases included in this analysis (n = 306).
From the MSH5 locus, although rs3131379 was previously found to be associated with lung cancer (27, 37, 49, 50), this SNP was not included in our analysis (because it was not present in all GAME-ON databases), and rs3115672 was identified as the most significant SNP at this locus instead. It should be noted that the pairwise LD between rs3131379 and rs3115672 is very high (R2 > 0.99). Our study strongly associated this locus with lung cancer, with only weaker, non-significant associations detected for colon cancer, lung adenocarcinoma and mucinous ovarian cancer. This gene has been previously associated with lung cancer (27, 37, 39, 50) and non-Hodgkin’s lymphoma risk (OR = 1.16, p = 0.03) (51). Interestingly, this locus has also been associated with individuals suffering from lupus erythematosus (52–54), who themselves are known to be predisposed to non-Hodgkin’s lymphoma and lung cancer, while have reduced rates of other solid cancers (55).
Our results identified a SNP at genetic locus 13q13.1, near the BRCA2 gene. While mutations to BRCA2 have been known to be associated with multiple cancer types (10, 56, 57), this SNP has not been previously identified as a common variant related to cancer susceptibility. The SNP showed strong association with lung cancer. The SNP was located within the analytical window of the BRCA2 gene (+/− 50 kB) but was within the FRY gene region, which is not a canonical DNA repair gene. Thus, this finding should be interpreted with more caution, as supportive evidence of the association of common variants of DNA repair with cancer. However, the possibility that this SNP could affect BRCA2 gene function cannot be ruled out. Furthermore, it represents a potentially novel finding that suggests need for further investigation. This SNP, rs56404467, is in a non-coding exon and likely does not affect the activity or function of the BRCA2 protein but may alter the rate of BRCA2 translation. This contrasts to the smaller and non-functional BRCA2 protein resulting from a mutation and could explain the different pattern in cancer associations.
A previous analysis of the BRCA2 gene discovered a locus associated with squamous lung cancer, but this locus was not associated with lung adenocarcinoma, in contrast to our own findings (58). However, secondary analysis identified an additional genetic feature which may explain this discrepancy. There was a different, less significant loci, detected within the BRCA2 gene, but this did not meet the criteria for significance testing of p < 5.09 × 10−6 (rs4942486, p = 0.003). We found that this less significant loci was not strongly associated with adenocarcinoma but was associated with overall lung cancer, as previously reported (58). Despite being within the same analytical window, the FRY and BRCA2 loci were over 100,000 bases apart, located within different genes, and did not appear to be in high LD. Therefore, our results support the existence of two separate genetic association loci around the BRCA2 gene.
Overall, individual variant testing failed to find robust evidence for an association between common variants in DNA repair genes and cancer susceptibility. Few loci were identified and all genes had been previously associated with cancer susceptibility. Furthermore, evidence for pleiotropy among common variants in these genetic regions did not receive strong statistical support. However, analysis of Q-Q plots from specific cancer sites, using SNPs data from DNA repair gene regions, suggested that consistent association for common variants in DNA repair genes may exist but are likely difficult to detect due to their small effect sizes. In order to examine this possibility, pathway analysis was used as a tool to reduce the dimensionality of the dataset.
Hierarchical modeling provided statistical evidence that common variants of DNA repair genes are likely associated with cancer susceptibility. Homologous recombination, mismatch repair, and nucleotide excision repair showed strong statistical associations with cancer susceptibility, and for homologous recombination and mismatch repair, this association was present across multiple cancer sites. Sensitivity analysis suggested that these results were not due to the contribution of a few, highly significant loci, but through the combination of small, individual SNP effects throughout the entire dataset.
A limitation of our analyses is due to the availability of only aggregate summary-level data. Thus, we were unable to evaluate associations with non-aggressive prostate or ER-positive breast cancers. Lack of individual level data also made it difficult to enforce a consistent definition of aggressive prostate cancer. Despite this, our findings support further exploration of associations with DNA repair genes in these subgroups.
The results from pathway analysis and individual loci testing clarify the scientific model of the association of common variants in DNA repair genes with cancer risk. Although rare variants in these genes are known to be strongly linked to cancer incidence, very few individual loci were detected in our analysis, even when using a large database and a powerful analytical approach. Robust statistical significance was only detected under pathway analysis, and was observed to be likely due to the contribution of small effect sizes from multiple genes in DNA repair pathway. These data suggest that common variants of DNA repair genes are associated with cancer risk, but that the associations tend to be weak. These results and their interpretation seem particularly plausible, given the epidemiological observation that mutations at some DNA repair genes have profound deleterious effects (Fanconi anemia, xeroderma pigmentosa, ataxia telangiectasia, etc.). Thus, there is a strong theoretical justification for why common variant effects on cancer predisposition in these genes may be difficult to detect, as they likely face strong, negative selection pressure. This observation provides further rationale for conducting future targeted sequencing to explore the role that rare variants play in determining cancer risk.
The scientific development and funding for this project were supported by the following: the Genetic Associations and Mechanisms in Oncology (GAME-ON): a NCI Cancer Post-GWAS Initiative, U19CA148112 (TA Sellers, JM Schildkraut, P Pharoah), U19CA148127 (CI Amos), U19CA148107 (SB Gruber), U19CA148065 (DJ Hunter, P Kraft, DF Easton), U19CA148537 (BE Henderson), National Cancer Institute grants R01CA176016 (JM Schildkraut), R01CA088164 (JS Witte), R25CA126938 (JM Schildkraut), P30 CA023108 (CI Amos), U01CA127298 (JS Witte), National Institute of General Medical Science grant P20GM103534 (CI Amos), Cancer Research UK grants C490/A16561 (P Pharoah), C490/A10124 (P Pharoah), C490/A10119 (P Pharoah), C1287/A16563 (DF Easton).
We would like to thank Dr. Nilanjan Chatterjee (NIH) for his helpful advice and comments on our implementation of the ASSET software.
Conflict of Interest Statement
Dr. Ros Eeles has research support from Janssen and also received an honorarium from Speakers Bureau. Dr. Judy Garber is a consultant for Pfizer and Sequenom and has a commercial research grant from Myriad Genetic Labs. Dr. Garber also has immediate family members who have a commercial research grant from Novartis and who are consultants for Pfizer and SV Life Sciences. All other authors have no conflicts of interest to report.