|Home | About | Journals | Submit | Contact Us | Français|
Nucleotide excision repair (NER) is a vital response to DNA damage, including damage from tobacco exposure. Single nucleotide polymorphisms (SNPs) in the NER pathway may encode alterations that affect DNA repair function and therefore influence risk for pancreatic cancer development.
A clinic based case-control study in non-Hispanic white persons compared 1,143 patients with pancreatic adenocarcinoma with 1,097 healthy controls. Twenty-seven genes directly and indirectly involved in the NER pathway were identified and 236 tag-SNPs were selected from 26 of these (one had no SNPs identified). Association studies were performed at the gene level by principal components analysis, while recursive partitioning analysis was utilized to identify potential gene-gene and gene-environment interactions within the pathway. At the individual SNP level, adjusted additive, dominant, and recessive models were investigated, and gene-environment interactions were also assessed.
Gene level analyses showed an association of MMS19L genotype (chromosome 10q24.1) with altered pancreatic cancer risk (p=0.023). Haplotype analysis of MMS19L also showed a significant association (p=0.0132). Analyses of 7 individual SNPs in this gene showed both protective and risk associations for minor alleles, broadly distributed across patient subgroups defined by smoking status, sex, and age.
In a candidate pathway SNP association study analysis, common variation in a NER gene, MMS19L, was associated with risk for pancreatic cancer.
DNA repair is a key mechanism in the function of human cells in response to DNA-damaging stimuli and consequent progression to cancer. It has also become an area of intense research in the study of genetic predisposition to pancreatic cancer, because mutations in genes involved with DNA repair, such as BRCA1 and BRCA2, are known to increase risk for pancreatic adenocarcinoma (1, 2). However, mutations in high-penetrance tumor suppressor genes explain only a small number (<5%) of cases of pancreatic cancer.(3) In an effort to further characterize genetic risk for pancreatic cancer, the role of more common genetic variation (i.e. polymorphisms) has been increasingly studied.
Nucleotide-excision repair (NER) represents a pathway involved in detection and repair of DNA base damage such as pyrimidine dimers and bulky adducts, most notably those caused by environmental exposures such as ultraviolet (UV) light and chemical exposures (e.g., carcinogens)(4). High penetrance defects in this pathway in the XPA, ERCC3/XPB, XPC, ERCC2/XPD, XPE, and ERCC5 genes have been implicated in the recessive clinical disorder xeroderma pigmentosum(5, 6), resulting in up to 1,000 to 2,000-fold increased risk for cutaneous malignancy as a result of UV damage in skin cells. Affected persons are also at increased risk for cancers of the brain and oral cavity at a young age.(7) Cockayne syndrome (ERCC8/CKN1/CSA, ERCC6/CSB), an autosomal recessive severe developmental disorder with photosensitivity, is not known to confer increased cancer risk, though affected individuals often die in childhood of infectious causes, so lifelong cancer risk is unknown.(8)
The NER pathway consists of several primary steps that locate the damage, unwind the DNA duplex around the site, place incisions in the DNA upstream and downstream of the damage, and repair the gap.(9, 10) Specifically, the protein XPC, bound to RAD23B, recognizes and binds to the damage. Next, several other proteins bind in a complex (RPA, XPA, GTF2H, MMS19L, and XPG) which unwind the DNA helix, and the complex is then bound by ERCC1 and ERCC4/XPF which excise a 27–30 nucleotide fragment about the area of damage. DNA polymerases then repair the defect.(4)
The importance of this pathway in carcinogenesis is suggested by prior associations of polymorphic variants with risk for certain cancers, especially tobacco-related cancers such as head/neck and lung cancer.(11) Interactions between NER polymorphisms and smoking have also been reported.(12, 13) One potential mechanism for this is a reported direct inhibition of NER by tobacco smoke.(14)
Effects of NER gene polymorphisms and haplotypes have been shown to correlate with altered DNA repair capacity in some genes such as ERCC1 and ERCC2/XPD(15), but conferred risk for pancreatic cancer by variation in the NER pathway has not been definitively answered, with largely candidate SNP studies reported to date using relatively small sample sizes.(16–19) Because candidate SNP studies inherently miss substantial variations in genes, we chose to perform a systematic tag-SNP approach to the NER pathway. The intent of such an approach is to use existing knowledge of linkage disequilibrium from HapMap(20) to comprehensively assess common variation in all identified genes in the pathway of interest. Using this approach, we performed a case-control analysis utilizing the Mayo Clinic Biospecimen Resource for Pancreas Research
This study was approved by the Mayo Clinic Institutional Review Board. Written, informed consent was obtained from each subject for participation in this study and provision of a blood sample. From October 2000 through March 2007, patients with pancreatic adenocarcinoma (ICD-O site codes C25.0-C25.3, C25.7, C25.9 and morphology codes 8140/3, 8140/6) were consecutively recruited to a registry (ultra-rapid recruitment) during their visit to Mayo Clinic (Rochester, Minnesota or Jacksonville, Florida). Ultra-rapid recruitment is defined as recruitment at the time of clinic vistit for the initial work up for pancreatic cancer. Patients were identified by review of appointment calendars and pathology records, then approached by a study coordinator during a clinic visit or, if missed, contacted by mail. Of these, 71% consented to participate in the study. All records were reviewed and 1,949 were confirmed as pancreatic adenocarcinoma by a physician specialist (R.M.) in gastrointestinal medical oncology. Invasive intraductal papillary mucinous neoplasms, when identified by surgical pathology or clinical diagnosis, were excluded (n=42). Eighty-seven percent of consenting participants provided blood samples for DNA analysis and 64% self-completed risk factor questionnaires specifically for pancreatic cancer. For those not completing questionnaires, data on clinical variables (smoking, body mass index, family history, race, ethnicity) were extracted from electronic and paper clinical records and death certificates by a single physician (R.M.). This data extraction step was assessed for intermethod reliability with 25 cases and 25 controls who completed questionnaires. For this study, 1,203 patients with pancreatic adenocarcinoma of all stages were initially included, representing 62% of all pancreatic adenocarcinoma patients identified at Mayo Clinic during this time period. Of these, 1,143 (95%) were non-Hispanic whites, so in order to prevent population stratification, analyses were limited to this demographic group. Ninety-six percent of cases had histological confirmation of their diagnosis, with the remainder meeting the following criteria: having a pancreatic mass visualized on imaging and at least two of the following: elevated CA19-9, jaundice, weight loss, or abdominal pain. Upon enrollment, a risk-factor and family history questionnaire was completed by the patient. Peripheral blood was collected for DNA analysis.
From May 2004 to February 2007, 1,511 control patients were recruited from the General Internal Medicine clinic at Mayo Clinic (Rochester) at the time of a general physical exam, out of a total of 2,707 approached (56%). Controls were attempted to be frequency matched to cases on sex, residence (Olmsted County, Minnesota; three-state (MN, WI, IA); five state area (MN, WI, IA, SD, ND, or outside of area) age at time of recruitment (in 5-year increments), and race/ethnicity. Controls with prior diagnoses of cancer except non-melanoma skin cancer were excluded. Upon enrollment, controls completed an equivalent risk-factor and family history questionnaire to those administered to cases. Peripheral blood was collected for DNA analysis. For this study, 1,097 non-Hispanic white controls were randomly selected from those controls providing blood samples and completing questionnaires, using strata delineating age (in 5 year increments), sex, and location of residence to best approximate cases on a frequency matching basis.
Study participants provided information about age at initiation and cessation of smoking and the number of packs smoked per day. If no smoking data were available from the self-completed questionnaire, smoking information was extracted from the participant’s medical record (data were extracted for 24% of controls and 23% of cases). Smoking data were available for 99.7% of study participants. Total number of pack-years was calculated by multiplying the typical number of packs smoked daily with the number of years smoked. Pack-years were used as a measure of smoking exposure. Subjects were categorized as “never smokers” and “ever smokers” (> 100 cigarettes in their lifetime). Ever smokers were further stratified by number of pack-years (≤20 pack-years, >20–40 pack-years, and >40 pack-years).
Genes encoding proteins involved with the NER pathway were selected from review of the literature.(21) In order to comprehensively assess common genetic variation in the genes selected, a linkage disequilibrium (LD) based tag-SNP strategy was employed. To select LD tag SNPs for the genes, genotype data from white populations were compiled from 3 different sources. Gene coordinates were calculated based on NCBI Build 36. For all but 3 genes, coordinates were calculated from the UCSC Genome Browser knownGene and knownToLocusLink tables. The coordinates for the other 3 genes were calculated from the gene2refseq file from the NCBI FTP site. One genome wide genotyping project, Hapmap (http://www.hapmap.org) and two resequencing projects were utilized, SeattleSNPs (http://pga.mbt. washington.edu/) and NIEHS SNPs (http://egp.gs. washington.edu/). We ran ldSelect software (Version 1.0, Seattle, Washington) (22) for SNP selection on each gene including 5kb upstream/downstream using criteria of r2 = 0.9 and minor allele frequency (maf) > 0.05. We selected 3 tag SNPs for bins of size 30 or more, 2 tag SNPs for bins of size 10 or more and 1 tag SNP otherwise. For genes with multiple sources, the optimal source of SNPs for each gene was chosen, based on the most number of LD bins and most number of SNPs in each LD bin. All known genes directly and indirectly involved in the NER pathway were identified (N=27), and 236 SNPs were selected. (No tag-SNPs were identified in GTF2H2).
DNA samples were analyzed in the Mayo Clinic Genotyping Shared Resource using an Illumina Golden Gate® Custom 768-plex OPA panel using the standard protocol. We selected SNPs with an Illumina design score of >0.4. BeadStudio II software was used to analyze the data and prepare reports. Cases and controls were intermixed on plates. Genotyping was successful for 1,189 cases and 1,126 controls, with a 99.7% average loci call rate. Locus success rate was 95.1% and sample success rate was 99.6%. Preset rules for dropping SNPs were poorly defined clusters, replicate or Mendelian errors, call rate < 90%, all samples heterozygous.
Positive and negative controls were run in parallel to ensure there was no contamination of the DNA. Other quality control measures included the addition of 56 CEPH family trios to the genotyping plates to test for non-Mendelian inheritance with 100% reproducibility and no Mendelian errors. Ten samples had low GenCall scores (<0.4)(23) and were excluded from the analysis. All genotype clusters were manually inspected by a specialist scientist (JC), those with atypical clustering SNPs were flagged and excluded (N=3 SNPs, 1 in ERCC5 and 2 in RPA3. Call rates were high for SNPs overall, at 99.6% rate for samples, and 95.1% for loci. Forty-seven pairs were used for duplicate concordance, with a 99.9% concordance rate. Twelve SNPs failed to amplify or were discarded due to poor quality and 91 samples had a call rate of 0.
Risk factor questionnaires (RFQs) were completed by 100% of controls and 71% of cases. For cases missing RFQs, clinical data were extracted from available medical records as described above. To assess intermethod reliability between these two methods, we used the Kappa coefficient to measure the inter-rater agreement.(24)
Before analysis of disease-marker associations was performed, we used χ2 tests to determine whether the genotype distributions for each SNP showed Hardy-Weinberg equilibrium under Mendelian biallelic expectations.
For each polymorphism, we defined the major allele as the most common allele in controls, and the minor allele as the less common allele in controls. In order to examine the association between each SNP and disease we considered multiple unadjusted models (allelic, Cochran Armitage trend, genotypic (2df), additive, codominant, dominant, and recessive) among cases and controls using a combination of PLINK v0.99r (http://pngu.mgh.harvard.edu/purcell/ plink/)(25) and SAS (SAS software, version 9.1.2, Cary, North Carolina). Multivariable logistic analyses adjusted for age, sex, smoking status (ever/never), family history of pancreas cancer in a first degree relative (yes/no), body mass index (BMI), and personal history of diabetes (yes/no) was then performed in the three different genetic models as well. (SAS software, version 9.1.2, Cary, North Carolina).
A principal components analysis(26) approach was utilized in order to test for an overall association between disease and the multiple SNPs genotyped within each gene. The necessary number of principal components needed for each gene was determined using a 90% explained variance criteria. Once the necessary principal components were determined, univariate and multivariable logistic regression models were considered to assess the significance of each gene.
Haplotype-disease association was evaluated for each gene using Haplo.score(27), which accounted for ambiguous linkage phase. This method uses an expectation–maximization (EM) algorithm to infer haplotypes and accounts for ambiguity in haplotype assignment when comparing cases to controls and allows adjustment for non-genetic covariates, which are often critical when analyzing genetically complex phenotypes. The EM method also provides global tests for association, as well haplotype-specific tests, which give a meaningful advantage in attempting to understand the roles of different haplotypes. Haplotype ORs and 95% CIs were calculated using Haplo.glm(28). Haplotype analyses were performed using the Haplo.score and Haplo.glm functions included in HaploStats package version 1.2.1 in S-plus (Version 8.0.1).
Recursive partitioning (RPART) models(29), which implement binary trees to recursively partition the dataset into 2 subsets which are the most homogeneous with respect to the endpoint of interest (case/control status), were implemented to help identify potential interactions between SNPs (gene-gene) and environmental variables (gene-environment). (30) These classification trees were built using all SNPs as well as the clinical variables used as adjusters in the multivariate analysis. After the first factor (and splitting point) has been chosen to maximize the homogeneity, each succeeding factor enters the tree conditional upon what has already entered and therefore represents an interaction (e.g. the second factor into the model would represent an interaction between the first factor and the second factor). Trees were grown using the standard defaults implemented by using standard functionality contained within the rpart library in S-plus (Version 8.0.1). The final trees were determined by pruning the tree to obtain a parsimonious model using cross-validation relative error rate and the 1-SE rule (29) as a guide to determine the best number of splits. The terminal nodes remaining after this pruning would define “subgroups” of interest while the splits resulting in those nodes would define potential interactions.
Cases and controls (Table 1) were similar in age, but differed in BMI, sex (despite attempted frequency matching), percent of ever-smokers, percent reporting a first degree relative with pancreatic cancer, and diabetes (defined as diagnosed > 2 yrs prior to cancer diagnosis for cases or participation for controls). When we validated medical record data to self-reported questionnaires, kappa values for each variable for cases and controls, respectively, were: ever/never smoker (0.92, 0.75), pack-years (0.35, 0.64), family history of pancreatic cancer (1.0, 1.0), race (1.0, 1.0). These results showed strong agreement between the two data sources.
The Principal Components Analysis approach was utilized to serve as an omnibus test for association between each candidate gene and disease. Adjusted and unadjusted principal components analyses were performed for each gene in the NER pathway to determine an overall gene level contribution to risk for pancreatic cancer. MMS19L was the only gene which appeared to be significantly associated as shown in both unadjusted analyses (p-value =0.0058) and after adjusting (0.0230) for age, sex, smoking status, BMI, diabetes, family history of pancreatic cancer in first degree relative. Unadjusted and adjusted results for each of the genes are shown in Table 2. Based on our population, we determined that three independent principal components were sufficient to explain over 90% of the variability measured by the 7 correlated SNPs of MMS19L. Unfortunately this approach does not identify specific disease causing variants and therefore additional analyses and/or follow-up studies would be necessary. Individual SNP level contributions to the eigenvectors and eigenvalue information for the first 3 principal components can be found in Supplemental Tables 1 and 2 respectively.
Logistic regression analyses at the single SNP level for each gene were also performed using additive, dominant, and recessive models model adjusted for age, sex, ever/never smoking, 1st degree family history of pancreatic cancer, body-mass index, and diabetes. Overall odds ratios and subgroup analyses for MMS19L SNPs (total of 7) are shown in Table 3. Protective associations were observed in additive, dominant, and recessive models for minor alleles at rs872106 and rs2211243, while an increased risk was observed for rs2236575. The direction of risk effect for each SNP is largely consistent across demographic groups such as sex, location of residence, and smoking status, suggesting an effect independent of these factors (Table 4), though associations were more pronounced for females. Associations among smokers did not show a dose-dependent effect by pack-year categories, with risk changes more pronounced among ever than never smokers, though smokers with the least and most pack-year exposure showed the highest effect and moderate pack-year smokers showed the least. No effect was detected in current smokers, but numbers of current smokers in both cases and controls were small. The strongest associations for all three SNPs were seen among those former smokers who had quit at least 15 years prior to diagnosis/enrollment. The associations in the heaviest smokers were roughly comparable to those seen in nonsmokers.
Table 5 displays the results of the haplotype analysis for MMS19L. Of all possible combinations, seven haplotypes constituted 99% of haplotypes identified in controls, and were designated as A through F. were We observed an overall effect on risk for pancreatic cancer (global simulation p value = 0.012). Two haplotypes (labeled in decreasing order of frequency in controls) were associated with statistically significant decreases in risk compared to the most common haplotype A (B, adjusted OR 0.85, 95% CI 0.73–0.99 and E, adjusted OR 0.65, 95% CI 0.47–0.90). Although these two haplotypes differed at rs872106, they carry the same alleles at rs2211243 and rs2236575.
Recursive partitioning analysis was also performed as an exploratory method to assess SNP-SNP associations within the pathway and SNP-environment interactions, both overall and within the following subgroups: <age 60, ever/never smokers, BMI >30, self-reported diabetes Y/N, and self-reported diabetes (Y/N) at least 2 years prior to either cancer diagnosis or enrollment as a control. After pruning the final trees using cross-validation error rate and the 1-SE rule, diabetes provided the only split in the overall analysis (342/1143 cases vs 89/1097 controls). Ensuing splits that did not remain after pruning were ever/never smoking among nondiabetics, and then age < or > 63.5 among smoking nondiabetics. No significant SNP-SNP or SNP-environment interactions were observed based on this analysis. In the subgroups, smoking (ever/never) provided a split among subjects < age 60 at cancer diagnosis or enrollment as a control (215/329 cases vs 120/297 controls) after pruning; among nondiabetics only, smoking (455/1008 cases vs 475/801 controls) and age < or > 63.5 years among ever-smokers (166/455 cases were < 63.5 vs 238/475 control ever smokers).
We previously reported an association of ERCC4/XPF SNP R415Q (rs1800067) showing an inverse association with pancreatic cancer, though this was attributed to chance given the low frequency of minor alleles.(16) This prior study used a different control group and was of smaller sample size. The reported effect was not seen in this current study (adjusted OR 0.92, 95% CI 0.72, 1.17).
Nucleotide excision repair (NER) is a complex pathway integral to repair of exogenous damage to DNA from a variety of sources. Small variations in this pathway that may have an impact on DNA repair capacity, and, over time, could heighten risk for malignancy. The effect of interactions of these variations among the many genes involved in NER is largely unknown.
This study represents an analysis of common polymorphisms in the complete NER pathway and associated genes with risk for pancreatic cancer. Due to the explosion of high-throughput technology in genetic analysis, large scale analyses are now possible for genetic epidemiology studies. Increasingly common among these are genome wide association studies (GWAS), which are agnostic, without the need for choosing candidate genes or pathways. These can be costly, and often are only done on a small subset of the sample in a staged approach, so only one question can be addressed (usually overall adjusted risk using an additive model) in the second stage. An alternative is the candidate pathway approach, which are based on prior suspicion of association, and this follow a classic hypothesis-testing approach. In these studies, tag-SNPs are chosen in every known gene in the pathway in an attempt to include most common sequence variation in the identified genes, through the assumption of linkage disequilibrium. Variations may have a direct effect on gene function, but more likely are linked to potential causal variants. This approach is limited by our knowledge of the genes involved in pathways and their interactions, and will miss less common variation (such as deleterious mutations).
In order to screen for overall gene effects, we performed gene-level associations using a principal components analysis with each SNP of a gene included in the analysis, adjusted for important covariates.
Our study has implicated MMS19L (on 10q24.1), a human homolog of MMS19, a gene first noted in Saccharomyces cerevisiae to be involved in NER and RNA transcription, with separate domains required for each process.(31,32) MMS19L has not been well characterized in humans, but is believed to play a similar role in human NER, with several regions highly conserved; and alternate splicing preserved across species.(33) The protein binds to the GTF2H complex via ERCC2 and ERCC3, though its exact function is unclear.(34) Analysis of MMS19L variants with cancer risk has only been reported in one study of lung cancer, with no alteration of outcome for one non-synonymous SNP.(35)
In addition to the gene level analyses by principal components analysis, we also performed individual SNP, subgroup, haplotype, and interaction analyses within the pathway. As noted above, three SNPs in MMS19L appeared to associate with altered risk for pancreatic cancer. The association appeared to be strongest among women, ever smokers, former smokers quitting > 15 years prior, and those with lower BMI. However, confidence intervals for these subgroups overlap with others, so these distinctions are considered exploratory.
In order to avoid missing possible associations of SNPs in genes not detected by the principal component approach, individual analyses were performed for all SNPs in the pathway. Because many of these will be associated simply by chance, replication will be required to confirm our findings.
In the pathway interaction analyses undertaken using recursive partitioning (RPART), no significant associations were found, though we cannot rule out interactions. Pathway analysis is limited by many factors, including unknown biological function of variants, lack of ability to separate chance findings from true differences, and lack of consensus among the research community how to best assess interactions. A potential limitation of RPART is that due to binary splitting, subgroups are created with rapidly diminishing numbers of cases and controls. Thus, it may not detect more complex associations due to a lack of power in the smaller groups. However, an advantage of RPART is that it is agnostic, and does not simply constitute a compilation of positive findings, many of which could be false positives.
Perhaps more important than our findings with MMS19L, there does not appear to be a large effect of NER variation on pancreatic cancer risk overall. The low number of positive associations, when many are likely due to chance, suggests that perhaps this pathway is less important in pancreatic adenocarcinoma carcinogenesis. Replication of our findings, both positive and negative, in other study populations will be key to defining the role for polymorphisms this pathway in pancreatic cancer risk.
Limitations of this study include genotyping failure of 5% of our samples, which could affect power and results, but is unlikely to introduce a systematic bias. As this is a clinic based case-control study, the choice of controls is always problematic, since no control group perfectly matches the patient population. Indeed, patients seen at a referral center are likely younger, healthier, and earlier stage than in the general population, and they must survive long enough to be seen. We attempted to minimize this with recruitment at the time of initial clinic appointment. In addition, using healthy patients seen in the General Internal Medicine Clinic as controls draws from a similar referral population at our institution, and the odds ratios seen for subjects from local and nonlocal locations of primary residence are consistent, at least for the MMS19L SNPs (Table 4). We also did not correct for multiple comparisons in our analyses, as we view these findings as exploratory and not conclusive. Methods such as the Bonferroni method can be overly conservative in genetic analyses due to linkage disequilibrium.(36) The field has not yet reached a consensus on the correct adjustments needed, if any, aside from future replication, which we believe would represent the most important method of confirming our findings as not occurring by chance.
Further studies to confirm the associations and identify the functional genetic variants in MMS19L responsible for the association are needed before these findings would be able to be included in risk modeling for pancreatic cancer.
In a tag-SNP analysis of the NER pathway and its associated genes, common variation in MMS19L is associated with altered risk for pancreatic cancer.
We appreciate the efforts of Martha Matsumoto (data analysis), Traci Hammer, Cindy Chan, and Jodie Cogswell (study coordinators, patient recruitment).
Sources of Support:
National Cancer Institute CA K07 116303 (R.M.), P50 CA 102701 (G.P.)
Conflicts of Interest: None