|Home | About | Journals | Submit | Contact Us | Français|
While lung cancer is largely caused by tobacco smoking, inherited genetic factors play a role in its etiology. Genome-wide association studies (GWAS) in Europeans have robustly demonstrated only three polymorphic variations influencing lung cancer risk. Tumor heterogeneity may have hampered the detection of association signal when all lung cancer subtypes were analyzed together. In a GWAS of 5,355 European smoking lung cancer cases and 4,344 smoking controls, we conducted a pathway-based analysis in lung cancer histologic subtypes with 19,082 SNPs mapping to 917 genes in the HuGE-defined “inflammation” pathway. We identified a susceptibility locus for squamous cell lung carcinoma (SQ) at 12p13.33 (RAD52, rs6489769), and replicated the association in three independent samples totaling 3,359 SQ cases and 9,100 controls (odds ratio=1.20, Pcombined=2.3×10−8).
The combination of pathway-based approaches and information on disease specific subtypes can improve the identification of cancer susceptibility loci in heterogeneous diseases.
Lung cancer is a major cause of cancer death worldwide causing over 1 million deaths each year (1). The major histological classification separates small cell lung cancers (SC) from non-small cell lung cancers, the latter mostly comprised of adenocarcinoma (AD) and squamous cell (SQ) tumors. These lung cancer histologies have diverse molecular characteristics that reflect differences in carcinogenesis, etiology and treatment (2, 3).
While lung cancer is largely caused by tobacco smoking, studies have also implicated inherited genetic factors in disease etiology. Genome-wide association studies (GWAS) in Europeans have consistently identified three polymorphic variations at 15q25.1 (CHRNA5-CHRNA3-CHRNA4), 5p15.33 (TERT-CLPTM1) and 6p21.33 (BAT3-MSH5) influencing lung cancer risk (4–8). Interestingly, one single nucleotide polymorphism (SNP), rs2736100, which localizes to the TERT gene, was distinctly associated with adenocarcinoma risk (9), suggesting that additional searches for histology specific associations are likely to prove informative.
It is estimated that perhaps as much as 25% of all cancers are associated with chronic infection and inflammation (10). Because tobacco smoking can initiate or sustain chronic inflammation (11), often in concert with altered DNA repair and inflammatory response (12), we explored the impact of common genetic variation in inflammatory genes on lung cancer risk in smokers using a GWAS from the National Cancer Institute (NCI) (9). The study included 5,355 smoking cases and 4,344 smoking controls of European ancestry. We performed a pathway analysis of the genes listed under the category of “inflammation” in the HuGE Navigator (version 1.4) (13) to measure the collective effect of these genes and identify the SNPs with the strongest association for replication in independent samples. We analyzed the association with lung cancer risk overall and within the major lung cancer histology groups to explore the genetic basis of disease subtypes.
We first conducted quality control analysis for all GWAS SNPs included in this study. Quantile-quantile plots of the negative logarithm of the genome-wide P values and genomic control (GC) λ values computed based on all GWAS SNPs indicated no global variance inflation (genomic controls λ between 1 and 1.02 for all analyses, Supplementary Fig. 1), excluding the possibility of inflated type-I error rates in the pathway analyses.
Among the 591,928 SNPs from the original GWAS, we analyzed 19,082 SNPs mapping to 917 genes from the HuGe navigator list (13). Given the large number of genes and the variable degree of involvement of these genes in the inflammation-related function, we used HuGE Literature Finder to assign each gene a score, reflecting the strength of evidence for association with inflammation (Supplementary Table 1 and Supplementary Methods), and tested the pathway association integrating this weighting for the genes. The list of the 917 genes and the corresponding SNP P-values are reported in Supplementary Table 1. After applying a Bonferroni correction, we found a strong association of the pathway for SQ cases (Fig. 1 and Supplementary Table 2; P=0.0004, ever smokers), suggesting the existence of SNPs truly associated with risk in this histologic subtype. An analysis without the HuGE-based weighting provided similar results, although with weaker associations (SQ P = 0.001, Supplementary Table 2). No statistically significant association was found for AD or SC risk following either the weighted or un-weighted approach (Fig. 1 and Supplementary Table 2). Therefore, we restricted further analyses to the SQ subtype.
We chose the 55 SNPs in the 55 genes with the strongest evidence for association in SQ (gene-wise P-value <0.05) and performed replication in two independent studies of European ancestry, including UK1 (8), with 592 SQ cases and 2,699 controls from the 1958 Birth Cohort (WTCCC) (14), genotyped using Illumina HumanHap550 arrays; and Texas (5), with 306 SQ cases and 1,137 controls, genotyped using Illumina HumanHap300 arrays.
Among the 55 SNPs, only rs6489769 was consistently replicated in both studies, with a combined P = 6.48×10−7 for the association with SQ risk (Table 1). The SNP marker, rs6489769, maps to chromosome 12p13.33 (943,226 bps). The pathway analysis in NCI SQ data after the exclusion of the SNPs at 12p13.33 showed a pathway-level P = 0.0008. We then replicated the rs6489769 SNP in a third independent sample, UK2 (15), with 1,038 SQ cases and 933 controls genotyped using Illumina Infinium custom arrays. The association was confirmed also in the third sample (Table 2). Although these case-control series were smaller than the discovery dataset, each had the statistical power to replicate the signal of the NCI discovery set (OR=1.23) at one-sided p<0.05 (statistical power for UK1=0.92, UK2=0.93, and Texas=0.70). Combining data from all four studies, the association was statistically significant on a genome-wide basis with P = 2.30×10−8, two orders of magnitude below the Bonferroni corrected P-value threshold for 19,082 SNPs (0.05/19,082 SNPs, P = 2.62×10−6) and odds ratio = 1.20 (95% confidence interval = 1.12–1.28; Phet = 0.89, I2=0%; Table 2).
We verified whether the association with SQ risk for this SNP was modified by pack-years of tobacco smoking in the NCI GWAS, but found very similar results across smoking strata (Supplementary Table 3). We also investigated in EAGLE (441 SQ cases and 1319 controls) (16) whether the association between rs6489769 and SQ was confounded by chronic obstructive pulmonary disease (COPD) status, but found no major changes in the adjusted data (data not shown). rs6489769 was not significantly associated with COPD in lung cancer cases (P=0.67, OR=0.97). Since only 131 of the controls had documented COPD, a larger study of cancer-free COPD patients is required to robustly examine the impact of this SNP on COPD risk.
To explore the 12p13.33 region further, we imputed unobserved genotypes in SQ cases and controls in NCI SQ data using HapMap Phase III and 1000 Genomes Project data but did not identify any stronger association at 12p13.33 than that provided by rs6489769. This locus harbors the RAD52 gene, which is involved in homologous recombination (HR). We examined whether other genes in the HR pathway or the overall DNA repair pathway influence SQ risk. None of the other 17 HR genes (involving 142 SNPs) showed an association with gene-wise P<0.05 (Supplementary Table 4). In the analysis of the overall DNA repair pathway including 1,410 SNPs mapping to 136 genes (Supplementary Table 5), we observed a modest pathway-level association in the NCI SQ data including (P = 0.006) or excluding (P = 0.04) the RAD52 SNPs, and the only SNP with P-value < 0.001 was rs6489769.
Using a pathway analysis in a genome-wide association study of lung cancer from European ancestry, we identified a susceptibility locus for squamous cell carcinoma risk on chromosome 12p13.33. The finding was replicated in three independent samples, did not appear to be modified by smoking quantity or personal history of COPD and exceeded a genome-wide threshold for association.
The 12p13.33 locus has at least 31 alternatively spliced variants (AceView (17)). Depending on the transcripts, rs6489769, the SNP most strongly associated with SQ risk, appears located ~13Kb centromeric to or within a plausible candidate gene RAD52 (yeast, homolog of RAD52; MIM 600392) (Fig. 2). At 53Kb LD interval from rs6489769 is also the gene encoding WNK1 (protein kinase, lysine deficient 1; MIM 605232), which plays a role in pseudo hyperaldosteronism and hereditary sensory neuropathy. At 27Kb from the same SNP also lies ERC1 (ELKS/RAB6-interacting/CAST family member 1, MIM 607127), a member of a family of RIM-binding proteins, but rs6489769 does not appear to correlate with SNPs in this gene (Fig. 2).
High-fidelity replication of DNA, and its accurate segregation to daughter cells, is critical for maintaining genome stability and suppressing cancer. DNA replication forks are stalled by many DNA lesions and stalled forks may eventually collapse, producing a broken DNA end (18). In concert with BRCA2, RAD52 plays a pivotal role in repairing these DNA double-strand breaks (DSBs) through homologous recombination (HR) (19). RAD52 and BRCA2 seem to act in parallel pathways, and RAD52 provides an important alternate way for repairing replication-associated damage by HR in the absence of BRCA2 (20). RAD52 interacts with DNA recombination protein RAD51 and participates in the regulation of its polymerization (21). Thus, variation in RAD52 may disrupt the DSB repair function of RAD51. RAD52 also cooperates with OGG1 to repair oxidative DNA damage thereby enhancing cellular resistance to inflammatory-related oxidative stress (22). Since most therapeutic strategies for lung cancer create DNA replication stress, inherited variation in replication stress response may affect treatment efficacy (18, 23).
While DSBs can arise during DNA replication, they can also be induced by exposure to tobacco smoking (24), chronic inflammation (25) and other agents (20). Thus, genetic variation in RAD52 could contribute to altered repair of tobacco-induced and microinflammatory-sustained DNA damage a priori, providing additional support for the role of HR dysfunction in cancer development. In our data, the rs6489769-SQ association was not modified by levels of tobacco smoking. While the locus may have an effect on SQ risk independent of smoking, our sample size had limited statistical power to detect small changes across smoking strata.
Our analysis also provides evidence for the collective role of inflammatory genes in SQ, which may contribute to the development or maintenance of a carcinogenic inflammatory microenvironment. Since RAD52 is the plausible candidate gene, we verified whether other genes in the HR pathway or the overall DNA repair pathway were associated with SQ risk. While there may be other HR influencing SQ risk, our analysis suggests that by far the primary common determinant is related to the RAD52 association. Fine mapping studies and functional analyses are required to determine the biological basis of the association.
The 12p13.33 (RAD52) locus was distinctly associated with squamous cell carcinoma risk. Together with the prior identification of variants at 5p15.33 (TERT) associated with AD risk (9), our findings underscore the importance of searching for histology-specific lung cancer risk variants. Studies in other tumors have also identified different genetic variation by tumor subtypes or phenotypic diversity (26, 27), confirming that studying specific disease subtypes can enhance power for detecting susceptibility loci in GWAS (28). Moreover, evaluating the associations between susceptibility loci and tumor subtypes may improve risk assessment; and predicting the risk for specific tumor subtypes may lead to targeted early detection or prevention strategies. Moreover, identifying histology specific SNPs may refine mechanistic understanding of currently unknown origins of morphologic differences, and may contribute to the ongoing search for personalized treatment for subtype specific lung cancer cases (29, 30). However, a challenge of this approach is the difficulty of accruing necessary sample sizes given the relative rarity of many such disease subtypes. Our pathway-based approach took advantage of prior knowledge of the disease etiology and substantially helped prioritizing the most relevant SNPs for replication even in a relatively small sample size. This suggests that the combination of a pathway-based approach and information on disease specific subtypes can greatly improve the identification of cancer susceptibility loci.
The 9,699 smoker subjects were drawn from one population-based case-control study and three cohort studies: Environment and Genetics in Lung cancer Etiology study (EAGLE, case-control, Italy), Alpha-Tocopherol Beta-Carotene Cancer Prevention study (ATBC, cohort, Finland), Prostate, Lung, Colon, Ovary screening trial (PLCO, cohort, US) and Cancer Prevention Study II (CPS-II, cohort, US). All subjects were of European ancestry. The study included three main histological subtypes, AD, SQ, SC and a small number of other lung cancer subtypes. Subjects were genotyped using Illumina 1M, 610QUAD, 550K, and 317K+240S HumanHap arrays. The distribution of subjects by histology, smoking status and genotyping platforms are in Supplementary Table 6. The details of quality control were reported previously (9). Briefly, SNPs with missing rate >5%, Hardy Weinberg Equilibrium (HWE) P<10−7 in controls and minor allele frequency (MAF) <5% were excluded. Subjects were removed if they were the outliers in the ancestry plots, had high missing genotype rates or were duplicates or relatives of other subjects. The GC λ values were 1.03, 1.03, 1.01, and 1.01 in EAGLE, PLCO, CPS-II, and ATBC, respectively, suggesting no significant hidden population substructure or unadjusted confounding factors.
Cases with pathologically confirmed SQ were ascertained through the Genetic Lung Cancer Predisposition Study (GELCAPS), genotyped using HumanHap550. All subjects were British residents and self-reported to be of European Ancestry. Controls were from the 1958 Birth cohort. The GC λ value was 1.03. Details on the quality control procedures have been previously reported (8).
Lung cancer cases were ascertained through GELCAPS. The 933 healthy smoking individuals included in the analysis are part of the National Cancer Research Network genetic epidemiological studies (1,497 males, 1,539 females; mean age 61 years, SD 11), the National Study of Colorectal Cancer (NSCCG; 1999–2006; n = 541), GELCAPS (1999–2004; n = 1,520); and the Royal Marsden Hospital Trust/Institute of Cancer Research Family History and DNA Registry (1999–2004; n = 975). These controls were the spouses or unrelated friends of patients with malignancies. None had a personal history of malignancy at time of ascertainment. All were British residents and self-reported to be of European ancestry. Subjects were genotyped using Illumina Infinium custom array.
Case-control study of European ancestry, including lung cancer cases newly diagnosed at the University of Texas M.D. Anderson Cancer Center since 1991 and controls from the Kelsey-Seybold clinics (the GWAS included only smokers and cases with non-small cell lung cancer). Controls were frequency matched to cases according to their smoking behavior, age, ethnicity, and sex. Former smoking controls were further frequency matched to former smoking cases according to the number of years since smoking cessation. Subjects were genotyped using Illumina HumanHap300 chip. The GC λ value was 1.03. Details on the quality control procedures have been previously reported (5).
All subjects included in the discovery sample and replication samples signed an informed consent form. The studies were performed after approval by each local institutional review board.
We used the SNP list compiled by the HuGE Literature Finder (HuGE Navigator, version 1.4) (13), corresponding to 970 genes potentially involved in inflammation and assigned each gene the HuGE score. The HuGE score was calculated based on the frequency of reported associations with inflammation in HuGE literature (the largest the score, the strongest the association), including studies of genetic association in humans and animal studies. In particular, the number of all publications reporting on the association of a given gene with inflammation, and whether the identified associations were based on genome-wide analyses, meta-analyses, or genetic testing were taking into account. Publications based on animal models were also added to the final weighting system (Supplementary Methods). Among the 970 genes, 917 (Supplementary Table 1) were covered by the NCI GWAS, corresponding to 19,082 SNPs (within ~20 kb upstream of the start of transcription and ~10 kb downstream of stop of transcription, NCBI build 36).
For the NCI data, we performed an unconditional logistic regression analysis to test the additive effect of each SNP genotype on lung cancer risk using PLINK software (31), adjusting for age (≤50, 51–55, 56–60, 61–65, 66–70, 71–75, 76+), sex, study (ATBC, EAGLE, ATBC, CPS-II), four principal components derived based on EIGENSTRAT (32) to control population stratification and cigarettes smoked per day (≤10, 11–20, 21–30, 31–40, 41+), smoking duration in 10-year intervals, smoking status (former vs. current smoker) and, for former smokers, the number of years since quitting (1–5, 6–10, 11–20, 21–30, 30+). To assess the possibility of systematic bias due to unadjusted confounding factors or cryptic relatedness, we computed the GC λ values based on all SNPs passing the quality control filters in the NCI GWAS by histology and smoking status.
We tested the pathway-level association by integrating the HuGE scores that reflected different strength of association with lung cancer risks for each gene. Briefly, we first derived the gene-wise P-values P1,···, PK for K= 917 genes, adjusting for the number of SNPs mapping to each gene region and the linkage disequilibrium. We converted the gene-wise P-values into quantiles Qi of the distribution, i.e. . We then tested if P1,···, PK as a set deviated from the uniform distribution (no pathway-level association) using statistic , with wi being the weights converted from the HuGE scores. A large value of T indicated a pathway-level association. The statistical significance was evaluated based on 5000 random permutations. The details of the pathway testing procedure are in the Supplementary Methods. We replicated the same analysis without applying the HuGE score by setting identical wi across the genes.
We selected the 66 genes with gene-wise P-value <5% in the NCI SQ data, and chose the min-P SNP for each gene. Since physically close genes may be in LD or have identical min-P, for each pair of min-P SNPs with high LD (r2 > 0.5), we removed the SNP with a weaker association signal in the NCI SQ data. We compiled a list of 59 SNPs for replication, among which four SNPs were not genotyped in any replication samples. In summary, we chose 55 SNPs for replication. Details of SNPs’ selection within the major histocompatibility complex (MHC) region with long range LD are in the Supplementary Methods.
Standard fixed effect meta-analysis was performed for the 55 SNPs in NCI, UK1 and Texas to derive the ORs, 95% confidence intervals and P-values. For the most significant SNP rs6489769, we performed meta-analysis of the results from NCI, UK1, UK2 and Texas datasets. Cochran’s Q statistic was calculated to test for heterogeneity with P-value Phet.
Prediction of the untyped SNPs in the 12p13.33 locus was carried out using IMPUTE2 (33), based on CEU HapMap Phase III haplotypes release 2 and 1000 Genomes Project. Unconditional logistic regression was performed to test the association between the imputed genotypic dosage and the trait using R adjusting for age, sex, study, principal components, cigarettes smoked per day, smoking duration, smoking status and, for former smokers, the number of years since quitting as in the single SNP analysis. LD metrics between HapMap SNPs and association P-values were plotted using SNAP (34).
Financial support: This study was supported by the Intramural Research Program of National Institutes of Health, National Cancer Institute, Division of Cancer Epidemiology and Genetics. The studies contributing to the NCI GWAS, including The Environment and Genetics in Lung cancer Etiology (EAGLE), Prostate, Lung, Colon, Ovary Screening Trial (PLCO), and Alpha-Tocopherol, Beta-Carotene Cancer Prevention (ATBC) studies were supported by the Intramural Research Program of the National Institutes of Health, National Cancer Institute (NCI), Division of Cancer Epidemiology and Genetics.
ATBC was also supported by U.S. Public Health Service contracts (N01-CN-45165, N01-RC-45035, and N01-RC-37004) from the NCI.
PLCO was also supported by individual contracts from the NCI to the University of Colorado Denver (NO1-CN-25514), Georgetown University (NO1-CN-25522), the Pacific Health Research Institute (NO1-CN-25515), the Henry Ford Health System (NO1-CN-25512), the University of Minnesota, (NO1-CN-25513), Washington University (NO1-CN-25516), the University of Pittsburgh (NO1-CN-25511), the University of Utah (NO1-CN-25524), the Marshfield Clinic Research Foundation (NO1-CN-25518), the University of Alabama at Birmingham (NO1-CN-75022), Westat, Inc. (NO1-CN-25476), and the University of California, Los Angeles (NO1-CN-25404).
The Cancer Prevention Study-II (CPS-II) Nutrition Cohort was supported by the American Cancer Society.
The NIH Genes, Environment and Health Initiative (GEI) partly funded DNA extraction and statistical analyses (HG-06-033-NCI-01 and RO1HL091172-01), genotyping at the Johns Hopkins University Center for Inherited Disease Research (U01HG004438 and NIH HHSN268200782096C), and study coordination at the GENEVA Coordination Center (U01HG004446) for the EAGLE study and part of the PLCO. Genotyping for the remaining part of PLCO and all ATBC and CPS-II samples were supported by the Intramural Research Program of the National Institutes of Health, NCI, Division of Cancer Epidemiology and Genetics. The “Texas” study was supported by NIH grants R01CA55769, R01 CA127219, R01CA133996, and U19CA121197.
The UK1 and UK2 work was supported by Cancer Research UK (C1298/A8780 and C1298/A8362- Bobby Moore Fund for Cancer Research UK). Yufei Wang was supported by an extramural NIH grant U19 CA14812701. We are also grateful to National Cancer Research Network, Helen Rollason Heal Cancer Charity and Sanofi-Aventis and the NHS funding for the Royal Marsden Biomedical Research Centre.
We thank the investigators involved in the EAGLE study, listed in http://eagle.cancer.gov. We thank the clinicians who took part in the GELCAPS consortium. The UK studies made use of genotyping data on the 1958 Birth Cohort and these data were generated and generously supplied to us by Panagiotis Deloukas of the Wellcome Trust Sanger Institute. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. This study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, MD (http://biowulf.nih.gov).
Conflicts of interest: The authors declare no conflicts of interest.