|Home | About | Journals | Submit | Contact Us | Français|
Lung cancer, of which 85% is non–small-cell (NSCLC), is the leading cause of cancer-related death in the United States. We used genome-wide analysis of tumor tissue to investigate whether single nucleotide polymorphisms (SNPs) in tumors are prognostic factors in early-stage NSCLC.
One hundred early-stage NSCLC patients from Massachusetts General Hospital (MGH) were used as a discovery set and 89 NSCLC patients collected by the National Institute of Occupational Health, Norway, were used as a validation set. DNA was extracted from flash-frozen lung tissue with at least 70% tumor cellularity. Genome-wide genotyping was done using the high-density SNP chip. Copy numbers were inferred using median smoothing after intensity normalization. Cox models were used to screen and validate significant SNPs associated with the overall survival.
Copy number gains in chromosomes 3q, 5p, and 8q were observed in both MGH and Norwegian cohorts. The top 50 SNPs associated with overall survival in the MGH cohort (P ≤ 2.5 × 10−4) were selected and examined using the Norwegian cohort. Five of the top 50 SNPs were validated in the Norwegian cohort with false discovery rate lower than 0.05 (P < .016) and all five were located in known genes: STK39, PCDH7, A2BP1, and EYA2. The numbers of risk alleles of the five SNPs showed a cumulative effect on overall survival (Ptrend = 3.80 × 10−12 and 2.48 × 10−7 for MGH and Norwegian cohorts, respectively).
Five SNPs were identified that may be prognostic of overall survival in early-stage NSCLC.
Lung cancer is the leading cause of cancer-related death in the United States.1 Non–small-cell lung cancer (NSCLC) comprises more than 80% of lung cancer.2 The TNM staging system has been the standard for determining prognosis of NSCLC. It is reported that the 5-year survival rate is 40% to 67% in stage I and 25% to 55% in stage II NSCLC.3 The wide range of survival rates seen in patients with early-stage disease indicates the heterogeneity of prognoses within this population and the inadequacy of the TNM staging system to account fully for this heterogeneity.
Due to advances in high-throughput genotyping, screening for disease loci on a genome-wide scale is now possible. Genome-wide association studies have been published on lung cancer, suggesting that lung cancer susceptibility may be associated with several single nucleotide polymorphisms (SNPs).4,5 All of these studies, however, focused on cancer susceptibility rather than cancer survival.
Genetic variations in the tumor genome may serve as better prognostic markers than germline genetic variations. Tumor genomics are more representative of the clinical problem and therefore are more likely to determine the transformative and invasive behaviors specific to cancer cells. Moreover, the SNP genotypes in tumor genome may be different from the corresponding genotypes in the germline DNA because of the frequent events of mutation resulting from genetic instability.6
Genome-wide copy number aberrations in tumor genome have been reported.7,8 Some studies have suggested an association of copy number variation of tumor genome with cancer survival.7,9 There were also studies using metagene or gene signatures from the mRNA expression data of NSCLC to predict patients' prognosis.10,11 However, to our knowledge, there is no study of the association of tumor SNPs with NSCLC survival using genome-wide technology.
The aim of this study was to identify genetic variations in tumors that are associated with the survival in early-stage NSCLC. In this study, genome-wide analysis of survival was conducted with one discovery set and the results were validated using a separate validation set.
The discovery phase of the study included 240 patients with early-stage NSCLC who underwent surgical resection at Massachusetts General Hospital (MGH; Boston, MA) between 1992 and 2001, and whose resected specimen was available for pathological review and DNA extraction. One hundred forty of 240 patients were excluded because of insufficient tissue quality, inadequate DNA content, or low DNA quality. The remaining 100 specimens selected from the MGH cohort were used as the discovery set. An independent validation set consisted of 89 specimens assembled using similar criteria from a series of 199 patients with NSCLC collected by the National Institute of Occupational Health, Oslo, Norway, who underwent surgical resection between 1988 and 1998. We also included 19 additional specimens of matched non-neoplastic lung parenchyma from patients with NSCLC in the Norwegian cohort. Written informed consent was obtained from all patients. The study was approved by the institutional review boards of MGH, the Harvard School of Public Health, and the Norwegian Data Inspectorate, and Local Regional Committee for Medical Research. The section of demographic and clinical data collection is described in Appendix (online only).
Frozen, archived resection specimens were analyzed. DNA was prepared from tumor and non-neoplastic lung parenchyma after manual microdissection of 5-μ histopathologic sections. For the discovery set, a pathologist (L.R.C.) who had no knowledge of the outcome reviewed all sections for each patient. Each specimen was evaluated for amount and quality of tumor cells and histologically classified using the WHO criteria. The Norwegian specimens were all resections collected and prepared in the same way. Specimens with lower than 70% cancer cellularity, inadequate DNA concentration (< 50 ng/μL), or a smearing pattern in gel electrophoresis were not included for genotyping.
A total of 262,264 SNPs were genotyped for 189 tumor DNAs, 20 paired blood DNAs in the MGH cohort and 19 DNAs from paired noninvolved lung tissues in the Norwegian cohort using Affymetrix 250K Nsp GeneChip (Affymetrix, Santa Clara, CA). The comparison of the 39 tumor DNA and the corresponding 39 DNA from the blood or noninvolved tissue is shown in Appendix Table A1 (online only). Overall, 2.42% of the SNPs had different genotypes in the pair DNAs. The median call rates of the MGH and Norwegian samples were 92.8 and 90.9, respectively, which were similar to the previous studies in solid tumors using Affymetrix SNP chip.12–14 Copy numbers were obtained with dChip software.15 The probe intensities were calculated by model-based expression after invariant set normalization. For each SNP in each sample, the raw copy number was computed as signal times 2 divided by the mean signal of reference samples at this SNP, using blood or non-neoplastic tissue samples as the referent. Inferred copy numbers were computed from the raw copy numbers by median smoothing for each locus of 262,264 SNPs.
Inferred copy numbers ≥ 2.7 were considered gains and ≤ 1.3 were considered losses. These cut-offs were set in order to detect ≥ 3 and ≤ 1 copies by tolerating 30% normal cell contamination. The prevalence of the subjects with copy number variations (CNVs) was plotted across the genome. Significance in genome-wide copy number variations was determined based on the binomial distribution with the probabilities of CNVs (≥ 2.7 or ≤ 1.3) estimated empirically from the data, and q values were calculated to control for multiple comparisons across the genome using the false discovery rate.16
The 74,666 SNPs with ≥ 95% call rate, ≥ 10% subjects with heterozygous or variant homozygous alleles, and ≥ 3% subjects with variant homozygous in the MGH cohort were selected for subsequent genome-wide analysis of overall survival. This selection was necessary to eliminate the potential for biased results driven by the few subjects carrying homozygous variant alleles. For each of the selected 74,666 SNPs, genome-wide analysis of overall survival in the additive mode was performed using the MGH cohort by univariate Cox models. Two-sided P values were obtained using score tests. The top 50 SNPs with the smallest P values were used for validation using the Norwegian cohort. For each of these 50 SNPs, Cox models were fit adjusting for covariates including age (in a continuous scale), sex, clinical stage (IA, IB, IIA, and IIB as ordinal categories), cell type (squamous cell carcinoma v adenocarcinoma), smoking pack-years (in a continuous scale), and false discovery rate (FDR) using q values were computed to control for the 50 comparisons. Survival analysis using joint copy number and SNP was also performed, where copy numbers in each SNP were adjusted by multiplying the SNPs in a continuous scale by the inferred copy numbers. For SNPs with consistent effects in two cohorts, pooled analyses were performed using stratified Cox models, assuming different baseline hazards for the two cohorts to control for differences between the two cohorts that are not accounted for by the covariates, and adjusting for the above covariates.
Risk alleles were defined as the alleles associated with shorter survival. Joint effects were investigated by adding up the number of risk alleles of the five validated SNPs by the Norwegian cohort. Cox models were fit using the total number of risk alleles as an ordinal variable while adjusting for covariates. Poisson regressions were used to compute crude mortality rates, crude recurrence rates, and 95% CIs for different numbers of risk alleles. The numbers of risk alleles of the five SNPs were further categorized into 0, 1 or 2, 3 or 4, and more than 4. Hazard ratios for each category were estimated using Cox models with those carrying 0 risk allele as the referent. Kaplan-Meier survival estimates were also plotted for the four groups and P values were obtained with log-rank tests. Similar P values were obtained using Cox models controlling for covariates. Within stage IA, IB, and II, Kaplan-Meier survival estimates were also plotted for those carrying ≤ 2 risk alleles and more than two risk alleles. The linkage disequilibrium plots were derived using Haploview (version 3.32; http://www.broad.mit.edu/mpg/haploview).
Patient characteristics of the MGH and Norwegian cohorts are described in Table 1. The median survival times were 6.3 and 3.7 years, and 43 and 55 deaths occurred in the MGH and Norwegian cohorts, respectively. The proportion of male patients and squamous cell carcinomas was higher in the Norwegian cohort, but the median smoking pack-years was higher in the MGH cohort (all with P < .05). In the MGH cohort, there were eight patients who received adjuvant radiation and one patient with adjuvant chemotherapy. None of the Norwegian patients received adjuvant chemotherapy or radiotherapy. Among the characteristics listed in Table 1, only age showed marginal significance in association with overall survival (P = .056 and .11 for the MGH and Norwegian cohorts, respectively).
The pattern of copy number variations (≥ 2.7 or ≤ 1.3) in both cohorts was similar although Norwegian cases seemed to have more substantial changes than MGH patients (Fig 1). In both cohorts, at least 10% to 15% subjects had large-scale copy number gains in chromosomes 3q, 5p, and 8q, which was statistically significant even after adjusting for multiple comparisons by FDR. Furthermore, focal amplifications of copy number in chromosomes 7p, 14q, 17q, and 19q were also significantly high in both cohorts (FDR < 0.05).
Seventy four thousand six hundred sixty six SNPs were used for genome-wide survival analysis in the MGH cohort. The top 50 SNPs (P ≤ 2.5 × 10−4 in univariate analyses) that were most highly associated with overall NSCLC survival were chosen for validation in the Norwegian cohort using Cox models assuming the additive mode. Among the 50 SNPs, 10 were found to be associated with overall survival in the same direction with P values lower than .1 and three SNPs with opposite associations in the Norwegian cohort with adjustment of age, sex, cell type, clinical stage, and smoking pack-years (Table 2). Analyses adjusted for copy numbers also revealed similar findings. Notably, nine of 10 SNPs validated at significance level of 0.1 located in known genes, which was significantly overrepresented (P = 1.1 × 10−4) given the fact that only 40.3% of the probes on the chip were in known genes. The most significant two SNPs (rs10176669 and rs4438452) were located in the same genes (STK39), and two SNPs in both PCDH7 and HTR3E were validated with significance level of 0.1 (FDR < 0.15).
Five SNPs (rs10176669, rs4438452, rs12446308, rs13041757, and rs10517215) with consistent effects and FDR lower than 0.05 (P < .016) were all located in the introns of known genes: serine threonine kinase 39 (STK39), protocadherin 7 (PCDH7), ataxin 2-binding protein 1 (A2BP1), and eyes absent homolog 2 (EYA2). For the five SNPs, pooled analyses showed even higher significance than that in either single cohort. The patients in the MGH cohort receiving adjuvant chemotherapy or radiation affected the results in Table 2 very little after excluding them from the analyses or adjusting for it as a covariate. Analyses excluding the nonwhite patients (n = 4) also showed similar results. The linkage disequilibrium plots of the validated SNPs are shown in Appendix Figure A2(online only).
The joint effect of the five SNPs was summarized by the number of total risk alleles in Table 3. The crude mortality rates in both cohorts were shown to increase with the number of risk alleles, which was consistent with the results from tests for trend in Cox models (P = 3.80 × 10−12 in the MGH cohort and P = 2.48 × 10−7 in the Norwegian cohort). A similar trend was also observed in relapse-free survival of the MGH cohort (P = 1.67 × 10−5). Adjusted hazard ratios computed for patients carrying at least one risk alleles with Cox models increased from 4.1 to 53.9 times risk of death and 2.1 to 20.4 times risk of recurrence in the MGH cohort, and 1.6 to 16.6 times risk of death in the Norwegian cohort, using those not carrying any risk allele as the referent (Table 3). Similar allele dose-response relationships were also shown in Kaplan-Meier curves (Fig 2). The 5-year survival rates for the patients with 0, 1 or 2, 3 or 4, and more than 4 risk alleles, respectively were 87.5%, 82.0%, 28.0%, and 11.1%, in the MGH cohort, and 60.0%, 51.7%, 0%, and 0% in the Norwegian cohort. Even within different cell types, the patients with more than two risk alleles consistently showed poor prognosis (Appendix Fig A1, online only).
Copy number gains in 3q, 5p, and 8q have been reported in previous studies of lung cancer.7,8 These CNV loci provide candidate regions for identification of novel oncogenes and the magnitude of these CNVs makes evident the need for combining both SNPs and CNVs in genome-wide analysis of tumor genome.
Ten SNPs, of which nine were located in six known genes, were validated in Norwegian cohort at significance level of 0.1. Of the six genes, STK39, PCDH7, and HTR3E had more than one SNP on the list of 10 validated SNPs. After adjusting for multiple comparisons with FDR of 0.05 as the cutoff, we still found five SNPs (rs10176669, rs4438452, rs12446308, rs13041757, and rs10517215) in the tumor genome to be significantly associated with the survival of early-stage NSCLC patients (P < .016) even after adjusting for the clinical covariates or copy number variations. The cumulative dosage effect of the five SNPs makes it a promising prognostic marker because it can use the number of risk alleles to predict overall survival and relapse-free survival of early-stage NSCLC patients on a finer scale. In contrast, the cumulative effect may imply certain biologic interactions that require further experimental investigations to define.
The five SNPs identified as significantly associated with survival were located within known genes: STK39, PCDH7, A2BP1, and EYA2. STK39 encodes a serine threonine kinase that specifically activates the p38 mitogen-activated protein kinase signaling pathway and is thought to play a role in cellular stress response.17 Its inactivation has also been shown to enhance cell to apoptosis.18 PCDH7 encodes an integral membrane protein that is believed to function in cell-cell recognition and adhesion,19 and its localization, 4p15, is a region of loss of heterozygosity in some head and neck squamous cell carcinomas.20 A2BP1 encodes a ribonucleoprotein motif that is highly conserved among RNA-binding proteins, which suggests an important basic function in development and differentiation.21,22 EYA2 encodes a transcriptional factor associated with apoptosis during development,23 and its upregulation has been shown to promote tumor growth and decrease overall survival in epithelial ovarian cancer.24
There are three SNPs with opposite effects on survival in the two cohorts. Similar conflicting findings have also been reported in disease risk studies of germline polymorphisms of COMT in schizophrenia25 and of NPSR1 in asthma.26 Such opposite associations, called the flip-flop phenomenon, have been explained by the difference in structures of linkage disequilibrium across different populations when the investigated variant is correlated with the causal variant.25 Moreover, because we studied somatic gene variations, it is possible that there may be considerably more of such phenomena. The dual and opposite role in life span by mechanisms of cellular senescence/aging and tumor suppression has also been shown in tumor suppressor genes p53 and p16INK4a.27,28 However, further investigation is needed to disentangle true but opposite associations from the chance associations.
The design of two independent cohorts for discovery set and validation set is one of our major strengths, which would largely reduce false positive findings from the whole genome scan. In contrast, restriction to early stage and two cell types in NSCLC makes the study population more homogenous, and long follow-up time increase the statistical power to detect a genetic effect on survival. Furthermore, complete clinical information enables us to adjust for potential confounding factors; and stringent histopathological criteria minimizes the misclassification of tumor and normal DNA preserving statistical power to detect the copy number variations and to identify the genetic variation specific to tumors.
We acknowledge several limitations in our study. First, the modest sample size of both cohorts does not have the optimal statistical power of discovering and validating the association, so false negative findings to a certain extent should be expected. It was noted that the most significant SNP, rs16931907, discovered in the MGH cohort (P = 2.01 × 10−11), only showed marginal significance (P = .057 in univariate analysis and .16 in multivariate analysis) in the Norwegian cohort, which may reflect such a limitation. However, the sample size of this study is comparable with other genome-wide studies investigating the association of mRNA expression in tumor with NSCLC survival.10,11 Second, the analyses in “one-marker-at-a-time” fashion cannot capture the effect from a “repertoire” of numerous genetic variations, each of which contributes only mild effect singly. Thirdly, the discovery and validation cohorts were not similar as presented in Table 1, which would not provide the optimal efficiency in validating the findings. Among the 50 most significant SNPs discovered in the MGH cohort, 10 were validated at significance level of .1 using univariate Cox models; however, 13 SNPs were validated using multivariate analyses adjusting for age, sex, clinical stage, cell type, and smoking. Conversely, one could argue that the associations validated in the Norwegian cohort are of more value given the diversity between the two cohorts, which was also the case shown in previous studies.10,11 The five SNPs (rs10176669, rs4438452, rs12446308, rs13041757, and rs10517215) reported here were validated in both crude and adjusted analyses at significance level of .05. Finally, the two cohorts are mainly of white origin, which may limit the generalizability of our findings to other ethnicities.
We conclude that copy number increases in chromosomes 3q, 5p, and 8q were validated. Five SNPs in tumor were found to be significantly associated with the survival of early-stage NSCLC patients, and survival decreased with the number of the risk alleles. Larger studies are required to confirm effects of the five SNPs in other patient populations, and resequencing neighboring regions along with functional assessment will confirm the role of these tumor SNPs in NSCLC behavior.
We are indebted to the participants of Molecular and Genetic Analysis of Lung Cancer Study; to the Lung Cancer Study Group: Eugene Mark, MD, Matthew Kulke, MD, Wei Zhou, MD, PhD, Geoffrey Liu, MD, Marcia Chertok, Andrea Shafer, Lauren Cassidy, Maureen Convery, Salvatore Mucci; to Panos Fidias, MD, and Bruce A. Chabner, MD, and the physicians and surgeons of the Massachusetts General Hospital Cancer Center; to Lodve Stangeland, MD, Haukeland University Hospital, Norway for recruiting patients; to Els Goetghebeur, PhD, Deanne Taylor, PhD, Huanyu Zhou, PhD, and Shun-Chiao Chang of Harvard School of Public Health; to Chien-Ling Lin of University of Massachusetts Medical School for her advice on copy number analysis; and to Edward Cox, MD, and the Microarray Core Facility of Dana-Farber Cancer Institute.
Demographic and smoking information was collected by a trained research assistant using a modified standardized American Thoracic Society respiratory questionnaire (Zhou W, Heist RS, Liu G, et al: Clin Cancer Res 12:7187-7193, 2006) and the clinical information was collected via extensive chart review. A similar approach was used for the Norwegian validation set. Relapse-free and overall survival was calculated from time of surgery to time of documented recurrence and death, respectively.
|Tissue||Single Nucleotide Polymorphisms||P*|
|Lung tumor||3,785,272||39.71||2,167,107||22.73||3,580,231||37.56||< 10−16|
|Blood or noninvolved||3,729,068||38.04||2,517,673||25.68||3,556,211||36.28|
Supported by Grants No. CA092824 (D.C.C.), CA074386 (D.C.C.), and CA090578 (D.C.C.) and funding (X.L.) from the National Institutes of Health; and the Norwegian Cancer Society (V.S., A.H.).
Authors' disclosures of potential conflicts of interest and author contributions are found at the end of this article.
The author(s) indicated no potential conflicts of interest.
Conception and design: Yen-Tsung Huang, Rebecca S. Heist, Xihong Lin, Zhaoxi Wang, David C. Christiani
Financial support: David C. Christiani
Administrative support: David C. Christiani
Provision of study materials or patients: Yen-Tsung Huang, Rebecca S. Heist, Vidar Skaug, Aage Haugen, Li Su, David C. Christiani
Collection and assembly of data: Yen-Tsung Huang, Rebecca S. Heist, Lucian R. Chirieac, Vidar Skaug, Aage Haugen, Kofi Asomaning, David C. Christiani
Data analysis and interpretation: Yen-Tsung Huang, Rebecca S. Heist, Lucian R. Chirieac, Xihong Lin, Vidar Skaug, Michael C. Wu, Zhaoxi Wang, David C. Christiani
Manuscript writing: Yen-Tsung Huang, Rebecca S. Heist, Lucian R. Chirieac, Xihong Lin, Vidar Skaug, Michael C. Wu, Zhaoxi Wang, Li Su, Kofi Asomaning, David C. Christiani
Final approval of manuscript: Yen-Tsung Huang, Rebecca S. Heist, Lucian R. Chirieac, Xihong Lin, Vidar Skaug, Aage Haugen, Michael C. Wu, Zhaoxi Wang, Li Su, Kofi Asomaning, David C. Christiani