|Home | About | Journals | Submit | Contact Us | Français|
Lung cancer is the leading cancer killer for both men and women worldwide. Over 80% of lung cancers are attributed to smoking. In this analysis, the authors propose to use a two-stage clonal expansion (TSCE) model to predict an individual’s lung cancer risk based on gender and smoking history. The TSCE model is traditionally fitted to prospective cohort data. Here, the authors describe a new method that allows for the reconstruction of cohort data from the combination of risk factor data obtained from a case-control study, and tabled incidence/mortality rate data, and discuss alternative approaches. The method is applied to fit a TSCE model based on smoking. The fitted model is validated against independent data from the control arm of a lung cancer chemoprevention trial, CARET, where it accurately predicted the number of lung cancer deaths observed.
Lung cancer is the second leading cancer in terms of incidence for both men and women, and is the leading cancer killer worldwide (1–2). Lung cancer is expected to cause 28% of all cancer deaths in 2010 (3). Smoking is the major risk factor for lung cancer implicated in as much as 90% of cases in the United States (4).
Here, the authors use carcinogenesis modeling, specifically the two-stage clonal expansion (TSCE) model, incorporating the effect of cigarette smoking, for use as a lung cancer risk prediction model. This model assumes two unique mutations are required for normal cells to become malignant. It also allows for clonal expansion of pre-malignant cells. More details on the model are provided in the Methods section.
Since the TSCE model is a time to event model, it is normally fitted to prospectively collected cohort data. For this study, case-control but not cohort data on risk factor exposures are available, along with tabled age-specific mortality rates. This prompted the development of a new method of reconstructing cohort data using resampling. The resampling method outlined in this paper is applied to fit the TSCE model to MD Anderson case-control data on individual smoking histories. The fitted model with resulting parameter estimates is validated by predicting lung cancer deaths in the non-asbestos exposed control arm of the Carotene and Retinol Efficacy Trial (CARET).
The goal of this project was to use a combination of lung cancer case-control data and mortality rate data from an independent source to fit a carcinogenesis model that will allow individual level prediction of lung cancer risk. A complication of using case-control data is that the matching between cases and controls causes intrinsic biases. A further complication in this particular analysis is that the enrolled cases and controls are matched partially on the risk factor of interest (i.e. on smoking status, although not on smoking intensity or duration). We overcome this complication by combining the case-control data with independent data on age, gender, ethnicity and smoking status specific lung cancer mortality rates from two cohorts, as outlined below.
A lung cancer case-control study was launched at the MD Anderson Cancer Center Department of Epidemiology in 1991 and is still ongoing. Detailed smoking histories, as well as data on other risk factors, such as exposure to dust and asbestos have been obtained by personal interviews (5– 8). Lung cancer cases are matched with cancer-free controls on age (within 5yrs), gender, ethnicity, and smoking status. Our analysis included 992 males and 919 females (Table 1) for whom the required information concerning demography and risk factors is fully available. The analysis has been limited to Caucasians, because of the low sample sizes of the other races.
The case-control study design included matching on gender, race, age (within 5 years) and smoking status (current, former, and never smoker). Therefore, data on age-specific mortality rates stratified by gender, race, and smoking status, as well as data on the proportions of each smoking status observed in the general population, were needed. They were obtained from the following sources.
The Cancer Prevention Study I (CPS-I) was a prospective cohort study conducted by the American Cancer Society, with the recruitment conducted between October 1, 1959 and February 15, 1960. CPS-I enrolled individuals that were over age 30, with at least one family member over the age of 45. Study participants completed a baseline survey at enrollment and follow-up questionnaires in years 1961, 1963, 1965, and 1972 allowing for 12 years of follow-up. The questionnaires asked about risk factors including tobacco use. Follow-up surveys addressed changes in smoking status and vital status. The CPS-I study has documented known death rates for 117,199 males. Mortality tables for 5-year age groups stratified by race, gender, and smoking status for this study are available in Appendix C of Chapter 3 of the Smoking and Tobacco Control Monograph 8 (9). These rates are used in this analysis for fitting the TSCE model for men.
The Nurses’ Health Study (NHS) is a prospective cohort study, started in 1976, of married registered female nurses with ages between 30 and 55 years. The cohort consists of 121,700 female nurses. Every 2 years the participants respond to questionnaires on disease risk factors including smoking. We employed the tabled age-specific lung cancer incidence rates stratified by smoking status for Caucasians as published in Meza et al. 2008 (10). These rates are used in this analysis for fitting the TSCE model for women.
The National Health Interview Survey (NHIS) is a survey study conducted by the US Department of Health and Human Services (11). In NHIS annual surveys are conducted of 35,000 to 40,000 households including 75,000 to 100,000 individuals. The survey asks participants about a range of topics including smoking. For this analysis, the NHIS provides data on the proportion of individuals of each smoking category (current, former, and never) in the population in the year 2000 stratified by gender and race.
With the advance of molecular biology, clonal expansion was recognized as an essential stage in carcinogenesis (12–13). Motivated by the idea of clonal expansion, Moolgavkar et al. (14) established a two-stage clonal expansion (TSCE) model depicted in a supplement to this manuscript (Supplemental Figure 1).
The TSCE model assumes that a normal cell (NC) mutates into an initiated cell (IC) in the first transition, according to a Poisson process with intensity ν(t), where t denotes the age. There are X normal cells in the tissue at birth or maturity, depending on the tissue. The initiated cells (IC) divide or die according to a birth-death process with parameters α(t) and β(t) and form a clone of initiated cells. At rate µ(t), a progeny cell may become a malignant cell (MC). This event constitutes the second transition and its time is counted as the time of tumor onset. After a lag time, the MC develops into an observable cancerous tumor.
For parameters piece-wise constant over time, Heidenreich et al. (15) derived exact formulas for the hazard and survival functions of the TSCE model. Only 4k-1 of the 4k biological parameters ν, µ, α, and β over k distinct time intervals are identifiable when fitting the model in the piecewise-constant setting. One commonly used method of dealing with this non-identifiability is by setting the background mutations rates equal to each other, ν0 = µ0, and assuming a plausible number of normal cells such as, X = 107 (10,16–17).
The parameterization relating smoking to the parameters of the TSCE model was chosen so as to include as few fitted parameters as possible (to reduce any potential identifiability issues) while still being able to produce accurate predictions. It is based on smoking intensity measured in packs per day (ppd) and is similar to a parameterization used in another study (10).
Smoking variable expressed as the square root of ppd was chosen over a simple linear relationship because the linear model produced unrealistically high estimates of lung cancer risk for very heavy smokers (more than 6 ppd).
The TSCE model is usually fitted to prospective cohort data using the maximum likelihood paradigm. The cohort likelihood is defined as the product of the individual likelihoods, . For a fixed lag-time of tlag, each Lj depends on the time of entry into the study, sj , censoring or failure (lung cancer diagnosis or death) time, tj , and the individual’s exposure history (15).
The resampling method proposed here recreates time to event data by merging risk factor data from a case-control study with incidence/mortality rate data. The method seeks to resample the cases and controls from the case-control study in the proportions reflected in the mortality rate data in order to recreate cohort data. The rate data are essential to adjust for the effects of matching in the case-control study as well as provide information about age-related rates of disease. Technically, given the matching stratum of the case-control study and disease status, cases and controls are randomly sampled from the underlying population. Each resampled cohort is referred to as a pseudo-cohort and is fitted by maximizing the cohort likelihood (15) described earlier. More details about how a pseudo-cohort is generated follow later.
A simulation study was performed to assess the accuracy of the resampling method in fitting the baseline parameters of the TSCE model. For this simulation study, case-control data and tabled incidence rate data were simulated according to an assumed TSCE model with no risk factor dependencies. The assumed background rates were taken from a TSCE model previously fitted to data from the CPS-II study (16). For simplicity, the lag-time between birth of the first malignant cell and lung cancer onset was assumed zero, which does not impact the model predictions, although it results in adjusted parameter estimates (17-18).
We first generated a population by simulating lung cancer in 1,000,000 individuals using the simulation routine suggested by Kaiser and Heidenreich (19). Tabulated age-specific mortality rates were calculated based on a cohort of 100,000 randomly sampled individuals from the simulated population. A case-control study was generated by sampling 1500 cancer cases from the simulated population and then 1500 matching controls were generated with ages of enrollment within 5 years of the sampled cases. The resampling method was applied to the simulated case-control study along with the tabled age-specific mortality rate data to fit the TSCE model.
From each simulated case-control study, 200 pseudo-cohorts of 20,000 simulated individuals were generated using the resampling routine described later, which is the same method as the one applied to the actual case-control data from MD Anderson Cancer Center. The choice of generating 20,000 individuals was made to ensure inclusion of an adequate number of cases in the resampled pseudo-cohorts for use in maximizing the likelihood. Each pseudo-cohort was fitted by minimizing the negative log likelihood described earlier. Parameter estimates were obtained from each pseudo-cohort fit and the final parameter estimates were the averages taken over the 200 fitted pseudo-cohorts. The 95% confidence intervals were estimated using the percentiles of the 200 fitted estimates. The parameter estimates in the simulation study were obtained using varying numbers (50, 100, 150, and 200) of generated pseudo-cohorts of which 200 pseudo-cohorts was determined a sufficient number to provide stable parameter estimates and confidence limits in this analysis (the estimates changed very little from 150 to 200 pseudo-cohorts, details shown in Supplemental Table 1).
The following routine describes how a pseudo-cohort was generated using the MDA case-control data, combined with incidence/mortality rate data from CPS-I and NHS for men and women, respectively. Each individual in the pseudo-cohort is resampled as follows.
As in the simulation study, 20,000 individuals were re-sampled from the case-control dataset for each pseudo-cohort created. Then each pseudo-cohort was fitted to the TSCE model by maximizing the likelihood in the usual way. Two hundred pseudo-cohorts were generated and fitted providing 200 joint estimates of the parameters. The overall fit was obtained as the average estimate over the 200 generated pseudo-cohorts and the 95% CI was estimated using the 2.5% and 97.5% percentiles.
As a validation study, the fitted model was used to predict the number of lung cancer deaths in the control arm of the non-asbestos exposed (heavy smokers) cohort of the CARET (Carotene and Retinol Efficacy Trial) for comparison against the observed number of lung cancer deaths. This analysis attempts to determine whether the resulting fitted TSCE model can satisfactorily predict lung cancer death outcomes in an independent cohort, i.e. whether the model is generalizable to other cohorts given the unique smoking histories of their participants. CARET was a double blind, placebo-controlled trial on the effects of beta-carotene and retinol in lung cancer prevention. The heavy-smokers cohort included 7,965 men and 6,289 women aged 50–69 with at least a 20 pack-year smoking history, who were current smokers or had quit within the previous 6 years. The study was stopped early when it was determined that use of the supplement resulted in excess cancer mortality (20). For model validation purpose, data were obtained on the control arm of the heavy-smokers cohort including data on 6,877 individuals (3797 males and 3080 females) (Table 2). Average observed follow-up was 11.5 years and the longest follow-up was 19.5 years (Figure 1). As shown in Figure 1, the number of individuals being observed in the study drops off dramatically starting in follow-up year 11.
Lung cancer mortality was simulated for each individual enrolled in the trial based on their gender, smoking history, d, age at enrollment, t0, and age at the end of follow-up, t1 using the observed data from the CARET cohort as mentioned above. A modified version of the simulation routine described by Kaiser and Heidenreich (19) was implemented to simulate lung cancer mortality in CARET; defining S(t;d), the survival function, as the probability that the event (lung cancer death) has not occurred by time t, i.e. S(t;d)=Pr(T>t | d) according to the fitted TSCE model:
For each individual random variable u, uniformly distributed over the interval (0, S(t0;d)) was drawn.
Each individual of the CARET cohort was simulated to generate a simulated trial, from which the cumulative and yearly number of lung cancer deaths per follow-up year were calculated. Five thousand simulated CARET trials were generated to produce expected lung cancer deaths and 95% confidence intervals. Since healthy volunteer bias (a deficit of mortality in a cohort compared to the general population, due to enrollment criteria and self-selection) is likely to be present in CARET, the first 3 years were removed before comparing the observed and predicted lung cancer deaths. Figure 2 shows that the observed number of lung cancer deaths was increasing in years 1 through 3 before leveling off in year 4, while the number of person-years observed (Figure 1) was slowly declining indicating that the healthy volunteer effect lasted through the first 3 years.
Results of the simulation study on the resampling method, including a table of fitted parameters (Suppl. Table 1) and a figure of resulting incidence rate predictions (Suppl. Figure 2), are presented in the supplement. As shown, the method results in reasonable incidence rate predictions. The number of pseudo-cohorts (50, 100, 150, and 200) was increased until determined adequate to obtain stable parameter estimates and confidence intervals at 200, as seen in Suppl. Table 1.
As discussed in the Methods section, the resampling method was applied to MD Anderson case-control data on smoking histories combined with tabled incidence/mortality rate data from CPS-I and NHS for males and females respectively. Two hundred pseudo-cohorts were generated and fitted using cohort maximum likelihood. Since the predictions of the TSCE model are insensitive to the choice of lag-time distribution, because the parameters shift in response to changes in lag-time assumptions (17–18), a fixed lag-time of 6 years between the appearance of the first malignant cell and death from lung cancer was assumed which is consistent with disease progression models (21). Table 3 contains the parameter estimates and 95% confidence limits for the final fitted model. Although males and females were fitted separately, none of the differences in parameter estimates between the genders were statistically significant. The resulting parameters indicate that smoking is involved in both first and second transitions and, to a lesser extent, proliferation of pre-malignant clones.
As described in the Methods section, the final fitted model with estimated parameters, (Table 3) was used to simulate lung cancer mortality for each individual enrolled in the CARET cohort in order to predict the number of lung cancer deaths observed during the trial. The model was used to simulate lung cancer mortality over the course of the trial 5000 times, and the first 3 years of follow-up were ignored to remove any possible healthy volunteer effect. Comparing observed and predicted lung cancer deaths in CARET, there seems to be a healthy volunteer effect shown by the increasing slope in observed lung cancer deaths while person-years were decreasing indicating that the healthy volunteer effect lasted through the first 3 years of follow-up in males and 2 years for females. When the first 3 years of follow-up were removed to adjust for this effect, the model accurately predicted lung cancer deaths over the remainder of the study. There were 329 observed lung cancer deaths in the follow-up years 4–20 of CARET while the model predicted 323.9, 95% CI: 291, 359. The gender-specific predicted numbers were 206.2 (95% CI: 180,233) for men and 117.7 (95% CI: 98, 139) for women, while the observed numbers were 209 and 120, respectively (Table 4). The mean predicted yearly and cumulative lung cancer deaths along with confidence limits for males and females are depicted in Figure 2.
In this paper, we introduce a resampling-based method of merging case-control data on risk factors and tabled incidence rate data to reconstruct time to event data. This method was used to fit a model that allows for prediction of lung cancer risk over an individual’s lifetime based on gender and smoking history. It is generally preferable to use prospective cohort data to fit time to event models. However, in this study they were unavailable. Also, a further complication of this study is that the case-control study design included partial matching on the risk factor of interest, namely smoking status. By using external incidence/mortality rate data, the method was able to adjust for the matching present in the case-control study.
Applying the resampling method to MD Anderson case-control data on smoking histories and tabled lung cancer incidence/mortality rate data, we were able to fit a TSCE model based on smoking history. The fitted model was validated against the control arm of CARET where it accurately predicted the number of lung cancer deaths observed during the trial after adjusting for the healthy volunteer effect. The adjustment consists in removal of the first few years of follow-up, as it was performed in Bach et al. (22).
The simulation study on the resampling method showed that the method proposed here did result in satisfactory fits. The choice of fitting 200 pseudo-cohorts was determined to provide stable estimates in this study however this could be different in other applications of the method. Likewise 20,000 individuals were resampled per pseudo-cohort in order to get reasonable numbers of cases in each pseudo-cohort, as lung cancer incidence is low for certain smoking categories such as never smokers or very long term quitters. Sensitivity of these assumptions on the ability to accurately predict requires further study.
Previously, Heidenreich et al. (23) and Deng et al. (18) have developed approaches to fit the TSCE model using case-control data, both using additional mortality data to allow estimation of the age dependence of the hazard function. In particular, Heidenreich et al. (23) introduced a direct case-control likelihood approach for fitting the TSCE model. The proposed likelihood was designed to fit the TSCE model to case-control studies with larger sample sizes than the case-control study used in this analysis. As a result, when applied to our data the Heidenreich et al. approach resulted in a flat likelihood function. Summarizing, our approach is the resampling equivalent of Heidenreich et al. methodology. By resampling, the weights that Heidenreich et al. used in the likelihood function are essentially created. The proposed resampling method allows bypassing this problem of a flat likelihood function. Deng et al. (18) incorporated a least squares estimation approach utilizing a complicated objective function. The resampling approach provides a straightforward alternative. Further, the resampling method can be used to reconstruct time to event data for use in applications other than model fitting.
The parameterization used in this paper differs from the former model developed by Deng et al. (18) in a few ways. First, the Deng et al. study included the effect of DNA repair capacity and resulted in 9 fitted parameters. In this study, the model was fitted with the intention to use as few parameters as possible while maintaining satisfactory prediction quality; as a result only 5 parameters were used. This leaves room for additional risk factors to be included in a later version. Regarding non-identifiability and parameterization, the models also differ. For this study the two background mutation rates were assumed equal (ν0 = μ0). Also, the model presented here includes the net proliferation rate γ, instead of the death rate of the ICs β,,which was used in Deng et al.
In conclusion, the proposed resampling method provides an opportunity to fit time to event models to case-control data and to evaluate the effects of risk factors, including factors other than smoking, on different stages of carcinogenesis. The method presented here can accurately predict the risk of lung cancer death based on individual level data on age, gender, and smoking history.
MF was supported, in part, by a cancer prevention fellowship supported by the National Cancer Institute training grant R25T CA57730, Robert M. Chamberlain, Ph.D., Principal Investigator, to the University of Texas MD Anderson Cancer Center. MF, MK, and OYG were supported, in part, by the NCI CISNET grant U01CA097431.
This paper seeks to introduce and validate a model of lung cancer risk prediction for individuals based on gender and smoking history. The model was developed using a novel approach that reconstructs time to event (cohort) data for use in model fitting, using the combination of risk factor data collected in a case-control study and tabled disease-specific incidence/mortality rate data from a cohort study. The validation study indicates that the fitted model is accurate in predicting lung cancer risk.