|Home | About | Journals | Submit | Contact Us | Français|
Calibrating a disease simulation model’s outputs to existing clinical data is vital to generate confidence in the model’s predictive ability. Calibration involves two challenges: 1) defining a total goodness-of-fit score for multiple targets if simultaneous fitting is required; and 2) searching for the optimal parameter set that minimizes the total goodness-of-fit score (i.e., yields the best fit). To address these two prominent challenges, we have applied an engineering approach to calibrate a microsimulation model, the Lung Cancer Policy Model (LCPM).
First, eleven targets derived from clinical and epidemiological data were combined into a total goodness-of-fit score by a weighted-sum approach, accounting for the user-defined relative importance of the calibration targets. Second, two automated parameter search algorithms, Simulated Annealing (SA) and Genetic Algorithm (GA), were independently applied to a simultaneous search of 28 natural history parameters to minimize the total goodness-of-fit score. Algorithm performance metrics were defined for speed and model fit.
Both search algorithms obtained total goodness-of-fit scores below 95 within 1,000 search iterations. Our results show that SA outperformed GA in locating a lower goodness-of-fit. After calibrating our LCPM, the predicted natural history of lung cancer was consistent with other mathematical models of lung cancer development.
An engineering-based calibration method was able to simultaneously fit LCPM output to multiple calibration targets, with the benefits of fast computational speed and reduced need for human input and its potential bias.
Simulation modeling and clinical trials offer different and complementary methods of exploring the relationship between a health care intervention and health outcomes. Clinical trials compare two or more clinical interventions and are able to describe the comparative effectiveness of the different options. However, most clinical trials have a relatively short time horizon (2 - 7 years); by extrapolating from the short term trial data, modeling can estimate the longer term consequences (both positive and negative) of the intervention. In addition, clinical trials are very expensive and so rarely evaluate all the clinically relevant options; modeling can combine the results of several clinical trials, adjust for trial population differences, and estimate the incremental differences between options not compared in any one trial.
In some instances, simulation models incorporate unobservable natural history parameters to model disease development in patients. Often, estimates of logical relationships between observable parameters can be established using meta-analysis or evidence synthesis, but the values of unobservable natural history parameters must be obtained through the process of model calibration. Calibration is a process of varying the unobservable parameters until model outputs closely match existing clinical and epidemiological data.1 After the relevant epidemiological data have been selected as calibration targets, calibrating the model to those targets consists of two parts: (1) defining how to simultaneously measure the level of discrepancy between model output and multiple calibration targets; and (2) searching for the parameter set which results in an overall minimization of that discrepancy.
Despite the importance of model calibration, there is no standard practice in the disease modeling literature for measuring the level of discrepancy between the model output and the calibration targets. Nor are there standards for how to search the parameter space such that the best parameter set is identified.
Methods of defining the level of discrepancy vary widely. Some researchers visually match the model output to the clinical data. This methodology is particularly problematic because it cannot be independently replicated: a different researcher using the same model may have selected a different parameter set.
Methods of searching the parameter space also vary widely. Some researchers manually vary input parameters in their models, while others use simple parameter search algorithms, such as grid search2, 3 and random search.4 Grid search divides each parameter value between the maximum and minimum values into regular grids. The results of all possible parameter set combinations are then compared to the calibration targets. The random search algorithm randomly picks input parameter sets within the allowable parameter space. Based on initial results, the allowable parameter region can be modified or narrowed, and searched again more thoroughly. Both methods attempt to locate the optimal parameter values by exhaustively exploring the whole parameter space. These methods will find the optimal parameter set, but are only practical for simulation models with only a few parameters.
For a disease microsimulation model with many natural history parameters, grid and random search methods cannot sample the parameter space efficiently. Using a model with 20 unobservable parameters as an example, testing only 10 values of each unknown parameter with grid or random searches will require 1020 parameter sets. In practice, researchers would not test all 1020 sets; typically, results from searching one area of parameter space are visually inspected and interpreted before selecting another area of parameter space to search. However, even searching a portion of those possibilities would result in a time consuming effort.
We sought to adapt calibration methodologies from the engineering literature that would enable an automated, computationally feasible, and time-efficient parameter search for comprehensive microsimulation models. Two fast parameter search algorithms commonly used in engineering simulation were identified as potentially meeting all of these criteria: Simulated Annealing (SA)5, 6 and Genetic Algorithm (GA).7, 8 Both parameter search methods were applied to the Lung Cancer Policy Model (LCPM),9, 10 a microsimulation model of lung cancer development, progression, detection, treatment, and survival. In this paper, we have provided examples of using these two fundamentally different optimization algorithms to perform the parameter search and compared their performance using relevant constraints and scenarios comparable to the process of initially calibrating a model, repeating the calibration of an individual component or refining the calibration.
The LCPM is as a comprehensive model of lung cancer, designed to evaluate screening10, 11 and other lung cancer control interventions. A detailed description of all components of the model is available online at the National Cancer Institute’s Cancer Intervention and Surveillance Modeling Network (CISNET) Web site (http://cisnet.cancer.gov/profiles/), but details relevant to the current study are provided here. The LCPM simulates cohorts of individuals with gender, race, and birth cohort specific smoking histories as observed for the U.S. population.12, 13 In each monthly cycle, an individual may develop a new cancer, or have an existing cancer grow, or develop symptoms. Lung cancers and benign pulmonary nodules can be detected through incidental imaging or scheduled screening. Cancers can also be diagnosed by an evaluation of a patient’s symptoms. Patients with suspected lung cancer receive diagnostic and staging tests, and may receive surgical or non-surgical treatment. Existing cancers may or may not be detected before an individual dies of their lung cancer or from another cause.1 Smoking exposure is updated monthly. Model outputs include estimates of age-specific lung cancer incidence rates and distributions of lung cancer cell types and stage at diagnosis, as well as survival by stage at diagnosis.
To reflect known heterogeneity of lung cancer and allow the evaluation of a variety of interventions, the LCPM has a ‘deep’ underlying natural history model, with cell-type specific parameters for growth, invasiveness, and relationship with smoking (among others). Simulating the development and progression of 4 cell types of lung cancer with such detail required calibrating the model to estimate values of 87 unobservable natural history parameters. In this study, we applied the optimization algorithms to a subset of 28 parameters that describe the development of new lung cancers (ie, coefficients in the logistic equations for each cell type). The remaining 59 parameters related to the growth, progression, and symptom detection of existing cancers were not varied and their values were estimated in the previous calibration exercises.
Each simulated person may develop up to 3 cancers of any of 4 lung cancer cell types (adenocarcinoma with or without bronchioloalveolar carcinoma, large cell, squamous cell, and small cell), which comprise over 90% of lung cancer. The monthly probability of developing the first malignant cell is calculated using an independent logistic equation for each cancer type. Each logistic function has a type-specific intercept, type-specific coefficients for age, age2, years of cigarette exposure (smoke-years, SY), an interaction term between SY and age2, the average number of cigarettes smoked per day (cigarettes per day, CPD), and the years since quitting (YSQ) smoking.
Prior to beginning the calibration procedure, we eliminated implausible parameter ranges and assumed correlations between some parameters on the basis of known relationships between the incidence of lung cancer by cell type and smoking history. For example, the baseline risks dictated by the type-specific intercepts were ordered to reflect the distribution of cell types among non-smokers.14-17 Lung cancer risk is also known to increase with age and smoking experience (SY) and lung cancer risk is known to decrease as the years since quitting (YSQ) increases.18-21 The risk of small cell cancer is the most dramatically affected by smoking experience, and the effect of smoking cessation has the weakest effects for developing adenocarcinoma.16, 22 Thus, the type-specific coefficient for SY is required to have the largest magnitude for small cell cancer and the type-specific coefficient for YSQ is necessary to have the smallest magnitude for adenocarcinoma.
To account for the changes in unmeasured risk factors (in addition to the change in smoking pattern) experienced by different birth cohorts, we incorporated one gender specific birth-cohort coefficient, βBY, into the monthly probability of lung cancer development.
With the publication of new biological and medical findings, new calibration targets are constantly incorporated into our LCPM. The version of the LCPM used in this study had 11 calibration targets (9 primary and 2 secondary targets), derived from various data sources. The primary calibration targets were cancer incidence by cell type, stage-specific survival, and stage distribution at diagnosis. Primary targets were extracted from data from the NCI’s Surveillance, Epidemiology, and End Results (SEER) Program.23 The secondary calibration targets were age-specific mortality rates of non-smokers and lung cancer-specific mortality ratios for current (vs. never) smokers, derived from past cohort studies 24 and other literature sources describing clinical experience.15, 17, 25, 26 All calibration targets in this study were derived from only publicly-available de-identified human subject data.
The discrepancy between the simulation model output and each calibration target was measured by the goodness-of-fit (GOF) statistic. We calculated the measure of discrepancy between each LCPM output to the corresponding calibration target (i), GOFi, using a sum-of-squared error GOF statistic (analogous to a χ2 statistic).
Calibrating the LCPM to multiple targets simultaneously required a definition of a global GOF statistic. We used the weighted-sum approach27, 28 in which weighting factors were assigned to all targets in advance of the calibration procedure to reflect the relative importance of the targets. Thus, the summary GOFsum statistic is a linear combination of the individual statistics,
where Wi is the weighting factor for the i th target. For the LCPM, primary calibration targets were given a weight of 1.0 and secondary calibration targets were given a weight of 0.5 to avoid overfitting to targets with possible measurement issues or dissimilar populations. Two of the secondary targets are functions of rare events (lung cancer in never smokers), so the GOF values for these two targets are large. During the initial exploratory runs prior to model calibration, the values of the weighting factors were tested to prevent them from dominating the GOFsum during the calibration. Their units are in the inverse of the corresponding GOFi, resulting in a unitless GOFsum.
To model the US population, individual level characteristics such as age at starting smoking, cigarettes smoked per day, and age of smoking cessation, were generated by ‘smoking history generator’ provided by NCI’s Cancer Intervention and Surveillance Modeling Network (CISNET) (C.M. Anderson, personal communication). Calibration of the model for multiple populations was performed in a sequential manner, in which we initially calibrated all natural history parameters except βBY (set at 1.0) to targets corresponding to the white male cohort born in 1930 to establish the reference values of the natural history parameters. We then estimated βBY for the remaining cohorts by calibrating the simulation model to total lung cancer incidence corresponding to white males and females born between 1920 and 1970.
A flow chart of the SA methodology is shown in Figure 1a. SA is analogous to the thermodynamics of freezing water or the crystallization of metal.5, 6 The number of defects in metal can be reduced by controlling the heating and cooling schedule during manufacturing. Heating the metal causes atoms to move from their initial positions and wander randomly (a high “free energy” state). Slowly cooling the metal to a lower free energy allows the atoms to find the most favorable configuration. The concept of SA as applied to parameter searching involves the introduction of an artificial temperature. At the initial high temperature, the search algorithm is allowed to widely explore the parameter space by accepting the parameter values with higher probabilities. By conceptualizing the model’s goodness-of-fit as a surface with peaks (poor fitting parameter sets) and valleys (better fitting parameter sets), it is apparent that bigger “jumps” avoid the algorithm falling into a local minimal goodness-of-fit. Slowly lowering the temperature allows the search to locate the parameter set with the lowest goodness-of-fit statistic. To test that the algorithm identified the unique lowest GOF statistic (the global minimum), we initiated the SA algorithm in 10 random locations in the parameter space, with each run limited to 1,000 iterations or the returned goodness-of-fit value below a predefined stopping value, GOFstop. See section on Comparison of Search Algorithms for the definition of GOFstop.
A flow chart of the GA methodology is shown in Figure 1b. The genetic algorithm is an example of evolutionary algorithms where the parameter search method is based on the principle of “survival of the fittest”.7, 8 The initial generation consists of a population of candidate parameter sets (analogous to chromosomes). After running the simulation model with each parameter set in the original population, the goodness-of-fit score of each individual parameter set is stored. The probability of a parameter set being selected for “reproduction” is proportional to the difference between its goodness-of-fit value and the largest goodness-of-fit value among all tested parameter sets. The parameter set with the largest goodness-of-fit value has a zero probability of reproduction and is eliminated in the reproduction process. Using a one-point crossover method,7 the encoded natural history parameters of two parameter sets are combined to produce a new parameter set, filling the next generation. Each newly generated parameter set is subject to random changes of individual parameters, with new values chosen within 10% of the values in either of the parent parameter sets (analogous to genetic mutation). In this study, we permitted the algorithm to repeat until either a total of 1,000 parameter sets had been tested (25 generations of 40 chromosomes) or a parameter set was identified yielding a goodness-of-fit below the same predefined stopping value (GOFstop) in SA runs.
We measured and compared the accuracy and speed of SA and GA parameter search algorithms for calibration of the LCPM output to the clinically observed data using a cohort of white males born in 1930. All simulation runs were performed on a LINUX High-Performance Computing (HPC) Beowulf cluster. Since the search algorithms incorporate random noise to assist the parameter exploration process, the returned results from the search algorithms are stochastic and statistical tests are required to distinguish their performances.
To compare the ability of each algorithm to find the optimal parameter set efficiently, we determined the ability of each algorithm to find the lowest GOFsum within a specified number of iterations and the amount of time required for each algorithm to reach a specific GOFsum target. In the accuracy comparison, each algorithm was allowed to search through the allowable parameter space for a total of 1,000 iterations; each iteration simulated a total of 500,000 hypothetical males. We performed 10 repeats using each search algorithm: 10 randomly generated starting parameter sets for SA and 10 randomly generated populations for GA. The mean minimal GOFsum value, , from the 10 repeats was used for comparisons.
In the speed comparison, both search algorithms were allowed to search through the parameter space until they returned GOFsum below the predefined stopping value, GOFstop. We recorded the GOFstop as the largest in the accuracy comparison. Each search algorithm was repeated 10 times in the speed comparison test. We distributed the simulation runs to the HPC Beowulf cluster using one processor per run. To simulate 500,000 individual life-histories, the LCPM would require 24 minutes processing time on a dual processor Intel® Pentium® 3.4GHz computer. Because the computational time of the simulation runs is dominated by the calculations of the life histories, SA and GA require virtually the same amount of real computational time per iteration on the same processor. The difference in CPU time between the two algorithms to complete 1000 iterations (400 hours) is less than 10 minutes. Within the cluster, there are seven types of computing nodes with different processor speeds. Hence, the real computational time per iteration depends on the hardware. Instead of reporting the real computational time, we reported the number of iterations required to reach GOFstop.
In addition to comparing the average values, we also calculated the 95% confidence intervals using the following equation:29
where n, and σ are the number of parameter search runs, the mean and the standard deviation of the iteration to find the stopping GOFsum, respectively. The subscripts 1 and 2 refer to SA and GA, respectively. From the z-table for the standard normal distribution, zα/2 =1.96. When lower and upper limits (terms on the left and right hand sides of μ1 - μ2) of the confidence interval are both negative, SA is faster than GA with 95% confidence. A similar equation was used for the accuracy comparison of the minimal .
We also examined the abilities of the two parameter search algorithms to obtain optima parameter sets in a smaller parameter space, a situation often encountered when researcher are recalibrating a modified component of a simulation model. After having established th location of the optimal parameter set, we repeated the accuracy and speed comparisons whe only two natural history parameters were to be adjusted, specifically the type-specific intercept (β0) of the logistic equations for developing adenocarcinoma and large cell lung cancer. The results are compared with the calibration results from the search with 28 parameters.
The main objective of model calibration is to obtain values for the unobserved natura history parameters and the unobserved biological process which can not be experimentall determined. After model calibration, we examined two model predictions of the natural histor of lung cancer development: the lung cancer risk of current smokers, and the temporal trends of lung cancer risk for different birth cohorts.
A typical trace of GOFsum during a SA minimization is shown in Fig. 2. At the beginning of the SA run (high temperature), the search algorithm is allowed to explore all possible parameter values, resulting in large fluctuations in GOFsum. At the intermediate temperature, SA avoids being trapped in the local minimum by hopping over barriers, as indicated by the arrows in Fig. 2. At low temperature, the search algorithm eventually settled on one minimum.
Genetic algorithm evolves the parameter search through selection and mutation. Figure 3 shows the evolution of GOFsum as a function of generation number using GA. In the first generation, the randomly generated parameter sets yield a wide distribution of the GOFsum values. During each reproduction cycle, the selection process forces the distribution of the GOFsum values to converge. At the same time, the random mutations that occur in each new generation of parameter sets maintain the heterogeneity in distribution of GOFsum as shown in Fig. 3.
We first examined the ability of both algorithms to locate the minimal GOFsum when 28 natural history parameters are allowed to vary simultaneously. This comparison mimics the situation where researchers are performing the initial model calibration. Within 1,000 iterations, SA was able to achieve a mean minimal GOFsum, , of 86.5 (range from 80.7 to 90.6) averaging over 10 independent parameter searches as shown in Table. 1. Within 1,000 simulation runs, GA was able to achieve of 91.3 (85.2 to 94.1). The 95% confidence interval is: -7.56 < μ1 - μ2 < -2.03. The upper and lower limits of the confidence interval are both negative, indicating that SA is more efficient than GA in locating the lowest GOFsum. The stopping GOFsum value for the speed comparison was selected as 95 based on the rounded largest value obtained from GA, reflecting a typical scenario in which a parameter set must be chosen within a budgeted computational time. For the speed comparison, choosing the high stopping value also prevents the complication of some runs not achieving the stopping GOFsum. We measured the number of iterations required to reach a GOFsum < 95 where all 28 natural history parameters were available to be adjusted. This comparison mimics the case was where the researchers are recalibrating the model and an estimated value of previously established. In this scenario, SA required an average of 202 iterations to reach a GOFsum of 95. Under the same scenario, GA required an average of 294 iterations to reach a GOFsum of 95. The values for the means and the standard deviations are shown in Table 2. The 95% confidence interval is -216.7 < μ1 - μ2 < 32.9. Since the two limits have opposite signs, no significant difference in speed was found between the two algorithms.
In addition to the 28 parameter searches, we have also examined the abilities of the two values are algorithms to search through only two-parameter space. In Fig. 4, the values are plotted for both 2 and 28 parameter searches. Except the result for GA in the 28 parameter search, all of the values are below 90. The average search times are plotted in Fig. 5. Going from 2 to 28 parameters increased the search time by 230% and 280% for SA and GA, respectively.
The model outputs corresponding to the parameter sets with the lowest for both algorithms (80.7 for SA and 85.2 for GA) are shown in Figs. 6 for selected calibration targets: overall incidence, distribution of cell type, and stage distribution of incident cancers. Both SA and GA obtained similar good fits for each of the calibration targets. Model outputs versus the remaining targets are available online (www.cisnet.cancer.gov/profiles).
After model calibration, we investigated two model predictions for the natural history of lung cancer: the monthly probability of lung cancer development as a function of age and cigarettes smoked per day, and the temporal trends of lung cancer risk in the US population.
Since both SA and GA obtained similar good fits, we only show results using the best parameter set from SA to examine our model prediction for lung cancer development. Figure 7 shows the monthly probabilities of cancer development as a function of both age and cigarettes per day. The best natural history parameter set shows a peak in the monthly probability of developing the first malignant lung cancer cell near age 76, followed by a decrease in the monthly probability of developing lung cancer at older ages. This decreasing trend has been predicted by other published theories: increased tumor suppressor protection with age,30 and the loss of proliferative ability in senescent cells.31
Figure 8 shows the comparison between model outputs and SEER incidence rates for white males born between 1975 and 2000. Figure 9 shows the effect of birth year on lung cancer risk, stratified by gender. For both genders, the trends level off after the 1950’s. The temporal trend for females shows a peak around 1930. Published mathematical models obtained similar results.32-36
Motivated by the lack of a standard procedure to calibrate disease simulation models in the literature, we have adapted an engineering approach to calibrate the LCPM. This approach can simultaneously shorten the time spent on model calibration and minimize human-induced bias. With the ongoing release of new medical information, the LCPM is constantly being modified. This automated calibration approach is able to reduce the burden of model calibration and allow researchers to focus on model development and data analysis.
Our approach is a general procedure which can be applied to other microsimulation models. We suggest a weighted sum approach for combining multiple targets. When we varied 28 natural history parameters, our results indicate that SA is superior in locating the lowest GOFsum. When a stopping rule is imposed by defining a specific acceptable level of GOFsum, the average search time of SA is faster than the GA, but not significantly. For two natural history parameters, both algorithms perform equally well in model calibration. SA and GA are both faster than grid search because in grid search the computational time increases as a power function of the number of parameters as described in the introduction. This makes both SA and GA extremely attractive for calibrating disease models with a large number of parameters. In view of the fact that SA is better than or as good as GA regardless the calibration scenario. SA would be a better choice for researchers considering advanced search algorithm for automated calibration.
Our model prediction of the monthly probability of developing adenocarcinoma showed a downturn in lung cancer risk at old age. This prediction was not assumed in advance; it was purely a result of calibrating our LCPM but is in agreement with recent biological theories based on observed epidemiological data.30, 31 These theories still need to be verified by future clinical studies. Our approach of using logistic equations for lung cancer development is flexible enough to capture the observed epidemiological trend of lung cancer incidence at older age.
Our results indicate that the lung cancer risks (all cell types combined) of white female birth cohorts decreased from 1920 to 1945 and then leveled off. For white females, the temporal risk shows an initial increase, peaks around birth year 1930, decreases between 1930 to1950, and eventually levels off. Since population trends in smoking patterns were already incorporated by using the CISNET smoking history generator, additional effects such as increasing use of cigarette filters and improvement of the overall population health may have contributed to the decrease in the lung cancer risk for the younger birth-cohorts. Published age-cohort and age-period-cohort analyses that fitted lung cancer incidence with mathematical equations directly obtained similar results.32-36 In Fig. 8, the LCPM over-predicted the lung cancer incidence for the youngest cohort, indicating that the lung cancer natural history of the youngest cohort is substantially different from our reference cohort of the white male born in 1930. Our method of using a single birth-cohort coefficient, βBY, may not adequately capture the dramatically changes in population trends such as increasing use of cigarette filters and its effects on different lung cancer subtypes. In the future, we will expand βBY to 4 cell types. The observed lung cancer incidences of different cell types will be used as calibration targets to improve our population calibration.
Manual parameter tuning, grid and random searches are inadequate to calibrate simulation models with large numbers of natural history parameters. The combination of weighted-sum method and automated parameter search algorithms offers an attractive solution for disease modelers. Our results demonstrated that this engineering approach is capable of calibrating a disease model with a large number of natural history parameters. The advantages of this novel approach are 1) estimating the natural history parameters within an acceptable time and 2) reducing human bias in the parameter search. Adapting this engineering approach also facilitates transparent communication of model calibration procedures by specifying the search algorithm.
This comparison has several limitations. First, we have not formally examined the effects of varying the ranges of allowable parameter values on the search time and accuracy of search algorithms. The best GOFsum values are obtained within the estimated allowable parameter ranges. There are potentially lower GOFsum values outside of the allowable parameter ranges. Second, epidemiologic studies have shown changing patterns of lung cancer incidence that are both gender and histological type-specific.32, 37-40 However, such effects can not be captured by the current gender specific birth-cohort coefficient, βBY. Future expansion of the model will incorporate gender-specific birth-cohort coefficients for each histological lung cancer type.
Due to the limited time and computing resources, our approach to model calibration only explores one method of handling multiple calibration targets and two automated parameter search algorithms. In the engineering literature, there exist many excellent review articles on other techniques of optimization.7, 27, 41 Other disease modelers are encouraged to consider using this literature to further develop calibration procedures in disease natural history modeling.
C.M. Anderson and CISNET Lung investigators. Financial support from the National Cancer Institute (R01 CA97337 Gazelle) and CISNET.
None of the authors has any conflicts of interest.