This paper provides a description of the initial iterative process we utilized to assess model performance and gain insight about the generalizability of analyses relying on data derived from particular study cohorts.
Model-estimated survival of the WIHS all-female cohort using natural history input parameters derived from the MACS all-male cohort underestimated survival for individuals with initial CD4 cell counts <350/µl, particularly as follow-up time increased. Using data from the WIHS, coupled with moderate changes in mortality for those with a history of OI for the two highest CD4 strata, the re-parameterized model closely approximated the empiric data, demonstrating good internal consistency. While the differences between model survival estimates using MACS versus WIHS-derived parameter values could theoretically reflect gender differences in natural history, prior data suggests that cohort differences distinct from gender, such as underlying differences in general health status and co-morbidities are more likely to explain differences in estimates.
[2]–
[4],
[31]–
[35],
[38]–
[40]Comparison of model-estimated survival of women on HAART with empiric WIHS survival data showed the model overestimated short-term survival. Adjustment of influential treatment assumptions (e.g., ‘clinical effectiveness’, the ART effect and the CD4 gain on treatment) individually across all lines of HAART did not produce a good fit to either 12- or 24-month survival. In contrast, scenarios that reduced the ‘clinical effectiveness’ of earlier treatment regimens and increased that of later regimens (e.g., 3rd and 4th line HAART), more closely approximated the empiric published data. Further, multi-way sensitivity analyses that simultaneously varied these assumptions allowed less extreme (and more plausible) changes in individual variables while providing better visual fits to the published data.
Examination of the good-fitting parameter sets to the empiric data revealed several interesting observations. First, for both CD4 count strata, good fits to the data required that the ‘clinical effectiveness’ of 1st and 2nd line HAART be reduced such that the “implied failure rates” were 2.0 to 3.5 fold higher. Importantly, as described above, we considered ‘clinical effectiveness’ as a proxy for the net impact of regimen efficacy, tolerance without major toxicity, adherence, and personal choice to remain on treatment. Accordingly, the “implied failure rate” associated with the model calibrated to the WIHS cohort serves as a proxy for virologic failure, toxicity or side effects leading to a change in regimen, and discontinuation of HAART for undocumented reasons. In contrast, for both CD4 count strata, best fits to the data were obtained with a 40% to 60% increase in the effectiveness of 3rd and 4th line HAART, with analogously lower failure/discontinuation rates.
The more than 50% reduction in ‘clinical effectiveness’ that characterized the best fitting parameter sets is inconsistent with the higher treatment efficacy documented in more recent studies,
[46]–
[48],
[52]–
[55],
[59]–
[63] the data used in this exercise were based on a specific cohort from 1998 and 2002 and would not be expected to reflect more recent care patterns and improved outcomes. Further, while we used intention to treat efficacy data from clinical trials for our initial parameterization, the proportion who choose to change regimens or stop therapy in clinical trials may be lower than in cohort studies such as this one.
[42],
[46]–
[49]Recognizing that newer data show better tolerated regimens and higher treatment efficacy, the necessity for such high failure rates in initial regimens to calibrate the model prompted us to consider the particularities of this specific cohort, their clinical histories and past ART experience, as well as their behaviors including adherence, discontinuation of HAART, and choices about continued treatment following HAART toxicity. We concluded that the substantial reduction in ‘clinical effectiveness’ with 1
st and 2
nd line HAART regimens in this historical simulation could very well be plausible, given that only 16% to 20% of women were completely ART naïve prior to HAART initiation; approximately 80% had some previous exposure to ART through mono- or combination therapy.
[33],
[34] Furthermore, 44%–48% of women who initiated treatment had a diagnosis of AIDS, suggesting very advanced disease. In contrast to the reduction in ‘clinical effectiveness’ for 1
st and 2
nd line HAART required to calibrate the model to the WIHS, the efficacy of 3
rd and 4
th line HAART required an increase that ranged from 30% to 75%; this considerable increase in efficacy is likely attributable to both the availability of new and more effective treatment regimens and an increasingly homogeneous group of women more likely to pursue, adhere to, and continue treatment.
It is notable, although not unusual for the time period, that a sizable proportion of women in the cohort elected to discontinue HAART. For example, between April 1997 and September 1997, when many women had initiated HAART, 45.6% of these women switched regimens and 18% reported discontinuing HAART (13% switched to a less intensive regimen and 5% discontinued therapy completely).
[34] By three years later, in September 2000, the percentage discontinuing therapy completely increased from 5% to 11.4%.
[34] Similar rates of discontinuation have been seen in both clinical trials and in cohort studies. For example, Staszewski et al reported 27%–43% discontinuation of HAART unrelated to efficacy in a clinical trial of indinavir plus two nucleoside reverse transcriptase inhibitors versus efavirenz plus two nucleoside reverse transcriptase inhibitors.
[42] Hammer reported that the overall rate of premature discontinuation was 20% in a clinical trial comparing zidovudine (or stavudine) and lamivudine (28%) versus indinavir, zidovudine (or stavudine), and lamivudine (12%).
[49] Several cohort studies described a high rate of discontinuation and short median duration of time on a specific regimen. Saag et al. described the increasing number of unique antiretroviral regimens between 1988 and 1998 and a median duration of a specific regimen of 4 months.
[64] Van Roon et al. reported that 25% of their clinic patients discontinued HAART within 1 year of initiating therapy.
[65] An Italian cohort found that 36% of men who began HAART modified or discontinued their initial regimen over a median follow-up time of 11 months.
[66] Mocroft et al. estimated that 26% of their patients initiating HAART modified or discontinued their regimen within 6 months of initiation and that 45% had modified or discontinued their regimen after a median follow-up time of 14 months.
[67]The life expectancy projected by the model calibrated to the 24-month short-term cohort-specific data was 140.9 months using the mean calculated from simulations using the 50 best-fitting parameter sets (with individual estimates of the 50 best-fits ranging from 130.5–148.4 months) among the patients with CD4 50–199/µl. Further, the incremental gains projected by 5 lines of HAART versus 4 lines of HAART using the empirically calibrated model (, Part B) were twice those predicted by the model prior to calibration. We also found that uncertain assumptions, such as late failure, while not influential on short-term outcomes, exerted a major impact on the predicted life expectancy. While estimates of life expectancy varied considerably with plausible changes in uncertain assumptions, the incremental gains associated with comparing different treatment strategies within a single cohort varied far less. The implication is that results of incremental cost-effectiveness analyses, for example those conducted to inform choices among competing treatment options, may be less affected by this variation; in contrast, analyses that seek to project long-term estimates of life expectancy or cost for a population of HIV-infected persons, may be more variable.
Our analysis has several important limitations. First, this analysis is not intended to depict a formal empirical calibration process. Rather, this paper was intended to provide a description of the “real world” iterative process of assessing model performance while building a simulation model of a complex disease. In addition, we sought to demonstrate the kind of insights that can be obtained by this type of exercise while providing a description that is intended to increase the transparency of a model development phase. Although we intended to explore the comparative implications of using WIHS versus MACS cohort data, our primary goal was not to fit the model to empiric data. In fact, we would not want to use a model empirically calibrated to older data, reflecting much lower treatment efficacy, to inform current policy questions that could contribute to decisions in the future. Furthermore, we recognize that there are alternative methods for sampling the parameter space including utilization of Bayesian methods, random sampling or complex optimization algorithms. Our guided approach was chosen after careful consideration of the practical and theoretical strengths and limitations of these alternatives, given our goal was to conduct an exploratory exercise; that being said, it is possible we did not sufficiently explore the entirety of the parameter space. These exercises can play an important role in characterizing the effects of key uncertain assumptions, identifying logical inconsistencies, and helping the analyst to understand and describe the performance of the model.
Second, cohort heterogeneities pose challenges to assessing model performance in that it is impossible to reflect all patient and population level differences in any analysis; the availability of data that adequately characterize heterogeneities within this study cohort remain limited. Some differences between the WIHS cohort and the clinical trial cohorts used to generate initial HAART efficacy estimates
[42],
[49] are clear; for example, the WIHS is all women (versus trials often with more than 80% male), more than 30% report a history of injection drug use (versus only 10–18% in trials), and nearly two-thirds are black or Hispanic (versus more than 50% white in many trials).
[33],
[34] Furthermore, heterogeneities in prior treatment exposure, underlying health status, patient adherence, and patient preferences about treatment, could have substantial effects on outcomes which must be taken into consideration; these and other unknowable factors could have directly or indirectly contributed to the high rates of switching and discontinuation of early lines of HAART in women in the WIHS. For example, toxicities have been reported as an important reason for discontinuation of therapy,
[66] and a study by Ahdieh and colleagues reported that women were twice as likely as men to discontinue HAART because of toxicities.
[68]Third, treatment regimens could not be simulated with complete accuracy. Between the period of April 1996 and September 1996 there were roughly 13 unique HAART regimens used in the WIHS, with 25% of women taking the most common regimen which consisted of zidovudine, lamivudine and indinavir.
[34] However, by the year 2000, there were 171 unique HAART regimens reported in the cohort, with fewer than 15% of women taking the most common regimen of stavudine, lamivudine and nelfinavir.
[34] We attempted to account for HAART era effects on treatments used by using values representative of commonly-used regimens for the given time period during which the WIHS treatment data were collected.
[43] However, we recognize these assumptions were at best approximations of the actual range of regimens used.
We emphasize that this analysis is not intended to be a representation of the current treatment environment, where there have been substantial improvements over time in response to treatment, both in terms of drug efficacy and reductions in treatment failure, in addition to decreases in drug toxicity.
[59],
[60],
[62],
[63],
[69] Rather, the purpose of these exercises was to assess whether the model could produce results consistent with the data used to parameterize the model (i.e., internal consistency and validity), and could simulate a specific cohort such that outcomes were consistent with independent data from that cohort. Using this same model to simulate access to contemporary treatment strategies in HIV-infected women in the United States today, we found the projected life expectancy in women with a mean CD4 cell count of 350/µl, exceeded 250 months (>21 years) given 5 lines of therapy and assuming initiation of HAART at a CD4 cell count of 350/µl. Simulations using a higher CD4 cell count threshold for treatment and/or a greater number of contemporary treatment regimens are likely to project even longer life expectancies.
Exercises that involve iterative assessment of model performance can provide information about the relative influence of different uncertain assumptions, illuminate unexpected synergies between parameters, and provide insight into particular heterogeneities within and between cohorts. When data are available to allow for exercises like those described here, they can be used to assess model performance; descriptive analyses of the process taken to do so can contribute to a dialogue about different approaches that are taken by analysts to assess model process and model structure uncertainty.