|Home | About | Journals | Submit | Contact Us | Français|
Estimates of treatment effectiveness in epidemiologic studies using large observational health care databases may be biased due to inaccurate or incomplete information on important confounders. Study methods that collect and incorporate more comprehensive confounder data on a validation cohort may reduce confounding bias.
We applied two such methods, imputation and reweighting, to Group Health administrative data (full sample) supplemented by more detailed confounder data from the Adult Changes in Thought study (validation sample). We used influenza vaccination effectiveness (with an unexposed comparator group) as an example and evaluated each method’s ability to reduce bias using the control time period prior to influenza circulation.
Both methods reduced, but did not completely eliminate, the bias compared with traditional effectiveness estimates that do not utilize the validation sample confounders.
Although these results support the use of validation sampling methods to improve the accuracy of comparative effectiveness findings from healthcare database studies, they also illustrate that the success of such methods depends on many factors, including the ability to measure important confounders in a representative and large enough validation sample, the comparability of the full sample and validation sample, and the accuracy with which data can be imputed or reweighted using the additional validation sample information.
Large health care databases are increasingly being used to study treatment effectiveness in medical research . However, using data collected primarily for administrative and clinical purposes to conduct comparative effectiveness research poses many challenges. One major problem is that large databases can have limited ability to characterize important confounding differences in outcome risk between exposed and unexposed persons [2–4]. For instance, database confounder adjustment for health status is often accomplished by broadly defining medical conditions using binary International Classification of Disease (ICD-9) diagnosis codes, or risk score summary measures based on these codes, assigned by the medical provider during patient visits [5–7]. This relatively crude adjustment can lead to residual confounding in effectiveness estimates, because ICD-9 codes do not adequately measure disease severity or functional status [4, 8–12].
A prominent example of this problem is the estimation of influenza vaccine effectiveness (VE) among the elderly in large database studies, which have consistently found implausibly high risk reductions against all-cause mortality (~50%) when adjusting only for database information such as binary ICD-9 coded indicators of health status [13–15]. More recent research has indicated that residual confounding may account for some, if not all, of this observed effect [10–11]. Specifically, when examining the association between influenza vaccine and mortality in the pre-influenza control period prior to the circulation of influenza, even larger reductions in risk (~70%) have been found . Any effect observed during the pre-influenza period represents bias, since no association between influenza vaccine and morbidity or mortality is biologically plausible when influenza virus is not circulating. This bias has been shown to be reduced by adjusting for functional limitations obtained from medical chart review , which suggests that unmeasured frailty is the most plausible unmeasured confounder in this setting. Such confounding would occur if seniors who are very close to dying are no longer given preventive therapies, such as influenza vaccine.
Although adjusting more comprehensively for additional confounders obtained by medical record review or in-person physical examination has the potential to reduce bias in traditional effectiveness estimates that adjust only for information available in database sources, it may be too expensive to collect these more costly confounders in large database studies, where sample sizes can reach tens or hundreds of thousands. One solution is to collect the more expensive data on a smaller validation sample or a subset of the full database cohort and use validation or two-phase sampling methods to incorporate this information into analyses. Here we implement two such approaches, a missing data imputation method and a survey sample reweighting method, to estimate influenza VE in the elderly. We use GHC administrative data from a prior influenza VE study  (full sample) supplemented by richer confounder data on a subset (validation subsample) that included in-person examinations as part of the Adult Changes in Thought (ACT) study . We use the control time period prior to influenza season to evaluate each method’s ability to successfully reduce confounding bias compared to traditional adjustment approaches that rely solely on confounders from database sources.
We used existing cohorts from two prior studies conducted among persons aged 65 years and older who were members of Group Health Cooperative (GHC), a managed care organization in Washington State with ~350,000 enrollees. The composition of the GHC population is representative of the surrounding region, which is primarily white, middle class, and well educated. The first was a large, retrospective database cohort study of influenza VE among 72,527 community-dwelling seniors from 1995–2002  that captured data from GHC’s administrative systems on all-cause mortality (outcome of interest), influenza immunization (exposure of interest), and database confounders used in prior database studies of influenza VE [14–15], including health care utilization (e.g., number of outpatient visits) and ICD-9 diagnosis codes assigned to patient encounters and used to define binary health status indicators (e.g., heart disease). In the current study, we utilized data from two study years (September 1, 2000 – August 31, 2001 and September 1, 2001 – August 31, 2002), required that persons remain continuously enrolled during each study year, and defined this cohort as the full sample. Subjects were followed each study year from the September 1 start date until their death or August 31, whichever occurred first. Database confounders were captured in the one-year baseline period prior to each study year (September 1, 1999 – August 31, 2000 and September 1, 2000 – August 31, 2001). To make fuller use of available database information in the current study compared with prior studies, we also defined additional database covariates using a broader range of data, including medications, laboratory test results, other health care utilization (e.g., home health services), and disease severity measures, based on methods described previously .
The second sample was taken from the longitudinal ACT study, a prospective cohort study of aging and dementia among GHC seniors . The original ACT cohort of 2,581 community-dwelling, dementia-free persons aged 65 years or older was enrolled from 1994 to 1996 and supplemented with 811 more members from 2000 to 2003. Extensive data from in-person interviews and physical examinations were collected at an initial visit and follow-up visits every two years thereafter, including self-reported demographics, activities, and instrumental activities of daily living (ADL and IADL), health behaviors, and disease conditions, as well as clinical assessments of physical function, dementia, and depression. Some interviews were conducted by proxy if study subjects were unavailable. Further study design details have been published previously [16–17]. In the current study, we used ACT data to more comprehensively characterize potential confounders on subjects who were also in the full sample. Confounder data were accessed on ACT enrollees with a follow-up visit during the baseline period in which database confounders were available from the full sample (September 1, 1999 – August 31, 2001 and September 1, 2000 – August 31, 2001). These ACT participants are a subset of the full cohort and are defined in the current study as the validation sample. In analyses, we linked the ACT data on validation sample members to their full sample database information, which contained their mortality status, influenza vaccine exposure status, and database confounders.
Our primary aim was to assess whether confounding bias in traditional database estimates of influenza VE against all-cause mortality, which naively adjust only for database confounders, can be reduced by incorporating adjustment for additional confounders measured in a validation sample. For each approach, we used data from both study years to fit a Cox proportional hazards model that estimated the RR of death for vaccinated versus unvaccinated individuals, treating vaccination status as a binary time-varying covariate defined for each subject as ‘unvaccinated’ from September 1 up to the date of vaccination and as ‘vaccinated’ for the rest of that study year (i.e., until August 31). To estimate the effect of vaccine during influenza season as well as control periods before and afterward, we included an interaction term between vaccine status and a three-category time period effect defined using local influenza viral surveillance data for each study year : before (September 1 to December 16 in study year 1, and September 1 to December 15 in study year 2), during (December 17 to March 18 in study year 1, and December 16 to March 10 in study year 2), and after (March 19 to August 31 in study year 1, and March 11 to August 31 in study year 2) influenza season.
In each Cox model, propensity scores were used to consolidate confounders into a single summary measure for adjustment. The propensity score was defined as the probability of receiving influenza vaccine in each study year conditional on confounders measured in the year prior and was estimated using multivariable logistic regression. Two specific scores were created using confounders defined a priori based on expert clinical opinion: 1) An error-prone score (PSep) computed among the full cohort and based only on database confounders, and 2) A gold-standard score (PSgs) computed in the validation cohort and based both on database and validation sample confounders. To prevent bias, propensity score models excluded variables related only to exposure and not outcome by inspecting the age and gender adjusted odds ratios (ORs) between each variable and outcome [18–19]. Using these propensity scores, we implemented four approaches: 1) an unadjusted model, 2) a naïvely adjusted model, 3) imputation, and 4) reweighting. The unadjusted and naïvely adjusted methods involved fitting a Cox model among the full sample that either did not adjust for any confounders or adjusted only for database confounders as measured by PSep, thus replicating traditional unadjusted and adjusted database study methods. The second two approaches, described further in the next sections, fit Cox models that incorporated confounders from the ACT validation sample.
We first viewed the lack of more detailed confounder data (i.e., the lack of PSgs) for some full cohort members from a missing data perspective  and applied the following steps : 1) In the validation sample, use linear regression to estimate the association between the predictor PSep and outcome PSgs, adjusted for influenza vaccination status; 2) Use this regression equation to predict PSgs among full sample members not in the ACT validation sample; and 3) In the full sample, fit a Cox regression model estimating the RR of death for those vaccinated versus not vaccinated, adjusted for PSgs (for those in the ACT validation sample) or its predicted value of PSgs (for those not in the ACT validation sample) and use bootstrapping to estimate standard errors. Notably, when considering this problem in a measurement error context, where the propensity score based only on database confounders (PSep) is the quantity measured with error compared with a gold standard propensity score based on the more detailed confounder data (PSgs), this imputation approach is equivalent to the regression calibration algorithm described by Carroll et al. . Sturmer et al. referred to this specific application of regression calibration as propensity score calibration .
The second validation sampling approach we employed is a survey reweighting method called generalized raking [23–24]. Reweighting is often used when analyzing a subcohort sampled from a larger cohort using a two-phase stratified design. Subcohort analyses are then inverse probability–weighted based on the sampling probabilities (i.e., using Horvitz-Thompson estimation) so that subcohort inference reflects the larger cohort and is thus generalizable to the original population . However, weights based only on the stratifying factors do not generally use all the available information on the larger cohort, information known as auxiliary data (V) . To increase precision, standard weights can be adjusted using V so that the observed total of V in the larger cohort equals the weighted total of V in the subcohort, while keeping the adjustment as small as possible. This induced dependence of the weights on V, measured on the full cohort, drives the efficiency gain and is known as calibration [27–28]. To avoid confusion with the previously described imputation approach, which has also been called calibration, we refer to this reweighting method using an alternative survey terminology: generalized raking .
To implement raking in the influenza VE example, we fit a weighted Cox model in the validation sample that estimated the RR of death for those vaccinated versus not vaccinated, adjusted for PSgs, where the weights were estimated as follows: 1) Define initial weights as the inverse probability of inclusion in the ACT validation cohort, and estimate them using logistic regression with age, gender, and their interaction as predictors, as if the ACT cohort was drawn from the full sample using an age and gender stratified design; and 2) Adjust the initial weights by using the additional auxiliary information, PSep, available on all full cohort members. Instead of directly using PSep as the raking variable (i.e., instead of using V=PSep), we used a variable based on PSep called a delta-beta, a quantity that reflects the estimated influence of each subject in a Cox regression and has been shown to estimate the optimally efficient choice of V [23,25–26]. We note that although the initial weights in Step 1 were based only on age and gender (in order to reflect stratifying factors that are commonly used in practice in two-phase designs), the final weights used in Step 2 for reweighting depend on all the auxiliary database information contained in the PSep thus fully leveraging the available database information.
The full sample and validation sample cohorts comprised about 44,000 and 1,000 seniors each year who contributed 86,400 and 1,936 person-years during the two-year study period, respectively (Table 1). Annual influenza vaccine coverage was about 72% and 77% in the full and validation samples, respectively, and about 3–4% died each year in each cohort. The percent who died in the periods before, during, and after influenza season were 0.9%, 0.8%, and 1.6%, respectively, with the highest percent observed after influenza season, which was roughly twice as long as the other periods. Most vaccinated seniors received vaccine in November of each study year (Figure 1). Tables 2 and and33 show the characteristics of the full and validation sample cohorts based on database confounders included in the PSep and the supplemental confounders included in the PSgs, respectively. About 60% of members in both cohorts were female, and the full sample was slightly younger than the validation sample. Table 4 shows the ORs and 95% confidence intervals (CIs) quantifying the magnitude of the age and gender adjusted association between each confounder and death to provide further insight into the confounding mechanisms based on both database and validation sample information.
Estimates and 95% CIs of the RR of death associated with influenza vaccination obtained using each of the four approaches (unadjusted, naïve, imputation, and reweighting) in each time period (before, during, and after influenza season) are shown in Figure 2. RRs were lowest (<0.50) in the period before influenza season and then increased steadily (to 0.50–0.70 during and >0.80 after influenza season). Unadjusted and naïvely adjusted estimates were similar across all time periods. Estimates based on imputation or reweighting were also comparable in all time periods, but consistently closer to the null (i.e., RR=1.0) compared with the unadjusted and naïvely adjusted estimates. No approach correctly estimated a null RR=1.0 in the control period before influenza season, though the pre-influenza estimates based on imputation and reweighting were closer to 1.0 than the naïvely-adjusted RR, indicating that bias was somewhat reduced using methods that incorporated the validation confounder data. The quality of the imputation and reweighting is characterized in Figure 3. This scatterplot with fitted regression lines estimating the association between PSep and PSgs within the validation sample shows modest correlation (ρ=0.60) but wide variability in PSgs for each PSep.
The association between influenza vaccination and risk of all-cause mortality is a useful example for studying problems of confounding in treatment effectiveness studies that rely on administrative databases , as strong confounding is present, and there is a natural control period prior to influenza season that can be used to assess bias [10,29]. In this study, we leveraged existing data from two prior cohort studies to explore the utility of using two methods (imputation and reweighting) that integrate additional confounder data from a validation sample to reduce confounding bias in influenza VE estimates that adjust only for information available in database sources. Using the control period prior to influenza season as a gauge, we found that both methods modestly reduced but did not completely eliminate the bias compared with naïvely adjusted estimates that did not use the validation sample confounder data. The magnitude of the bias reduction was comparable in both approaches.
Use of validation sample methods can enhance healthcare database studies, but our results suggest that their success in practice depends on many factors and assumptions. The key bias-reducing factor for either imputation or reweighting is the ability to measure the important confounders in the validation sample. Both methods also rely on the comparability of the validation and full samples, which is guaranteed if the validation sample is designed as a probability-sampled subcohort. Unbiased estimation for imputation further depends on the correctness of the model used to impute the gold-standard confounder data from the error-prone information, while reweighted estimates are robust to this assumption (i.e., they will be no worse than estimates based only on the validation sample, even if this model is incorrect). In both methods precision will improve as the amount of information in the validation sample increases, which can occur either with larger validation sample sizes or with increases in the strength of the association between the gold standard and error-prone confounders. Lastly, the imputation approach has several additional assumptions, including the conditional independence of the error-prone confounders from outcomes, given the gold-standard confounders (i.e., the surrogacy assumption) [22,30]. Our specific application of imputation, which was designed to be consistent with the propensity score calibration method, involved a propensity score summary measure rather than a single measured covariate, and this raises additional technical issues many of which have been discussed by Lunt et al. . Implementation of the propensity score calibration approach could be further enhanced by performing multiple rather than single imputation.
The influenza VE example we used in this study was advantageous for several reasons. First, there is a well-defined control period during which bias can be assessed. Also, the potential for other sources of bias is relatively small. Outcome misclassification was minimized, because in addition to capturing mortality data directly from GHC databases, we linked to state mortality records and thus obtained information even if a subject disenrolled from GHC. Exposure misclassification is also likely to be small, since vaccination coverage rates in the GHC population have been found to closely reflect average coverage rates among those 65 years and older in Washington State . Reasons for high accuracy of exposure data include the following: 1) The electronic vaccination registry at GHC is well-established, dating back to 1991 when it was created for the Vaccine Safety Datalink Project , and is routinely monitored for quality assurance, 2) GHC reciprocally shares data with the Washington State Immunization Information System and so captures vaccine data on seniors vaccinated at outside institutions, and 3) GHC databases will contain vaccinations received by seniors during hospital stays if the hospital filed a claim for payment for the vaccination.
However, the influenza VE application was also limited in several ways. One major challenge is the presence of a selection mechanism for influenza vaccination that is extremely difficult to measure. Although the ACT validation confounders included a variety of disease severity and functional status measures that were geared to address unmeasured frailty, the reasons for selective receipt of preventive therapies such as influenza vaccine in seniors are clearly complex and difficult to measure, and this has been observed in prior studies [10–12]. In many settings, confounding by frailty could instead be addressed by using an active versus an unexposed comparator, but an active comparator is not readily available for influenza vaccine. Trimming a small proportion of those treated contrary to prediction has been proposed as another method to address unmeasured confounding due to frailty, but we did not explore that option in this analysis . Second, comparability between the full and validation cohorts was imperfect, reducing the generalizability to the full sample of the validation sample model that related the gold standard and error-prone propensity scores. More importantly, the relatively rigorous nature of the ACT interviews and examinations may have resulted in frail and demented seniors (the group most plausibly responsible for the much of the unmeasured confounding) being under-represented in the validation sample, which would limit the ability to remove confounding by frailty. A third limitation is that the quality of the model relating the gold standard to the error-prone information in the validation sample was somewhat weak, with wide variability in the gold-standard propensity score for each value of the error-prone propensity score, suggesting relatively limited predictive ability of the database information. Fourth, the validation sample size was relatively small and the mortality outcome was rare, which reduced statistical power.
Our results support further exploration of validation sampling methods, such as imputation and reweighting, to improve the accuracy of findings from health care database studies. Although similar recommendations have been made previously [26,35–37], and software is readily available (widely for imputation and comprehensively in R for survey procedures ), such methods remain relatively underutilized. One challenge when studying treatment effectiveness beyond influenza vaccine is that there are limited methods to evaluate the performance of more sophisticated confounder adjustment techniques, like those that incorporate validation data. Unlike the case with influenza vaccine, there may not be a readily available control period during which the association between treatment exposure and outcome is known. If this is the case, one cannot determine with certainty when a method gets the ‘right’ answer or when one method out-performs another. Efficacy estimates from RCTs may give some indication of the ‘truth,’ but they may also substantially differ from observational effectiveness results due to major differences among study populations and between highly controlled RCTs and ‘real-world’ observational settings. Without clear gold-standard estimates of effectiveness in practice for most exposures, a balance of simulation studies (where truth can be generated) and example applications (where the complexities of real data are present) is needed to more fully understand the optimal implementation and settings for use of validation methods in practice.
This work was supported by a subcontract with America’s Health Insurance Plans (AHIP) under contract 200-2002-00732 from the Centers for Disease Control and Prevention. The findings and conclusions in this report are those of the authors, and do not necessarily represent the official position of the Centers for Disease Control and Prevention. Additional funding support was received from grant UO1 AG06781 from the National Institute on Aging, National Institutes of Health. Preliminary results were presented orally at the Comparative Effectiveness Research symposium “From Efficacy to Effectiveness” at AHRQ’s DEcIDE Methods Center Learning Network in Rockville, Maryland, on June 13, 2012.