|Home | About | Journals | Submit | Contact Us | Français|
An instrumental variable (IV) is an unconfounded proxy for a study exposure that can be used to estimate a causal effect in the presence of unmeasured confounding. To provide reliably consistent estimates of effect, IVs should be both valid and reasonably strong. Physician prescribing preference (PPP) is an IV that uses variation in doctors' prescribing to predict drug treatment. As reduction in covariate imbalance may suggest increased IV validity, we sought to examine the covariate balance and instrument strength in 25 formulations of the PPP IV in two cohort studies.
We applied the PPP IV to assess antipsychotic medication (APM) use and subsequent death among two cohorts of elderly patients. We varied the measurement of PPP, plus performed cohort restriction and stratification. We modeled risk differences with two-stage least square regression. First-stage partial r2 values characterized the strength of the instrument. The Mahalanobis distance summarized balance across multiple covariates.
Partial r2 ranged from 0.028 to 0.099. PPP generally alleviated imbalances in nonpsychiatry-related patient characteristics, and the overall imbalance was reduced by an average of 36% (±40%) over the two cohorts.
In our study setting, most of the 25 formulations of the PPP IV were strong IVs and resulted in a strong reduction of imbalance in many variations. The association between strength and imbalance was mixed.
Instrumental variable (IV) analysis joins other techniques [1–4] that attempt to mitigate the bias introduced by measured and unmeasured confounding present in nonexperimental data [5–10]. IVs are of particular interest in pharmacoepidemiology studies, as such studies struggle with potential for bias from confounding by indication and other unmeasured risk factors, particularly in administrative databases .
For unbiased IV estimation, the instrument must be valid . A valid instrument is a variable in the observed data that predicts choice of treatment but is not related to the study outcome, except through the effect of treatment. It must also meet several other criteria [13,14]. Although IV validity is not explicitly testable, stratifying the patient population by a valid dichotomous IV should result in more observed balance among the measured covariates than if those same patients had instead been stratified by their actual treatment. If changing study design or IV definition yields even further covariate balance, the increase may correspond to an increase in the validity of the IV.
A strong instrument is one that is a good predictor of actual treatment, with its predictive effect independent of other measured variables. It is important for an IV to be relatively strong: IV estimation involves scaling up an estimate derived by substituting the IV for actual treatment in an outcome model by a factor inversely proportional to IV strength; hence, any residual confounding in that estimate will be amplified if the instrument is weak. Unlike validity, IV strength is a measurable quantity that can be assessed, reported, and compared [15–18]. In nonrandomized research, it is possible that an instrument can be too strong. A variable that is strongly correlated with a confounded exposure cannot plausibly fulfill the requirements for a valid IV: it will likely be associated with the study outcome via the same unmeasured confounders paths that led to the need for IV analysis in the first place .
This article and its companion, “Instrumental variables I: instrumental variables exploit natural variation in nonexperimental data to estimate causal relationships,” together introduce the concept of instrumental variable (IV) analysis and examine some of the key assumptions underlying the technique. Taken together, the articles show how IVs arise in observational data and how IV analysis parallels randomized trial designs, and also examine the key notions of instrument strength and validity. Each of them describes instruments that have been used in clinical epidemiology and gives examples of IV analysis.
In the study presented here, we explore alternative definitions of the physician prescribing preference (PPP) instrument, proposed by Brookhart et al.  and related work by other authors [20,21], as well as a series of variations in study design and cohort selection. For each variation, we assess the IV's strength and the reduction in imbalance resulting from the application of the IV. We compare reductions in imbalance across the variations and assess the overall relationship between strength and imbalance. To accomplish this, we studied two cohorts of elderly patients initiating treatment with antipsychotic medications (APMs), and considered an outcome of mortality within 180 days.
Brookhart et al.  have proposed that an individual physician's preference for prescribing one drug over another is an IV that predicts which drug a patient will be treated with . They examined physician prescribing patterns and deduced that the variation they observed may be an instrument , under the assumption that PPP is unrelated to outcome. They proposed a simple technique for measuring a physician's preference which we term the “base case”.
As in the earlier work, the base case considered the entire cohort; preference at the time of seeing the patient was determined by the treatment a doctor chose for the previous patient who was treated in his or her practice and who also required a new prescription for one of the study drugs [7,24].
The use of the previous patient's treatment to estimate preference has the advantage of quickly registering any changes in preference, but two issues arise: first, the previous patient's treatment may not reflect the doctor's true preference, and second, the simple IV as specified may not possess the required strength and validity. To examine these issues, we designed variations on the base case that were meant to exercise the definition of the PPP measure and to create contrasts in strength and validity. We modified (1) preference assignment algorithm, (2) source population, and (3) stratification criteria (Table 1). In all instances, we chose single, dichotomous IVs for interpretability and comparability.
To consider alternative formulas for measuring the doctor's preference, we first altered the preference assignment algorithm. We expanded the time window to calculate preference from more than just the last new prescription filled. We used the previous two, three, and four new prescriptions, and set different targets for prescribing consistency; as an example, in the case of four prescriptions, we considered that “any of the four,” “half of the four,” and “all of the four” were conventional rather than atypical APMs. We hypothesized that expanding the window would increase balance in treatment groups by creating a better, more stable estimate of true underlying preference and therefore better quasi-randomization of patients to the two predicted treatment groups. On the other hand, we thought that this also would likely decrease the IV strength by weakening the correlation between the IV and the treatment, especially at the higher targets of prescribing consistency. Because the need for more data per physician increases as the window expands, we performed all preference assignment variations in Table 1's group R1, a cohort of patients seen by doctors with very high-volume prescribing.
A concern about instruments based on physician preference is that varying physician quality and patients' “shopping” for doctors based on the treatment they expect to receive may introduce confounding of the IV and negatively affect the validity of the instrument [13,19]. To address these concerns, the second category of variations considered cohort restriction schemes in which we limited the patient population by combinations of measured doctor-level confounders (primary care, specialty, year of graduation) and patient-level confounders (age, age relative to the average in the doctor's practice). By restricting, we hoped to isolate subpopulations in which the IV assumptions may have been more consistently held, thereby increasing validity and balance. We hypothesized that the restricted cohorts would show higher IV strength, because the combination of greater patient homogeneity and greater number of marginal patients would increase the predictive power of the IV within the subgroups.
Finally, because the preference algorithms estimate preference at any given time from physicians' behavior with prior patients, we created stratification schemes that rearranged the data such that the previous patients shared major characteristics (such as age or gender) with the current patient. By this rearrangement, we hoped that the treatments given to prior patients would reflect not just overall preference, but preference within a particular subgroup of patient. Unlike the restriction schemes, stratification always considered the entire cohort. We hypothesized that the stratification method would contribute to higher instrument strength by means of greater prescribing consistency among like patients, and that stratification would not affect residual imbalance.
We expected the estimates of effect on the outcome to be incomparable across these different variations because of the different patient populations and doctor characteristics. We did believe our empirical measures of strength and imbalance, as well as the standard errors of the effect estimates, would be comparable across the variations.
We performed an example study of initiation of APM therapy and the associated risk of short-term mortality. APMs are categorized into two groups: conventional (older) and atypical (newer) agents . They are widely used off-label to control behavioral disturbances in demented elderly patients. Previous studies have found increased rates of death among users of atypical antipsychotic agents as compared with placebo . Nonrandomized studies have indicated that both types of APMs increase risk of death in the elderly, with the atypical drugs showing lesser risk than the conventional ones [8,27].
Our study population, fully described in earlier work [8,27], was comprised of two cohorts of patients aged 65 years and older who initiated APM treatment. The first cohort was drawn from Pennsylvania (PA)'s Pharmaceutical Assistance Contract for the Elderly (PACE), a drug assistance program for the state's low-income seniors, between 1994 and 2003. The second cohort was drawn from all British Columbia (BC) residents aged 65 years or more between 1996 and 2004. Patients with existing cancer diagnoses were excluded.
We defined our exposed group to be initiators of conventional APM treatment and compared them with a referent group of initiators of atypical APM therapy [8,27]. Outcome was defined as death within 180 days from drug initiation. We defined the baseline characteristics of the patients based on the 6 months before each subject's index date and included coexisting illnesses and use of health care services [28–30]. All dates were measured to the level of day; events occurring on the same day were ordered randomly. Because of limitations of the claims data, we were not able to measure several potentially important covariates—frailty, cognitive impairment, and ability to perform activities of daily living—factors which we hoped to adjust for using IV methods.
Two-stage least squares (2SLS) models were used to estimate risk differences [7,9]. All IV models were run in Stata Version 9  using the ivreg2 module . Reported standard errors are robust and account for clustering within physician practices using the sandwich estimator [33,34].
To test for strength, we examined the partial F test from the first-stage regression, which predicts treatment as a function of instrument and covariates. The partial F test has the null hypothesis that the coefficient for effect of instrument in the first-stage regression model is zero . In the economics literature, an F statistic greater than 10 indicates that the instrument is not weak [18,35].
We also computed the partial r2, the square of the partial correlation between the instrument and the treatment, conditional on other covariates in the model . The partial r2 can be interpreted as the proportion of the variance explained by the addition of the IV to the model. Large partial r2 values indicate that the instrument contributes substantially to the prediction of treatment.
We measured change in imbalance for measured covariates, comparing the population as stratified by the treatment versus stratified by the IV. We assessed the change for each covariate; negative numbers indicated a reduction in imbalance. We also computed a summary measure: percentage change in the Mahalanobis distance [36,37]. In the case of a single dichotomous confounding variable, the Mahalanobis distance reflects the standardized difference in mean prevalence between treatment groups. When additional variables are considered simultaneously, the Mahalanobis distance extends logically and also corrects for observed covariance among the measured characteristics so as to avoid “double-counting” correlated variables.
Characteristics of the 36,541 BC initiators of APMs and 20,087 PA initiators are presented in Table 2. There were 4,113 deaths from any cause (11% of cohort) in BC and 2,935 deaths (15%) in PA.
When stratified by treatment, many variables showed relative balance, but some showed differences of over 5% (in PA, those included gender, dementia, and mood disorders), especially among measured psychiatric conditions. Table 3 alters the stratification to be by the various IVs rather than by treatment. It shows, for a series of potential confounding variables in each of the cohort restriction variations, the difference in prevalence in the predicted treatment groups after stratifying by the IV. On variables unrelated to psychiatric conditions, balance was broadly achieved (difference tended toward 0%), with the exception of hypertension in the PA cohort.
As a summary measure of the imbalance figures for each covariate, the rightmost columns of the PA and BC sections of Table 3 show the percentage change in Mahalanobis distance between treatment and IV stratification. The Mahalanobis distance was reduced in most cases, indicating improved covariate balance, though several variations, especially among the preference algorithm schemes, showed a greater imbalance. The stratification schemes generally showed good improvement in balance.
As an example of change in balance of a single covariate, the PA cohort was 15.1% male in the atypical APM group and 20.1% male in the conventional APM group (Table 2), for a difference between the groups of 5%. When stratified by IV used in the base case, the difference was reduced to 1.8% (Table 3), that is, stratifying by the base case IV resulted in 1.8% more males in the atypical group than in the conventional group. Overall, the application of the instrument caused an overall reduction of imbalance of 62.7% as compared with stratification by the treatment.
For reference, Table 4 shows results of unadjusted, age and sex adjusted, and fully adjusted ordinary least squares (OLS) models, as well as a two-stage least squares (2SLS) IV analysis. All analyses showed an increased risk of death among those treated with conventional APMs, though some confidence intervals included the null value of zero. We presented figures for both the base case and for the subcohort restricted to patients attended by primary care physicians; we assumed that “doctor shopping” would be minimized when patients were seeing their usual primary care doctor.
Table 5 presents measures of instrument strength for all variations of PPP. Partial r2 values were generally high as compared with selected values from the economics literature, with values ranging from 0.028 to 0.099. The partial r2 values observed in the base case were among the highest. The partial r2 values were similar across the various cohort definitions and were not stronger for the restriction schemes. P-values for the F statistic were universally less than 0.05.
We compared the partial r2 with several other measures in our data. Partial r2 did not vary strongly with study size (BC Spearman r = 0.068; PA r = 0.171). With regard to change in imbalance (Fig. 1a), the correlation was modest in BC and weak in PA (BC r = 0.482; PA r = −0.049). With regard to standard errors (Fig. 1b), we observed a consistent decrease in standard error as the IV strengthened (BC r = −0.496; PA r = −0.677).
In Table 5, one can further observe a decrease in study size as a result of cohort restriction schemes. These decreases correlated with a simultaneous increase in standard error of the IV point estimate (r2 = 0.71).
The relatively infrequent use of IVs in epidemiology may be the result of a perceived lack of strong instruments or concerns about IV validity. In our two example studies evaluating the effectiveness of medicines in routine care, we found that PPP in almost any of its definitions or study formulations would be considered a strong instrument as compared with typical examples in the economics literature. The results also show a broad reduction in imbalance of measured covariates across our restriction and stratification variants. The reduced imbalance in measured covariates and the IV's strength lend credence to the notion that PPP may be an effective instrument for the selected drug comparison. We also noted that the association between instrument strength and imbalance in measured covariates was a mixed one; the Spearman correlation in BC was fairly high, whereas that of PA was close to zero.
Validity of an IV is an untestable property because it involves quantifying the strength of the association between the instrument and the outcome, potentially mediated through unmeasured paths. As in other approaches to controlling confounding, IV validity can be explored through subject matter expertise or empirical assessment of relationships likely to be correlated with unmeasured factors . Inspection of the reduction in imbalance of measured factors achieved by applying the instrument may also be informative. In our data, application of the IV generally reduced imbalance in measured covariates, but significant imbalance remained among the measured psychiatric conditions. These conditions were each correlated with each other, perhaps because of misclassification of specific psychiatric conditions . Because of these strong correlations, we used the Mahalanobis distance to assess overall balance.
The reduction in Mahalanobis distance in many of the variations, along with previous work [7,8,19,24], suggests that PPP was at least a reasonably valid instrument in this setting. The fact that some imbalance remained, especially in psychiatric conditions, suggests some “nonrandom” assignment of patient to practice, such as a clustering of a particular patient type within practice . (For a violation of the IV assumptions to occur, the selection of patients to practice would also have to be associated with the outcome of death.) Overall, an observed decrease in Mahalanobis distance may be suggestive of increased validity but is not necessarily indicative; it is possible to imagine a circumstance where the Mahalanobis distance is dramatically reduced but IV validity is not affected. It is also possible that using an IV—even one that yields strong treatment group balance—can lead to greater bias than would occur in a non-IV setting. To avoid this, any numeric evaluation of validity also requires due consideration of potential violations of the IV assumptions based on subject matter expertise and other knowledge [14,39].
The fairly consistent decrease in partial r2 when additional past prescriptions were added to the preference estimation algorithm suggests that considering the additional prescriptions decreases the proportion of the variance in treatment explained by the instrument and weakens the predictive power of the dichotomous IV. Using a continuous rather than dichotomous IV may have mitigated this effect. Even though the IV was weaker, the additional previous prescriptions may have also yielded a better estimate of the physician's true preference because they estimated preference over a longer period of time and over more patients. This suggests that the somewhat lower partial r2 values when adding previous prescriptions may be a better estimate of the PPP IV's true strength than the higher value observed in the base case.
At the same time, almost all of the cases in which we saw increases in overall imbalance came from requiring that a doctor be totally consistent in his or her prescribing over the window considered (Table 3, rows P4 through P6). However, the physician may be consistent not because of his or her preference but because he or she is seeing similar patients who may have self-selected to his or her practice (“doctor shopped”), or as a result of other forms of atypical case mix. In these cases, the element of randomness in the “assignment” of patients to doctor may have been reduced or lost.
We had hypothesized that a stronger instrument would be associated with somewhat greater imbalance: as instrument strength increases, the IV starts to resemble more closely the treatment variable. If this resemblance becomes too strong, then the IV may be confounded by the same factors that confound treatment, and stratification by the strong IV should reduce imbalance less than stratified by a weaker IV that is less correlated with the treatment's confounders. In our data, by Spearman's rank-based measure of correlation between strength and balance, this played out in BC (r = 0.482) but not in PA (r = −0.049). Using Pearson's measure based on an assumed linear relationship, there was moderate correlation in both populations (BC r = 0.270; PA r = −0.249). The divergent findings suggest no clear answer to whether there was a trade-off between imbalance and strength.
The IV methods measure the effect in the marginal patient rather than the effect in the entire cohort [12,40,41]. By varying the cohort definitions, we may have also affected who the marginal patient would be, and therefore, any measures of effect drawn from these variations may not be comparable. We did not present second-stage-effect estimates for all variations, as the choice of the “right” estimate would be very much a decision of study design and subject matter expertise, and should not be driven by the results that appear most reasonable based on previous knowledge.
This study examined a range of implementations of the PPP instrument in two pharmacoepidemiologic studies on APM treatment. In these limited examples, the application of the PPP instrument did generally reduce imbalances, but created imbalances in some cases of the very stringent IV definitions. Imbalances in measured covariates can be controlled for in the analysis, but the remaining imbalances suggest that the unmeasured covariates may be imbalanced as well, and may therefore lead to bias in a traditional outcome model.
In summary, we have demonstrated a number of variants of the PPP instrument and shown how empirically assessing the strength of an IV and its reduction in imbalance of covariates may inform the use of PPP in practical settings relevant to pharmacoepidemiology using claims data.
Funding: Dr. Schneeweiss received support from the National Institute on Aging (R01-AG021950), National Institute of Mental Health (U01-MH078708), and the Agency for Healthcare Research and Quality (2-RO1-HS10881), Department of Health and Human Services, Rockville, MD. He is Principal Investigator of the Brigham & Women's Hospital DEcIDE Research Center on Comparative Effectiveness Research funded by the Agency for Healthcare Research and Quality.