|Home | About | Journals | Submit | Contact Us | Français|
Provider profiling of outcome performance has become increasingly common in pay-for-performance programs. For chronic conditions, a substantial proportion of patients eligible for outcome measures may be lost to follow-up, potentially compromising outcome profiling. In the context of primary care depression treatment, we assess the implications of missing data for the accuracy of alternative approaches to provider outcome profiling.
We used data from the Improving Mood-Promoting Access to Collaborative Treatment trial and the Depression Improvement across Minnesota, Offering a New Direction initiative to generate parameters for a Monte Carlo simulation experiment.
The patient outcome of interest is the rate of remission of depressive symptoms at 6 months among a panel of patients with major depression at baseline. We considered two alternative approaches to profiling this outcome: (1) a relative, or tournament style threshold, set at the 80th percentile of remission rate among all providers, and (2) an absolute threshold, evaluating whether providers exceed a specified remission rate (30 percent). We performed a Monte Carlo simulation experiment to evaluate the total error rate (proportion of providers who were incorrectly classified) under each profiling approach. The total error rate was partitioned into error from random sampling variability and error resulting from missing data. We then evaluated the accuracy of alternative profiling approaches under different assumptions about the relationship between missing data and depression remission.
Over a range of scenarios, relative profiling approaches had total error rates that were approximately 20 percent lower than absolute profiling approaches, and error due to missing data was approximately 50 percent lower for relative profiling. Most of the profiling error in the simulations was a result of random sampling variability, not missing data: between 11 and 21 percent of total error was attributable to missing data for relative profiling, while between 16 and 33 percent of total error was attributable to missing data for absolute profiling. Finally, compared with relative profiling, absolute profiling was much more sensitive to missing data that was correlated with the remission outcome.
Relative profiling approaches for pay-for-performance were more accurate and more robust to missing data than absolute profiling approaches.
The use of outcome measures to profile providers in pay-for-performance (P4P) programs is becoming increasingly common. Major P4P initiatives, including the Premier Hospital Quality Incentive Demonstration (Lindenauer et al. 2007), the California Pay for Performance Program (Integrated Healthcare Association 2009), and the United Kingdom's Quality and Outcomes Framework (Doran et al. 2008), have used outcome measures, in addition to process measures, to incentivize quality. Beginning in 2013, Medicare's Hospital Value-Based Purchasing program will also begin to incentivize performance on outcome measures for all acute care hospitals in the United States (Federal Register 2011).
A potential problem associated with the use of outcome measures to profile providers in P4P programs, particularly in the outpatient setting, is that a substantial proportion of patients may be lost to follow-up or fail to have their outcomes measured and recorded at specified time points, resulting in missing data. There are two mechanisms through which missing data can undermine profiling. First, the loss of data decreases sample size, which in turn increases random sampling variation, ultimately decreasing measure reliability (Normand et al. 2007). Second, when measures are missing in a way that is correlated with patient outcomes, missing data can lead to systematically biased estimates of provider quality, biasing profiling. This problem, henceforth referred to as “systematically missing” data, undermines the validity of outcome measures.
The extent to which missing data affects provider profiling in P4P programs may also depend on the specific approach used to profile providers—that is, to determine which providers receive financial incentives. One prominent approach is to use a relative, or tournament style threshold, in which providers who are ranked above a pre-determined percentile among the population of providers receive a P4P bonus payment (Institute of Medicine 2006; Lindenauer et al. 2007). Another common approach uses an absolute threshold, rewarding providers if they exceed a specified level of measured quality (e.g., providers with 50 percent or more patients achieving treatment goals) (Institute of Medicine 2006; Doran et al. 2008).
There is reason to believe that absolute profiling methods may be more sensitive to missing data than relative profiling methods. While both profiling methods will be affected by the decrease in reliability associated with the reduction in sample size from missing data, systematically missing data may be more detrimental to absolute profiling methods. This is because relative profiling methods depend only on the relative ranking of providers, not the precise level of measured quality, while absolute profiling methods depend directly on the level of measured quality. Systemically missing data will always affect levels of quality, but they may not affect relative quality rankings if the effects of systematically missing data are similar across providers, thereby preserving provider rankings. While there is an extensive literature on the problems of missing data in longitudinal clinical studies (Laird 1988; Siddique et al. 2008; Gibbons, Hedeker, and DuToit 2010; Normand 2010), with an emphasis on nonrandom attrition from patient drop-out (Little 1995), we are aware of no research that has assessed the impact of missing data on the accuracy of provider profiling in pay-for-performance programs.
We conducted a Monte Carlo simulation study to assess the impact of missing data on the accuracy of both absolute and relative provider profiling methods in the context of primary care depression treatment. Depression is strongly linked to poor adherence to treatment and impaired self-care. In the context of depression care, the effect of systematically missing data remains unknown: patients who remain depressed after the initial stage of treatment may think that treatment is ineffective or may face barriers to receiving treatment and therefore may be likely to drop out of treatment. At the same time, patients who respond to treatment may not see a need for the maintenance of treatment and may be more likely to drop out. Given this uncertainty, the potential impact of systemically missing data is of particular interest and relevance in depression care.
Our Monte Carlo simulation study provides several advantages in addressing our research question. First, it allows us to know “true” provider quality rankings—because these are assigned in the simulation—and evaluate the accuracy of profiling methods against this known benchmark. By creating the data-generating process used in the simulation model, we can also partition profiling error into error resulting from random sampling variability and from missing data, which is impossible given an unknown data generating process. Second, our simulation study allows us to conduct sensitivity analysis to understand the conditions under which simulation results hold, and the key parameters to which results are sensitive. Third, the simulation allows us to conduct randomized experiments within the context of our model, allowing us to understand the causal relationship between key model parameters, such as the effect of systematically missing data on profiling error.
The patient outcome of interest is remission of depression symptoms at 6 months after the start of treatment (“remission”). We evaluated two separate profiling approaches. The first, relative profiling, is based on a threshold set at the 80th percentile of remission rate among all providers. The second, absolute profiling, evaluates whether providers exceed a remission rate of 30 percent. In the population of providers (see below), the 80th percentile of the remission rate is equal to a remission rate of 34 percent. We assessed the accuracy of provider profiling using the total error rate, defined as the proportion of providers who were incorrectly classified as either above or below the relevant threshold, given a profiling approach. To assess the accuracy of profiling approaches, we compare the average of the total error rate, which is calculated across all the simulation iterations for a given simulation scenario (see below).
We used data from a randomized controlled trial called IMPACT (Improving Mood-Promoting Access to Collaborative Treatment) to generate parameters for the simulation (Unutzer et al. 2002). The IMPACT data are uniquely suited for our purposes. These include both a clinical registry used by care managers in the trial to document exposure to the intervention and to track patient outcomes (“registry data”) as well as longitudinal research interviews, which independently assessed patient outcomes at regular intervals (“research data”). We consider the research data to represent the gold standard for measuring clinical quality, which is not typically available for provider quality profiling. The registry data were not as complete as the research data and are similar to data that would typically be used to profile providers for public reporting or P4P programs: At 6 months after start of treatment, more patients had missing depression outcomes in the registry data than in the research data. We used data from 715 patients who had known remission status based on the 6-month follow-up from the research data (considered the “true” outcome) and who also initiated the care management according to the registry data. Of these 715 patients, 165 (23.1 percent) did not have their depression symptoms measured and recorded in the registry data during the 6th month and were therefore considered to have missing data for the remission outcome.
We matched the research and registry data from the IMPACT study to generate several parameters for the simulation. The first parameter we generated from these data was the association between data being missing in the registry data (missingness) and remission in the research data. To generate this parameter, we calculated the rate of remission for those with missing registry data (0.232) and those without missing registry data (0.262) at the patient level. The difference between these rates (−.030) provides an estimate of the association between data missingness and remission (i.e., the effect of systematically missing data on remission). In this case, it implies that patients with missing registry data had slightly worse remission outcomes. Next, we calculated the provider-level mean and variance of remission rates (from research data) and provider-level mean and variance of rates of missingness (from registry data).
To situate our simulation study in the context of real-world provider profiling, as opposed to the IMPACT research study, we based the number of medical groups (25) and patient caseload per medical group (200) on a large depression quality improvement initiative, Depression Improvement Across Minnesota, Offering a New Direction (DIAMOND) (Solberg et al. 2010). After obtaining simulation parameters from the IMPACT research study and the DIAMOND initiative, our study uses data generated from our Monte Carlo simulation in all further analysis. All simulation parameters, their values, the source from which they are derived, and the distributions from which they are drawn, are shown in Table 1.
Our Monte Carlo simulation model is implemented at the provider level and the patient level: random draws are taken first at the provider, and then at the patient level. To generate precise point estimates of the accuracy of alternative profiling approaches, we replicate the entire simulation model 10,000 times for each of the five alternative scenarios (see below). Each replication of the simulation model is referred to as an iteration. The process of using Monte Carlo methods to generate simulated data by sampling from specified statistical distributions to assess the properties of estimators, in our case the accuracy of alternative estimators of quality profiling, is described in Greene (2011), and implementation in Stata is described by Cameron and Trivedi (2009).
To initiate the Monte Carlo simulation, we assigned each provider a true quality score and a rate of missingness, each drawn randomly from distributions shown in Table 1. On the basis of data from the IMPACT trial, we assumed a correlation between missingness and true quality that was common to all providers. Then, 200 patient-level random draws, first determining whether an observation was missing or nonmissing and then determining patient remission (conditional on missing status), were taken for each provider. We then aggregated information from these draws, calculating provider-level “observed” quality scores under one of the two conditions: (1) using remission outcomes only from patients with nonmissing data, so that profiling error was a combination of error from missing data and from random sampling; and (2) using remission outcomes from patients with both missing and nonmissing data, so that all profiling error was due to random sampling. To calculate #1, provider-level observed scores were based only on nonmissing patients: thus, if from a panel of 200 patients a provider had 40 patients with missing data, the observed score would be based only on the data from the 160 nonmissing patients. To calculate #2, random draws determining remission were taken both from patients with nonmissing data and from patients with missing data. Working from the previous example, we would have taken random draws determining remission for the 40 patients with missing data, based on their probability of remission, then combined this with data from the 160 patients with nonmissing data to calculate the “observed” provider-level rate.
Next, each provider's true quality and observed quality were profiled using the two separate profiling approaches, described above. We then calculated the error rates associated with each condition: the error rate associated with the first condition is referred to as the “total error rate” and the error rate associated with the second condition is referred to as the “random sampling error rate.” We calculated the contribution of missing data to profiling errors by taking the difference between the total error rate and the random sampling error rate.
We conducted sensitivity analyses by varying key parameters. This sensitivity analysis included assuming a doubling of the average missing rate, varying the number of medical groups in the simulation (from 25 to 100), varying the patient caseload (from 200 to 500), and allowing the effect of systematically missing data on remission to vary across providers. In all, we created a total of five simulation scenarios. Each separate specification of the simulation model was run for 10,000 iterations. For each iteration, we obtained information about the error rate for the two profiling approaches. For each simulation scenario, we conducted Wald tests of the difference in error rates attributable to missing data between relative and absolute profiling approaches.
We then performed a separate analysis to evaluate how the accuracy of profiling was impacted by the association between missingness and remission. To assess the relationship between the accuracy of profiling and systematically missing data, we randomly varied the effect of systematically missing data on remission so that the effect of systematically missing data ranged between −0.2 and 0.2, holding other simulation parameters at their values for Scenario 5. When varying the effect of systematically missing data parameter from −0.2 to 0.2, we are allowing the effect of systematically missing data to vary over a wide range of values (approximately 10 times the standard error of the effect of systematically missing data), changing from an assumption that missing data are associated with much worse remission outcomes to an assumption that missing data are associated with much better remission outcomes. We then plotted the relationship between the association between missingness and remission and the total error rate for both profiling approaches.
Finally, we assessed the relationship between the proximity of providers' true quality to the incentive threshold and the probability of profiling misclassification. To do this, we evaluated the probability of misclassification for both profiling approaches as a function of the distance between providers' true quality and incentive threshold values. We estimated a logit model in which error in classification for both absolute and relative profiling approaches was modeled as a function of the difference between providers' true quality score and the incentive threshold, modeling this difference using a flexible 5th order polynomial. The analysis was performed using the base set of assumptions in Scenario 1, with the exception of increasing the number of providers from 25 to 10,000 to estimate the relationship with greater precision. This analysis was performed using a single simulation iteration.
Table 2 shows the main results from the simulation analysis. The columns to the left show the values of key parameters that change across the five scenarios that are simulated. The columns to the right show the total error rate, the error rate from random sampling alone (i.e., when there is no missing data), and the error rate attributable to missing data (the difference between the total error rate and the error rate from random sampling) for both the relative profiling and absolute profiling approaches.
Under the base set of assumptions in Scenario 1, the overall error rate was 11.9 percent using the relative threshold, and was 14.2 percent using the absolute threshold. The error attributable to missing data was 1.3 percentage points for relative profiling and 2.3 percentage points for absolute profiling. In Scenario 1 (and all other scenarios), the difference in the error attributable to missing data between the relative and absolute profiling approaches was significant at p < .01. Assuming a greater number of providers (Scenario 2) led to no change in the error attributable to missing data, and assuming a larger caseload (Scenario 3) led to limited changes in the error attributable to missing data. Assuming a higher rate of missing data (Scenario 4) and allowing the association between missingness and remission to vary across providers (Scenario 5) resulted in an increase in both the total error rate and the error attributable to missing data. However, while the error attributable to missing data varied across the simulation scenarios, the ratio of error attributable to missing data between the absolute and relative profiling approaches remained approximately constant for Scenarios 1 through 4, with relative profiling having approximately half as much error attributable to missing data. This ratio increased slightly when the relationship between missingness and remission was allowed to vary across providers in Scenario 5.
Figure 1 shows the relationship between the effect of systematically missing data and the total error rate for each profiling approach. For relative profiling, as the effect of systematically missing data deviates from 0, the total error rate changes only slightly, indicating the robustness of relative profiling to systematically missing data. However, for absolute profiling, the total error rate increases substantially as the effect of systematically missing data deviates from 0, indicating a sensitivity of absolute profiling to systematically missing data.
Figure 2 shows the relationship between the difference between providers' true quality and the incentive threshold and the probability of profiling error. It shows that, for both profiling approaches, the probability of profiling error is close to 50 percent for providers whose true quality is within 1 percentage point of the incentive threshold, but that the probability of profiling error decreases to approximately 10 percent for providers whose true quality is 5 percentage points from the incentive threshold, decreasing to approximately 0 percent for providers whose true quality is 10 percentage points from the threshold. Thus, providers whose true quality is very close to incentive thresholds are at great risk for profiling error, but this risk decreases sharply within a moderate distance from the thresholds.
Our Monte Carlo simulation study assessed the accuracy of absolute and relative provider profiling approaches in the context of primary care depression pay-for-performance programs. We found that, over a range of scenarios, relative profiling approaches had profiling error rates that were approximately 20 percent lower than absolute profiling approaches. Also, most of the profiling error in the simulations was a result of random sampling variation, not missing data: between 11 and 21 percent of total error was attributable to missing data for relative profiling, while between 16 and 33 percent of total error was attributable to missing data for absolute profiling. This finding, however, is based largely on the fact that the missing data were not strongly related to the remission outcome in the IMPACT data, and a stronger relationship would amplify the relationship between missing data and profiling error. We also found that absolute profiling approaches were much more sensitive to error from systematically missing data than relative profiling approaches. Finally, the risk of profiling error was extremely high, approximately 50 percent, for providers whose true quality was in the immediate proximity of incentive thresholds, but decreased sharply to approximately 10 percent for providers whose true quality was 5 percentage points from incentive thresholds, indicating that the risk of profiling error is disproportionately borne by providers whose true quality is close to incentive thresholds.
As the practice of provider profiling grows, so does the literature that focuses on statistical properties of quality measures and profiling methods (Nyweide et al. 2009; Adams et al. 2010; Friedberg and Damberg 2011). Our study contributes to the current knowledge and debate in two ways. First, we specifically evaluated the implications of missing data for the accuracy of profiling. As missing data are prevalent and can be addressed in different ways (including exclusion of missing observations and various imputation methods), it is important to assess the magnitude of the problem. It is also important to understand the mechanisms by which missing data can affect profiling accuracy to guide the adoption of different remedies. Second, by evaluating how missing data affects the error rate of profiling, our study addressed an issue that is central to pay-for-performance programs and is directly relevant to stakeholders and designers of these programs.
Our study has a number of important implications. First, given that relative profiling methods were shown to be more accurate and less sensitive to systematically missing data than absolute profiling methods, relative profiling is recommended in situations with substantial missing data, particularly when it is suspected that missing data may be related to the outcome of interest. Second, our study reinforces the concern that random sampling variability is a key driver of profiling error, emphasizing the importance of larger samples to reduce profiling error (Nyweide et al. 2009; Adams et al. 2010). While our study was situated in the context of pay-for-performance in depression care management, the findings are directly generalizable to other contexts, such as public quality reporting, and also likely applicable to profiling for outpatient care other than depression.
Our study also has limitations. First, the parameters we generated for the simulation (e.g., mean and variation of true quality and of missing data, correlation between missing and true quality, etc.) came from the unique data from the IMPACT study and may be different in other settings. For instance, it is possible that the extent of missing data may be much greater than was observed in the IMPACT study. While this may reduce the generality of our findings, our results were robust across a number of sensitivity checks, suggesting that the greater accuracy of relative profiling would likely hold in other circumstances. Furthermore, it is unclear how the findings from this study, in the context of depression care management, would extend to profiling for other chronic care.
Second, our simulation approach required us to make a number of assumptions about the underlying data-generating processes for study parameters, and violations to these assumptions could impact our results. For instance, we assumed that provider-level remission rates were distributed normally: if remission rates followed a different distribution, then overall error rates could differ depending on the proximity between providers' quality scores and bonus thresholds (Figure 2). However, other research has shown that the distribution of outcome quality measures (such as mortality) is often approximately normal, and that deviations from normality can be explained by a combination of small sample sizes or distributions that are truncated as a result of ceiling or floor effects (Ryan et al. 2012). Given that our remission rate outcome is not subject to either of these concerns, we think that the normality assumption is reasonable.
Third, we only considered a narrow range of simple profiling approaches and did not evaluate profiling error for more complex approaches, such as the proposed approach for Hospital Value-Based Purchasing (Federal Register 2011). Nonetheless, the absolute and relative profiling approaches form the backbone of more complex schemes, and it is important to understand profiling error for these approaches. In addition, our study does not consider the behavioral response of providers to correct and incorrect profiling, which is an important topic for future research.
Pay-for-performance programs require a mechanism to translate quality performance to actual payments. While missing data have the potential to confound the translation of quality performance to payments, more robust profiling approaches can mitigate the bias from missing data. As a result of the greater accuracy and robustness to missing data, relative profiling approaches show greater potential for limiting profiling error in pay-for-performance programs. Moving forward with pay-for-performance, relative profiling appears to be a more promising approach to rewarding providers.
Joint Acknowledgment/Disclosure Statement: Andrew Ryan was supported by a grant from the Agency for Healthcare Research and Quality (K01 HS018546-01). Yuhua Bao was supported by a grant from the National Institute of Mental Health (K01 MH090087). The authors would like to thank Jürgen Unützer, M.D., M.P.H., M.A., at the University of Washington for kindly sharing the IMPACT data, which we used to generate parameters for simulation. The IMPACT study was funded by grants from the John A. Hartford Foundation and the California Healthcare Foundation.
Additional supporting information may be found in the online version of this article:
Appendix SA1: Author Matrix.