|Home | About | Journals | Submit | Contact Us | Français|
Multiple factors limit identification of patients with depression from administrative data. However, administrative data drives many quality measurement systems, including the Health Plan Employer Data and Information Set (HEDIS®).
We investigated two algorithms for identification of physician-recognized depression. The study sample was drawn from primary care physician member panels of a large managed care organization. All members were continuously enrolled between January 1 and December 31, 1997. Algorithm 1 required at least two criteria in any combination: (1) an outpatient diagnosis of depression or (2) a pharmacy claim for an antidepressant. Algorithm 2 included the same criteria as algorithm 1, but required a diagnosis of depression for all patients. With algorithm 1, we identified the medical records of a stratified, random subset of patients with and without depression (n=465). We also identified patients of primary care physicians with a minimum of 10 depressed members by algorithm 1 (n=32,819) and algorithm 2 (n=6,837).
The sensitivity, specificity, and positive predictive values were: Algorithm 1: 95 percent, 65 percent, 49 percent; Algorithm 2: 52 percent, 88 percent, 60 percent. Compared to algorithm 1, profiles from algorithm 2 revealed higher rates of follow-up visits (43 percent, 55 percent) and appropriate antidepressant dosage acutely (82 percent, 90 percent) and chronically (83 percent, 91 percent) (p<0.05 for all).
Both algorithms had high false positive rates. Denominator construction (algorithm 1 versus 2) contributed significantly to variability in measured quality. Our findings raise concern about interpreting depression quality reports based upon administrative data.
Administrative medical, pharmacy, and membership files of managed care organizations offer relatively low-cost, convenient data sources for examining patterns of care at the population level. Cohorts defined from administrative data often drive quality measurement and reporting systems, such as the Health Plan Employer Data and Information Set (HEDIS®). HEDIS is a set of standardized performance measures for comparisons among managed care organizations (Hanchak et al. 1996). Administratively defined cohorts are also used from quality improvement (Weiner et al. 1990; Weiner et al. 1995; Romano, Roos, and Jollis 1993; Garnick, Hendricks, and Comstock 1994; Leatherman et al. 1991) and disease management programs. For example, at-risk individuals may be identified from administrative files to receive reminders for annual mammography or influenza vaccination. The Centers for Medicare and Medicaid Services often use administrative data to identify patients for national Medicare quality improvement projects (Jencks and Wilensky 1992).
Regardless of the condition, several factors limit the accuracy of administrative algorithms for disease identification. These factors include incompleteness of data submitted by providers for capitated visits and procedures (encounters) and fee-for-service procedures (claims) to payors; limited clinical detail in the International Classification of Disease (ICD), Clinical Procedure Terminology (CPT), and Diagnostic Related Group (DRG) systems; and inaccuracy of demographic information in administrative files. For example, administrative pharmacy databases will not contain evidence of treatment if the physician only gives the patient samples from the office and does not write a formal prescription. The patient could also have the prescription filled at a nonparticipating pharmacy without using a pharmacy ID card. Therefore, to use administrative databases effectively for quality improvement and profiling, careful attention must be given to disease identification algorithms (Benesch et al. 1997; Weintraub et al. 1999).
Depression, in particular, presents many additional challenges in the use of health plan administrative data. These problems include failure of the physician to recognize depression (Wells et al. 1989; Kessler, Cleary, and Burke 1985; Borus et al. 1988), failure of the physician and patient to report depression because of the stigma of mental illness (Hoyt et al. 1997; Hirschfeld et al. 1997; Rost et al. 1994), and confounding of diagnosis by medical comorbidity (Tylee, Freeling, and Kerry 1993; Epstein et al. 1996; Koenig et al. 1993; Cohen-Cole and Stoudemire 1987; Coulehan et al. 1990). Because of these difficulties, one cannot assume that successful approaches to identifying patients with other diagnoses from claims data may be necessarily extrapolated to depression (Hanchak et al. 1996; Kiefe et al. 2001; Ellerbeck et al. 1995; Marciniak et al. 1998). However, the high prevalence of depression (Simon, Von Korff, and Barlow 1995; Simon and Von Korff 1995; Henderson and Pollard 1992; Hall and Wise 1995) and the well-documented deficiencies in the diagnosis and treatment of depression (Rost et al. 1994; Wells 1994; Rogers et al. 1993; Norquist et al. 1995; Katon et al. 1992; Lemelin et al. 1994; Bouhassira et al. 1998) make this an important area of quality assessment and improvement.
Existing research has not yet successfully addressed the impact of these difficulties. The purpose of this paper is to describe the process we used to evaluate and refine an algorithm for identifying physician-recognized depression using multistate data from a large managed care organization. We also explored the effect of changes in the algorithm on apparent changes in quality of care.
We examined two algorithms that used administrative data to identify patients diagnosed with depression by their primary care physicians. The study sample was drawn from primary care physician member panels of a large managed care organization (MCO). For all comparisons, we used a physician diagnosis of depression in the medical record as the standard. Next, we used administrative data to explore the variability in physician performance on identical quality measures for two contemporaneous patient samples, each constructed from different algorithms. Algorithm 1 was designed to maximize sensitivity by allowing patients to be identified from either administrative diagnostic codes or pharmacy data. Algorithm 2 was designed to decrease the false positive identifications that result from the use of pharmacy data alone.
Administrative data from primary care encounters, specialist encounters, claims, and pharmacy databases were linked with the MCO's membership and provider files. Although each patient was assigned to a primary care physician, we included treatment and follow-up events from any physician in the plan. We obtained our study samples from the pool of all members aged 12 years and older with medical and pharmacy benefits who were continuously enrolled as members in the MCO's health plans in the Mid-Atlantic or Northeast regions of the United States between January 1, 1997, and December 31, 1997 (n=892,786).
Algorithm 1 relied on a combination of diagnostic and pharmacy codes from administrative databases. The ICD-9 codes identified depressive disorders (296.2–296.36; 300.4; 311). Bipolar affective disorder (i.e., manic disorders) and depression with psychosis were specifically excluded. Pharmacy codes included: (1) monoamine oxidase inhibitors (MAOIs), (2) tricyclic antidepressants, (3) tetracyclic antidepressants, (4) selective serotonin reuptake inhibitors (SSRIs), (5) serotonin 2-receptor antagonists, (6) alpha-2-receptor antagonists, and (7) other miscellaneous antidepressants (e.g., modified cyclics). Patients less than 19 years of age with prescriptions for imipramine were excluded because imipramine may be used to treat enuresis or attention deficit hyperactivity disorder. Patients with prescriptions for lithium were also excluded.
Algorithm 1 required that members have at least two events in the administrative data with each event satisfying one of the following criteria: (1) an outpatient encounter with a primary diagnosis of depression or (2) a pharmacy claim for an antidepressant medication. Algorithm 1 could be satisfied by two events from the same category. Because a diagnostic code was not required, some members may have been identified because they filled two or more prescriptions for antidepressants. One event (either encounter diagnosis or pharmacy claim) for depression was not considered sufficient for identification because of the potential for miscoding or the use of another family member's pharmacy card when filling prescriptions.
Algorithm 2 required that one of the two events from Algorithm 1 actually be an encounter with a diagnosis of depression. The other event could either be another encounter with a diagnosis of depression or a pharmacy claim for an antidepressant medication.
Our protocol for medical record abstraction, which has been published elsewhere, included a computerized abstraction module with extensive instructions and synonym documentation, detailed abstractor training with an instruction manual, careful attention to data security and tracking of the records, and quality assurance (Allison, Wall et al. 2000). The study protocol was approved separately by the Institutional Review Boards of the University of Alabama at Birmingham and the MCO. Abstractors were trained to protect patient confidentiality. Medical record review was performed by the MCO, and no member-identifying information was released to the collaborating academic institution.
The original algorithm (Algorithm 1) was used to identify patients with and without depression for inclusion in the medical record sample. In defining the sample for medical record review, we focused on patients with a new diagnosis of depression. Therefore, we only included members that met the criteria of Algorithm 1 during our 12-month study window and for whom the MCO had no record of depression-related treatment in the 12 months preceding January 1, 1997. We used a stratified random sampling methodology that matched depressed and nondepressed patients within brackets of age, gender, and number of comorbid medical conditions.
From administrative data, 892,786 members were eligible for the study in the Mid-Atlantic and Northeast regions. Of these, 53,170 patients met criteria for depression by Algorithm 1 using administrative data, and the remaining 839,616 patients did not meet criteria. Based upon the stratified randomized methodology described above, we abstracted the charts of 234 patients with depression and 231 patients without depression. The time frame of abstraction was from July 1, 1996, to December 31, 1997. Approximately 10 percent of all medical records were dually abstracted with an overall interrater agreement of at least 95 percent for all main variables. In particular, interrater agreement for physician-recorded diagnosis of depression was 98 percent.
The clinical standard for the diagnosis of depression is based on the Diagnostic and Statistical Manual of Mental Disorders, Version IV (American Psychiatric Association 1994). The DSM-IV criteria are traditionally ascertained through a structured medical interview (Spitzer et al. 1992; Robins et al. 1981). A diagnosis of major depression requires the presence of one of the primary symptoms of depression (depressed mood or anhedonia) plus four additional symptoms for more than two weeks (American Psychiatric Association 1994). Additional symptoms include impaired concentration or cognitive dulling, thoughts of suicide, loss of energy/fatigue, altered appetite or weight change, feelings of worthlessness and guilt, disturbances of sleep, and psychomotor retardation or agitation.
Because quality measurements are not currently based on DSM-diagnosed depression, we used the standard of physician-recognized depression determined by documentation in the medical record.
One purpose of quality measures is to accurately capture the essence of evidence-based clinical guidelines in a quantitative fashion, allowing large amounts of data to be processed for improving delivery of medical care (Weissman et al. 1999; Hofer et al. 1997; Turpin et al. 1996; Harr, Balas, and Mitchell 1996). Important foundations for quality measures include: (1) strength of supportive evidence (evidence obtained from multiple randomized controlled trials given greatest emphasis); (2) consensus of professional societies about targeted intervention; (3) existence of a performance gap with documented need to improve care; (4) ability to improve care based upon quality measure, after consideration of practical resource constraints; (5) availability of adequate and economically feasible data sources; and (6) the severity and consequences of the underlying condition.
We applied the above principles to the 1993 guidelines on treatment of depression issued by the Agency for Health Care Policy and Research (AHCPR), currently the Agency for Healthcare Research and Quality (AHRQ) (1993). The guidelines divide treatment of depression into the acute, continuation, and maintenance phases. The goal of the acute treatment phase is to achieve remission, that is, to remove all signs and symptoms of the current episode of depression and to restore psychosocial and occupational functioning. Continuation treatment is intended to prevent relapse. Recovery is achieved when the patient has been asymptomatic for at least four to nine months, at which time the clinician may consider tapering or stopping antidepressant medication under certain circumstances. Maintenance treatment prevents subsequent episodes in those at risk for recurrence. We developed three measures specifically for the acute phase (adequate follow-up, medication adherence, minimum medication dosage), two measures specifically for the continuation phase (medication adherence and minimum medication dosage), and one measure for both the acute and continuation phase (adequate trial before switching medications). Table 1 provides a more detailed definition and rationale for each measure.
Bivariate comparisons were made with the chi-square statistic for dichotomous variables and the t-test for continuous variables (Rosner 1995).
We first compared demographic characteristics and comorbidities for abstracted cases by the presence of algorithm-defined depression. Next, we examined the DSM depression symptoms for patients with and without physician-recognized depression. From the sample of 465 abstracted medical records, we compared the operating characteristics (true/false positives, true/false negatives, sensitivity, specificity, and predictive values) of each algorithm, taking physician-diagnosed depression as the standard. We first derived predictive values based on the prevalence of depression in our sample. We then examined change in predictive values over a broad prevalence range of physician-recognized depression.
Finally, we identified patients of all physicians who had a minimum of 10 members with depression by Algorithm 1 (n=32,819 patients) and Algorithm 2 (n=6,837 patients). Using administrative data, we then compared aggregate physician performance on each of the six quality measures for both patient samples.
Although the ages of the patients with and without depression were similar, a higher proportion of the patients with depression were female. In addition, patients with depression tended to have more comorbidities (Table 2).
The diagnosis of depression was recorded in 121 of the 465 abstracted medical records. Among those 121 medical records, only 9 percent contained documentation that met the American Psychiatric Association's rigorous DSM-IV criteria for diagnosis of depression (Table 3).
Table 4 gives the operating characteristics of both algorithms. Algorithm 1, which required any two depression-related events (diagnosis or medication claim) in the administrative data, had a sensitivity of 95 percent and a specificity of 65 percent. The sensitivity was quite high because there were few cases where the algorithm identified a member as not depressed but the medical record indicated a diagnosis of depression (false negative cases). Of the 234 cases identified with depression by Algorithm 1, 54 percent had no diagnosis of depression in administrative data and were identified from antidepressant use only. As expected, there was a high rate of antidepressant use among the false positive cases (84 percent). Compared to Algorithm 1, Algorithm 2, which required a diagnosis of depression, yielded a higher specificity (88 percent) and lower sensitivity (52 percent). The rate of antidepressant use among false positive cases was less at 53 percent.
Unlike sensitivity and specificity, predictive values depend upon the prevalence of the index condition in the population. The method we used to construct our sample yielded a prevalence of 26 percent (121/465) for physician-recognized depression. Based upon this prevalence, the positive predictive values for Algorithms 1 and 2 were 49 percent and 61 percent, respectively, and the negative predictive values were 97 percent and 84 percent, respectively (Table 4). Figure 1A and 1B depict how the predictive values of both algorithms vary as the prevalence of physician-recognized depression varies. In the prevalence range of 5–10 percent, the positive predictive values of both algorithms remain below 33 percent, and negative predictive values remain above 94 percent. With a prevalence of 20 percent, the positive predictive values remain below 53 percent and the negative predictive values above 88 percent.
Using administrative medical and pharmacy data, we examined variation in office-level performance on six quality measures for Algorithms 1 and 2. Algorithm 1 identified 32,819 members as depressed, compared to 6,837 members identified by Algorithm 2. The number of primary care provider offices with a minimum of 10 depressed patients decreased from 1,756 with Algorithm 1 to 414 with Algorithm 2.
There were significant differences in quality performance reflected by the two algorithms (Figure 2). More specifically, for Algorithm 2, there were significantly higher rates of follow-up visits after initiation of antidepressant medication, appropriate dosage and medication adherence, and appropriate medication trial before switching to another antidepressant.
Our work demonstrates the difficulty in identifying patients with depression from administrative data. Recognizing that depression is underreported in administrative data, we specifically designed Algorithm 1 to explore the effects of using pharmacy codes as primary identifiers of depression. The pharmacy inclusion criteria for Algorithm 1 were broad, thus capturing more members (increased sensitivity) at the expense of generating more false positive diagnoses (decreased specificity and positive predictive value). The more stringent Algorithm 2, which required a diagnosis of depression, had a better specificity, but much lower sensitivity.
Both algorithms suffered from low positive predictive value, and thus frequently falsely classify patients as having depression. Predictive value depends upon the prevalence of the underlying condition. Only in highly selected primary care patient populations does the prevalence of depression approach 20 percent (Pearson et al. 1999), corresponding to a positive predictive value of less than 53 percent for both algorithms. One study found the prevalence to range between 5 to 10 percent for unselected elderly patients (Barry et al. 1998), corresponding to a positive predictive value of less than 33 percent for both algorithms. This means that if administrative data were used to derive quality measures for depression, only 33 percent of those patients in the denominator would actually have a physician diagnosis of depression by chart review.
These findings are especially important given the close relationship of our quality measures to the HEDIS measure for Antidepressant Medication Management. Linking Algorithm 2 with quality measures 1, 2, and 3 approximates the current HEDIS technical specifications (Allison, Wall et al. 2000). Algorithm 2 uses the same pharmacy and ICD-9 codes as HEDIS for denominator construction. Both approaches require each member to have at least 12 months of continuous enrollment in a managed care plan with pharmacy benefits.
However, there are some differences between our quality measures and the HEDIS measure. Our quality measures were intended to reflect the 1993 AHCPR guidelines and not to duplicate the HEDIS Antidepressant Medication Management measure, which was in draft format at the time of data collection for this study. Similar to the HEDIS measure, we required a new diagnosis of depression because many of our quality measures apply to the acute phase of depression. In contrast to the HEDIS measure, which allows either a primary or secondary diagnosis of depression, we required a primary diagnosis.
Quality measure 1 differs from the corresponding HEDIS measure by requiring one visit to a primary care provider within 6 weeks of diagnosis, whereas the HEDIS measure requires three visits within 12 weeks. We made our criteria for follow-up more lenient because, in certain cases, telephone contact without an office visit is appropriate and would not be reflected in the administrative data. We included two measures of antidepressant medication adherence. Our measure 2, Adherence during the Acute Phase of Treatment, is similar to the HEDIS measure, as both reflect adherence during the first three months of treatment, allowing for gaps in medication supply. Our measure 3, Adherence during the Maintenance Phase of Treatment, examined adherence within the first four months of treatment, while the HEDIS measure examines adherence during the first six months. We constructed measure 3 to reflect adherence to the minimum recommended by the 1993 AHCPR guideline (i.e., four months). Our work also reveals important differences in quality measurement according to which algorithm was used to define the denominator. Algorithm 1, associated with a positive predictive value and corresponding higher false positive rate, led to lower rates for each quality measure. In fact, three of the measures differed by 7–8 percent. Such underreporting of quality performance is important and may lead to loss of credibility in provider feedback with crippling of improvement efforts (Allison, Calhoun et al. 2000). Therefore, when planning a quality improvement project, positive predictive value is probably the most important operating characteristic of a disease-identification algorithm.
The number of identified members, which increases with the sensitivity of the algorithm, has implications when generating performance profiles. Creating valid physician profiles requires sufficient patient numbers. In this study, the impact of requiring a diagnostic code to identify members with depression reduced the eligible member population by two-thirds. Consequently, fewer practices meet minimum volume criteria for individualized performance profiles.
Even beyond the difficulties imposed by administrative data, several factors contribute to the diagnostic challenges of depression. Although depression affects up to 10 percent of the U.S. population at an estimated annual cost of $44 billion (Hall and Wise 1995) and produces impairment in quality of life similar to that of other serious chronic diseases (Wells, Stewart et al. 1989), the diagnosis is often missed by physicians. Primary care physicians recognize only about one-half of all depressed patients in the outpatient setting (Wells, Hays et al. 1989; Kessler, Cleary, and Burke 1985; Borus et al. 1988). The detection rate by primary care physicians falls to 30 percent for patients with significant medical comorbidity (Tylee, Freeling, and Kerry 1993). This may result from physicians attributing signs and symptoms of depression to other medical illness. Somatic symptoms used in making the diagnosis of depression (e.g., fatigue, sleep disturbance, weight loss) are also presenting features of many other medical illnesses. Subjective symptoms such as depressed mood and anhedonia may also be inappropriately regarded as an understandable reaction to illness.
Concern about patient confidentiality and the potential for jeopardizing reimbursement and other benefits may also lead physicians to deliberately substitute alternative diagnoses on claims and encounters. In a survey of 440 primary care physicians randomly selected from the membership lists of professional organizations, 50 percent of respondents reported substituting another diagnostic code in the prior two weeks for one or more patients who met the criteria for major depression (Rost et al. 1994). Physicians may underreport depression to protect patients from social stigma and possible occupational and legal consequences (Hoyt et al. 1997; Hirschfeld et al. 1997). For example, medical records are often subpoenaed during custody hearings. In addition, many physicians may be uncertain about making such diagnoses because of limited training with behavioral health disorders. As a result, valid cases of depression are not identified by physicians, let alone by algorithms based on administrative data. Furthermore, patients identified from administrative data may represent more severe cases (Valenstein et al. 2000).
Some patients reluctantly express psychological symptoms and may deny mood changes. These patients may present instead with a variety of nonspecific somatic complaints such as headache, abdominal pain, insomnia, weight loss, or low energy. This symptom overlap leads to a complex interaction between depression and medical comorbidity. Therefore, medical illness frequently presents as depression and depression as other medical illness. This problem is especially troubling in the elderly, who suffer from a higher burden of comorbidity (Coulehan et al. 1990).
The overlap in treatment of depression and other medical conditions makes identification of depressed patients from pharmacy data difficult. Antidepressants are now used for a wide variety of diseases other than depression such as chronic pain, neuropathic pain, fibromyalgia, chronic fatigue syndrome, migraine and tension headaches, irritable bowel disease, premenstrual dysphoric disorder, insomnia, eating disorders, premature ejaculation, panic disorder, post-traumatic stress disorder, social phobia, and anger attacks (Barkin et al. 1996a, 1996b; Compas et al. 1998; Davies et al. 1996; Fishbain et al. 1998; Keck and McElroy 1997; McQuay et al. 1996; Merskey 1997; Metz and Pryor 2000; Moreland and St. Clair 1999; Pappagallo 1999; Simon and Von Korff 1997). These syndromes have variable overlap with clinical depression, and it is often difficult to discern which problem is primary (Keck and McElroy 1997; Clarke 1998). Often a corresponding ICD-9 diagnosis of depression cannot be found when an antidepressant prescription is written at an office visit. Given the tolerability and safety of the newer antidepressants, clinicians may tend to prescribe these agents for nonspecific psychiatric symptoms or behavioral problems (e.g., stress) without a making a clear diagnosis. As a result, there is concern about inappropriate use of antidepressants (Bouhassira et al. 1998).
The use of antidepressants for disorders other than depression and the treatment of depression without a diagnosis are both reflected in the rate of antidepressant use among members with a false positive diagnosis. Algorithm 2 produced fewer cases of false positive identification and, even among the false positive cases, the rate of antidepressant use was lower for Algorithm 2 (53 percent versus 84 percent).
The limited observation period available through administrative databases of health plans is both a strength and limitation. Administrative data permits longitudinal observation at the member level, unlike certain other data types. However, annual member disenrollment averaged 29 percent in 1999 for HMOs reporting to NCQA's Quality Compass 2000 (2000). This limits identification of new cases. Although recently enrolled members may appear to be newly identified with a disease, it is possible that the disease is long-standing. Variation in quality performance for new cases of major depression may partially result from undetected variation in disease chronicity. Katon did not find important differences in antidepressant treatment patterns in a staff-model HMO after adjusting for multiple factors, including prior history of depressive episodes (Katon et al. 2000).
Our paper raises the need for caution in interpreting quality measures based on administrative data. For example, we found that rates of appropriate antidepressant treatment (e.g., follow-up visits, appropriate dosage) were substantially higher when the specifications for the member population required a diagnosis of depression. Likewise, Kerr found that variations in the specifications of quality-of-care measures for depression treatment influenced conclusions about the adequacy of antidepressant prescribing patterns in two managed care practices (Kerr et al. 2000). Kerr varied the definition of a new episode of depression (four-month versus nine-month clean period) and the minimum number of visits with a diagnosis of depression. Patients with two or more visits coded for depression were more likely to receive antidepressants at the appropriate dosage than those with only one visit coded for depression. This may, in part, result because increased algorithm specificity from the requirement of a diagnosis code may lead to the identification of patients with more severe depression as opposed to a temporary crisis or generalized stress.
Our study also has specific implications for the training of primary care physicians. We found that primary care physicians who documented a diagnosis of depression in the medical record rarely documented the symptoms required to make that diagnosis using the DSM-IV criteria for major depression. This finding may reflect poor documentation rather than poor interviewing skills. Other studies suggest that interviewing style of the primary care physician is related to the recognition of depression (Badger et al. 1994; Robbins et al. 1994). Training and feedback based upon medical record review has been shown to increase both recognition of depression and documentation of symptoms (Linn and Yager 1980). In addition, automated office screening tools have been shown to increase recognition of depression without placing excessive demands on physicians (Zung et al. 1983; Moore, Silimperi, and Bobula 1978; Hoeper et al. 1984; Magruder-Habib, Zung, and Feussner 1990).
We found that accurate identification of patients with physician-recognized depression from administrative data poses significant difficulty. In addition, observed quality varied significantly with algorithm operating characteristics, with lower observed quality being associated with lower algorithm specificity and a greater number of members being falsely identified as having depression. This suggests that low-quality performance may, in part, be attributed to the specific algorithms used to identify the study population. Low specificity, and the associated false classification of patients as having depression, may inappropriately lower quality performance scores and decrease confidence in performance feedback.
Supported in part by the Academic Medicine and Managed Care Forum, grant nos. HS09446 and HS1112403