|Home | About | Journals | Submit | Contact Us | Français|
To examine the effects of varying diagnostic and pharmaceutical criteria on the performance of claims-based algorithms for identifying beneficiaries with hypertension, heart failure, chronic lung disease, arthritis, glaucoma, and diabetes.
Secondary 1999–2000 data from two Medicare+Choice health plans.
Retrospective analysis of algorithm specificity and sensitivity.
Physician, facility, and pharmacy claims data were extracted from electronic records for a sample of 3,633 continuously enrolled beneficiaries who responded to an independent survey that included questions about chronic diseases.
Compared to an algorithm that required a single medical claim in a one-year period that listed the diagnosis, either requiring that the diagnosis be listed on two separate claims or that the diagnosis to be listed on one claim for a face-to-face encounter with a health care provider significantly increased specificity for the conditions studied by 0.03 to 0.11. Specificity of algorithms was significantly improved by 0.03 to 0.17 when both a medical claim with a diagnosis and a pharmacy claim for a medication commonly used to treat the condition were required. Sensitivity improved significantly by 0.01 to 0.20 when the algorithm relied on a medical claim with a diagnosis or a pharmacy claim, and by 0.05 to 0.17 when two years rather than one year of claims data were analyzed. Algorithms that had specificity more than 0.95 were found for all six conditions. Sensitivity above 0.90 was not achieved all conditions.
Varying claims criteria improved the performance of case-finding algorithms for six chronic conditions. Highly specific, and sometimes sensitive, algorithms for identifying members of health plans with several chronic conditions can be developed using claims data.
Health services researchers, health plan administrators, and health policymakers often need to identify people with specific chronic medical conditions, such as diabetes or heart failure. For example, researchers may be interested in assessing the outcomes of alternative treatments. Health plan administrators may wish to identify members for quality improvement or disease management programs. Policymakers may be interested in tracking access to or quality of care. These efforts typically rely on diagnoses listed on physician and hospital claims submitted by providers to health insurers, and pharmacy claims when available, to identify patients who have the conditions of interest.
Information about chronic conditions in administrative databases may be incomplete or inaccurate for a variety of reasons (Virnig and McBean 2001). For example, health care that is not covered or billed is not reflected in the data. Information that is unnecessary for processing payments may not be collected or recorded accurately. Even when care is sought for a chronic condition, the diagnosis might not appear on provider claims (Horner et al. 1991; Fowles et al. 1995; Fowles, Fowler, and Craft 1998). Conversely, diagnoses listed on claims may be related to testing for disease rather than confirmed disease. Despite these caveats, administrative data are used extensively by researchers and by the National Commission for Quality Assurance, the Centers for Medicare and Medicaid Services, and managed care organizations to identify people that have specific chronic conditions (National Committee for Quality Assurance 1999; Centers for Medicare and Medicaid Services 2002a; Centers for Medicare and Medicaid Services 2002b).
Few studies have described efforts to maximize the performance of claims-based algorithms for identifying health plan enrollees with chronic medical conditions. We previously found that the positive predictive value of claims collected over a two-year period could be improved by requiring two or more claims rather than one claim listing a diagnosis of hypertension, or by combining the diagnostic criterion with the requirement for a pharmacy claim for a medication commonly used to treat hypertension (Quam et al. 1993). In a comparison of self-reports of chronic conditions and administrative data, Robinson et al. (1997) found that more cases of self-reported diabetes or hypertension were confirmed by administrative data as the number of years of claims data increased. Others studied factors that affect the sensitivity and specificity of case-finding algorithms using Medicare claims to identify people who reported they had diabetes on the Medicare Current Beneficiary Survey (Hebert et al. 1999). The sensitivity of their algorithms increased by approximately 10 percent with a small drop in specificity when two rather than one year of claims data were used. For different types of claims (hospital, physician, etc.) requiring the diabetes diagnosis to be the first one listed on the claim or listed on two claims rather than one or listed on claims associated with direct physician contact improved the already near-perfect specificity slightly and lowered the sensitivity.
Further investigation of the sensitivity and specificity of case-finding algorithms for chronic conditions using different administrative data is warranted given the extensive use of claims data for this purpose. This study extends previous research by systematically evaluating the performance of a wide range of alternative claims-based algorithms for identifying members of Medicare+Choice plans with six chronic medical conditions that are prevalent in the elderly population and are frequently the focus of research studies and managed care initiatives. We used survey data collected from members of two Medicare+Choice health plans to assess the sensitivity and specificity of algorithms that required different claims criteria for the diagnosis, more than one claim with the same diagnosis, one or two years of claims history, and incorporation of pharmacy claims to identify patients with hypertension, heart failure, diabetes, arthritis, glaucoma, or chronic lung disease.
Both study plans had Medicare risk contracts under the Medicare+Choice program and contracted with health care providers in their communities to create provider networks. One plan was located in a Midwestern metropolitan area, the other in the Northeast. Members of both plans were required to select a primary care physician. Only members of the Northeastern plan were required to obtain a referral before they saw a specialist. During the period of the study (1999 and 2000), the copayment for an office visit was $10 in the Midwestern plan and $15 in the Northeastern plan. The annual limit on drug benefits in the Midwestern plan was $900 in 1999 and $500 in 2000; the limit in the Northeastern plan was $300 for both years.
A study of the influence of socioeconomic status on the utilization of medical care by elderly members of the study plans was the source of information on chronic conditions. Briefly, members who were enrolled in the plans in January 2000 were randomly sampled within socioeconomic strata defined by dual (Medicare and Medicaid) eligibility and household income at the zip code level. Members who were disabled and less than 65 years old, members in institutions or hospice programs, and members with end-stage renal disease were excluded from the study population. All 942 dually eligible members of the health plans were included in the sample, as were 700 members living in zip codes where approximately 50 percent of the households had incomes less than 200 percent of the poverty level, and 5,354 other members. The total sample consisted of 6,996 members equally distributed between the two plans.
Telephone interviews were completed for 4,613 members in the sample between April and October 2000 (response rate, 72 percent, after adjusting for ineligible subjects). Six of the interview questions asked, “Has a doctor ever told you that you had (1) high blood pressure or hypertension, (2) congestive heart failure, (3) chronic lung disease such as chronic bronchitis, emphysema, or asthma, (4) arthritis, (5) glaucoma, or (6) diabetes or high blood sugar?” Respondents answered yes, no, or don't know or refused to answer these questions. Our analysis excluded the 0.1 percent to 0.6 percent of respondents who did not give a definite yes or no answer to each question about the presence of a chronic condition. An additional 548 surveys completed with proxy respondents were excluded as well.
We used physician, facility, and pharmacy claims to develop case-finding algorithms for each chronic condition. Physician claims contained information on each service provided to health plan members by physicians, including the Physicians' Current Procedural Terminology (CPT) code for the service, the date of the service, and up to four diagnoses identified using International Classification of Diseases, 9th Edition, Clinical Modification (ICD-9-CM) diagnosis codes. We used the CPT codes to identify claims for patient visits to a physician (so-called face-to-face encounters), which may have more valid diagnostic codes than other types of claims such as those for laboratory tests or diagnostic procedures that are often used to rule out a condition even though they might have been done directly by a physician. The CPT codes representing a visit to the emergency room were not included as face-to-face encounters because of the potential for more “rule-out” diagnoses during emergency care. (See appendix for codes used to define face-to-face encounters. Online-only appendix is available at http://www.blackwell-synergy.com.)
Facilities (for example, hospitals) submit claims for inpatient, outpatient, and ancillary services. Each facility claim listed a date of service, up to nine diagnosis codes, and up to six procedures using either CPT or ICD-9-CM procedure codes. Facility claims are itemized by “revenue codes” that represent units of service such as use of an inpatient bed day, an emergency room visit, or a radiologic test. Similar to physician claims, facility claims were classified as face-to-face encounters using procedure codes for physician visits or revenue codes. The revenue codes typically represent room and board charges. (See appendix for codes used.)
Pharmacy claims list the date the medication was dispensed and the National Drug Code (NDC) that was linked to proprietary codes for generic ingredients. Pharmacy claims were further grouped and coded by therapeutic class such as medications used to treat diabetes or hypertension or pharmacological class such as loop diuretics, angiotensin-converting enzyme inhibitors, or nonsteroidal anti-inflammatory agents to simplify incorporation into algorithms.
We evaluated 38 different algorithms for each of the six chronic conditions using physician, facility, and pharmacy claims for 3,633 survey respondents who were continuously enrolled during the two-year period from 1999 to 2000 and met the other inclusion criteria. The ICD-9-CM diagnosis codes and types of medications used to identify members with the six conditions are listed in Table 1.
Case-finding algorithms for each condition were developed and tested by systematically varying several factors. Three algorithms searched claims from 1999 for (1) an identifying diagnosis on any physician or facility claim, (2) an identifying diagnosis on a face-to-face claim as previously defined, and (3) an identifying diagnosis listed in the primary position on a face-to-face claim. Three additional algorithms searched diagnostic information in claims as above except they required an identifying diagnosis to be listed on two claims with different dates of service.
One algorithm used only pharmacy claims from 1999 to identify individuals. Six algorithms identified cases using an identifying pharmacy claim or one of the six diagnostic criteria, while an additional six algorithms required the presence of both an identifying pharmacy claim and one of six diagnostic criteria. Finally, we applied each of the above 19 algorithms to claims from both 1999 and 2000.
In preliminary analyses, we evaluated claims-based algorithms that incorporated therapeutic procedures such as administration of home oxygen as an indication of chronic lung disease, administration of intravenous inotropes for heart failure, and intravenous infusion of infliximab or joint replacement surgery for arthritis. In general, it was rare to have a claim for a therapeutic procedure without a claim listing the diagnosis of interest. Consequently, information about therapeutic procedures did not enhance the claim-based algorithms, and was not used in the algorithms presented in this paper.
We adopted members' self-reports on the survey as the gold standard for determining who had a chronic condition, and calculated the sensitivity and specificity of the claims-based algorithms relative to this gold standard. We defined sensitivity as the proportion of members reporting that a doctor told them they had a condition who were identified by an algorithm as having the condition. We defined specificity as the proportion of members reporting that they did not have a condition who were not identified by the algorithm. The sensitivity and specificity calculated for the 19 algorithms based on one year of claims were summarized for each condition by plotting the sensitivity on the vertical axis and one minus the specificity on the horizontal axis, similar to a receiver operating characteristic (ROC) curve (McNeil, Keeler, and Adelstein 1975). One minus the specificity represents the proportion of people not reporting a condition who were apparently falsely identified by an algorithm. McNemar's test was used to determine when sensitivity or specificity of algorithms differed significantly when applied to the same sample (Bennett 1972).
We also calculated the ratio of the sensitivity to one minus the specificity, known as the likelihood ratio positive (LR+). The LR+ is a measure of the ability of an algorithm to “rule in” a condition, and represents how much the odds that a plan member has a condition increase when identified by the algorithm as having it. In addition, we calculated the ratio of one minus the sensitivity to the specificity, known as the likelihood ratio negative (LR−). The LR− is a measure of the ability of an algorithm to “rule out” a condition, and represents how much the odds that a member has a condition decrease when identified by the algorithm as not having it.
Self-reported prevalence varied substantially across the six chronic conditions in the study. Hypertension (58 percent) and arthritis (56 percent) were the most common conditions, while diabetes (18 percent), chronic lung disease (14 percent), glaucoma (10 percent), and heart failure (6 percent) were reported less frequently. The prevalence of these conditions among the 3,633 individuals used to study the sensitivity and specificity of the claims-based algorithms was similar to the prevalence in the entire group of 4,613 survey respondents (data not shown).
The ROC curves in Figures 1 and and22 summarize the results from the 19 algorithms applied to the 1999 claims data for each condition. A supplemental appendix showing the sensitivity and specificity of all 38 algorithms for all six conditions is available in the electronic version of this report (see http://www.blackwell-synergy.com). The sensitivity and specificity of the algorithms varied considerably for all the conditions. For hypertension, the specificity ranged from 0.60 to 0.96, while the sensitivity ranged from 0.32 to 0.90. For heart failure, the specificity ranged from 0.75 to 0.99 and the sensitivity from 0.23 to 0.78. For chronic lung disease, the specificity ranged from 0.87 to 0.99 and the sensitivity from 0.22 to 0.62. For arthritis, the specificity ranged from 0.77 to 0.99 and the sensitivity from 0.07 to 0.55. For glaucoma, the specificity ranged from 0.95 to 0.99 and the sensitivity from 0.32 to 0.73. Lastly, for diabetes, the specificity ranged from 0.93 to 0.99 and the sensitivity from 0.44 to 0.91.
Interestingly, the ROC curves for heart failure and for arthritis exhibited a “discontinuity” in which seven of the algorithms do not appear to follow the contour of the curve established by the other 12 algorithms. The seven algorithms are displaced toward lower specificities, although they generally have higher sensitivities. These seven algorithms used only pharmacy claims, or an identifying pharmacy claim or one of the six diagnostic criteria. Use of pharmacy claims alone to identify cases of arthritis or heart failure increased sensitivity at the price of a substantial decline in specificity, especially for heart failure.
The algorithm that required at least two face-to-face claims with a first-listed diagnosis and at least one prescription for a medication commonly used to treat the condition produced the highest specificity for all six conditions. The specificity of this algorithm was 0.99 for all the conditions except hypertension, where it was 0.96. However, the sensitivity of this algorithm was uniformly low: 0.32, 0.23, 0.22, 0.07, 0.32, and 0.44, respectively, for hypertension, heart failure, chronic lung disease, arthritis, glaucoma, and diabetes. This algorithm had the highest LR+ for all six conditions, with values of 8, 24, 24, 21, 50, and 23. Therefore, it is useful for identifying cases that almost surely have the condition of interest—that is, for “ruling in” the condition—although it yields many false negatives. The algorithm that required at least two face-to-face claims with a diagnosis in any position and at least one prescription for a medication commonly used to treat the condition had lower specificity by only 0.01 or less for all the conditions compared to the algorithm that required the diagnosis to be in the first-listed position, and it had higher sensitivity by 0.03 to 0.08. Thus requiring that the diagnosis be the first one listed did not improve specificity very much while reducing sensitivity appreciably.
The algorithm that required only one diagnosis in any position on any type of claim or a pharmacy claim produced the highest sensitivity for all six conditions. The sensitivity of this algorithm for hypertension, heart failure, chronic lung disease, arthritis, glaucoma, and diabetes was 0.90, 0.78, 0.62, 0.55, 0.73, and 0.91, respectively. Conversely, the specificity of this algorithm was lower than for most other algorithms: 0.60, 0.75, 0.87, 0.77, 0.95, and 0.93. This algorithm had the lowest LR− for all six conditions, with values of 0.17, 0.30, 0.43, 0.59, 0.28, and 0.10. Thus it is useful for “ruling out” the condition of interest.
In practice, researchers and health plan managers may often wish to use algorithms that have high specificity, to avoid falsely identifying as cases people who do not actually have the conditions of interest, but that simultaneously possess good sensitivity, to identify as many true cases as possible. As shown on Figures 1 and and2,2, several other algorithms had nearly as high specificity and better sensitivity compared with the algorithms that required at least two face-to-face claims with the diagnosis and at least one prescription for a medication used to treat the condition. Table 2 lists algorithms for each condition that had the highest sensitivity while maintaining the specificity above 0.90. Whenever the most sensitive algorithm required pharmacy claims, the most sensitive without pharmacy claims is tabulated as well. Also, when two algorithms that did not incorporate pharmacy claims had similar sensitivity, both are listed. The most sensitive algorithms that still had specificity greater than 0.90 relied on pharmacy claims to help detect hypertension, chronic lung disease, glaucoma, and diabetes, although for each of these conditions there was also an algorithm that was nearly as sensitive that did not require pharmacy claims. Requiring only one claim with diagnosis appeared to be a useful approach for identifying members with diabetes, glaucoma, heart failure, chronic lung disease, and arthritis even though the diagnosis had to be on a face-to-face claim for the latter two conditions. It is important to note that the sensitivity of the algorithms listed in Table 2 is modest for all the conditions except glaucoma and diabetes.
Tests of the effects of using different diagnostic criteria on the specificity and sensitivity of case-finding algorithms are shown in Table 3. Compared with requiring any one claim with a diagnosis, requiring that the diagnosis be listed on a claim for a face-to-face encounter or requiring at least two claims that list the diagnosis increased the specificity for all conditions, but also reduced sensitivity. The effects on specificity were smallest when the specificity was already greater than 0.90. Requiring that the diagnosis on the claim for a face-to-face encounter be the first one listed increased the specificity only slightly, but decreased the sensitivity appreciably for most of the conditions.
Sensitivities and specificities of algorithms that incorporated pharmacy claims are shown in Table 4. Except for heart failure, algorithms based only on pharmacy claims were as or more specific but less sensitive than the algorithms based on a diagnosis in any position on a single claim. Algorithms that used diagnosis or pharmacy criteria to identify members had better sensitivity, however they had lower specificity with the exception of algorithms requiring one diagnosis for diabetes and glaucoma. Use of diagnosis and pharmacy criteria significantly increased the specificity of case-finding algorithms for all conditions except pharmacy-based algorithms for diabetes and glaucoma that already had a specificity of 0.99. Requiring both a diagnosis and pharmacy claim reduced sensitivity.
When the algorithms that required a single diagnosis on any claim were applied to two years of claims data (1999 and 2000) rather than one year (1999) the sensitivity of the algorithms increased significantly for all conditions: diabetes (0.90 to 0.95), hypertension (0.83 to 0.92), heart failure (0.58 to 0.74), arthritis (0.43 to 0.60), chronic lung disease (0.59 to 0.70), and glaucoma (0.68 to 0.75). With the exception of hypertension, reductions in specificity were similar to or less than the gains in sensitivity: diabetes (0.93 to 0.88), hypertension (0.69 to 0.56), heart failure (0.93 to 0.88), arthritis (0.87 to 0.77), chronic lung disease (0.89 to 0.82), and glaucoma (0.95 to 0.92). Table 5 lists algorithms using two years of data that had the highest sensitivity while maintaining specificity above 0.90, except for hypertension where algorithms with the highest specificity are presented. Four conditions (hypertension, heart failure, glaucoma, and diabetes) had algorithms with a sensitivity≥0.70 when two years of claims data were analyzed.
Case identification algorithms based on diagnoses listed on physician and facility claims and pharmacy claims exhibited varying performance in identifying members of Medicare+Choice plans who reported they had one of six chronic conditions. Algorithms with a specificity of at least 95 percent were found for all conditions. However, highly specific algorithms may not identify substantial proportions of members who have the conditions of interest because these algorithms typically have low sensitivity. Diabetes was the only condition where an algorithm had a specificity and sensitivity greater than 0.90.
The positive and negative predictive values of these algorithms will depend on the prevalence of the condition in the population being screened as well as the sensitivity and specificity (McNeil, Keeler, and Adelstein 1975). For example, assuming 58 percent of the study population truly did have hypertension, the positive predictive value for finding members with hypertension using the algorithm requiring one face-to-face claim and a pharmacy claim that had an estimated specificity of 0.91 and sensitivity of 0.52 would be 89 percent. In other words, 89 percent of the members identified by the algorithm would be expected to self-report they had hypertension. However, the positive predictive value would be only 34 percent for a less prevalent condition such as heart failure (estimated prevalence=6 percent) even though the estimated sensitivity and specificity of the algorithm requiring one claim with the diagnosis were 0.93 and 0.58, respectively.
Selection of an appropriate algorithm depends on whether it is more important to find as many cases of a condition as possible using an algorithm that has high sensitivity but that also identifies many false positives, or to find fewer but nearly certain cases using an algorithm that has high specificity but misses many members who have the condition. When high sensitivity is of primary interest, our findings suggest that the sensitivity of claims-based algorithms might be improved by (1) using either diagnostic or pharmacy criteria to identify cases, or (2) using a longer period of time to capture claims that list a diagnosis for a chronic condition or a medication. Although these techniques improved the sensitivity of algorithms for all six conditions, the magnitude of the effect varied considerably. We can only speculate about what caused this variation. Sensitivity of the diagnosis-based algorithm for diabetes was already 0.90 leaving little room for improvement by examining pharmacy claims or a longer period of time. Pharmacy claims would not identify people with conditions that are not being treated, such as hypertension or diabetes being controlled by diet alone and some types of chronic lung disease that are generally not responsive to medications. Adherence to medication prescriptions might vary across conditions as well. People with arthritis might be using nonprescription medications that are generally not covered by drug benefits. Unlike medications for diabetes and glaucoma, medications for heart failure are not used solely for this condition. Therefore, use of pharmacy claims may increase sensitivity of algorithms for heart failure while substantially decreasing specificity.
We did not determine when use of pharmacy claims, or longer periods of time, or both, would be the best way to improve the sensitivity of case finding algorithms. Information in the supplemental electronic appendix could be used to explore these questions (see http://www.blackwell-synergy.com Additional investigations are also needed to determine whether the sensitivity of algorithms might be improved by including claims for diagnostic procedures that are typically done by a physician. We excluded these types of “face-to-face” claims because of concerns that the diagnosis code might be the indication for the test rather than the conclusion from test results, thus any gain in sensitivity from using claims for diagnostic procedures might be accompanied by a decrease in specificity. Furthermore, if physicians usually submit claims for both the diagnostic procedure and a patient visit, addition of the claims for diagnostic tests might not add any information. We observed this phenomenon when we tried to improve the sensitivity of algorithms by using claims for therapeutic procedures.
When specificity is most important, algorithms might be improved by requiring that the diagnosis be listed on a claim for face-to-face physician encounter, requiring both a provider claim that lists the diagnosis and a pharmacy claim for a medication commonly used to treat the condition of interest, or requiring at least two claims with different dates of service that list the diagnosis of interest. Algorithms that required two face-to-face claims and at least one corresponding prescription had good specificity for all conditions studied.
In many practical applications, researchers and health plan administrators may wish to use algorithms that have high specificity and as high of a sensitivity as possible, both to identify as many cases for study as is feasible and to minimize the selection bias that can occur when cases are identified using algorithms with very low sensitivity. Tables 2 and and55 can be used by researchers and plan administrators as guides for choosing appropriate algorithms depending on whether they have one or two years of claims data and on whether they have pharmacy claims.
Our study has several limitations. Use of self-reports as a gold standard for conditions may have introduced error into our estimates of sensitivity and specificity. Self-reports are subject to lack of knowledge or misperceptions about the presence of a condition, and the imprecise lay terminology used in surveys may be inconsistent with medical definitions of disease. Validation of the self-reports by review of medical records was not feasible for this secondary analysis of data. Previous studies have indicated that self-reports of the absence of chronic conditions including diabetes, hypertension, heart failure, chronic lung disease, and joint problems agree with medical records in a high percentage of cases (Fowles, Fowler, and Craft 1998; Martin et al. 2000). These and other studies have found that people who do have chronic conditions tend to under report their presence, and the extent of underreporting varies by condition (Turner et al. 1997). Whenever a claims-based algorithm identified members who falsely reported they did not have the condition, the estimated specificity would be less than the true specificity. Nevertheless, our observations on how one can improve the specificity of claims-based algorithms by requiring more than one claim with the diagnosis or that the diagnosis be listed on a claim for a face-to-face encounter was consistent with previous research (Quam et al. 1993; Hebert et al. 1999). However, we did not program and compare exactly the same algorithms presented in previous reports. Further evaluation is needed to determine which algorithms are optimal or equivalent.
Our study was limited to two Medicare+Choice health plans that had limited outpatient drug benefits. Thus, our results may not be representative of algorithm performance when applied to the overall Medicare population or to younger insured populations. Furthermore, the performance of pharmacy claims may differ in health plans where outpatient drug benefits are less limited.
Although several nuances of claims data make them less than perfect for the identification of health plan members who have a particular condition, our results suggest that claims-based algorithms can help identify members who have the six chronic conditions studied. Researchers and health plan administrators must choose algorithms that yield the performance they desire depending on the objective of identifying members with specific conditions. A note of caution, however, is that the limits of performance vary considerably across conditions and may be different in other populations.
This work was supported, in part, by Grant No. RO1-HS/AG09630 from the Agency for Healthcare Research and Quality.