PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Med Care. Author manuscript; available in PMC 2011 May 19.
Published in final edited form as:
PMCID: PMC3097385
NIHMSID: NIHMS291346

Inventory of Data Sources for Estimating Health Care Costs in the United States

Abstract

Objective

To develop an inventory of data sources for estimating health care costs in the United States and provide information to aid researchers in identifying appropriate data sources for their specific research questions.

Methods

We identified data sources for estimating health care costs using 3 approaches: (1) a review of the 18 articles included in this supplement, (2) an evaluation of websites of federal government agencies, non profit foundations, and related societies that support health care research or provide health care services, and (3) a systematic review of the recently published literature. Descriptive information was abstracted from each data source, including sponsor, website, lowest level of data aggregation, type of data source, population included, cross-sectional or longitudinal data capture, source of diagnosis information, and cost of obtaining the data source. Details about the cost elements available in each data source were also abstracted.

Results

We identified 88 data sources that can be used to estimate health care costs in the United States. Most data sources were sponsored by government agencies, national or nationally representative, and cross-sectional. About 40% were surveys, followed by administrative or linked administrative data, fee or cost schedules, discharges, and other types of data. Diagnosis information was available in most data sources through procedure or diagnosis codes, self-report, registry, or chart review. Cost elements included inpatient hospitalizations (42.0%), physician and other outpatient services (45.5%), outpatient pharmacy or laboratory (28.4%), out-of-pocket (22.7%), patient time and other direct nonmedical costs (35.2%), and wages (13.6%). About half were freely available for downloading or available for a nominal fee, and the cost of obtaining the remaining data sources varied by the scope of the project.

Conclusions

Available data sources vary in population included, type of data source, scope, and accessibility, and have different strengths and weaknesses for specific research questions.

Keywords: health care costs, data sources, administrative data, linked data, survey, health economics

Health care cost estimates are used to inform policy decisions on the setting of public and private budgets, structuring of insurance benefits and establishing reimbursement rates, and in cost-of-illness and cost-effectiveness analyses. In the United States, many sources of data are available for estimating health care use and costs. Prior reviews have identified some of these data sources1,2 or summarized data sources used in cost-of-illness studies.3,4 To build on these efforts, and aid analysts in the process of identifying and choosing data sources for estimating health care costs, we developed an inventory of data sources in the United States.

The majority of data sources commonly used to estimate health care costs in the United States were not originally developed for research purposes. Longitudinal information across the trajectory of illness is generally available only for the covered populations within discrete health insurance programs. National information is available from a variety of patient surveys and hospital discharge databases, but these data are generally cross-sectional, and may have small numbers of individuals with specific conditions. Several panel surveys collect information at multiple time points, but are of short duration or limited in clinical detail.5,6 Because comprehensive longitudinal data for nationally representative populations across health insurance programs and without health insurance are largely unavailable, analysts must choose between different attributes of data sources for their specific study questions.

The choice of data source is important, because as illustrated in several articles in this issue, different data sources can produce different estimates of the cost of health care.7,8 In addition, some existing data sources have rarely been used for estimating health care costs. For example, some costs, such as patient and caregiver time costs, are routinely excluded from cost-effectiveness analyses, even though they have long been recommended for inclusion,9 often in the mistaken belief that these data are not available. In the following sections, we describe our approach to identifying data sources for estimating health care costs, provide a summary of data source attributes, and include a series of tables with detailed information about the attributes of each data source. This information can serve as a resource for researchers choosing data sources for estimating health care costs.

IDENTIFICATION OF DATA SOURCES FOR ESTIMATING HEALTH CARE COSTS

We identified data sources for estimating health care costs in the United States using 3 approaches: (1) a review of the 18 articles included in this supplement,5,7,8,1024 (2) an evaluation of websites of federal government agencies, nonprofit foundations, and related societies that support health care research or provide health care services, and (3) a systematic review of the recently published literature. Although health care utilization patterns from many data sources can be applied to standard fee schedules, we only considered data sources where direct medical or direct nonmedical health care costs are available or can be derived. We use the term “cost” to refer to payments, expenditures, reimbursements, charges, or prices.

Our review of federal government agency websites included the Agency for Healthcare Research and Quality, Bureau of Labor Statistics, the Centers for Disease Control (and the National Center for Health Statistics), the Centers for Medicare and Medicaid Services, the Department of Defense, the Federal Interagency Forum on Aging-Related Statistics, the Health Resources and Services Administration, the 27 individual Institutes and Centers of the National Institutes of Health, and the Veterans Health Administration. We also reviewed the websites of Academy Health, the Commonwealth Fund, the International Society for Pharmacoeconomics and Outcomes Research, the Kaiser Family Foundation, the Robert Wood Johnson Foundation, and the Research Data Assistance Center, a contractor to the Centers for Medicare and Medicaid Services that provides assistance to researchers interested in using Medicare or Medicaid data (http://www.resdac.umn.edu/).

To identify data sources from the published literature, we used Scopus, the largest abstract and citation database, including 15,000 peer-reviewed journals and 100% MEDLINE coverage (http://www.info.scopus.com/). We identified articles published in English between January 1990 and December 2007 that included the terms “cost,” “economic,” “expenditure,” “charge,” or “payment” in the title (N = 141,876), and used the search terms “data source” or “database” (N = 450,862 articles) and “healthcare” and (“cost” or “payment” or “charge”) or “health care” and (“cost” or “payment” or “charge”) for the full article (N = 37,946 articles). The combination of these searches yielded 539 articles.

The abstract for each article was reviewed to identify the data source(s) used to estimate health care costs, if possible. If the source could not be identified from the abstract, the entire article was reviewed. We excluded studies that met any of the following criteria: data source was from outside the United States, monetary estimates were not presented, data were not available after 1990, data were from a single institution or a clinical trial and unlikely to be widely available to other investigators, or the study was only available in the form of a published abstract or dissertation (N = 266). Because electronic searches may not identify all relevant studies,25 we also evaluated all reviews (N = 107) to identify data sources used in the underlying research studies. The underlying research studies were evaluated further and the same eligibility criteria were applied. From the 3 search strategies, we identified a total of 88 data sources with sufficient information to abstract key data elements.

Because our goal was to provide information for analysts interested in using these data sources, we made extensive efforts to identify as many data sources and data elements as possible. Some data sources mentioned in the literature review could not be located. Others have been discontinued or merged with other data sources. In situations where we could not abstract the information about the population and cost elements from the data source website, we followed up with the listed contact for additional information. Despite these efforts, we did not have sufficient information about several data sources to include them in this inventory.

ABSTRACTION OF DATA SOURCE ATTRIBUTES

Information about each data source was abstracted using a standardized format. We abstracted descriptive information about the data source, including the sponsoring agency or organization (government, private, university), website for the data source, lowest level of data aggregation (hospital or provider, service, and individual or patient), and type of data source (survey, administrative or linked administrative data, discharge, fee or cost schedule).

We also abstracted information about the eligible or covered population, whether the data were nationally representative, whether they were cross-sectional or longitudinal (including repeated cross-sections or panels), and the source of diagnosis information (procedure or diagnosis codes, self-report, chart or medical record, registry, not available). Cost elements abstracted were the types of services or resources for which cost data were collected, including institution or facility (eg, hospital or freestanding clinic), inpatient hospitalization, physician and other outpatient, outpatient pharmacy or laboratory, out-of-pocket, wages, patient time, and other direct nonmedical (eg, disability days). The cost of obtaining the data source was categorized as either (1) freely available for download/less than $100 or (2) cost depends on scope of project. We did not gather information about data completeness or quality.

ATTRIBUTES OF DATA SOURCES FOR ESTIMATING HEALTH CARE COSTS

Descriptive characteristics of the data sources we identified are summarized in Table 1 and listed for individual data sources in the remaining Tables. The lowest level of data aggregation for most sources was the individual or patient-level, followed by the service level, and hospital, provider, or institution level. Two data sources were aggregated at the national level.23,26 Data sources at the service level of aggregation included hospital and other discharges, fee or cost schedules for physician services, ambulatory services, and equipment or prescriptions. Government agencies sponsored the majority of the data sources. Most data sources were national or nationally representative. A sizable portion were surveys, followed by administrative data or survey or registry linked to administrative data, fee or cost schedules, hospital discharges, and other types of data. Over 60% of the data sources were cross-sectional. The remaining data sources were longitudinal, including panel data with repeated cross-sections. Many of the data sources provided information about diagnosis with specific diseases, including procedure or diagnosis codes that can be used with algorithms to identify patients with disease, self-reported diagnoses, and registry or chart review identified diagnoses.

TABLE 1
Characteristics of Data Sources for Estimating Direct Medical and Nonmedical Health Care Costs

The cost elements available in the different data sources included inpatient hospitalization (42.0%), institution or facility (5.7%), physician and other outpatient (45.5%), outpatient pharmacy or laboratory (28.4%), out-of-pocket (22.7%), patient time and other direct nonmedical costs (35.2%), and provider wages (13.6%). About half of the data sources were freely downloadable or available for a nominal fee (ie, <$100), and the cost of acquiring data varied with the scope of the project for the other half.

We classified data sources by level of data aggregation and type of data and listed the available cost elements and source of diagnosis information in Tables 2 to to6.6. Data sources whose lowest level of data aggregation was at the hospital and provider level are listed in Table 2. Available cost elements include institution or facility, inpatient hospitalization, and wages or payroll. Data sources with the lowest level of data aggregation at the service level include discharges and cost, price, or fee schedules (Table 3). Available cost elements include inpatient hospitalization, outpatient services, and pharmacy and equipment.

TABLE 2
Hospital and Provider Level Data Sources Used to Estimate Direct Medical Health Care Costs
TABLE 3
Service Level Data Sources Used to Estimate Direct Medical Health Care Costs
TABLE 6
Individual Level Data Sources Used to Estimate Patient and Caregiver Time

Data sources at the individual or patient-level that are surveys and administrative data or administrative data linked to surveys or registries are listed in Table 4 and Table 5, respectively. Detailed cost elements for these patient-level data included inpatient hospitalization, physician or other outpatient services, outpatient pharmacy, out-of-pocket, and other direct nonmedical. Individual level data sources that can be used to estimate patient or caregiver time are listed in Table 6. Time information abstracted included nursing home stays, outpatient services, restricted activity days, home care/ home therapy, and hospice care. Information on length of inpatient hospital stay was available from hospital discharge data listed in Table 3 and administrative data with inpatient information listed in Table 4, and not listed separately in Table 6.

TABLE 4
Individual or Patient Level Data Sources Used to Estimate Direct Medical Health Care Costs: Surveys
TABLE 5
Individual Level Data Sources Used to Estimate Direct Medical Health Care Costs: Administrative Data, and Registries and Surveys Linked to Administrative Data

Finally, an alphabetical listing of all data sources with the web address for additional information, and the Table(s) where the data source is described in greater detail is contained in Table 7. Table 7 also includes indicators of whether data were nationally representative and whether longitudinal data were available, and the cost of obtaining the data. Use of several data sources requires collaboration with internal investigators; these data sources are indicated with a footnote. Data sources aggregated at the national level were not abstracted or reported separately, although they are listed in Table 7.

TABLE 7
Alphabetical Listing of Data Sources for Estimating Health Care Costs

SUMMARY

In this inventory, we identified more than 80 data sources in the United States that can be used to estimate health care costs, and abstracted key characteristics, including sponsor, lowest level of data aggregation, population included, length of observation, source of diagnosis information, and available cost elements. The data sources we identified vary in these dimensions as well as in their accessibility. Some are publicly available and freely downloadable directly from sponsors’ websites, others must be purchased, and still others are restricted to the use of researchers or collaborators within sponsors’ institutions.

The inventory is as comprehensive as we could make it, but some sources were unavoidably excluded. Additionally, work is ongoing to develop linkages between registries and surveys with administrative data, including linkage among multiple data sources or multiple payors. Recently, investigators were able to link the Michigan cancer registry data with both Medicare and Medicaid.27 Data linkage efforts with Medicare, Medicaid, and additional data sources, including private payors, are ongoing in other states.

Ultimately, investigators must weigh the strengths and weaknesses of different data sources for their specific research questions. Considerations include the representativeness of the data source to the population of interest, the appropriate level of aggregation, the need for information from single or multiple payors (eg, Medicare, Medicaid, private), types of services or resources measured, period of observation (longitudinal versus cross-sectional), and need for accurate identification of patients with specific conditions (eg, cancer). As illustrated in 2 articles in this supplement, accurate identification of patients for either incidence or prevalence cost estimates is critical in cancer7,8; the method of patient identification may be less critical for other diseases. In the case of simulation models, which may integrate cost estimates from multiple sources, similarity of patient populations and types of resources measured across sources may be a key consideration. Other issues, such as periodicity and most recent year of cost data may also be important for studies of trends in health care costs or in studies tracking the dissemination of new therapies. Finally, feasibility, ease, and cost of accessing the data may also be important considerations for selecting the most appropriate data source for estimating health care costs for the specific research question.

ACKNOWLEDGMENTS

The authors thank thoughtful comments on an earlier version of the Inventory narrative and tables from Sally Stearns of the University of North Carolina, Gerald Riley of the Centers for Medicare and Medicaid Services, and L. Clark Paramore of the United Biosource Corporation.

Footnotes

The views expressed in this article are those of the authors, and no official endorsement by the US Department of Health and Human Services, the Agency for Healthcare Research and Quality, and the National Cancer Institute is intended or should be inferred.

REFERENCES

1. Resources for research and decision making: an annotated database inventory. Med Care. 1996;34 suppl 3:111–141.
2. Ferrell LA. Sources for collecting cost-of-illness data. In: Haddix AC, Teutsch SM, Corso PS, editors. Prevention Effectiveness: A Guide to Decision Analysis and Economic Evaluation. New York, NY: Oxford University Press; 2003. pp. 232–238.
3. Akobundu E, Ju J, Blatt L, et al. Cost-of-illness studies: a review of current methods. Pharmacoeconomics. 2006;24:869–890. [PubMed]
4. Clabaugh G, Ward MM. Cost-of-illness studies in the United States: a systematic review of methodologies used for direct cost. Value Health. 2008;11:13–21. [PubMed]
5. Cohen JW, Cohen SB, Banthin JS. The Medical Expenditure Panel Survey: a national information resource to support healthcare cost research and inform policy and practice. Med Care. 2009;47 suppl 7:S44–S50. [PubMed]
6. Haffer SC, Bowen SE. Measuring and improving health outcomes in Medicare: the Medicare HOS program. Health Care Financ Rev. 2004;25:1–3. [PubMed]
7. Yabroff KR, Warren JL, Banthin J, et al. Comparison of approaches for estimating prevalence costs of care for cancer patients: what is the impact of data source? Med Care. 2009;47 suppl 7:S64–S69. [PMC free article] [PubMed]
8. Yabroff KR, Warren JL, Schrag D, et al. Comparison of approaches for estimating incidence costs of care for colorectal cancer patients. Med Care. 2009;47 suppl 7:S56–S63. [PubMed]
9. Gold MR, Siegel JE, Russell LB, et al. Cost-Effectiveness in Health and Medicine. New York: Oxford University Press; 1996.
10. Riley GF. Administrative and claims records as sources of health care cost data. Med Care. 2009;47 suppl 7:S51–S55. [PubMed]
11. Barlow W. Overview of methods to estimate the medical costs of cancer. Med Care. 2009;47 suppl 7:S33–S36. [PMC free article] [PubMed]
12. Neumann PJ. Costing and perspective in published cost-effectiveness analysis. Med Care. 2009;47 suppl 7:S28–S32. [PubMed]
13. Mullahy J. Econometric modeling of health care costs and expenditures. Med Care. 2009;47 suppl 7:S104–S108. [PubMed]
14. Basu A, Manning WG. Issues for the next generation of health care cost analyses. Med Care. 2009;47 suppl 7:S109–S114. [PubMed]
15. Huang Y. Cost analysis with censored data. Med Care. 2009;47 suppl 7:S115–S119. [PMC free article] [PubMed]
16. Marshall DA, Hux M. Design and analysis issues for economic analysis alongside clinical trials. Med Care. 2009;47 suppl 7:S14–S20. [PubMed]
17. Russell L. Completing costs: patients’ time. Med Care. 2009;47 suppl 7:S89–S93. [PubMed]
18. Frick KD. Micro-costing quantity data collection methods. Med Care. 2009;47 suppl 7:S76–S81. [PMC free article] [PubMed]
19. Barnett PG. An improved set of standards for finding cost for cost-effectiveness analysis. Med Care. 2009;47 suppl 7:S82–S88. [PubMed]
20. Rosen A, Cutler DM. Challenges in building disease-based national health accounts. Med Care. 2009;47 suppl 7:S7–S13. [PubMed]
21. Fishman PA, Hornbrook MC. Assigning resources to health care use for health services research: options and consequences. Med Care. 2009;47 suppl 7:S70–S75. [PubMed]
22. Hoerger T. Using costs in cost-effectiveness models for chronic diseases: lessons from diabetes. Med Care. 2009;47 suppl 7:S21–S27. [PubMed]
23. Heffler S, Nuccio O, Freeland M. An overview of the NHEA with implications for cost analysis researchers. Med Care. 2009;47 suppl 7:S37–S43. [PubMed]
24. Grosse SD, Krueger KV, Mvundura M. Economic productivity by age and sex: 2007 estimates for the United States. Med Care. 2009;47 suppl 7:S94–S103. [PubMed]
25. Dickersin K, Scherer R, Lefebvre C. Identifying relevant studies for systematic reviews. BMJ. 1994;309:1286–1291. [PMC free article] [PubMed]
26. Electronic Citation. Organization for Economic Cooperation and Development (OECD) health data; 2009.
27. Bradley CJ, Given CW, Luo Z, et al. Medicaid, Medicare, and the Michigan Tumor Registry: a linkage strategy. Med Decis Making. 2007;27:352–363. [PubMed]