|Home | About | Journals | Submit | Contact Us | Français|
With important technological advances in healthcare delivery and the internet, clinicians and scientists now have access to overwhelming number of available databases capturing patients with critical illness. Yet investigators seeking to answer important clinical or research questions with existing data have few resources that adequately describe the available sources and the strengths and limitations of each. This article reviews an approach to selecting a database to address health services and outcomes research questions in critical care, examines several databases that are commonly used for this purpose, and briefly describes some strengths and limitations of each.
Narrative review of the medical literature.
The available databases that collect information on critically ill patients are numerous and vary in the types of questions they can optimally answer. Selection of a data source must not only consider accessibility, but also the quality of the data contained within the database, and the extent to which it captures the necessary variables for the research question. Questions seeking causal associations (e.g. effect of treatment on mortality) usually either require secondary data that contain detailed information about demographics, laboratories, and physiology to best address non-random selection or sophisticated study design. Purely descriptive questions (e.g. incidence of respiratory failure) can often be addressed using secondary data with less detail such as administrative claims. Though each database has its own inherent limitations, all secondary analyses will be subject to the same challenges of appropriate study design and good observational research.
The literature demonstrates that secondary analyses can have significant impact on critical care practice. While selection of the optimal database for a particular question is a necessary part of high-quality analyses, it is not sufficient to guarantee an unbiased study. Thoughtful and well-constructed study design and analysis approaches remain equally important pillars of robust science. Only through responsible use of existing data will investigators ensure that their study has the greatest impact on critical care practice and outcomes.
Clinicians and scientists have witnessed an unprecedented expansion in the publication of critical care studies employing observational designs. This expansion is perhaps most evident in studies using secondary data. Secondary data can be defined as data gathered for one reason (e.g. clinical trial) now being reemployed to answer a novel question (e.g. costs of care), whereas primary data are data collected specifically for the purposes of answering a novel question (1). Several factors are responsible for this expansion, including wide recognition of the importance of safety and quality improvement research (2-5), the ability to perform complex analytic tasks on personal computers, and growth in the cadre of investigators capable of performing secondary analyses through methods training(6, 7).
When done well, this work has contributed novel observations and changed practice in fundamental ways, improving outcomes for patients. At the bedside, the re-evaluation of pulmonary artery catheters—once a ubiquitous feature of nearly every medical ICU patient—began with a clever re-analysis of a clinical trial done to investigate other questions(8). Modern ICU organization—with its focus on high-volume centers of excellence—was shaped by scientific observations about the volume/outcome relationship(9, 10) and by the rigorous evaluation of the surgical and trauma center experience(11-13).
Secondary data analyses also provide essential raw material for key operations in healthcare. National priority setting about causes of death and clinical decision-making about prior probabilities of disease both depend on secondary data. For example, virtually every basic-science grant application on severe sepsis contextualizes the proposed work with national-scale epidemiology derived from administrative records(14, 15). Current policy concerns about healthcare overuse in the ICU such as excessive end-of-life spending and unexplained geographical variation in ICU use depend on secondary data analyses(16-20). Much of our understanding of racial/ethnic and insurance-based disparities(21-28), as well as the value of critical care(29-32), derive from secondary data analyses. This work has helped move the conversation about the causes for inequities in critical care away from personal opinion toward scientific evaluation and efforts to solve such problems.
This new scientific and clinical importance of secondary data analysis has regrettably been accompanied by numerous examples of poorly designed studies utilizing datasets ill equipped to answer the research questions posed of them(33). A major contributor to the evolution of secondary analyses is the dramatic growth in existing critical care databases and the ease with which one can access them. Despite the attractiveness of such data for many purposes, there have been few references to turn to that discuss available data sources relevant to critical care to facilitate informed choices by prospective investigators(1, 7, 34-36).
In this review, we examine several existing critical care data sources commonly used for secondary data analysis in critical care, and present a practical approach to the selection of a database based upon the strengths of the source. We limit our discussion to databases used for the conduct of clinical epidemiology, health service, policy and outcomes research, rather than discuss data derived from genetic analyses, “-omics”, or other bench science. Although there are important critical care data resources outside of the United States(37-39), we focus on resources within. Our target audience includes both investigators seeking to answer research questions in critical care, but also readers of the medical literature interested in ways to better appraise the data sources selected in published studies. Finally, we focus on the secondary data available for answering well-formed questions, but do not seek to review fundamentals of good scientific study design.
Investigators who employ secondary data to answer clinical questions can capitalize on several of its advantages compared to primary data. First, secondary analysis of data promotes efficient use of research investments particularly when performed on biologic data or data otherwise overly expensive to collect. Second, there are some questions (e.g. where randomization is unethical or measuring “real-world” practice) that often can only or are most efficiently answered by secondary data analysis. Third, some secondary data, such as large registries and administrative data, may provide greater generalizabilty due to a much greater scope than studies collecting primary data—potentially generalizing to regions, states, or even the nation. Fourth, it may be feasible to carry out a secondary data analysis for questions where primary data collection is too onerous, such as those that consider 5- or 10-year follow-ups. Similarly, scientists with appropriate statistical training but with limited grant funding may find secondary data analysis more feasible as a first approach to a problem. Fifth, because some secondary analyses employ administrative data that are very large, they may provide a much more precise estimate of effect than smaller primary studies, particularly for rare diagnoses. Sixth, when secondary data cannot be used to answer the appropriate scientifically rigorous and clinically relevant question, they may play an important supplementary role to assess the plausibility and likelihood of success of a large-scale primary data collection effort; secondary data analysis may be a particularly cost-effective way to obtain such preliminary data. For example, the secondary analysis of the SUPPORT study, which suggested pulmonary artery catheters were harmful, formed the basis for several randomized trials. Finally, secondary analyses of administrative data may be more relevant to policymakers and therefore support the translation of scientific discoveries into improved care. For example, Medicare stakeholders base policy decision on research that is conducted using data from Medicare patients.
The issues that a researcher confronts when evaluating a database for secondary analysis are the same that readers of the literature must consider when assessing the quality of a data source used in a published study. However, because investigators ultimately need to obtain the data in addition to critiquing it, we approach the database evaluation and acquisition process from the perspective of a researcher.
When confronted with the overwhelming number of existing data sources, investigators must first consider the ability of a given data source to sufficiently address the research question of interest. This involves characterizing the quality and overall susceptibility to bias of the data source. While this process largely remains a subjective task(40, 41), in 2003 a group of investigators from the United Kingdom developed a framework to assess the quality of secondary databases(42). The framework included two aspects characterizing database quality: coverage and accuracy(Table 1). While useful in principle, this framework is not sufficient to identify an optimal database. By placing equal value on each aspect of database quality, it ignores how the potential for inadequacies in a single domain may be fatal to a study. Most obviously, a database may be perfect in all domains yet lack the outcome variable of interest.
A more practical approach to the selection of a data source considers the needs imposed by the research question. Through articulating a well-defined research question, an investigator will know which variables are needed to define the population, the treatment or exposure, the outcomes, and those needed for adjustment (i.e. confounding variables), such as demographics and severity of illness(43). One can then consider whether clinically detailed versus clinically sparse data are needed to address the question. Many quantitative research questions can be lumped into five overlapping groups that describe the goals of the study (Table 2). The goals of the study often dictate the need for clinically rich versus sparse information. When investigators seek to determine the causal relationship between a risk factor and outcome or compare outcomes across specific treatments, greater clinical information is usually necessary to address confounding – that is, account for any variable that may distort the relationship between the observed exposure and outcome(44). Often, the most important confounding variable is severity of illness(7, 44).
In contrast, more descriptive questions that characterize the epidemiology of disease, clinical practice, health service use, and health care spending, are not usually limited by confounding because comparisons between different groups are less often performed. For example, Wiener and colleagues used national hospital discharge claims to examine temporal changes in use of pulmonary artery catheters(45). The absence of clinically detailed information in this study did not impact the importance of this result given its descriptive nature.
While the taxonomy of questions in Table 2 may be helpful to articulating the needs for a given question, we do not intend it to be prescriptive or without exception. For example, there are many excellent examples of investigators employing clinically sparse data to examine causal relationships(10, 46) and clinically rich information to examine disease incidence(47). Clinically rich information often has the benefit of being able to address many types of questions, but the same is not necessarily true for clinically sparse information. When sparse data are used to examine causal relationships more sophisticated methods to address patient selection and confounding are needed(44, 48-51).
Only after an investigator has decided on the variables needed to address the question and has considered the importance of clinically rich versus sparse data, should he or she investigate the available databases in which to study the question. A prudent approach toward finding a source requires searching of the web, the literature, and discussions with investigators who have used secondary data. Once potential sources are identified, an important next step is to determine if access to the data is feasible. Some secondary data sources can be downloaded from the web free of charge, but others have fees that can range from $20 to over $100,000. Independent of access fees, some sources require navigating administrative hurdles, including vetting the research proposal by an oversight committee, or require collaboration with a scientist that has access to the data. While in no way comprehensive, Table 3 describes the several critical care databases organized by the degree of clinical detail available within each and qualitatively describes the accessibility of each data source.
Nationally funded randomized controlled trials (RCT) and large-scale prospective cohort studies usually collect data with considerable clinical detail, including clinical physiology, severity of illness, and patient outcomes. One of the most important existing repositories for critical care RCTs and cohort studies is the National Heart, Lung, and Blood Institute's Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC)(52). BioLINCC provides data from over 80 clinical and epidemiological studies. These include many prominent critical care studies from the last few decades, particularly the ARDSNet RCTs conducted from 1996 through 2006(53-58)—use of which has resulted in dozens of secondary analyses.
The electronic medical record (EMR) has great promise to become the future source of many secondary data analyses(59, 60). Unfortunately, several important barriers hamper the current realization of the research potential of the EMR, including difficulty in extracting information from free text, and compatibility of systems across hospitals(60). Nevertheless, there have been successes.
The Department of Veterans Affairs created the Inpatient Evaluation Center (IPEC), an infrastructure for improving the quality of care in VA medical centers that includes data on all inpatients in over 100 hospitals extracted from the VA's EMR. This data source includes an excellent risk-adjustment measure and has been used to study the organization and quality of care within the VA(61, 62), develop risk-adjustment models(63, 64), and determine the impact of infection control measures on outcomes(65, 66). Kaiser Permanente of Northern California has similarly rich data on its large network of community hospitals(67).
An additional EMR-based resource is the Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II) database(68). This publicly available, deidentified repository includes minute-to-minute data for over 30,000 patients admitted to an ICU in Beth Israel Deaconess Medical Center. Published studies using the MIMIC-II database have examined several aspects of ICU care such as, developing and validating high-fidelity risk-adjustment models(69, 70), and characterizing providers' response to hypotension(71, 72). Users can gain access to MIMIC-II via the internet.
Several existing data sources that were created for benchmarking or quality improvement purposes provide clinically rich data on ICU patients. Perhaps the most famous of these is the APACHE database(73). By maintaining the gold standard for risk-adjustment, the APACHE database provides rich clinical information for patients in hospitals that voluntarily contribute data. Investigators have used this source to answer questions about the impact of organizational features on patient outcomes(74, 75), variation in ICU admission practice(76), volume-outcome relationships(9), among others. Cerner, the owner of APACHE, also previously maintained the now unavailable Project IMPACT(77, 78).
A relatively new data source of critically ill patients is the eICU Research Institute(79). Although designed to allow off-site intensivist involvement in remote ICUs, telemedicine systems also standardize disparate data from participating ICUs(80). Phillips eICU (formerly VISICU), currently the largest vendor of ICU telemedicine, created the eICU Research Institute in collaboration with health-care providers and academia(80, 81). The University of Maryland School of Pharmacy Pharmaceutical Research Computing Center (PRC) is the first academic partner with access to this database.
Although the line distinguishing a registry from quality improvement or benchmarking database is somewhat arbitrary, registries usually focus on a single disease or syndrome and are often use by their participants to benchmark their own data to that of others. For example, the National Trauma Data Bank maintains the largest nationally representative sample of patients experiencing trauma. Data fields include demographics, vital signs from the ED and EMS, abbreviated injury scale, procedure and diagnosis codes, ICU and ventilator days, among other characteristics. Prominent past studies employing this data source have looked at the impact of helicopter transport(82), the pulmonary artery catheter(83), and prehospital fluids(84) on outcomes of trauma.
The American Heart Association Get with the Guidelines (GWTG) maintains several registries capturing patients that often require critical care services. The GWTG-Resuscitation collects information on consecutive patients with in-hospital cardiac arrests, defined by the absence of a central palpable pulse, apnea, and unresponsiveness(85). Extensive data surrounding the arrest and the post hospital course is collected including outcomes of return to spontaneous circulation, neurologic status, and survival to discharge. Recent studies employing this dataset include analyses examining cardiac arrest among patients with pneumonia(86), variation in hospital cardiac arrest rates(87, 88) and in the time to defibrillation(89), and racial differences in outcomes after arrest(90).
Administrative data are data collected on patients during encounters with the healthcare system and are most often collected for billing insurers. For hospitalized patients, this usually includes data from the Uniform Billing 04 sheet (UB-04) which collects facility charges during an inpatient stay(91). Although elements vary by payer, this form typically collects demographics including payer, admission source, ICD-9-CM diagnosis and procedure codes, DRG codes, some CPT and/or HCPCS codes, length of stay, disposition, hospital identifier, and detailed charges for aspects of the hospital stay (e.g. ICU room and board, pharmacy). Although encounter-specific, claims can often be linked allowing one to trace an individuals course through inpatient, post-discharge, and outpatient facilities.
The two main sources of administrative data include insurers and government agencies interested in tracking healthcare use. For example, Medicare provides research claims for all aspects of care among its close to 50 million beneficiaries across the entire United States, a segment of the population that accounts for a majority of critical care use(92) and of intrinsic public policy interest. Long-term mortality and longitudinal utilization can be tracked in Medicare files. Access to Medicare data is relatively expensive if one's research question requires individual-level linkage across hospitals or outpatient claims; in contrast, one year's standard inpatient file, so-called “MedPAR” files, can cost less than $1,000. MedPAR includes information about the inpatient stay typically present on the UB-04 form, including diagnosis, procedure, and DRG codes, ICU or CCU stay, hospital charges, and hospital discharge disposition. Data about skilled nursing stays are also included. However, information about outpatient visits, physician charges, durable medical equipment, and hospice care are in separate files. Investigators have used Medicare data to examine long term survival of respiratory failure(93), epidemiology of sepsis(94), cognitive outcomes among critically ill patients(95), and epidemiology of long-term acute care use(96).
In contrast to Medicare, some data sources include data from all payers. These include various state health departments or agencies such as the CDC that maintain national surveys of inpatient care(97). The Healthcare Cost and Utilization Project (HCUP), the largest collection of all-payer inpatient care data in the US maintains one of the most accessible sources of administrative data(98). Investigators can access over 95% of ER visits and hospital discharges from individual states using HCUP's State Emergency Database or the State Inpatient Database, or use HCUP's Nationwide Inpatient Database to examine questions in a nationally representative sample of hospitals and patients. Data are also available for children. Readmissions are tracked in several states; however, follow-up to out-of-hospital patient-centered outcomes is often impossible. HCUP has been used to examine variation in ICU use(19, 99), stroke risk among patients with atrial fibrillation and sepsis(100), longitudinal trends in PA catheter use(45), and impact of marital status on sepsis outcomes(101). Finally, private groups or insurers also maintain research files that can be purchased at significant costs. These include MarketScan, a data source representing diverse claims from over 100 private payers(102), and Premier Perspective, the nations largest inpatient drug utilization database. Premier Perspective is unique in its collection of time stamped data about medications delivered during an inpatient stay. Investigators have capitalized on this unique attribute to examine the impact of activated protein C on mortality in sepsis(103), and the quality of care among patients admitted with COPD exacerbations(104, 105).
Often a single dataset may provide only part of the information that is necessary to conduct a successful analysis. In such situations investigators can either supplement the data source by collecting additional data or link two or more existing data sources. For example, Treggiari and colleagues successfully linked an existing ARDS database to a prospective survey of ICU directors to determine the relationship between physician staffing and outcomes of ARDS(106).
The often-easier option involves the linking of two independent but preexisting data sources that together have the necessary information for the question. Occasionally, this linkage has already been done prior to obtaining the data. For example, the Health and Retirement Study collects data on the sources beneficiaries use to pay for services, health status, and other economic and family variables from nationally representative samples of older Americans. An existing link to Medicare files allows one to identify survey respondents who were hospitalized with critical illness. Iwashyna et al.(95) used this data, and Barnato et al.(107) capitalized on the similar Medicare Current Beneficiary Study to examine disability of long-term survivors of critical illness. Such linkage offers an unusual opportunity to examine outcomes for rarer diseases with prospectively collected pre-morbid data(108). An additional pre-linked data source is the Surveillance Epidemiology and End Results linked to Medicare files (SEER-Medicare)(109). SEER collects information on cancer incidence, prevalence and survival from specific geographic areas containing 28 percent of the US population. Through an existing link to Medicare inpatient files, one can examine the intersection between cancer and critical care, such as the relationship between critical illness and long-term survival among lung cancer patients(110). When links are not already in place, investigators can often establish them provided identifying information is present within the data. For example, Seymour and colleagues linked paramedic run sheets with WA state hospital discharge claims to study prehospital risk factors for ICU admission and hospital mortality(111).
While many databases described above are limited by their lack of clinical detail, all have additional unique limitations. Most RCTs in critical care enroll only a small fraction of eligible patients, which may threaten generalizability of secondary analyses using RCTs as a data source. Many registries or databases collected for quality improvement and benchmarking efforts include only volunteer hospitals. These non-random hospitals may be highly motivated to improve care for their patients, are often geographically clustered, and are more likely to be teaching hospitals, factors that threaten generalizability of studies employing these data sources(7). Administrative data are limited by the often-unknown validity of ICD-9-CM or other billing codes for identifying critical illness, variable number of ICD-9-CM codes collected across hospitals(112), temporal instability in coding practice(113), biases due to provider efforts to maximize payment(114), among others. Investigators using these databases should consider how these limitations might bias their results and include strategies in their analyses that address these weaknesses. These limitations suggest that the optimal approach to an avenue of research uses secondary data for the questions that they are uniquely suited to address, but turns to primary data for other aspects of the key clinical questions.
Finally, and perhaps most crucially, databases only provide the raw materials to address a research question but do nothing to ensure a study is appropriately designed and conducted. Observational research—indeed, all research—regardless of the study design or data source, is subject to a variety of biases in addition to the issues of confounding.
A major barrier to optimal care for all critically ill patients is absence of a centralized repository of data on critically ill patients in the US—despite the fact that such a barrier is surmountable. Policymakers and scientists have used available registries of patients with cardiac disease, including those described above, not only to increase guideline concordant care, but also to gain important insights about the care for patients with congestive heart failure, myocardial infarction, and cardiac arrest(87, 115-117). Leaders within the American College of Surgeons have driven continuing improvements in trauma and surgical outcomes through trauma registries and the National Surgical Quality Improvement Program registry(118, 119). Registries have even been successfully implemented in combat zones to improve care of wounded soldiers(120). Armed with high-quality clinical practice guidelines, policymakers in cardiology and surgery have developed clinical registries by capitalizing on strong leadership and financing from professional societies. Despite the existence of guidelines for management of some critically ill populations, such as sepsis, leaders in critical care have been less successful in their efforts to create comparably accessible and comprehensive registries(121). As leaders within professional critical care societies strive to guide practice through publication and implementation of guidelines, they should continue to pursue parallel efforts to track the populations targeted by their guidelines to ensure that optimal care is being delivered in the real world. As we have witnessed in cardiology and surgery, secondary analyses of such critical care registries could realize further gains in the care for our patients.
Through secondary data analyses, investigators have provided a large contribution to the understanding of disease and heath care delivery in critical care. This past work is an important reminder that rigorous observational science is not only possible it is essential to further improve the care delivered to critically ill patients. Scientists using existing data for research also promote a more efficient research agenda because they maximizes the knowledge that can be gained from the past, often expensive efforts to gather data(122). Investigators using secondary data must carefully consider the advantages and disadvantages of each potential data source prior to selecting one or more for their analyses. Through applying a rigorous approach to database selection and data quality assessment, investigators will be well on their way ensure that their study will have the greatest impact.
Funding: Support for this work was provided in part by a grant from the Agency For Healthcare Research and Quality (K08 HS020672, Dr. Cooke), the National Heart, Lung, and Blood Institute (K08 HL091249, Dr. Iwashyna), and U.S. Department of Veterans Affairs Health Services Research & Development Services (IIR 11-109, Dr. Iwashyna). The views expressed in this article are those of the authors and do not necessarily reflect the position or policy of the Department of Veterans Affairs or the US government.
The authors have not disclosed any potential conflicts of interest