|Home | About | Journals | Submit | Contact Us | Français|
Comparative Effectiveness Research (CER) has the potential to transform the current healthcare delivery system by identifying the most effective medical and surgical treatments, diagnostic tests, disease prevention methods and ways to deliver care for specific clinical conditions. To be successful, such research requires the identification, capture, aggregation, integration, and analysis of disparate data sources held by different institutions with diverse representations of the relevant clinical events. In an effort to address these diverse demands, there have been multiple new designs and implementations of informatics platforms that provide access to electronic clinical data and the governance infrastructure required for inter-institutional CER. The goal of this manuscript is to help investigators understand why these informatics platforms are required and to compare and contrast six, large-scale, recently funded, CER-focused informatics platform development efforts. We utilized an 8-dimension, socio-technical model of health information technology use to help guide our work. We identified six generic steps that are necessary in any distributed, multi-institutional CER project: data identification, extraction, modeling, aggregation, analysis, and dissemination. We expect that over the next several years these projects will provide answers to many important, and heretofore unanswerable, clinical research questions.
The American Recovery and Reinvestment Act of 2009 provided $1.1 billion for Comparative Effectiveness Research (CER)1. The goal of CER is to generate new evidence on the potential effectiveness, benefits, and harms of different treatments, diagnostics, preventions, and care models under “real world” conditions. Widespread adoption of CER has potential to radically change healthcare. CER also places enormous demands on existing informatics research infrastructure2, as it requires aggregation and analysis of disparate data held by different institutions, each with its own representation of relevant events and accountabilities for protecting data as a matter of patient confidentiality and business operations.
Currently, most data manipulations are performed using non-coordinated applications (e.g., data collection forms, electronic health records [EHRs], research databases, condition-specific registries, and statistical analyses) with disjointed institutional control. In an effort to address these demands, there have been new designs and implementations of informatics platforms that provide access to electronic clinical data and the governance required for inter-institutional comparative effectiveness research3,4,5,6. Briefly, a “platform” is a suite of interconnected, coordinated applications, together with the operational environment that hosts those applications.
The goal of this manuscript is to compare and contrast six large-scale, projects that are either developing or extending existing informatics platforms for CER. Rather than compare the informatics platforms at an abstract level, we focus on specific CER projects that provide implementations of informatics platforms and highlight design requirements and solutions.
The following sections provide an overview of the projects surveyed.
WICER is creating infrastructure to facilitate patient-centered outcomes research in Washington Heights, NY. The project facilitates comprehensive understanding of populations by leveraging data from existing EHRs, and combining data from institutions representing various healthcare processes. For example, it includes data from hospitals, clinics, specialists, homecare agencies and long-term care facilities. It also includes survey data from community residents with assessments on socioeconomic status, vital statistics, support networks, health and illness perceptions, quality of life, and health literacy. Data from multiple sources are merged in a data warehouse, where deeper analysis is performed by clinical and public health researchers. WICER investigators are using the infrastructure and methods on three clinical trials in hypertension care around diagnosis, adherence to therapy, and care management.
The HMO Research Network (HMORN) is a consortium of 19 Health Plans with formal, research capabilities7. SPAN, a project within the HMORN, uses its Virtual Data Warehouse (VDW) to provide a standardized, federated data system across 11 partners, to address CER in ADHD and obesity8. The VDW consists of commonly-defined linked tables within each health plan that capture medical care utilization, clinical data, health plan enrollment information, demographics, detailed inpatient and outpatient encounter information, outpatient pharmacy dispensing data, laboratory test results and vital signs9. The VDW is augmented with State and local cancer registry information on date and cause of death for health plan members. Each plan maintains control of individual VDW data files and does not have access to files held by other HMORN sites. All HMORN participants must be capable of running - without modification - SAS programs distributed by other sites to execute against their local VDW. SPAN is pioneering use of a new platform –PopMedNetTM – that facilitates creation, operation, and governance of multi-site, distributed health data networks10.
The CER-HUB is an Internet-based platform for conducting CER. A central function of CER-HUB is facilitating (through online, interactive tools) development of a shared, data processor library that can be downloaded by registered researchers to provide uniform, standardized coding of both free-text and structured clinical data. This shared library permits researchers to assess data on clinical effectiveness in multiple healthcare areas and gain access to information locked in freetext notes. Using CER-HUB, researchers collaboratively build software applications (MediClass applications11) that will process EHR data within their respective healthcare organizations, creating standardized datasets that can be pooled to address specific CER protocols. Participating researchers contribute IRB-approved, limited data sets to a centralized coordinating center to be pooled with data similarly processed from other healthcare organizations to answer CER questions. The CER-HUB is being used to conduct 2 CER studies addressing effectiveness of medication for controlling asthma and of smoking cessation counseling services, across 6 geographically-distributed and demographically-diverse health systems. Researchers and data providers for these initial studies come from 3 Kaiser health plans (Northwest, Hawaii, and Georgia regions), one consortium of Federally Qualified Health Centers located primarily along the west coast (OCHIN, Inc), one Veterans Administration service region (Puget Sound VA in Washington), and an integrated network of hospitals and physicians in the greater Dallas/Fort Worth area (Baylor Health Care System).
The RPDR is an enterprise data warehouse combined with a multi-faceted user interface that enables clinical research and CER across Partners Healthcare in Boston, MA. The RPDR is used to recruit patients for clinical trials, and to perform active surveillance. It amasses data from billing, decision support, and EHRs in the Partners' system. Data are available to researchers through a drag-and-drop web Query Tool12 allowing users to construct exploratory, ad hoc, queries for hypothesis generation from structured data, and to get aggregate totals and graphs of age, race, gender and vitals. A utility exists for finding matched controls for patients. Requests can be made for detailed data on patients identified through the query tool with proper IRB authorization through an automated wizard. The RPDR has proven useful for gathering clinical trial cohorts, and for CER. This strategy was later adopted as the core of “Informatics for Integrating Biology and the Bedside” (i2b2)13. The RPDR was first released in December, 1999 and has been in production at multiple sites since March, 2002.
INPC was begun in 1994 as an experiment in community-wide health information exchange serving five major hospitals in Indianapolis, IN. Today, it includes data from hospitals and payers statewide14,15,16. Entities participating in INPC submit patient registration records, laboratory test results, diagnoses, procedure codes, and other data for various types of healthcare encounters. Data are also obtained from health departments and a pharmacy benefit manager consortium. Data are standardized (e.g., laboratory test results are mapped to LOINC17 with common units of measure) to the extent possible, prior to storage in a central repository. Data for a patient with visits to multiple INPC institutions can be linked using a patient matching algorithm. The COMET-AD project is using data from INPC to monitor healthcare processes and outcomes and to build systems to monitor patients for adverse drug events. The project also involves building infrastructure and workflows to support integration of biospecimen results with clinical data from the INPC.
The goal is to assess how well an existing statewide quality assurance and quality improvement registry (i.e., the Surgical Care Outcomes Assessment Program) can be leveraged to perform CER. The SCOAP-CERTN leverages relationships built collaboratively in SCOAP to improve surgical care and outcomes and aims to build infrastructure for streamlined, electronic data abstraction from EHRs, patient reported outcomes, and healthcare payments across hospitals. Through a partnership with Microsoft Health Solutions Group (Redmond, WA), SCOAP-CERTN is identifying ways to maximize automatic capture of data from EHRs, to:
In addition, SCOAP developers plan to add functions to capture patient reported outcomes for research and quality improvement evaluation. The primary informatics goal is to assess how, and to what degree, the collection of SCOAP-CERTN measures can be automated across sites.
Designing, developing, implementing, and using health information technology (HIT) within healthcare delivery systems is a complex, socio-technical challenge. To provide a theoretical basis for our comparison of six CER informatics platforms we adapted an 8-dimension, socio-technical model of safe and effective HIT use18. This model prescribes attention to: (1) appropriate hardware/software, (2) a spectrum of clinical content ranging from case narrative, to standard vocabularies, to algorithms representing best practices, (3) human-computer interfaces enabling productive interactions with technology, (4) personnel who develop systems and how systems meet the needs of users in their social contexts, (5) workflow and communications (both between people and technology components) required to accomplish tasks using the technology, (6) organizational policies, procedures, culture, and environment that prescribe and govern how and where things happen and who is responsible, (7) external rules, regulations, and pressures which shape these organizational constraints, and (8) system measurement and monitoring which ensures adequate performance for primary intended use cases, i.e., the conduct of CER.
These eight constructs18 are used to investigate and evaluate aspects of CER platform design and implementation by ensuring that both the social as well as the technical aspects are considered. Failure to consider who will use the applications, how they will use them, and why they are necessary often leads to sub-optimal technology design and utilization.
We developed a written survey and sent it to informatics experts representing six large CER projects focusing on the design, development, and use of multi-institutional informatics platforms. Projects were selected by convenience, yet they are representative of vastly different approaches researchers have taken to address numerous CER challenges.
We (DFS, BLH) developed a 2-page, open-ended survey that highlighted project-specific similarities and differences. We created 2 – 8 questions within each of the 8 dimensions to ensure that all important aspects were captured18. For example, within Workflow/communication we asked, “How do data get into your warehouse?” and “What stages do the data go through?” Similarly, within the Hardware/Software dimension we asked, “What computing infrastructure is required to run your system?”
Completed surveys were returned by e-mail and checked for completeness. DFS and BLH read through the 6–10 page responses from each of the co-authors looking for key concepts highlighting project similarities and differences. After review and discussion, it became clear that the following 4 dimensions of the 8-dimension model were the key differentiators: content or data (Table 2); workflow/communication regarding how data moved from sources to analysis (Table 3); people (investigators, data programmers, research analysts, managers) involved in the projects (Table 4); and organizational policies, procedures, and culture (Table 5). We extracted data items to fill-in the tables from surveys. In addition to survey items, two authors (DFS, BLH) gathered information regarding project descriptions and funding from websites and journal articles (Table 1). Drafts of completed tables were sent to co-authors for review.
All projects implement six generic data processing steps necessary for distributed, multi-institutional CER projects:
All projects performed these activities, although there were variations in how (real-time aggregation of HL-7 transactions vs. nightly or as-needed extraction, transformation, and loading), where (local site vs. coordinating center), and with what tools (web-based query interfaces for researchers vs. tools to develop Natural Language Processing (NLP) modules).
Table 2 compares data sources, types, models, and handling of duplicate patients. All projects collected data from multiple sources (i.e., hospitals, clinics, billing, long-term care) and included different data types (eg, numeric test results, ICD-9-encoded problems, and free text progress notes). Only three projects used a “master patient index” that enabled them to combine data from patients who received treatment at different organizations. All projects used different, and sometimes multiple, data storage and manipulation formats ranging from SAS tables to XML-based documents to relational databases.
Table 3 provides a comparison of data flow and transformation, from local EHRs to aggregated analyses. The most important differences highlighted in Table 3 pertain to when patient-identifiable data leave local sites. In two projects, this occurs immediately following extraction from the local transaction-based clinical or administrative systems. In SPAN and CER-HUB, transfer of “raw” patient-identifiable data never occurs (i.e., all data are processed at the local site by data analysis programs that are distributed from the central site, and only data conforming to protocol-specific Limited Data Sets are shared). Only three sites had any form of natural language processing capability; the other sites relied solely on numeric or coded data elements.
Also of interest in Table 3 is the state of data analysis tools offered by projects. All projects are working on “user-friendly” tools to facilitate researchers' direct access to data via ad hoc queries, while concurrently meeting multi-institutional requirements for protecting patient data and corporate business interests. To date, only the RPDR has a working version.
Table 4 describes key personnel. The most important difference is that some projects either have or are working on Internet-based interfaces that allow non-technical investigators to perform a limited set of data queries and analyses on the combined data set. For example, the SPAN project currently requires all queries be coded as SAS programs and sent to the local site where they are executed and the results returned after manual review; SPAN is beta-testing an internet-based approach using the PopMedNet architecture to allow non-technical users to issue queries.
Table 5 provides a comparison of project governance and internal organizational policies and procedures. All projects have an oversight committee; most consisting of representatives from all sites involved in the project. Often this committee is responsible for governing all aspects of data ownership and sharing, project membership, and publication rights and responsibilities.
We compared six large CER projects and described how they employ informatics platforms to provide data aggregation, analysis, and research management capabilities. Many of these platforms were originally designed and developed to address widely different healthcare, organizational and research objectives; only after significant amounts of work had been completed were they transitioned to focus on CER. For example, the RPDR was originally designed to answer the question, “How many patients with a specific set of characteristics have we treated within our integrated delivery network?” On the other hand, INPC and WICER started as a means of improving the quality and efficiency of care in large metropolitan areas by creating centrally-managed health information exchanges (HIEs). Similarly, SCOAP-CERTN started as a registry to improve surgical outcomes and efficiency. SPAN (and to a lesser extent CER-HUB) build upon existing research networks comprised of similarly organized and managed, large, integrated health plans.
Different data types are required to create complete, patient-centered views of patient's medical history. The surveyed projects demonstrate that creating a useful CER platform requires enormous amounts, and a large variety, of data. To access these data, CER investigators need to collect them from as many different sources within their participating organizations as possible. Therefore, we see researchers collecting data from inpatient and outpatient EHRs (including the text narrative of clinical encounters), from billing and ancillary systems such as laboratory, pharmacy, and radiology. In addition, it is important to collect data that document that patients actually received the care that was ordered, so we see organizations collecting pharmacy dispensing and patient-reported data when available. This vast array of data, while large, is nearly always incomplete (i.e., they generate sparse representations in a large-dimensional space of patient care facts in the real world) and methods which use these data must be appropriate to the task of measuring health status and care events with available data.
Researchers need to aggregate data from multiple organizations to have enough information to identify small differences, address bias, perform subgroup analyses, improve generalizability, allow evaluation of demographic and geographic variation, and identify rare events. Therefore, CER informatics platforms must be able to extract and collect data from many different organizations to compile as complete a view of conditions, treatments, and individuals as possible. Towards that end we see investigators working to include data from multiple organizations, pursuing non-traditional research data sources, such as long-term care facilities, home and public health agencies, and attempting to reliably ascertain patients' socioeconomic status on a widespread basis.
A key requirement for data collection across healthcare provider organizations located in the same geographic region is the need to merge data from the same patient who has received healthcare services and had clinical data captured at multiple institutions. Such efforts require a community-wide master patient index that identifies patients based on multiple demographic data (e.g., first name, last name, date of birth, gender, social security or telephone numbers) and keeps track of all patient identifiers used by various participating organizations to create a single, master patient identifier19. To date, only the CER projects that were built on top of existing health information exchange platforms designed for patient care have tackled this extraordinarily difficult problem20, 21, but in the future patient matching capabilities will be a critical success factor.
Researchers must be able to extract required data from various electronic data systems, map data types to standardized clinical representations, and analyze it. Design and development of these “mapping” applications is one of the biggest challenges in any multi-institutional research project, because it is often the case that different organizations refer to the same activity, condition, or even procedure by different names, and the same names can refer to different things across institutions. Further, even with accurate mapping it is difficult for researchers to fully appreciate local idiosyncratic data issues (e.g., non-random incomplete data capture) without active engagement of local data experts.
Furthermore, conducting CER is a complex undertaking requiring people with widely different skills, often in different locations and subject to different organizational policies and practices. In an attempt to reduce potential for misunderstanding in collaboration processes, platform developers are working to create powerful, user-friendly tools for data extraction, manipulation, and analysis. These tools are being designed so CER project staff, who often have little informatics training, can perform their tasks more efficiently. In addition, several projects are developing tools to help researchers make sense of highly variable and clinically-rich free-text notes documenting patient care.
The social, legal, ethical, and political challenges involved in setting up and conducting large, multi-institutional CER projects must not be underestimated. Friedman et al. stated that “organizations are understandably reluctant to move data beyond their own boundaries absent a clear and specific need to do so, and patients will be less likely to consent to allow this to happen.”22 Therefore, in addition to providing the technical infrastructure required to collect, standardize, normalize, and analyze disparate data, informatics platforms must conform to local organization's internal governance and IRB's rules and regulations as well as existing state and federal guidelines. One design to address use of protected health information is to retain physical control of raw data while providing for their aggregation as limited data sets to answer specific questions. Other ways in which projects have accommodated inter-institutional governance issues include standardizing data models across the project; limiting access to authorized personnel while facilitating remote access; restricting the types of queries that can be executed and masking patient-specific, identifiable data; and logging all data transactions and access activities. As rules, regulations, and guidelines evolve (eg, proposed Common Rule revision23) CER platforms and governance processes must evolve accordingly.
CER stands to transform the current healthcare delivery system by identifying which therapies, procedures, preventive tests, and healthcare processes are most effective from the standpoints of cost, quality, and safety. State-of-the-art informatics platforms are necessary to carry out this type of research across organizations with disparate patient populations, health information systems, data types, and local governance structures.
We used an 8-dimension, socio-technical model to develop a survey enabling us to compare and contrast informatics platforms that are under development or in use in six large CER efforts. Based on the data we collected, we identified six generic steps necessary in any distributed, multi-institutional CER project: data identification, extraction, modeling, aggregation, analysis, and dissemination.
We conclude that all of the informatics platforms for CER studied are on their way to creating the socio-technical infrastructure required to enable researchers from multiple institutions to conduct high-quality, cost-effective CER. We expect that over the next several years, these projects will provide answers to many important CER questions that in the past were virtually inaccessible. In addition, we expect many more CER-focused informatics research platforms to be designed, developed, and tested as the fields of informatics and CER continue to evolve.
Dr. Sittig is supported in part by a grant from the National Library of Medicine R01- LM006942 and by a SHARP contract from the Office of the National Coordinator for Health Information Technology (ONC #10510592).
Dr. Hazlehurst's work is supported in part by grants from the National Library of Medicine (R21LM009728), and the Agency for Healthcare Research and Quality (AHRQ) (R01HS019828, R18HS18157).
Dr. Brown is supported in part by the AHRQ grant 1R01HS019912
Dr. Murphy is supported by grants from the National Institute of Health U54LM008748, UL1RR025758, U24RR025736 and by a grant from the Office of the National Coordinator 90TR0001/01.
In this work Dr. Rosenman was supported by a grant from AHRQ (R01HS019818)
Dr. Tarczy-Hornoch is supported in part by AHRQ 1 R01 HS 20025-01 “Surgical Care and Outcomes Assessment Program (SCOAP) Comparative Effectiveness Research Translation Network (CERTN)” and by NIH NCRR 1 UL1 RR 025014 “Institute of Translational Health Sciences”.
Dr. Wilcox is supported in part by AHRQ grant R01 HS019853-01, Washington Heights/Inwood Informatics Infrastructure for Community-Centered Comparative Effectiveness Research (WICER)
We also thank Andrea Bradford, PhD for editorial assistance.