Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Biomed Inform. Author manuscript; available in PMC 2012 April 1.
Published in final edited form as:
PMCID: PMC3063322

The DEDUCE Guided Query Tool: Providing Simplified Access to Clinical Data for Research and Quality Improvement


In many healthcare organizations, comparative effectiveness research and quality improvement (QI) investigations are hampered by a lack of access to data created as a byproduct of patient care. Data collection often hinges upon either manual chart review or ad hoc requests to technical experts who support legacy clinical systems. In order to facilitate this needed capacity for data exploration at our institution (Duke University Health System), we have designed and deployed a robust Web application for cohort identification and data extraction—the Duke Enterprise Data Unified Content Explorer (DEDUCE). DEDUCE is envisioned as a simple, web-based environment that allows investigators access to administrative, financial, and clinical information generated during patient care. By using business intelligence tools to create a view into Duke Medicine's enterprise data warehouse, DEDUCE provides a guided query functionality using a wizard-like interface that lets users filter through millions of clinical records, explore aggregate reports, and, export extracts. Researchers and QI specialists can obtain detailed patient- and observation-level extracts without needing to understand structured query language or the underlying database model. Developers designing such tools must devote sufficient training and develop application safeguards to ensure that patient-centered clinical researchers understand when observation-level extracts should be used. This may mitigate the risk of data being misunderstood and consequently used in an improper fashion.

Keywords: translational research, medical informatics, clinical informatics, medical records systems, computerized health care evaluation mechanisms, hospital information systems

1. Introduction

Although the adoption of health information technologies (HIT) such as electronic health records (EHRs) and computerized physician order entry (CPOE) has been identified as critical for improving the nation's health care [13], the invaluable clinical data gathered by HIT, which could spur research to an unprecedented degree, has often gone untapped [4]. The Recovery Act of 2009 included $19.2 billion in funding intended to encourage physicians and healthcare organizations to implement EHRs and make “meaningful use” of the collected data by exchanging information and reporting clinical quality measures [5]. An additional $1.1 billion in funding will be administered through the U.S. Department of Health and Human Services. These grants will support comparative effectiveness research evaluating methods used to diagnose, treat, and monitor clinical conditions [6].

But despite clear imperatives for making information in health system databases accessible for secondary data analysis [7,8], a mature approach to combining, analyzing, and leveraging these resources has yet to emerge. Within many organizations, data reside in separate silos or in proprietary databases and are often captured in incompatible formats. Even when laborious manual queries are used, the resulting extracts are often incomplete because the source systems were not designed with domain-spanning research in mind. Such functionality requires the adoption of structure and standards [9,10].

In this manuscript, we report on a “research portal” developed by the Duke University Health System (DUHS)—the Duke Enterprise Data Unified Content Explorer (DEDUCE). This user-friendly data extraction system is envisioned as a multiple-tool environment that will expedite access to clinical data stored in the organizational data warehouse, supporting grant applications, research projects, and quality improvement (QI) activities. Here we report on the conceptual design and development of the first tool in DEDUCE, Guided Query (GQ), which uses business intelligence (BI) tools to allow users to obtain both an aggregate report and a raw data extract based on query parameters.

2. Background

Research portals developed in response to the need to access and combine diverse sources of data from clinical and research domains, both within and across institutions. Two fundamental portal types are described in the literature: translationally-based and clinical practice-based. Although both include clinical data and provide user interfaces, the goals of the inaugural database designs differ. Translationally-based platforms such as REDCap [11], Slim-Prim [12], MMIM [13], and TraM [8] start with data collected for research purposes (clinical or basic science) and integrate these domains in a user-accessible repository. These tools may use federated queries to derive data for specific disease states across a national set of hospitals, enable data sharing for multicenter translational projects, and create a framework for the input of new research data and subsequent curation [8,1315]. Clinical practice-based portals, on the other hand, use patient care data from clinic and hospital databases without predefining the research project or domain. The focus in this context is on ensuring that all reasonable data elements regarding a patient's healthcare encounter are standardized and accessible.

While translational portals have been relatively well documented in the literature, there is a notable lack of publications describing the conceptual design, deployment, and operationalized use of clinical practice-based portals. Most of the available information on such portals has been presented in forums such as conferences, proceedings, news articles, or electronic white papers, making a comprehensive discussion of differing feature sets and deployment methodologies challenging. Vanderbilt's Synthetic Derivative research application is one such tool advertised as containing both structured clinical data and care narratives (e.g., nursing notes; surgical reports) on 1.7 million patients, as derived from their health system's EHR [16,17]. Its website ( suggests that only deidentified data are available, and a recent conference presentation indicates that ICD9 codes, labs, vital signs, medications, CPT codes, and demographics are available as query criteria. However, the exporting capabilities are unclear and we infer that the tool is designed for the needs of the patient-centric researcher seeking to define a cohort and not optimized for QI personnel looking for “cohorts” of observation-level data (e.g., all lab results of a particular type). Similarly, the Stanford Translational Research Integrated Database Environment (STRIDE) provides self-service research access to a clinical data warehouse that supports two hospitals and numerous clinics [18]. Users can search for patients using criteria including demographics, ICD-9/CPT codes, lab results, pharmacy orders, and information held within narrative clinical reports. STRIDE also provides research access to a tumor tissue databank, thus integrating translational data with its clinical foundation. Yet according to its Web site, STRIDE does not yet release protected health information (PHI) and researchers must collaborate with informatics staff to discuss the extraction of clinical data for research purposes. Based on the only formal report available to date, it is unclear whether STRIDE permits the extraction of observation-level data needed for QI investigation [18].

Partners Healthcare system has published sporadic short reports on its research portal, the Research Patient Data Repository (RPDR), which is designed to aid cohort identification for research studies, support grant applications, and enable outcomes research for two medical centers and four community hospitals [19,20]. This tool has two distinct functions: 1) a query tool that returns aggregate numbers of patients based on complex queries generated from a user-friendly, “drag-and-drop” interface; and 2) a data acquisition tool allowing researchers to obtain detailed extracts including PHI, when authorized by an IRB protocol. Various inpatient and outpatient data elements are available, including demographics, encounter data, diagnoses, medications, procedures, labs, radiology/pathology reports, and discharge notes. However, as with STRIDE, it is unclear the extent to which observation-level data can be extracted independently of patient cohort definition. Recently, some RPDR features were incorporated into SHRINE [21], which uses a federated model to access the clinical databases of three large health centers. The SHRINE prototype, however, functions in a test environment using an enterprise dataset that is not refreshed. SHRINE is one of a growing number of tools that uses the open-source Informatics for Integrating Biology and the Bedside framework (i2b2; sponsored by the NIH Roadmap National Centers for Biomedical Computing. This platform bridges clinical and scientific domains by providing open-source software tools for concomitant data collection and management. Aimed at clinical investigators, bioinformaticists, and software developers, i2b2 application modules can be integrated using a variety of Web services and XML messages [14,15,22]. The i2b2 framework has been a fixture at many healthcare informatics and data warehousing conferences where organizations discuss research query tools.

Although these clinically-based research portals offer aggregate counts and raw data, the emphasized goal is to define a highly specific patient cohort that suits the needs of a physician-researcher. However, there are myriad QI questions that require investigation of observation-level data, such as lists of medication or laboratory orders [23], and the query procedure should be designed around these needs. Such investigation will become increasingly important to comply with new “meaningful use” mandates from the Recovery Act [5]. We view the lack of focus on obtaining a specific, defined “cohort” of encounter-, process-, or observation-level data as a major gap in currently reported applications. Our objective in developing DEDUCE was to build an access model that simultaneously served both patient- and encounter-centered needs by creating a user-friendly gateway to various axes of patient care. The DUHS comprises two community hospitals and an academic facility, the Duke University Medical Center (DUMC); the DUMC itself includes a teaching hospital and more than 150 affiliated outpatient clinics. We recognize that in order to serve all user types from these settings, DEDUCE may ultimately require multiple access environments. Since there are relatively few formally published descriptions of how organizations have developed and deployed clinically-based portals, we share here the experiences of the DUHS in developing the underlying DEDUCE framework and releasing our first DEDUCE tool—Guided Query (GQ).

3. Business need and design requirements

The need for this application grew from an increasing number of requests to health system data warehouse personnel for research and QI data extracts. At this point, the Director of the DUHS Data Warehouse Group as well as the Associate Chief Information Officer for Enterprise Analytics and Patient Safety collaborated to create a suite of self-service tools for clinical data extraction. In doing this, an advisory steering committee was convened to shape the functional design (see section 4.2). At the onset of the design process, the committee designated principal functional requirements for the long-term development of a comprehensive, multiple-environment research portal:

  1. DEDUCE should be a flexible, self-service application for data extraction by clinical researchers and QI professionals across the DUHS. DEDUCE will include a number of tools or “user environments” catering to different user levels.
  2. DEDUCE should incorporate multiple domains of patient care and be flexible enough to eventually include evolving knowledge domains from the translational arena, including genomics and proteomics. We did not want to predefine research areas, as has been done with translationally-based research portals.
  3. Both granular and aggregate data should be available with PHI to the extent allowed by IRB policies.
  4. Data should be filterable and extractable not only at the patient level (e.g., counts of patients and their attributes, such as demographics) but at the observation level as well (e.g., counts of medication orders and their attributes, such as order start date or medication type).
  5. The DEDUCE portal should be usable without knowledge of the underlying clinical database structure.

The steering committee acknowledged that design decisions made in developing the first DEDUCE tool would shape all future DEDUCE applications. The committee therefore decided to educate the Duke Medicine user base over time by developing a series of tools of increasing complexity. Given pressures to develop the first tool quickly, we decided to use a BI application to create a user-friendly view into our existing integrated enterprise data warehouse, which receives data from legacy clinical systems using ETL processes. Federated models, as reported elsewhere [13,21,24], allow each source system owner better control of their data, but typically exhibit poorer performance relative to their integrated counterparts. Federated models are appropriate when multiple independent entities share data [24], and we feel that they will be useful as we share data with external partners.

Development began with an environment that permitted simple data exploration, report aggregation, and data extraction using a Web-based wizard, a process we termed Guided Query (GQ). The goal of the GQ environment was to permit queries to be created easily and executed efficiently behind the scenes without the user feeling “lost.” As GQ would query all data sources available in future DEDUCE environments, this approach would help prospective users become familiar with the clinical warehouse data for more powerful future tools (such as an interactive clinical cohort builder). Cognos BI (IBM, Armonk, NY, USA) was the application of choice given that our data warehouse analysts had extensive experience in building reports with this software suite. Additionally, Cognos is a familiar reporting framework for many DUHS clinicians who access organizational metrics and scorecards.

An alternative to building DEDUCE internally would be to use the i2b2 framework. Early in the project, a DEDUCE developer evaluated a local instance of i2b2 in order to assess its functionality and performance. After considering our organization's unique needs at that time, we decided against implementing the i2b2 framework based on several issues. First and most importantly, i2b2 has database structure requirements (the entity-attribute-value model) that differ from those of our existing warehouse, requirements that would have necessitated the creation of customized extract-transform-load (ETL) code to place data into subject areas not directly supported by i2b2. Although no ETL strategy is without challenges, this would represent duplicated effort given that, like many large health systems, we already write ETL code to bring clinical data streams into our warehouse. This is an issue that has been noted by others evaluating i2b2 [25]. Our alternate approach—using a BI layer to create a user-friendly view into the database—does not create additional ETL tasks and therefore allowed us to devote more effort to GQ interface design. Another related consideration is that any upgrades to i2b2 are expected to require additional ETL work to retain tool functionality and represent yet another system to maintain. Although adaptation efforts are almost always necessary when upgrades occur, our BI tool is already used for many other applications that access the warehouse to fulfill business needs. DEDUCE GQ can then “piggyback” on the maintenance efforts invested here, requiring only minimal new tasks on its own. Finally, because we view the future of DEDUCE as a suite of tools, we felt that we must choose a framework consistent with this vision. Although future functional requirements for other tools were not formally drafted, we knew that the inability to upload a cohort of MRNs or e-mail results (i.e., execute the query in the background) within i2b2 made it an inappropriate platform for our needs.

4. Methods

4.1. Application overview

Data from hospital and clinic operations are archived in the Decision Support Repository (DSR), a custom-built data warehouse containing integrated clinical and financial data. The DEDUCE GQ application interacts directly with the DSR via structured query language (SQL) created by the BI tools and does not require a separate research-specific data warehouse (Figure 1). The GQ Web interface was built using Cognos BI 8.2 and custom JavaScript applications. The user is guided through a series of prompt pages that define inclusion and exclusion criteria for a SQL query that runs in the background. Results include aggregate patient counts, encounters, and observation-level data such as ICD9 diagnostic codes or medication orders. DEDUCE has its own active study protocol, approved by the DUHS IRB, which permits use by QI personnel; researchers with appropriate IRB authorization are also allowed access.

Figure 1
Overview of the DEDUCE Guided Query Architecture

4.2. Project duration and team members

Technical work on the GQ tool began in January of 2008 and followed an agile software development model [26] through 7 months, during which functional requirements and technical solutions were developed iteratively with the advisory steering committee. This group included 26 representatives from hospital and clinic leadership, patient safety, the institutional review board (IRB), outcomes/health services research, and clinical specialty areas. Committee members were appointed for one of three reasons: they or their group historically had many data warehouse extract requests; they had a unique role at the DUHS (such as a patient safety officer); or because they had specific, personal interest in seeing such an application made available. These persons thus represented a broad cross-section of the potential DEDUCE user base. The steering committee convened on a biweekly basis to review progress by the DEDUCE development team and make further suggestions. Between meetings, all committee members had access to prototype versions of the tool and were asked to evaluate its usability and functionality.

The DEDUCE technical development team consisted of an analyst (1.0 full-time equivalent [FTE]), a business intelligence (BI) developer (0.5 FTE), a contracted BI architect (1.0 FTE), and a systems architect (1.0 FTE). All developers were trained in HIPAA regulations regarding patient privacy.

4.3. Technical description

4.3.1. Data sources and content

The DSR resides in a 1.5-terabyte database (Oracle Server 10g Enterprise Edition, Redwood Shores, CA) hosted on an AIX machine with 4 processors and 32 gigabytes of memory. ETL processes bring daily data from the DUHS clinical information systems into the DSR using a Linux 4 × 2 CPU with 19 gigabytes of memory. Beyond the ETL process, no further data cleansing was performed, although some data curation was done to improve the usability of the GQ tool. All data housed within the DSR are governed by a comprehensive Information Security Operations Plan.

The clinical domains within the DSR available for DEDUCE query are described in Table 1. Comprehensive data are currently available for Duke University Medical Center (DUMC) and its associated outpatient clinics. Clinical data from the two DUHS community hospitals is expected to become available over the course of 2010.

Table 1
Data available in DEDUCE Guided Query for Duke University Medical Center (DUMC)

4.3.2. Business intelligence user interface

Cognos BI interacts with clinical data kept in the DSR and presents a framework within which to query these data and design interactive reports. The GQ user interface is a complex, multi-paged, prompted report created in Cognos Report Studio, a module of Cognos Business Intelligence 8.2, which runs on a Windows 2003 Server machine with two dual-core processors and 4 gigabytes of memory. To end users, the GQ appears as highly interactive Web pages accessible using Internet Explorer versions 6.0 and higher. Users interact with prompts, their selections triggering custom JavaScript that dynamically determines what prompts should be presented throughout the steps of the GQ. Each script renders another Web page from the user's perspective. For example, at the first GQ screen, the user picks a patient care axis such as medications or labs. This selection is stored as a variable, which then causes conditional rendering of the remaining GQ report pages. In this way, the same Cognos report pages are continuously executed behind the scenes, yet the objects displayed are conditional upon user selections. Aggregate data from GQs are displayed in HTML, PDF, or Excel as tables or graphs depending on user selections. A flat file of the raw data used to create the aggregate reports may also be selected, although this access is conditional on user role.

4.4. Access, authentication, and patient confidentiality

Any Duke Medicine staff member may request access to DEDUCE in writing. DEDUCE authenticates using Microsoft Windows Server 2003 (Redmond, WA, USA) Active Directory accounts, as these are employees' primary means of accessing workstations and clinical applications. When logging into the system, users must choose one of 4 user role types (Table 2) and provide the appropriate IRB protocol numbers that authenticate against the DUHS IRB database. Upon first entry into the system and every 6 months afterwards, the user must agree to the terms of use: Protect data; do not share data outside of Duke without approval; and store data only on Duke computer drives. Additionally, at every login, they must agree to the DEDUCE “data use agreement,” which reminds the user of the eight most important aspects of the Terms of Use that safeguard security and privacy. All data created in the course of a query are stored electronically on servers belonging to the DUHS, and all login activities as well as the SQL executed are logged in perpetual audit trails. All data transfers are subject to encryption via SSL certificates.

Table 2
User Roles as Defined in the DEDUCE System

5. Results

5.1. Navigation of the Guided Query interface and data export

The DEDUCE GQ was released on August 1, 2008. The user begins with a prompt to select one of 6 subject areas: ICD9 diagnoses, ICD9 or CPT procedures, lab results, CPOE orders, or medication orders. Patient encounter and demographic data can be queried within each of these areas, but users may only search and extract a single subject area at a time. This represents a critical decision that was made to ensure the simplicity of the tool, as there are multiple ways in which a user may combine subject areas, depending on the research or QI question. The query has five distinct steps, each with its own prompt page: 1) subject area selection, 2) encounter criteria, 3) demographics, 4) subject area criteria, and 5) report output. The promptable data items are displayed in Table 3. As these steps are taken, a status box appears on the left detailing the steps completed and the criteria selected in order to minimize any confusion during the process (this reflects our awareness that many users will multitask and may require reorientation) (Figure 2).

Figure 2
DEDUCE Guided Query (GQ) Web Interface
Table 3
User-selected prompts in DEDUCE Guided Query

At the export screen, users may choose whether to display aggregate report results as tables, graphs, or both. Although DEDUCE is more a data extraction tool than a visualization tool, the steering committee decided that basic aggregate report of data item counts was necessary to engage novice users as well as to provide immediate sample size feedback without requiring data extract download. After the user makes selections, DEDUCE runs the query and shows counts of the number of encounters, patients, and subject area observations returned. Known as the “kiss or cry” step during training, this stage is where users can quickly evaluate the volume of the results and can return to the filter selection steps if they wish to adjust their parameters. With appropriate IRB or QI authorization the user may from here download or receive in e-mail an extract of the raw data in addition to the aggregate report. The data records will reflect the subject area chosen. For example, an extract created from a query of the labs subject area will list a unique returned lab result for each row. All extracts provide a basic set of encounter and demographics information in additional columns. This facilitates combining extracts from multiple queries outside of the DEDUCE environment to answer questions such as: “how many patients received medication X and were assigned diagnosis Y during the same encounter?

The aggregate report is a single document available as PDF, Excel, or HTML (according to user preference) with multiple pages that represent different sections. If the user does not wish to wait for the report to render, a Web link allows results to be delivered via e-mail. The first two report pages list the filter criteria. Subsequent pages show profiles of the population created, as follows: a) subject area counts by subject area item code, admit (or discharge) service (or location); b) subject area count by subject area item, gender, and race; c) patient count by demographics; d) encounter count by race and discharge date; and e) encounter count by sex and discharge date. Although all users can render these sections, check boxes are available to display only those sections relevant to their particular needs. Figure 3 shows selected graphical output spanning these report pages. The section of patient counts by demographics is most often used by clinical researchers seeking customary population statistics that are often requested as part of grant applications in order to prove adequate sample size. Subject area count reports are used mostly by QI personnel who are trying to characterize and understand the frequency of healthcare processes, such as the ordering of a drug by nursing unit.

Figure 3
Aggregate report output from DEDUCE Guided Query (GQ)

5.2. Performance

Either a clinical research or QI user can build a DEDUCE query in 1–2 minutes if they know the exact characteristics they desire. Since users may choose what results they wish to receive, execution time has been enumerated according to three stages: 1) HTML results of the number of patient, encounter, and subject area records; 2) execution of the entire aggregate report in PDF form; and 3) export of the raw data set. Execution length varies with the number of selected parameters, the number of concurrent users, the type of result requested, and the volume of data within each subject area. Query length, as measured during regular mid-day business hours for a number of scenarios, is shown in Table 4. Typically, creation of the aggregate PDF report takes the longest, as this involves the largest number of SQL select statements on behalf of the BI tool for data aggregation.

Table 4
Representative query execution times for the DEDUCE Guided Query tool

5.3. Utilization and training

The DEDUCE GQ user base includes clinicians, researchers, clinical research coordinators, data analysts, and QI personnel. From August 2008–August 2009, 241 users were trained by a DEDUCE analyst, either in ad hoc or formal group instruction sessions. During this same period, 347, 87, and 123 logins were recorded for the QI, Review Preparatory to Research, and IRB roles, respectively. Given the growing user base, future training will be held in scheduled classroom sessions. Online training with suggested exercises is also offered for those wishing to learn DEDUCE GQ at their leisure. A comprehensive user guide and video tutorials are provided at the DEDUCE Web site. The DEDUCE team also maintains a subscribeable electronic message board containing tips, downtime notifications, user-to-user help, and technical support.

5.4. Cost

DEDUCE was built using resources supporting the existing DUHS data warehouse, whose development and maintenance entails considerable health system expenditures. But because a well-maintained warehouse is in place, the cost of building the GQ tool (which required no new hardware or software) was relatively small. Once developed, the projected fully loaded cost for maintenance (i.e., ongoing user training, user support, account management, bug fixes, and minor system upgrades) is $50,500/year. Given that we already possessed a sophisticated data warehouse and that only relatively minor investment of capital is needed to support this tool, DEDUCE is planned to be a sustained data analytics service of the data warehouse team for the foreseeable future.

5.5. Use cases

One of the many uses supported by the DEDUCE GQ is to help researchers prepare grant applications by gathering data that demonstrate study feasibility. For example, DEDUCE has been used to support a DUHS application to the Agency for Healthcare Research and Quality (AHRQ) titled “Cellulitis Abscess Management in the Era of Resistant Antibiotics (CAMERA),” which was funded in 2009. The GQ tool was used to find all DUMC outpatients who had been assigned one of 5 ICD9 diagnosis or CPT procedure codes indicative of ongoing soft tissue infections. The GQ aggregate report, which shows total demographics as a function of specific diagnosis or procedure code, was used to provide preliminary data and ultimately to demonstrate the ability to contact these patients upon IRB approval.

A second common use for the DEDUCE GQ is to obtain data for QI efforts, such as evaluating the effects of implementing new HIT at Duke University Hospital. In February of 2009, several QI specialists investigated whether the release of a venous thromboembolism prophylaxis advisor within the CPOE interface led to an increase in ordering of items (e.g., medications, compression devices) that can help prevent thrombotic events such as blood clots. The DEDUCE GQ was used to obtain three separate extracts containing ICD9 categories, medication orders, and CPOE orders for a specific inpatient cohort across three technology deployment phases. The resulting data were analyzed offline in SAS 9.0 (SAS Corporation, Cary, NC) and used to demonstrate that prophylaxis ordering had indeed increased since advisor release [27]. Efforts are underway to determine whether outcomes improved in patients who received such preventative care.

6.0. Discussion

6.1. Advantages of a research portal

One of the most significant benefits realized with the DEDUCE tool is the fact that data requests are now “self-service” for many researchers and QI personnel. Data extraction at the healthcare observation level will become increasingly important to comply with Recovery Act mandates. One challenge presented by traditional methods of individual data warehouse requests is that extract formats may differ and be cleaned in ways that complicate options for future data gathering. Data extracts from DEDUCE are provided in a harmonized manner that allows users to return and collect additional data on the same patients or encounters at a later point. DEDUCE also adds a data-driven research paradigm to current hypothesis-based research methods. Exploratory analyses can be conducted without completely defining the research question as the first step. This may lead to new avenues of investigation or, alternatively, quickly halt studies that are underpowered or otherwise impracticable. Another powerful feature of the DEDUCE conceptual design is its modular data organization, which affords a “window” into the data warehouse and ensures scalability as future subject areas are added.

6.2 Lessons learned

The biggest risk in allowing researchers to obtain their own data extracts is that they may misunderstand basic properties of archived clinical data and use it inappropriately in their studies. Clinicians are often patient-centered in their thinking and expect data extracts wherein each row represents a different patient. Quality improvement personnel, on the other hand, are transaction-centered and might expect each row to represent a different observation, such as a lab result. Our experience suggests that this issue must be strongly emphasized during the training process by underscoring the fact that multiple healthcare actions may occur at different timepoints within a single patient encounter.

Beyond this, we feel that an effective strategy for introducing researchers to the complexity of clinical data query activities includes providing a suite of tools tailored to different user levels; within this framework, GQ represents our first such tool. In GQ, we do not allow users to export results covering multiple subject domains in the same file—they must perform the queries separately and join the files outside of DEDUCE in a manner appropriate to their research question and are given the appropriate patient and encounter IDs to do so. Automatic combination of two data extracts from separate subject areas creates a Cartesian product in which the query would return all possible row combinations between the data observations from each subject area. It is thus far too easy for the novice user to arrive at inaccurate, non-unique counts of encounters, patients, or observations. Future environments (see section 6.1) could allow advanced users to specify distinct filtering criteria during the query and extraction steps, thereby avoiding the creation of Cartesian products. Instituting such a “compound query” at this stage, however, was at odds with the mandate for simplicity in this tool. We expect that our long-term, multi-environment approach to DEDUCE, with GQ at the center, will constitute an effective strategy for reducing the likelihood of foundational misanalysis by the user base.

Unavoidable analysis-confounding data eccentricities should be made clear in multiple, redundant ways. For example, researchers are constantly reminded of whether the data in a report are patient- or observation-level by use of a leading summary table and reiteration of patient counts on each report page. Even if the principle design philosophy of DEDUCE is understood by users, consensus taxonomies, classifications, and standards remain essential to encourage user acceptance as well as meaningful interoperability among subject areas.

Several cultural issues emerged during development of the DEDUCE GQ tool. First, we had to overcome translational barriers that often exist between clinicians and technical staff. Database queries require an attention to technical detail unfamiliar to many clinicians, who must often be “coached” to translate their research question into a query that fully addresses the scope of their research problem. Without such specificity, accurate and complete results might not be obtained. Additionally, customary data formats used by technical analysts are often not satisfactory for researchers, which can lead to mutual frustration. When developing tools such as DEDUCE, it is critical to remain mindful of differences in culture and customary terminology between the clinicians requesting data and the technicians providing it. As a result, we realized the great need for additional translational data architects within our organization that can span both the clinical and technical domains [28].

Development of DEDUCE could not have occurred without including IRB advisors at all stages. Other developers of similar tools have raised the issue that persons performing data searches restricted to de-identified data from a rare patient cohort could technically be capable of identifying individuals, even if the researcher lacks access to PHI [19]. However, our IRB liaison assured us that the combination of a blanket DEDUCE IRB protocol and approval of the individual user's study protocol accommodated this “small numbers” scenario.

6.3. Limitations

There are several limitations to the DEDUCE GQ tool. First, use of Cognos BI (the engine of DEDUCE) requires precise browser settings and wired Ethernet access to ensure satisfactory performance. Because we used an integrated (not federated) model for querying data, all DEDUCE data elements must first have a place in the data warehouse. Adding translational data (e.g., from a different subject domain, such as genomics) would require integration methods tailored to our data warehouse and not extensible to other settings. In addition, ETL loads into the data warehouse are not performed in real time, and often certain patient data (such as CPOE orders) may not be fully loaded until after patient discharge. Although the benefits of this method include improved query performance, encounter data may be incomplete and users should not use DEDUCE to retrieve data with date ranges more recent than 1 month previous.

At present, the DEDUCE developers lack a mechanism to ensure that researchers or QI specialists query only the subject areas authorized for that particular access. For this reason, users must accede to the terms of the Data Usage Agreement at each login, which the DUHS IRB considers an acceptable solution. Under this scenario, the risk of misuse of access is no greater than that inherent in trusting an investigator to examine only the permitted data points in manual paper chart review.

DEDUCE has some usability features that distinguish it from competing portal designs such as RPDR or STRIDE. For instance, DEDUCE lacks a drag-and-drop “feel.” Inappropriate use of the browser “back” button may confuse the application, ultimately yielding inappropriate results. In addition, there is no “undo” function, meaning that users must navigate back to the original GQ prompt page to change a query. Further, once data are exported, filter conditions cannot be saved for future GQ sessions.

A known limitation of the GQ stems from an inability to create cohorts that combine subject areas, such as: “How many patients with diagnosis A have had procedure B?” However, more sophisticated cohorts can be created by combining individual exports from multiple subject areas—the internal data warehouse patient and encounter keys provided in most extracts allow users to join data offline in other programs. This does not diminish the utility of the wizard interface, as a user may be highly skilled in analysis, but not in the techniques needed for data retrieval.

Currently, only DUMC has the full subject areas available, but data from community hospitals will become available during 2010. DEDUCE does not yet contain clinical information present in text-based notes, pathology reports, radiology reports, or ECG images.

6.1. Future work

The GQ tool currently offers a relatively rigid query structure in order to minimize user mistakes. Future plans for DEDUCE include additional, complementary query environments for more experienced users. We will allow advanced users to build queries spanning multiple subject areas by allowing the search filter and extraction parameters to remain distinct. For example, a user could filter for all encounters that received insulin medication orders and, from there, extract only blood glucose lab results for those insulin-medicated orders. We are currently conducting beta testing on a compound-query DEDUCE tool in which users will be able to create and save highly defined cohorts through the use of set operators, (including logical operators for items such as lab values). We are also working on a third tool aimed at clinical trial recruitment, which will permit users to enter a set of query criteria and receive real-time notification when a potential trial recruit fits those parameters. Within the current DEDUCE GQ feature set we will look for opportunities to enhance existing search methods, such as adding LOINC codes from Cerner Millennium (Kansas City, MO) for DUHS lab tests, as well as adding a commercial data dictionary housing technical and business definitions.

7. Conclusion

We believe that careful implementation of HIT and BI software can help U.S. healthcare stakeholders transform unwieldy data stockpiles into knowledge that can improve patient outcomes, increase safety, and enhance quality. Although numerous reports have described platforms for integrating, querying, and extracting biomedical information domains in support of translational research, systems that are accessible by any health system researcher while simultaneously serving QI needs have not been formally reported. We developed a scalable research portal to leverage the overwhelming stores of clinical data archived across our organization. The DEDUCE GQ creates a “window” into the data warehouse that allows researchers to explore whether cohorts that could support their research activities exist, evaluate whether they have sufficient power to test a hypothesis, and examine raw data in granular extracts. These advances are essential in light of ongoing policy development spurred by the Recovery Act and are vital to ensuring that medical practice is informed by evidence-based research and not misguided impressions.

8. Acknowledgements

The authors thank the DUHS Data Warehouse Group, Robert Califf, Ronald Goldberg, David Tanaka, Lloyd Michener, John Harrelson, Mike Cuffe, Jimmy Tcheng, and Lawrence Muhlbaier for guidance in the design of the DEDUCE Guided Query tool. The authors also thank Jonathan McCall of the Duke Clinical Research Institute for editorial assistance with this manuscript.

This publication was made possible by Grant Number UL1RR024128 from the National Center for Research Resources (NCRR), a component of the National Institutes of Health (NIH), and NIH Roadmap for Medical Research. Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH.


Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

9. References

[1] Corrigan JM, editor. Institute of Medicine. National Academy Press; Washington, DC: 2001. Crossing the quality chasm: A new health care system for the 21st Century.
[2] Leape LL, Berwick DM. Five years after To Err Is Human: what have we learned? JAMA. 2005;293:2384–90. [PubMed]
[3] Stelfox HT, Palmisani S, Scurlock C, Orav EJ, Bates DW. The “To Err is Human” report and the patient safety literature. Qual Saf Health Care. 2006;15:174–8. [PMC free article] [PubMed]
[4] Prokosch HU, Ganslandt T. Perspectives for medical informatics. Reusing the electronic medical record for clinical research. Methods Inf Med. 2009;48:38–44. [PubMed]
[5] Steinbrook R. Health care and the American Recovery and Reinvestment Act. N Engl J Med. 2009;360:1057–60. [PubMed]
[6] Institute of Medicine . Initial national priorities for comparative effectiveness research. National Academy Press; Washington, DC: 2009.
[7] Embi PJ, Payne PR. Clinical research informatics: challenges, opportunities and definition for an emerging domain. J Am Med Inform Assoc. 2009;16:316–27. [PMC free article] [PubMed]
[8] Wang X, Liu L, Fackenthal J, Cummings S, Olopade OI, Hope K, Silverstein JC. Translational integrity and continuity: personalized biomedical data integration. J Biomed Inform. 2009;42:100–12. [PMC free article] [PubMed]
[9] Scheuner MT, de Vries H, Kim B, Meili RC, Olmstead SH, Teleki S. Are electronic health records ready for genomic medicine? Genet Med. 2009;11:510–7. [PubMed]
[10] Mathew JP, Taylor BS, Bader GD, Pyarajan S, Antoniotti M, Chinnaiyan AM, Sander C, Burakoff SJ, Mishra B. From bytes to bedside: data integration and computational biology for translational cancer research. PLoS Comput Biol. 2007;3:e12. [PMC free article] [PubMed]
[11] Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research electronic data capture (REDCap)--a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform. 2009;42:377–81. [PMC free article] [PubMed]
[12] Viangteeravat T, Brooks IM, Smith EJ, Furlotte N, Vuthipadadon S, Reynolds R, McDonald CS. Slim-prim: a biomedical informatics database to promote translational research. Perspect Health Inf Manag. 2009;6:6. [PMC free article] [PubMed]
[13] Hibbert M, Gibbs P, O'Brien T, Colman P, Merriel R, Rafael N, Georgeff M. The Molecular Medicine Informatics Model (MMIM) Stud Health Technol Inform. 2007;126:77–86. [PubMed]
[14] Murphy SN, Mendis M, Hackett K, Kuttan R, Pan W, Phillips LC, Gainer V, Berkowicz D, Glaser JP, Kohane I, Chueh HC. Architecture of the open-source clinical research chart from Informatics for Integrating Biology and the Bedside. AMIA Annu Symp Proc. 2007:548–52. [PMC free article] [PubMed]
[15] Murphy SN, Mendis ME, Berkowitz DA, Kohane I, Chueh HC. Integration of clinical and genetic data in the i2b2 architecture. AMIA Annu Symp Proc. 2006:1040. [PMC free article] [PubMed]
[16] Harris PA. Integrating Informatics and Medicine Graduate School of Information Science in Health First Invited Symposium. Munich Technical University; Spitzingsee, Germany: Jul 25–28, 2009. Vanderbilt University's StarBRITE Researcher Portal.
[17] MacMillan L. The Reporter. vol. 4/10/2009. Vanderbilt Medical Center Office of News and Public Affairs; Nashville: 2009. Clinical records rich source for research.
[18] Lowe HJ, Ferris TA, Hernandez PM, Weber SC. STRIDE--An integrated standards-based translational research informatics platform. AMIA Annu Symp Proc. 2009;2009:391–5. [PMC free article] [PubMed]
[19] Murphy SN, Chueh HC. A security architecture for query tools used to access large biomedical databases. Proc AMIA Symp. 2002:552–6. [PMC free article] [PubMed]
[20] Murphy SN, Morgan MM, Barnett GO, Chueh HC. Optimizing healthcare research data warehouse design through past COSTAR query analysis. Proc AMIA Symp. 1999:892–6. [PMC free article] [PubMed]
[21] Weber GM, Murphy SN, McMurry AJ, Macfadden D, Nigrin DJ, Churchill S, Kohane IS. The Shared Health Research Information Network (SHRINE): A prototype federated query tool for clinical data repositories. J Am Med Inform Assoc. 2009;16:624–30. [PMC free article] [PubMed]
[22] Murphy SN, Weber G, Mendis M, Gainer V, Chueh HC, Churchill S, Kohane I. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2) J Am Med Inform Assoc. 2010;17:124–30. [PMC free article] [PubMed]
[23] Nelson EC, Splaine ME, Plume SK, Batalden P. Good measurement for good improvement work. Qual Manag Health Care. 2004;13:1–16. [PubMed]
[24] Bellika JG, Sue H, Bird L, Goodchild A, Hasvold T, Hartvigsen G. Properties of a federated epidemiology query system. Int J Med Inform. 2007;76:664–76. [PubMed]
[25] Deshmukh VG, Meystre SM, Mitchell JA. Evaluating the informatics for integrating biology and the bedside system for clinical research. BMC Med Res Methodol. 2009;9:70. [PMC free article] [PubMed]
[26] Larman C. Agile and iterative development: a manager's guide. Pearson Education, Inc; Boston, MA: 2004.
[27] Stashenko G, Krichman A, Ferranti J, Russell M, Tcheng J, Tapson V. Adding decision support to computerized provider order entry improves venous thromboembolism prophylaxis rates. Chest. 2009;136:146S–b-147.
[28] Horvath MM, Cozart H, Ahmad A, Langman MK, Ferranti J. Sharing adverse event data using business intelligence technology. J Pat Safety. 2009;5:35–41. [PubMed]