The i2b2 project allows the data to move back and forth between these two use cases. Data about a set of patients (metadata), is copied from the i2b2 enterprise database and placed into an i2b2 project database with the same data format and with the same data descriptors. Data that was collected independently of the electronic medical record can also be imported into the project database with its own data descriptors. In an i2b2 project database, new data can be derived from the raw EMR data in a multitude of different ways, such as through natural language processing, resulting in new, derived observations being created within the database.15
Often the new observations are more accurate or precise than the original raw observations, and can become the terms used in new queries and further data manipulations. The nature of the data schema and controlled terminologies allows the derived data and metadata tables to potentially (data ownership issues aside) be appended from a project database back into the enterprise tables.
An i2b2 “hive” is a set of server-side software modules (“cells”) that either store data or contain analysis methods that facilitate the repurposing of medical record data for research.13
The i2b2 hive manages both the enterprise-wide data and the distribution and access of data associated with the various projects that may have originated from cuts of the enterprise data. The data is generally loaded into the enterprise system using methods that take raw data from a hospital electronic medical record transaction system and place it into the CRC of the i2b2 hive. The feeding transaction systems may be the hospitals' registration and laboratory systems, provider-based medical record systems, or specially built electronic data capture systems. The implementation of the CRC currently supports Oracle and Microsoft SQL Server databases. These were chosen not only because of their industrial strength scalability, but also because many extract, transform, and load (ETL) tools exist to support these platforms. These tools are used to extract data from the clinical systems and place copies of it into the CRC. The core team has not developed connections to load data from specific EMR implementations, but teams outside i2b2 have been active on this front (M Kamerick, personal communication, 2009).
The infrastructure created by the i2b2 software allows investigators to perform queries on an enterprise data repository. This infrastructure consists of several cells of an i2b2 hive. The project management cell contains a set of data structures that associate users with passwords, IRB approvals, preferences, and (as created) new projects. When a user “logs on” to the i2b2 hive, they are using web services of the project management cell to authenticate themselves. Every time another part of the hive tries to perform an action on behalf of the user, it goes to the project management cell to gather the proper authorizations. Once authenticated, the user performs queries against a second cell of the hive, known as the data repository cell which contains the CRC of the enterprise. Each row of the main fact table in the CRC star schema database is associated with a term in another cell, called the “ontology” cell. This allows every term that is used to describe a patient in the enterprise CRC database to be used in a query.
Using the i2b2 web client shown in allows ad-hoc queries to be created by research clinicians throughout the enterprise and return aggregate numbers of patients that satisfy the queries. The terms used to create the queries are those in the data repository that are associated with patient observations. As shown in , metadata items are dragged from the “Navigate Terms” and “Find Terms” panels on the left (shown in the double lined box) and a query is created in the Venn-diagram-like panels (in the broken line box) on the right. If items are dragged into the same panel, they are logically OR'd together, and if they are dragged into different panels they are logically AND'd together. A similar interface has been described previously in detail.16
Other types of query tools are being developed by various groups to plug in to the i2b2 data repository cell. Of particular interest is a temporal alignment query tool that directs temporal queries of the clinical data.17
Figure 1 The i2b2 web client is used as the interface for the enterprise users to construct queries. Patient attributes are dragged from the “Terms” panels into the “Query Tool” panels, and the patient sets that result after running (more ...)
Using the i2b2 web client, patient sets may be created and aggregate counts on the demographics of these patients can be obtained. However, it only shows anonymous patient data, obfuscating the true results by adding or subtracting a small random number to the aggregate totals using a previously published method.18
This allows aggregate counts of the patient data to be displayed without the risk of disclosing the identities of the patients. The lists of patients that make up the patient sets are saved in the background, and line-item patient detail at the limited data set (LDS) level is available once data is extracted from the enterprise repository and placed into a specific project's data mart.
The process of creating a project's data mart is shown in . The investigator begins by selecting an approval (such as a recorded IRB approval) that allows them to create a project and include various kinds of patient detail. Restrictions may also be identified, such as restricting data to that of a single hospital. People who should be granted access to the data mart are then selected. Queries that were created in the broken line box of the i2b2 web client were saved as “previous queries”, and are now used to designate the patients for the new project (the “cases”). Previous queries can also designate patients to exclude from the project. Generally, a set of “cases” will be desired to have a matched set of “controls” in the data mart. A pool of patients that could be selected as matched controls can be selected from previous queries as well. Patient are usually matched by age, gender, race/ethnicity, and the number of observations in the medical record, or as closely as possible. Matching by the number of observations (facts) gives a rough approximation to hospital activity. Attempting to obtain an assessment of the match, we compare ranking of concepts in the two groups. The top ten diagnoses (not related to original selection), consistently give Spearman rank-order correlations of the two sets greater than 0.9.
Figure 2 Sequential process for creating a project's data mart. The patient sets that were created in the query tool can be used in many places where the request is formulated, seen in the broken line boxes. The patient attribute terms from the i2b2 web client (more ...)
Once the patients have been identified, the data to be included (and perhaps some to be specifically excluded) in the project data mart is selected from the available terms in the double lined box of the i2b2 web client. Proper authorizations to create the project are checked, and data can now be copied from the Enterprise data repository into the project data mart. This creates a persistent set of data for the project.
The new project data mart represents the next step in the data curation process. i2b2 software supports the data curation workflow by allowing the creation of new “observation-facts” in the fact table of the data mart, such as concepts identified through natural language processing (NLP). This allows a project to build up a catalog of new, more refined observations which represent progressive improvements in the medical record data. Illustrated in is the i2b2 Java “Workbench”, which is based on the open-source Eclipse platform.19
This platform was chosen due to its popularity among developers and its well defined extensibility through plug-ins that are based on an OSGi-standard compliant implementation (http://www.osgi.org/
). New terms generated from derived data (such as that resulting from chart reviews) are made available in the “Navigate Terms” window. This client extends the functionality available through the i2b2 web client, especially because it is known that the investigator has explicit IRB approval to view line-item patient detail. Even so, public health information (PHI) that is part of the 16 identifiers prohibited by limited data sets is either maintained in an unencrypted form in a separate repository, or encrypted in the data mart. To view this data, a decryption key is issued to those members of the project that are authorized to view PHI. Data export and import, multiple data visualizations, and mechanisms to validate newly derived NLP data are featured in the i2b2 Workbench. Various analysis types are available, and shown is a timeline view which plots the observations as bars in categories and according to times when they were observed. Clicking on the elements of the timeline can result in further detail, such as the Text Analyzer panel that shows how the “Erosion” term was derived form the Radiology Report through NLP.
Figure 3 The i2b2 Eclipse Workbench is generally used as the interface for working with specific project databases. It offers views for exporting and importing data to the project database, as well as views focused on specialized analysis. The i2b2 software architecture (more ...)
All of the operations that have been described rely on the highly flexible i2b2 patient data model, which at its core has a very simple design as seen in . The central fact table contains rows of observations about patients, with key attributes of patient ids, event ids, concept codes (possibly with associated modifier codes), observer codes, and dates associated with the observations. The fact table also contains value objects associated with the observations. The value objects are defined so they may be queried rapidly using relational database technology. This means they should obey at least two constraints. First, they should be expressed in the basic data types available in the database, such as numbers and simple strings. A complex value, such as blood pressure, for example “120/80-standing”, is reduced to three rows with concept “blood pressure” and modifiers “systolic”, “diastolic”, and “position” (preferably expressed as LOINC or other known standard, but this is not required). Second, the value type is specified in the ValType column, such as N=number, T=text, D=date, and so forth. This is a way of specifying the format for a value object in relational database technology. Numbers and dates are placed in the NVal_Num column which is a numeric data type. However, the TVal_Char column, a character data type, is also used in the specification of numeric values, and contains a representation of the operator, such as “equals”, “greater than”, or “less than”. ValueFlag is used to hold the source systems assessment of high/low/normal or normal/abnormal, if available.
Figure 4 All patient data is held in the Star Schema of the data repository. The tables are represented by each box, and the columns are represented by the rows of the box. Columns that are underlined are part of the primary key of the table. Columns that are (more ...)
Besides the fact table, called the Observation_Fact table in the i2b2 data repository, there are four other tables that help express the patient data. The Patient_Dimension table has one row for every patient in the database. This table contains the patient's Birth_Date, Death_Date, and a column Vital_Status_CD that describes if these dates are known, and if so, how accurately these dates are known. Date values are accompanied by an accuracy code to increase the efficiency of performing computations in relational database languages. Other columns in this table are specific to various i2b2 implementations and are not required. They may represent concepts that exist in the fact table, but for operational reasons may need be consolidated into one value per patient, such as current zip code. Business rules are applied to perform this consolidation.
The Visit_Dimension table allows periods to be represented that correspond roughly to patient encounters where observations were recorded. An “encounter” can involve a patient directly, such as a visit to a doctor's office, or it can involve the patient indirectly, such as running several tests tied together by the same tube of the patient's blood.
The Concept_Dimension table has vocabulary terms that map to the original codes used to specify observations made on the patient. The observation_fact table contains the original codes from the source systems with a three character prefix to avoid collisions of codes from various systems. Various standard concepts preloaded into i2b2 example terminologies include International Classification of Diseases (ICD), National Drug Code (NDC), and Logical Observation Identifiers Names and Codes (LOINC). These example terminologies were chosen due to their frequency of use in many source systems and practical accessibility. However, the i2b2 terminology systems do not recognize a difference between standard and local terminologies. Terms may be grouped into hierarchies. The hierarchical representation used in the concept table is similar to that of a hierarchical file system. The parent term is positioned in the “folder” position of the path, and the child term in the “file” position. For example, in the concept_dimension table (), the parent “anti-infectives” can have the three children “penicillin”, “ampicillin”, and “Bactrim”. The children map to the NDC codes used in the fact table, but the path shows they are types of anti-infectives, and most importantly, the path allows group queries to be performed using Structured Query Language (SQL) in the data repository.
Layout of concept data representations in the concept dimension
The organization of concept strings described above allows the choice of hierarchies to dictate groups of concepts used in a query. This allows a simple SQL statement to be created that is easily optimized (and therefore very fast) in relational database systems. The paths allow us to use the general concept of anti-infectives in a query. The query below shows the use of the concept_path column to give all the codes for penicillin, ampicillin, and Bactrim. Therefore, if we wanted to get all patients seen on anti-infectives, we would run the following query:
- Select distinct(patient_num)
- From observation_fact
- Where concept_cd in
- (select concept_cd
- from concept_dimension
- where concept_path like
The path of the concept is used to find and use all concept_cds that fall into the anti-infectives group. If we only wanted to find patients specifically on Bactrim, we would use the same query with the following concept_path:‘Med-V2\anti-infectives\Bactrim\%’
These patterns in SQL queries make it easy to implement automated query builders. Similar functionality using linked lists would require a much more complex implementation. These same patterns are used to create queries from the observer_dimension table. Rows in this table usually represent a clinician, but occasionally may point to a mechanical device, such as a continuous blood pressure monitor.
The database design described both allows new blocks of data to be added without disturbing the integrity of the old data, as well as allowing blocks of data to be copied out of a larger original database to a smaller one. The data about a set of patients can be copied from the i2b2 enterprise database and placed into an i2b2 project database with the same data format and with the same data descriptors while preserving powerful methods for querying the data.
The representations of the coding terminologies from two different systems can be combined, or kept separate depending on the degree of integration (and thus usability) that is required by the site. i2b2 does not provide an automated method to perform this function. If terminology systems are kept apart, the user will need to go to each terminology to perhaps choose very similar concepts to use within a query.