Search tips
Search criteria 


Logo of procamiaLink to Publisher's site
AMIA Annu Symp Proc. 2011; 2011: 994–1003.
Published online 2011 October 22.
PMCID: PMC3243196

Federating Clinical Data from Six Pediatric Hospitals: Process and Initial Results from the PHIS+ Consortium


Integrating clinical data with administrative data across disparate electronic medical record systems will help improve the internal and external validity of comparative effectiveness research. The Pediatric Health Information System (PHIS) currently collects administrative information from 43 pediatric hospital members of the Child Health Corporation of America (CHCA). Members of the Pediatric Research in Inpatient Settings (PRIS) network have partnered with CHCA and the University of Utah Biomedical Informatics Core to create an enhanced version of PHIS that includes clinical data. A specialized version of a data federation architecture from the University of Utah (“FURTHeR”) is being developed to integrate the clinical data from the member hospitals into a common repository (“PHIS+”) that is joined with the existing administrative data. We report here on our process for the first phase of federating lab data, and present initial results.


For the last decade, academic pediatric hospitalists have been conducting comparative effectiveness research on a variety of acute and chronic conditions for which children are hospitalized. These studies have provided important evidence to guide clinical decision-making and care delivery in the nascent field of pediatric hospital medicine. However, the research has been hampered by several factors including (a) difficulty assembling sufficiently large cohorts to study rare conditions in pediatrics, (b) inability to adjust for severity of illness in observational studies, (c) lack of availability of important, reliable, and measurable outcomes, and (d) lack of availability of clinical data to validate and strengthen the results derived from administrative data. These limitations, especially the last one, have made it difficult for otherwise sound comparative effectiveness research (CER) studies in pediatric hospital medicine to achieve the quality of evidence needed to change practice nationwide.

To address these limitations, the Pediatric Research in Inpatient Settings (PRIS) network and the Child Health Corporation of America (CHCA) partnered to develop the infrastructure to improve the methodology for prospectively collecting data from electronic clinical databases at CHCA member hospitals. With funding from the Agency for Healthcare Research and Quality (PROSPECT studies1), CHCA and PRIS initiated a project to enhance CHCA’s existing electronic database of detailed administrative data—the Pediatric Health Information System (PHIS)—with laboratory and radiology results from six children’s hospitals in order to create a more complete database, called “PHIS+”. The three year grant objectives are to build the robust infrastructure and conduct four comparative effectiveness research studies of healthcare interventions for hospitalized children. This effort will establish the infrastructure for the eventual electronic delivery of clinical data from 43 children’s hospitals nationwide that currently submit comprehensive and longitudinal administrative data to CHCA’s PHIS database. We will begin this effort by federating standard lab data from the six hospitals.

Overview of PHIS

PHIS is a comprehensive pediatric database that was created by CHCA for member hospitals. It contains clinical and financial details of more than six million patient cases. Since 1999, PHIS has collected data on 20.5 million patient encounters. Children included in the database represent the full spectrum of ages, races, ethnicities, and geographic regions in the United States. The PHIS database contains diagnosis and procedure codes and billed transaction/utilization data of inpatient and select outpatient (emergency department, observational, and ambulatory surgery) encounters from the 43 PHIS hospitals. Member hospitals represent 17 of the 20 major metropolitan areas across the United States. Updated quarterly, data are readily available for PHIS users on encounters as early as January 1, 1992. Each member hospital has direct access to PHIS data via an online report query tool.

Overview of PHIS+ Project

Our objective is to build on the existing strong and sustainable infrastructure at CHCA to augment PHIS with laboratory and radiology data for children seen in the ambulatory and inpatient departments of 6 large children’s hospitals. We chose to augment the administrative database in PHIS with laboratory and radiology data based on our collective experience with the limitations of performing comparative effectiveness research with purely administrative data, as well as evidence from the literature that the addition of these clinical data significantly enhances the internal and external validity of CER2,3. The PRIS-member hospitals collaborating on this project (see Table 1) are national leaders in the use of electronic medical records and will be providing clinical data collected as part of routine clinical care. These hospitals have partnered with biomedical informatics researchers from the University of Utah, who have developed a platform, called “FURTHeR”, for federating heterogeneous data from multiple data sources.

Table 1:
Children’s hospitals participating in the PHIS+ project.

The Federated Utah Research and Translational Health electronic Repository (FURTHeR)

To support the PHIS+ data translation and federation requirements, we utilized computational resources developed as part of the University of Utah’s Clinical and Translational Science Award4 (CTSA) Biomedical Informatics Core (BMIC) infrastructure. This infrastructure, the Federated Utah Research and Translational Health electronic Repository57 (FURTHeR), was developed to integrate health information from heterogeneous data sources in order to support syntactic and semantic data interoperability for clinical and translational research purposes. Utilizing both real-time terminology and data model translation services, FURTHeR is able to map from a local instance of a clinical data record to a central, standardized terminology and data model. The system already supported standard lab data translation, utilizing LOINC8 as its standard lab terminology. Two important alterations to the FURTHeR architecture were necessary for PHIS+: (1) FURTHeR’s ability to query local data sources on-the-fly was removed because the six contributing hospitals were providing the input data in predetermined batch files for PHIS+. The query capability was replaced with a data file adapter (a modified version of an existing relational database access adapter) that could read the formatted text batch files supplied by the hospitals; and, (2) the ability of FURTHeR to store results to a physical database was added. FURTHeR typically aggregates and stores translated query results in a temporary, in-memory database for presentation and analysis by the investigator for the duration of the user’s session. We added software to allow the in-memory database to instantiate a Hibernate object that could be persisted to a physical, JDBC-compliant database. Information that specified database type, connection information and input file format options is read from a simple text configuration file. The alterations to FURTHeR and their actual uses during data translation are described in more detail below. This new FURTHeR instance will be used in the future by CHCA for on-going translation of subsequent clinical data from the contributing hospitals. BMIC will also produce the initial terminology and model mapping content for the lab data. This process, as well as the ongoing data update and investigator access processes for the system, is shown in Figure 1.

Figure 1:
Process flow diagram for the PHIS+ system. The figure shows the creation of the terminology and data model content from hospital sample data. This content is then used in the FURTHeR-CHCA stand-alone instance of the FURTHeR translation services to map ...

PHIS+ Project Organization/Administration

An Oversight Committee (the OC) and an Information Technology Committee (the ITC) share the governance for the PHIS+ project. The OC, chaired by the grant PI (RK), includes clinical investigators from each of the sites as well as participants from University of Utah BMIC and CHCA. The ITC, chaired by the FURTHeR project director (SPN), is made up of informatics experts from each of the hospitals, CHCA and BMIC, as well as two clinical investigators from the OC. Each committee meets bi-weekly. The presence of clinical investigators and informatics experts on both committees ensures ongoing communication and clarity of mission between those who will be using the PHIS+ database and those who are building and maintaining it. A project manager coordinates the overall planning and implementation of the work across all the sites, and each of the sites has a project coordinator to organize site-specific efforts. A Microsoft SharePoint content management system is used to share project documentation. In February 2011, twenty-seven project participants (OC/ITC members and site coordinators) attended the first annual PHIS+ project meeting in Park City, Utah.

Business Agreements, Data Use Agreements, and IRB

As the primary recipient of the PHIS+ grant funding, Children’s Hospital of Philadelphia (CHOP) offered to serve as the IRB of record for the other participating hospitals in order to reduce the administrative burden and duplicative effort of multiple hospital IRBs reviewing a low risk protocol. Hospitals that chose not to use CHOP as the IRB of record submitted the CHOP protocol to their own IRBs for review. Supplementary to the IRB application, several contractual documents were drafted to authorize the exchange of different types of data for the development and use of the PHIS+ database. These documents and the data and entities they govern are outlined in Figure 2 and described in the text below.

Figure 2:
PHIS+ Partners, Data Sets, and Contractual Documents.

Business Associates Agreement (BAA) Between Hospitals and CHCA (1)

In order to facilitate matching of PHIS+ clinical data with corresponding administrative data shared with CHCA through PHIS, hospital clinical data sent to CHCA contain patient identifiers such as medical record number, hospital billing number, and date of service. To authorize the sharing of data with identifiers, a business associates agreement (BAA) was employed between each hospital and CHCA. This BAA was already in place as a result of the PHIS participation of the 6 hospitals.

Data Use Agreement Between CHCA and University of Utah BMIC (2)

CHCA drafted a data use agreement governing the sharing of de-identified hospital clinical data with the University of Utah BMIC. Under the agreement, CHCA sends de-identified clinical data (as limited data sets) to BMIC, who uses the data to test and refine their mapping software. BMIC then sends the mapped results back to CHCA. The only personal identifiers contained in the limited data sets are dates of service. This data use agreement is needed until CHCA assumes responsibility for the mapping of clinical data sent from the hospitals.

Data Use Agreement Between CHCA and Participating Hospitals (3)

After PHIS+ is established, hospitals who want to receive limited data sets for research will sign a separate DUA for this data. CHCA drafted a data use agreement to govern the delivery of PHIS+ data to hospital investigators.

Lab Data Federation Process Description

Site capability analysis

As part of our process, we needed to analyze the current electronic data capabilities of each hospital. In particular, we needed to identify the particular sources of the lab, medical record, and data warehouse systems from our six contributing institutions. The site capabilities are shown in Table 2. The site capabilities include vendor and in-house developed systems.

Table 2:
Electronic data sources for the PHIS+ hospitals. The “PHIS+ Lab Datasource” column refers to the source of information for lab data extracted for PHIS+.

Test Selection

One of the first steps in our process was to select the lab tests to include in the PHIS+ repository. Hypothesizing that there would be wide variation between the sites in lab tests performed, we decided to utilize lab order information already stored in PHIS to determine the overall prevalence of specific tests across the sites and then use this to prioritize labs for PHIS+. CHCA created an ordered frequency list of PHIS inpatient (including ED and ambulatory surgery) lab orders for 2009 (the most recent complete year of PHIS data). Microbiology, pathology and cytogenetics tests were excluded from this phase. The resulting list contained 497 lab orders. (Items were grouped by test order type and did not differentiate by specimen type or other attributes that might normally differentiate lab orders.) Note that PHIS stores lab information by order while the goal of PHIS+ is to obtain the results of these orders. Since a set of results may come from one order (for example a CBC panel test will produce multiple individual results such as WBC, Hemoglobin, Hematocrit, etc) there is not a 1:1 relationship between orders and results. The OC debated how to handle panel tests in the list, knowing that the panel content could be quite different between the sites. It was agreed that we would investigate the top ten individual (vs. panel) test orders on the list as an initial “proving ground” for the PHIS+ data federation process. The list of the top ten tests is shown in Table 3. Each test corresponds to a single analyte that can be measured in a variety of ways. Because of this, the OC further suggested that our initial data collection should be for all results on these particular analytes, regardless of ordering method, lab method or specimen, in order to ensure that we have near-complete coverage for a given analyte. For example, tests for Sodium could include serum/plasma, blood, urine or other body fluid specimens, and might also include point-in-time and 24-hour collections. The hospitals were directed to find all possible tests performed at their sites for each analyte and report these as separate test result items.

Table 3:
Top ten PHIS individual lab test orders across the 6 hospitals.

The OC also wanted to include two panel tests, CBC with differential and Complete Urinalysis, in order to examine variability among the sites in panel content and to further test the data federation capabilities of our system. Many of the analyte tests that are commonly included in these panels also appeared high in our lab order frequency list. The hospitals were asked to find all local examples of how they ordered these panels and then provide the complete list of individual analyte tests that made up these panels, again regardless of test methodology or specimen. The combination of the top ten tests in Table 3 and the additional tests from the two panels was denoted as “Lab Sample 1”. Selection of a larger set of tests for the first operational phase of PHIS+ was deferred until after analysis of the Lab Sample 1 results.

Metadata Selection

The process of mapping local terminology and data models to FURTHeR’s terminology and data model requires knowledge about how local systems store their data. This local metadata must be discovered and shared in order to create the mapping content necessary to semantically and syntactically integrate/federate data. There are two types of metadata we required for the lab data: metadata to map local test result codes, and metadata to map local lab result instance formats.

As part of FURTHeR’s regular process for lab test result code mapping, BMIC uses the Regenstrief LOINC Mapping Assistant (RELMA)8,9. RELMA is particularly useful for automated batch processing of large test code files. It operates best when a specific set of information (metadata) is included with each test code, and when this information is presented in a standard format. Following published directions for RELMA, BMIC developed a metadata specification for the lab result test codes in Lab Sample 1. A partial list of the metadata fields and their descriptions is shown in Table 4. The list also included metadata that could be used for interpreting certain result instance field formats, such as date field formats, but most of these were later removed for reasons explained below. The metadata fields were reviewed in phone meetings with the OC and ITC. Each of the sites confirmed the local availability of required fields and most of the optional fields.

Table 4:
A subset of the Lab Sample 1 metadata fields and their descriptions.

For the lab result instance formats from the sites, we originally proposed that the sites would be able to provide their lab sample data in a format that was most suitable to their data export capabilities. FURTHeR is able to process input information in a variety of formats, and our goal was to lessen the burden on the sites of configuring data in an unfamiliar manner. But, due to design changes that required that data flow through CHCA first and then to BMIC, one common data format was agreed upon. The common format is discussed in more detail in the Lab Data Collection section below.

Metadata Collection

A Microsoft Excel spreadsheet template was created with columns for each of the metadata fields. This template was given to each of the hospitals as a metadata collection tool, along with field description information, a metadata example file using the template, and an instruction sheet for the collection process.

Over a period of several weeks, the hospitals collected and began to send their metadata spreadsheets via email to BMIC. (Email was considered safe for this task since no patient identifying information or hospital sensitive information was included.) Some sites struggled to some degree with discovering the metadata due to a lack of available technical personnel or a more complicated IT infrastructure that made metadata discovery more difficult because the information was spread over more than one location (e.g., separation of specimen information from test results), making it harder to join information into the metadata spreadsheet. This led more than one site to submit information in a non-standard format.

We initially estimated two weeks for the sites to complete their metadata collection, but some sites needed more time. Initial review of metadata spreadsheets also revealed some problems with the data (besides the template format changes mentioned previously), including misinterpretation of fields, incorrect formatting of data, and missing data. An iterative process with the hospitals corrected the errors and the initial metadata collection was completed in approximately one month.

Metadata Processing

BMIC began processing each of the metadata files as they were received. Any non-standard reporting formats (as reported above) in the metadata spreadsheets were corrected first. One of the authors (RG), a Master’s trained informaticist with a medical background, then used the RELMA tool to map the local test codes for each site to a corresponding LOINC code. RELMA provided an output file with the local and LOINC code pairs, which were then loaded into the FURTHeR terminology server. Each site received its own namespace in the terminology server in order to separate the local codes and provide more efficient code maintenance. In addition, the LOINC mappings, including LOINC code and descriptive LOINC name, were combined with the original metadata spreadsheets from the sites, along with comments/questions about the metadata and LOINC interpretations where appropriate, creating a “return file” for each site. From these return files, a master file was created that listed, side-by-side, all of the local test codes and corresponding LOINC codes, grouped and sorted by LOINC code. Our terminology mapping expert (RG) was able to process approximately 200 test codes per day using this process.

The site codes for the Test Value (when coded), Units of Measure, and Interpretation Code fields (see Table 4) were also added to the terminology server in the site namespaces and mapped to appropriate standard terminologies.

Some of the local tests could not be mapped to LOINC due to ambiguities and missing data in the metadata. We used the return file spreadsheets in an iterative process with the sites in order to address these inconsistencies. For example, for some test metadata the specimen or method was not clear, and multiple LOINC code mappings were possible. We also found potential errors with reported units and reference ranges. In these cases, we asked the sites to provide additional metadata information about the tests. This iterative process continued for approximately 2–3 months, including extending into the lab data collection phase when actual lab results indicated missing codes and erroneous mappings. In the end, we were able to map each local test to an unambiguous LOINC code. (Some local tests were discarded because they were not part of the Lab Sample 1 list.)

An area that required special attention was where tests were mapped to a non-specific body fluid type, which might have consequences on research where more specific specimens are expected. The original LOINC mappings were re-examined at the request of the OC to ensure accurate mappings. Several tests were changed to more specific specimen tests by this process, but others remained with a generic specimen because the sites allowed this.

A beneficial consequence of the PHIS+ mapping was in some cases to confirm local LOINC mappings and in other cases to correct erroneous maps. One of the sites provided its own LOINC mappings for all its local test codes, and BMIC’s mapping of the site’s metadata largely agreed with these LOINC mappings. Discrepancies were primarily due to ambiguous metadata that were resolved during the iterative cleanup process. Three other sites provided some LOINC mappings for their local codes. Discrepancies found with these mappings were again often due to ambiguous metadata. Others were actual erroneous mappings by the sites. We also discovered some instances where a site’s LOINC mapping was accurate when it was originally made, but the test had changed over time and the LOINC mapping was no longer correct. Finally, for those sites that had not yet implemented LOINC coding, this process gave them valuable information to begin the transition to LOINC usage. In the future, we anticipate that the sites will assume the LOINC mapping process, and a simpler QA process will be performed for PHIS+.

Metadata Results

Table 5 shows the results of the metadata processing, with the sites anonymized. The number of mapped, unique tests for each site is shown. A single LOINC code may map to more than one test at a site. Across all the sites, 435 unique LOINC codes were identified, which mapped to 959 total site tests. The vast majority of LOINC codes (58%) are used at only one site; 19% are used at two sites, and 8% at three sites. But 15% (68) of the LOINC codes are used at four or more of the sites. Nine of the top ten tests in Table 3 were covered by at least five of the sites for serum/plasma specimens: lack of total agreement was due to a site reporting a different specimen for its test. The anomaly in the top ten list was Urea Nitrogen: more variability exists in specimen and method among the sites for this test. In addition to the top ten tests, all six sites perform common Hematocrit, Urine Leukocyte Esterase, Urine pH, Urine Reducing Substance, and Hemoglobin tests.

Table 5:
Metadata results from the six hospitals showing the number of unique local tests from each site and corresponding number of linked LOINC codes, as well as other mapped codes. (Sites have been anonymized.)

The differences in the number of local tests between the sites performing the most (Sites A & B) vs. least (Site F) tests are partly due to Sites A & B more often reporting multiple tests that map to the same LOINC code (i.e. multiple local methods to obtain the same result). But Sites A & B still listed far more tests than the other sites, mostly due to unique tests reported as part of CBC and Complete Urinalysis panels.

In addition to the LOINC Codes, 34 SNOMED codes for Units of Measure were linked to 101 local Units; 9 HL7 Interpretation Codes were linked to 35 local codes; and 11 SNOMED Specimen codes were linked to 25 local codes.

Lab Data Collection

Once the metadata issues were largely resolved, we began the task of collecting lab data from the sites. As mentioned earlier, the project participants decided to use a common format for reporting the lab result data for Lab Sample 1. For simplicity, we developed a field-delimited text file format that had characteristics similar to HL7 v2.x message syntax10, including the use of a pipe delimiter between data fields. The required fields were derived from the metadata in Table 4 as well as additional fields necessary for the lab result instance data. For example, we added Patient ID and Billing Number fields in order to join the lab data with the existing administrative data in PHIS. We also specified a sequence number on each row of data so that we could QA our results with the sites and resolve issues. The format and fields for the lab sample file are shown in Figure 3, along with an example lab result. (The Hospital_Number and Campus_ID fields correspond to specific site identifiers issued previously by CHCA for PHIS.) Each row in the file ends with a carriage return. Descriptions of the file format, field contents, and example lab results were provided to the sites. The sites were instructed to compile results from the year 2009 for all the tests mapped in their corresponding metadata return files.

Figure 3:
Lab data format for Lab Sample 1. The example shows an abnormal Hemoglobin result from a CBC panel. In the example, the End_Date_Time, Ref_Range_Low, Ref_Range_Hi, and Comments fields are empty.

Because of the issues discussed previously with data use agreements and IRB protocols, the sites were instructed to create a de-identified copy of their lab sample files. Patient ID and Billing Number fields were left blank, and all date/time fields were set to the same value (“January 1, 2009, 9:00 am”). The de-identified files could then be sent to BMIC. (The scrubbed fields were not necessary for the data translation process.) Once the data use agreements and IRB protocols were agreed to, the translated data from the de-identified files could be joined with the information in the original (identified) files via the Sequence_Number field.

We had previously established a secure FTP transmission process for sending the lab sample files from the sites to CHCA. We retained this process for the de-identified files so that the sites would have to implement only one communication method for PHIS+. CHCA received the de-identified files, performed a simple QA to verify that identifying fields were scrubbed, and then forwarded the files to a secure FTP server at BMIC. BMIC then moved the files to a secure local server for data translation.

Lab Data Processing

As discussed previously, BMIC used a modified version of FURTHeR to process the lab sample files. One of the authors (OEL), a senior software engineer on the FURTHeR project, created a data file adapter that could read the pipe-delimited results from the sample files and feed the lab records to the FURTHeR translation engine. A simple command line interface initiated the process by pointing the data file adapter to the correct sample file and configuration file, and invoking the FURTHeR application. The translation engine marshaled the raw lab data into the FURTHeR lab object and translated all local codes to the standard terminologies (using the code associations in the terminology server, described earlier). Unrecognized codes and malformed input data were flagged to a log file for manual review. An output adapter, also developed by OEL for PHIS+, took each translated lab result and inserted it into a MySQL database via a Java Hibernate object11. (The Hibernate persistence layer can easily be reconfigured to support any other JDBC compliant database.) The fields in the database mimicked those in the lab sample data file (Figure 3). We included columns for the original field values as well as for the translated values for QA purposes. We also added fields for auditing purposes (insertion date/time, status).

Initial translation performance of the lab files was poor, taking up to several hours to process an entire file on a desktop workstation. FURTHeR does not typically handle input data sets as large as the lab sample files. Tuning enhancements were made by OEL and PM, including local caching of common terminology mappings, additional CPU RAM and database tuning. These enhancements significantly improved performance of the system: approximately 4,000 results per second could be translated on a desktop workstation. The enhancements were also added to the standard FURTHeR code base, providing a beneficial speed-up in FURTHeR queries.

Some issues with the sample file formats were identified during processing. Some sites had inserted quotation marks around field values or around entire row results. While it would have been relatively easy to correct for this in the data file adapter, we asked the sites to correct these errors and retransmit their files in order to maintain a consistent file format across the sites. One of the sites inserted additional, unanticipated fields into its sample file. The site was asked to remove these fields and retransmit its file.

Several of the sites sent unknown test codes, units of measure, and interpretation codes that were caught during the terminology translation process. Consultation with the sites resolved these anomalies, usually resulting in modifications/additions to a site’s metadata and/or namespace terminology. In certain cases, the errors were traced to errors in the source systems or in the data collection at the site.

Lab Data Results

Table 6 shows the results from the lab data processing, with the sites anonymized as in Table 5. The number of results from each site for all the tests identified from a site’s metadata for the year 2009 is shown. The most common lab tests across the sites are measurements normally performed as part of CBC, electrolyte and urinalysis panels.

Table 6:
Lab data result counts from the six hospitals. (Sites have been anonymized.)

Of the 435 total LOINC test codes originally returned by the sites in their metadata files, results were returned for only 392 LOINC codes. 24 LOINC tests covered 50% of all the tests reports; 141 tests covered 95% of all the results reported. Comparing the LOINC results from Table 5 and Table 6, three sites had fewer actual LOINC tests in their data files than were specified in their metadata. This is because the metadata they originally supplied represented possible tests the site could perform during the sample period even though these tests were not performed in 2009. Site E reported far fewer tests than the other sites because they transitioned to a new EMR in 2009 and did not have the entire year’s labs converted to the new EMR format at the time of the data extract.


Federating lab data from the six pediatric hospitals revealed many important process issues that might not have been discovered had we examined each site separately. For example, the problem with the non-specific body substance (“body fluid”) was largely discovered by running comparisons across the sites for common tests. The OC and ITC were also able to quickly come to common ground on test sample content and file formats that greatly simplified our initial data federation work. Working across the six sites may have actually sped up this process since the hospitals had a strong desire to implement the informatics portion of PHIS+ in order to move on to the later CER projects. The sites were able to compromise and quickly come to agreement on solutions that best supported common goals.

The decision to work on a small set of lab tests for Lab Sample 1 proved to be valuable. Even with an open, well-discussed and well-documented process, we still had anomalies with information provided by the sites, both in content and format. Fortunately, we had established good communications channels between BMIC, CHCA and the hospitals, and our iterative process of addressing and resolving issues worked well. Again, working on the six sites in parallel also provided some efficiency because the hospitals varied in the time it took them to provide data, allowing BMIC to work on one site’s data while awaiting data from another. Even the “small” number of tests requested in Lab Sample 1 surprised us with the number of tests we eventually needed to map. In retrospect, we should have anticipated that asking for all possible tests that measured the analytes in our top ten list, plus the addition of CBC and Complete Urinalysis panels, would result in a large number of local tests.

We were also surprised by the high number of LOINC test codes resulting from the local tests, and the corresponding low ratio of local tests to LOINC codes (2.2:1). We had hoped that a higher ratio of local tests would map to common LOINC codes. It will be interesting to investigate whether LOINC codes may overlap or can be subsumed under a parent code in a LOINC hierarchy in some cases in order to facilitate more efficient querying of PHIS+ lab data. We may discover that, even though we mapped local test codes to standard LOINC codes, federation of the lab data is not fully possible because of variability in local test performance. We plan to investigate this as part of future work on PHIS+.

There are also implications as we proceed to the next project phase and select a larger lab test sample for federation. Preliminary investigations reveal that there are approximately 9,000 unique lab codes across the six hospitals. At our current rate of 200 code translations per day, it would take 45 business days to complete an initial translation of all the codes. An equivalent number of days would be needed to resolve any issues. Therefore, our current task is to identify an appropriate subset of tests in order to create the initial PHIS+ load. We are encouraged that some sites may more actively pursue in-house LOINC coding, which would greatly improve our mapping efficiency and potential quality. We also need to consider the number of test results that will result from this larger set of test codes. As PHIS+ goes into operational use, the hospitals will provide regular update files that cover smaller time periods. But to initially populate the repository, we will need to load batch files that cover a wider timeframe (going back several years before 2009). Even at 4,000 record translations per second, we anticipate a heavy processing load and will therefore need to investigate methods to efficiently handle this load.

From lessons learned during processing Lab Sample 1, we are planning several changes for the next, larger lab sample. The issue with non-specific body substances (“body fluid”) indicated that we need to include specimen detail whenever possible in the metadata. We also must be much more stringent about enforcing metadata format standards, as a significant amount of time was spent reformatting metadata files in order to import them into RELMA. On the other hand, metadata file format differences submitted by one of the sites were actually useful to the mapping process and will likely be adopted for future metadata gathering. As we go forward, particularly when we enter the PHIS+ operational phase, we see value in requiring sites to include additional descriptive data in their data sets, too. Requiring panel information (instead of being “optional”) and asking for specimen information and test methodology, if available, will help to ensure that lab data being sent continue to match existing metadata. As noted previously, we have already seen instances where lab tests have changed over time, and even local LOINC mappings have become obsolete. Additional metadata in the results can help us discover these problems.

A notable area of concern in our lab federation is that, even though we have mapped local test codes to a standard (LOINC), there may still be a problem with joining results across hospitals because of potential differences in reference ranges. Unless reference ranges are also equal, we can not be sure that tests with the same LOINC codes from different institutions can be equated. However, we have collected reference range information with the results, and have included abnormal flags. In cases where reference ranges differ, investigators can use the flags for qualitative analysis (e.g. find all Sodium results that are “High”). We can also look to efforts by other groups to normalize lab tests based on reference ranges in order to provide guidance for addressing this issue within PHIS+.


The goals of the AHRQ PROSPECT grants are to advance the capacity of electronic data collection infrastructure as a basis for comparative effectiveness research, and to conduct specific comparative effectiveness studies using this new robust data. We have made significant progress on fulfilling the grant vision by partnering a dynamic pediatric research network (PRIS) with the existing CHCA PHIS database and incorporating data federation methods developed by an experienced informatics group (BMIC). The varied expertise required to accomplish the work presented in this paper reflects the first 6 months of the grant – and the lessons learned. Many issues have been able to be solved by the governance, bi-weekly workgroups, significant efforts of individual hospitals and centralized efforts from BMIC, CHCA and the PRIS Network.

We were able to successfully map over 900 unique lab tests across six pediatric hospitals to their corresponding LOINC codes. We were then able to use our translation infrastructure to federate over 13,000,000 lab results into a common repository. Our process for collecting metadata worked well to support the PHIS+ mapping efforts. Lessons learned during this initial phase will be used to inform our next phase when we will create the first instance of the PHIS+ repository with lab data and join it with existing administrative data. We expect to have this phase completed by the end of Q2 2011. Subsequent phases will incorporate microbiology and radiology data into PHIS+. We have already formed workgroups with specific expertise in these areas who are analyzing sample data to determine approach, terminology, metadata and data format, following similar project management and technical methods used for the clinical lab data.


The authors would like to thank the hospital and CHCA OC and ITC members and their staff. We would like to acknowledge the tremendous contributions of our project coordinators, Lauren Lubarsky (CHOP), Matthew Whittaker (BMIC) and Jebi Miller (CHCA), the PRIS network coordinator, Jaime Blank, the PRIS Research Network, and Mark Schreiner and Barbara Engel from the CHOP IRB for consultation on the regulatory framework for exchanging data. This effort was supported by grant R01 HS019862-01 from AHRQ.


1. ARRA-AHRQ Recovery Act 2009 Limited Competition: PROSPECT Studies: Building New Clinical Infrastructure for Comparative Effectiveness Research (R01) [Internet] 2010. (accessed Mar 2011).
2. Fry DE, Pine M, Jordan HS, et al. Combining administrative and clinical data to stratify surgical risk. Ann Surg. 2007 Nov;246(5):875–885. [PubMed]
3. Pine M, Jordan HS, Elixhauser A, et al. Enhancement of claims data to improve risk adjustment of hospital mortality. JAMA. 2007;297(1):71–76. [PubMed]
4. Clinical and Translational Science Awards [Internet] 2011. (accessed Mar 2011).
5. Bradshaw RL, Matney S, Livne OE, Bray BE, Mitchell JA, Narus SP. Architecture of a federated query engine for heterogeneous resources. AMIA Annu Symp Proc; 2009. pp. 70–4. [PMC free article] [PubMed]
6. Livne OE, Schultz ND, Narus SP. Federated querying architecture for clinical & translational health IT. In: Veinot T, editor. IHI ’10: 1st ACM International Health Informatics Symp. Washington, DC: ACM; pp. 250–6.
7. Matney SA, Bradshaw RL, Livne OE, Bray BE, Mitchell JA, Narus SP. Developing a semantic framework for clinical and translational research. (In Press). 2011 AMIA Summit on Translational Bioinformatics.
8. McDonald CJ, Huff SM, Suico JG, Hill G, Leavelle D, Aller R, Forrey A, Mercer K, DeMoor G, Hook J, Williams W, Case J, Maloney P. LOINC, a universal standard for identifying laboratory observations: a 5-year update. Clin Chem. 2003 Apr;49(4):624–33. PubMed. [PubMed]
9. Regenstrief LOINC mapping assistant [Internet] 2011. (accessed Mar 2011).
10. Health Level Seven International. V2 messages [Internet] 2011. (accessed Mar 2011).
11. Bauer C, King G. 2004. Hibernate in action, Manning Publications.

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association