Data integration is a constant challenge in translational science1
. In the past decade, several data integration regimes, including federated database strategies3
, workflow approaches4
, semantic web5
, and warehousing methods 8
, have been tested in the biomedical informatics community. The strengths and limitations of these approaches have been carefully reviewed 12
, and a data warehousing approach is considered most suitable because of its desired data integrity and its standalone architecture that is less affected by inadequate infrastructure environments. To date, the approach has been widely adopted in the translational informatics community: in a recent clinical translational science award (CTSA) annual meeting, 23 out of 67 abstracts were related to warehousing strategies 15
. However, there is no consensus on whether, and how, heterogeneous source data need to be processed for integration; how these data should be accessed; and how the data can be shared beyond the local setting 8
. In the database layer, the Entity-attribute-value (EAV) scheme is commonly used to manage evolving domain concepts, in conjunction with various modeling concepts 8
; this may further complicate the problem of data semantic inconsistency within, and between, warehouses. At the application level, many locally developed data warehouses do not have an end-user application interface. Users have to rely on programmers or informaticians to retrieve data on a case-by-case basis. Another kind of warehousing system, e.g., Informatics for Integrating Biology and the Bedside (I2B2) 9
, provides a two-step data retrieval strategy; users are given a cohort number through some query criteria, and then are required to shape and clean the selected dataset to create their own “mini-marts” for their special needs. Whether through programmers or assisted by computation tools, obtaining data of interest on a case-by-case basis is not a cost-effective solution. In addition, if end-users cannot directly access data values in a database, the quality of this data source could be compromised due to lack of user feedback 18
. Therefore, issues concerning data integration and application of a warehouse system warrant further investigation. In this report, we introduce an alternative approach to address these issues.
Strictly speaking, a data warehouse (DW) is not a simple data repository filled with aggregated source data. Rather, it is a database that integrates data from disparate sources while delivering data with uniformity, semantic consistency, and minimized redundancy12
. Without meeting these criteria, the aggregated data will be of little use. Here data usability is defined as “data + meaning,” which can be achieved when data are unified, standardized, connected, and validated 20
. To fully benefit from these data, the data mart (DM) concept is introduced to facilitate users to consume data stored in a DW, 23
. oftentimes by providing a user interface for a user group with shared specific interest 25
. The Star schema and Entity-Relationship (ER) schema are the major data organization schemes used to organize DW data 27
. Conceptual data modeling is considered a necessary process to build a flexible warehouse schema that can satisfy various requirements 26
. The ER approach is frequently used in the conceptual design, due to its mathematical foundation 30
, ability to clarify and annotate data semantics 27
, and its adequate support from established SQL functions and industrial grade data management tools 29
. Although a group of pioneer researchers had used such modeling method to successfully manage a centralized clinical data source in 1991 35
, conceptual modeling is often overlooked in translational and clinical informatics practice.
Motivated by various translational research projects, which involve both cancer and non-cancer medical research fields, we have initiated a data warehousing project called Tran
slational data M
. During the four years that TRAM has been in active use, our local cancer translational research community has further specified their informatics demands, which can be categorized as follows:
- Researchers want to be able to search and retrieve semantically and descriptively consistent data across domains and longitudinally, and use these data for quantifiable analysis with little or no additional effort for data manipulation and cleansing.
- Bio-specimen data need to be annotated with available clinical and translational research data.
- Molecular research (e.g., genotyping) and phenotypic (clinical) records should be interlinked at the level of individuals if they are derived from the same persons.
- Researchers demand to protect their data privacy for ongoing research but also want to be able to share these data with collaborators.
- Researchers hope they can curate and annotate integrated data, and eventually develop an evidence-based knowledgebase for all cancers.
When analyzing these requests, one can realize that these specifications are, in fact, not unique to cancer researchers. However, there has not been a warehousing system available publicly to satisfy these application demands. Our TRAM system, on the other hand, has the architecture framework that can be built upon to meet these requirements. Therefore, our objective was to develop an oncology data mart (ONCOD) as a module within the TRAM system to satisfy the need of cancer researchers. Through this effort, we should be able to establish a DW/DM system that can be easily customized to support other marts for major medical fields. In this report, we first introduce our system design and the methods used to develop ONCOD. We then assess the end-results by measuring data quality and performance of ONCOD against the specifications proposed by cancer researchers. We also outline the system architecture that supports ONCOD and its potential. Additionally, we discuss lessons learned in this study, highlight unsolved problems and possible solutions of our current approach, and describe the potential application of the ONCOD/TRAM mechanism beyond our local environment.