Search tips
Search criteria 


Logo of amiasummtspLink to Publisher's site
AMIA Jt Summits Transl Sci Proc. 2013; 2013: 86–88.
Published online 2013 March 18.
PMCID: PMC3845781

A comprehensive framework for data quality assessment in CER


The panel addresses the urgent need to ensure that comparative effectiveness research (CER) findings derived from diverse and distributed data sources are based on credible, high-quality data; and that the methods used to assess and report data quality are consistent, comprehensive, and available to data consumers. The panel consists of representatives from four teams leveraging electronic clinical data for CER, patient centered outcomes research (PCOR), and quality improvement (QI) and seeks to change the current paradigm where data quality assessment (DQA) is performed “behind the scenes” using one-off project specific methods. The panelists will present their process of harmonizing existing models for describing and measuring clinical data quality and will describe a comprehensive integrated framework for assessing and reporting DQA findings. The collaborative project is supported by the Electronic Data Methods (EDM) Forum, a three-year grant from the Agency for Healthcare Research and Quality (AHRQ) to facilitate learning and foster collaboration across a set of CER, PCOR, and QI projects designed to build infrastructure and methods for collecting and analyzing prospective data from electronic clinical data .


Detailed clinical data from disparate data sources, including electronic health records (EHRs), is the backbone of large-scale comparative effectiveness research (CER). Yet there exists no formal methods for assessing and reporting on the quality of data obtained from these sources. This proposal will develop a comprehensive data quality assessment framework and guidelines for the CER community.

This panel will present the collaboration between diverse research teams leveraging electronic clinical data for research and quality improvement (QI). The goal of this collaboration is to create draft recommendations and guidelines that can guide the development of new analytic and reporting methods specifically directed to data quality assessment and reporting for CER studies. The long-term vision is that all EHR-based clinical studies and all publically available data sets would be linked to data quality assessment results that allow for an independent assessment of the quality of the data used to generate the reported results. In addition, as NIH data sharing requirements become more stringent, the presence of uniform, standardized data quality assessment measures enables a potential data consumer to determine if a given data set is sufficient for their intended use.

The collaborative project is supported by the Electronic Data Methods (EDM) Forum, a three-year grant from the Agency for Healthcare Research and Quality (AHRQ) to facilitate learning and foster collaboration across a set of CER, patient centered outcomes research (PCOR), and QI projects designed to build infrastructure and methods for collecting and analyzing prospective data from electronic clinical data. The EDM Forum has commissioned collaborative projects that examine current challenges and opportunities for conducting CER, PCOR, and QI with electronic clinical data. Specific areas of focus include aspects of the data governance, clinical informatics, and analytic issues that are crucial to the design and use of electronic clinical data for CER, as well as lessons learned from quality improvement and other efforts to use electronic clinical data for health research and clinical care. The EDM Forum and the research projects connected to the Forum are funded by the American Recovery and Reinvestment Act (ARRA).

Panel Overview

The panel consists of representatives from four teams leveraging electronic clinical data for CER, PCOR, and QI. One representative from each of the projects will describe their pre-existing model for describing and measuring clinical data quality and will describe their role in constructing a harmonized model of data quality that captures the key elements of their individual models and any additional features described in the clinical data quality assessment literature.

Moderator: Erin Holve PhD (Electronic Data Methods Forum)

Dr. Erin Holve, principal investigator of the EDM Forum, will serve as the panel’s moderator. In this role, Dr. Holve will highlight cross-cutting themes and challenges for data quality among the broader set of EDM Forum research projects.

Panelist: Michael G. Kahn MD, PhD (University of Colorado)

Comparative effectiveness research studies require access to detailed clinical data collected across diverse clinical practice settings. Unlike “traditional” prospective clinical trials that utilize detailed data collection tools and procedures and rely on trained data collection personnel, EHR databases contain data collected during routine clinical care by practitioners focused on patient care rather than research. Differences in clinical workflows, practice standards, patient populations, available technologies, and referral resources impact what data are collected and how they are documented. Numerous studies have highlighted significant concerns about the quality of data in EHRs. 1 6 CER studies seek to exploit real-world diversity in order to detect and understand determinants impacting outcome variation. Data quality and completeness problems, however, may affect the validity of CER findings.

The importance of good quality data in clinical research is well accepted. 6 7 However, methods for categorizing, analyzing, and reporting on data quality are poorly developed. Most approaches to data quality assessment (DQA) are ad hoc, developed based on an intuitive understanding of data quality challenges, and focused on specific research questions. Few systematic approaches to DQA for the secondary use of clinically-obtained data have been proposed. Current methods do not emphasize the need to improve the reporting of DQA results. This presentation will describe the development of a comprehensive community-driven DQA framework and guidelines. Using this harmonized framework, we will describe how data quality can be continuously assessed and improved in large CER databases, and how investigators and consumers may utilize data quality assessment results to plan future studies and interpret study results.

Panelist: Meredith Nahm PhD (Duke University)

Similar to other research designs, clinical trials are making increased use of existing data. These data originate from multiple clinical sources, including data in health records, registries, and clinical data warehouses. Secondary use of data is at the core, use by individuals other than those who originally collected the data and for uses other than those for which the data were originally collected. Further, data from clinical trials, whether collected denovo for the trial or gathered from clinical documentation may be reused for secondary (tertiary, or further) analysis. Thus, in addition to the quality needs imposed to support practice or regulatory decision-making 7 8 , clinical trials must themselves contend with data quality issues related to secondary data use, i.e., consider additional dimensions of data quality to support further secondary use of the trial data itself such as federal requirements for data sharing. 9

Potential solutions to the trifecta are likely multifold. Frameworks are needed to support secondary data use, frameworks that facilitate analysis of data sources for a given data use. Because clinical trials are themselves a secondary data user and because fundamental informatics principles apply to representation and management of all data, such frameworks are likely consistent with other research designs across the spectrum of the NIH definition of clinical research 10 and can be unified. Further, because of increased secondary data use demands placed on the data from clinical trials, in some cases, data quality considerations for clinical trials will include additional data quality dimensions to support secondary data use. Examples of additional information that may be needed to support secondary data use include procedural definition of the original observation or measurement, complete definition of collected data elements, documentation of cleaning algorithms to which the data values were subjected, quantitative assessments of data accuracy, and metadata including attribution, contemporaneity, and provenance of the data values. Methods for managing association of this data quality metadata along with the data values are needed. 11

Panelist: Patrick Ryan PhD (OMOP)

In order to interpret the results of any analysis on a data source, the characteristics of the data source be clearly understood. The Observational Source Characteristics Analysis Report (OSCAR) provides a systematic approach for summarizing all observational healthcare data within the OMOP common data model. The procedure creates structured output of descriptive statistics for all relevant tables within the model to facilitate rapid summary and interpretation of the potential merits of a particular data source for addressing active surveillance needs.

Generalized Review of OSCAR Unified Checking (GROUCH) is a program that produces a summary report for each data source of warnings of implausible and suspicious data observed from the OSCAR summary. It identifies potential issues across all OMOP common data model tables, including potential concerns with all drug exposures and all conditions. GROUCH allows for data quality review of specific drugs (such as the ingredients that comprise the OMOP drugs of interest) or specific conditions (including population-level prevalence of the health outcomes of interest, and unexpected gender-specific rates, such as males with pregnancy, and females with prostate cancer).

Panelist: Nicole Weiskopf (Columbia University)

In order to determine whether a dataset is of sufficient quality, it is first necessary to identify and understand the data needs of the intended use case. Systematic approaches to data quality assessment and reporting are desirable, but must be flexible enough to account for the fact that data quality may be a task-dependent concept. Further complicating matters is the increase in the secondary use of data pulled from sources like electronic health records (EHR), which present a number of unique challenges.

An ideal data quality model must accommodate the task-dependent nature of data quality assessment by making explicit the connections between study designs, data needs, data quality requirements, and potential data quality assessment methods, thereby allowing data consumers to determine the suitability of complicated datasets for specific research tasks. This presentation will use data completeness as an example to illustrate how different definitions of the same data quality dimension may lead to significantly different data quality assessment methods and findings. Data consumers must explicitly define data completeness as it relates to the intended research in order to enable appropriate assessment and allow transparent reporting of data methods.


A key task of the EDM Forum is to assemble relevant stakeholders with an interest in addressing and resolving (when feasible) the infrastructure and methods challenges likely to arise through the development of infrastructure and methods for CER based on electronic clinical data. This collaborative project represents an important opportunity to bring together four existing data quality assessment (DQA) models and create a harmonized model, data quality assessment methods, DQA best practices and data quality reporting recommendations for the CER community.


Drs. Erin Holve, Michael Kahn, Meredith Nahm, and Patrick Ryan, and Ms. Nicole Weiskopf have all agreed to participate on this panel.


1. Aronsky D , Haug PJ . Assessing the quality of clinical data in a computer-based record for calculating the pneumonia severity index . J Am Med Inform Assoc . 2000 ; 7 ( 1 ): 55 – 65 . [PMC free article] [PubMed]
2. Chan KS , Fowles JB , Weiner JP . Review: Electronic health records and the reliability and validity of quality measures: A review of the literature . Medical Care Research and Review . 2010 ; 67 ( 5 ): 503 – 27 . [PubMed]
3. Hogan WR , Wagner MM . Accuracy of data in computer-based patient records . J. Am. Med. Inform. Assoc . 1997 ; 4 ( 5 ): 342 – 55 . [PMC free article] [PubMed]
4. Botsis T , Hartvigsen G , Chen F , Weng C . Secondary use of EHR: Data quality issues and informatics opportunities . AMIA Summits on Translational Science proceedings AMIA Summit on Translational Science . 2010 ; 2010 : 1 – 5 . [PMC free article] [PubMed]
5. de Lusignan S , Valentin T , Chan T , Hague N , Wood O , van Vlymen J , et al. Problems with primary care data quality: osteoporosis as an exemplar . Inform Prim Care . 2004 ; 12 ( 3 ): 147 – 56 . [PubMed]
6. Weiner MG , Embi PJ . Toward reuse of clinical data for research and quality improvement: the end of the beginning? . Ann Intern Med . 2009 ; 151 ( 5 ): 359 – 60 . [PubMed]
7. Davis JR , Nolan VP , Woodcock J , Estabrook RW , editors. Assuring data quality and validity in clinical trials for regulatory decision making: Workshop report . The National Academies Press ; 1999 . [PubMed]
8. Society for Clinical Data Management (SCDM) Good clinical data management practices document . 2011 Available from .
9. National Library of Medicine . Data sharing plan requirements . National Institutes of Health ; 2011 .
10. National Institutes of Health . Glossary and acronym list . 2011 .
11. Food and Drug Admininstration . Guidance for industry Computer systems used in clinical trials . Department of Health and Human Services ; 1999 .

Articles from AMIA Summits on Translational Science Proceedings are provided here courtesy of American Medical Informatics Association