|Home | About | Journals | Submit | Contact Us | Français|
The National Cancer Institute (NCI) has developed the Common Data Elements (CDE) to serve as a controlled vocabulary of data descriptors for cancer research, to facilitate data interchange and inter-operability between cancer research centers. We evaluated CDE’s structure to see whether it could represent the elements necessary to support its intended purpose, and whether it could prevent errors and inconsistencies from being accidentally introduced. We also performed automated checks for certain types of content errors that provided a rough measure of curation quality.
Evaluation was performed on CDE content downloaded via the NCI’s CDE Browser, and transformed into relational database form. Evaluation was performed under three categories: 1) compatibility with the ISO/IEC 11179 metadata model, on which CDE structure is based, 2) features necessary for controlled vocabulary support, and 3) support for a stated NCI goal, set up of data collection forms for cancer research.
Various limitations were identified both with respect to content (inconsistency, insufficient definition of elements, redundancy) as well as structure – particularly the need for term and relationship support, as well as the need for metadata supporting the explicit representation of electronic forms that utilize sets of common data elements.
While there are numerous positive aspects to the CDE effort, there is considerable opportunity for improvement. Our recommendations include review of existing content by diverse experts in the cancer community; integration with the NCI thesaurus to take advantage of the latter’s links to nationally used controlled vocabularies, and various schema enhancements required for electronic form support.
The National Cancer Institute’s Cancer Bioinformatics Grid (CaBIG, http://cabig.nci.nih.gov)(1) comprises a network of individuals from NCI-supported cancer centers, NCI personnel and NCI-affiliated contractors, who are working towards the creation of standards for cancer-related informatics, and the eventual creation of interoperable software modules supporting those standards. The modules will serve various purposes, from exchange of research data, conduct of clinical trials, financial, billing and other administrative tasks, adverse event reporting, and so on.
The interoperation will be based on common metadata standards. (The term “metadata” – data that describe and define other data” (2) – is used in both the singular and plural.) Among the various forms of metadata are controlled vocabularies, whose role in biomedical standardization efforts is well known. Examples of biomedical controlled vocabularies are the Medical Subject Headings (MeSH) (3) Logical Observations, Identifiers, Names and Codes (LOINC) (4), the Systematic Nomenclature of Medicine (SNOMED) (5) and the Gene Ontology (GO) (6). The National Library of Medicine (NLM)’s Unified Medical Language System (UMLS)(7) is a compendium of numerous existing biomedical vocabularies, including all those just mentioned.
The NCI has developed two controlled vocabularies. One of these, the NCI Thesaurus (8, 9), is incorporated into UMLS. The other, the Common Data Elements (CDE) is not, and its contents are consequently less well known. The NCI Center for Bioinformatics (NCICB) describes the purpose of CDE as follows(10):
“One of the problems confronting the biomedical data management community is the panoply of ways that similar or identical concepts are described. Such inconsistency in data descriptors (metadata) makes it nearly impossible to aggregate and manage even modest-sized data sets in order to be able to ask basic questions. The NCI, together with partners in the research community, develops common data elements (CDEs) that are used as metadata descriptors for NCI-sponsored research…CDEs are descriptors of data – metadata – that are used to set up data collection forms for cancer research studies.”
Various NCI-sponsored groups conduct clinical research that generates significant amounts of data. The parameters that are recorded relate to clinical and laboratory findings as well as items in standardized questionnaires. Just as individual items in an laboratory data stream using the Health Level 7 (HL7) communications protocol (www.hl7.org) are tagged with LOINC identifiers for the corresponding lab parameters, the idea is that eventually CDEs could act similarly as a foundation for cancer research data interchange. Most centers use a variety of clinical study data management systems (CSDMSs) for electronic data collection during conduct of clinical research. If such software is initialized with CDE content, then, when electronic forms are set up, mapping of the form elements/questions to CDEs would greatly simplify the interchange of data collected at different cancer centers. An additional hope is that existing definitions of standard cancer research forms may be reused in their entirety instead of having to be redefined by each group.
The CDE has some attributes of a controlled terminology, in the sense that its contents, like LOINC identifiers or SNOMED-CT concepts, will be utilized far beyond their site of origination to support semantic mapping between electronic systems whose use is related to cancer research or treatment. An internal evaluation of CDE from the terminology aspect has been previously performed by the Chute group at Mayo Clinic; its conclusions are reported very briefly on the group’s Web site.(11). We analyzed functions and structure to determine its fitness for its intended purpose.
The CDE, which has been in continuous development for at least four years, was originally intended to be a standard nomenclature for the reporting of Phase 3 cancer clinical trials data (12). NCICB stores the CDE in a relational database called caDSR (Cancer Data Standards Repository), whose design is influenced by the ISO/IEC 11179 standard for descriptive metadata (13) (ISO=International Standards Organization; IEC=International Electrotechnical Commision). The documents describing ISO/IEC 11179 prescribe a conceptual model rather than an actual physical implementation, even though they include several Unified Modeling Language (UML) (14) class diagrams. These diagrams, an extended form of the Entity-Relationship diagrams used to model database schemas, also incorporate referential integrity constraints between the various components. (An example of a referential integrity constraint for an outpatient clinic database is: if a new visit is recorded for a patient, the “physician visited” must first exist in the database.) Referential integrity is so important that modern database engines allow it to be specified “declaratively”, through a concise phrase in the schema definition language, as opposed to having to write code.
In ISO/IEC 11179 terminology, a Data Element is the fundamental unit of data that an organization disseminates. A data element is based on a Data Element Concept (the abstract unit of knowledge that it represents) and a representation. Aspects of representation include unit of measure, and Value Domain. A value domain (a set of permissible/valid values for the element) is defined by information such as data type (e.g., number, character, date), maximum and minimum permissible values, maximum and minimum permissible length (e.g., number of characters), number of decimal places, and if applicable, an enumerated list of values (typically codes accompanied by descriptive phrases).
One composes a data element by combining a concept with a value domain. In some cases, only one value domain is meaningful for a given concept. In other cases, however, an abstract concept may be described in more than one way, e.g., quantitatively (as numbers in a specified unit), qualitatively (absent, mild, moderate, severe) or comparatively with a reference (e.g., above normal, within normal limits, or below normal), and each type of description calls for a different value domain. Obviously, a single value domain may often apply to multiple data elements. Some value domains occur so commonly that they may be treated as special data types: a well-known example is the Boolean data type, which consists of the enumeration (True/Yes, False/No).
Clear guidelines are provided for composing the names and definitions of data elements from the names of the concepts and the value domains. Related concepts may be grouped into Classes, but the standard leaves the details of this issue unspecified.
NCICB does not provide direct access to the caDSR schema or an ftp-able version of the contents plus schema definition. We therefore accessed CDE content via the CDE browser (15), which performs a live query of the database. The downloaded contents are in the form of XML or a delimited file containing 57 columns. This file is the result of a join of around nine relational tables, and is consequently highly redundant in content. With some programming effort and the help of the CDE technical documentation diagrams, we reconstructed a semantically equivalent copy of the original schema. The UML class diagram illustrating the schema, and the relationships between tables, is illustrated in fig. 1. (A Microsoft Access database containing the schema and complete CDE contents as of Sept. 1, 2004 can be downloaded from our ftp site as ftp://custard.med.yale.edu/others/cde.zip.)
To understand this structure, it helps to emphasize that the ISO/IEC 11179 model is not intended to represent metadata with sufficient richness to address every potential use for it. This model is concerned specifically with the structure of “metadata registries” – official repositories of metadata gathered from different sources. Many aspects of the CDE schema are concerned with issues of provenance (origin, attribution)– which source created or is responsible for a particular element, what the current version of a particular element is, and what it is designated as in the source, etc. These aspects have been elided in the figure by shortening or omitting the details of certain tables.
The four most important tables in fig. 1 are Concepts, Value Domains, Data Elements and Choices. As stated earlier, a data element is logically a combination of a concept with a value domain, and therefore the Data Elements table acts as a “bridge” between the other two. For value domains that comprise a list of enumerated or ordinal items, the individual items are stored in Choices. Both a concept and a value domain may be derived from a particular source (in UMLS terminology), which in ISO/IEC 11179 is called a Context. Each context has an administrator with the authority to manage and edit the CDEs that the context “owns”. Examples of Contexts are individual NCI divisions such as CTEP (Cancer Therapy Evaluation Program) or the SPORE research consortia (Specialized Programs of Research Excellence). A data element may be indexed by one or more keywords: the table Classifications records these keywords. Finally, there may be documentation records associated with an element, as well as entries (“Designations”) that denote how the data element is recorded in the original source: this last is roughly equivalent in function to the Source Abbreviation field in UMLS.
Fig. 2 is a screen-shot of a form within the Microsoft Access application, showing the details of an individual data element (the summary result of an abdominal CT scan used to assess bladder disease). Associated data in the Concepts, Value Domains, Choices and Classification tables is also displayed in the same form.
We evaluated the caDSR and its content (the CDEs and related tables described in fig.1) from several aspects:
Each of these aspects was evaluated from several perspectives:
Completeness of Evaluation: Because CDE has few tables and a simple structure, a structural evaluation can be complete. Similarly, evaluations based on the running of queries that test specific constraints or duplications can also be complete, because they return all rows that fail the desired criterion. However, certain kinds of content errors, such as errors in semantics, can only be identified by visual inspection of individual rows of data and application of domain knowledge. Because of the size of CDE, one cannot quantify these errors precisely without intensive curation. Our evaluation from the latter aspects cannot be considered “complete”: however, it is still important to report their presence where they are encountered in the course of another aspect of the evaluation.
The results are now described under these headings.
While caDSR design follows the logical model for ISO/IEC 11179, it departs from this model in several ways, as reflected in both structure as well as content. This divergence can be problematic. We now explain with examples.
Requirement: Controlled vocabularies should provide means of arranging related concepts of varying granularity in a hierarchy or network. Recording hierarchical and non-hierarchical inter-concept relationships, as in SNOMED and UMLS, supports navigational browsing, facilitating understanding of the vocabulary’s coverage, and helps to identify potential redundancies.
The concepts within caDSR range from finely granular concepts like “line 1 of the street address”, to concepts like “Hematology Lab”, which includes numerous diverse data elements such as erythrocyte sedimentation rate, prothrombin time and the various components of the differential white blood cell count. However, caDSR lacks such a hierarchy; all concepts exist at a single level, with no means of inter-relating them. The Mayo document cited earlier points out that the absence of semantic or syntactic linkage of shared concepts makes it difficult to algorithmically recognize related.concepts (11)
Another consequence of the lack of this desideratum is content redundancy. For example, The “Hematology Lab” concept has a data element with the preferred name HMT_NEUT_LAB_PTG_VAL and the definition “peripheral blood neutrophils percentage”. On the other hand, neutrophil cell percentage is also a concept in its own right. The data element that is the single child of this concept, however, is different from the one just mentioned. It has the preferred name “LAB_HEME_NEUTROPHILS_CELL_*” and the same preferred definition, except that it begins with a capital P. Similar redundancies occur for other parts of the differential, such as monocytes, promyelocytes, etc., as well as the absolute counts of these cell types. This situation represents one of unrecognized synonyms, since the underlying semantics of the two data elements are identical.
Requirements: A controlled vocabulary must support representation of alternative synonymous forms (terms) for the same underlying concept. The clinical domain, for example, has both Anglo-Saxon vs. Greco-Latin equivalents for the same concepts, e.g., vomiting vs. emesis. Terms, or the key phrases that they contain, provide a means of query expansion, in that the same concept can be located through different search terms.
The caDSR lacks a “synonyms/terms” table. Concepts are only classified by keywords and grouped by the source they came from. Consequently, searching is less robust.
Requirements: In order to support generation of robust electronic data forms from individual data elements, it is necessary to record information that inter-relates these elements, such as:
The papers of van Ginneken (23), which addresses requirements for structured data entry, and Nadkarni et al (24) which deals with E-form generation for a clinical trials database, address this theme in greater detail: the software described in both papers is available as open-source. More than a dozen relational tables are required to capture the requisite information, starting with the representation of a form itself. (The documentation and code for our open-source clinical study data management system, TrialDB (25, 26), are available at ftp://custard.med.yale.edu).
The CDE schema currently lacks the structures necessary to record any of the information necessary for form definition as delineated above. Meeting this goal is realistic, but the details of how it may be achieved are beyond the scope of this paper.
To recapitulate the results of the evaluation:
Despite the problems identified in the evaluation, there are certain positive aspects to CDE initiative, notably the decision to utilize the existing ISO/IEC 11179 standard model has raised general awareness of this standard (previously confined mostly to the information-technology world) in the biomedical informatics community. Several of this model’s aspects – such as the definition of value domains –benefit all controlled biomedical vocabulary efforts, by addressing several issues that existing vocabularies currently tend to deal with in an ad hoc fashion.
Mere adoption of a standard model, however, does not suffice. The current version of ISO 11179 was devised for support of registries of descriptive metadata, not for the considerably more complex issues that CDE tries to address. Several HL7-affiliated investigators have recognized this limitation and are working towards augmenting ISO 11179 for ontology support (notably Harold Solbrig of Mayo Clinic, whose previous work on ISO 11179 is described in (27)) but it may take some time before these efforts yield a revised standard data model.
To their credit, caBIG affiliated individuals have begun to establish well-defined processes for new data elements suggested for incorporation into CDE as standard elements (28), including human expert review of the existing CDE content to check for semantic identity with an existing element. Much existing (legacy) content, however, has attained a “standard” status as a consequence of less rigorous review processes. The recommendations below deal with how to fixing CDE structure and content to support its various intended roles.
We state our recommendations under the three broad categories of evaluation stated earlier.
NCICB needs to ensure that CDE content adheres to the standard’s intentions of clear, correct and unambiguous definitions and descriptions. This task requires curatorial input from cancer/clinical content experts as well as those with experience in development of biomedical thesauri. Relatively few individuals outside NCI have had an opportunity to inspect CDE content in its entirety. It is desirable for NCICB to emulate the example of UMLS, whose contents are made available as a set of delimited text files whose contents can readily be massaged and imported into relational tables. The present hurdle of requiring those who wish to inspect CDE content in bulk to parse a complex-structured and highly content-redundant XML stream, results in unnecessarily duplicated effort at individual cancer centers.
It is desirable to follow the example of UMLS and explicitly support preferred definitions for both concepts and domains. The most useful source vocabularies that feed into UMLS (notably the NLM’s Medical Subject Headings) record concept definitions.
Standard structures to support the minimum requirements of controlled vocabularies–synonyms and relationships – should be incorporated. This will also facilitate mapping of CDE content to standard sources such as the UMLS, and leveraging UMLS content in turn to link to its constituent vocabularies such as LOINC and SNOMED. One of the efforts that is part of the CaBIG initiative is the creation of an information architecture that faithfully follows the HL7 version 3 draft standard (29): such mapping will facilitate interpretation of the semantics of data streams that contain CDEs as attributes.
The NCI thesaurus already has mapping to UMLS in place, but the CDE and NCI thesaurus efforts currently appear to be operating and managed more or less independently of each other. Integration of CDE content into the NCI thesaurus should be a high priority. The use of “terminology services” tools such as developed by the Mayo group (30–32) should help greatly in such an integration effort.
CDE will possibly need to borrow eventually from SNOMED-CT to incorporate a mechanism for support of description logics. The research of Hahn and Schulz (33, 34) and the OpenGALEN project (35, 36) has shown that controlled terminologies often need to be augmented by mechanisms for such knowledge representation. SNOMED-CT currently uses an XML representation to support composition of complex concepts from more atomic ones, and CDE will possibly need to use a similar approach.
Given the modest size of CDE, the tasks of enforcing ISO 11179 compliance and controlled vocabulary features are tractable
To meet the goal of supporting computable electronic data collection forms, the caDSR schema requires major extensions. The complexity of this task, however, cannot be underestimated. In our own experience in maintaining a clinical trials data management system over more than seven years, this schema component has evolved continually to meet user demands. For example, we have now added metadata schema (and generic code) to support dynamic E-form generation in an arbitrary number of languages (e.g., English, Spanish, German). Such a feature is useful in clinical studies that are conducted internationally, because it allows a single Web site to serve pages in multiple languages without creating multiple bodies of code or multiple database schemas. At the metadata schema level, such support involves allowing multiple display captions for the same data element (and for each value in an enumerated value domain), one for each target language. Ideas from the UMLS, which records synonymous terms for the same concepts in different languages, can be profitably borrowed.
We believe that the goals of full ISO/IEC 11179 compatibility and schema infrastructure for controlled terminologies are achievable with relatively modest resources, while E-form support should be postponed until these goals are met.
The CDE is a critical linchpin in the highly desirable goal of inter-operability and data sharing for cancer research. This evaluation is intended to assist the cancer informatics community in identifying ways of improving the CDE.
NIH grants U01 CA78266, U01 ES10867, R01 LM06843 and K23 RR16042, institutional funds from Yale University School of Medicine, and a contract from the National Cancer Institute for support of participation in the Cancer Bioinformatics Grid (caBIG)