Assessing eMERGE Data Element against PhenX and caDSR
As evident from , while the lexical matching technique found few equivalences (8%) between the eMERGE and PhenX data elements, and majority had broader (41%), narrower (4%) or no (36%) relationships. These outcomes are consistent with the fact that the eMERGE studies are primarily focused on EMR-derived phenotyping, and hence, the phenotype specific data elements are representative of data stored in EMR systems that can range from very abstract (e.g., Cancer Indicator) to extremely granular (e.g., Ankle-Brachial Index after a Treadmill Test). PhenX measures, on the other hand, were developed primarily for investigators who are either planning a future study or expanding an existing one with the expectation that the measures, when readily available, can be used as part of standard protocols for collecting subject related data. Furthermore, PhenX also focused on environmental exposures (e.g., History of Daycare Attendance) that were out of scope for eMERGE. As a consequence, either many eMERGE data elements had a broader relationship to PhenX measures, or had no match. Interestingly enough, for the data elements that were equivalent, the corresponding mapped caDSR CDEs were not the same. (We discuss this issue later in this section.)
For the caDSR CDE-based mapping approach, the goal was to determine CDEs common and mapped to both eMERGE and PhenX data elements. We identified that a majority (97%) of caDSR CDEs did not match, or were not reused across both projects. One of the reasons for such a large non-overlap of data elements is due to non-overlap between the phenotypes and domain of study between both the projects. For example, several PhenX measures were modeled for cancer, reproductive health and speech and hearing—areas that eMERGE did not address. The second major reason for lack of overlap is more technical, and is associated with coverage and curation aspects of caDSR. We discuss this issue next in this section.
caDSR Coverage and Curation
In total, PhenX measures from 21 research domains have been registered as 352 CDEs in caDSR. Of these, 31 existing CDEs were re-used, and 321 newly created. The existing CDEs that PhenX measures map to are most commonly used data elements from Demographics, Anthropometrics, Alcohol and Tobacco Use, and Assays. The only exception is the “Perceived Stress Scale Questionnaire” (public ID: 2199495) in the Psychosocial domain. The large number of newly created CDEs fall in non-cancer research domains which include other disease areas (e.g., Speech and Hearing, Skin, Bone, Muscle and Joint,), environmental factors (Nutrition, Environmental Exposure, Physical Activity and Physical Fitness), and social domains (Social Environment, Psychosocial). This set of 321 newly created CDEs is a significant addition to the caDSR.
In our study, several caDSR CDEs did not match for the eMERGE and PhenX data elements. We see two main reasons for this: (1) mapping to granular, context-specific CDEs in the caDSR, and (2) presence of duplicate (or semantically similar) CDEs in the caDSR. For the first issue, several eMERGE data elements were mapped to phenotype specific caDSR CDEs (e.g., Dementia Cognitive Abilities Screening Instrument Count) that were not relevant for PhenX. Similarly, several PhenX data elements mapped to the caDSR CDEs (e.g., Paternal Grandfather’s Birthplace) were out of eMERGE’s scope. This aspect, while leads to lesser degree of overlap between the data elements for eMERGE and PhenX, illustrates the fact that the domains for these projects are non-overlapping. As more phenotypes are studied in eMERGE, in future we expect the degree of data element overlap with PhenX to significant increase. The second issue is more involved and technical. In its current incarnation, the caDSR provides a database and a set of APIs for creating, editing, sharing and using CDEs to facilitate interoperability. However, due to the limitations of the ISO/IEC 11179 model Version 2 used in the existing caDSR implementation as well as API and caDSR CDE browser limitations, not only it is difficult for end-users to query for the relevant CDEs, but it is also difficult to identify CDEs that are semantically similar, and hence, can be re-used. Consequently, often CDEs with overlapping semantics get curated, and users are presented with several similar CDEs for a given search query. For instance, at the time of writing this manuscript, a string search for Gender using the caDSR CDE browser, 67 different CDEs are returned as the query result, and the user is left with the exercise for selecting the most appropriate one, thereby leading to inconsistent CDE reuse and mapping. Continuing the above example, the data element Sex in eMERGE was mapped to the caDSR CDE Person Gender (caDSR Public ID=2200604), whereas PhenX mapped it to the caDSR CDE Gender Code (caDSR Public ID=2179640). It is abundantly clear, even from this simple example, that significant improvements with respect to CDE curation, software implementation and modeling, as well as education and training is required to ensure appropriate re-use of CDEs for data interoperability.
Limitations and Future Work
While caDSR is a very useful resource for data elements in individual studies to share with the research community, it has some limitations as described above. Adopting diverse set of metadata standards and terminologies will expose studies to a broader user community to enhance interoperability with a wider range of potential studies and promote cross-study pooling of data to detect both more subtle and complex genotype-phenotype associations. Consequently, both eMERGE and PhenX are investigating using CHI standards including, LOINC and SNOMED-CT, for future cross-study analysis.
In addition to the collaboration with eMERGE on the phenotypic data extracted from EMR, PhenX is collaborating with other projects including dbGaP (http://www.ncbi.nlm.nih.gov/gap
) to develop a consistent rule set for mapping PhenX measures to dbGaP study variables. This will enable PhenX measures to be included in dbGaP, thereby facilitating sharing and access of variables from different studies for cross-study analysis. Through this study of mapping eMERGE data elements and PhenX measures, our outcomes can serve as a gateway to link mapped eMERGE EMR variables to other widely visible and diverse resources.
Wide-spread adoption and use of standard measures within clinical research will greatly facilitate cross-study analysis. Increased statistical power from cross-study analysis makes it possible to detect more subtle and more complex gene associations including gene-gene and gene-environment interactions. This study demonstrates the value of using a standardized metadata resource for exposing studies to a broader community, as well as, outlines several limitations of existing metadata resources.