|Home | About | Journals | Submit | Contact Us | Français|
The PhenX Toolkit provides researchers with recommended, well-established, low-burden measures suitable for human-subjects research. The database of Genotypes and Phenotypes (dbGaP) is the data repository for a variety of studies funded by the National Institutes of Health (NIH), including genome-wide association studies (GWAS). The dbGaP requires that investigators provide a data dictionary of study variables as part of the data submission process. Thus, dbGaP is a unique resource that can help investigators identify studies that share the same or similar variables. As a proof of concept, variables from 16 studies deposited in dbGaP were mapped to PhenX measures. Soon, investigators will be able to search dbGaP using PhenX variable identifiers and find comparable and related variables in these 16 studies. To enhance effective data exchange, PhenX measures, protocols, and variables were modeled in Logical Observation Identifiers Names and Codes (LOINC). PhenX domains and measures are also represented in the Cancer Data Standards Registry and Repository (caDSR). Associating PhenX measures with existing standards (LOINC and caDSR) and mapping to dbGaP study variables extends the utility of these measures by revealing new opportunities for cross-study analysis.
The influx of genome-wide association studies (GWAS) has led to the identification of many genetic variants associated with disease outcomes. More than 1,000 publications are currently included in the Catalog of Published GWAS [Hindorff et al., 2009]. Despite the vast potential for cross-study comparisons, the lack of standard phenotypic and environmental measurements has limited the ability to combine data from GWAS and other large-scale genomic and epidemiologic studies [Hindorff et al., 2009; Manolio, 2009; Thorisson et al., 2009]. Standard measures are critical for combining data from seemingly disparate studies with similar underlying risk factors, increasing statistical power so that relatively modest or more complex associations can be identified and initial findings from GWAS can be replicated [Burton et al., 2009; Fortier et al., 2010; García-Closas and Lubin, 1999; Khoury et al., 2009]. However, in most longitudinal clinical studies, each investigator develops a set of clinical variables that are not the same across other studies.
In response to a clear need for standard measures of phenotypes and exposures, PhenX (consensus measures for Phenotypes and eXposures) engaged 21 Working Groups (WGs) of experts to identify high-quality, relatively low-burden, well-established measures of phenotypes and exposures. These measures were vetted by the scientific community prior to inclusion in the PhenX Toolkit (https://www.phenxtoolkit.org). The PhenX Toolkit provides researchers with a source of standard measures suitable for a variety of study designs in population-based research. Because the PhenX Toolkit provides a variety of high-quality measures, investigators can come to the Toolkit and to select measures to expand their study, especially to add measures that are beyond the primary research focus of the study.
The nomenclature for the PhenX Toolkit was defined by the PhenX Steering Committee and is shown in Table 1. Currently, the PhenX Toolkit includes 295 measures spanning 21 research domains [Hamilton et al., 2011; Hendershot et al., 2011]. A measure is usually comprised of multiple variables or questions, so most measures correspond to many items in the other data sets described in this article.
Challenges in phenotype harmonization have been widely recognized, and efforts have been made in this emerging research field [Bennett et al., 2011; Fortier et al., 2010]. To help address these problems, all 295 PhenX measures have been mapped to multiple resources, including the database of Genotypes and Phenotypes (dbGaP; http://www.ncbi.nlm.nih.gov/gap/), Logical Observation Identifiers Names and Codes (LOINC; http://loinc.org/), and the Cancer Data Standards Registry and Repository (caDSR) of the cancer Biomedical Informatics Grid (caBIG; https://cabig.nci.nih.gov/). This article describes how PhenX measures were integrated into these standards and demonstrates the utility of this approach.
The dbGaP database, which was created by the National Center for Biotechnology Information (NCBI), is a public repository for individual-level genotype, sequence, and phenotype data and the associations between them [Mailman et al., 2007]. dbGaP currently contains more than 125,000 variables. Many of these variables may be similar enough to PhenX variables that they could be considered comparable or related for cross-study analysis. To help researchers interested in PhenX variables find similar variables in dbGaP, we developed a process for mapping dbGaP study variables to PhenX variables. As a proof of concept, variables from 16 completed studies deposited in dbGaP were mapped to PhenX measures. These results will be fully incorporated in dbGaP and will bring to light additional opportunities for cross-study analysis.
Investigators who submit data to dbGaP will be asked to identify PhenX variables as part of the data submission process. Then, PhenX variables will then be highlighted as such in dbGaP. Because dbGaP was established before PhenX measures were developed, none of the studies currently in dbGaP used PhenX protocols. However, we know that there are many variables in dbGaP that are similar, or nearly identical, to PhenX variables and potentially could be combined with data collected using PhenX protocols as well as with each other. Although it is possible to run full-text searches within the dbGaP database to find data that are similar, experience tells us that the full-text searches for variables are likely to return large numbers of false positives. For example, a search on “education” will return more than 10,000 variables.
To make it easier for researchers to find non-PhenX variables that might be compared to or combined with PhenX variables, scientists from PhenX and dbGaP investigated the feasibility of mapping dbGaP variables to PhenX variables. The first attempt began by examining four dbGaP studies so that we could begin to develop the process and refine our ideas about what it means to map one variable to another. Each scientist was given all the variables for the four studies, including the variable description and a link to the variable report page on the dbGaP website. Using this information and all of the information available on the PhenX Toolkit, each scientist generated his or her own set of mappings for the dbGaP variables. Many factors—such as measurement concept, protocol, code category (answer list), and measurement unit—were discussed. Based on these discussions, the team decided on the following two levels of mapping:
It is possible that a dbGaP variable neither corresponds nor is related to a PhenX variable or measure. Such a lack of correspondence or relation could be considered “not found,” but this mapping level is not explicitly shown when looking at the dbGaP variable. Rather, variables that do not have a mapping level simply are not displayed.
A dbGaP variable can be mapped to multiple PhenX variables and/or measures. For example, the dbGaP variable phv00111936 (smok_evr: Smoked more than 100 cigarettes or 5 packs in lifetime) is mapped to four PhenX variables (as comparable with one variable and as related to the other three) that are associated with three different PhenX Measures (see Figure 1).
Once mapping criteria had been agreed on, the remaining studies were mapped. Mapping was performed by at least two independent curators. Results were compared, and as before, discrepancies were resolved by consensus after discussion. Any new mapping criteria that were developed during this process were added to the guidelines for future use.
In this report, we show the results of mapping 13 Gene Environment Association Studies (GENEVA) consortium studies and 3 electronic Medical Records and Genomics (eMERGE) network studies to PhenX. The GENEVA consortium (https://www.genevastudy.org) consists of 16 GWAS that aim to accelerate understanding of genetic and environmental contributions to health and disease on a collection of mostly traditional epidemiologic cohorts [Cornelis, 2010]. The eMERGE network (https://www.mc.vanderbilt.edu/victr/dcc/projects/acc/index.php/Main_Page) is a national consortium formed to develop, disseminate, and apply approaches to research that combine DNA biorepositories with electronic medical record (EMR) systems for large-scale, high-throughput genetic research [McCarty, 2011; Kho, 2011]. We used variable descriptions from GENEVA and eMERGE studies released in dbGaP at the time of this mapping. Table 2 lists the dbGaP studies, their dbGaP accession numbers, the total number of variables from each study, and the number of variables that mapped to a PhenX variable or measure. The percentage of variables mapped for a particular study ranges from 23% to 80%, but the effective mapping rate for all studies is somewhat higher than this because all of the studies contain variables that are not phenotype data and therefore cannot be expected to have an analog in PhenX. These types of variables include administrative data such as IDs (e.g., for subjects, subjects' parents, locations of data collection), consent status, or information about instrumentation (e.g., sequencing platforms). Aside from administrative variables, there are some dbGaP variables that do not map to PhenX (Table 2). In general, these variables reflect concepts that are study specific (e.g., “Are your ear lobes creased?” or “What is your U.S. shoe size?”).
Results of mapping the dbGaP studies to PhenX are summarized in the PhenX-dbGaP Variables Cross-Reference Table in Supp. Table S1. For these 16 dbGaP studies, the cross-reference table lists a total of 2,041 mappings, with 604 dbGaP variables mapped to 504 PhenX variables and 52 PhenX measures. The cross-reference table is available at the PhenX Toolkit website (https://www.phenxtoolkit.org). Examples of these mappings are illustrated in Table 3, in which individual PhenX variables are mapped to many variables from multiple studies, highlighting opportunities for cross-study analysis at the investigator's discretion. Note that “Lipid_Total_Cholesterol” (a PhenX variable) maps to “Dyslipidemia” (a condition). Although this mapping may at first be disconcerting, it is actually a good example of how mapping can identify data that is comparable or related, even though the reasons for collecting the data were different. “Lipid_Total_Cholesterol” is a variable associated with the PhenX Lipid Profile measure, and the data collected can be used to drive the condition “Dyslipidemia.” On the dbGaP website, mapping information for a variable is shown on that variable's report page. Figure 1 is a screenshot of the report page for the dbGaP variable phv00111936 (smok_evr). The “Terms Linked to this Variable” section lists the PhenX variables mapped to smok_evr. The left column shows the level of mapping; a full green circle indicates comparable, a half-filled yellow circle indicates related. The second column lists the name of the PhenX variable or measure that has been mapped to; these names are linked to a search page that displays all of the dbGaP variables that map to that PhenX variable (see Supp. Figure S1). The third column gives a short definition of the mapped PhenX variable, whereas the measure column lists the PhenX measure associated with the variable. The names of the PhenX measures are links to the measure on the PhenX website.
Figure 2 shows the number of dbGaP mappings to PhenX as a function of the PhenX measure. Although dbGaP variables map to more than 100 different PhenX measures, only the 25 PhenX measures with the most mappings are shown here. When looking at this plot, you should keep in mind the following points:
Point 3 explains why the measures that have the most dbGaP mappings to PhenX are those like “Sleep Apnea” and “Migraine” rather than “Gender” or “Current Age.” A measure like “Sleep Apnea” or “Migraine” contains an extensive protocol that collects a large number of discrete variables including age, gender, height, and weight in addition to the more specific data suggested by its name. Therefore these measures will have many dbGaP variables mapped to them from a single study, whereas the measure “Gender” may only have one variable mapped to it from each study.
For the pilot study described, only a handful of dbGaP studies comprising a relatively small number of variables were selected, and it was relatively easy to identify all variables related to a given concept (e.g., diabetes, race, smoking). The manual approach, although somewhat laborious, resulted in thoughtful, consistent mappings. That said, scaling up will present challenges, and using Natural Language Processing (NLP) algorithms to identify similarities and differences among the variables may be helpful in this regard. For example, NLP has been used successfully to identify cataract cases from electronic health records [Peissig et al., 2012]. Perhaps in the future, NLP can be used to augment and extend the described approach.
LOINC® (http://loinc.org/) is a vocabulary standard for identifying laboratory tests, clinical measurements and reports, survey instruments, and other kinds of clinical observations. By providing universal identifiers for a wide range of measurements and observations, LOINC enables exchange and aggregation of electronic health data from independent systems for many purposes [Vreeman et al., 2010b; McDonald et al., 2003]. LOINC has been widely adopted in the private and public sectors, both within the United States, and by users in more than 140 countries worldwide. Notably, the Health Information Technology (IT) Standards Committee of the Federal Office of the National Coordinator for Health IT recently adopted LOINC as the coding system for transmitting results of laboratory and other tests, assessment instruments, and many other clinical variables [Health IT Standards Committee, 2011]. LOINC has now incorporated all of the PhenX content, enabling results of PhenX measures from independent systems to be shared using the same exchange, storage, and processing infrastructure that health information systems use for sending a serum glucose test result or a chest X-ray report. Here we describe the process of representing PhenX content in LOINC, advantages to this linkage, and some of the lessons learned.
Each term in LOINC provides a “fully specified” name using an established model that contains six main axes (Supp. Table S2) [McDonald, 2011]. The model produces names that are detailed enough to distinguish among similar clinical observations. As a collection, PhenX contains many kinds of measurements, from laboratory tests to anthropomorphic measures and validated questionnaires. LOINC has developed a robust model for representing standardized assessment instruments, recognizing that they have psychometric properties that are essential for interpreting meaning [Vreeman et al., 2010a]. Thus, in addition to the structured name, LOINC stores many other attributes about the individual variable, including the exact question text and source, example units of measure (for quantitative variables) and full answer lists (for categorical variables), references, descriptions, and external copyright information when applicable. LOINC also creates terms for named collections of variables (called “panels” in LOINC), and enumerates the child elements contained in that set into an explicit hierarchy.
Through iterative development, the LOINC team incorporated the entire set of PhenX measures into LOINC, either by creating new LOINC terms or by linking the PhenX variables to existing LOINC terms. We extracted content from the PhenX Toolkit for every variable in each measure and domain, starting with a small set of PhenX content that was first represented in LOINC version 2.29 as a proof of concept. Some variables, such as head circumference and gestational age, were already present in LOINC, but the majority of them were not. We modeled variables new to LOINC according to the established naming conventions. From the protocol text, we extracted and stored the key accessory attributes (e.g., units of measure or the allowable answer choices). Many PhenX variables are defined or illustrated by graphics (e.g., line drawings or photographs) to show exactly how a measurement should be taken or answer that particular question. The LOINC team created a mechanism for storing and displaying these graphics in the free desktop mapping program called the Regenstrief LOINC Mapping Assistant (RELMA; http://loinc.org/relma) and the online LOINC search application (http://search.loinc.org). Figure 3 illustrates how the accessory content for a PhenX variable is represented in LOINC, including the structured answer list, exact question text, and a reference image. To capture the hierarchical arrangement of variables into collections, we created LOINC panel terms at the level of each PhenX domain, measure, and protocol. These named panels include all of the corresponding PhenX child elements in a formal hierarchy linked to that panel. Over time, the LOINC team added the remainder of the PhenX content (Supp. Table S3). LOINC has now completed modeling of all PhenX variables from 295 measures in 21 research domains; 138 existing LOINC terms were mapped to PhenX variables, and approximately 4,500 new LOINC terms were added based on the PhenX content.
Incorporating the PhenX content into LOINC has many advantages. Adding the PhenX measures to LOINC enables the results to be shared using the same health information technology infrastructure and standards that are now becoming widely adopted. In addition, the LOINC model provides the same uniform computable representation of the PhenX content as the other standard assessments and data sets contained in LOINC, including many mandated by the Centers for Medicare & Medicaid Services (CMS) and provided by other NIH institutes, such as the Patient Reported Outcomes Measurement Information System [Gershon et al., 2010; Riley et al., 2011] (PROMIS; http://www.nihpromis.org/) and Quality of Life Outcomes in Neurological Disorders (Neuro-QOL; http://www.neuroqol.org/). Having such a common representation that promotes sharing will accelerate genomic and other clinical research. Moreover, because of LOINC's broad adoption worldwide, representing the PhenX measures in LOINC will widen the audience for PhenX measures.
The process of integrating the PhenX content into LOINC elucidated several important lessons. Many of the PhenX measures selected instruments and protocols that were initially conceptualized as paper data collection forms. As the LOINC team defined its terms and parsed this content into its data model, it revealed many of the same challenges that were encountered with coding other widely used survey instruments [Vreeman et al., 2010b]. For example, some protocols did not specify all of the variables needed to collect the data or lacked sufficient detail to precisely define the observation. In other cases, the information model of the protocols differed substantially from the typical information model used to exchange data between clinical care systems with LOINC and messaging standards like Health Level Seven International (HL7). The LOINC team always found solutions to these problems through discussions with the PhenX team. One strategy was to turn a long list of “Check All That Apply (Yes or No)” questions into a single variable with an answer choice list that could be repeated as many times as necessary. For example, a protocol requiring answers of yes or no to a long list of potential diseases could be transformed into an active diseases variable whose answer values could be the diseases present. This approach dramatically reduced the number of LOINC observation codes necessary to cover all of the PhenX variables and was consistent with the prevailing health data exchange and storage conventions. We anticipate that these challenges will diminish as survey instrument developers become acquainted with the formality required for computer representation of instruments in LOINC.
LOINC was chosen as the vocabulary standard for several reasons. The goal was to represent PhenX content in a widely adopted vocabulary standard that would enable data aggregation using prevailing conventions (e.g., HL7 messaging). The value of LOINC in this context is that it provides a set of universal identifiers and a uniform model of that instrument across any context. LOINC is well suited for clinical observations, formal surveys and questionnaires, and it is the standard adopted by the Health Information Technology (HIT) Standards Committee for laboratory and nonlaboratory measurements and observations. When this pilot study was initiated, LOINC already contained many similar complete packages of standardized assessments and data sets, including the CMS-required Minimum Data Set (MDS; https://www.cms.gov/MinimumDataSets20/), Outcome and Assessment Information Set (OASIS; https://www.cms.gov/OASIS/), the new Continuity Assessment Record and Evaluation (CARE) instrument, Patient Health Questionnaire (PHQ), PROMIS [Gershon et al., 2010], and Neuro-QOL. Making PhenX content available in the same model and format will facilitate data interoperability and data exchange.
The caDSR is a data-standards repository in caBIG [caBIG Strategic Planning Workspace, 2007; Kakazu et al., 2004]. It is an open-source, open-access information network designed to enable secure data exchange throughout the cancer research community. The caDSR includes a catalog of Common Data Elements (CDEs). Each CDE is a unique pairing of a Data Element Concept which represents the question metadata and a Value Domain which represents the answer metadata. One or more CDEs are either assigned, or created, for every PhenX protocol (it is possible for a PhenX measure to have multiple protocols). There are 353 PhenX protocols mapped with 379 CDEs; 343 of these CDEs were newly created for PhenX.
PhenX has reused existing CDEs when available. The need to create so many new CDEs is not surprising; the CDEs previously available were focused either on general demographic concepts, such as gender, race, and age, or on specific concepts related to cancer, whereas the focus of PhenX is much broader. PhenX represents 21 research domains, most of which are outside of the traditional cancer research domain; such domains include the Psychiatric, Psychosocial, and Social Environments domains. For example, two new CDEs were created for the protocols of the measure Assay for Chlamydia/Gonorrhea: Immunology Chlamydia trachomatis Assay Laboratory Finding Result (3151324); Immunology Gonorrhea Assay Laboratory Finding Result (3153202). At the request of the caDSR administrator, the PhenX CDEs' workflow status was changed from “draft new” to “released” so that they would be available for reuse; they have already been used by other studies. In the caDSR, PhenX protocols are organized by research domain and can be located using the CDE Browser (CDE Browser; https://cdebrowser.nci.nih.gov/CDEBrowser/) as Figure 4 shows.
Table 4 shows an excerpt from the cross-reference table that includes LOINC codes and caCDR CDEs that are associated with each PhenX protocol. The comprehensive cross-reference table provided, in Supp. Table S4, is available on the Toolkit website and will serve as a valuable resource to investigators as well as bioinformaticists.
Recognition and use of the PhenX Toolkit continues to increase as investigators begin to realize the importance of collecting data with standard instruments or tools. As of the end of January 2012, there were 259,077 visitors to the Toolkit website. Most Toolkit visitors are from the United States, the United Kingdom, and Australia, but there have also been visitors from 143 other countries. There are currently 637 registered users. Registered users have access to additional features, such as the “My Toolkit” for collecting and saving selected measures. Joining the Toolkit Network makes it possible for users of the network to contact each other. The idea is that the network can be used to facilitate collaboration at the study design phase as well as retrospective cross-study analyses. Early adopters of PhenX measures include PhenX RISING (Real world, Implementation, SharING) project (https://www.phenx.org/Default.aspx?tabid=748), the National Eye Institute (NEI) Glaucoma Human Genetics Collaboration (NEIGHBOR) Consortium, and the Gulf Long-Term Follow-up Study (GuLF STUDY; http://nihgulfstudy.org/). Additional information about early adopters is available on the Toolkit website.
dbGaP and PhenX will continue to collaborate and extend the relationship between the two resources. As noted previously, when new studies submit their data to dbGaP and identify their variables as PhenX, this information will be stored in the database. Other areas of development in dbGaP include adding the ability to filter search results to return variables submitted as, or mapped to, PhenX; mapping additional retrospective studies; and adding other languages/ontologies beneath the “Terms Linked to this Variable” heading on the variable report page (e.g., ICD-9 codes or MESH terms). These developments will expand the ability of investigators to identify variables of interest across dbGaP. This information can be used prospectively, at the study design stage, or retrospectively, to identify opportunities for cross-study analysis with or without the need for harmonization.
PhenX intends to keep its mappings to LOINC and caBIG CDEs up to date when the Toolkit is expanded or updated, either by linking to concepts already extant in those resources, or by creating new concepts within them (as described earlier). By maintaining collaborations and close connections to these resources (and potentially adding resources), PhenX will be able to expand and update the cross-reference table accordingly. The results presented here extend the utility of PhenX measures and add value to resources like dbGaP, LOINC, and caDSR.
The goal of associating PhenX measures with existing standards is to make it easier for investigators to share data and to compare and combine study results. Integrating PhenX measures into existing standards (LOINC, CDE) and mapping PhenX variables to dbGaP study variables extend the utility of PhenX measures and reveal new opportunities for cross-study analysis. The primary limitation of data sharing is that study-specific measures are needed to support scientific inquiry. That is, deciding what measures are needed to effectively address a specific research question is inherent to study design. Striking a balance between the inclusion of study-specific measures and the inclusion of standard measures is necessary; both types of measures will affect the overall scientific impact of the study results. As individual investigators to recognize and realize the potential of data sharing and cross-study analysis; the work presented here will facilitate that process. Linking dbGaP, LOINC, caDSR, and PhenX resources will help promote data sharing and thus will have a significant positive impact on biomedical research.
PhenX is supported by: National Human Genome Research Institute (NHGRI) Award No. U01 HG004597-01 and National Center for Research Resources (NCRR) Award No. 3UL1RR025761-02S6.
dbGaP research was supported in part by the Intramural Research Program of the NIH, National Library of Medicine.
LOINC research was supported in part by contracts HHSN2762008000006C from the National Library of Medicine and NCRR Award No. 3UL1RR025761-02S6.
Grant Sponsor: NHGRI U01 HG004597-01, NLM HHSN2762008000006C, and NCRR 3UL1RR025761-02S6.
Supporting Information for this preprint is available from the Human Mutation editorial office upon request (humu/at/wiley.com)
Conflicts of interest - none.