|Home | About | Journals | Submit | Contact Us | Français|
The database of genotypes and phenotypes (dbGaP) developed by the National Center for Biotechnology Information (NCBI) is a resource that contains information on various genome-wide association studies (GWAS) and is currently available via NCBI's dbGaP Entrez interface. The database is an important resource, providing GWAS data that can be used for new exploratory research or cross-study validation by authorized users. However, finding studies relevant to a particular phenotype of interest is challenging, as phenotype information is presented in a non-standardized way. To address this issue, we developed PhenDisco (phenotype discoverer), a new information retrieval system for dbGaP. PhenDisco consists of two main components: (1) text processing tools that standardize phenotype variables and study metadata, and (2) information retrieval tools that support queries from users and return ranked results. In a preliminary comparison involving 18 search scenarios, PhenDisco showed promising performance for both unranked and ranked search comparisons with dbGaP's search engine Entrez. The system can be accessed at http://pfindr.net.
The database of genotypes and phenotypes (dbGaP) is an important repository for data generated through various genome-wide association studies (GWAS), which can be used for new explorations or cross-study validation.1–3 In addition to genomic data, dbGaP requires investigators to submit phenotype data. As of 7 July 2013, dbGaP contained 422 studies, including more than 130 000 phenotype variables. However, searching relevant studies accurately and completely is challenging, because phenotypic information related to studies is often stored in a non-standardized way. For particular queries, the dbGaP Entrez system returns several studies that are not always relevant, and it does not make clear how particular records are selected and why they appear in a particular order. Consequently, users have to review each study description carefully to determine relevancy, which can become a laborious and time-consuming task when many studies are retrieved.
To address this issue, we developed a new web-based information retrieval system called PhenDisco (phenotype discoverer) based on the user requirements obtained by interviewing dbGaP users. The project is funded through the program entitled phenotype finder in data resources (pFINDR) from the National Heart, Lung, and Blood Institute. The goal of this program is to facilitate the search of phenotypes in dbGaP’s GWAS. Our approach uses natural language processing (NLP) as well as information retrieval techniques in order to improve phenotype search in dbGaP.
There are several related works that aim to address issues associated with the lack of standardization in phenotype variables.3–9 PhenX defined 287 frequently used phenotypes (called measures) in 21 research domains, and manually cross-mapped these measures to phenotype variables in 16 dbGaP studies.3 4 The goal is to use these measures prospectively, so new studies are described in a standardized way. Another project, eMERGE, used a semi-automated process: users manually search for phenotype variables for specific domains (eg, Alzheimer’s disease), and these variables are automatically mapped to standardized vocabularies through a tool called eleMAP. eleMAP outputs are then further curated by users before results can be interpreted.8 9 Our group was involved in similar work that annotated phenotypes in the gene expression omnibus (GEO),10 a public gene expression data repository. Human annotators reviewed the papers published using the data available in GEO, then manually identified the phenotype variables and mapped them to the National Cancer Institute thesaurus.5–7 Although the results of such manual or semi-automated mapping processes tend to be reliable and accurate for small data, the technique is not scalable. Therefore, we developed an algorithmic approach to process the large amount of phenotype variables in dbGaP for standardization.
PhenDisco consists of two main components: (1) text processing tools that standardize both phenotype variables and study metadata, and (2) information retrieval tools that support queries from users and return ranked results. Below we describe each component.
We collected information about the GWAS and their phenotype variables from two publicly available dbGaP sources: (1) dbGaP web pages (http://www.ncbi.nlm.nih.gov/gap), and (2) the dbGaP FTP site (ftp://ftp.ncbi.nlm.nih.gov/dbgap). The dbGaP web pages contain information about individual study levels such as study ID, title, description, platforms, and the dbGaP FTP site contains phenotypic information such as phenotype ID, phenotype description and associated statistics. We developed a crawler to download both types of data. We analyzed 422 studies, which contained 130 000 variables.
Given that the number of new studies being added every month is small, we focused on automating the standardization of variables, while the abstraction of study data itself was only partially automated. Portions of the study-level metadata are well structured and amenable to full automatic parsing. Study ID, title, number of participants, and study design are automatically extractable study data. We extracted, through manual review, study data such as topic diseases, consent type, institutional review board status, and study locations.11 12 To standardize the study information, the topic diseases were mapped to the unified medical language system (UMLS)'s concept unique identifiers.13 We adopted UMLS as a controlled vocabulary in this project based on its comprehensive domain coverage and widespread use in biomedical NLP systems.14 15 In addition, we mapped study locations to ISO 3166-2 country subdivision code,16 for example, US-AZ (USA—Arizona).
The task of phenotype variable standardization has been the most interesting, yet most challenging, part of developing PhenDisco. The lack of a uniform naming convention meant that, for a study containing thousands of phenotype variables, idiosyncratic choices introduced unnecessary variation and redundancy across studies. For example, the same variable ‘body weight’ can be represented as ‘weight’ (variable id: phv00173256.v1.p1), ‘WGHT’ (variable id: phv00169068.v2.p1), and ‘FB9’ (variable id: phsv00001189.v1.p7). Therefore, variable descriptions, which provide more information than variable names, are more useful for the task of standardization. The lack of standardization is a well-known problem in clinical informatics; standards and information models, such as the clinical elements model (CEM), were designed to address this issue. The CEM worked reasonably well for clinical variables in electronic medical records, but did not address clinical research variables in dbGaP.17 While standards such as the observational medical outcome partnership (OMOP) model18 19 cover many of these variables, given our experience mapping variables into OMOP for a very limited set of conditions,20 we realized that the variables in dbGaP studies were described in less detail and determined that it would be more cost-effective and scalable to map them into a simpler model.21 22 We briefly describe our approach as follows.
We developed an information model including four major information classes: ‘theme’ (ie, age, gender, race, ethnicity), ‘subject’, ‘event’, and ‘linkage’ of information.21 23 24 For example, the phenotype variable ‘age Mom diagnosed—asthma’ has theme age, subject ‘mother’, event ‘asthma’, and linkage of information ‘diagnosed’. We wrote a simple NLP tool in Python called DIVER to identify and map phenotype variables into this model. The evaluation on 3565 variables from pulmonary studies in dbGaP showed that DIVER achieved 98% recall and 94% precision in identifying variables related to demographic concepts and 79% correct mapping into the information model.23
For variables that were not related to demographic concepts, we identified two categories of variables: ‘topic’ and ‘subject of information’. The ‘topic’ is the main theme of phenotype variables while the ‘subject of information’ is the individual experiencing the variable. For example, the phenotype variable ‘father diagnosed with lung cancer’ has subject of information ‘father’ and topic ‘lung cancer’. We first tagged ‘topic’ and ‘subject of information’ terms from each variable description, and then mapped those terms to the UMLS metathesaurus.13 This process was automatically implemented by our customized NLP tool. Further standardization of these variables based on information modeling and NLP is in progress.21
The information retrieval tool consists of two parts: a query parser and a ranking algorithm.
We utilized pyparsing25—a toolkit written in Python—for parsing queries in PhenDisco. The role of a query parser is to take an input query and break it into its respective terms and operators. Search terms can be a single word or whole phrases, connected by operators (ie, AND, OR, NOT). To improve search performance, we expanded each input query to include synonyms by integrating MetaMap26 into the query parser. This concept-based search is the default search mode of PhenDisco (see figure 1).
We used the BM25F ranking algorithm,27 28 as it is one of the most popular ranking algorithms for structured documents. BM25F is a modified tf-idf (term frequency—inverse document frequency) algorithm29 that has been shown to enhance performance when dealing with documents composed of several fields such as title, headline, main text.30 31 We considered each study using the different fields identified in the study abstraction process, such as title, study description, or topic disease, along with standardized phenotypes. In this first version of PhenDisco, we considered terms from different fields to be equally important, and we plan to analyze user searches and rankings to assign appropriate weights for these terms in the next version of the software. We utilized Whoosh,32 a search library, to implement the BM25F algorithm. The system components are depicted in figure 2. The system is implemented in Linux Ubuntu OS 64-bit using 32GB RAM, running MySQL V.14.14 on an Apache V.2.2.20 web server and is available at http://pfindr.net.
Currently, PhenDisco supports basic keyword searches and offers the following features that are not supported in dbGap Entrez:
A domain expert developed 18 search scenarios related to particular cardiopulmonary conditions. Search scenarios could included disease names such as ‘asthma’, ‘myocardial infarction’ in combination with demographics such as ‘African American’ and/or a clinical attribute such as ‘FVC’ (forced vital capacity). The list of queries used for evaluation is listed in table 1. Use cases were determined based on presumed clinical relevance, clinical interest, and potential future research impact. For example, in regard to use cases 1–9, ‘asthma’ was chosen because of its widespread prevalence.34
The domain expert then manually reviewed all dbGaP studies and created the gold standard for each search scenario according to the following steps:
We conducted a preliminary evaluation of the system using standard information retrieval measurements: precision, recall and F-measure for unranked studies.35–37 For relevancy ranking, we used two measures: mean rank precision (MRP) and mean average precision (MAP). They are widely used in information retrieval evaluation for both general and biomedical texts.38–41 MRP is the mean value of the precisions computed over all queries at a certain cut-off rank. MAP is the mean value of the average precisions for each rank computed for all queries. Average precision is calculated as follows:
Here n is the number of returned documents; precision(i) is the precision at rank i, and rel(i) is an indicator function at rank i: it equals 1 if the corresponding study is relevant, and 0 otherwise. In our evaluation we chose the cut-off rank to be 5, which is a frequently selected cut-off point.30 38–40
Our evaluation of PhenDisco and dbGaP Entrez was conducted on 10 January 2013. The results are shown in table 2 (see more details in supplementary appendix 2, available online only). For the limited number of queries that were evaluated, PhenDisco had substantially better performance than dbGaP Entrez, with an F-measure of 0.4552 versus 0.1321 for the unranked evaluation. When ranking was considered for the top five returns, PhenDisco also showed better performance than dbGaP Entrez with the MRP of 0.40 versus 0.06, and MAP of 0.2971 versus 0.0756.
A preliminary evaluation of usability from three real dbGaP users indicated that PhenDisco fully satisfied the usability requirements they put forward (see more details in supplementary appendix 3, available online only).
PhenDisco achieved higher recall and precision than dbGaP in both unranked and ranked results in this pilot evaluation. Through error analysis, we found that dbGaP's low precision was mainly due to its acceptance of search terms that appear in any text in any part of the study, including less relevant contexts such as exclusion criteria or title of papers referenced on the study description. On the other hand, the main reason for the low recall of dbGaP Entrez is the lack of standardization of phenotype information. In other words, dbGaP Entrez only supported string-based search, thus search terms such as ‘myocardial infarction’ were not expanded into synonymous or acronyms such as ‘heart attack’ and ‘MI’. The fact that dbGaP Entrez returns unranked results accounts for that system’s low performance in the relevance ranking evaluation.
Precision in PhenDisco was higher than in dbGaP Entrez, but was still lower than expected. This may have resulted from the utilization of too stringent a criterion to consider a particular study as being ‘relevant’ for the search. The domain expert was focused on the primary goals of the studies for this formative evaluation, and not on the availability of the phenotype in general (eg, if ‘asthma’ was not a main subject for a study, then the domain expert considered the study not to be relevant, although the study might have contained individuals with that phenotype and hence it would not be necessarily a false positive). In the comparison between Entrez and PhenDisco, however, using a stringent criterion affected both systems equally. In future work we will investigate the appropriateness of using a less stringent criterion to categorize studies into relevant or not relevant for a particular search. We believe that the best way to categorize may be to obtain direct feedback from users. For example, by unselecting studies that appear in the output, users are indicating that they are irrelevant for their searches. Once we collect data from a large number of users, we will be able to enhance our system and provide more accurate precision and recall estimates.
PhenDisco may be a good alternative to dbGaP Entrez for scientists who need to identify studies that contain the phenotypes they are interested in. Some advantages of PhenDisco over dbGaP Entrez are: (1) PhenDisco integrates NLP tools to enhance query processing and phenotype variable mapping; (2) PhenDisco augments background knowledge from domain experts by adding meta-data for the studies; and (3) PhenDisco's results are ranked in descending order of relevance. The main disadvantage of PhenDisco is that, unlike dbGaP Entrez, which relies on keyword search in any portion of a study document, PhenDisco's search is performed on study and variable descriptions only, based on meta-data that are produced by a process that is not fully automated. We use a curator to verify a large portion of the results of an automated mapping process and to fix annotations as needed. Given our simple information model, it takes less than 30 min for a curator to validate the majority of the meta-data and this is why we were able to annotate all studies in dbGaP with the help of part-time curators. As the number of new studies is relatively small when compared to over 400 that underwent this process, the semi-automated process is scalable and is not a bottleneck. We plan to improve further the information model and mapping algorithm and use the same process to annotate phenotypes in GEO and other public data resources.
In the future, we plan to add more features to the current system and keep our users updated by prominently displaying the changes in the home page of PhenDisco's web site. These features include: (1) improving the search performance, especially by integrating search queries with ontology expansions for concepts’ children; (2) improving PhenDisco's advanced search, by incorporating other types of study level meta-data; (3) providing efficient ways of identifying and browsing similar phenotype variables collected across different studies using clustering techniques. We also plan to apply more sophisticated NLP techniques to improve precision of the system to account for detection of negated concepts and temporal relationships, and promote broader dissemination of the tool and meta-data through the iDASH National Center for Biomedical Computing.42
Correction notice: This article has been corrected since it was published Online First. The last author's name was previously incorrect and has now been corrected.
Acknowledgements: The authors would like to thank Wendy Chapman, Melissa Tharp, Jihoon Kim, and the dbGaP helpdesk for their valuable help and feedback in the early phases of this project. They thank Karen Truong, Myoung Lah, Vinay Venkatesh, Rafael Talavera, and other internship students for their contributions to the system. The authors also thank NIH officers and their scientific advisory board for helpful feedback.
Contributors: SD was the main software developer, creating the framework and backend pipeline of the system. He wrote the manuscript with the help of others. H-EK was the main investigator for this work and led phenotype standardization and user requirement analysis. She also contributed to system evaluation. K-WL contributed to phenotype standardization, system evaluation, and user requirement analysis. MC contributed to the study abstraction work and also participated in phenotype standardization development. AG contributed to user interface design and development. SFF contributed to study abstraction, system evaluation, and user requirement analysis. AH mainly contributed to phenotype standardization and user interface development. MKR contributed to study abstraction, ranking algorithm development and system evaluation. XJ contributed to the development of the ranking algorithm. NA contributed to system evaluation and phenotype standardization. HX contributed to phenotype standardization. RW contributed to phenotype standardization, system evaluation and user requirement analysis. SF contributed to phenotype standardization. JZ contributed to system evaluation. LO-M provided oversight for this work and substantial input to this manuscript.
Funding: This work was supported in part by grants UH2HL108785, U54HL108460, and T15LM011271 from the NIH.
Competing interests: None.
Provenance and peer review: Not commissioned; externally peer reviewed.