|Home | About | Journals | Submit | Contact Us | Français|
The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) provides information about interactions between environmental chemicals and gene products and their relationships to diseases. Chemical–gene, chemical–disease and gene–disease interactions manually curated from the literature are integrated to generate expanded networks and predict many novel associations between different data types. CTD now contains over 15 million toxicogenomic relationships. To navigate this sea of data, we added several new features, including DiseaseComps (which finds comparable diseases that share toxicogenomic profiles), statistical scoring for inferred gene–disease and pathway–chemical relationships, filtering options for several tools to refine user analysis and our new Gene Set Enricher (which provides biological annotations that are enriched for gene sets). To improve data visualization, we added a Cytoscape Web view to our ChemComps feature, included color-coded interactions and created a ‘slim list’ for our MEDIC disease vocabulary (allowing diseases to be grouped for meta-analysis, visualization and better data management). CTD continues to promote interoperability with external databases by providing content and cross-links to their sites. Together, this wealth of expanded chemical–gene–disease data, combined with novel ways to analyze and view content, continues to help users generate testable hypotheses about the molecular mechanisms of environmental diseases.
Exposure to environmental chemicals may influence human health (1,2). The molecular mechanisms of action between chemicals and gene products, however, are not well understood. Toward that end, the Comparative Toxicogenomics Database (CTD; http://ctdbase.org) is a public resource that provides information about the interaction of environmental chemicals with gene products and their effect on human disease (3–6). This information is first garnered from the scientific literature by professional biocurators who manually curate a triad of core interactions including chemical–gene, chemical–disease and gene–disease relationships (7). These core data are then internally integrated to generate inferred chemical–gene–disease networks. Additionally, the core data are integrated with external data sets such as Gene Ontology (GO) and pathway annotations (from KEGG and Reactome) to establish novel inferences. A unique and powerful feature of CTD is the inferred relationships generated by data integration, following the Swanson ABC model of knowledge transfer (8): if chemical A interacts with gene B and independently gene B is directly associated with disease C, then chemical A has an inferred relationship to disease C (inferred via gene B). This knowledge transfer can be expanded to include any type of information directly annotated to chemicals, genes or diseases; thus, if GO term A is annotated to gene B, and independently gene B directly interacts with chemical C, then GO term A has an inferred relationship to chemical C (inferred via gene B). Such inferred connections can be statistically scored to help indicate the significance of the association (B. L. King et al., submitted for publication), provide novel insights that expand CTD content (4) and allow users to analyze toxicogenomic information from different perspectives. These inferences make CTD more informative than the sum of its individual curated parts (7).
To increase the efficiency and productivity of manual curation, we developed and implemented several procedures, including the use of a streamlined curation paradigm (7); development of a sophisticated, yet easy-to-use, web-based annotation tool for remote biocurators (7) and the creation and adoption of practical controlled vocabularies (9). Selecting articles for manual curation at CTD is typically performed via a chemical-centric approach, wherein PubMed (http://www.ncbi.nlm.nih.gov/pubmed/) is queried for publications that describe a chemical-of-interest. To complement this process, we recently introduced a new journal-centric approach for triaging literature to help ensure data currency in CTD (A. P. Davis et al., submitted for publication). We also continue to refine text-mining processes to rank and prioritize articles for curation (10,11). Finally, a collaborative project with Pfizer, Inc. (see below) provided an additional corpus of over 80 000 toxicology papers.
Here, we provide an update to CTD, describing its increased data content and several new analytical and visualization tools and enhancements since our 2011 report (4). These updates further expand the utility of CTD for environmental health research.
In September 2010, CTD initiated a collaborative project with Pfizer, Inc. to curate a corpus of >80 000 Pfizer-selected toxicology papers triaged for therapeutic drug interactions with four diseases-of-interest (cardiovascular, renal, neurological and hepatic disorders). In 10 months, five CTD biocurators manually reviewed the entire corpus and found that 53 951 of the papers contained curatable data for CTD. Curated data from this project are now fully integrated with core CTD and freely available to all users. The curation of the Pfizer corpus, along with CTD’s regular literature selection process, has dramatically increased the database content. In July 2012, CTD contained data from 94 513 articles, from whence 799 204 interactions were manually curated (599 182 chemical–gene, 176 627 chemical–disease and 23 395 gene–disease interactions) for 11 755 unique chemicals, 27 950 unique genes and 5987 unique diseases (Table 1). Internal integration of these data generates >10.1 million inferred gene–disease relationships and 913 622 inferred chemical–disease relationships. Integration with external data sets from GO (12) and pathway annotations from KEGG (13) and Reactome (14) provides the basis for additional inferred relationships. In total, 15.6 million toxicogenomic relationships are provided for analysis, representing a 3.6-fold increase in content since our last report in 2011 (4) and a 10.6-fold increase since our original report in 2009 (5). To make the most of this updated content, new users of CTD should consult our ‘Help’ menu (http://ctdbase.org/help) and ‘FAQ’ section (http://ctdbase.org/help/faq/) for more information and step-by-step instructions about performing simple and advanced queries in CTD.
CTD continues to expand its connectivity with external databases. We now include links on CTD Chemical pages to ChEBI (15), a dictionary of molecular entities focused on small chemical compounds; to PubChem (16), a repository of chemical compounds and their associated biological activities; and to TOXLINE (17), a bibliographic database of toxicology articles. CTD Gene pages now link to WikiGenes, an author-driven wiki system of biological information (18) and NCBI Gene (19) provides links back to CTD Gene pages. In total, CTD links out to 25 external databases from our Chemical, Gene, Disease, Organism, GO, Pathway and Reference pages (Table 2). As a federally funded public database, CTD content is often linked to, repackaged or integrated with other database products. Currently, we are aware of 37 external databases that either use CTD content at their site or link back to CTD (Table 3). This connectivity augments data access for users of both CTD and other linked resources. This interoperability and adoption of CTD data allows for cross-integration of additional information with CTD content in the future. In compliance with the bioDBcore initiative (20), the core attributes describing CTD are provided in Supplementary Table S1. CTD data files are freely available from either individual pages or our ‘Downloads’ tab (http://ctdbase.org/downloads/) in multiple formats (CSV, TSV, XML, Excel and OBO).
We enhanced CTD by adding four new computational and network scoring features:
To help navigate the 15.6 million toxicogenomic relationships in CTD, we created a package of analytical and visualization tools, accessible under the ‘Analyze’ menu. We have previously described the Batch Query, VennViewer and MyGeneVenn tools (4,5). To this suite, we added the following:
A growing challenge for databases is developing ways to visualize large data sets to enhance knowledge management for the user (23–25). Toward that end, CTD has begun implementing processes to visualize our content using three different approaches.
In addition to the above features, we also increased the utility of GO and pathway annotations at CTD. These annotations are directly assigned to gene symbols by external sources, and through integration with CTD data we can create novel connections to diseases with which the same genes are involved. We expanded CTD’s GO and Pathway pages to include a ‘Diseases’ data tab that list these associations. For example, as of July 2012 CTD’s Pathway page for ‘TGF-beta signaling pathway’ (http://ctdbase.org/detail.go?type=pathway&acc=KEGG%3a04350) is directly associated with 375 genes via KEGG, which in turn can be integrated via CTD to 316 diseases, including lung neoplasms, craniofacial abnormalities and sepsis. Similar integrated relationships are available for GO terms on CTD’s GO pages, allowing users to explore diseases from GO and pathway perspectives.
On CTD Gene pages, the listed synonyms are now seamlessly hyperlinked to keyword query searches to help find related genes. For example, CTD’s Gene page TP53 (http://ctdbase.org/detail.go?type=gene&acc=7157) contains the synonym ‘p53 tumor suppressor’, which, when clicked, finds other genes that use that phrase, including the mouse-specific version of the gene (called TRP53), as well as several p53-binding proteins (e.g. TP53BP1 and TP53BP2). This simple feature can alert users to other genes that may be relevant to their gene-of-interest and is particularly helpful because of CTD’s cross-species gene aggregation.
Finally, the Batch Query tool (http://ctdbase.org/tools/batchQuery.go) has been expanded to accommodate literature retrieval by now accepting PubMed identification numbers or digital object identifiers as an input type. This feature allows users to retrieve all curated data content for batches of articles.
CTD provides detailed information about manually curated chemical–gene interactions, chemical–disease relationships and gene–disease relationships. Integrating these core data with other data sets, CTD helps turn knowledge into discoveries by identifying novel connections between chemicals, genes, diseases, pathways and GO annotations that might not otherwise be apparent using other biological resources.
Here, we have highlighted recent major improvements to CTD, including expanded data content, greater connectivity with other databases, new analytical tools and novel visualization strategies that help users view and organize information. These features make CTD a unique scientific resource for promoting understanding of the effects of environmental chemicals on human health and for generating testable hypotheses about the mechanisms underlying the etiology of environmental diseases.
In the future, we hope to expand the depth and breadth of the manually curated core data, especially by curating recent toxicology journals triaged via a new journal-centric approach to help improve data currency at CTD (A. P. Davis et al., submitted for publication) and expanding into new knowledge spaces, including exposure science (29) and phenotypes. We also plan to increase the visualization and analysis capacity of CTD. For example, heat maps are practical visual devices that help users rapidly interpret large data sets (30). We are currently experimenting with different visualization prototypes to present MEDIC-Slim summaries for disease relationships.
Supplementary Data are available at NAR Online: Supplementary Table 1.
National Institute of Environmental Health Sciences (NIEHS) grants ‘Comparative Toxicogenomics Database’ [R01-ES014065]; ‘Generation of a centralized and integrated resource for exposure data’ [R01-ES019604]. Funding for open access charge: NIEHS [R01-ES014065 and R01-ES019604].
Conflict of interest statement. None declared.
We thank Dr Heather Keating for contributions to the curation of the Pfizer-selected toxicology corpus and Roy McMorran for CTD system/database administration. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.