|Home | About | Journals | Submit | Contact Us | Français|
The Comparative Toxicogenomics Database (CTD) is a curated database that promotes understanding about the effects of environmental chemicals on human health. Biocurators at CTD manually curate chemical–gene interactions, chemical–disease relationships and gene–disease relationships from the literature. This strategy allows data to be integrated to construct chemical–gene–disease networks. CTD is unique in numerous respects: curation focuses on environmental chemicals; interactions are manually curated; interactions are constructed using controlled vocabularies and hierarchies; additional gene attributes (such as Gene Ontology, taxonomy and KEGG pathways) are integrated; data can be viewed from the perspective of a chemical, gene or disease; results and batch queries can be downloaded and saved; and most importantly, CTD acts as both a knowledgebase (by reporting data) and a discovery tool (by generating novel inferences). Over 116 000 interactions between 3900 chemicals and 13 300 genes have been curated from 270 species, and 5900 gene–disease and 2500 chemical–disease direct relationships have been captured. By integrating these data, 350 000 gene–disease relationships and 77 000 chemical–disease relationships can be inferred. This wealth of chemical–gene–disease information yields testable hypotheses for understanding the effects of environmental chemicals on human health. CTD is freely available at http://ctd.mdibl.org.
Environmental agents are postulated to play a critical role in the etiology of many human diseases (1–4), and chemicals are an important component of the environment. To understand the impact of environmental chemicals on human health, we have developed the Comparative Toxicogenomics Database (CTD; http://ctd.mdibl.org) as a unique tool to provide connections between chemicals, genes/proteins and diseases that may not otherwise be apparent, and to provide the basis for testable hypotheses about the mechanisms underlying the etiology of environmental diseases (5–7).
Several valuable chemical, gene and disease databases currently exist. Each one has its advantages. Many public chemical databases, such as PharmGKB (8), DrugBank (9), ChemBank (10) and STITCH (11) focus on drugs and other small molecules, providing an invaluable resource for therapeutic research. There are several microarray resources that provide varying degrees of data for chemicals, genes and diseases. Chemical Effects in Biological Systems (CEBS) (12) is a public repository and tool for chemically relevant microarray, proteomics, clinical chemistry, hematology and histopathology data. ArrayExpress (13) and Gene Expression Omnibus (GEO) (14) are public repositories for microarray data. Although the latter contain chemically relevant data, these data are not their expressed priority. ArrayTrack (15) is an installable application and database for managing and analyzing microarray data. Currently, only users at the US Food and Drug Administration (FDA) may submit their data; however, non-FDA users have access to ArrayTrack functionality. ChEBI (16) is an excellent dictionary for chemical entities, but outsources its information on the biology of those chemicals to other databases via external links. PubChem (14) is a repository of chemical substance information, compound structures and biological activities of small molecules, but does not integrate that data with official gene symbols or disease information. OMIM (17) and HGMD (18), two of the most commonly cited disease databases, annotate genetic diseases, but do not provide any associated chemical information. Some gene databases, such as GeneCards (19) and PubGene (20), have recently included gene–chemical associations, but those relationships are established via text-mining algorithms and are not reviewed or validated by professional biocurators. KEGG (21) and Reactome (22) map chemicals, genes and (in the case of KEGG) disease information to pathways, but the pathways and interactions are generically applied to orthologous proteins and all species, and it is not always clear which reference supports which pathway relationship. CTD is distinct from these databases in three ways: (i) it focuses on environmental chemicals; (ii) it integrates curated and imported data, allowing users to explore connections between chemicals, genes, and diseases; and (iii) it functions not only as a repository for information, but also as a resource for generating novel hypotheses about environmental diseases and chemical actions.
It is becoming well established that environmental agents influence chronic disease susceptibility (23). There are numerous types of environmental agents, including infectious agents (bacteria, viruses and parasites), diet, radiation and chemicals. One way that chemicals might influence diseases is by interacting with genes and proteins. Environmental chemicals can affect genes in multiple, nonexclusive ways, including mutagenesis (24), altered methylation (3), physical interaction (25) and influencing gene expression or protein function. Conversely, naturally occurring genetic polymorphisms may affect chemical susceptibility and result in increased disease predisposition (26). To help understand the complex effects of the environment on human health, CTD focuses its manual curation effort on environmental chemicals (e.g. arsenic, heavy metals and dioxins), how those chemicals interact with genes or proteins in different species and how they relate to human diseases.
CTD biocurators capture three types of core data from the literature: chemical–gene (and protein) interactions, chemical–disease relationships and gene–disease relationships. These data are curated in a structured format using controlled vocabularies and are integrated to establish a triad of chemicals, genes and diseases (Figure 1a, Table 1).
A major strength of CTD is that these core data are manually curated from the literature by professional biocurators (27), ensuring accuracy. CTD does use text mining to triage the literature, but each reference (abstract or full-text) is read by a biocurator to identify interactions and relationships, and all curated data is supported by its source citation. Some databases rely solely on text mining and report interactions based on co-occurrences of a chemical and gene in a document. However, this method has several limitations: co-occurrence of terms does not always imply a valid chemical–gene interaction; chemical names and gene symbols are challenging to text mine accurately because of their many synonyms and correspondence with common words (e.g. ‘lead’, ‘find’, ‘up’, ‘for’, ‘a’); and to date, text-mining tools have not accommodated types of molecular interactions. The manual curation approach at CTD allows biocurators to validate every interaction and relationship, ensure that the correct chemical name and gene symbol is used, and generate detailed descriptions of the types of interaction. Data are uploaded to the database monthly.
The use of controlled vocabularies provides numerous advantages: the curation process is streamlined, different biocurators capture data in a consistent manner, users retrieve data reproducibly and quality control is feasible. The hierarchical structure of chemical and disease vocabularies in CTD also enable users to query data using general (e.g. heterocyclic compounds) or specific (e.g. 2-hydroxytacrine) terms. The following vocabularies are used by biocurators for curation and are integrated in the database to facilitate querying:
Community-accepted controlled vocabularies and identification numbers allow integration with other databases that use the same terms. CTD enhances its core data pages (Chemical, Gene and Disease) with links to the following external resources (data are updated monthly):
A powerful feature of CTD is the integration of curated chemical, gene and disease core data from the literature (knowledge) to generate new, putative discoveries (Figure 1b, Table 1). For example, if chemical A interacts with gene B (via a curated chemical–gene interaction) and independently gene B is associated with disease C (via a curated gene–disease relationship), then it may be inferred or hypothesized that chemical A has a relationship with disease C (inferred via gene B). This integration provides possible chemical–gene–disease connections that may not otherwise be apparent.
The molecular basis of most environmental diseases is still not clear. CTD can act as a discovery tool to generate testable hypotheses about the mechanisms underlying the etiology of environmental diseases. This approach was recently supported by analyzing the CTD arsenic data set, wherein CTD correctly predicted types of diseases that may be associated with arsenic exposure and set of genes that may be involved in modulating arsenic-related diseases, such as lung cancer and diabetes. A similar analysis can be applied to any environmental chemical or disease. For example, chemical–gene networks connecting environmental exposure to autism can be discovered by going to the CTD disease page for autism (Autistic Disorder) and clicking on the ‘Chemicals’ tabs (Figure 2). Here, users will find a list of chemicals that interact with genes known to be associated with autism, generating a hypothetical chemical–gene–disease network for the disease. These predictions of environmental exposure can now be addressed and tested in the laboratory.
Since the chemical–gene–disease triad connects all nodes with curated edges (Figure 1a), a user can explore CTD by their chemical, gene or disease of interest and discover a novel connection to any of the other two nodes. This makes CTD a valuable discovery tool for any laboratory studying chemistry, genetics or human health.
Users have several options for querying data in CTD. A keyword search box appears on every CTD page (Figure 3). This box contains a pick-list to allow queries of chemicals, genes, diseases, GO terms, organisms or references. Keywords may include terms, symbols or accession IDs, and Boolean operators are supported.
Users can also perform detailed searches using Gene, Interaction and Reference Query pages. Many terms associated with curated and imported data can be used as search parameters. For example, queries may include GO annotations, KEGG pathways, chemical classes, types of chemical interactions, associated diseases and organisms to ask questions, such as: polychlorinated biphenyls affect the activity of which transcription factors? What proteins involved in limb development interact with a heavy metal? What chemicals downregulate members of the glycine metabolism pathway? This detailed querying option allows users to find data beyond the limits of a specific chemical, gene or disease term, and start to analyze data from the perspective of broader biological concepts and systems.
A new batch query tool allows users to download data associated with lists of chemicals, genes or diseases. Users choose the type of results to retrieve, which include curated chemical–gene interactions, curated chemical or gene associations, disease relationships, GO associations or pathway associations. This feature provides important biological insights into groups of chemicals, genes or diseases such as: what is the predominant molecular function associated with the 863 genes that interact with paraquat, or what disease is most commonly associated with this list of heavy metals? Batch query results can be downloaded in CSV (comma-separated values), TSV (tab-separated values) or XML (extensible markup language) format.
The curation and integration paradigm in CTD allows users to explore data from the perspective of a chemical, gene or disease. All chemical, gene and disease terms are hyperlinked to respective detail pages, which organize associated data on tabbed pages (Figure 3). The data for each tabbed page is presented in a table, which can be sorted by columns or downloaded in CSV, TSV or XML format. The location and content of tabbed data pages in CTD are described:
CTD is a unique scientific resource that promotes understanding about the effects of environmental chemicals on human health. It provides chemical–gene interactions, chemical–disease relationships and gene–disease relationships that are manually curated by biocurators using controlled vocabularies. By integrating these core data, CTD functions as a discovery tool for identifying connections between chemicals, genes and diseases not otherwise apparent in other biological resources, and for generating testable hypotheses about the mechanisms underlying the etiology of environmental diseases.
Future development of CTD will aim to further expand the depth of its curated data and enhance the data query and visualization capabilities. Specifically, text-mining tools will be incorporated to increase the efficiency of manual data curation, the number of databases to which CTD is reciprocally linked will increase to improve integration of relevant databases in the public domain and additional query and data visualization strategies will be explored to introduce graphical representations of complex data relationships (e.g. chemical–gene–protein interactions and chemical–gene–disease relationships). CTD will continue to be publicly available and the community is encouraged to contact us with comments and suggestions so that we may continue to enhance its value.
This work is supported by the National Institute of Environmental Health Sciences (ES014065) and the INBRE program of the National Center for Research Resources (RR016463) of the National Institutes of Health.
Conflict of interest statement. None declared.