A prototype version of CTD is available via the World Wide Web (http://ctd.mdibl.org
; ). The major data types in CTD are: 1) nucleotide and protein sequences, 2) reference publications, 3) curated genes, 4) Gene Sets (sets of curated genes), 5) a hierarchical vocabulary of chemicals, 6) Gene Ontology terms (hierarchical vocabulary of biological processes, cellular components, and molecular functions), and 7) organism taxonomy. To date, nucleotide and protein sequences are included for all vertebrates and invertebrates. Nucleotide sequences and annotations are acquired from the National Center for Biotechnology Information (NCBI). We include only Reference Sequences (RefSeqs) for H. sapiens, M. musculus, R. norvegicus, D. melanogaster
, and C. elegans
(Pruitt and others, ‘05). Amino acid sequences and annotations are acquired from the European Bioinformatics Institute’s UniProtKB/Swiss-Prot and TrEMBL databases (Bairoch and others, ‘05).
References are acquired from PubMed and must contain information about chemical-gene interactions. This data set will continue to expand in scope and size as our curation capacity expands.
CTD integrates controlled, hierarchical vocabularies for organism taxonomy (NCBI; Wheeler and others, ‘02), chemicals (National Library of Medicine’s Medical Subject Headings and Supplementary Concepts (http://www.nlm.nih.gov/mesh/MBrowser.html
), and Gene Ontology (GO; Harris and others, ‘04) to ensure consistency in data integration, annotation, access, and interpretation. These vocabularies enhance visitor query capabilities and enable us to link to related data in other biological resources. The chemical vocabulary has also been critical for establishing initial automated chemical-gene associations.
Our primary focus with curation is to construct cross-species toxicologically important genes and Gene Sets. Genes are defined in CTD by their constituent nucleotide and protein sequences from vertebrates and invertebrates and are constructed using sequence analysis methods in combination with literature review. Gene Sets group closely related curated genes, such as those that have undergone duplication events in specific species (e.g., CYP1A4, CYP1A5) or are members of large families (e.g., ABC transporters, G protein coupled receptors) and provide visitors with a broader perspective about their gene(s) of interest. To date we have curated seven Gene Sets that comprise 21 curated genes and 572 sequences from 84 unique species.