Transcription factors control which information in a genome becomes transcribed to produce RNAs that function in the biological systems of cells and organisms. Reliable and comprehensive information about transcription factors is invaluable for large-scale network-based studies. However, existing transcription factor knowledge bases are still lacking in well-documented functional information.
Here, we provide guidelines for a curation strategy, which constitutes a robust framework for using the controlled vocabularies defined by the Gene Ontology Consortium to annotate specific DNA binding transcription factors (DbTFs) based on experimental evidence reported in literature. Our standardized protocol and workflow for annotating specific DNA binding RNA polymerase II transcription factors is designed to document high-quality and decisive evidence from valid experimental methods. Within a collaborative biocuration effort involving the user community, we are now in the process of exhaustively annotating the full repertoire of human, mouse and rat proteins that qualify as DbTFs in as much as they are experimentally documented in the biomedical literature today. The completion of this task will significantly enrich Gene Ontology-based information resources for the research community.
Pax3 has numerous integral functions in embryonic tissue morphogenesis and knowledge of its complex function in cells of adult tissue continues to unfold. Across a variety of adult tissue lineages, the role of Pax3 is principally linked to maintenance of the tissue’s resident stem/progenitor cell population. In adult peripheral nerves, Pax3 is reported to be expressed in nonmyelinating Schwann cells, however, little is known about the purpose of this expression. Based on the evidence of the role of Pax3 in other adult tissue stem and progenitor cells, it was hypothesised that the cells in adult peripheral nerve that express Pax3 may be peripheral glioblasts. Here, methods have been developed for identification and visualisation of Pax3 expressant cells in normal 60 day old mouse peripheral nerve that allowed morphological and phenotypic distinctions to be made between Pax3 expressing cells and other nonmyelinating Schwann cells. The distinctions described provide compelling support for a resident glioblast population in adult mouse peripheral nerve.
Autism spectrum disorders (ASD) represent a group of developmental disabilities with a strong genetic basis. The laboratory mouse is increasingly used as a model organism for ASD, and MGI, the Mouse Genome Informatics resource, is the primary model organism database for the laboratory mouse. MGI uses the Mammalian Phenotype (MP) ontology to describe mouse models of human diseases. Using bioinformatics tools including Phenologs, MouseNET, and the Ontological Discovery Environment, we tested data associated with MP terms to characterize new gene-phenotype associations related to ASD. Our integrative analysis using these tools identified numerous mouse genotypes that are likely to have previously uncharacterized autistic-like phenotypes. The genes implicated in these mouse models had considerable overlap with a set of over 300 genes recently associated with ASD due to small, rare copy number variation (Pinto D. et al, 2010). Prediction and characterization of autistic mutant mouse alleles assists researchers in studying the complex nature of ASD and provides a generalizable approach to candidate gene prioritization.
Autism spectrum disorders; phenotype ontology; mouse disease models
The laboratory mouse is the premier animal model for studying human biology because all life stages can be accessed experimentally, a completely sequenced reference genome is publicly available and there exists a myriad of genomic tools for comparative and experimental research. In the current era of genome scale, data-driven biomedical research, the integration of genetic, genomic and biological data are essential for realizing the full potential of the mouse as an experimental model. The Mouse Genome Database (MGD; http://www.informatics.jax.org), the community model organism database for the laboratory mouse, is designed to facilitate the use of the laboratory mouse as a model system for understanding human biology and disease. To achieve this goal, MGD integrates genetic and genomic data related to the functional and phenotypic characterization of mouse genes and alleles and serves as a comprehensive catalog for mouse models of human disease. Recent enhancements to MGD include the addition of human ortholog details to mouse Gene Detail pages, the inclusion of microRNA knockouts to MGD’s catalog of alleles and phenotypes, the addition of video clips to phenotype images, providing access to genotype and phenotype data associated with quantitative trait loci (QTL) and improvements to the layout and display of Gene Ontology annotations.
The Mouse Genome Database, the Gene Expression Database and the Mouse Tumor Biology database are integrated components of the Mouse Genome Informatics (MGI) resource (http://www.informatics.jax.org). The MGI system presents both a consensus view and an experimental view of the knowledge concerning the genetics and genomics of the laboratory mouse. From genotype to phenotype, this information resource integrates information about genes, sequences, maps, expression analyses, alleles, strains and mutant phenotypes. Comparative mammalian data are also presented particularly in regards to the use of the mouse as a model for the investigation of molecular and genetic components of human diseases. These data are collected from literature curation as well as downloads of large datasets (SwissProt, LocusLink, etc.). MGI is one of the founding members of the Gene Ontology (GO) and uses the GO for functional annotation of genes. Here, we discuss the workflow associated with manual GO annotation at MGI, from literature collection to display of the annotations. Peer-reviewed literature is collected mostly from a set of journals available electronically. Selected articles are entered into a master bibliography and indexed to one of eight areas of interest such as ‘GO’ or ‘homology’ or ‘phenotype’. Each article is then either indexed to a gene already contained in the database or funneled through a separate nomenclature database to add genes. The master bibliography and associated indexing provide information for various curator-reports such as ‘papers selected for GO that refer to genes with NO GO annotation’. Once indexed, curators who have expertise in appropriate disciplines enter pertinent information. MGI makes use of several controlled vocabularies that ensure uniform data encoding, enable robust analysis and support the construction of complex queries. These vocabularies range from pick-lists to structured vocabularies such as the GO. All data associations are supported with statements of evidence as well as access to source publications.
The 5th International Biocuration Conference brought together over 300 scientists to exchange on their work, as well as discuss issues relevant to the International Society for Biocuration’s (ISB) mission. Recurring themes this year included the creation and promotion of gold standards, the need for more ontologies, and more formal interactions with journals. The conference is an essential part of the ISB's goal to support exchanges among members of the biocuration community. Next year's conference will be held in Cambridge, UK, from 7 to 10 April 2013. In the meanwhile, the ISB website provides information about the society's activities (http://biocurator.org), as well as related events of interest.
Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text.
This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement.
As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.
A recent paper (Nehrt et al., PLoS Comput. Biol. 7:e1002073, 2011) has proposed a metric for the “functional similarity” between two genes that uses only the Gene Ontology (GO) annotations directly derived from published experimental results. Applying this metric, the authors concluded that paralogous genes within the mouse genome or the human genome are more functionally similar on average than orthologous genes between these genomes, an unexpected result with broad implications if true. We suggest, based on both theoretical and empirical considerations, that this proposed metric should not be interpreted as a functional similarity, and therefore cannot be used to support any conclusions about the “ortholog conjecture” (or, more properly, the “ortholog functional conservation hypothesis”). First, we reexamine the case studies presented by Nehrt et al. as examples of orthologs with divergent functions, and come to a very different conclusion: they actually exemplify how GO annotations for orthologous genes provide complementary information about conserved biological functions. We then show that there is a global ascertainment bias in the experiment-based GO annotations for human and mouse genes: particular types of experiments tend to be performed in different model organisms. We conclude that the reported statistical differences in annotations between pairs of orthologous genes do not reflect differences in biological function, but rather complementarity in experimental approaches. Our results underscore two general considerations for researchers proposing novel types of analysis based on the GO: 1) that GO annotations are often incomplete, potentially in a biased manner, and subject to an “open world assumption” (absence of an annotation does not imply absence of a function), and 2) that conclusions drawn from a novel, large-scale GO analysis should whenever possible be supported by careful, in-depth examination of examples, to help ensure the conclusions have a justifiable biological basis.
Understanding gene function—how individual genes contribute to the biology of an organism at the molecular, cellular and organism levels—is one of the primary aims of biomedical research. It has been a longstanding tenet of model organism research that experimental knowledge obtained in one organism is often applicable to other organisms, particularly if the organisms share the relevant genes because they inherited them from their common ancestor. Nevertheless this tenet is, like any hypothesis, not beyond question. A recent paper has termed this hypothesis a “conjecture,” and performed a statistical analysis, the results of which were interpreted as evidence against the hypothesis. This statistical analysis relied on a computational representation of gene function, the Gene Ontology (GO). As representatives of the international consortium that produces the GO, we show how the apparent evidence against the “ortholog conjecture” can be better explained as an artifact of how molecular biology knowledge is accumulated. In short, a complementarity between knowledge obtained in mouse and human experimental systems was incorrectly interpreted as a disagreement. We discuss the proper interpretation of GO annotations and potential sources of bias, with an eye toward enhancing the informed use of the GO by the scientific community.
Optimal curation of human diseases requires an ontology or structured vocabulary that contains terms familiar to end users, is robust enough to support multiple levels of annotation granularity, is limited to disease terms and is stable enough to avoid extensive reannotation following updates. At Mouse Genome Informatics (MGI), we currently use disease terms from Online Mendelian Inheritance in Man (OMIM) to curate mouse models of human disease. While OMIM provides highly detailed disease records that are familiar to many in the medical community, it lacks structure to support multilevel annotation. To improve disease annotation at MGI, we evaluated the merged Medical Subject Headings (MeSH) and OMIM disease vocabulary created by the Comparative Toxicogenomics Database (CTD) project. Overlaying MeSH onto OMIM provides hierarchical access to broad disease terms, a feature missing from the OMIM. We created an extended version of the vocabulary to meet the genetic disease-specific curation needs at MGI. Here we describe our evaluation of the CTD application, the extensions made by MGI and discuss the strengths and weaknesses of this approach.
The identification of orthologs—genes pairs descended from a common ancestor through speciation, rather than duplication—has emerged as an essential component of many bioinformatics applications, ranging from the annotation of new genomes to experimental target prioritization. Yet, the development and application of orthology inference methods is hampered by the lack of consensus on source proteomes, file formats and benchmarks. The second ‘Quest for Orthologs’ meeting brought together stakeholders from various communities to address these challenges. We report on achievements and outcomes of this meeting, focusing on topics of particular relevance to the research community at large. The Quest for Orthologs consortium is an open community that welcomes contributions from all researchers interested in orthology research and applications.
The Cell Ontology (CL) aims for the representation of in vivo and in vitro cell types from all of biology. The CL is a candidate reference ontology of the OBO Foundry and requires extensive revision to bring it up to current standards for biomedical ontologies, both in its structure and its coverage of various subfields of biology. We have now addressed the specific content of one area of the CL, the section of the ontology dealing with hematopoietic cells. This section has been extensively revised to improve its content and eliminate multiple inheritance in the asserted hierarchy, and the groundwork was laid for structuring the hematopoietic cell type terms as cross-products incorporating logical definitions built from relationships to external ontologies, such as the Protein Ontology and the Gene Ontology. The methods and improvements to the CL in this area represent a paradigm for improvement of the entire ontology over time.
ontology; hematopoietic cells; immunology
The body of human genomic and proteomic evidence continues to grow at ever-increasing rates, while annotation efforts struggle to keep pace. A surprisingly small fraction of human genes have clear, documented associations with specific functions, and new functions continue to be found for characterized genes. Here we assembled an integrated collection of diverse genomic and proteomic data for 21,341 human genes and make quantitative associations of each to 4333 Gene Ontology terms. We combined guilt-by-profiling and guilt-by-association approaches to exploit features unique to the data types. Performance was evaluated by cross-validation, prospective validation, and by manual evaluation with the biological literature. Functional-linkage networks were also constructed, and their utility was demonstrated by identifying candidate genes related to a glioma FLN using a seed network from genome-wide association studies. Our annotations are presented—alongside existing validated annotations—in a publicly accessible and searchable web interface.
function prediction; gene function human; Gene Ontology; machine learning
The Mouse Genome Database (MGD, http://www.informatics.jax.org) is the international community resource for integrated genetic, genomic and biological data about the laboratory mouse. Data in MGD are obtained through loads from major data providers and experimental consortia, electronic submissions from laboratories and from the biomedical literature. MGD maintains a comprehensive, unified, non-redundant catalog of mouse genome features generated by distilling gene predictions from NCBI, Ensembl and VEGA. MGD serves as the authoritative source for the nomenclature of mouse genes, mutations, alleles and strains. MGD is the primary source for evidence-supported functional annotations for mouse genes and gene products using the Gene Ontology (GO). MGD provides full annotation of phenotypes and human disease associations for mouse models (genotypes) using terms from the Mammalian Phenotype Ontology and disease names from the Online Mendelian Inheritance in Man (OMIM) resource. MGD is freely accessible online through our website, where users can browse and search interactively, access data in bulk using Batch Query or BioMart, download data files or use our web services Application Programming Interface (API). Improvements to MGD include expanded genome feature classifications, inclusion of new mutant allele sets and phenotype associations and extensions of GO to include new relationships and a new stream of annotations via phylogenetic-based approaches.
Representing species-specific proteins and protein complexes in ontologies that are both human- and machine-readable facilitates the retrieval, analysis, and interpretation of genome-scale data sets. Although existing protin-centric informatics resources provide the biomedical research community with well-curated compendia of protein sequence and structure, these resources lack formal ontological representations of the relationships among the proteins themselves. The Protein Ontology (PRO) Consortium is filling this informatics resource gap by developing ontological representations and relationships among proteins and their variants and modified forms. Because proteins are often functional only as members of stable protein complexes, the PRO Consortium, in collaboration with existing protein and pathway databases, has launched a new initiative to implement logical and consistent representation of protein complexes.
We describe here how the PRO Consortium is meeting the challenge of representing species-specific protein complexes, how protein complex representation in PRO supports annotation of protein complexes and comparative biology, and how PRO is being integrated into existing community bioinformatics resources. The PRO resource is accessible at http://pir.georgetown.edu/pro/.
PRO is a unique database resource for species-specific protein complexes. PRO facilitates robust annotation of variations in composition and function contexts for protein complexes within and between species.
Genomic sequencing has made it clear that a large fraction of the genes specifying the core biological functions are shared by all eukaryotes. Knowledge of the biological role of such shared proteins in one organism can often be transferred to other organisms. The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing. To this end, three independent ontologies accessible on the World-Wide Web (http://www.geneontology.org) are being constructed: biological process, molecular function and cellular component.
The Cell Ontology (CL) is an ontology for the representation of in vivo cell types. As biological ontologies such as the CL grow in complexity, they become increasingly difficult to use and maintain. By making the information in the ontology computable, we can use automated reasoners to detect errors and assist with classification. Here we report on the generation of computable definitions for the hematopoietic cell types in the CL.
Computable definitions for over 340 CL classes have been created using a genus-differentia approach. These define cell types according to multiple axes of classification such as the protein complexes found on the surface of a cell type, the biological processes participated in by a cell type, or the phenotypic characteristics associated with a cell type. We employed automated reasoners to verify the ontology and to reveal mistakes in manual curation. The implementation of this process exposed areas in the ontology where new cell type classes were needed to accommodate species-specific expression of cellular markers. Our use of reasoners also inferred new relationships within the CL, and between the CL and the contributing ontologies. This restructured ontology can be used to identify immune cells by flow cytometry, supports sophisticated biological queries involving cells, and helps generate new hypotheses about cell function based on similarities to other cell types.
Use of computable definitions enhances the development of the CL and supports the interoperability of OBO ontologies.
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources; and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
The Mouse Genome Database (MGD) is the community model organism database for the laboratory mouse and the authoritative source for phenotype and functional annotations of mouse genes. MGD includes a complete catalog of mouse genes and genome features with integrated access to genetic, genomic and phenotypic information, all serving to further the use of the mouse as a model system for studying human biology and disease. MGD is a major component of the Mouse Genome Informatics (MGI, http://www.informatics.jax.org/) resource. MGD contains standardized descriptions of mouse phenotypes, associations between mouse models and human genetic diseases, extensive integration of DNA and protein sequence data, normalized representation of genome and genome variant information. Data are obtained and integrated via manual curation of the biomedical literature, direct contributions from individual investigators and downloads from major informatics resource centers. MGD collaborates with the bioinformatics community on the development and use of biomedical ontologies such as the Gene Ontology (GO) and the Mammalian Phenotype (MP) Ontology. Major improvements to the Mouse Genome Database include comprehensive update of genetic maps, implementation of new classification terms for genome features, development of a recombinase (cre) portal and inclusion of all alleles generated by the International Knockout Mouse Consortium (IKMC).
The Protein Ontology (PRO) provides a formal, logically-based classification of specific protein classes including structured representations of protein isoforms, variants and modified forms. Initially focused on proteins found in human, mouse and Escherichia coli, PRO now includes representations of protein complexes. The PRO Consortium works in concert with the developers of other biomedical ontologies and protein knowledge bases to provide the ability to formally organize and integrate representations of precise protein forms so as to enhance accessibility to results of protein research. PRO (http://pir.georgetown.edu/pro) is part of the Open Biomedical Ontology Foundry.
The Mouse Genome Database (MGD) is a major component of the Mouse Genome Informatics (MGI, http://www.informatics.jax.org/) database resource and serves as the primary community model organism database for the laboratory mouse. MGD is the authoritative source for mouse gene, allele and strain nomenclature and for phenotype and functional annotations of mouse genes. MGD contains comprehensive data and information related to mouse genes and their functions, standardized descriptions of mouse phenotypes, extensive integration of DNA and protein sequence data, normalized representation of genome and genome variant information including comparative data on mammalian genes. Data for MGD are obtained from diverse sources including manual curation of the biomedical literature and direct contributions from individual investigator’s laboratories and major informatics resource centers, such as Ensembl, UniProt and NCBI. MGD collaborates with the bioinformatics community on the development and use of biomedical ontologies such as the Gene Ontology and the Mammalian Phenotype Ontology. Recent improvements in MGD described here includes integration of mouse gene trap allele and sequence data, integration of gene targeting information from the International Knockout Mouse Consortium, deployment of an MGI Biomart, and enhancements to our batch query capability for customized data access and retrieval.
The Protein Ontology (PRO) is designed as a formal and principled Open Biomedical Ontologies (OBO) Foundry ontology for proteins. The components of PRO extend from a classification of proteins on the basis of evolutionary relationships at the homeomorphic level to the representation of the multiple protein forms of a gene, including those resulting from alternative splicing, cleavage and/or post-translational modifications. Focusing specifically on the TGF-beta signaling proteins, we describe the building, curation, usage and dissemination of PRO.
PRO is manually curated on the basis of PrePRO, an automatically generated file with content derived from standard protein data sources. Manual curation ensures that the treatment of the protein classes and the internal and external relationships conform to the PRO framework. The current release of PRO is based upon experimental data from mouse and human proteins wherein equivalent protein forms are represented by single terms. In addition to the PRO ontology, the annotation of PRO terms is released as a separate PRO association file, which contains, for each given PRO term, an annotation from the experimentally characterized sub-types as well as the corresponding database identifiers and sequence coordinates. The annotations are added in the form of relationship to other ontologies. Whenever possible, equivalent forms in other species are listed to facilitate cross-species comparison. Splice and allelic variants, gene fusion products and modified protein forms are all represented as entities in the ontology. Therefore, PRO provides for the representation of protein entities and a resource for describing the associated data. This makes PRO useful both for proteomics studies where isoforms and modified forms must be differentiated, and for studies of biological pathways, where representations need to take account of the different ways in which the cascade of events may depend on specific protein modifications.
PRO provides a framework for the formal representation of protein classes and protein forms in the OBO Foundry. It is designed to enable data retrieval and integration and machine reasoning at the molecular level of proteins, thereby facilitating cross-species comparisons, pathway analysis, disease modeling and the generation of new hypotheses.
The evolving complexity of genome-scale experiments has increasingly centralized the role of a highly computable, accurate, and comprehensive resource spanning multiple biological scales and viewpoints. To provide a resource to meet this need, we have significantly extended the PhenoGO database with gene-disease specific annotations and included an additional ten species. This a computationally-derived resource is primarily intended to provide phenotypic context (cell type, tissue, organ, and disease) for mining existing associations between gene products and GO terms specified in the Gene Ontology Databases Automated natural language processing (BioMedLEE) and computational ontology (PhenOS) methods were used to derive these relationships from the literature, expanding the database with information from ten additional species to include over 600,000 phenotypic contexts spanning eleven species from five GO annotation databases. A comprehensive evaluation evaluating the mappings (n = 300) found precision (positive predictive value) at 85%, and recall (sensitivity) at 76%. Phenotypes are encoded in general purpose ontologies such as Cell Ontology, the Unified Medical Language System, and in specialized ontologies such as the Mouse Anatomy and the Mammalian Phenotype Ontology. A web portal has also been developed, allowing for advanced filtering and querying of the database as well as download of the entire dataset .
The Mouse Genome Database (MGD, http://www.informatics.jax.org/), integrates genetic, genomic and phenotypic information about the laboratory mouse, a primary animal model for studying human biology and disease. Information in MGD is obtained from diverse sources, including the scientific literature and external databases, such as EntrezGene, UniProt and GenBank. In addition to its extensive collection of phenotypic allele information for mouse genes that is curated from the published biomedical literature and researcher submission, MGI includes a comprehensive representation of mouse genes including sequence, functional (GO) and comparative information. MGD provides a data mining platform that enables the development of translational research hypotheses based on comparative genotype, phenotype and functional analyses. MGI can be accessed by a variety of methods including web-based search forms, a genome sequence browser and downloadable database reports. Programmatic access is available using web services. Recent improvements in MGD described here include the unified mouse gene catalog for NCBI Build 37 of the reference genome assembly, and improved representation of mouse mutants and phenotypes.
This article collects opinions from leading scientists about how text mining can provide better access to the biological literature, how the scientific community can help with this process, what the next steps are, and what role future BioCreative evaluations can play. The responses identify several broad themes, including the possibility of fusing literature and biological databases through text mining; the need for user interfaces tailored to different classes of users and supporting community-based annotation; the importance of scaling text mining technology and inserting it into larger workflows; and suggestions for additional challenge evaluations, new applications, and additional resources needed to make progress.