The Gene Ontology (GO) Consortium (GOC, http://www.geneontology.org) is a community-based bioinformatics resource that classifies gene product function through the use of structured, controlled vocabularies. Over the past year, the GOC has implemented several processes to increase the quantity, quality and specificity of GO annotations. First, the number of manual, literature-based annotations has grown at an increasing rate. Second, as a result of a new ‘phylogenetic annotation’ process, manually reviewed, homology-based annotations are becoming available for a broad range of species. Third, the quality of GO annotations has been improved through a streamlined process for, and automated quality checks of, GO annotations deposited by different annotation groups. Fourth, the consistency and correctness of the ontology itself has increased by using automated reasoning tools. Finally, the GO has been expanded not only to cover new areas of biology through focused interaction with experts, but also to capture greater specificity in all areas of the ontology using tools for adding new combinatorial terms. The GOC works closely with other ontology developers to support integrated use of terminologies. The GOC supports its user community through the use of e-mail lists, social media and web-based resources.
The skeleton is of fundamental importance in research in comparative vertebrate morphology, paleontology, biomechanics, developmental biology, and systematics. Motivated by research questions that require computational access to and comparative reasoning across the diverse skeletal phenotypes of vertebrates, we developed a module of anatomical concepts for the skeletal system, the Vertebrate Skeletal Anatomy Ontology (VSAO), to accommodate and unify the existing skeletal terminologies for the species-specific (mouse, the frog Xenopus, zebrafish) and multispecies (teleost, amphibian) vertebrate anatomy ontologies. Previous differences between these terminologies prevented even simple queries across databases pertaining to vertebrate morphology. This module of upper-level and specific skeletal terms currently includes 223 defined terms and 179 synonyms that integrate skeletal cells, tissues, biological processes, organs (skeletal elements such as bones and cartilages), and subdivisions of the skeletal system. The VSAO is designed to integrate with other ontologies, including the Common Anatomy Reference Ontology (CARO), Gene Ontology (GO), Uberon, and Cell Ontology (CL), and it is freely available to the community to be updated with additional terms required for research. Its structure accommodates anatomical variation among vertebrate species in development, structure, and composition. Annotation of diverse vertebrate phenotypes with this ontology will enable novel inquiries across the full spectrum of phenotypic diversity.
The Gene Ontology (GO; http://www.geneontology.org) is a community-based bioinformatics resource that supplies information about gene product function using ontologies to represent biological knowledge. Here we describe improvements and expansions to several branches of the ontology, as well as updates that have allowed us to more efficiently disseminate the GO and capture feedback from the research community. The Gene Ontology Consortium (GOC) has expanded areas of the ontology such as cilia-related terms, cell-cycle terms and multicellular organism processes. We have also implemented new tools for generating ontology terms based on a set of logical rules making use of templates, and we have made efforts to increase our use of logical definitions. The GOC has a new and improved web site summarizing new developments and documentation, serving as a portal to GO data. Users can perform GO enrichment analysis, and search the GO for terms, annotations to gene products, and associated metadata across multiple species using the all-new AmiGO 2 browser. We encourage and welcome the input of the research community in all biological areas in our continued effort to improve the Gene Ontology.
A scientific ontology is a formal representation of knowledge within a domain, typically including central concepts, their properties, and relations. With the rise of computers and high-throughput data collection, ontologies have become essential to data mining and sharing across communities in the biomedical sciences. Powerful approaches exist for testing the internal consistency of an ontology, but not for assessing the fidelity of its domain representation. We introduce a family of metrics that describe the breadth and depth with which an ontology represents its knowledge domain. We then test these metrics using (1) four of the most common medical ontologies with respect to a corpus of medical documents and (2) seven of the most popular English thesauri with respect to three corpora that sample language from medicine, news, and novels. Here we show that our approach captures the quality of ontological representation and guides efforts to narrow the breach between ontology and collective discourse within a domain. Our results also demonstrate key features of medical ontologies, English thesauri, and discourse from different domains. Medical ontologies have a small intersection, as do English thesauri. Moreover, dialects characteristic of distinct domains vary strikingly as many of the same words are used quite differently in medicine, news, and novels. As ontologies are intended to mirror the state of knowledge, our methods to tighten the fit between ontology and domain will increase their relevance for new areas of biomedical science and improve the accuracy and power of inferences computed across them.
An ontology represents the concepts and their interrelation within a knowledge domain. Several ontologies have been developed in biomedicine, which provide standardized vocabularies to describe diseases, genes and gene products, physiological phenotypes, anatomical structures, and many other phenomena. Scientists use them to encode the results of complex experiments and observations and to perform integrative analysis to discover new knowledge. A remaining challenge in ontology development is how to evaluate an ontology's representation of knowledge within its scientific domain. Building on classic measures from information retrieval, we introduce a family of metrics including breadth and depth that capture the conceptual coverage and parsimony of an ontology. We test these measures using (1) four commonly used medical ontologies in relation to a corpus of medical documents and (2) seven popular English thesauri (ontologies of synonyms) with respect to text from medicine, news, and novels. Results demonstrate that both medical ontologies and English thesauri have a small overlap in concepts and relations. Our methods suggest efforts to tighten the fit between ontologies and biomedical knowledge.
An understanding of heart development is critical in any systems biology approach to cardiovascular disease. The interpretation of data generated from high-throughput technologies (such as microarray and proteomics) is also essential to this approach. However, characterizing the role of genes in the processes underlying heart development and cardiovascular disease involves the non-trivial task of data analysis and integration of previous knowledge. The Gene Ontology (GO) Consortium provides structured controlled biological vocabularies that are used to summarize previous functional knowledge for gene products across all species. One aspect of GO describes biological processes, such as development and signaling.
In order to support high-throughput cardiovascular research, we have initiated an effort to fully describe heart development in GO; expanding the number of GO terms describing heart development from 12 to over 280. This new ontology describes heart morphogenesis, the differentiation of specific cardiac cell types, and the involvement of signaling pathways in heart development and aligns GO with the current views of the heart development research community and its representation in the literature. This extension of GO allows gene product annotators to comprehensively capture the genetic program leading to the developmental progression of the heart. This will enable users to integrate heart development data across species, resulting in the comprehensive retrieval of information about this subject.
The revised GO structure, combined with gene product annotations, should improve the interpretation of data from high-throughput methods in a variety of cardiovascular research areas, including heart development, congenital cardiac disease, and cardiac stem cell research. Additionally, we invite the heart development community to contribute to the expansion of this important dataset for the benefit of future research in this area.
annotation; cardiovascular; development; Gene Ontology; heart
The Plant Ontology Consortium (POC, http://www.plantontology.org) is a collaborative effort among model plant genome databases and plant researchers that aims to create, maintain and facilitate the use of a controlled vocabulary (ontology) for plants. The ontology allows users to ascribe attributes of plant structure (anatomy and morphology) and developmental stages to data types, such as genes and phenotypes, to provide a semantic framework to make meaningful cross-species and database comparisons. The POC builds upon groundbreaking work by the Gene Ontology Consortium (GOC) by adopting and extending the GOC's principles, existing software and database structure. Over the past year, POC has added hundreds of ontology terms to associate with thousands of genes and gene products from Arabidopsis, rice and maize, which are available through a newly updated web-based browser (http://www.plantontology.org/amigo/go.cgi) for viewing, searching and querying. The Consortium has also implemented new functionalities to facilitate the application of PO in genomic research and updated the website to keep the contents current. In this report, we present a brief description of resources available from the website, changes to the interfaces, data updates, community activities and future enhancement.
The Gene Ontology Consortium (GOC) is a major bioinformatics project that provides structured controlled vocabularies to classify gene product function and location. GOC members create annotations to gene products using the Gene Ontology (GO) vocabularies, thus providing an extensive, publicly available resource. The GO and its annotations to gene products are now an integral part of functional analysis, and statistical tests using GO data are becoming routine for researchers to include when publishing functional information. While many helpful articles about the GOC are available, there are certain updates to the ontology and annotation sets that sometimes go unobserved. Here we describe some of the ways in which GO can change that should be carefully considered by all users of GO as they may have a significant impact on the resulting gene product annotations, and therefore the functional description of the gene product, or the interpretation of analyses performed on GO datasets. GO annotations for gene products change for many reasons, and while these changes generally improve the accuracy of the representation of the underlying biology, they do not necessarily imply that previous annotations were incorrect. We additionally describe the quality assurance mechanisms we employ to improve the accuracy of annotations, which necessarily changes the composition of the annotation sets we provide. We use the Universal Protein Resource (UniProt) for illustrative purposes of how the GO Consortium, as a whole, manages these changes.
Gene Ontology; Annotation; Function prediction; Misinterpretation
The ‘Crop Ontology’ database we describe provides a controlled vocabulary for several economically important crops. It facilitates data integration and discovery from global databases and digital literature. This allows researchers to exploit comparative phenotypic and genotypic information of crops to elucidate functional aspects of traits.
Background and aims
Agricultural crop databases maintained in gene banks of the Consultative Group on International Agricultural Research (CGIAR) are valuable sources of information for breeders. These databases provide comparative phenotypic and genotypic information that can help elucidate functional aspects of plant and agricultural biology. To facilitate data sharing within and between these databases and the retrieval of information, the crop ontology (CO) database was designed to provide controlled vocabulary sets for several economically important plant species.
Existing public ontologies and equivalent catalogues of concepts covering the range of crop science information and descriptors for crops and crop-related traits were collected from breeders, physiologists, agronomists, and researchers in the CGIAR consortium. For each crop, relationships between terms were identified and crop-specific trait ontologies were constructed following the Open Biomedical Ontologies (OBO) format standard using the OBO-Edit tool. All terms within an ontology were assigned a globally unique CO term identifier.
The CO currently comprises crop-specific traits for chickpea (Cicer arietinum), maize (Zea mays), potato (Solanum tuberosum), rice (Oryza sativa), sorghum (Sorghum spp.) and wheat (Triticum spp.). Several plant-structure and anatomy-related terms for banana (Musa spp.), wheat and maize are also included. In addition, multi-crop passport terms are included as controlled vocabularies for sharing information on germplasm. Two web-based online resources were built to make these COs available to the scientific community: the ‘CO Lookup Service’ for browsing the CO; and the ‘Crops Terminizer’, an ontology text mark-up tool.
The controlled vocabularies of the CO are being used to curate several CGIAR centres' agronomic databases. The use of ontology terms to describe agronomic phenotypes and the accurate mapping of these descriptions into databases will be important steps in comparative phenotypic and genotypic studies across species and gene-discovery experiments.
The Gene Ontology Normal Usage Tracking System (GONUTS) is a community-based browser and usage guide for Gene Ontology (GO) terms and a community system for general GO annotation of proteins. GONUTS uses wiki technology to allow registered users to share and edit notes on the use of each term in GO, and to contribute annotations for specific genes of interest. By providing a site for generation of third-party documentation at the granularity of individual terms, GONUTS complements the official documentation of the Gene Ontology Consortium. To provide examples for community users, GONUTS displays the complete GO annotations from seven model organisms: Saccharomyces cerevisiae, Dictyostelium discoideum, Caenorhabditis elegans, Drosophila melanogaster, Danio rerio, Mus musculus and Arabidopsis thaliana. To support community annotation, GONUTS allows automated creation of gene pages for gene products in UniProt. GONUTS will improve the consistency of annotation efforts across genome projects, and should be useful in training new annotators and consumers in the production of GO annotations and the use of GO terms. GONUTS can be accessed at http://gowiki.tamu.edu. The source code for generating the content of GONUTS is available upon request.
The Gene Ontology (GO) project () develops and uses a set of structured, controlled vocabularies for community use in annotating genes, gene products and sequences (also see ). The GO Consortium continues to improve to the vocabulary content, reflecting the impact of several novel mechanisms of incorporating community input. A growing number of model organism databases and genome annotation groups contribute annotation sets using GO terms to GO's public repository. Updates to the AmiGO browser have improved access to contributed genome annotations. As the GO project continues to grow, the use of the GO vocabularies is becoming more varied as well as more widespread. The GO project provides an ontological annotation system that enables biologists to infer knowledge from large amounts of data.
The Gene Ontology (GO) Consortium (http://www.geneontology.org) (GOC) continues to develop, maintain and use a set of structured, controlled vocabularies for the annotation of genes, gene products and sequences. The GO ontologies are expanding both in content and in structure. Several new relationship types have been introduced and used, along with existing relationships, to create links between and within the GO domains. These improve the representation of biology, facilitate querying, and allow GO developers to systematically check for and correct inconsistencies within the GO. Gene product annotation using GO continues to increase both in the number of total annotations and in species coverage. GO tools, such as OBO-Edit, an ontology-editing tool, and AmiGO, the GOC ontology browser, have seen major improvements in functionality, speed and ease of use.
The Gene Ontology is a controlled vocabulary for representing knowledge related to genes and proteins in a computable form. The current effort of manually annotating proteins with the Gene Ontology is outpaced by the rate of accumulation of biomedical knowledge in literature, which urges the development of text mining approaches to facilitate the process by automatically extracting the Gene Ontology annotation from literature. The task is usually cast as a text classification problem, and contemporary methods are confronted with unbalanced training data and the difficulties associated with multi-label classification.
In this research, we investigated the methods of enhancing automatic multi-label classification of biomedical literature by utilizing the structure of the Gene Ontology graph. We have studied three graph-based multi-label classification algorithms, including a novel stochastic algorithm and two top-down hierarchical classification methods for multi-label literature classification. We systematically evaluated and compared these graph-based classification algorithms to a conventional flat multi-label algorithm. The results indicate that, through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods can significantly improve predictions of the Gene Ontology terms implied by the analyzed text. Furthermore, the graph-based multi-label classifiers are capable of suggesting Gene Ontology annotations (to curators) that are closely related to the true annotations even if they fail to predict the true ones directly. A software package implementing the studied algorithms is available for the research community.
Through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods have better potential than the conventional flat multi-label classification approach to facilitate protein annotation based on the literature.
The Gene Ontology (GO) (http://www.geneontology.org) is a community bioinformatics resource that represents gene product function through the use of structured, controlled vocabularies. The number of GO annotations of gene products has increased due to curation efforts among GO Consortium (GOC) groups, including focused literature-based annotation and ortholog-based functional inference. The GO ontologies continue to expand and improve as a result of targeted ontology development, including the introduction of computable logical definitions and development of new tools for the streamlined addition of terms to the ontology. The GOC continues to support its user community through the use of e-mail lists, social media and web-based resources.
The study of biodiversity spans many disciplines and includes data pertaining to species distributions and abundances, genetic sequences, trait measurements, and ecological niches, complemented by information on collection and measurement protocols. A review of the current landscape of metadata standards and ontologies in biodiversity science suggests that existing standards such as the Darwin Core terminology are inadequate for describing biodiversity data in a semantically meaningful and computationally useful way. Existing ontologies, such as the Gene Ontology and others in the Open Biological and Biomedical Ontologies (OBO) Foundry library, provide a semantic structure but lack many of the necessary terms to describe biodiversity data in all its dimensions. In this paper, we describe the motivation for and ongoing development of a new Biological Collections Ontology, the Environment Ontology, and the Population and Community Ontology. These ontologies share the aim of improving data aggregation and integration across the biodiversity domain and can be used to describe physical samples and sampling processes (for example, collection, extraction, and preservation techniques), as well as biodiversity observations that involve no physical sampling. Together they encompass studies of: 1) individual organisms, including voucher specimens from ecological studies and museum specimens, 2) bulk or environmental samples (e.g., gut contents, soil, water) that include DNA, other molecules, and potentially many organisms, especially microbes, and 3) survey-based ecological observations. We discuss how these ontologies can be applied to biodiversity use cases that span genetic, organismal, and ecosystem levels of organization. We argue that if adopted as a standard and rigorously applied and enriched by the biodiversity community, these ontologies would significantly reduce barriers to data discovery, integration, and exchange among biodiversity resources and researchers.
There is currently a gap between the rich and expressive collection of published biomedical ontologies, and the natural language expression of biomedical papers consumed on a daily basis by scientific researchers. The purpose of this paper is to provide an open, shareable structure for dynamic integration of biomedical domain ontologies with the scientific document, in the form of an Annotation Ontology (AO), thus closing this gap and enabling application of formal biomedical ontologies directly to the literature as it emerges.
Initial requirements for AO were elicited by analysis of integration needs between biomedical web communities, and of needs for representing and integrating results of biomedical text mining. Analysis of strengths and weaknesses of previous efforts in this area was also performed. A series of increasingly refined annotation tools were then developed along with a metadata model in OWL, and deployed for feedback and additional requirements the ontology to users at a major pharmaceutical company and a major academic center. Further requirements and critiques of the model were also elicited through discussions with many colleagues and incorporated into the work.
This paper presents Annotation Ontology (AO), an open ontology in OWL-DL for annotating scientific documents on the web. AO supports both human and algorithmic content annotation. It enables “stand-off” or independent metadata anchored to specific positions in a web document by any one of several methods. In AO, the document may be annotated but is not required to be under update control of the annotator. AO contains a provenance model to support versioning, and a set model for specifying groups and containers of annotation. AO is freely available under open source license at http://purl.org/ao/, and extensive documentation including screencasts is available on AO’s Google Code page: http://code.google.com/p/annotation-ontology/ .
The Annotation Ontology meets critical requirements for an open, freely shareable model in OWL, of annotation metadata created against scientific documents on the Web. We believe AO can become a very useful common model for annotation metadata on Web documents, and will enable biomedical domain ontologies to be used quite widely to annotate the scientific literature. Potential collaborators and those with new relevant use cases are invited to contact the authors.
The Gene Ontology project is a collaborative effort to provide descriptions of gene products in a consistent and computable language, and in a species-independent manner. The Gene Ontology is designed to be applicable to all organisms but up to now has been largely under-utilized for prokaryotes and viruses, in part because of a lack of appropriate ontology terms.
To address this issue, we have developed a set of Gene Ontology classes that are applicable to microbes and their hosts, improving both coverage and quality in this area of the Gene Ontology. Describing microbial and viral gene products brings with it the additional challenge of capturing both the host and the microbe. Recognising this, we have worked closely with annotation groups to test and optimize the GO classes, and we describe here a set of annotation guidelines that allow the controlled description of two interacting organisms.
Building on the microbial resources already in existence such as ViralZone, UniProtKB keywords and MeGO, this project provides an integrated ontology to describe interactions between microbial species and their hosts, with mappings to the external resources above. Housing this information within the freely-accessible Gene Ontology project allows the classes and annotation structure to be utilized by a large community of biologists and users.
Annotation; Gene Ontology; Host; Ontology; Virus
Ontology term labels can be ambiguous and have multiple senses. While this is no problem for human annotators, it is a challenge to automated methods, which identify ontology terms in text. Classical approaches to word sense disambiguation use co-occurring words or terms. However, most treat ontologies as simple terminologies, without making use of the ontology structure or the semantic similarity between terms. Another useful source of information for disambiguation are metadata. Here, we systematically compare three approaches to word sense disambiguation, which use ontologies and metadata, respectively.
The 'Closest Sense' method assumes that the ontology defines multiple senses of the term. It computes the shortest path of co-occurring terms in the document to one of these senses. The 'Term Cooc' method defines a log-odds ratio for co-occurring terms including co-occurrences inferred from the ontology structure. The 'MetaData' approach trains a classifier on metadata. It does not require any ontology, but requires training data, which the other methods do not. To evaluate these approaches we defined a manually curated training corpus of 2600 documents for seven ambiguous terms from the Gene Ontology and MeSH. All approaches over all conditions achieve 80% success rate on average. The 'MetaData' approach performed best with 96%, when trained on high-quality data. Its performance deteriorates as quality of the training data decreases. The 'Term Cooc' approach performs better on Gene Ontology (92% success) than on MeSH (73% success) as MeSH is not a strict is-a/part-of, but rather a loose is-related-to hierarchy. The 'Closest Sense' approach achieves on average 80% success rate.
Metadata is valuable for disambiguation, but requires high quality training data. Closest Sense requires no training, but a large, consistently modelled ontology, which are two opposing conditions. Term Cooc achieves greater 90% success given a consistently modelled ontology. Overall, the results show that well structured ontologies can play a very important role to improve disambiguation.
The three benchmark datasets created for the purpose of disambiguation are available in Additional file 1.
Biomedical ontologies are being widely used to annotate biological data in a computer-accessible, consistent and well-defined manner. However, due to their size and complexity, annotating data with appropriate terms from an ontology is often challenging for experts and non-experts alike, because there exist few tools that allow one to quickly find relevant ontology terms to easily populate a web form.
Adult hippocampal neurogenesis is not a single phenotype, but consists of a number of sub-processes, each of which is under complex genetic control. Interpretation of gene expression studies using existing resources often does not lead to results that address the interrelatedness of these processes. Formal structure, such as provided by ontologies, is essential in any field for comprehensive interpretation of existing knowledge but, until now, such a structure has been lacking for adult neurogenesis.
We have created a resource with three components 1. A structured ontology describing the key stages in the development of adult hippocampal neural stem cells into functional granule cell neurons. 2. A comprehensive survey of the literature to annotate the results of all published reports on gene function in adult hippocampal neurogenesis (257 manuscripts covering 228 genes) to the appropriate terms in our ontology. 3. An easy-to-use searchable interface to the resulting database made freely available online. The manuscript presents an overview of the database highlighting global trends such as the current bias towards research on early proliferative stages, and an example gene set enrichment analysis. A limitation of the resource is the current scope of the literature which, however, is growing by around 100 publications per year. With the ontology and database in place, new findings can be rapidly annotated and regular updates of the database will be made publicly available.
The resource we present allows relevant interpretation of gene expression screens in terms of defined stages of postnatal neuronal development. Annotation of genes by hand from the adult neurogenesis literature ensures the data are directly applicable to the system under study. We believe this approach could also serve as an example to other fields in a ‘bottom-up’ community effort complementing the already successful ‘top-down’ approach of the Gene Ontology.
After 10-year-use of AFLP (Amplified Fragment Length Polymorphism) technology for DNA fingerprinting and mRNA profiling, large repertories of genome- and transcriptome-derived sequences are available in public databases for model, crop and tree species. AFLP marker systems have been and are being extensively exploited for genome scanning and gene mapping, as well as cDNA-AFLP for transcriptome profiling and differentially expressed gene cloning. The evaluation, annotation and classification of genomic markers and expressed transcripts would be of great utility for both functional genomics and systems biology research in plants. This may be achieved by means of the Gene Ontology (GO), consisting in three structured vocabularies (i.e. ontologies) describing genes, transcripts and proteins of any organism in terms of their associated cellular component, biological process and molecular function in a species-independent manner. In this paper, the functional annotation of about 8,000 AFLP-derived ESTs retrieved in the NCBI databases was carried out by using GO terminology.
Descriptive statistics on the type, size and nature of gene sequences obtained by means of AFLP technology were calculated. The gene products associated with mRNA transcripts were then classified according to the three main GO vocabularies. A comparison of the functional content of cDNA-AFLP records was also performed by splitting the sequence dataset into monocots and dicots and by comparing them to all annotated ESTs of Arabidopsis and rice, respectively. On the whole, the statistical parameters adopted for the in silico AFLP-derived transcriptome-anchored sequence analysis proved to be critical for obtaining reliable GO results. Such an exhaustive annotation may offer a suitable platform for functional genomics, particularly useful in non-model species.
Reliable GO annotations of AFLP-derived sequences can be gathered through the optimization of the experimental steps and the statistical parameters adopted. The Blast2GO software was shown to represent a comprehensive bioinformatics solution for an annotation-based functional analysis. According to the whole set of GO annotations, the AFLP technology generates thorough information for angiosperm gene products and shares common features across angiosperm species and families. The utility of this technology for structural and functional genomics in plants can be implemented by serial annotation analyses of genome-anchored fragments and organ/tissue-specific repertories of transcriptome-derived fragments.
Bio-ontologies provide terminologies for the scientific community to describe biomedical entities in a standardized manner. There are multiple initiatives that are developing biomedical terminologies for the purpose of providing better annotation, data integration and mining capabilities. Terminology resources devised for multiple purposes inherently diverge in content and structure. A major issue of biomedical data integration is the development of overlapping terms, ambiguous classifications and inconsistencies represented across databases and publications. The disease ontology (DO) was developed over the past decade to address data integration, standardization and annotation issues for human disease data. We have established a DO cancer project to be a focused view of cancer terms within the DO. The DO cancer project mapped 386 cancer terms from the Catalogue of Somatic Mutations in Cancer (COSMIC), The Cancer Genome Atlas (TCGA), International Cancer Genome Consortium, Therapeutically Applicable Research to Generate Effective Treatments, Integrative Oncogenomics and the Early Detection Research Network into a cohesive set of 187 DO terms represented by 63 top-level DO cancer terms. For example, the COSMIC term ‘kidney, NS, carcinoma, clear_cell_renal_cell_carcinoma’ and TCGA term ‘Kidney renal clear cell carcinoma’ were both grouped to the term ‘Disease Ontology Identification (DOID):4467 / renal clear cell carcinoma’ which was mapped to the TopNodes_DOcancerslim term ‘DOID:263 / kidney cancer’. Mapping of diverse cancer terms to DO and the use of top level terms (DO slims) will enable pan-cancer analysis across datasets generated from any of the cancer term sources where pan-cancer means including or relating to all or multiple types of cancer. The terms can be browsed from the DO web site (http://www.disease-ontology.org) and downloaded from the DO’s Apache Subversion or GitHub repositories.
There is an increasing interest in developing ontologies and controlled vocabularies to improve the efficiency and consistency of manual literature curation, to enable more formal biocuration workflow results and ultimately to improve analysis of biological data. Two ontologies that have been successfully used for this purpose are the Gene Ontology (GO) for annotating aspects of gene products and the Molecular Interaction ontology (PSI-MI) used by databases that archive protein–protein interactions. The examination of protein interactions has proven to be extremely promising for the understanding of cellular processes. Manual mapping of information from the biomedical literature to bio-ontology terms is one of the most challenging components in the curation pipeline. It requires that expert curators interpret the natural language descriptions contained in articles and infer their semantic equivalents in the ontology (controlled vocabulary). Since manual curation is a time-consuming process, there is strong motivation to implement text-mining techniques to automatically extract annotations from free text. A range of text mining strategies has been devised to assist in the automated extraction of biological data. These strategies either recognize technical terms used recurrently in the literature and propose them as candidates for inclusion in ontologies, or retrieve passages that serve as evidential support for annotating an ontology term, e.g. from the PSI-MI or GO controlled vocabularies. Here, we provide a general overview of current text-mining methods to automatically extract annotations of GO and PSI-MI ontology terms in the context of the BioCreative (Critical Assessment of Information Extraction Systems in Biology) challenge. Special emphasis is given to protein–protein interaction data and PSI-MI terms referring to interaction detection methods.
The Gene Ontology (GO) project (http://www.geneontology.org/) provides a set of structured, controlled vocabularies for community use in annotating genes, gene products and sequences (also see http://www.sequenceontology.org/). The ontologies have been extended and refined for several biological areas, and improvements to the structure of the ontologies have been implemented. To improve the quantity and quality of gene product annotations available from its public repository, the GO Consortium has launched a focused effort to provide comprehensive and detailed annotation of orthologous genes across a number of ‘reference’ genomes, including human and several key model organisms. Software developments include two releases of the ontology-editing tool OBO-Edit, and improvements to the AmiGO browser interface.
Genes and gene products are frequently annotated with Gene Ontology concepts based on the evidence provided in genomics articles. Manually locating and curating information about a genomic entity from the biomedical literature requires vast amounts of human effort. Hence, there is clearly a need forautomated computational tools to annotate the genes and gene products with Gene Ontology concepts by computationally capturing the related knowledge embedded in textual data.
In this article, we present an automated genomic entity annotation system, GEANN, which extracts information about the characteristics of genes and gene products in article abstracts from PubMed, and translates the discoveredknowledge into Gene Ontology (GO) concepts, a widely-used standardized vocabulary of genomic traits. GEANN utilizes textual "extraction patterns", and a semantic matching framework to locate phrases matching to a pattern and produce Gene Ontology annotations for genes and gene products.
In our experiments, GEANN has reached to the precision level of 78% at therecall level of 61%. On a select set of Gene Ontology concepts, GEANN either outperforms or is comparable to two other automated annotation studies. Use of WordNet for semantic pattern matching improves the precision and recall by 24% and 15%, respectively, and the improvement due to semantic pattern matching becomes more apparent as the Gene Ontology terms become more general.
GEANN is useful for two distinct purposes: (i) automating the annotation of genomic entities with Gene Ontology concepts, and (ii) providing existing annotations with additional "evidence articles" from the literature. The use of textual extraction patterns that are constructed based on the existing annotations achieve high precision. The semantic pattern matching framework provides a more flexible pattern matching scheme with respect to "exactmatching" with the advantage of locating approximate pattern occurrences with similar semantics. Relatively low recall performance of our pattern-based approach may be enhanced either by employing a probabilistic annotation framework based on the annotation neighbourhoods in textual data, or, alternatively, the statistical enrichment threshold may be adjusted to lower values for applications that put more value on achieving higher recall values.
Personalised medicine provides patients with treatments that are specific to their genetic profiles. It requires efficient data sharing of disparate data types across a variety of scientific disciplines, such as molecular biology, pathology, radiology and clinical practice. Personalised medicine aims to offer the safest and most effective therapeutic strategy based on the gene variations of each subject. In particular, this is valid in oncology, where knowledge about genetic mutations has already led to new therapies. Current molecular biology techniques (microarrays, proteomics, epigenetic technology and improved DNA sequencing technology) enable better characterisation of cancer tumours. The vast amounts of data, however, coupled with the use of different terms - or semantic heterogeneity - in each discipline makes the retrieval and integration of information difficult.
Existing software infrastructures for data-sharing in the cancer domain, such as caGrid, support access to distributed information. caGrid follows a service-oriented model-driven architecture. Each data source in caGrid is associated with metadata at increasing levels of abstraction, including syntactic, structural, reference and domain metadata. The domain metadata consists of ontology-based annotations associated with the structural information of each data source. However, caGrid's current querying functionality is given at the structural metadata level, without capitalising on the ontology-based annotations. This paper presents the design of and theoretical foundations for distributed ontology-based queries over cancer research data. Concept-based queries are reformulated to the target query language, where join conditions between multiple data sources are found by exploiting the semantic annotations. The system has been implemented, as a proof of concept, over the caGrid infrastructure. The approach is applicable to other model-driven architectures. A graphical user interface has been developed, supporting ontology-based queries over caGrid data sources. An extensive evaluation of the query reformulation technique is included.
To support personalised medicine in oncology, it is crucial to retrieve and integrate molecular, pathology, radiology and clinical data in an efficient manner. The semantic heterogeneity of the data makes this a challenging task. Ontologies provide a formal framework to support querying and integration. This paper provides an ontology-based solution for querying distributed databases over service-oriented, model-driven infrastructures.