Phenotype ontologies are used in species-specific databases for the annotation of mutagenesis experiments and to characterize human diseases. The Entity-Quality (EQ) formalism is a means to describe complex phenotypes based on one or more affected entities and a quality. EQ-based definitions have been developed for many phenotype ontologies, including the Human and Mammalian Phenotype ontologies.
We analyze formalizations of complex phenotype descriptions in the Web Ontology Language (OWL) that are based on the EQ model, identify several representational challenges and analyze potential solutions to address these challenges.
In particular, we suggest a novel, role-based approach to represent relational qualities such as concentration of iron in spleen, discuss its ontological foundation in the General Formal Ontology (GFO) and evaluate its representation in OWL and the benefits it can bring to the representation of phenotype annotations.
Our analysis of OWL-based representations of phenotypes can contribute to improving consistency and expressiveness of formal phenotype descriptions.
Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text.
This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement.
As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.
Ontology-based gene annotations are important tools for organizing and analyzing genome-scale biological data. Collecting these annotations is a valuable but costly endeavor. The Gene Wiki makes use of Wikipedia as a low-cost, mass-collaborative platform for assembling text-based gene annotations. The Gene Wiki is comprised of more than 10,000 review articles, each describing one human gene. The goal of this study is to define and assess a computational strategy for translating the text of Gene Wiki articles into ontology-based gene annotations. We specifically explore the generation of structured annotations using the Gene Ontology and the Human Disease Ontology.
Our system produced 2,983 candidate gene annotations using the Disease Ontology and 11,022 candidate annotations using the Gene Ontology from the text of the Gene Wiki. Based on manual evaluations and comparisons to reference annotation sets, we estimate a precision of 90-93% for the Disease Ontology annotations and 48-64% for the Gene Ontology annotations. We further demonstrate that this data set can systematically improve the results from gene set enrichment analyses.
The Gene Wiki is a rapidly growing corpus of text focused on human gene function. Here, we demonstrate that the Gene Wiki can be a powerful resource for generating ontology-based gene annotations. These annotations can be used immediately to improve workflows for building curated gene annotation databases and knowledge-based statistical analyses.
The mouse has long been an important model for the study of human genetic disease. Through the application of genetic engineering and mutagenesis techniques, the number of unique mutant mouse models and the amount of phenotypic data describing them are growing exponentially. Describing phenotypes of mutant mice in a computationally useful manner that will facilitate data mining is a major challenge for bioinformatics. Here we describe a tool, the Mammalian Phenotype Ontology (MP), for classifying and organizing phenotypic information related to the mouse and other mammalian species. The MP Ontology has been applied to mouse phenotype descriptions in the Mouse Genome Informatics Database (MGI, http://www.informatics.jax.org/), the Rat Genome Database (RGD, http://rgd.mcw.edu), the Online Mendelian Inheritance in Animals (OMIA, http://omia.angis.org.au/) and elsewhere. Use of this ontology allows comparisons of data from diverse sources, can facilitate comparisons across mammalian species, assists in identifying appropriate experimental disease models, and aids in the discovery of candidate disease genes and molecular signaling pathways.
Ontology; Phenotype; Mammal; Annotation; Model System
This paper describes an approach to providing computer-interpretable logical definitions for the terms of the Human Phenotype Ontology (HPO) using PATO, the ontology of phenotypic qualities, to link terms of the HPO to the anatomic and other entities that are affected by abnormal phenotypic qualities. This approach will allow improved computerized reasoning as well as a facility to compare phenotypes between different species. The PATO mapping will also provide direct links from phenotypic abnormalities and underlying anatomic structures encoded using the Foundational Model of Anatomy, which will be a valuable resource for computational investigations of the links between anatomical components and concepts representing diseases with abnormal phenotypes and associated genes.
Neurodegenerative diseases present a wide and complex range of biological and clinical features. Animal models are key to translational research, yet typically only exhibit a subset of disease features rather than being precise replicas of the disease. Consequently, connecting animal to human conditions using direct data-mining strategies has proven challenging, particularly for diseases of the nervous system, with its complicated anatomy and physiology. To address this challenge we have explored the use of ontologies to create formal descriptions of structural phenotypes across scales that are machine processable and amenable to logical inference. As proof of concept, we built a Neurodegenerative Disease Phenotype Ontology (NDPO) and an associated Phenotype Knowledge Base (PKB) using an entity-quality model that incorporates descriptions for both human disease phenotypes and those of animal models. Entities are drawn from community ontologies made available through the Neuroscience Information Framework (NIF) and qualities are drawn from the Phenotype and Trait Ontology (PATO). We generated ~1200 structured phenotype statements describing structural alterations at the subcellular, cellular and gross anatomical levels observed in 11 human neurodegenerative conditions and associated animal models. PhenoSim, an open source tool for comparing phenotypes, was used to issue a series of competency questions to compare individual phenotypes among organisms and to determine which animal models recapitulate phenotypic aspects of the human disease in aggregate. Overall, the system was able to use relationships within the ontology to bridge phenotypes across scales, returning non-trivial matches based on common subsumers that were meaningful to a neuroscientist with an advanced knowledge of neuroanatomy. The system can be used both to compare individual phenotypes and also phenotypes in aggregate. This proof of concept suggests that expressing complex phenotypes using formal ontologies provides considerable benefit for comparing phenotypes across scales and species.
phenotype; ontology; Neuroscience Information Framework; neurodegenerative disease; semantics
The evolving complexity of genome-scale experiments has increasingly centralized the role of a highly computable, accurate, and comprehensive resource spanning multiple biological scales and viewpoints. To provide a resource to meet this need, we have significantly extended the PhenoGO database with gene-disease specific annotations and included an additional ten species. This a computationally-derived resource is primarily intended to provide phenotypic context (cell type, tissue, organ, and disease) for mining existing associations between gene products and GO terms specified in the Gene Ontology Databases Automated natural language processing (BioMedLEE) and computational ontology (PhenOS) methods were used to derive these relationships from the literature, expanding the database with information from ten additional species to include over 600,000 phenotypic contexts spanning eleven species from five GO annotation databases. A comprehensive evaluation evaluating the mappings (n = 300) found precision (positive predictive value) at 85%, and recall (sensitivity) at 76%. Phenotypes are encoded in general purpose ontologies such as Cell Ontology, the Unified Medical Language System, and in specialized ontologies such as the Mouse Anatomy and the Mammalian Phenotype Ontology. A web portal has also been developed, allowing for advanced filtering and querying of the database as well as download of the entire dataset .
Phenotypic differences among species have long been systematically itemized and described by biologists in the process of investigating phylogenetic relationships and trait evolution. Traditionally, these descriptions have been expressed in natural language within the context of individual journal publications or monographs. As such, this rich store of phenotype data has been largely unavailable for statistical and computational comparisons across studies or integration with other biological knowledge.
Here we describe Phenex, a platform-independent desktop application designed to facilitate efficient and consistent annotation of phenotypic similarities and differences using Entity-Quality syntax, drawing on terms from community ontologies for anatomical entities, phenotypic qualities, and taxonomic names. Phenex can be configured to load only those ontologies pertinent to a taxonomic group of interest. The graphical user interface was optimized for evolutionary biologists accustomed to working with lists of taxa, characters, character states, and character-by-taxon matrices.
Annotation of phenotypic data using ontologies and globally unique taxonomic identifiers will allow biologists to integrate phenotypic data from different organisms and studies, leveraging decades of work in systematics and comparative morphology.
Biomedical ontologies are emerging as critical tools in genomic and proteomic research, where complex data in disparate resources need to be integrated. A number of ontologies describe properties that can be attributed to proteins. For example, protein functions are described by the Gene Ontology (GO) and human diseases by SNOMED CT or ICD10. There is, however, a gap in the current set of ontologies – one that describes the protein entities themselves and their relationships. We have designed the PRotein Ontology (PRO) to facilitate protein annotation and to guide new experiments. The components of PRO extend from the classification of proteins on the basis of evolutionary relationships to the representation of the multiple protein forms of a gene (products generated by genetic variation, alternative splicing, proteolytic cleavage, and other post-translational modifications). PRO will allow the specification of relationships between PRO, GO and other ontologies in the OBO Foundry. Here we describe the initial development of PRO, illustrated using human and mouse proteins involved in the transforming growth factor-beta and bone morphogenetic protein signaling pathways.
Natural language processing (NLP) is a high throughput technology because it can process vast quantities of text within a reasonable time period. It has the potential to substantially facilitate biomedical research by extracting, linking, and organizing massive amounts of information that occur in biomedical journal articles as well as in textual fields of biological databases. Until recently, much of the work in biological NLP and text mining has revolved around recognizing the occurrence of biomolecular entities in articles, and in extracting particular relationships among the entities. Now, researchers have recognized a need to link the extracted information to ontologies or knowledge bases, which is a more difficult task. One such knowledge base is Gene Ontology annotations (GOA), which significantly increases semantic computations over the function, cellular components and processes of genes. For multicellular organisms, these annotations can be refined with phenotypic context, such as the cell type, tissue, and organ because establishing phenotypic contexts in which a gene is expressed is a crucial step for understanding the development and the molecular underpinning of the pathophysiology of diseases. In this paper, we propose a system, PhenoGO, which automatically augments annotations in GOA with additional context. PhenoGO utilizes an existing NLP system, called BioMedLEE, an existing knowledge-based phenotype organizer system (PhenOS) in conjunction with MeSH indexing and established biomedical ontologies. More specifically, PhenoGO adds phenotypic contextual information to existing associations between gene products and GO terms as specified in GOA. The system also maps the context to identifiers that are associated with different biomedical ontologies, including the UMLS, Cell Ontology, Mouse Anatomy, NCBI taxonomy, GO, and Mammalian Phenotype Ontology. In addition, PhenoGO was evaluated for coding of anatomical and cellular information and assigning the coded phenotypes to the correct GOA; results obtained show that PhenoGO has a precision of 91% and recall of 92%, demonstrating that the PhenoGO NLP system can accurately encode a large number of anatomical and cellular ontologies to GO annotations. The PhenoGO Database may be accessed at the following URL: http://www.phenoGO.org
The laboratory mouse is the pre-eminent model organism for the dissection of human disease pathways. With the advent of a comprehensive panel of gene knockouts, projects to characterise the phenotypes of all knockout lines are being initiated. The range of genotype-phenotype associations can be represented using the Mammalian Phenotype ontology. Using publicly available data annotated with this ontology we have constructed gene and phenotype networks representing these associations. These networks show a scale-free, hierarchical and modular character and community structure. They also exhibit enrichment for gene coexpression, protein-protein interactions and Gene Ontology annotation similarity. Close association between gene communities and some high-level ontology terms suggests that systematic phenotyping can provide a direct insight into underlying pathways. However some phenotypes are distributed more diffusely across gene networks, likely reflecting the pleiotropic roles of many genes. Phenotype communities show a many-to-many relationship to human disease communities, but stronger overlap at more granular levels of description. This may suggest that systematic phenotyping projects should aim for high granularity annotations to maximise their relevance to human disease.
We introduce a principled computational framework and methodology for automated discovery of context-specific functional links between ontologies. Our model leverages over disparate free-text literature resources to score the model of dependency linking two terms under a context against their model of independence. We identify linked terms as those having a significant bayes factor (p < 0.01). To scale our algorithm over massive ontologies, we propose a heuristic pruning technique as an efficient algorithm for inferring such links.
We have applied this method to translationalize Gene Ontology to all other ontologies available at National Center of Biomedical Ontology (NCBO) BioPortal under the context of Human Disease ontology. Our results show that in addition to broadening the scope of hypothesis for researchers, our work can potentially be used to explore continuum of relationships among ontologies to guide various biological experiments.
The Plant Ontology (PO; http://www.plantontology.org/) is a publicly available, collaborative effort to develop and maintain a controlled, structured vocabulary (‘ontology’) of terms to describe plant anatomy, morphology and the stages of plant development. The goals of the PO are to link (annotate) gene expression and phenotype data to plant structures and stages of plant development, using the data model adopted by the Gene Ontology. From its original design covering only rice, maize and Arabidopsis, the scope of the PO has been expanded to include all green plants. The PO was the first multispecies anatomy ontology developed for the annotation of genes and phenotypes. Also, to our knowledge, it was one of the first biological ontologies that provides translations (via synonyms) in non-English languages such as Japanese and Spanish. As of Release #18 (July 2012), there are about 2.2 million annotations linking PO terms to >110,000 unique data objects representing genes or gene models, proteins, RNAs, germplasm and quantitative trait loci (QTLs) from 22 plant species. In this paper, we focus on the plant anatomical entity branch of the PO, describing the organizing principles, resources available to users and examples of how the PO is integrated into other plant genomics databases and web portals. We also provide two examples of comparative analyses, demonstrating how the ontology structure and PO-annotated data can be used to discover the patterns of expression of the LEAFY (LFY) and terpene synthase (TPS) gene homologs.
Bioinformatics; Comparative genomics; Genome annotation; Ontology; Plant anatomy; Terpene synthase
The Mouse Genome Informatics (MGI; http://www.informatics.jax.org/) database integrates genetic and genomic data with the primary mission of facilitating the use of the mouse as a model system for understanding human biology and disease processes. MGI is the authoritative source of official mouse genetic nomenclature, gene ontology annotations, mammalian phenotype annotations, and mouse anatomy terms. MGI staff enforce the use of standardized genetic nomenclature, ontologies, and controlled vocabularies to describe mouse sequence data, genes, strains, expression data, alleles, and phenotypes. Extensive links between gene-centric information in MGI and other informatics resources (e.g., OMIM, Ensembl, UCSC, NCBI, UniProt) are maintained and updated on a regular basis.
Using the Web-based query interfaces for MGI, users can query for a mouse gene or genes according to diverse biological attributes of those genes, including phenotype associations, gene expression, functional annotation, and genome location. The MGI MouseBLAST server allows users to interrogate the MGI database using nucleotide and/or protein sequences. Functional and phenotypic data from MGI can be viewed in a broader genomic context using an interactive genome browser called Mouse GBrowse. The power of the MGI database as a research tool for biomedicine stems from the degree to which data from diverse sources are integrated. Integration, in turn, allows the data to be evaluated in new contexts. For example, integration makes possible such complex queries as “Find all genes from Chromosome 1 where the function is annotated as transcription factor and there is a knockout allele that results in eye dysmorphology.”
Numerous disease syndromes are associated with regions of copy number variation (CNV) in the human genome and, in most cases, the pathogenicity of the CNV is thought to be related to altered dosage of the genes contained within the affected segment. However, establishing the contribution of individual genes to the overall pathogenicity of CNV syndromes is difficult and often relies on the identification of potential candidates through manual searches of the literature and online resources. We describe here the development of a computational framework to comprehensively search phenotypic information from model organisms and single-gene human hereditary disorders, and thus speed the interpretation of the complex phenotypes of CNV disorders. There are currently more than 5000 human genes about which nothing is known phenotypically but for which detailed phenotypic information for the mouse and/or zebrafish orthologs is available. Here, we present an ontology-based approach to identify similarities between human disease manifestations and the mutational phenotypes in characterized model organism genes; this approach can therefore be used even in cases where there is little or no information about the function of the human genes. We applied this algorithm to detect candidate genes for 27 recurrent CNV disorders and identified 802 gene-phenotype associations, approximately half of which involved genes that were previously reported to be associated with individual phenotypic features and half of which were novel candidates. A total of 431 associations were made solely on the basis of model organism phenotype data. Additionally, we observed a striking, statistically significant tendency for individual disease phenotypes to be associated with multiple genes located within a single CNV region, a phenomenon that we denote as pheno-clustering. Many of the clusters also display statistically significant similarities in protein function or vicinity within the protein-protein interaction network. Our results provide a basis for understanding previously un-interpretable genotype-phenotype correlations in pathogenic CNVs and for mobilizing the large amount of model organism phenotype data to provide insights into human genetic disorders.
The Mouse Genome Database, (MGD, http://www.informatics.jax.org/), integrates genetic, genomic and phenotypic information about the laboratory mouse, a primary animal model for studying human biology and disease. MGD data content includes comprehensive characterization of genes and their functions, standardized descriptions of mouse phenotypes, extensive integration of DNA and protein sequence data, normalized representation of genome and genome variant information including comparative data on mammalian genes. Data within MGD are obtained from diverse sources including manual curation of the biomedical literature, direct contributions from individual investigator's laboratories and major informatics resource centers such as Ensembl, UniProt and NCBI. MGD collaborates with the bioinformatics community on the development of data and semantic standards such as the Gene Ontology (GO) and the Mammalian Phenotype (MP) Ontology. MGD provides a data-mining platform that enables the development of translational research hypotheses based on comparative genotype, phenotype and functional analyses. Both web-based querying and computational access to data are provided. Recent improvements in MGD described here include the association of gene trap data with mouse genes and a new batch query capability for customized data access and retrieval.
Cataloguing gene expression during development of the genitourinary tract will increase our understanding not only of this process but also of congenital defects and disease affecting this organ system. We have developed a high-resolution ontology with which to describe the subcompartments of the developing murine genitourinary tract. This ontology incorporates what can be defined histologically and begins to encompass other structures and cell types already identified at the molecular level. The ontology is being used to annotate in situ hybridisation data generated as part of the Genitourinary Development Molecular Anatomy Project (GUDMAP), a publicly available data resource on gene and protein expression during genitourinary development. The GUDMAP ontology encompasses Theiler stage (TS) 17 to 27 of development as well as the sexually mature adult. It has been written as a partonomic, text-based, hierarchical ontology that, for the embryological stages, has been developed as a high-resolution expansion of the existing Edinburgh Mouse Atlas Project (EMAP) ontology. It also includes group terms for well-characterised structural and/or functional units comprising several sub-structures, such as the nephron and juxtaglomerular complex. Each term has been assigned a unique identification number. Synonyms have been used to improve the success of query searching and maintain wherever possible existing EMAP terms relating to this organ system. We describe here the principles and structure of the ontology and provide representative diagrammatic, histological, and whole mount and section RNA in situ hybridisation images to clarify the terms used within the ontology. Visual examples of how terms appear in different specimen types are also provided.
genitourinary development; renal development; kidney development; urinary development; reproductive development; kidney; gonad; bladder; ureter; urethra; genital tubercle; ovary; testis; congenital defects; gene expression; ontology; annotation; database; anatomy; atlas of development; partonomic ontology; RNA in situ hybridisation
Autism spectrum disorders (ASD) represent a group of developmental disabilities with a strong genetic basis. The laboratory mouse is increasingly used as a model organism for ASD, and MGI, the Mouse Genome Informatics resource, is the primary model organism database for the laboratory mouse. MGI uses the Mammalian Phenotype (MP) ontology to describe mouse models of human diseases. Using bioinformatics tools including Phenologs, MouseNET, and the Ontological Discovery Environment, we tested data associated with MP terms to characterize new gene-phenotype associations related to ASD. Our integrative analysis using these tools identified numerous mouse genotypes that are likely to have previously uncharacterized autistic-like phenotypes. The genes implicated in these mouse models had considerable overlap with a set of over 300 genes recently associated with ASD due to small, rare copy number variation (Pinto D. et al, 2010). Prediction and characterization of autistic mutant mouse alleles assists researchers in studying the complex nature of ASD and provides a generalizable approach to candidate gene prioritization.
Autism spectrum disorders; phenotype ontology; mouse disease models
The Zebrafish Information Network (ZFIN, http://zfin.org), the model organism database for zebrafish, provides the central location for curated zebrafish genetic, genomic and developmental data. Extensive data integration of mutant phenotypes, genes, expression patterns, sequences, genetic markers, morpholinos, map positions, publications and community resources facilitates the use of the zebrafish as a model for studying gene function, development, behavior and disease. Access to ZFIN data is provided via web-based query forms and through bulk data files. ZFIN is the definitive source for zebrafish gene and allele nomenclature, the zebrafish anatomical ontology (AO) and for zebrafish gene ontology (GO) annotations. ZFIN plays an active role in the development of cross-species ontologies such as the phenotypic quality ontology (PATO) and the gene ontology (GO). Recent enhancements to ZFIN include (i) a new home page and navigation bar, (ii) expanded support for genotypes and phenotypes, (iii) comprehensive phenotype annotations based on anatomical, phenotypic quality and gene ontologies, (iv) a BLAST server tightly integrated with the ZFIN database via ZFIN-specific datasets, (v) a global site search and (vi) help with hands-on resources.
Linkages between animal models of diseases and human data enable the development of translational research hypotheses. The objective of this study is to investigate two approaches to integrating phenotype and clinical information. On the one hand, we develop a terminology mapping between phenotypes from the Mammalian Phenotype Ontology (MPO) and Online Mendelian Inheritance in Man (OMIM) through the Unified Medical Language System (UMLS). On the other, we associate MPO phenotypes with OMIM manifestations through annotations made to orthologous genes. 1,469 MPO concepts (22%) were mapped successfully to some disease concept in the UMLS, of which 869 were present in OMIM. Among the 16,764 distinct MGI genes associated with human orthologs, 1,968 distinct genes were associated with both MPO and OMIM annotations. The UMLS is a valuable resource for linking phenotype terms to clinical terminologies, and these mappings between terminologies can help enrich gene annotation databases and unify phenotype representation.
There is an increasing interest in developing ontologies and controlled vocabularies to improve the efficiency and consistency of manual literature curation, to enable more formal biocuration workflow results and ultimately to improve analysis of biological data. Two ontologies that have been successfully used for this purpose are the Gene Ontology (GO) for annotating aspects of gene products and the Molecular Interaction ontology (PSI-MI) used by databases that archive protein–protein interactions. The examination of protein interactions has proven to be extremely promising for the understanding of cellular processes. Manual mapping of information from the biomedical literature to bio-ontology terms is one of the most challenging components in the curation pipeline. It requires that expert curators interpret the natural language descriptions contained in articles and infer their semantic equivalents in the ontology (controlled vocabulary). Since manual curation is a time-consuming process, there is strong motivation to implement text-mining techniques to automatically extract annotations from free text. A range of text mining strategies has been devised to assist in the automated extraction of biological data. These strategies either recognize technical terms used recurrently in the literature and propose them as candidates for inclusion in ontologies, or retrieve passages that serve as evidential support for annotating an ontology term, e.g. from the PSI-MI or GO controlled vocabularies. Here, we provide a general overview of current text-mining methods to automatically extract annotations of GO and PSI-MI ontology terms in the context of the BioCreative (Critical Assessment of Information Extraction Systems in Biology) challenge. Special emphasis is given to protein–protein interaction data and PSI-MI terms referring to interaction detection methods.
Now that the laboratory mouse genome is sequenced and the annotation of its gene content is improving, the next major challenge is the annotation of the phenotypic associations of mouse genes. This requires the development of systematic phenotyping pipelines that use standardized phenotyping procedures which allow comparison across laboratories. It also requires the development of a sophisticated informatics infrastructure for the description and interchange of phenotype data. Here we focus on the current state of the art in the description of data produced by systematic phenotyping approaches using ontologies, in particular, the EQ (Entity-Quality) approach, and what developments are required to facilitate the linking of phenotypic descriptions of mutant mice to human diseases.
Ontologies are becoming increasingly important for the efficient storage, retrieval
and mining of biological data. The description of phenotypes using ontologies is a
particularly complex problem. We outline a schema that can be used to describe
phenotypes by combining orthologous axiomatic ontologies. We also describe tools for
storing, browsing and searching such complex ontologies. Central to this approach is
that assays (protocols for measuring phenotypic characters) describe what has been
measured as well as how this was done, allowing assays to link individual organisms to
ontologies describing phenotypes. We have evaluated this approach by automatically
annotating data on 600 000 mutant mice phenotypes using the SHIRPA protocol. We
believe this approach will enable the flexible, extensible and detailed description of
phenotypes from any organism.
Advances in Natural Language Processing (NLP) techniques enable the extraction of fine-grained relationships mentioned in biomedical text. The variability and the complexity of natural language in expressing similar relationships causes the extracted relationships to be highly heterogeneous, which makes the construction of knowledge bases difficult and poses a challenge in using these for data mining or question answering.
We report on the semi-automatic construction of the PHARE relationship ontology (the PHArmacogenomic RElationships Ontology) consisting of 200 curated relations from over 40,000 heterogeneous relationships extracted via text-mining. These heterogeneous relations are then mapped to the PHARE ontology using synonyms, entity descriptions and hierarchies of entities and roles. Once mapped, relationships can be normalized and compared using the structure of the ontology to identify relationships that have similar semantics but different syntax. We compare and contrast the manual procedure with a fully automated approach using WordNet to quantify the degree of integration enabled by iterative curation and refinement of the PHARE ontology. The result of such integration is a repository of normalized biomedical relationships, named PHARE-KB, which can be queried using Semantic Web technologies such as SPARQL and can be visualized in the form of a biological network.
The PHARE ontology serves as a common semantic framework to integrate more than 40,000 relationships pertinent to pharmacogenomics. The PHARE ontology forms the foundation of a knowledge base named PHARE-KB. Once populated with relationships, PHARE-KB (i) can be visualized in the form of a biological network to guide human tasks such as database curation and (ii) can be queried programmatically to guide bioinformatics applications such as the prediction of molecular interactions. PHARE is available at http://purl.bioontology.org/ontology/PHARE.
Taxonomic descriptions are unparalleled sources of knowledge of life's phenotypic diversity. As natural language prose, these data sets are largely refractory to computation and integration with other sources of phenotypic data. By formalizing taxonomic descriptions using ontology-based semantic representation, we aim to increase the reusability and computability of taxonomists' primary data. Here, we present a revision of the ensign wasp (Hymenoptera: Evaniidae) fauna of New Caledonia using this new model for species description. Descriptive matrices, specimen data, and taxonomic nomenclature are gathered in a unified Web-based application, mx, then exported as both traditional taxonomic treatments and semantic statements using the OWL Web Ontology Language. Character:character-state combinations are then annotated following the entity–quality phenotype model, originally developed to represent mutant model organism phenotype data; concepts of anatomy are drawn from the Hymenoptera Anatomy Ontology and linked to phenotype descriptors from the Phenotypic Quality Ontology. The resulting set of semantic statements is provided in Resource Description Framework format. Applying the model to real data, that is, specimens, taxonomic names, diagnoses, descriptions, and redescriptions, provides us with a foundation to discuss limitations and potential benefits such as automated data integration and reasoner-driven queries. Four species of ensign wasp are now known to occur in New Caledonia: Szepligetella levipetiolata, Szepligetella deercreeki Deans and Mikó sp. nov., Szepligetella irwini Deans and Mikó sp. nov., and the nearly cosmopolitan Evania appendigaster. A fifth species, Szepligetella sericea, including Szepligetella impressa, syn. nov., has not yet been collected in New Caledonia but can be found on islands throughout the Pacific and so is included in the diagnostic key. [Biodiversity informatics; Evaniidae; New Caledonia; new species; ontology; semantic phenotypes; semantic species description; taxonomy.]