Search tips
Search criteria

Results 1-13 (13)

Clipboard (0)
Year of Publication
Document Types
1.  modMine: flexible access to modENCODE data 
Nucleic Acids Research  2011;40(Database issue):D1082-D1088.
In an effort to comprehensively characterize the functional elements within the genomes of the important model organisms Drosophila melanogaster and Caenorhabditis elegans, the NHGRI model organism Encyclopaedia of DNA Elements (modENCODE) consortium has generated an enormous library of genomic data along with detailed, structured information on all aspects of the experiments. The modMine database ( described here has been built by the modENCODE Data Coordination Center to allow the broader research community to (i) search for and download data sets of interest among the thousands generated by modENCODE; (ii) access the data in an integrated form together with non-modENCODE data sets; and (iii) facilitate fine-grained analysis of the above data. The sophisticated search features are possible because of the collection of extensive experimental metadata by the consortium. Interfaces are provided to allow both biologists and bioinformaticians to exploit these rich modENCODE data sets now available via modMine.
PMCID: PMC3245176  PMID: 22080565
2.  Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium 
Briefings in Bioinformatics  2011;12(5):449-462.
The goal of the Gene Ontology (GO) project is to provide a uniform way to describe the functions of gene products from organisms across all kingdoms of life and thereby enable analysis of genomic data. Protein annotations are either based on experiments or predicted from protein sequences. Since most sequences have not been experimentally characterized, most available annotations need to be based on predictions. To make as accurate inferences as possible, the GO Consortium's Reference Genome Project is using an explicit evolutionary framework to infer annotations of proteins from a broad set of genomes from experimental annotations in a semi-automated manner. Most components in the pipeline, such as selection of sequences, building multiple sequence alignments and phylogenetic trees, retrieving experimental annotations and depositing inferred annotations, are fully automated. However, the most crucial step in our pipeline relies on software-assisted curation by an expert biologist. This curation tool, Phylogenetic Annotation and INference Tool (PAINT) helps curators to infer annotations among members of a protein family. PAINT allows curators to make precise assertions as to when functions were gained and lost during evolution and record the evidence (e.g. experimentally supported GO annotations and phylogenetic information including orthology) for those assertions. In this article, we describe how we use PAINT to infer protein function in a phylogenetic context with emphasis on its strengths, limitations and guidelines. We also discuss specific examples showing how PAINT annotations compare with those generated by other highly used homology-based methods.
PMCID: PMC3178059  PMID: 21873635
gene ontology; genome annotation; reference genome; gene function prediction; phylogenetics
3.  Finding Our Way through Phenotypes 
Deans, Andrew R. | Lewis, Suzanna E. | Huala, Eva | Anzaldo, Salvatore S. | Ashburner, Michael | Balhoff, James P. | Blackburn, David C. | Blake, Judith A. | Burleigh, J. Gordon | Chanet, Bruno | Cooper, Laurel D. | Courtot, Mélanie | Csösz, Sándor | Cui, Hong | Dahdul, Wasila | Das, Sandip | Dececchi, T. Alexander | Dettai, Agnes | Diogo, Rui | Druzinsky, Robert E. | Dumontier, Michel | Franz, Nico M. | Friedrich, Frank | Gkoutos, George V. | Haendel, Melissa | Harmon, Luke J. | Hayamizu, Terry F. | He, Yongqun | Hines, Heather M. | Ibrahim, Nizar | Jackson, Laura M. | Jaiswal, Pankaj | James-Zorn, Christina | Köhler, Sebastian | Lecointre, Guillaume | Lapp, Hilmar | Lawrence, Carolyn J. | Le Novère, Nicolas | Lundberg, John G. | Macklin, James | Mast, Austin R. | Midford, Peter E. | Mikó, István | Mungall, Christopher J. | Oellrich, Anika | Osumi-Sutherland, David | Parkinson, Helen | Ramírez, Martín J. | Richter, Stefan | Robinson, Peter N. | Ruttenberg, Alan | Schulz, Katja S. | Segerdell, Erik | Seltmann, Katja C. | Sharkey, Michael J. | Smith, Aaron D. | Smith, Barry | Specht, Chelsea D. | Squires, R. Burke | Thacker, Robert W. | Thessen, Anne | Fernandez-Triana, Jose | Vihinen, Mauno | Vize, Peter D. | Vogt, Lars | Wall, Christine E. | Walls, Ramona L. | Westerfeld, Monte | Wharton, Robert A. | Wirkner, Christian S. | Woolley, James B. | Yoder, Matthew J. | Zorn, Aaron M. | Mabee, Paula
PLoS Biology  2015;13(1):e1002033.
Imagine if we could compute across phenotype data as easily as genomic data; this article calls for efforts to realize this vision and discusses the potential benefits.
Despite a large and multifaceted effort to understand the vast landscape of phenotypic data, their current form inhibits productive data analysis. The lack of a community-wide, consensus-based, human- and machine-interpretable language for describing phenotypes and their genomic and environmental contexts is perhaps the most pressing scientific bottleneck to integration across many key fields in biology, including genomics, systems biology, development, medicine, evolution, ecology, and systematics. Here we survey the current phenomics landscape, including data resources and handling, and the progress that has been made to accurately capture relevant data descriptions for phenotypes. We present an example of the kind of integration across domains that computable phenotypes would enable, and we call upon the broader biology community, publishers, and relevant funding agencies to support efforts to surmount today's data barriers and facilitate analytical reproducibility.
PMCID: PMC4285398  PMID: 25562316
4.  Standardized description of scientific evidence using the Evidence Ontology (ECO) 
The Evidence Ontology (ECO) is a structured, controlled vocabulary for capturing evidence in biological research. ECO includes diverse terms for categorizing evidence that supports annotation assertions including experimental types, computational methods, author statements and curator inferences. Using ECO, annotation assertions can be distinguished according to the evidence they are based on such as those made by curators versus those automatically computed or those made via high-throughput data review versus single test experiments. Originally created for capturing evidence associated with Gene Ontology annotations, ECO is now used in other capacities by many additional annotation resources including UniProt, Mouse Genome Informatics, Saccharomyces Genome Database, PomBase, the Protein Information Resource and others. Information on the development and use of ECO can be found at The ontology is freely available under Creative Commons license (CC BY-SA 3.0), and can be downloaded in both Open Biological Ontologies and Web Ontology Language formats at Also at this site is a tracker for user submission of term requests and questions. ECO remains under active development in response to user-requested terms and in collaborations with other ontologies and database resources.
Database URL: Evidence Ontology Web site:
PMCID: PMC4105709  PMID: 25052702
5.  The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data 
Nucleic Acids Research  2013;42(Database issue):D966-D974.
The Human Phenotype Ontology (HPO) project, available at, provides a structured, comprehensive and well-defined set of 10,088 classes (terms) describing human phenotypic abnormalities and 13,326 subclass relations between the HPO classes. In addition we have developed logical definitions for 46% of all HPO classes using terms from ontologies for anatomy, cell types, function, embryology, pathology and other domains. This allows interoperability with several resources, especially those containing phenotype information on model organisms such as mouse and zebrafish. Here we describe the updated HPO database, which provides annotations of 7,278 human hereditary syndromes listed in OMIM, Orphanet and DECIPHER to classes of the HPO. Various meta-attributes such as frequency, references and negations are associated with each annotation. Several large-scale projects worldwide utilize the HPO for describing phenotype information in their datasets. We have therefore generated equivalence mappings to other phenotype vocabularies such as LDDB, Orphanet, MedDRA, UMLS and phenoDB, allowing integration of existing datasets and interoperability with multiple biomedical resources. We have created various ways to access the HPO database content using flat files, a MySQL database, and Web-based tools. All data and documentation on the HPO project can be found online.
PMCID: PMC3965098  PMID: 24217912
6.  Phenotypic overlap in the contribution of individual genes to CNV pathogenicity revealed by cross-species computational analysis of single-gene mutations in humans, mice and zebrafish 
Disease Models & Mechanisms  2012;6(2):358-372.
Numerous disease syndromes are associated with regions of copy number variation (CNV) in the human genome and, in most cases, the pathogenicity of the CNV is thought to be related to altered dosage of the genes contained within the affected segment. However, establishing the contribution of individual genes to the overall pathogenicity of CNV syndromes is difficult and often relies on the identification of potential candidates through manual searches of the literature and online resources. We describe here the development of a computational framework to comprehensively search phenotypic information from model organisms and single-gene human hereditary disorders, and thus speed the interpretation of the complex phenotypes of CNV disorders. There are currently more than 5000 human genes about which nothing is known phenotypically but for which detailed phenotypic information for the mouse and/or zebrafish orthologs is available. Here, we present an ontology-based approach to identify similarities between human disease manifestations and the mutational phenotypes in characterized model organism genes; this approach can therefore be used even in cases where there is little or no information about the function of the human genes. We applied this algorithm to detect candidate genes for 27 recurrent CNV disorders and identified 802 gene-phenotype associations, approximately half of which involved genes that were previously reported to be associated with individual phenotypic features and half of which were novel candidates. A total of 431 associations were made solely on the basis of model organism phenotype data. Additionally, we observed a striking, statistically significant tendency for individual disease phenotypes to be associated with multiple genes located within a single CNV region, a phenomenon that we denote as pheno-clustering. Many of the clusters also display statistically significant similarities in protein function or vicinity within the protein-protein interaction network. Our results provide a basis for understanding previously un-interpretable genotype-phenotype correlations in pathogenic CNVs and for mobilizing the large amount of model organism phenotype data to provide insights into human genetic disorders.
PMCID: PMC3597018  PMID: 23104991
7.  A knowledge based approach to matching human neurodegenerative disease and animal models 
Neurodegenerative diseases present a wide and complex range of biological and clinical features. Animal models are key to translational research, yet typically only exhibit a subset of disease features rather than being precise replicas of the disease. Consequently, connecting animal to human conditions using direct data-mining strategies has proven challenging, particularly for diseases of the nervous system, with its complicated anatomy and physiology. To address this challenge we have explored the use of ontologies to create formal descriptions of structural phenotypes across scales that are machine processable and amenable to logical inference. As proof of concept, we built a Neurodegenerative Disease Phenotype Ontology (NDPO) and an associated Phenotype Knowledge Base (PKB) using an entity-quality model that incorporates descriptions for both human disease phenotypes and those of animal models. Entities are drawn from community ontologies made available through the Neuroscience Information Framework (NIF) and qualities are drawn from the Phenotype and Trait Ontology (PATO). We generated ~1200 structured phenotype statements describing structural alterations at the subcellular, cellular and gross anatomical levels observed in 11 human neurodegenerative conditions and associated animal models. PhenoSim, an open source tool for comparing phenotypes, was used to issue a series of competency questions to compare individual phenotypes among organisms and to determine which animal models recapitulate phenotypic aspects of the human disease in aggregate. Overall, the system was able to use relationships within the ontology to bridge phenotypes across scales, returning non-trivial matches based on common subsumers that were meaningful to a neuroscientist with an advanced knowledge of neuroanatomy. The system can be used both to compare individual phenotypes and also phenotypes in aggregate. This proof of concept suggests that expressing complex phenotypes using formal ontologies provides considerable benefit for comparing phenotypes across scales and species.
PMCID: PMC3653101  PMID: 23717278
phenotype; ontology; Neuroscience Information Framework; neurodegenerative disease; semantics
8.  On the Use of Gene Ontology Annotations to Assess Functional Similarity among Orthologs and Paralogs: A Short Report 
PLoS Computational Biology  2012;8(2):e1002386.
A recent paper (Nehrt et al., PLoS Comput. Biol. 7:e1002073, 2011) has proposed a metric for the “functional similarity” between two genes that uses only the Gene Ontology (GO) annotations directly derived from published experimental results. Applying this metric, the authors concluded that paralogous genes within the mouse genome or the human genome are more functionally similar on average than orthologous genes between these genomes, an unexpected result with broad implications if true. We suggest, based on both theoretical and empirical considerations, that this proposed metric should not be interpreted as a functional similarity, and therefore cannot be used to support any conclusions about the “ortholog conjecture” (or, more properly, the “ortholog functional conservation hypothesis”). First, we reexamine the case studies presented by Nehrt et al. as examples of orthologs with divergent functions, and come to a very different conclusion: they actually exemplify how GO annotations for orthologous genes provide complementary information about conserved biological functions. We then show that there is a global ascertainment bias in the experiment-based GO annotations for human and mouse genes: particular types of experiments tend to be performed in different model organisms. We conclude that the reported statistical differences in annotations between pairs of orthologous genes do not reflect differences in biological function, but rather complementarity in experimental approaches. Our results underscore two general considerations for researchers proposing novel types of analysis based on the GO: 1) that GO annotations are often incomplete, potentially in a biased manner, and subject to an “open world assumption” (absence of an annotation does not imply absence of a function), and 2) that conclusions drawn from a novel, large-scale GO analysis should whenever possible be supported by careful, in-depth examination of examples, to help ensure the conclusions have a justifiable biological basis.
Author Summary
Understanding gene function—how individual genes contribute to the biology of an organism at the molecular, cellular and organism levels—is one of the primary aims of biomedical research. It has been a longstanding tenet of model organism research that experimental knowledge obtained in one organism is often applicable to other organisms, particularly if the organisms share the relevant genes because they inherited them from their common ancestor. Nevertheless this tenet is, like any hypothesis, not beyond question. A recent paper has termed this hypothesis a “conjecture,” and performed a statistical analysis, the results of which were interpreted as evidence against the hypothesis. This statistical analysis relied on a computational representation of gene function, the Gene Ontology (GO). As representatives of the international consortium that produces the GO, we show how the apparent evidence against the “ortholog conjecture” can be better explained as an artifact of how molecular biology knowledge is accumulated. In short, a complementarity between knowledge obtained in mouse and human experimental systems was incorrectly interpreted as a disagreement. We discuss the proper interpretation of GO annotations and potential sources of bias, with an eye toward enhancing the informed use of the GO by the scientific community.
PMCID: PMC3280971  PMID: 22359495
9.  The modENCODE Data Coordination Center: lessons in harvesting comprehensive experimental details 
The model organism Encyclopedia of DNA Elements (modENCODE) project is a National Human Genome Research Institute (NHGRI) initiative designed to characterize the genomes of Drosophila melanogaster and Caenorhabditis elegans. A Data Coordination Center (DCC) was created to collect, store and catalog modENCODE data. An effective DCC must gather, organize and provide all primary, interpreted and analyzed data, and ensure the community is supplied with the knowledge of the experimental conditions, protocols and verification checks used to generate each primary data set. We present here the design principles of the modENCODE DCC, and describe the ramifications of collecting thorough and deep metadata for describing experiments, including the use of a wiki for capturing protocol and reagent information, and the BIR-TAB specification for linking biological samples to experimental results. modENCODE data can be found at
Database URL:
PMCID: PMC3170170  PMID: 21856757
10.  Towards BioDBcore: a community-defined information specification for biological databases 
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources; and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
PMCID: PMC3017395  PMID: 21205783
11.  Towards BioDBcore: a community-defined information specification for biological databases 
Nucleic Acids Research  2010;39(Database issue):D7-D10.
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
PMCID: PMC3013734  PMID: 21097465
12.  Linking Human Diseases to Animal Models Using Ontology-Based Phenotype Annotation 
PLoS Biology  2009;7(11):e1000247.
A novel method for quantifying the similarity between phenotypes by the use of ontologies can be used to search for candidate genes, pathway members, and human disease models on the basis of phenotypes alone.
Scientists and clinicians who study genetic alterations and disease have traditionally described phenotypes in natural language. The considerable variation in these free-text descriptions has posed a hindrance to the important task of identifying candidate genes and models for human diseases and indicates the need for a computationally tractable method to mine data resources for mutant phenotypes. In this study, we tested the hypothesis that ontological annotation of disease phenotypes will facilitate the discovery of new genotype-phenotype relationships within and across species. To describe phenotypes using ontologies, we used an Entity-Quality (EQ) methodology, wherein the affected entity (E) and how it is affected (Q) are recorded using terms from a variety of ontologies. Using this EQ method, we annotated the phenotypes of 11 gene-linked human diseases described in Online Mendelian Inheritance in Man (OMIM). These human annotations were loaded into our Ontology-Based Database (OBD) along with other ontology-based phenotype descriptions of mutants from various model organism databases. Phenotypes recorded with this EQ method can be computationally compared based on the hierarchy of terms in the ontologies and the frequency of annotation. We utilized four similarity metrics to compare phenotypes and developed an ontology of homologous and analogous anatomical structures to compare phenotypes between species. Using these tools, we demonstrate that we can identify, through the similarity of the recorded phenotypes, other alleles of the same gene, other members of a signaling pathway, and orthologous genes and pathway members across species. We conclude that EQ-based annotation of phenotypes, in conjunction with a cross-species ontology, and a variety of similarity metrics can identify biologically meaningful similarities between genes by comparing phenotypes alone. This annotation and search method provides a novel and efficient means to identify gene candidates and animal models of human disease, which may shorten the lengthy path to identification and understanding of the genetic basis of human disease.
Author Summary
Model organisms such as fruit flies, mice, and zebrafish are useful for investigating gene function because they are easy to grow, dissect, and genetically manipulate in the laboratory. By examining mutations in these organisms, one can identify candidate genes that cause disease in humans, and develop models to better understand human disease and gene function. A fundamental roadblock for analysis is, however, the lack of a computational method for describing and comparing phenotypes of mutant animals and of human diseases when the genetic basis is unknown. We describe here a novel method using ontologies to record and quantify the similarity between phenotypes. We tested our method by using the annotated mutant phenotype of one member of the Hedgehog signaling pathway in zebrafish to identify other pathway members with similar recorded phenotypes. We also compared human disease phenotypes to those produced by mutation in model organisms, and show that orthologous and biologically relevant genes can be identified by this method. Given that the genetic basis of human disease is often unknown, this method provides a means for identifying candidate genes, pathway members, and disease models by computationally identifying similar phenotypes within and across species.
PMCID: PMC2774506  PMID: 19956802
13.  Sequence Ontology Annotation Guide 
This Sequence Ontology (SO) [13] aims to unify the way in which we describe sequence annotations, by providing a controlled vocabulary of terms and the relationships between them. Using SO terms to label the parts of sequence annotations greatly facilitates downstream analyses of their contents, as it ensures that annotations produced by different groups conform to a single standard. This greatly facilitates analyses of annotation contents and characteristics, e.g. comparisons of UTRs, alternative splicing, etc. Because SO also specifies the relationships between features, e.g. part_of, kind_of, annotations described with SO terms are also better substrates for validation and visualization software.
This document provides a step-by-step guide to producing a SO compliant file describing a sequence annotation. We illustrate this by using an annotated gene as an example. First we show where the terms needed to describe the gene's features are located in SO and their relationships to one another. We then show line by line how to format the file to construct a SO compliant annotation of this gene.
PMCID: PMC2447471  PMID: 18629179

Results 1-13 (13)