Search tips
Search criteria

Results 1-25 (1453156)

Clipboard (0)

Related Articles

1.  The mouse pathology ontology, MPATH; structure and applications 
The capture and use of disease-related anatomic pathology data for both model organism phenotyping and human clinical practice requires a relatively simple nomenclature and coding system that can be integrated into data collection platforms (such as computerized medical record-keeping systems) to enable the pathologist to rapidly screen and accurately record observations. The MPATH ontology was originally constructed in 2,000 by a committee of pathologists for the annotation of rodent histopathology images, but is now widely used for coding and analysis of disease and phenotype data for rodents, humans and zebrafish.
Construction and content
MPATH is divided into two main branches describing pathological processes and structures based on traditional histopathological principles. It does not aim to include definitive diagnoses, which would generally be regarded as disease concepts. It contains 888 core pathology terms in an almost exclusively is_a hierarchy nine layers deep. Currently, 86% of the terms have textual definitions and contain relationships as well as logical axioms to other ontologies such the Gene Ontology.
Application and utility
MPATH was originally devised for the annotation of histopathological images from mice but is now being used much more widely in the recording of diagnostic and phenotypic data from both mice and humans, and in the construction of logical definitions for phenotype and disease ontologies. We discuss the use of MPATH to generate cross-products with qualifiers derived from a subset of the Phenotype and Trait Ontology (PATO) and its application to large-scale high-throughput phenotyping studies. MPATH provides a largely species-agnostic ontology for the descriptions of anatomic pathology, which can be applied to most amniotes and is now finding extensive use in species other than mice. It enables investigators to interrogate large datasets at a variety of depths, use semantic analysis to identify the relations between diseases in different species and integrate pathology data with other data types, such as pharmacogenomics.
PMCID: PMC3851164  PMID: 24033988
Pathology; Ontology; Disease; Mouse; Phenotype
2.  PhenomeNET: a whole-phenome approach to disease gene discovery 
Nucleic Acids Research  2011;39(18):e119.
Phenotypes are investigated in model organisms to understand and reveal the molecular mechanisms underlying disease. Phenotype ontologies were developed to capture and compare phenotypes within the context of a single species. Recently, these ontologies were augmented with formal class definitions that may be utilized to integrate phenotypic data and enable the direct comparison of phenotypes between different species. We have developed a method to transform phenotype ontologies into a formal representation, combine phenotype ontologies with anatomy ontologies, and apply a measure of semantic similarity to construct the PhenomeNET cross-species phenotype network. We demonstrate that PhenomeNET can identify orthologous genes, genes involved in the same pathway and gene–disease associations through the comparison of mutant phenotypes. We provide evidence that the Adam19 and Fgf15 genes in mice are involved in the tetralogy of Fallot, and, using zebrafish phenotypes, propose the hypothesis that the mammalian homologs of Cx36.7 and Nkx2.5 lie in a pathway controlling cardiac morphogenesis and electrical conductivity which, when defective, cause the tetralogy of Fallot phenotype. Our method implements a whole-phenome approach toward disease gene discovery and can be applied to prioritize genes for rare and orphan diseases for which the molecular basis is unknown.
PMCID: PMC3185433  PMID: 21737429
3.  Improving Disease Gene Prioritization by Comparing the Semantic Similarity of Phenotypes in Mice with Those of Human Diseases 
PLoS ONE  2012;7(6):e38937.
Despite considerable progress in understanding the molecular origins of hereditary human diseases, the molecular basis of several thousand genetic diseases still remains unknown. High-throughput phenotype studies are underway to systematically assess the phenotype outcome of targeted mutations in model organisms. Thus, comparing the similarity between experimentally identified phenotypes and the phenotypes associated with human diseases can be used to suggest causal genes underlying a disease. In this manuscript, we present a method for disease gene prioritization based on comparing phenotypes of mouse models with those of human diseases. For this purpose, either human disease phenotypes are “translated” into a mouse-based representation (using the Mammalian Phenotype Ontology), or mouse phenotypes are “translated” into a human-based representation (using the Human Phenotype Ontology). We apply a measure of semantic similarity and rank experimentally identified phenotypes in mice with respect to their phenotypic similarity to human diseases. Our method is evaluated on manually curated and experimentally verified gene–disease associations for human and for mouse. We evaluate our approach using a Receiver Operating Characteristic (ROC) analysis and obtain an area under the ROC curve of up to . Furthermore, we are able to confirm previous results that the Vax1 gene is involved in Septo-Optic Dysplasia and suggest Gdf6 and Marcks as further potential candidates. Our method significantly outperforms previous phenotype-based approaches of prioritizing gene–disease associations. To enable the adaption of our method to the analysis of other phenotype data, our software and prioritization results are freely available under a BSD licence at Furthermore, our method has been integrated in PhenomeNET and the results can be explored using the PhenomeBrowser at
PMCID: PMC3375301  PMID: 22719993
4.  Clustering phenotype populations by genome-wide RNAi and multiparametric imaging 
How to predict gene function from phenotypic cues is a longstanding question in biology.Using quantitative multiparametric imaging, RNAi-mediated cell phenotypes were measured on a genome-wide scale.On the basis of phenotypic ‘neighbourhoods', we identified previously uncharacterized human genes as mediators of the DNA damage response pathway and the maintenance of genomic integrity.The phenotypic map is provided as an online resource at for discovering further functional relationships for a broad spectrum of biological module
Genetic screens for phenotypic similarity have made key contributions for associating genes with biological processes. Aggregating genes by similarity of their loss-of-function phenotype has provided insights into signalling pathways that have a conserved function from Drosophila to human (Nusslein-Volhard and Wieschaus, 1980; Bier, 2005). Complex visual phenotypes, such as defects in pattern formation during development, greatly facilitated the classification of genes into pathways, and phenotypic similarities in many cases predicted molecular relationships. With RNA interference (RNAi), highly parallel phenotyping of loss-of-function effects in cultured cells has become feasible in many organisms whose genome have been sequenced (Boutros and Ahringer, 2008). One of the current challenges is the computational categorization of visual phenotypes and the prediction of gene function and associated biological processes. With large parts of the genome still being in unchartered territory, deriving functional information from large-scale phenotype analysis promises to uncover novel gene–gene relationships and to generate functional maps to explore cellular processes.
In this study, we developed an automated approach using RNAi-mediated cell phenotypes, multiparametric imaging and computational modelling to obtain functional information on previously uncharacterized genes. To generate broad, computer-readable phenotypic signatures, we measured the effect of RNAi-mediated knockdowns on changes of cell morphology in human cells on a genome-wide scale. First, the several million cells were stained for nuclear and cytoskeletal markers and then imaged using automated microscopy. On the basis of fluorescent markers, we established an automated image analysis to classify individual cells (Figure 1A). After cell segmentation for determining nuclei and cell boundaries (Figure 1C), we computed 51 cell descriptors that quantified intensities, shape characteristics and texture (Figure 1F). Individual cells were categorized into 1 of 10 classes, which included cells showing protrusion/elongation, cells in metaphase, large cells, condensed cells, cells with lamellipodia and cellular debris (Figure 1D and E). Each siRNA knockdown was summarized by a phenotypic profile and differences between RNAi knockdowns were quantified by the similarity between phenotypic profiles. We termed the vector of scores a phenoprint (Figure 3C) and defined the phenotypic distance between a pair of perturbations as the distance between their corresponding phenoprints.
To visualize the distribution of all phenoprints, we plotted them in a genome-wide map as a two-dimensional representation of the phenotypic similarity relationships (Figure 3A). The complete data set and an interactive version of the phenotypic map are available at The map identified phenotypic ‘neighbourhoods', which are characterized by cells with lamellipodia (WNK3, ANXA4), cells with prominent actin fibres (ODF2, SOD3), abundance of large cells (CA14), many elongated cells (SH2B2, ELMO2), decrease in cell number (TPX2, COPB1, COPA), increase in number of cells in metaphase (BLR1, CIB2) and combinations of phenotypes such as presence of large cells with protrusions and bright nuclei (PTPRZ1, RRM1; Figure 3B).
To test whether phenotypic similarity might serve as a predictor of gene function, we focused our further analysis on two clusters that contained genes associated with the DNA damage response (DDR) and genomic integrity (Figure 3A and C). The first phenotypic cluster included proteins with kinetochore-associated functions such as NUF2 (Figure 3B) and SGOL1. It also contained the centrosomal protein CEP164 that has been described as an important mediator of the DNA damage-activated signalling cascade (Sivasubramaniam et al, 2008) and the largely uncharacterized genes DONSON and SON. A second phenotypically distinct cluster included previously described components of the DDR pathway such as RRM1 (Figure 3A–C), CLSPN, PRIM2 and SETD8. Furthermore, this cluster contained the poorly characterized genes CADM1 and CD3EAP.
Cells activate a signalling cascade in response to DNA damage induced by exogenous and endogenous factors. Central are the kinases ATM and ATR as they serve as sensors of DNA damage and activators of further downstream kinases (Harper and Elledge, 2007; Cimprich and Cortez, 2008). To investigate whether DONSON, SON, CADM1 and CD3EAP, which were found in phenotypic ‘neighbourhoods' to known DDR components, have a role in the DNA damage signalling pathway, we tested the effect of their depletion on the DDR on γ irradiation. As indicated by reduced CHEK1 phosphorylation, siRNA knock down of DONSON, SON, CD3EAP or CADM1 resulted in impaired DDR signalling on γ irradiation. Furthermore, knock down of DONSON or SON reduced phosphorylation of downstream effectors such as NBS1, CHEK1 and the histone variant H2AX on UVC irradiation. DONSON depletion also impaired recruitment of RPA2 onto chromatin and SON knockdown reduced RPA2 phosphorylation indicating that DONSON and SON presumably act downstream of the activation of ATM. In agreement to their phenotypic profile, these results suggest that DONSON, SON, CADM1 and CD3EAP are important mediators of the DDR. Further experiments demonstrated that they are also required for the maintenance of genomic integrity.
In summary, we show that genes with similar phenotypic profiles tend to share similar functions. The power of our computational and experimental approach is demonstrated by the identification of novel signalling regulators whose phenotypic profiles were found in proximity to known biological modules. Therefore, we believe that such phenotypic maps can serve as a resource for functional discovery and characterization of unknown genes. Furthermore, such approaches are also applicable for other perturbation reagents, such as small molecules in drug discovery and development. One could also envision combined maps that contain both siRNAs and small molecules to predict target–small molecule relationships and potential side effects.
Genetic screens for phenotypic similarity have made key contributions to associating genes with biological processes. With RNA interference (RNAi), highly parallel phenotyping of loss-of-function effects in cells has become feasible. One of the current challenges however is the computational categorization of visual phenotypes and the prediction of biological function and processes. In this study, we describe a combined computational and experimental approach to discover novel gene functions and explore functional relationships. We performed a genome-wide RNAi screen in human cells and used quantitative descriptors derived from high-throughput imaging to generate multiparametric phenotypic profiles. We show that profiles predicted functions of genes by phenotypic similarity. Specifically, we examined several candidates including the largely uncharacterized gene DONSON, which shared phenotype similarity with known factors of DNA damage response (DDR) and genomic integrity. Experimental evidence supports that DONSON is a novel centrosomal protein required for DDR signalling and genomic integrity. Multiparametric phenotyping by automated imaging and computational annotation is a powerful method for functional discovery and mapping the landscape of phenotypic responses to cellular perturbations.
PMCID: PMC2913390  PMID: 20531400
DNA damage response signalling; massively parallel phenotyping; phenotype networks; RNAi screening
5.  Phenotypic overlap in the contribution of individual genes to CNV pathogenicity revealed by cross-species computational analysis of single-gene mutations in humans, mice and zebrafish 
Disease Models & Mechanisms  2012;6(2):358-372.
Numerous disease syndromes are associated with regions of copy number variation (CNV) in the human genome and, in most cases, the pathogenicity of the CNV is thought to be related to altered dosage of the genes contained within the affected segment. However, establishing the contribution of individual genes to the overall pathogenicity of CNV syndromes is difficult and often relies on the identification of potential candidates through manual searches of the literature and online resources. We describe here the development of a computational framework to comprehensively search phenotypic information from model organisms and single-gene human hereditary disorders, and thus speed the interpretation of the complex phenotypes of CNV disorders. There are currently more than 5000 human genes about which nothing is known phenotypically but for which detailed phenotypic information for the mouse and/or zebrafish orthologs is available. Here, we present an ontology-based approach to identify similarities between human disease manifestations and the mutational phenotypes in characterized model organism genes; this approach can therefore be used even in cases where there is little or no information about the function of the human genes. We applied this algorithm to detect candidate genes for 27 recurrent CNV disorders and identified 802 gene-phenotype associations, approximately half of which involved genes that were previously reported to be associated with individual phenotypic features and half of which were novel candidates. A total of 431 associations were made solely on the basis of model organism phenotype data. Additionally, we observed a striking, statistically significant tendency for individual disease phenotypes to be associated with multiple genes located within a single CNV region, a phenomenon that we denote as pheno-clustering. Many of the clusters also display statistically significant similarities in protein function or vicinity within the protein-protein interaction network. Our results provide a basis for understanding previously un-interpretable genotype-phenotype correlations in pathogenic CNVs and for mobilizing the large amount of model organism phenotype data to provide insights into human genetic disorders.
PMCID: PMC3597018  PMID: 23104991
6.  Logical Gene Ontology Annotations (GOAL): exploring gene ontology annotations with OWL 
Journal of Biomedical Semantics  2012;3(Suppl 1):S3.
Ontologies such as the Gene Ontology (GO) and their use in annotations make cross species comparisons of genes possible, along with a wide range of other analytical activities. The bio-ontologies community, in particular the Open Biomedical Ontologies (OBO) community, have provided many other ontologies and an increasingly large volume of annotations of gene products that can be exploited in query and analysis. As many annotations with different ontologies centre upon gene products, there is a possibility to explore gene products through multiple ontological perspectives at the same time. Questions could be asked that link a gene product’s function, process, cellular location, phenotype and disease. Current tools, such as AmiGO, allow exploration of genes based on their GO annotations, but not through multiple ontological perspectives. In addition, the semantics of these ontology’s representations should be able to, through automated reasoning, afford richer query opportunities of the gene product annotations than is currently possible.
To do this multi-perspective, richer querying of gene product annotations, we have created the Logical Gene Ontology, or GOAL ontology, in OWL that combines the Gene Ontology, Human Disease Ontology and the Mammalian Phenotype Ontology, together with classes that represent the annotations with these ontologies for mouse gene products. Each mouse gene product is represented as a class, with the appropriate relationships to the GO aspects, phenotype and disease with which it has been annotated. We then use defined classes to query these protein classes through automated reasoning, and to build a complex hierarchy of gene products. We have presented this through a Web interface that allows arbitrary queries to be constructed and the results displayed.
This standard use of OWL affords a rich interaction with Gene Ontology, Human Disease Ontology and Mammalian Phenotype Ontology annotations for the mouse, to give a fine partitioning of the gene products in the GOAL ontology. OWL in combination with automated reasoning can be effectively used to query across ontologies to ask biologically rich questions. We have demonstrated that automated reasoning can be used to deliver practical on-line querying support for the ontology annotations available for the mouse.
The GOAL Web page is to be found at
PMCID: PMC3337258  PMID: 22541594
7.  The Mammalian Phenotype Ontology: enabling robust annotation and comparative analysis 
The mouse has long been an important model for the study of human genetic disease. Through the application of genetic engineering and mutagenesis techniques, the number of unique mutant mouse models and the amount of phenotypic data describing them are growing exponentially. Describing phenotypes of mutant mice in a computationally useful manner that will facilitate data mining is a major challenge for bioinformatics. Here we describe a tool, the Mammalian Phenotype Ontology (MP), for classifying and organizing phenotypic information related to the mouse and other mammalian species. The MP Ontology has been applied to mouse phenotype descriptions in the Mouse Genome Informatics Database (MGI,, the Rat Genome Database (RGD,, the Online Mendelian Inheritance in Animals (OMIA, and elsewhere. Use of this ontology allows comparisons of data from diverse sources, can facilitate comparisons across mammalian species, assists in identifying appropriate experimental disease models, and aids in the discovery of candidate disease genes and molecular signaling pathways.
PMCID: PMC2801442  PMID: 20052305
Ontology; Phenotype; Mammal; Annotation; Model System
8.  Automated identification of pathways from quantitative genetic interaction data 
We present a novel Bayesian learning method that reconstructs large detailed gene networks from quantitative genetic interaction (GI) data.The method uses global reasoning to handle missing and ambiguous measurements, and provide confidence estimates for each prediction.Applied to a recent data set over genes relevant to protein folding, the learned networks reflect known biological pathways, including details such as pathway ordering and directionality of relationships.The reconstructed networks also suggest novel relationships, including the placement of SGT2 in the tail-anchored biogenesis pathway, a finding that we experimentally validated.
Recent developments have enabled large-scale quantitative measurement of genetic interactions (GIs) that report on the extent to which the activity of one gene is dependent on a second. It has long been recognized (Avery and Wasserman, 1992; Hartman et al, 2001; Segre et al, 2004; Tong et al, 2004; Drees et al, 2005; Schuldiner et al, 2005; St Onge et al, 2007; Costanzo et al, 2010) that functional dependencies revealed by GI data can provide rich information regarding underlying biological pathways. Further, the precise phenotypic measurements provided by quantitative GI data can provide evidence for even more detailed aspects of pathway structure, such as differentiating between full and partial dependence between two genes (Drees et al, 2005; Schuldiner et al, 2005; St Onge et al, 2007; Jonikas et al, 2009) (Figure 1A). As GI data sets become available for a range of quantitative phenotypes and organisms, such patterns will allow researchers to elucidate pathways important to a diverse set of biological processes.
We present a new method that exploits the high-quality, quantitative nature of recent GI assays to automatically reconstruct detailed multi-gene pathway structures, including the organization of a large set of genes into coherent pathways, the connectivity and ordering within each pathway, and the directionality of each relationship. We introduce activity pathway networks (APNs), which represent functional dependencies among a set of genes in the form of a network. We present an automatic method to efficiently reconstruct APNs over large sets of genes based on quantitative GI measurements. This method handles uncertainty in the data arising from noise, missing measurements, and data points with ambiguous interpretations, by performing global reasoning that combines evidence from multiple data points. In addition, because some structure choices remain uncertain even when jointly considering all measurements, our method maintains multiple likely networks, and allows computation of confidence estimates over each structure choice.
We applied our APN reconstruction method to the recent high-quality GI data set of Jonikas et al (2009), which examined the functional interaction between genes that contribute to protein folding in the ER. Specifically, Jonikas et al used the cell's endogenous sensor (the unfolded protein response), to first identify several hundred yeast genes with functions in endoplasmic reticulum folding and then systematically characterized their functional interdependencies by measuring unfolded protein response levels in double mutants. Our analysis produced an ensemble of 500 likelihood-weighted APNs over 178 genes (Figure 2).
We performed an aggregate evaluation of our results by comparing to known biological relationships between gene pairs, including participation in pathways according to the Kyoto Encyclopedia of Genes and Genomes (KEGG), correlation of chemical genomic profiles in a recent high-throughput assay (Hillenmeyer et al, 2008) and similarity of Gene Ontology (GO) annotations. In each evaluation performed, our reconstructed APNs were significantly more consistent with the known relationships than either the raw GI values or the Pearson correlation between profiles of GI values.
Importantly, our approach provides not only an improved means for defining pairs or groups of related genes, but also enables the identification of detailed multi-gene network structures. In many cases, our method successfully reconstructed known cellular pathways, including the ER-associated degradation (ERAD) pathway, and the biosynthesis of N-linked glycans, ranking them among the highest confidence structures. In-depth examination of the learned network structures indicates agreement with many known details of these pathways. In addition, quantitative analysis indicates that our learned APNs are indicative of ordering within KEGG-annotated biological pathways.
Our results also suggest several novel relationships, including placement of uncharacterized genes into pathways, and novel relationships between characterized genes. These include the dependence of the J domain chaperone JEM1 on the PDI homolog MPD1, dependence of the Ubiquitin-recycling enzyme DOA4 on N-linked glycosylation, and the dependence of the E3 Ubiquitin ligase DOA10 on the signal peptidase complex subunit SPC2. Our APNs also place the poorly characterized TPR-containing protein SGT2 upstream of the tail-anchored protein biogenesis machinery components GET3, GET4, and MDY2 (also known as GET5), suggesting that SGT2 has a function in the insertion of tail-anchored proteins into membranes. Consistent with this prediction, our experimental analysis shows that sgt2Δ cells show a defect in localization of the tail-anchored protein GFP-Sed5 from punctuate Golgi structures to a more diffuse pattern, as seen in other genes involved in this pathway.
Our results show that multi-gene, detailed pathway networks can be reconstructed from quantitative GI data, providing a concrete computational manifestation to intuitions that have traditionally accompanied the manual interpretation of such data. Ongoing technological developments in both genetics and imaging are enabling the measurement of GI data at a genome-wide scale, using high-accuracy quantitative phenotypes that relate to a range of particular biological functions. Methods based on RNAi will soon allow collection of similar data for human cell lines and other mammalian systems (Moffat et al, 2006). Thus, computational methods for analyzing GI data could have an important function in mapping pathways involved in complex biological systems including human cells.
High-throughput quantitative genetic interaction (GI) measurements provide detailed information regarding the structure of the underlying biological pathways by reporting on functional dependencies between genes. However, the analytical tools for fully exploiting such information lag behind the ability to collect these data. We present a novel Bayesian learning method that uses quantitative phenotypes of double knockout organisms to automatically reconstruct detailed pathway structures. We applied our method to a recent data set that measures GIs for endoplasmic reticulum (ER) genes, using the unfolded protein response as a quantitative phenotype. The results provided reconstructions of known functional pathways including N-linked glycosylation and ER-associated protein degradation. It also contained novel relationships, such as the placement of SGT2 in the tail-anchored biogenesis pathway, a finding that we experimentally validated. Our approach should be readily applicable to the next generation of quantitative GI data sets, as assays become available for additional phenotypes and eventually higher-level organisms.
PMCID: PMC2913392  PMID: 20531408
computational biology; genetic interaction; pathway reconstruction; probabilistic methods
9.  Evolutionary Characters, Phenotypes and Ontologies: Curating Data from the Systematic Biology Literature 
PLoS ONE  2010;5(5):e10708.
The wealth of phenotypic descriptions documented in the published articles, monographs, and dissertations of phylogenetic systematics is traditionally reported in a free-text format, and it is therefore largely inaccessible for linkage to biological databases for genetics, development, and phenotypes, and difficult to manage for large-scale integrative work. The Phenoscape project aims to represent these complex and detailed descriptions with rich and formal semantics that are amenable to computation and integration with phenotype data from other fields of biology. This entails reconceptualizing the traditional free-text characters into the computable Entity-Quality (EQ) formalism using ontologies.
Methodology/Principal Findings
We used ontologies and the EQ formalism to curate a collection of 47 phylogenetic studies on ostariophysan fishes (including catfishes, characins, minnows, knifefishes) and their relatives with the goal of integrating these complex phenotype descriptions with information from an existing model organism database (zebrafish, We developed a curation workflow for the collection of character, taxonomic and specimen data from these publications. A total of 4,617 phenotypic characters (10,512 states) for 3,449 taxa, primarily species, were curated into EQ formalism (for a total of 12,861 EQ statements) using anatomical and taxonomic terms from teleost-specific ontologies (Teleost Anatomy Ontology and Teleost Taxonomy Ontology) in combination with terms from a quality ontology (Phenotype and Trait Ontology). Standards and guidelines for consistently and accurately representing phenotypes were developed in response to the challenges that were evident from two annotation experiments and from feedback from curators.
The challenges we encountered and many of the curation standards and methods for improving consistency that we developed are generally applicable to any effort to represent phenotypes using ontologies. This is because an ontological representation of the detailed variations in phenotype, whether between mutant or wildtype, among individual humans, or across the diversity of species, requires a process by which a precise combination of terms from domain ontologies are selected and organized according to logical relations. The efficiencies that we have developed in this process will be useful for any attempt to annotate complex phenotypic descriptions using ontologies. We also discuss some ramifications of EQ representation for the domain of systematics.
PMCID: PMC2873956  PMID: 20505755
10.  Construction and accessibility of a cross-species phenotype ontology along with gene annotations for biomedical research 
F1000Research  2014;2:30.
Phenotype analyses, e.g. investigating metabolic processes, tissue formation, or organism behavior, are an important element of most biological and medical research activities. Biomedical researchers are making increased use of ontological standards and methods to capture the results of such analyses, with one focus being the comparison and analysis of phenotype information between species.
We have generated a cross-species phenotype ontology for human, mouse and zebrafish that contains classes from the Human Phenotype Ontology, Mammalian Phenotype Ontology, and generated classes for zebrafish phenotypes. We also provide up-to-date annotation data connecting human genes to phenotype classes from the generated ontology. We have included the data generation pipeline into our continuous integration system ensuring stable and up-to-date releases.
This article describes the data generation process and is intended to help interested researchers access both the phenotype annotation data and the associated cross-species phenotype ontology. The resource described here can be used in sophisticated semantic similarity and gene set enrichment analyses for phenotype data across species. The stable releases of this resource can be obtained from
PMCID: PMC3799545  PMID: 24358873
11.  Mapping Gene Associations in Human Mitochondria using Clinical Disease Phenotypes 
PLoS Computational Biology  2009;5(4):e1000374.
Nuclear genes encode most mitochondrial proteins, and their mutations cause diverse and debilitating clinical disorders. To date, 1,200 of these mitochondrial genes have been recorded, while no standardized catalog exists of the associated clinical phenotypes. Such a catalog would be useful to develop methods to analyze human phenotypic data, to determine genotype-phenotype relations among many genes and diseases, and to support the clinical diagnosis of mitochondrial disorders. Here we establish a clinical phenotype catalog of 174 mitochondrial disease genes and study associations of diseases and genes. Phenotypic features such as clinical signs and symptoms were manually annotated from full-text medical articles and classified based on the hierarchical MeSH ontology. This classification of phenotypic features of each gene allowed for the comparison of diseases between different genes. In turn, we were then able to measure the phenotypic associations of disease genes for which we calculated a quantitative value that is based on their shared phenotypic features. The results showed that genes sharing more similar phenotypes have a stronger tendency for functional interactions, proving the usefulness of phenotype similarity values in disease gene network analysis. We then constructed a functional network of mitochondrial genes and discovered a higher connectivity for non-disease than for disease genes, and a tendency of disease genes to interact with each other. Utilizing these differences, we propose 168 candidate genes that resemble the characteristic interaction patterns of mitochondrial disease genes. Through their network associations, the candidates are further prioritized for the study of specific disorders such as optic neuropathies and Parkinson disease. Most mitochondrial disease phenotypes involve several clinical categories including neurologic, metabolic, and gastrointestinal disorders, which might indicate the effects of gene defects within the mitochondrial system. The accompanying knowledgebase ( supports the study of clinical diseases and associated genes.
Author Summary
An important prerequisite for successful disease gene identification is the assessment, with minimal ambiguity, of a particular clinical trait or phenotype. Even with years of experience, recognizing and diagnosing mitochondrial diseases is still a major hurdle in clinical medicine. Computational tools supporting clinicians not only help identify affected individuals, but also guide studies of the genetic and biological causes of these disorders. In this study we dissect and categorize individual clinical features, signs, and symptoms of 174 disease genes and then identify gene similarities based on their shared phenotypic features. We demonstrate that genes sharing more similar phenotypes have a stronger tendency for functional interactions, proving the usefulness of phenotype similarity values in disease gene network analysis. Our study of a large functional network of mitochondrial genes revealed distinct properties that differentiate disease and non-disease genes. Disease genes showed a lower average total connectivity but a tendency to interact with each other; a finding that we used to predict 168 high-probability disease candidates. The accompanying knowledgebase allows for easy navigation between disease and gene information. We believe the open source format will support and encourage further research that will benefit this and other human phenome projects.
PMCID: PMC2668170  PMID: 19390613
12.  The Vertebrate Trait Ontology: a controlled vocabulary for the annotation of trait data across species 
The use of ontologies to standardize biological data and facilitate comparisons among datasets has steadily grown as the complexity and amount of available data have increased. Despite the numerous ontologies available, one area currently lacking a robust ontology is the description of vertebrate traits. A trait is defined as any measurable or observable characteristic pertaining to an organism or any of its substructures. While there are several ontologies to describe entities and processes in phenotypes, diseases, and clinical measurements, one has not been developed for vertebrate traits; the Vertebrate Trait Ontology (VT) was created to fill this void.
Significant inconsistencies in trait nomenclature exist in the literature, and additional difficulties arise when trait data are compared across species. The VT is a unified trait vocabulary created to aid in the transfer of data within and between species and to facilitate investigation of the genetic basis of traits. Trait information provides a valuable link between the measurements that are used to assess the trait, the phenotypes related to the traits, and the diseases associated with one or more phenotypes. Because multiple clinical and morphological measurements are often used to assess a single trait, and a single measurement can be used to assess multiple physiological processes, providing investigators with standardized annotations for trait data will allow them to investigate connections among these data types.
The annotation of genomic data with ontology terms provides unique opportunities for data mining and analysis. Links between data in disparate databases can be identified and explored, a strategy that is particularly useful for cross-species comparisons or in situations involving inconsistent terminology. The VT provides a common basis for the description of traits in multiple vertebrate species. It is being used in the Rat Genome Database and Animal QTL Database for annotation of QTL data for rat, cattle, chicken, swine, sheep, and rainbow trout, and in the Mouse Phenome Database to annotate strain characterization data. In these databases, data are also cross-referenced to applicable terms from other ontologies, providing additional avenues for data mining and analysis. The ontology is available at
PMCID: PMC3851175  PMID: 23937709
Quantitative trait loci; Gene association; Trait ontology
13.  The Teleost Anatomy Ontology: Anatomical Representation for the Genomics Age 
Systematic Biology  2010;59(4):369-383.
The rich knowledge of morphological variation among organisms reported in the systematic literature has remained in free-text format, impractical for use in large-scale synthetic phylogenetic work. This noncomputable format has also precluded linkage to the large knowledgebase of genomic, genetic, developmental, and phenotype data in model organism databases. We have undertaken an effort to prototype a curated, ontology-based evolutionary morphology database that maps to these genetic databases ( to facilitate investigation into the mechanistic basis and evolution of phenotypic diversity. Among the first requirements in establishing this database was the development of a multispecies anatomy ontology with the goal of capturing anatomical data in a systematic and computable manner. An ontology is a formal representation of a set of concepts with defined relationships between those concepts. Multispecies anatomy ontologies in particular are an efficient way to represent the diversity of morphological structures in a clade of organisms, but they present challenges in their development relative to single-species anatomy ontologies. Here, we describe the Teleost Anatomy Ontology (TAO), a multispecies anatomy ontology for teleost fishes derived from the Zebrafish Anatomical Ontology (ZFA) for the purpose of annotating varying morphological features across species. To facilitate interoperability with other anatomy ontologies, TAO uses the Common Anatomy Reference Ontology as a template for its upper level nodes, and TAO and ZFA are synchronized, with zebrafish terms specified as subtypes of teleost terms. We found that the details of ontology architecture have ramifications for querying, and we present general challenges in developing a multispecies anatomy ontology, including refinement of definitions, taxon-specific relationships among terms, and representation of taxonomically variable developmental pathways.
PMCID: PMC2885267  PMID: 20547776
Bioinformatics; devo-evo; fish; morphology; ontology; Teleostei
14.  The Zebrafish Information Network: the zebrafish model organism database provides expanded support for genotypes and phenotypes 
Nucleic Acids Research  2007;36(Database issue):D768-D772.
The Zebrafish Information Network (ZFIN,, the model organism database for zebrafish, provides the central location for curated zebrafish genetic, genomic and developmental data. Extensive data integration of mutant phenotypes, genes, expression patterns, sequences, genetic markers, morpholinos, map positions, publications and community resources facilitates the use of the zebrafish as a model for studying gene function, development, behavior and disease. Access to ZFIN data is provided via web-based query forms and through bulk data files. ZFIN is the definitive source for zebrafish gene and allele nomenclature, the zebrafish anatomical ontology (AO) and for zebrafish gene ontology (GO) annotations. ZFIN plays an active role in the development of cross-species ontologies such as the phenotypic quality ontology (PATO) and the gene ontology (GO). Recent enhancements to ZFIN include (i) a new home page and navigation bar, (ii) expanded support for genotypes and phenotypes, (iii) comprehensive phenotype annotations based on anatomical, phenotypic quality and gene ontologies, (iv) a BLAST server tightly integrated with the ZFIN database via ZFIN-specific datasets, (v) a global site search and (vi) help with hands-on resources.
PMCID: PMC2238839  PMID: 17991680
15.  Practical application of ontologies to annotate and analyse large scale raw mouse phenotype data 
BMC Bioinformatics  2009;10(Suppl 5):S2.
Large-scale international projects are underway to generate collections of knockout mouse mutants and subsequently to perform high throughput phenotype assessments, raising new challenges for computational researchers due to the complexity and scale of the phenotype data. Phenotypes can be described using ontologies in two differing methodologies. Traditionally an individual phenotypic character has either been defined using a single compound term, originating from a species-specific dedicated phenotype ontology, or alternatively by a combinatorial annotation, using concepts from a range of disparate ontologies, to define a phenotypic character as an entity with an associated quality (EQ). Both methods have their merits, which include the dedicated approach allowing use of community standard terminology, and the combinatorial approach facilitating cross-species phenotypic statement comparisons. Previously databases have favoured one approach over another. The EUMODIC project will generate large amounts of mouse phenotype data, generated as a result of the execution of a set of Standard Operating Procedures (SOPs) and will implement both ontological approaches to capture the phenotype data generated.
For all SOPs a four-tier annotation is made: a high-level description of the SOP, to broadly define the type of data generated by the SOP; individual parameter annotation using the EQ model; annotation of the qualitative data generated for each mouse; and the annotation of mutant lines after statistical analysis. The qualitative assessments of phenodeviance are made at the point of data entry, using child PATO qualities to the parameter quality. To facilitate data querying by scientists more familiar with single compound terms to describe phenotypes, the mappings between the Mammalian Phenotype (MP) ontology and the EQ PATO model are exploited to allow querying via MP terms.
Well-annotated and comparable phenotype databases can be achieved through the use of ontologically derived comparable phenotypic statements and have been implemented here by means of OBO compatible EQ annotations. The implementation we describe also sees scientists working seamlessly with ontologies through the assessment of qualitative phenotypes in terms of PATO qualities and the ability to query the database using community-accepted compound MP terms. This work represents the first time the combinatorial and single-dedicated approaches have both been implemented to annotate a phenotypic dataset.
PMCID: PMC2679402  PMID: 19426459
16.  Ontological Discovery Environment: A system for integrating gene-phenotype associations 
Genomics  2009;94(6):377-387.
The wealth of genomic technologies has enabled biologists to rapidly ascribe phenotypic characters to biological substrates. Central to effective biological investigation is the operational definition of the process under investigation. We propose an elucidation of categories of biological characters, including disease relevant traits, based on natural endogenous processes and experimentally observed biological networks, pathways and systems rather than on externally manifested constructs and current semantics such as disease names and processes. The Ontological Discovery Environment (ODE) is an Internet accessible resource for the storage, sharing, retrieval and analysis of phenotype-centered genomic data sets across species and experimental model systems. Any type of data set representing gene-phenotype relationships, such quantitative trait loci (QTL) positional candidates, literature reviews, microarray experiments, ontological or even meta-data, may serve as inputs. To demonstrate a use case leveraging the homology capabilities of ODE and its ability to synthesize diverse data sets, we conducted an analysis of genomic studies related to alcoholism. The core of ODE’s gene-set similarity, distance and hierarchical analysis is the creation of a bipartite network of gene-phenotype relations, a unique discrete graph approach to analysis that enables set-set matching of non-referential data. Gene sets are annotated with several levels of metadata, including community ontologies, while gene set translations compare models across species. Computationally derived gene sets are integrated into hierarchical trees based on gene-derived phenotype interdependencies. Automated set identifications are augmented by statistical tools which enable users to interpret the confidence of modeled results. This approach allows data integration and hypothesis discovery across multiple experimental contexts, regardless of the face similarity and semantic annotation of the experimental systems or species domain.
PMCID: PMC2783409  PMID: 19733230
homology; combinatorial algorithms; microarray; ontology
17.  Towards improving phenotype representation in OWL 
Journal of Biomedical Semantics  2012;3(Suppl 2):S5.
Phenotype ontologies are used in species-specific databases for the annotation of mutagenesis experiments and to characterize human diseases. The Entity-Quality (EQ) formalism is a means to describe complex phenotypes based on one or more affected entities and a quality. EQ-based definitions have been developed for many phenotype ontologies, including the Human and Mammalian Phenotype ontologies.
We analyze formalizations of complex phenotype descriptions in the Web Ontology Language (OWL) that are based on the EQ model, identify several representational challenges and analyze potential solutions to address these challenges.
In particular, we suggest a novel, role-based approach to represent relational qualities such as concentration of iron in spleen, discuss its ontological foundation in the General Formal Ontology (GFO) and evaluate its representation in OWL and the benefits it can bring to the representation of phenotype annotations.
Our analysis of OWL-based representations of phenotypes can contribute to improving consistency and expressiveness of formal phenotype descriptions.
PMCID: PMC3448528  PMID: 23046625
18.  Mining the Gene Wiki for functional genomic knowledge 
BMC Genomics  2011;12:603.
Ontology-based gene annotations are important tools for organizing and analyzing genome-scale biological data. Collecting these annotations is a valuable but costly endeavor. The Gene Wiki makes use of Wikipedia as a low-cost, mass-collaborative platform for assembling text-based gene annotations. The Gene Wiki is comprised of more than 10,000 review articles, each describing one human gene. The goal of this study is to define and assess a computational strategy for translating the text of Gene Wiki articles into ontology-based gene annotations. We specifically explore the generation of structured annotations using the Gene Ontology and the Human Disease Ontology.
Our system produced 2,983 candidate gene annotations using the Disease Ontology and 11,022 candidate annotations using the Gene Ontology from the text of the Gene Wiki. Based on manual evaluations and comparisons to reference annotation sets, we estimate a precision of 90-93% for the Disease Ontology annotations and 48-64% for the Gene Ontology annotations. We further demonstrate that this data set can systematically improve the results from gene set enrichment analyses.
The Gene Wiki is a rapidly growing corpus of text focused on human gene function. Here, we demonstrate that the Gene Wiki can be a powerful resource for generating ontology-based gene annotations. These annotations can be used immediately to improve workflows for building curated gene annotation databases and knowledge-based statistical analyses.
PMCID: PMC3271090  PMID: 22165947
19.  Experimental-confirmation and functional-annotation of predicted proteins in the chicken genome 
BMC Genomics  2007;8:425.
The chicken genome was sequenced because of its phylogenetic position as a non-mammalian vertebrate, its use as a biomedical model especially to study embryology and development, its role as a source of human disease organisms and its importance as the major source of animal derived food protein. However, genomic sequence data is, in itself, of limited value; generally it is not equivalent to understanding biological function. The benefit of having a genome sequence is that it provides a basis for functional genomics. However, the sequence data currently available is poorly structurally and functionally annotated and many genes do not have standard nomenclature assigned.
We analysed eight chicken tissues and improved the chicken genome structural annotation by providing experimental support for the in vivo expression of 7,809 computationally predicted proteins, including 30 chicken proteins that were only electronically predicted or hypothetical translations in human. To improve functional annotation (based on Gene Ontology), we mapped these identified proteins to their human and mouse orthologs and used this orthology to transfer Gene Ontology (GO) functional annotations to the chicken proteins. The 8,213 orthology-based GO annotations that we produced represent an 8% increase in currently available chicken GO annotations. Orthologous chicken products were also assigned standardized nomenclature based on current chicken nomenclature guidelines.
We demonstrate the utility of high-throughput expression proteomics for rapid experimental structural annotation of a newly sequenced eukaryote genome. These experimentally-supported predicted proteins were further annotated by assigning the proteins with standardized nomenclature and functional annotation. This method is widely applicable to a diverse range of species. Moreover, information from one genome can be used to improve the annotation of other genomes and inform gene prediction algorithms.
PMCID: PMC2204016  PMID: 18021451
20.  Construction and accessibility of a cross-species phenotype ontology along with gene annotations for biomedical research 
F1000Research  2013;2:30.
Phenotype analyses, e.g. investigating metabolic processes, tissue formation, or organism behavior, are an important element of most biological and medical research activities. Biomedical researchers are making increased use of ontological standards and methods to capture the results of such analyses, with one focus being the comparison and analysis of phenotype information between species.
We have generated a cross-species phenotype ontology for human, mouse and zebra fish that contains zebrafish phenotypes. We also provide up-to-date annotation data connecting human genes to phenotype classes from the generated ontology. We have included the data generation pipeline into our continuous integration system ensuring stable and up-to-date releases.
This article describes the data generation process and is intended to help interested researchers access both the phenotype annotation data and the associated cross-species phenotype ontology. The resource described here can be used in sophisticated semantic similarity and gene set enrichment analyses for phenotype data across species. The stable releases of this resource can be obtained from
PMCID: PMC3799545  PMID: 24358873
21.  Worm Phenotype Ontology: Integrating phenotype data within and beyond the C. elegans community 
BMC Bioinformatics  2011;12:32.
Caenorhabditis elegans gene-based phenotype information dates back to the 1970's, beginning with Sydney Brenner and the characterization of behavioral and morphological mutant alleles via classical genetics in order to understand nervous system function. Since then C. elegans has become an important genetic model system for the study of basic biological and biomedical principles, largely through the use of phenotype analysis. Because of the growth of C. elegans as a genetically tractable model organism and the development of large-scale analyses, there has been a significant increase of phenotype data that needs to be managed and made accessible to the research community. To do so, a standardized vocabulary is necessary to integrate phenotype data from diverse sources, permit integration with other data types and render the data in a computable form.
We describe a hierarchically structured, controlled vocabulary of terms that can be used to standardize phenotype descriptions in C. elegans, namely the Worm Phenotype Ontology (WPO). The WPO is currently comprised of 1,880 phenotype terms, 74% of which have been used in the annotation of phenotypes associated with greater than 18,000 C. elegans genes. The scope of the WPO is not exclusively limited to C. elegans biology, rather it is devised to also incorporate phenotypes observed in related nematode species. We have enriched the value of the WPO by integrating it with other ontologies, thereby increasing the accessibility of worm phenotypes to non-nematode biologists. We are actively developing the WPO to continue to fulfill the evolving needs of the scientific community and hope to engage researchers in this crucial endeavor.
We provide a phenotype ontology (WPO) that will help to facilitate data retrieval, and cross-species comparisons within the nematode community. In the larger scientific community, the WPO will permit data integration, and interoperability across the different Model Organism Databases (MODs) and other biological databases. This standardized phenotype ontology will therefore allow for more complex data queries and enhance bioinformatic analyses.
PMCID: PMC3039574  PMID: 21261995
22.  The Drosophila phenotype ontology 
Phenotype ontologies are queryable classifications of phenotypes. They provide a widely-used means for annotating phenotypes in a form that is human-readable, programatically accessible and that can be used to group annotations in biologically meaningful ways. Accurate manual annotation requires clear textual definitions for terms. Accurate grouping and fruitful programatic usage require high-quality formal definitions that can be used to automate classification. The Drosophila phenotype ontology (DPO) has been used to annotate over 159,000 phenotypes in FlyBase to date, but until recently lacked textual or formal definitions.
We have composed textual definitions for all DPO terms and formal definitions for 77% of them. Formal definitions reference terms from a range of widely-used ontologies including the Phenotype and Trait Ontology (PATO), the Gene Ontology (GO) and the Cell Ontology (CL). We also describe a generally applicable system, devised for the DPO, for recording and reasoning about the timing of death in populations. As a result of the new formalisations, 85% of classifications in the DPO are now inferred rather than asserted, with much of this classification leveraging the structure of the GO. This work has significantly improved the accuracy and completeness of classification and made further development of the DPO more sustainable.
The DPO provides a set of well-defined terms for annotating Drosophila phenotypes and for grouping and querying the resulting annotation sets in biologically meaningful ways. Such queries have already resulted in successful function predictions from phenotype annotation. Moreover, such formalisations make extended queries possible, including cross-species queries via the external ontologies used in formal definitions. The DPO is openly available under an open source license in both OBO and OWL formats. There is good potential for it to be used more broadly by the Drosophila community, which may ultimately result in its extension to cover a broader range of phenotypes.
PMCID: PMC3816596  PMID: 24138933
Drosophila; Phenotype; Ontology; OWL; OBO; Gene ontology; FlyBase
23.  Evaluation of high-throughput functional categorization of human disease genes 
BMC Bioinformatics  2007;8(Suppl 3):S7.
Biological data that are well-organized by an ontology, such as Gene Ontology, enables high-throughput availability of the semantic web. It can also be used to facilitate high throughput classification of biomedical information. However, to our knowledge, no evaluation has been published on automating classifications of human diseases genes using Gene Ontology. In this study, we evaluate automated classifications of well-defined human disease genes using their Gene Ontology annotations and compared them to a gold standard. This gold standard was independently conceived by Valle's research group, and contains 923 human disease genes organized in 14 categories of protein function.
Two automated methods were applied to investigate the classification of human disease genes into independently pre-defined categories of protein function. One method used the structure of Gene Ontology by pre-selecting 74 Gene Ontology terms assigned to 11 protein function categories. The second method was based on the similarity of human disease genes clustered according to the information-theoretic distance of their Gene Ontology annotations. Compared to the categorization of human disease genes found in the gold standard, our automated methods can achieve an overall 56% and 47% precision with 62% and 71% recall respectively. However, approximately 15% of the studied human disease genes remain without GO annotations.
Automated methods can recapitulate a significant portion of classification of the human disease genes. The method using information-theoretic distance performs slightly better on the precision with some loss in recall. For some protein function categories, such as 'hormone' and 'transcription factor', the automated methods perform particularly well, achieving precision and recall levels above 75%. In summary, this study demonstrates that for semantic webs, methods to automatically classify or analyze a majority of human disease genes require significant progress in both the Gene Ontology annotations and particularly in the utilization of these annotations.
PMCID: PMC1892104  PMID: 17493290
24.  Co-clustering phenome–genome for phenotype classification and disease gene discovery 
Nucleic Acids Research  2012;40(19):e146.
Understanding the categorization of human diseases is critical for reliably identifying disease causal genes. Recently, genome-wide studies of abnormal chromosomal locations related to diseases have mapped >2000 phenotype–gene relations, which provide valuable information for classifying diseases and identifying candidate genes as drug targets. In this article, a regularized non-negative matrix tri-factorization (R-NMTF) algorithm is introduced to co-cluster phenotypes and genes, and simultaneously detect associations between the detected phenotype clusters and gene clusters. The R-NMTF algorithm factorizes the phenotype–gene association matrix under the prior knowledge from phenotype similarity network and protein–protein interaction network, supervised by the label information from known disease classes and biological pathways. In the experiments on disease phenotype–gene associations in OMIM and KEGG disease pathways, R-NMTF significantly improved the classification of disease phenotypes and disease pathway genes compared with support vector machines and Label Propagation in cross-validation on the annotated phenotypes and genes. The newly predicted phenotypes in each disease class are highly consistent with human phenotype ontology annotations. The roles of the new member genes in the disease pathways are examined and validated in the protein–protein interaction subnetworks. Extensive literature review also confirmed many new members of the disease classes and pathways as well as the predicted associations between disease phenotype classes and pathways.
PMCID: PMC3479160  PMID: 22735708
25.  Automatically transforming pre- to post-composed phenotypes: EQ-lising HPO and MP 
Large-scale mutagenesis projects are ongoing to improve our understanding about the pathology and subsequently the treatment of diseases. Such projects do not only record the genotype but also report phenotype descriptions of the genetically modified organisms under investigation. Thus far, phenotype data is stored in species-specific databases that lack coherence and interoperability in their phenotype representations. One suggestion to overcome the lack of integration are Entity-Quality (EQ) statements. However, a reliable automated transformation of the phenotype annotations from the databases into EQ statements is still missing.
Here, we report on our ongoing efforts to develop a method (called EQ-liser) for the automated generation of EQ representations from phenotype ontology concept labels. We implemented the suggested method in a prototype and applied it to a subset of Mammalian and Human Phenotype Ontology concepts. In the case of MP, we were able to identify the correct EQ representation in over 52% of structure and process phenotypes. However, applying the EQ-liser prototype to the Human Phenotype Ontology yields a correct EQ representation in only 13.3% of the investigated cases.
With the application of the prototype to two phenotype ontologies, we were able to identify common patterns of mistakes when generating the EQ representation. Correcting these mistakes will pave the way to a species-independent solution to automatically derive EQ representations from phenotype ontology concept labels. Furthermore, we were able to identify inconsistencies in the existing manually defined EQ representations of current phenotype ontologies. Correcting these inconsistencies will improve the quality of the manually defined EQ statements.
PMCID: PMC4016257  PMID: 24131519

Results 1-25 (1453156)