|Home | About | Journals | Submit | Contact Us | Français|
The wealth of genomic technologies has enabled biologists to rapidly ascribe phenotypic characters to biological substrates. Central to effective biological investigation is the operational definition of the process under investigation. We propose an elucidation of categories of biological characters, including disease relevant traits, based on natural endogenous processes and experimentally observed biological networks, pathways and systems rather than on externally manifested constructs and current semantics such as disease names and processes. The Ontological Discovery Environment (ODE) is an Internet accessible resource for the storage, sharing, retrieval and analysis of phenotype-centered genomic data sets across species and experimental model systems. Any type of data set representing gene-phenotype relationships, such quantitative trait loci (QTL) positional candidates, literature reviews, microarray experiments, ontological or even meta-data, may serve as inputs. To demonstrate a use case leveraging the homology capabilities of ODE and its ability to synthesize diverse data sets, we conducted an analysis of genomic studies related to alcoholism. The core of ODE’s gene-set similarity, distance and hierarchical analysis is the creation of a bipartite network of gene-phenotype relations, a unique discrete graph approach to analysis that enables set-set matching of non-referential data. Gene sets are annotated with several levels of metadata, including community ontologies, while gene set translations compare models across species. Computationally derived gene sets are integrated into hierarchical trees based on gene-derived phenotype interdependencies. Automated set identifications are augmented by statistical tools which enable users to interpret the confidence of modeled results. This approach allows data integration and hypothesis discovery across multiple experimental contexts, regardless of the face similarity and semantic annotation of the experimental systems or species domain.
High-thoughput molecular biology provides a means to rapidly associate underlying molecular pathways and other substrates to biological structures and functions. These associations are used to characterize phenotypes and in a limited way, to define the relations among them. There are numerous methodologies for empirical creation and analysis of gene-sets from this type of data. In contrast, defining biologically meaningful categories of phenotypes, particularly those which share a common mechanism is problematic due to the often subjective and phenomenological description of such categories.
Working from the top down, ontology development efforts develop and impose a knowledge structure on biology. Phenotype ontologies such as the Mammalian Phenome Ontology (MPO)  and the Phenotype And Trait Ontology (PATO)  are projects designed to organize higher order phenotypes based on construct knowledge. Both make use of formalized processes for describing relations pioneered by the Gene Ontology Consortium . These and other existing ontology development strategies often do not allow for the description of explicit structure and relationship among defined phenotypes. In the case of behavior, for example, there is limited shorthand to describe the essential categories of complex characteristics mediated by shared biological pathways. This is in contrast to biochemical pathways which are often more-well worked, though even the humble biochemical pathway becomes exquisitely complex as pathway members expand beyond reaction enzymes to the tremendous array of associated gene products involved in transport, anchoring, aggregation, synthesis and other processing of enzymes and substrates. Furthermore, it is challenging to compactly define and unify sets of processes that have different external manifestations of common internal processes. It then becomes vital to implement an approach that discovers the natural organizations of related behavioral processes as a reflection of underlying empirically-derived gene sets using dynamic points of intersection. Lastly, existing paradigms rely on prior knowledge or relevant gene groupings to describe new relationships successfully. For many new or largely uncharacterized genomic features, this is a significant problem. By constructing hierarchical ontologies from known gene-phenotype relationships, ODE breaks from existing constructs by separating the naturally occurring gene-network from the a priori concept structure of the ontology.
The automated and semi-automated creation and analysis of gene sets is a well-developed area enabling rapid development and interpretation of empirical data. This data is often synthesized and grouped through category matching approaches, wherein new empirical data is intersected with known, curated functional annotations for groups of genes. The most widely supported effort of this sort is the Gene Ontology  annotation effort which uses carefully curated experimental data from functional studies of each gene-phenotype association. Other pathway databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) , GenMAPP , and the Biocarta collection contain gene set annotations largely based on known systems and pathways. Highly curated data banks and tools for pathway reconstruction, such as Ingenuity’s Pathway Analysis package (Ingenuity Systems, Mountain View, CA), can be used to construct and annotate gene networks. Indeed, numerous tools have been described for the analysis of various category representations [6–9]. While these tools are often an invaluable aid for distilling and interpreting gene lists and pathways resulting from differential expression analysis, they suffer from a few limitations. Most notably, these include the need for cross-species data integration, and the need to understand, identify and analyze a highly granular and uncharacterized set of related biological processes underlying the broad disease constructs that are assessed through various experimental methods. Analysis of cross-species convergence of gene-phenotype associations, termed ‘convergent functional genomics,’ has been profitably employed in an analysis of bipolar disorder across species in several experimental contexts .
From a genome perspective, there have been many attempts to produce convergent analysis of phenome expression on genome scales, covering a variety of species including mouse, rat, human, and yeast [11–16]. Although each such example provides forward thinking approaches to cross-experimental data integration, the methodology of these existing efforts focuses on the creation of comprehensive ontologies of narrow domains, or on the mapping of high-throughput data to existing ontologies. These approaches often preclude the set-set comparison on non-referential data across diverse experimental domains or between species. Current mapping efforts to facilitate large scale phenotype interoperability are encouraging [17–19], but suffer from the challenges inherent to the lofty goals of structuring and describing compactly knowledge of all of biological function.
We present The Ontological Discovery Environment (ODE) as a Web-based software environment that extracts existing phenomenologically-driven complex trait genomic analysis, and integrates it with a simultaneous analysis of instances (gene-trait associations) and ontologies (classes of genes and traits). In this way, ODE provides and analyzes articulations between gene space and phenome space . ODE addresses the challenge of phenome mapping by accumulating gene-phenotype knowledge through data integration and hypothesis driven discovery across multiple labs and multiple experimental contexts. Emergent discovery in this software environment relies on user-submitted and publicly available gene sets associated with various species and phenotypes, and integrates them using categorical metadata, such as homology. In this way, ODE seeks to define the ontology of complex biological processes, such as behavior, based on intrinsic biological entities, rather than external phenotypic manifestations, which are often subject to historical and cultural biases. The collection of unique ODE tools builds a shared biological architecture of apparently distinct processes, enabling recognition of biological function in health and disease.
ODE’s novel approach to gene set analysis also incorporates computation-critical aspects of genome-scale discovery. This is a particularly pressing issue because classification and assessment of the phenome space is theoretically unbounded. Recent Bayesian network approaches have made significant contributions to our understanding of cross-domain synthesis but do not offer robust information about local relationships  needed for granular analysis. Since set relationships are discrete structures that can naturally be described as finite simple graphs, graph algorithms can be harnessed to interpret and analyze the enormous correlation matrices that arise in the study of transcriptomic and other sorts of -omic data. Bipartite graph representations of gene-phenotype associations are a discrete combinatorial approach that shows promise in preserving information while escaping constrained semantics as demonstrated by clustering of disease phenotype and genes in a fixed data set . In particular, by representing each gene list as a phenotype vertex connected to vertices representing each gene on the list in a bi-partite graph, ODE provides data integration while maintaining substructure relationships of nested gene-set clusters. The creation of emergent phenome ontologies as presented here addresses these computational demands in large part by exploiting novel mathematical tools, such as fixed-parameter tractability , and by employing innovative implementations of combinatorial algorithms we have synthesized for supercomputers at our disposal . Consequently, by leveraging high performance computing, ODE is uniquely positioned to provide phenome models in genome-scale space.
Gene sets, the primary input to the analyses, may be empirically defined or dynamically created within ODE’s repository of gene relationships. Multiple tools are available to perform integrative, gene-centered analysis, and provide confidence metrics for model structure and data aggregation. ODE’s tools include gene set clustering, pairwise Jaccard Similarity and Distance Analysis, Hypergeometric tests, and a highly efficient biclique method for constructing a map of the gene-centered, empirical phenome. Visualization of the resultant phenotypes can then be seen in real time and used for iterative testing and gene set creation. By integrating this approach into a web-based software system, we facilitate the analysis and interpretation of sets of genomic results, enabling comparison, intersection and integration of convergent data from several species and many experiment types, including mutant analyses, genome wide association studies, microarray experiments and virtually any other genomic data type.
The ODE environment uses bipartite graphs to dynamically create phenotype relationship diagrams to enable users to produce new knowledge about phenotype similarity and the underlying gene interconnectivity. Indeed, any type of data set representing gene-phenotype relationships, such quantitative trait loci (QTL), literature reviews, microarray experiments and ontological annotations, may be used as the foundation to create self-describing phenotype hierarchical graphs. To demonstrate a use case leveraging the homology underpinnings of ODE and its ability to synthesize information from various data sets, we conducted an analysis of alcoholism related behaviors in several model systems.
The initial data set includes genes from mouse strains selected for their functional abilities after acute ethanol exposure, called high and low acute functional tolerance or HATF2 and LAFT2, respectively . A second set of genes that are differentially expressed in response to acute ethanol in two mouse strains, C57BL/6 and DBA/2  is added. Cross-homology functionality is demonstrated by the inclusion of a differential gene expression analysis in rats after traumatic induced brain injury . Finally, to bring in genes associated with differing states of complex behavior, a set of bipolar disorder candidate genes derived from a mouse differential expression study are included . Each of these data sets are publicly available and pre-loaded into ODE as part of a large library of experimental data currently included as part of the environment, which currently includes data from Mus Musculus, Homo Sapiens, Drosophila Melanogaster, Rattus Norvegicus and Danio Rerio. This library also includes data from the Kyoto Encyclopedia of Gene and Genomes (KEGG), Gene Ontology (GO), and phenotypic alleles table of the Mouse Genome Informatics database, which consists of all of the Mammalian Phenotype Ontology terms and the mutant alleles to which these terms are associated, and results from many published genetic and genomic studies entered by users of our Web-based software system.
The ODE function, Jaccard Similarity (Figure 1), is one of several ODE tools for pairwise comparison of diverse gene sets. This analysis uses Jaccard’s positive match correlations to identify statistically similar gene sets. A complete pairwise Venn diagram display reveal 11 genes at the intersection of bipolar disorder  and traumatic brain injury , 13 genes at the intersection of bipolar disorder and acute ethanol response, and 14 genes at the intersection of bipolar disorder and acute functional tolerance to alcohol. All other pairwise intersections are populated.
To integrate these data sets, an analysis of higher order intersections was performed using the PhISH tool, which enumerates and illustrates all intersections. Results of the PhISH analysis of these data sets (Figure 2) highlight gene-phenotype relationships based on empirically derived heterogeneous data sets. The hierarchical distribution of intersections demonstrate a separation of genes into distinct categories that reflect underlying phenotypic states; genes involved in neural function, oxidative stress, depression, or mania emerge as a part of the empirically created ontology. In the root node a genetic singularity converges on mobp, a gene with demonstrated increased levels in schizophrenia patients with a history of substance abuse .
Significance of the tree is ascertained by examining phenotype parsimony and node overlap parameters. After permutation testing the parsimony value, which is reflected in the shape of the tree, is found to be normal and non-significant due to the presence of all combinations of phenotypes (p=1.0, n=50,000). The second measurement determines if there is more gene overlap in node intersections than expected by random chance. This is significant since, given multiple permutation tests, there are more observed overlaps than expected (p=5.99988 · 10−5, n=50,000).
Interactive visualization of the gene-phenotype association bi-partite graph (Figure 3) reveals highly connected (high-degree) gene nodes, and the pattern of gene-phenotype aggregation. A degree threshold can be set to filter out low-degree nodes, i.e. those genes which are connected to only a small number of phenotypes. Selection of a gene node can be used to perform a search for additional connected phenotypes.
ODE creates an environment in which data from existing, phenomenologically-driven genomic analysis can be integrated for a simultaneous and seamless analysis of instances (genes – traits associations) and ontologies (classes of genes and traits). Using ODE, a natural organization of complex traits such as basal and alcohol related behavioral processes may be elucidated, thereby reflecting common biological substrates for the relevant behaviors. By integrating genome-wide empirical associations, new information may be added to known pathways and novel relations may be revealed. The goal is not biochemical reaction or interaction analysis, but rather, to ask fundamental questions about the relations among behavioral processes such as stress response and alcohol consumption, or learning and addiction. Thus, the arbitrary and incomplete nature of experimental pathway data is not an impediment. By making use of a “gene and gene product parts list” that is empirically associated with a phenotype, common components can be identified and used to identify relations among any process. The relations of common components form a rational ontology, and can be identified through strictly empirical approaches. This enables well-studied biological and behavioral constructs to be mapped to actual biological processes, pathways and systems.
The ODE has numerous applications. The tool can be used for convergent validation of experimental results, validation of biological assays as metrics of related phenotypes, translational analysis for validation of animal models and treatments designed to mimic human disease and identification of candidate genes from among a list of positional candidates found in quantitative trait locus analysis and linkage analysis. Links to other resources from inside the tool facilitate annotation and aggregation of additional information around discovered networks. This interactive environment with features for storage and sharing of interim results can support integration of diverse data across interdisciplinary collaborative efforts. Indeed, ODE-associated tools may be extended to include alternative methods to test associations between disparate sources using a variety of statistical tests, such as edge permutation and node label permutation tests .
A property of phenome ontology we find exciting is its ability to create ontologies that can be mapped, linked and aligned. Previous attempts at ontological alignment have focused on semantic equalities . These approaches, however, are subject to lexical and data prejudice. Using inter-species homology translations, along with a consequent mapping of a variety of annotations, will enable empirically based ontology alignments and, perhaps, a convergence of the vast numbers of community ontologies being created. Through the process of ontological discovery from empirical observation, we believe that a fundamental reclassification of disease based on biological substrate, rather than external manifestation will one day be possible. This will enable biologists and clinicians to define the effects of genetic diversity, environmental perturbation and points of therapeutic intervention in terms of the functional processes underlying diverse mechanisms of disease rather than in terms of the often convergent outputs of these diverse perturbations.
ODE’s organizing metaphor is the gene and the subsequent superset of gene-sets and sets of gene-sets. Consequently, ODE accepts gene sets generated through any methodology dedicated to gene-network creation. For example, gene sets may defined from public microarray data including the Genome Institute of Novartis tissue specific gene expression data , MGI tables of phenotypic alleles, Gene Network’s genetic correlation to gene expression , literature associations obtain via text mining using bibliographic similarity based approaches  or Latent-Semantic Indexing , and even hand curated NCBI’s Gene Reference Into Function. A higher order and somewhat less empirical class of gene lists comes from numerous literature reviews and hypothesis-based studies in which researchers have compiled gene lists involved in various behavioral constructs including pain , aggression , alcohol specific , and drug abuse , among others. In addition, GAGGLE integration through FireGoose  enables bi-directional ODE interface with MeV, R, Cytoscape or other sites such as DAVID , STRING , or KEGG . Novel gene sets are also dynamically generated as a function of the analysis tools, iteratively optimized by users, and edited to create new sets of genes.
The software environment attempts to alleviate data incompatibility through the collection of metadata and pooling community gene annotation information. Metadata is collected during gene upload, using a web-based form designed to maximize free-form, ontological, and publication-centric information. For example, a PubMed ID (PMID) is sufficient to extract published information associated with the data set of interest and asynchronous tree menus allow users to assign multiple observations from community ontologies [3, 40] that may be used to describe their data. The use of existing OBOs means that metadata is extensible to any number of emerging ontologies and allows gene sets to be searched via a variety of biologically-relevant relationships. Plasticity in ontology metadata also allows the ontological alignment between different organisms, community ontology efforts, and experimental data sets.
Gene identifiers used in upload can come from a variety of databases, which are filtered based on the species and identifier type provided by the user during upload. The ODE upload process maps uploaded genes to the species’ reference database identifier (i.e. HGNC, MGI, RGD, etc.). If there is no reference identifier, the next most unique identifier is used (typically Entrez or Ensembl identifiers). This process ensures that ambiguous gene symbols from different species are kept distinct in the database. During analysis, gene name collisions across species are avoided by feeding unique ODE GENE IDs to the analysis tools. Homology relations are established using Homologene tables, though other mappings can be easily incorporated into the software. Once complete, the results are post-processed for on-screen display to add gene names.
The ultimate goal of ODE is to construct empirically-derived phenome ontologies based on user-submitted and dynamically-generated sets of genes, displayed by the ODE as a Phenome Interdependency and Similarity Hierarchy (PhISH). Creating a PhISH graph is computationally challenging but solvable due to recent advances in algorithms for bipartite graph analysis . Briefly, phenotype supersets are defined by common connections to a gene or genes (Figure 4). These sets reside in the root node of an is-a hierarchy for the classification of phenotypes. Subsets are defined by connections to additional genes. These child nodes are associated with the same biological networks as the parent node, but are also connected to additional genes. Node splitting rules based on similarity, and stopping rules based on node size, are applied to limit the growth and density of the tree. To enhance the multi-domain integration of divergent data types, this approach using bipartite graphs employs discrete associations, of which types and thresholds may be defined by the user.
The automated and semi-automated creation of models requires algorithms that ensure users the ability to rapidly gauge the context and confidence of results. We recognize that the literature describing statistical significance of network relationships within fixed data sets remains unresolved, and attempt to provide qualifying, if not deterministic, measurements of dynamic result sets. This is achieved by measuring characteristics representative of information aggregation occurring at the level of genes and phenotypes and applying permutation tests or other metrics to determine the chance occurrence of similar results. For example, the goal of phenome information aggregation in a bipartite graph or biclique is to minimize the number of intersections present, meaning that a large number of phenotypes were reduced to a limited set of categories based on shared biological substrates. In practical terms this is viewed as the parsimony of the phenome map, represented by (Eq. 1.1) and (Eq. 1.2) where Phenotypes is number of genes in an input set.
Here, larger values reflect the greater aggregation or condensation of phenotypes. From this perspective, a single root containing all phenotypes is an optimal result with maximal aggregation. According to (Eq. 1.2) it is apparent that even the addition of a single disjoint phenotype substantially reduces parsimony. Figure 5 demonstrates how parsimony is a generalization of the PhISH diagram shape, where irregular graph distributions have lower phenotype aggregation values and may be assigned probability values based on permutation tests.
Permutation tests were performed to place gene aggregation and phenotype aggregation into statistical context and to determine how the topology of the PhISH diagram deviates from random . Here, genes and phenotypes are shuffled within the information set, keeping the same overall density of gene-phenotype connections. Simulations against randomized data sets have a two-fold benefit. First, it enables assessment of the impact of false positive and false negative information on the resulting graph. The addition of false positive gene-phenotype associations adds links and nodes, connecting non-overlapping pairs of phenotypes, condensing two 2-phenotype nodes into a single 3-phenotype node, for example. In general this produces a taller tree that approaches the maximal phenotype aggregation value of a regular tree where all combinations of phenotypes are represented. Adding false negatives breaks links and removes nodes, deconstructing a tree into the minimal aggregation of all input phenotypes represented by a completely disjoint tree. These effects of permutation testing are described for a synthetic data set in Figure 6. Secondly, permuting a known data set n number of times produces a distribution of phenotype aggregation values allowing the probability measurement of the observed values.
Another property of interest is overlap, or the density of gene-phenotype associations. This is calculated per node and aggregated across the entire tree. Based on the density of intersections of any sets of genes, we compute the exact probability of obtaining a result of higher or lower overlap. The scores of individual bicliques (Eq. 2.1) are combined across all sets in the entire tree (Eq. 2.1), where Geneschildren is the number of genes in the union of all children of a biclique node. Either result is desirable depending on the user’s goal of identifying common or unique substrates.
ODE’s analysis tools build on maturing approaches to set analysis, specifically, on a variant of the binomial or hypergeometric test to determine whether members of each category are over-represented among a list of genes. ODE adds the Jaccard positive-match coefficient as a metric of set similarity, because this measure is not upwardly biased by a high rate of true negative results found in comparison of sparse sets. GoTree Machine was among the first to use a reference set  to estimate whether category members were over-represented among a list of genes relative to possible representation from the set of genes considered. Newer tools, such as ErmineJ, take advantage of the entire vector of gene expression values rather than forcing the gene set to have a categorical representation . Both standalone and web-based tools exist, but most of them simply allow an identification of relations to a single user entered gene set, or a limited group of gene sets, with a very limited set of functions facilitating union and intersection analysis. For example, existing tools allow one to ask questions such as, “Does this set of genes differentially expressed in response to stressors correspond to any known pathways or categories?” In contrast, ODE tool variants expand upon this approach to include matching sets of sets to other sets of sets, for example, by asking “Do stress related gene sets have any common relationships with alcohol consumption related gene sets?” Using hypergeometric, Jaccard similarity and distance, and fisher tests produces a high-level view of the landscape of gene relationships represented in the test set and, while not required to construct PhISH graphs, these gene set similarity matrices provide inputs to clustering methods and act as filters for empirical ontology classifications.
This work is a project of the Integrative Neuroscience Initiative on Alcoholism and is supported by NIH U01AA13499, U24AA13513.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.