|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: SML LH. Performed the experiments: SML WF. Analyzed the data: SML HT TW RAS. Contributed reagents/materials/analysis tools: SML WAB PK RPS. Wrote the paper: SML HT TW RAS LH. Designed and implemented the initial reading and reasoning components of the system, created the networks and contributed to the manuscript: SML. Contributed to the design of the system, used the system to interpret the craniofacial data, identified and developed the biological hypothesis, and contributed to the manuscript: HT. Did the mouse dissections and in situ staining to validate the system's results: WF. Designed and supervised a software reimplementation to facilitate broader distribution of the system: WAB. Implemented the reading and reasoning components to facilitate dynamic updates and further development: PK. Implemented aspects of the reporting component that facilitated direct access to the underlying knowledge and its provenance: RPS. Advised on and evaluated the significance of the developmental biology and contributed to the manuscript: TW. Advised on and evaluated the significance of the genetics and contributed to the manuscript: RAS. Conceived of the overall system, supervised all aspects of its construction, and wrote the initial and final drafts of the manuscript: LH.
The profusion of high-throughput instruments and the explosion of new results in the scientific literature, particularly in molecular biomedicine, is both a blessing and a curse to the bench researcher. Even knowledgeable and experienced scientists can benefit from computational tools that help navigate this vast and rapidly evolving terrain. In this paper, we describe a novel computational approach to this challenge, a knowledge-based system that combines reading, reasoning, and reporting methods to facilitate analysis of experimental data. Reading methods extract information from external resources, either by parsing structured data or using biomedical language processing to extract information from unstructured data, and track knowledge provenance. Reasoning methods enrich the knowledge that results from reading by, for example, noting two genes that are annotated to the same ontology term or database entry. Reasoning is also used to combine all sources into a knowledge network that represents the integration of all sorts of relationships between a pair of genes, and to calculate a combined reliability score. Reporting methods combine the knowledge network with a congruent network constructed from experimental data and visualize the combined network in a tool that facilitates the knowledge-based analysis of that data. An implementation of this approach, called the Hanalyzer, is demonstrated on a large-scale gene expression array dataset relevant to craniofacial development. The use of the tool was critical in the creation of hypotheses regarding the roles of four genes never previously characterized as involved in craniofacial development; each of these hypotheses was validated by further experimental work.
Recent technology has made it possible to do experiments that show hundreds or even thousands of genes that play a role in a disease or other biological phenomena. Interpreting these experimental results in the light of everything that has ever been published about any of those genes is often overwhelming, and the failure to take advantage of all prior knowledge may impede biomedical research. The computer program described in this paper “reads” the biomedical literature and molecular biology databases, “reasons” about what all that information means to this experiment, and “reports” on its findings in a way that makes digesting all of this information far more efficient than ever before possible. Analysis of a large, complex dataset with this tool led rapidly to the creation of a novel hypothesis about the role of several genes in the development of the tongue, which was then confirmed experimentally.
Human knowledge relevant to biomedical research is expanding at an exponential pace. Over the last twenty years, more than 10 million publications have been indexed by the National Library of Medicine (NLM) and made available through PubMed, reflecting a compounded annual growth rate of more than 4.8% ,. Structured knowledge, in the form of molecular biology relevant databases, has also been growing at an impressive rate. The journal Nucleic Acids Research publishes an annual compendium of peer-reviewed databases relevant to molecular biology; the 2008 issue reported on 1,078 such databases .
While intense specialization has in many cases made it possible for biomedical researchers to know everything practically relevant in a very narrow domain, a breakdown of disciplinary boundaries and the fundamental interconnectedness of biological systems have rendered specialization an increasingly impractical strategy for keeping up with biomedical knowledge. Information about fundamental molecular structures and functions, such as mutations or protein-protein interactions, are spread across the entire literature. For example, [4; figure 1] demonstrated that nearly 40% of the more than 5,000 journals indexed in PubMed in a typical year contained at least one assertion regarding protein transport, interaction or expression that could be found by a text mining system.
One approach to dealing with this overwhelming amount of information is to organize human experts to curate key aspects of it, resulting in databases of formally represented assertions with pointers to the evidence in the literature. Over the last 6 years, the U.S. National Institutes of Health has invested more than $52 million to support ontology development and use (Personal communication from Peter Good), including the Gene Ontology Consortium and the National Center for Biomedical Ontology. However, even this large investment has been simply unable to keep up with the volume of relevant publications;  showed that even under extremely optimistic assumptions it will be decades before annotation will be complete and up to date.
Furthermore, not all human knowledge of biomolecular function is explicitly stated in any database or publication. Molecular biologists often make inferences regarding the likely function of a molecule based on factors such as homology, interaction partners, or other methods; this approach has been called the “post-genomic approach to protein function” . Protein interactions are reasonably well characterized experimentally in yeast, but much less so in other organisms. As of this writing, the Database of Interacting Proteins (DIP) contains records regarding 18,331 interactions among 4,923 yeast proteins derived from 23,344 experiments, likely a close to complete inventory. However, there are only 415 curated interactions among 307 mouse proteins derived from 595 experiments in the database—likely fewer than 1% of the true protein-protein interactions. Recently, computational approaches to protein function inference such as ,, and others have extended interaction predictions to generate functional categorization of dramatically larger numbers of proteins. As these inferences of function are less reliable than experimental observations, most computational approaches associate a likelihood or reliability with each prediction.
Advances in instrumentation are also generating molecular data at ever increasing rates. High-throughput (also known as genome-scale) assays for detection and analysis of gene expression, genetic polymorphisms, macromolecular interactions and other fundamental processes are generating datasets that contain information about the structures or activities of on the order of 106 different genes or gene products at a time. More than 200,000 such assays from more than 8,300 different experiments are publicly available from the National Library of Medicine's Gene Expression Omnibus catalog (http://www.ncbi.nlm.nih.gov/geo/ viewed on April 7, 2008), and far more results of high-throughput experiments are available in more restricted settings.
Experiments that exploit these genome scale assays often generate results that implicate dozens to hundreds of genes or gene products related to a phenomenon under study. The amount of information regarding even just these significant results (and relevant homologs) in gene-centric databases and in the research literature is often overwhelming, yet the proper interpretation of the results requires taking stock of all of that knowledge. Furthermore, as the revolution in systems biology has made clear, it is critical to analyze the specific interactions among the genes, not just the genes in isolation. In a set of hundreds of relevant genes, there are tens of thousands of potential interactions to consider. Analyzing all of the relevant genes and interactions in genome-scale data, while important to advancing human understanding of biomedical phenomena, is a truly daunting task.
Here we introduce a novel computational approach to analyzing genome-scale data in the light of existing knowledge, built on three broad classes of algorithms: reading, reasoning and reporting. For that reason, we refer to the overall approach as a 3R system. 3R systems are a restricted class of knowledge-based systems. The goal of a 3R system is to assist biologists in forming explanations of the phenomena in genome-scale data, and to generate significant hypotheses that can influence the design of future experiments. The approach is based on the comparison of two weighted graphs. One graph, called the “knowledge network,” represents a large portion of the existing knowledge of gene products and their relationships. The other, a “data network,” describes a particular data set produced by a high-throughput experiment. There are many possible ways to implement a 3R system; we call the particular implementation reported on here the Hanalyzer (for high-throughput analyzer).
This paper describes the use of the Hanalyzer in the analysis of a comprehensive expression dataset for mouse craniofacial development. (This dataset is described in detail in ; the analysis of a portion of the data not reported by  is described below.) The Hanalyzer does not automate the production of explanations (nor hypotheses), but supports human users who are performing these tasks. Through use of the Hanalyzer, several novel hypotheses regarding the gene networks involved in craniofacial biology were generated; we also report on their experimental validation.
A wide variety of previously reported systems and algorithms have influenced this work. The descriptions of the reading, reasoning and reporting components below cite related work and compare specific approaches. With respect to the overall system architecture, there is substantially less related work. Many reported uses of background knowledge in the analysis of high throughput data use it as the basis for clustering differentially expressed genes or to attempt to model pathways or networks; such work is reviewed in . Other approaches use background knowledge to identify a priori sets of related genes for differential expression testing, e.g. or .  describes InteractionFetcher and CytoTalk, two Cytoscape plugins that facilitate lookups of information about genes in an interaction graph and can assert new edges based on interaction information from remote databases; they describe a use-case analyzing Hepatitis C with their tools. Perhaps the closest previous approach is the case study described in , where a protein-protein interaction network was built using the MedScan text mining approach  and then applied to analysis of expression array data with the active subnetwork algorithm .
The 3R approach differs from this prior work in several ways. First, the use case of developing explanations for the data, rather than identifying or clustering differentially expressed genes, influences both the methods employed and, most importantly, the criteria used to evaluate such a system. Second, while the aforementioned systems are all designed for the specific analysis of gene expression array data, 3R systems can be applied to many other forms of high throughput data, as described in the discussion. Third, our representational commitment to nodes as fiducials both expands and constrains the sorts of knowledge graphs that can be produced and applied. Finally, our division of the approach into reading, reasoning and reporting tasks expands the sorts of algorithms that can be productively applied to improving performance of 3R systems; reasoning (in the Hanalyzer, the network inference algorithms) in particular had not previously been applied in this sort of analysis.
As shown in the system diagram in Figure 1, 3R systems involve reading, reasoning and reporting. The reading component extracts information from the literature and from relevant databases. The reasoning component makes inferences regarding several types of semantic relationships among genes and gene products, estimating likelihoods and leaving a trail of provenance. The reporting component relates knowledge to data and presents the combinations by augmenting a popular visual interface. Underlying each of these tasks is a shared knowledge representation capable of supporting the required inference and record keeping.
Knowledge in our system is constrained to be compatible with the World Wide Web Consortium's Web Ontology Language (OWL). While program internals represent the knowledge graphs more directly, it is always possible for the Hanalyzer to export an OWL version of its knowledge, and to import knowledge in OWL format. The OWL syntax for representing properties is a binary relation, linking two individuals or an individual and a value; however, for a great deal of knowledge in molecular biology it is natural and convenient to link an individual to more than one other individual or value – for example, the process of transporting a protein from one subcellular location to another would naturally involve a relation between the protein and two locations. For this reason, we adopt the practice recommended in the W3C working group note Defining N-ary Relations on the Semantic Web of 12 April 2006 (http://www.w3.org/TR/2006/NOTE-swbp-n-aryRelations-20060412/) pattern 1, primarily following use case 3. The quantification links (described in more detail below) follow use case 1. Provenance information is stored as an annotation property.
Entities in our knowledge network that are clearly and unambiguously interpretable by the biomedical community are termed “fiducials.” A fiducial is either a specific element of a community-curated ontology (such as available through the National Center for Biomedical Ontology's BioPortal, http://www.bioontology.org/bioportal.html) or derived from a specific entry in a publicly available database, such as a particular identifier from the Entrez Gene database (http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene). All representations of genes, gene products, macromolecular sequence features, molecular functions, biological processes, metabolic pathways, subcellular locations, cell types, organisms, diseases and drugs in our system are fiducials.
Not all relationships between entities in our knowledge network can be mapped to elements of a community curated ontology. Arcs between fiducials are used to represent non-fiducial elements. For example, in addition to relations defined in the OBO Relation Ontology (http://www.obofoundry.org/ro/) such as ‘part-of,’ we use at least two additional relationships: One non-fiducial link represents the very abstract relationship that specifies a connection of any kind between a pair of proteins. Ultimately, the user interface displays this very abstract relationship, which, borrowing terminology from , we will refer to as a semantic relationship. The other non-fiducial link quantifies the overall inferred reliability of the semantic relation (see below for how this is calculated). Also, as described in detail below, some knowledge sources may assert links among fiducials that do not correspond to relations from the OBO Relation Ontology. The totality of all entities and relationships in this knowledge representation in the system at any given time is called the knowledge network.
The fiducials and semantic relationships in the knowledge network are initially populated by a series of processes that extract information from databases and from the literature. Again following the terminology of , we call these processes “experts.” The size of the graph produced can be limited by seeding the knowledge-base with a target set of fiducials (usually a set of genes of interest from a particular experiment), and requiring any addition to the knowledge-base to have a relationship involving one of these target fiducials. Unless otherwise noted, the knowledge networks discussed below begin from a target list of 8923 Mus musculus genes that were differentially expressed among at least one pair of conditions in the craniofacial dataset described below. The genes were specified by identifiers from the Mouse Genome Informatics (MGI) database  (or Entrez Gene or Uniprot IDs, which can be readily translated), and no distinction is made between genes and gene products.
Relationships describing protein-protein interactions are extracted from the Biomolecular Interaction Network Database (BIND) , Database of Interacting Proteins (DIP) , Molecular Interaction database (MINT) , the IntAct database  and the RIKEN protein interaction table . Additional relationships are taken from the list compiled by  which relates a protein annotated to the Gene Ontology  Molecular Function term “protein binding” (GO:0005515) with evidence code IPI (inferred from physical interaction) to the protein identified in the “with” field of the term annotation. Interactions from all databases are combined and divided into experimental assay groups by canonicalizing spelling variants among text strings describing the assay (e.g., TAP and tandem affinity precipitation) and grouping like assays (e.g., CLASSICAL-TWO-HYBRID, MATRIX-TWO-HYBRID, TWO-HYBRID, TWO-HYBRID-ARRAY, TWO-HYBRID-TEST all represented by the single label TWO-HYBRID). Any relationship without an assay description is labeled UNKNOWN. This process results in 4,544 relationships among 1,693 targeted MGI identifiers with 25 assay type labels. Each assay type becomes an “expert,” and can therefore be assigned a reliability score independent of the other assays.
Relationships describing protein-DNA interactions are extracted from the TRANSFAC 10.2 database  by relating a protein to the transcription factor recognizing a given sequence motif found in the regulatory region of the protein. The expert derived from this information (Transfac) contains 580 relationships among 434 MGI identifiers. Additional putative protein-DNA interactions are extracted from the PReMod database of genome-wide mammalian cis-regulatory module predictions  which catalogs phylogenetically conserved regulatory modules between human and mouse. The resource lists the TRANSFAC motif identifiers of elements in a conserved module, together with the upstream and downstream genes. Two different experts are derived from this information, one which relates a transcription factor recognizing any motif in the module to both the upstream and downstream genes (PReMod) and one which relates two transcription factors if they recognize motifs in the same identified conserved module (PReModM). The PReMod expert asserts 345,814 relationships among 13,852 targeted MGI identifiers while the PReModM expert asserts 17,317 relationships among 189 targeted MGI identifiers. The large number of relationships from these experts suggests the potential of a high level of noise, which is expected for computational predictions.
The OpenDMAP system  was used to extract information from all abstracts in Medline regarding protein transport events, protein-protein interaction assertions, and what proteins are expressed in which cell types. OpenDMAP is particularly well suited to this task, since its information extraction patterns are explicitly associated with a knowledge-base, and all of its outputs are in terms of the representation scheme of the knowledge-base. Although discussed in detail in  a brief example describing the extraction of protein transport assertions from the literature here is illustrative. Protein transport is a 4-place relationship between two proteins (a transporter and a transportee, represented by MGI IDs) and two subcellular locations (fiducials from the Gene Ontology cellular component subtree). Most assertions do not mention all aspects of that relationship, although to be extracted at least one protein and one compartment had to be recognized. To map this extracted information into the network, up to five pairwise relationships are created. An expert (Transloc) derived from this information asserts a relationship between the transporter and the transportee, using the Entrez gene ID to MGI identifier mapping available at the MGI website, for a total of 157,764 interactions among 1108 targeted MGI identifiers.
Protein-protein interactions extracted from the literature can be translated into network arcs straightforwardly. Extracted assertions regarding the type of cell that a protein was expressed in were mapped to a relation between a gene and an element of the cell type ontology. A total of 265,795 interaction instances and 176,153 expression instances were extracted from all Medline abstracts. Of these, 8292 interaction instances and 7035 expression in cell type instances could be mapped to a targeted MGI mouse gene, resulting in the assertion of 4525 relations among 3157 genes based on literature assertions of protein-protein interactions, and 127,283 relations among 1677 genes being expressed in the same cell type (fiducials from the Cell Type Ontology).
Even when the previously described information extraction system is unable to extract a direct relationship from biological literature, systematic overlap between publications that merely mention two genes can be taken as indirect evidence of a semantic relationship between them. Several systems have used the existence of an article that mentions a pair of genes as evidence of an interaction between them (e.g. ). Others use a probabilistic measure based on mutual information  or the hypergeometric distribution , and extract relationships exceeding a probability threshold. However,  demonstrates that a related measure (thresholded asymmetric co-occurrence fraction or ACF) provides more robust performance in network-based protein function prediction. Since the reliability of finding and normalizing gene mentions in free text is substantially higher than that of more general information extraction , we apply this technique in addition to the OpenDMAP approach described above. The ACF measure  calculates the proportion of the number of shared mentions relative to the number of mentions the less frequently-mentioned gene in a given pair, incorporating a bias toward relationships involving less well-studied genes. The result is a set of inferred relationships between a pair of genes whenever their ACF>0.5; this expert is called co-Lit.
Each expert also records the support for each assertion it makes, including at least a pointer to the source of the data and, when possible, a publication (as a PubMed identifier) and the date when the assertion was created. The reporting component can show this provenance information and link to the original database entry or document passage during analysis.
Note that finding multiple relationships between a single pair of entities is entirely possible. For example, a pair of proteins may be related via an expert that extracts knowledge from a protein-protein interaction database, and by another that does text-mining searches for protein transport statements in Medline abstracts. When multiple relationships are found between a single pair of entities, the reliability of the semantic relationship increases.
Once the initial knowledge-base is created, it is enhanced by reasoning processes that add additional relationships. These processes are also called experts. When necessary for clarity, experts that obtain knowledge by reading an external source are called reading experts, and those that infer additional knowledge are called reasoning experts.
An important method for adding semantic relations between genes is to infer that such a relationship exists when two genes have certain properties in common. A series of experts asserts semantic relationships between pairs of genes based on: shared membership in a signaling or metabolic pathway (co-KEGG) , shared annotation to a particular biological process (co-BP), molecular function (co-MF) or cellular component (co-CC) , shared gene knockout phenotype (co-Pheno)  or shared protein domain assignment (co-Interpro) . For resources involving a nested hierarchy of ontology terms, such as the Gene Ontology (GO) or the Mammalian Phenotype (MP) ontology, relationships exist at a number of levels. For MP, a relationship is added among proteins annotated to their most specific term while for GO, certain terms are first merged when the information content score by the Jiang measure between the terms exceeds 19.0 (see , for details). These experts assert between 7,873 (co-KEGG) and 267,317 (co-BP) relationships covering a combined total of 22,922 MGI identifiers.
Another set of inferred relationships links sets of ontology terms using the ontology enrichment process described in  to link molecular functions and biological processes from the Gene Ontology to small molecule participants from the Chemical Entities of Biomedical Interest (ChEBI) ontology. For example, this process creates relationships between the GO molecular function terms “Calcium Signaling” and “Calcium Transport” and the ChEBI term “Calcium(2+).” Additional semantic relationships between genes are inferred if such enrichment results in two genes sharing a small molecule participant in a molecular function or biological process (co-ChEBI). For example, this inference adds a semantic relationship between pairs of genes that have functions each of which in turn has calcium as a participant. Similar inference is made over the GO cross-products (see  and http://wiki.geneontology.org/index.php/Cross_Product_Guide).
A critical aspect of the reasoning component is the ability to assimilate the information from all experts and estimate the confidence that a relationship exists between any given pair of proteins. The collection of assertions from both reading and reasoning experts contains a large number of false positives due to uncertainty in a computational prediction, experimental noise in an assay, or even the intentionally noisy nature of inferred relationships. For example, it is not likely that all cytoplasmic proteins interact as the co-CC expert suggests, yet co-localization information can usefully contribute to estimating the likelihood of a semantic relationship when integrated with the other evidence types.
Biological data integration techniques have been widely studied in the literature, ranging from simple measures which assign higher confidence to assertions shared by multiple experts , or based on certain relationship network topology characteristics ,,,, to more sophisticated integration strategies which use machine learning techniques to estimate interaction likelihoods, such as probabilistic graphical models ,,,,,,,,,, or kernel methods ,,. Many of the techniques attempt to estimate error rates of the individual expert types before integration using either a gold standard ,,,,,,,,,, or the set of data sources themselves to determine relative reliabilities ,.
In mouse, there are already too few available sources for determining relationships to justify withholding one as the gold standard. Moreover, since our system attempts to capture a variety of semantics for what type of relationship might exist between two entities, determining the appropriate gold standard is difficult. The consensus reliability estimate  used in our system avoids the use of an explicit gold standard by computing the consensus number of assertions for a given relationship among all experts and assigning a higher reliability to a given expert if many other experts agree with its assertions on average (see  for details). Since many of the reasoning experts assert a large percentage of all possible relationship pairs, the consensus numbers used in the averaging are computed only over experts which are derived from sources explicitly naming both proteins (protein-protein interactions, 25 experts), protein-DNA interactions (Transfac, PReMod, PReModM), translocation events (Transloc), and literature co-occurrence (co-Lit)). All assertions from a given expert are assigned the reliability of that expert.
One of the most popular methods to combine individual reliabilities is to assume independence of experts (naive Bayes assumption) and compute the integrated likelihood P for each relationship using the Noisy-OR function P=1−Πi (1−ri) where ri is the reliability of an expert i (scaled if necessary into the range 0 to 1 to allow interpretation as probabilities) ,,,, see also the useful exposition . The Noisy-OR function has the useful property that the probability of a relationship is high with at least one reliable assertion yet increases with additional support. This property is especially relevant in biology, where it is often difficult to identify false negatives; a given assertion is strengthened by additional information but unlike the case for estimating the reliability of an expert on the whole, an individual assertion is not penalized for lack of additional evidence. Moreover, since the experts are assumed to be independent, experts can be removed from or added to the analysis without excessive re-computation.
The purpose of building this large, integrated network is to facilitate the exploration of high-throughput data in light of what is already known, with the goal of generating explanations of the observations and also generating novel biological hypotheses. In order to report the aspects of the knowledge network that are relevant to understanding a dataset, it is necessary to both select appropriate sub-networks for presentation and to present them in a comprehensible and useful way.
A simple approach involves visualizing the knowledge network that includes particular fiducials. A gene list can be used to generate the knowledge sub-network that includes them, similar to the approaches presented in ,,,. Such a sub-network can also be extended to include other genes linked to the query set by sharing highly interconnected subcomponents that might represent protein complexes (see for example ,.
Sub-networks, along with attributes of the fiducials and linkages among them are visualized using Cytoscape , an open source network visualization platform. A Cytoscape plugin, CommonAttributes, was written that allows the user to trace the provenance of links in the knowledge network and directly access the underlying data sources and publications (Figure 2). This approach was used to identify functional explanations of a gene list by exploiting inferences not available in any of the individual data sources .
A more effective method of exploiting the knowledge network is to create another quantitative network based on the experimental data (called a data network) and combine the two networks in various ways. In the application reported here, a data network is constructed from the results of a gene expression array experiment. Nodes in this data network are genes that exhibited differential expression in the experiment, arcs connect genes whose expression levels are correlated at an above threshold level, and arc weights are the absolute value of the correlation coefficient (see below for details). Data networks can be generated from any sort of data that can be represented as a weighted graph among fiducials, not just expression arrays. Methods for combining the knowledge network with the data network can highlight linkages in the data that are well supported by existing knowledge, thereby facilitating explanation, or can highlight linkages in the data that are not well supported by existing knowledge, facilitating the generation of novel hypotheses. Both approaches can be exploited together, as demonstrated below.
Other studies have used protein interaction networks together with p-values from tests of differential expression gene expression to identify ‘active’ sub-networks in an input graph ,,, and to improve expression profile clustering using a combined distance metric computed from profile correlation and network distance ,. Our system integrates these two prior approaches, using a combined distance metric to identify active (and explanatorily interesting) sub-networks.
The distinction between the knowledge network and the data network allows comparison at the level of networks, in the spirit of multiple graph approaches such as ,. Our approach builds on those in two ways: by exploiting the inferences in a dynamically generated and extremely broad knowledge network, and by offering multiple combination functions that support both explanation and hypothesis generation applications.
Creation of combined networks that integrate the knowledge and data networks in different ways is a key step. Due to the use of fiducials in both the data and knowledge networks, aligning the nodes of these networks is trivial. In contrast, there are many alternatives for combining the arc weights from the knowledge and data networks.
The semantic integration combination functions (e.g. noisy-OR) could also be used to combine corresponding arcs in the data and knowledge networks, but many other alternatives are also available, and some are superior. Approaches based on likelihood ratios for individual sources ,,, typically assume independence (naive Bayes) and simply multiply likelihoods. When the assertion probabilities can be interpreted as p-values,  review three techniques from statistical mechanics for integration: Fisher's F, Mudholkar-George's T, and Liptak-Stouffer's Z (see ). Averaging the probabilities or averaging logistic functions of the probabilities, as used in  are also possibilities.
The effect of combining probabilities from two sources using various techniques is illustrated in Figure 3. All functions except those denoted Average and Hanisch Logit exhibit the behavior described earlier about Noisy-OR, where the value of the combined probability is 1.0 if at least one of the sources assigns a probability of 1.0 (observed as the red area touching the z=1.0 plane). Mudholkar-George's T and Liptak-Stouffer's Z have the additional property that the combined probability is 0.0 if at least one of the sources assigns a probability of 0.0 (observed as the dark blue area touching the z=0.0 plane). In this context, these two functions are less applicable since negative relationships (probability of 0.0) are difficult to observe. The remaining functions differ on how they treat intermediate probability values. Fisher's F shows a rapid decline in combined probability compared to Noisy-OR which maintains a higher combined value when at least one is high. In contrast, Averaging and Hanisch Logit methods require agreement among sources to achieve a high combined value, allowing a value of 1.0 only when both source probabilities are 1.0. The sinusoidal curve of Hanisch Logit implements a thresholding effect where the combination is given more weight than in Averaging when both source probabilities are at least greater than 0.5. Since the purpose of the combination network is to emphasize concurrence among the knowledge and the data networks, the Averaging and Logit methods are more appropriate than the others.
In the application described below, the distribution of weights in the knowledge and data graphs are such that the Averaging combination method gives high scores to arcs that are supported in both the data and knowledge networks (behaving somewhat like a Boolean AND), while the Logit method privileges the high scoring arcs in the data network over those in the knowledge network (since the distribution of correlations is weighted more towards 1 than the distribution of knowledge confidence scores). Figure 4 shows in detail how an example link is created in each of these graphs.
Use of the Hanalyzer is demonstrated in the analysis of an experiment that created a comprehensive expression dataset for mouse craniofacial development. The transcriptome of C57BL/6J strain (Jackson Labs) mice was sampled at 12 hour intervals from E10.5-E12.5, a time period that spans from formation of the facial prominences to when they fuse together to form the mature facial platform. Microdissected samples from three distinct facial regions were isolated at each time-point: the frontonasal, the maxillary, and the mandibular prominence. Seven independent biological replicates were prepared and analyzed for each sample. This dataset and an initial analysis of it are described in detail in . To create the data network, the expression level of all the replicates at a particular time point and tissue for all probes associated with a particular MGI identifier are averaged. These averages are normalized by computing the log2 ratio of each gene's average expression level at each time point and tissue to the median expression level across all time points and tissues. The Pearson correlation coefficients over time and tissue are then computed for all pairs of genes.
Two combined networks are created; one using edge Averaging and one using the Logit method. Arcs were included in a combined network only if at least three of the reading experts support it. Genes not linked to any other genes were removed, creating combined networks containing 8923 MGI identifiers. The arcs in these two combined networks were further pruned so that only the highest scoring 1000 edges by each method were visualized. Figure 5a–c illustrates the distributions of the individual components while Figure 5d illustrates the top 1000 edges for both Average and Logit combination networks.
As shown in Figure 5, the distributions of probabilities in the knowledge and data networks interact with the combination functions to achieve different sorts of reporting goals. The arcs that appear in the Average combination network are strongly connected in the background knowledge and in the data network. Identifying these already well-understood aspects of the data provides rapid orientation to an analyst. Using Cytoscape with our visualization plugin, an analyst can identify important functional themes rapidly, surveying details such as associated GO annotations, gene descriptions and known knock out phenotype information. In contrast, the edges that appear in the Logit combination network but not the Average network indicate links that are strong in the experimental data, but have only modest support in the background knowledge. These edges are used to generate new hypotheses about the roles of genes not previously known to be involved in the phenomena under study.
The use of two different combination functions to investigate the network enables the development of an investigative methodology that supports hypothesis generation through systematic network exploration. The top 1000 edges as scored by either function generate a network comprised of 945 genes and 1,743 total edges. This collection of high scoring edges is organized as 92 pairs, 15 triplets, seven small clusters (<10 nodes), one large ‘yarnball’ (551 nodes), and three medium-sized clusters (comprising 27 to 51 nodes) (Figure 6). One of the medium-sized sub-networks (total 45 nodes, 107 edges, is analyzed in detail here (circled in Figure 6), illustrating a typical use of the Hanalyzer.
That sub-network contains 50 edges from the Average combination graph, involving 20 nodes (Figure 7); 15 edges asserted solely by the Average metric and 35 asserted by both the Average and Logit measures. By browsing the annotations associated with these 20 genes and their protein products it quickly became apparent that the theme common to this sub-network is muscle (Table S1). Nineteen of the 20 nodes have at least one reference to ‘muscle’ within their annotations or description, with the most informative descriptive terms being the GO Biological Process terms “muscle contraction” GO:0006936 (and children, including “regulation of muscle contraction” GO:0006937) and “muscle development” GO:0007517, together annotating 15 of the 20 nodes. It is also of interest to note that the majority of the nodes (13 of 20) in this network belong to one of three well characterized muscle protein families (Actin, Myosin and Troponin), suggesting that this network is involved in force generation and structural integrity of muscle.
The single apparent exception to this muscle theme was Thbs4 (Thrombospondin 4, MGI:1101779). Direct searching of PubMed identified a role for Thbs4 (also known as TSP-4) in muscle formation. Thbs4 is secreted by developing tendon mesenchyme cells, and is part of a local signaling process involving the protein ankyrin repeat domain 1 (Ankrd1; MGI:1097717) which couples tendon morphogenesis to muscle formation  (note that Ankrd1 was called “muscle ankyrin repeat protein” or marp in that paper). Thbs4 is expressed at high levels (and in complementary patterns) to Ankrd1 during myogenesis through late embryogenesis and is still observed postnatally .
This network is intriguing because of its strong muscle theme and because the expression profile of the nodes within this network is striking in its mandibular specificity (Figure 8). The expression of this group of 20 genes is consistently and exclusively up-regulated in the mandibular sample as development progresses from E10.5–12.5. The literature indicates that this expression profile is consistent with tongue muscle development; the tongue being the largest single muscle mass in the head and located within the mandible. At approximately E11, the migration of myogenic cells from the occipital somites into the tongue primordia is considered complete, with myoblasts continuing to proliferate and differentiate until around E15 when they fuse and withdraw from the cell cycle . Desmin (Des, MGI:94885) mRNA is detected as early as E10, consistent with its marking early steps in skeletal myogenesis, such as myoblast determination . Also, Thbs4 has been shown to promote myogenic differentiation specifically in the tongue, which due to its lack of cartilage, links muscle groups through a tendinous scaffold .
This same group of genes is also up-regulated at the later E12–12.5 time point in the maxilla sample, consistent with a later onset of all other muscle cell differentiation in relation to the tongue. Skeletal muscle development is staggered, with the tongue maturing approximately 1.5 days (in mice) earlier than all other skeletal muscles. The more advanced stage of tongue muscle development at birth is thought to correlate with its requirement for mammalian suckling immediately after birth . The lack of significant muscle in the frontonasal prominence accounts for the low level of expression of these genes in that tissue. The systematically reported and easily explored collection of relevant background knowledge made the interpretation of this complex set of evidence regarding the broad developmental function of a complex group of interacting genes much more straightforward than it would have been using any other approach with which we are familiar.
Once the well understood aspects of the sub-network had been explored and a biological explanation for the observations created, the analyst adds the edges asserted only by the Logit metric to the visualization of the sub-network. The inclusion of Logit-asserted edges introduced an additional 25 nodes to the network (total 45 nodes), and expanded the network to 107 edges (Figure 9). These 107 edges consist of 48 Logit-only edges, 18 Average edges (note the additional 3 Average edges linked into the network via connection to nodes introduced by the Logit edges) and 41 edges asserted by both Logit and Average metrics. The nodes comprising this larger network display the same striking mandible-specific expression pattern of the Average combination network, suggesting these additional nodes may also be implicated in tongue development (Figure 10).
Although nine of these additional nodes expand the core cluster described above, the majority of nodes form two new clusters tethered to the initial group by one to four edges. Browsing the collated annotations associated with these additional nodes allowed rapid insight into common functional themes. These annotations indicated that the two additional clusters represent myogenic differentiation (six nodes) and synapse interactions (eight nodes) (Figure 11 and Table S2). Within the synapse cluster the most informative annotations are the KEGG annotation “Neuroactive ligand-receptor interaction” KEGG:mmu04080 and the GO Cellular Component term “postsynaptic membrane” GO:0045211, which together annotate all six members of this cluster. All eight nodes within the transcription cluster are, unsurprisingly, annotated with the GO Biological Process “transcription” GO:0006350, and five of these nodes also have a documented muscle-related knock out phenotype. The specific genes and interactions in each of these three clusters are explored in turn, and several are selected for experimental validation.
The first cluster investigated we called the core cluster. Of the nine additional nodes contributing to the structural cluster, four (Cdh15, Nrk, Fndc5, and E430002G05Rik; MGI:106672, MGI:1351326, MGI:1917614 and MGI:2445082, respectively) lack annotations from our experts suggesting a role in either muscle, or more generally, craniofacial development. Supplementary investigation of the literature and publicly available expression data was required to extrapolate the muscle association of these four genes.
In contrast to the other ‘unannotated’ nodes, Cdh15 (also known as M-Cadherin, M denoting muscle ) is a very well studied gene with a number of associated publications (23 references tied to its MGI record alone [accessed 4/23/2008]). It has long been known that Cdh15 is expressed in myogenic cells and has a role in skeletal muscle differentiation, as indicated by low level expression in skeletal myoblasts followed by an increased expression in myotube forming cells . Its precise role during muscle development and regeneration is yet to be determined however, and a recent Cdh15 null mouse model with apparently normal muscle phenotype suggesting functional compensation by other cadherin proteins .
The lack of information linking Cdh15 with muscle development highlights the persisting problem of organism-specific gene name normalization. While Cdh15 is the only official gene symbol, there are two approved names for the resultant protein product; Cadherin 15 and M-Cadherin (myotubule) [Data from HUGO, www.genenames.org Accessed 5/1/2008], and to confuse things further, both names are only used in the human records for this gene (Both GeneBank [NM_004933] and Entrez Gene [ID: 1013] use “Homo sapiens cadherin 15, M-cadherin (myotubule) (CDH15), mRNA” as their definition).
The literature indicates that the Ste20-type kinase, NIK-related kinase (Nrk) is predominantly expressed in developing skeletal musculature from E10.5 through E17 during mouse embryogenesis; however, Nrk expression is not detected in any adult tissues, including skeletal muscles . Limited RNA expression data obtained from GenePaint.org , also appears to show Nrk expression in E14.5 tongue (GenePaint set ID: MH1818, section Embryo_C1818_1_4B).
In the developing embryo, the recently characterized fibronectin type III domain containing 5 gene (Fndc5, also known as PeP and Pxp; data from iHop ) is almost exclusively expressed in developing skeletal muscle . Absent at E7, Fndc5 expression is first detected in whole embryos at E11, and at E13.5 is specifically observed in the tongue and other skeletal muscles . A role during myoblast differentiation is indicated by a two-fold increase in expression during the transition from myoblasts into myotubes, after which expression stabilizes and continues into and throughout adulthood .
Finally, investigation of the Riken clone E430002G05Rik presented little informative annotation. A single GeneRif identified from the associated EntrezGene entry (GeneID: 210622) yielded all information ascertained about this gene via the associated publication. This single publication  identified mRNAs affected in a mouse model (mdx) for Duchenne muscular dystrophy (DMD). E430002G05Rik was identified as a down-regulated transcript in the mdx mouse and subsequently named RAMP (Regeneration-associated muscle protease homolog) . It was observed that RAMP is predominantly expressed in normal adult skeletal muscle and brain, and that it is specifically up-regulated in regenerating skeletal muscle fibers after injury . The absence of any annotation regarding development prompted the selection of this gene for further experimental validation.
We called the second cluster explored the Transcription Factor Cluster. Although well annotated as transcription factors, information provided by reading experts on Pitx3, Rxrg and Zim1 (MGI:1100498, MGI:98216, and MGI:1341879, respectively) did not suggest roles in muscle development (Table S2), prompting further investigations. Pitx3 is well characterized and annotated with respect to its role in lens formation during eye development ,. However, literature searching revealed that tongue-specific expression of Pitx3 (also known as Ptx3) during development (expression first detected at E11.5) was documented over a decade ago , while its specific role in myogenesis and myoblast differentiation has only more recently been reported .
Known and annotated principally for its role in mediating the effects of retinoic acid, there also exists extensive literature associating Rxrg (retinoid X receptor gamma) with myoblast differentiation. This association was not asserted by any of the reading experts, although 117 papers were returned by PubMed search with query “rxr muscle” (accessed 4/25/2008), also suggesting difficulties in species-specific gene name normalization. As early as 1993, RXRs were identified as positive regulators of skeletal muscle development via their direct interactions with Myogenin and MyoD promotor elements ,, and the role of Rxrg in muscle continues to be explored, with the most recent associated publication identifying a role in lipogenesis and SREBP1c regulation in skeletal muscle . A high-throughput study identifying transcription units involved in brain development  indirectly documented the tongue-specific expression profile of Rxrg in E13.5 mice (image MGI:3507450), with the same expression pattern weakly persevering in E14.5 mice (GenePaint.org set ID: C1279, section Embryo_C1279_6_3D).
Significantly less is known about the zinc-finger gene, Zim1. In mouse, this gene is part of an imprinted cluster that includes Zim2 (MGI:1923887) and Peg3 (MGI:104748) , but a Zim1 ortholog has not been identified to date in human. Therefore, it has been proposed that Zim1 is a recent addition to the mouse genome that was derived via a local duplication of Zim2. In mice, Zim1 is maternally imprinted and is only expressed during embryogenesis, notably in the limb bud and therefore it has been suggested as having a role in limb development . Limited and unannotated RNA expression information was available from additional studies in the mouse ; however, these did not address Zim1 expression in the developing face. We therefore selected Zim1 for experimental validation, as there was only limited knowledge of this gene and its function in mouse facial and muscle development.
Although well studied in craniofacial development, we also selected Hoxa2 (MGI:96174) for further analysis as its expression is not normally associated with branchial arch 1, which gives rise to the mandible. Indeed, Hoxa2 has a strong anterior limit of expression in the neural crest cells originating in rhombomere 4 that generate the mesenchyme of the second branchial arch. Moreover, the absence of Hox gene expression in more rostral tissues, including the first branchial arch, has been postulated to have enabled the evolution of the vertebrate head ,,,,,,. We therefore decided to explore this potential novel domain of Hoxa2 expression in more detail.
The third cluster explored was called the synapse cluster. All the nodes contributing to the synapse cluster are unambiguously implicated in neuromuscular signaling. However, two additional nodes (Ablim3 and Apobec2; MGI:2442582 and MGI:1343178 respectively) fail to fit neatly into any cluster, and instead appear to straddle the synapse interaction and muscle structure clusters. Ablim3 annotation includes both the GO Molecular Function term “actin binding” GO:0003779 as well as the KEGG annotation “Axon guidance” KEGG:mmu04360. However, the annotation associated with Apobec2 strongly indicates a role in RNA editing and processing, but gives no indication of a role in muscle (Table S2).
The Apobec2-associated literature revealed little consensus regarding its function. Apobec2 has been documented as an ancestral, cardiac and skeletal muscle-specific member of the Apobec family implicated in muscle regeneration . It has also been described as a ubiquitously expressed protein with cytidine deaminase RNA editing activity . Apobec2 knockout mice appear viable and fertile  but no examination of the tongue was reported. Apobec2 was selected for further biological investigation due to the sparse nature of current associated knowledge and its possible function in the tongue muscle development.
The above analysis generated hypotheses regarding the role of four genes (Apobec2, E430002G05Rik, Hoxa2, Zim1) in the development of the murine tongue. These hypotheses were tested by whole-mount in situ hybridizations to E11.5 and E12.5 mouse embryos, collected, prepared and hybridized as described in , stained with Hoxa2 , Apobec2, E430002G05Rik and Zim3 RNA probes, as described in . The mouse Apobec2 probe was derived by PCR from E10.5 FVB mouse head cDNA using the primers Apobec2F (5′-CCA GCC AGG CTT AGC TGC TGA CAG-3′) and Apobec2R (5′-GCT CAC CAG AAT GAG CAG ACG AAG-3′); the mouse E430002G05Rik probe was derived using the primers E43F (5′-GGT TTA TCA TCC AGT TGA GGT TTG G-3′) and E43R (5′-GCA GAC AGG TTG CTT TCC TGA-3′); the mouse Zim3 probe was derived using the primers Zim3F (5′-CGT ACA AGT GTG ACA AGT GC)-3′ and Zim3R (5′-GCA CAA ATG CTC CAA GTA GG-3′).
As shown in Figure 12 all four genes are expressed in the developing tongue at E12.5. Examination of the first arch tissue at E11.5 indicates that neither Apobec2 nor E430002G05Rik are expressed at this time-point, although the former gene is clearly expressed in the developing cardiac region (A, D). However, by E12.5 both genes are expressed in discrete regions of the developing tongue (B,C, E, F). At E11.5 Hoxa2 expression is prominent in the second arch tissue and there is clearly a sharp boundary of expression with the first arch (G). Nevertheless, weaker expression is apparent in the core of the first arch, and expression is again visible in the tongue at E12.5, presenting as bilateral stripes aligned with the anterioposterior axis (H, I). Zim1 expression is visible in the core mesenchyme of the first branchial arch at E11.5 and by E12.5 almost the entire tongue, with the exception of the ectoderm, is strongly stained.
This data confirms all four hypotheses; each of these genes is expressed in the developing tongue. The expression patterns for the four genes are different though, indicating that their function may not be directly related. The almost uniform expression of Zim1 at E12.5 suggests that it is marking the neural crest derived mesenchyme of the tongue that will give rise to the smooth muscle and connective tissue. Alternatively, it may identify the intrinsic skeletal musculature of the tongue. In contrast, the expression of the other three genes is consistent with their expression in different extrinsic tongue muscles that project out of the tongue and attach to surrounding skeletal components to allow tongue movement during swallowing and chewing.
The data we have obtained for the four genes we analyzed in detail, Apobec2, E430002G05Rik, Hoxa2, and Zim1, indicate that all four were indeed expressed in the developing mandible, specifically in the tongue. Further analysis will be required to determine if these genes have specific roles in tongue development and function, and if they act as specific markers for individual components of the intrinsic and extrinsic tongue musculature. Nevertheless, two observations are worth noting with respect to the expression patterns of Zim1 and Hoxa2. First, the expression of Zim1 in the tongue has considerable overlap with that of the linked Peg3 gene  and the expression profiles of these two genes are also very similar in the microarray dataset we have obtained. Unfortunately, data were not available on the linked Zim2 gene in our analysis  because the single probe set in the array did not generate a reliable signal. We hypothesize that the presence of Zim1 in our network is due to the importance of the linked Peg3 gene, and that the expression of Zim1 reflects its recent insertion next to the cis-regulatory sequences responsible for Peg3 expression.
It is worth noting that Peg3 was in the data network, but the scores of the arcs linking it to Zim1 were below the top 1000 threshold used in both the Average and Logit combined networks. Expanding the reporting component to provide an option to visualize all linkages passing a threshold correlation in the data network alone might have proven useful here. This example also provides a caution for analysts: genes can appear in the combination networks for many reasons, not all of which indicate a causal role in the phenomena under study.
With respect to Hoxa2, we do detect expression of this gene in the first branchial arch, although the level is considerably less than documented for the second arch. Whether this expression pattern marks cells intrinsic to the first arch migrating in from the second arch remains to be determined. We also note that Hoxa2 mutant mice have a number of craniofacial defects associated with the developing tongue. Specifically, in Hoxa2-null mice the tongue is not able to move appropriately during embryonic development and its abnormal location prevents closure of the secondary palate. Available data indicates that this is due to the absence of neural crest derived skeletal elements originating from the second branchial arch which function as the attachment sites for extrinsic tongue muscles ,. Our findings that Hoxa2 is also expressed in these extrinsic tongue muscles raises the possibility that the loss of Hoxa2 may directly cause tongue muscle defects leading to cleft palate.
Explaining the biological phenomena underlying complex, high-throughput datasets in light of existing knowledge is a critical step in the exploitation of powerful post-genomic instrumentation, as is generation of new, biologically significant hypotheses. This application of the Hanalyzer demonstrates that 3R systems have the potential to facilitate these analyses, making apparently overwhelming amounts of background knowledge particularly useful for analysts, accelerating the pace of biomedical discovery.
Inference to the best explanation (sometimes called abduction) is a complex task that can involve many other forms of reasoning. Although related to determination of causes, explanations can involve non-causal factors as well, and not all causal factors may be important in a particular explanation . One particularly important sort of explanation in biomedicine is the contrastive explanation (why this rather than that), which is well suited to the carefully controlled experimental methodology that underlies biomedical research. The system described here does not automate the production of explanations (nor hypotheses), but provides a novel class of software support for human users who are doing so.
The version of the Hanalyzer described here built a background network for only mouse genes, and the data network was constructed from a particularly well-powered time and tissue gene expression array series. One important question is how well this methodology will generalize to other organisms and data types.  demonstrated many of the experts that compose this system can be used to build knowledge networks for other model organisms, including yeast, worm, and fly. Ongoing work involves building knowledge networks for human and rat as well. As many of the same types of experts are available for each of these organisms, expansion of the knowledge networks to other organisms is a straightforward software engineering task.
One important issue in generalizing to other organisms is the question of how to build homology-based experts. As observed in ,,, there are many factors that are involved in the successful application of homology-based protein-protein interaction networks. Such predictions are even more difficult and uncertain in humans and other eukaryotes, although promising methods that could form the basis for such experts have been published recently, e.g.  and .
The natural language processing aspects of the system increase its performance over database integration systems alone . The OpenDMAP approach used is state of the art , but there is much room for improvement. One particularly important area for future work is in multi-organism gene normalization.
While there is a great deal of data in the form of expression arrays over which gene correlations can be made, there are many other sources of high-throughput information that could be profitably analyzed using 3R methodology. Construction of quantitative data networks with genes or gene products as fiducials could be based on data produced by many high throughput experimental techniques, including proteomics, miRNA assays, genotyping, and others. What are the best methods for generating such data networks, and are there differences in the types of knowledge networks (experts) that are best suited to analyzing them?
Another issue regards the inherently changing nature of biomedical knowledge. Experts can be re-run periodically to keep the knowledge networks up to date, but a variety of open research questions about handling time remain: Is it valuable to highlight more recent results for annotators? How should temporal considerations factor into the reliability calculations? Should reasoning experts take temporal considerations into account? How?
While the experts used in the Hanalyzer proved to be useful for analysts, there are a large number of potential experts, both reading (external) and reasoning that could be included in a 3R system. What is the optimal set of experts to use for building knowledge networks? Does that differ for different applications? The Noisy-OR combination method (and most others) assumes that the experts are independent of each other, yet many potentially useful sources of knowledge exhibit complex dependencies; should selection of experts be made in light of this constraint? Many sorts of inference, ranging from logical entailment to information theoretic, statistical or heuristic might be productively included in a 3R system; what is the optimal set of reasoning experts to use?
Finally, while the example of biological validation of several hypotheses generated through the use of the system provides some evidence that the system is of genuine value to biological data analysts, the question of how best to evaluate 3R systems remains open. Perhaps the “insight-based” evaluation methodology previously described for scientific visualization systems , could be modified to evaluate 3R systems as well.
The Hanalyzer, including the experts and the Cytoscape plugin for visualization is available as open source software via SourceForge at hanalyzer.sourceforge.net. The extracted assertions from the OpenDMAP text mining experts are available as supplementary materials associated with ; the links from the ACF expert are available as supplementary materials associated with .
Annotation terms associated with nodes within the Average network.
(0.08 MB DOC)
Annotation terms associated with those nodes added via the Logit asserted edges.
(0.09 MB DOC)
Thanks to Filippo Rijli for stimulating discussion regarding Hoxa2 null mice, Michael Bada for assistance with the OWL import and export features, and Mark Geraci for the role of the array core he directs in generating the original expression data.
The authors have declared that no competing interests exist.
This work was funded by the US National Institutes of Health, specifically grant R01DE15191 to RAS, and grants R01LM008111, R01LM009254 and R01GM083649 to LEH. The institutional training grant T15LM009451 supported SML and RPS, and HT was supported by the AstraZeneca-Fulbright Scholarship. No sponsor played any role in the preparation, review, or approval of the manuscript.