As high-throughput experiments become increasingly common, biologists face substantial challenges effectively leveraging genome-scale data from diverse organisms to inform new hypotheses. Experimental data coverage for an organism can be sparse, and prior functional knowledge (i.e. low-throughput experiments validating a gene’s function) can be notably limited. These impediments affect the breadth and accuracy of bioinformatic methods (e.g. machine-learning algorithms) that apply prior knowledge in learning novel biology. As a consequence, the applicability of these methods is often limited to biological processes and pathways that are already well characterized for an organism.
For example, a common challenge for biological researchers is interpreting the results of a genome-wide experiment (e.g. a list of candidate genes from a microarray experiment) and generating hypotheses for experimental follow-up. There are several effective resources, some network based, for researchers to analyze their gene sets (
1–6). These resources cover a wide range of organisms and address different needs of biologists by applying a variety of methods: from pathway enrichment analysis of a gene list to machine learning algorithms that predict a gene’s function. All these resources’ methods require known examples (i.e. pathways with at least a few annotated genes) in an organism. Consequently, the effectiveness of these applications is constrained by the extent of prior knowledge and available experimental data in the queried organism.
Other resources address the problem of disparate data coverage among organisms by focusing on methods to transfer high-throughput data (e.g. microarray and physical interaction experiments) between organisms (
7,
8). However, these efforts are limited to learning gene association networks, and none of them solve the problem of making accurate functional predictions and associations for biological processes that have not been well studied in a given organism. For example, most of the discovered genes involved in neuromuscular process have been in mouse [65 known genes according to gene ontology (
9)]. Relatively few genes are definitively known in mammalian systems outside of this model organism. Consequently, many existing methods will not be able to predict genes to that biological process in rat (where only one such gene is experimentally annotated), and a biologist using a rat model system with existing resources will not be able to leverage the known biology in mouse. Biologists need a technology that allows for the systematic application of prior functional knowledge from other organisms to their organism of study, at multiple points in an analytic workflow: from interpreting experimental results to generating hypotheses for functional assays. Integrative multi-species prediction (IMP) is an interactive web server designed to meet this burgeoning need.
IMP is an exploratory tool that, in addition to providing a high-quality interface for functional interrogation, solves several specific challenges encountered by biologists that benefit from integrating cross-organism biological knowledge. First, although biologists can interpret their experimental results in the context of functional networks, other servers do not allow them to adequately accomplish this task due to their limited workflow support and the incomplete prior knowledge in an organism. With IMP, biologists can save their custom genes sets and overlay their genes on functional networks, expanding or focusing their gene list by mining functional relationships within the networks. IMP can integrate cross-organism knowledge with a method that goes beyond the standard Basic Local Alignment Search Tool (
10) search by identifying enriched biological processes among the genes, using gene-pathway annotations from the queried organism and annotations from other organisms mapped by functional analogy (
11). In this way, pathways that are better characterized in a different organism will be included in an enrichment analysis, facilitating biological connections that would otherwise be hindered by limited functional knowledge. Moreover, no existing server provides a way to interactively examine putative functions and gene–gene interactions in functional networks across organisms. With IMP, biologists can compare functional contexts and interpret the behavior of their gene sets across organisms using flexible and interactive visualizations.
Finally, the results from a genome-wide experiment can elucidate a biological question but are often inconclusive and require experimental follow-up. Computational predictions of gene function can guide subsequent experiments. However, accurate assignments have previously been limited to pathways and processes that are already well characterized in an organism, as such information is necessary for training examples. Transferring functional annotations between homologous genes is a common method to improve coverage for a studied pathway and to generate hypotheses for functional assays, but high-quality transfers have historically been limited to smaller scale, manual curation efforts (
12). IMP systematically identifies functionally similar homologs, using state-of-the-art homolog identification methods that use genomic data compendia (
11) to transfer pathway annotations between organisms for learning. This allows for accurate gene-process predictions, even for processes that have few experimental annotations in an organism.