Understanding how a metazoan organism functions requires knowledge of the biochemical, cellular, and overall phenotypic effects of all genes. Despite considerable effort, direct experimental evidence supporting the participation of genes in biological process(es) exists for only a modest proportion of the full complement of metazoan genes (as reflected by Gene Ontology (GO) annotations [
1]; see Materials and methods section for details). For instance, of the nearly 29
K (
K = 1,000) genes in mouse, there is experimental evidence supporting the functional annotation of less than half, or approximately 12
K genes. Similarly, for
Caenorhabditis elegans, experimental evidence exists for about a third (approximately 7.5
K) of its approximately 20
K genes. Even the most experimentally amenable and well-characterized eukaryotic organism,
Saccharomyces cerevisiae, though not a metazoan, still has over 1
K of its 6
K genes lacking functional annotation [
2].
Both new and improving synthetic and analytic genome-scale technologies can help us determine the biological process(es) of unannotated genes, as well as provide new insight into annotated genes. Some of these approaches include yeast-two-hybrid (Y2H) screens to detect physically interacting proteins, expression profiling to detect transcript coexpression, modifier screens to identify genetic interactions, RNA interference screens to measure the genetic effects of gene knockdowns, genome tiling path arrays and next-gen sequencing to discover transcribed genomic elements, and ChIP-Chip and ChIP-seq to identify protein-DNA interactions. While these assays have the advantage of being high-throughput, distinguishing the biologically relevant relationships from noise within a single experiment is not a straightforward task. This, together with their sheer volume, makes interpretation challenging.
Methods to derive functional annotation from the available corpuses of data have been developed [
3,
4] and those that focus on data integration are among the more successful [
5-
9]. Integrating different types of genomics data has been shown to reveal relationships between genes not distinguishable within single datasets [
10,
11]. In the context of genomics data, the overarching theme of an integrative model is to distill the available data down to a value indicative of a gene pair being functionally related. These methods, pioneered by Troyanskaya
et al. [
5], Jansen
et al. [
8], and Lee
et al. [
12], were heavily based on Bayesian networks to bring together weighted gene-gene relationships across heterogeneous datasets. Here, and inspired from this previous work, a functional relationship between genes represents the likelihood that two genes are involved in the same biological process. Integrative models have been successfully used to construct molecular networks (that is, transcriptional regulation and metabolic) [
13,
14], predict genetic interactions in yeast [
15], predict phenotypic effects in worm [
16], provide new gene candidates in human disease [
17-
20], and make novel predictions of gene function [
6,
12,
21-
27]. The number of organisms with well-annotated genomes and sufficient experimental data to build integrated networks is limited. Thus, networks constructed from genome-wide data have been restricted to: bacteria [
14,
25],
S. cerevisiae [
5,
12,
26,
28,
29],
C. elegans [
16,
30], mouse [
31,
32], and human [
18-
20,
27].
Drosophila is among the most well-annotated organisms, and the amount of experimental and computational data for it is on par with worm, yeast, and mouse [
33,
34]. Although there exist repositories for flies that provide sophisticated query capability, namely FlyBase [
35] and FlyMine [
36], as well as ongoing attempts at mining disparate sources of fly data [
21,
37,
38], an integrated system that can be interrogated
ad hoc to easily deal with large sets of
Drosophila genes has not been available until now.
As one of the preeminent model organisms,
Drosophila has been the object of study for more than a century [
39]. This research has not only increased our understanding of the organism itself [
40,
41], but more importantly increased our knowledge of molecular mechanisms in biology in its broadest sense, particularly in the fields of genetics, development, evolution, and molecular biology.
Drosophila has the richest set of sequenced genomes for a metazoan genus [
42,
43] and, along with
C. elegans and human, will have the most comprehensive inventory of metazoan genomic elements stemming from the modENCODE [
44] and ENCODE projects [
45]. Despite these resources, there exist many genes for which biological process(es) are unknown. At the time of this study (v5.3 of the
D. melanogaster genome [
46]) there is direct experimental evidence supporting the biological process GO annotations (hereafter referred to as GO:BP) for less than half (approximately 42%) of the more than 15
K protein-coding genes (counted from curator reviewed GO evidence codes). These annotations are mostly based on genetic evidence, (that is, mutant phenotypes, genetic interactions, and RNA interference knockdown phenotypes). In addition to experimental evidence, roughly 26% of the genes have GO:BP terms that are inferred from electronic annotation methods (inferred from electronic annotation (IEA) GO evidence code). Considering all the available methods to determine in which biological process(es) a gene participates, we underscore the fact that nearly one-third of
Drosophila protein-coding genes (> 4.6
K) remain unannotated.
In this study, we bring together experimental data to build the first integrated functional gene networks in
Drosophila. We focus specifically on building functional relationships between pairs of genes that are likely to participate in the same biological process and are supported by experimental evidence. We adapt the approach developed by Marcotte and colleagues [
12,
16,
28] to integrate three experimental classes of data, in particular, genetic interactions, protein-protein interactions, and microarray gene expression. We demonstrate that the integrated networks perform well at recapitulating known functional relationships and outperform networks built exclusively from individual types of data (that is, just microarray data). We then utilize the functional relationships in the network to predict GO:BP annotations for unannotated genes using the Markov random field (MRF) method [
47] and demonstrate that this approach performs well at predicting annotations through tenfold cross-validation. We use this method to infer high confidence GO:BP terms for 483 uncharacterized genes, and evaluate these predictions with respect to the available independent evidence. Finally, we use the constructed network to reanalyze gene expression data related to nutritional deprivation. We show that the network can be used to discover clusters of functionally related genes amongst genes that were identified to be differentially expressed.
All data are made available through supplemental material [
48].