We demonstrate here that genes can be rationally associated with plant traits through guilt-by-association in a gene network. For this purpose, we created AraNet, a genome-wide gene network for A. thaliana, a reference organism for flowering plants, including many crops. AraNet is the most extensive gene network for any plant thus far; gene annotations derived by network guilt-by-association extend substantially beyond current gene annotations. We validated the network’s predictive power by cross-validation tests, independent pathway and phenotype datasets, cell-specific expression datasets, and by experiments on computationally selected candidate genes.
AraNet generates at least two main types of testable hypotheses. The first type uses a set of genes known to be involved in a specific process as bait to find new genes involved in that process. This test is useful if the bait genes are well-connected (i.e.
, high AUC). We used the set of genes conferring seed pigmentation defects (AUC = 0.68) as bait and found a 10-fold enrichment in identifying mutants with comparable phenotypes. Of the 318 GO biological processes with ≥5 genes, ~43% have AUCs of at least 0.68 (Supplementary Table 14
), suggesting that AraNet will be useful in identifying new genes in nearly half of these biological processes. In practice, this translates into identifying a small set of new genes from a relatively limited scale screen of the top network-predicted candidates (e.g., computer simulations suggest finding an average of 4-7 novel genes from tests of the top 200 candidates for biological processes with AUC >0.6; Supplementary Fig. 12
). The second type of hypotheses involves predicting functions for uncharacterized genes. We assayed predicted phenotypes for three uncharacterized genes, two of which showed phenotypes in the predicted processes, response to drought and meristem development. There are 4,479 uncharacterized genes in AraNet (30% of protein-coding genes) with links to characterized genes, suggesting broad utility for AraNet in identifying candidate functions. Both of these modes of operation can be easily performed on the AraNet website.
While AraNet currently shows high accuracy for many processes (Figures -), there are nonetheless specific processes that are poorly represented, with this trend stronger among plant-specific processes (). This trend manifested in our experimental validation of only 2 of 3 tested candidate genes, although these intentionally represented challenging cases lacking any current functional annotation and for which sequence homology approaches had failed. While we observed that non-plant-derived datasets helped identify genes for plant-specific processes, it is clear that more plant datasets will strongly enhance the utility of gene networks for finding trait-relevant genes.
Three major causes underlie such cases of poor predictive performance: First, our current knowledge of genetic factors for a process may be so sparse that AraNet cannot link them efficiently. Second, AraNet may lack linkages or data relevant to the poorly predicted processes. These two trends likely explain the lower performance among plant-specific processes relative to more broadly-studied, evolutionarily conserved processes. Additional plant-specific datasets, e.g. protein interactions, should help here, as should considering both indirect and direct network linkages for ranking candidates. Third, strongly implicated candidate genes that nonetheless test negative for a trait, resulting in apparent false positives, might be masked by epistatic effects, thus actually representing true predictions and false negative assay results. This trend may be reasonably common and has been previously observed in yeast46
AraNet represents a major step towards the goal of computationally identifying gene-trait associations in plants. This work suggests that gene networks for food and energy crops will be important enablers for enhanced manipulation of traits of economic importance and crop genetic engineering.