Previous yeast network reconstructions focused on a more limited number of genes in order to make the reconstruction tractable24,28,35
. Our integrative analysis combined large-scale genotype, gene expression, PPI, and TFBS data to construct networks comprised of more than 50% of the genes in the yeast genome using a novel Bayesian network reconstruction method. The relative utility of the resulting networks were highlighted by predicting responses to independent experimental perturbations and the known known biology of the system. Specifically, we demonstrated that networks constructed by incorporating genetic, TFBS, and PPI data were more predictive than a network constructed from expression data alone. Further, our method of integrating diverse data was also demonstrated to predict novel interactions, which in turn led to the identification of genes that are not well annotated, but that nevertheless serve as causal regulators for eQTL hot spots.
The modules emerging from the coexpression network were shown to elucidate the functional relevance of the different components of the network. Of particular interest is our finding that the coexpression network overlapped poorly with the PPI network, suggesting that the PPI and coexpression data reflect complementary views of the system, that the PPI data generated via high-throughput experiments is not very specific19
, or a combination of the two. We were able to identify structures in the PPI network that overlapped well with the coexpression network only after we performed a clique community analysis on the PPI network to define the core, highly interconnected substructures of this network. Through this analysis, we found that the overlapping structures were enriched for stable protein complexes, likely explaining the good correlation between the PPI clique communities and corresponding coexpression network modules.
The gold standard for assessing the predictive power of any network model is prospectively validating predictions made from such a model. We queried the different Bayesian networks constructed using progressively more data (BNraw
, and BNfull
) to predict the causal regulators of the subnetworks enriched for genes linked to the different eQTL hot spots in the BXR cross16
was demonstrated as the most predictive network, and five of the novel predictions made using BNfull
were prospectively tested experimentally, and all of these predictions were validated, thus confirming the predictive power of the integrated network to elucidate the regulatory control of some of the subnetworks. These results are also consistent with a large-scale simulation study we conducted to assess the extent to which genetic information could improve the accuracy of Bayesian networks based on gene expression data in a segregating populations27
The integrative reconstructions carried out in our study represent only the beginning steps needed to construct large-scale, accurate whole genome networks. A number of important limitations will need to be addressed to further enhance the accuracy of this type of network. First, the Bayesian network algorithm employed in this study does not permit loops, making it difficult to represent some types of feedback, which are obviously an important control mechanism in any biological system. Second, Bayesian networks do not effectively represent time-series data36
. These issues might be addressed by using dynamic Bayesian networks, which explicitly include a temporal representation of the interaction between nodes. Third, we reconstructed networks from a limited amount of data generated from a single population and under only a single biological condition. Given the impact genetic background and environment can have on network structure, with the connectivity structure of a network varying as a function of genetic background and environment, populations representing different genetic backgrounds in different environmental contexts will have to be studied to assess the impact on network structure. However, these and other issues notwithstanding, our results support that the construction of large-scale whole gene networks based on genetic, gene expression, TFBS, PPI and related types of large-scale data can lead to networks that are capable of predicting complex system behavior.