|Home | About | Journals | Submit | Contact Us | Français|
A major goal of biology is the construction of networks that predict complex system behavior. We combine multiple types of molecular data, including genotypic, expression, transcription factor binding site (TFBS), and protein-protein interaction (PPI) data previously generated from a number of yeast experiments in order to reconstruct causal gene networks. Networks based on different types of data are compared using metrics devised to assess the predictive power of a network. A network reconstructed by integrating genotypic, TFBS and PPI data is shown to be the most predictive. This network is used to predict causal regulators responsible for hot spots of gene expression activity in a segregating yeast population. The network is also shown to elucidate the mechanisms by which causal regulators give rise to larger-scale changes in gene expression activity. Predictions are prospectively validated to provide direct experimental evidence that predictive networks can be constructed by integrating multiple, appropriate data types.
Large-scale genetic, transcriptomic, proteomic, and metabolomic data sets have enabled researchers to decipher the biological function of individual genes, pathways, and, more generally, biological networks that drive complex phenotypes. However, the progress towards uncovering the mechanisms by which these genes lead to complex phenotypes has progressed at a slower rate. More recently, significant progress has been made integrating multiple sources of data sampled from human and experimental populations to reconstruct networks that are predictive of complex phenotypes. A number of studies in a variety of species have demonstrated that predictive networks can be built by leveraging the naturally occurring DNA variations to determine how such variation influences gene expression and other molecular phenotypes. By examining the effects of naturally occurring DNA variation on gene expression in segregating populations, other phenotypes can be examined with respect to these same DNA variations and ordered relative to the genes to infer causality1-4. Network reconstructions based on protein-protein interactions 5, metabolomic6, and literature data 7 are also now becoming more routine. The common theme among these reconstruction efforts is the organization of vast amounts of molecular data into networks that capture fundamental properties of complex systems in states that give rise to complex phenotypes.
While advances in the application of network reconstruction algorithms to high-dimensional biological data are being applied to a number of distinct data types, such as protein-protein interactions 5, metabolomic6, and published gene-gene relationships7, no systematic studies have exploited the advantages that can be gained by combining genotypic, gene expression, protein-protein interaction, and DNA-protein binding data to reconstruct whole gene networks. While comprehensive forms of these types of data are very rare at present, yeast is one such system in which all of these data types currently exist. In this present study, we apply recent advances in coexpression and Bayesian network reconstruction methods to large-scale yeast data sets in order to create yeast gene networks capable of predicting complex system behavior. Specifically, we combine multiple types of large-scale molecular data, including genotypic, gene expression, TFBS, and PPI data that were previously generated from a number of yeast experiments to reconstruct causal, probabilistic gene networks. We demonstrate that the integration of these diverse data types in the context of a genetic cross8 enhances the predictive power of the resulting gene networks beyond what could be achieved by reconstructing gene networks based on expression data alone. We show that the functional subnetworks represented in the more integrated networks are significantly enriched for genes under the control of common genetic loci (i.e., eQTL hot spots9). We also demonstrate that such networks can lead directly to the identification of the causal regulators for these different functional subnetworks, with many of the predictions providing a putative mechanistic understanding of how the causal regulators give rise to larger-scale changes in gene expression activity. Finally, we prospectively test and experimentally validate a number of these predictions, providing direct experimental evidence that predictive networks can be constructed via the integration of multiple, appropriate data types.
We assembled genotypic and expression data from 112 segregants obtained from a yeast cross between the BY and RM strains of S. cerevisiae (referred to here as the BXR cross)8. Of the 5,740 genes represented on the microarrays used in this study, 5,180 were supported as having been sampled from a normal distribution, thus satisfying important assumptions around the use of the Pearson correlation statistic to reconstruct coexpression networks. From this set, we selected 3,662 informative genes for constructing Bayesian networks for further network analysis (see Supplementary Methods for details). We also gathered transcription factor binding site (TFBS) data derived from multiple sources10,11 and protein-protein interaction (PPI) data by combining the SGD and DIP yeast databases to improve the structured priors for the Bayesian network reconstructions, as detailed below. From these data sources, we constructed four networks: one co-expression network and three different Bayesian networks – each of which took advantage of increasing amounts of data. Coexpression networks represent correlations among expression traits, while probabilistic, causal Bayesian networks based on integrative methods we have developed previously4,12 represent causal relationships among genes. Importantly, we test the predictive power of different Bayesian networks constructed from increasing amounts of data by comparing predictions made from the networks to experiments that were independent of the experiments used to construct the networks as well as to experiments we performed prospectively to specifically test the predictions.
We reconstructed the coexpression network for the set of 3,662 informative genes identified in the BXR cross using a previously described weighted coexpression network algorithm13. A number of studies have demonstrated previously that coexpression networks are both scale free and modular2,14, thus highlighting functional components of the network that are often associated with specific biological processes. Therefore, to identify modules comprised of highly interconnected expression traits within the coexpression network, we examined the topological overlap matrix15 associated with this network. Supplementary Figure 1 depicts a hierarchically clustered topological overlap map in which the most highly interconnected modules in the network are readily identified. To identify gene modules (subnetworks) formally from the topological overlap map, we employed a previously described algorithm that ensures genes in any given module are maximally interconnected relative to genes in other modules (see Supplementary Methods for details)2.
From the BXR topological overlap map depicted in Supplementary Figure 1, 16 modules were identified. We tested each of these modules for gene enrichment using the yeast gene ontology (GO) categories for biological processes, molecular functions, and cellular components. Table 1 lists the most significantly enriched GO category for each module. Thirteen (13) modules were significantly enriched for at least one GO category, supporting that the BXR coexpression network is organized into functional units. For example, 20 of the 149 genes annotated as belonging to the lipid metabolism GO biological process category fell into the black module comprised of only 44 genes (Fisher Exact Test P value = 3.31×10−17).
To assess whether modules in the yeast coexpression network are enriched for genes controlled by common genetic loci, we performed a genome-wide linkage scan on each of the expression traits to map expression quantitative trait loci (eQTL) for each of the expression traits. After partitioning the yeast genome into 603 bins, each 20kb in length, we identified 23 bins in chromosomes 1, 2, 3, 5, 8, 12, 13, 14, and 15 that contained at least 15 eQTLs each (defined as eQTL hot spots), 22 of which were identical to the previously reported results from this data set16. As shown in Table 1, all but one of the 15 modules in the coexpression network are significantly enriched for linking to at least one bin, thus suggesting extensive pleiotropic effects of QTLs on expression traits in the different bins, given each bin represents at least one QTL affecting multiple gene expression traits. These results also suggest that much of the correlation structure in the different modules is driven by common genetic loci representing perturbations on a particular part of the coexpression network, as we have shown in mouse17. For example, there are 52 genes linked to the bin located on chromosome 12 between base pair positions 640,000 and 660,000, and 34 of these genes are located in the black module (Fisher Exact Test P value = 2.37×10−60). Therefore, the joint analysis of the genotypic and coexpression data highlights that common genetic perturbations in this cross affect the expression activity of subnetworks of genes, which in turn affect the activity of important biological processes. For all subsequent analyses, we merged the 23 eQTL-enriched bins into the 13 previously reported eQTL hot spot regions for the BXR cross16.
The co-localization of entire network modules to common genetic loci suggests that the coherence achieved within these modules is at least partially driven by genes in the modules that are more directly under the control of common factors. While the eQTL hot spots were originally reported as not enriched for linking to loci harboring transcription factors (TF)16, the eQTL hot spots could be driven by genes that are not TFs, but which affect TF activity. We found that 15 of the 16 modules in the coexpression network were significantly enriched (all Fisher Exact Test p-value < 0.01) for at least one TF target gene set (Table 2). For example, 26 of 44 genes in the black module are annotated as having a Hap1 binding site (Fisher Exact Test P value = 7.92×10−30). Interestingly, the gene encoding the Hap1 TF physically resides in the chromosome 12 eQTL hot spot region and has a strong cis eQTL linked to this same region.
Naturally occurring DNA variations that impact expression traits provide one way to test for coherence in the coexpression network modules. However, directed single gene perturbation experiments can also be leveraged for this purpose. We generated knockout signatures from a previously published yeast compendium data set18 and found that all of the network modules were significantly enriched for at least one knockout signature gene set (Table 2). For example, the midnight blue module depicted in Supplementary Figure 1 is comprised of 23 genes, most of which function in yeast mating; 20 of these genes overlap with the signature from the double knockout of DIG1/DIG2 (Fisher Exact Test P value = 1.07×10−18), which are repressors of pheromone responsive transcription. These data not only demonstrate how targeted gene perturbations can affect entire subnetworks, they suggest that when genes in a highly connected module are perturbed, these genes tend to move a significant proportion of the module, reflecting a higher degree of causal connectivity among highly interconnected genes than has been previously appreciated17.
Protein-protein interaction data provide a complementary view of molecular interactions at play in living systems. Unlike coexpression networks, edges in the PPI networks represent putative physical interactions between two proteins. By combining the SGD and DIP yeast PPI data, the yeast PPI network is comprised of 4,833 nodes (distinct proteins) and 15,345 edges between these nodes. To compare the PPI and coexpression networks, the correlation distribution of gene pairs in the coexpression network that were also linked in the PPI network was compared with the correlation distribution of an equal number of randomly selected gene pairs from the coexpression network. While the correlation distribution derived from the PPI linked genes is shifted slightly higher compared to randomly selected gene expression trait pairs, the majority of pairs linked by PPI are not highly correlated (Supplementary Figure 2).
The lack of correlation between gene expression trait pairs that correspond to genes connected in the PPI network may be due to these different data types reflecting different domains of information. The lack of correlation may also be due to a high false positive rate in the PPI data, given recent studies have explicitly demonstrated that PPI data generated from high-throughput experiments are not very specific19. To identify higher-confidence interactions in the PPI data that may better inform the coexpression network, we employed a method to explore the global structural organization of the PPI network in terms of overlapping local network neighborhoods, referred to here as clique communities (see Supplementary Methods for details)20.
The application of this method to the yeast PPI data led to the identification of 492 cliques comprised of 5 or more proteins. A total of 112 clique communities representing 2,477 unique links in the PPI network were identified from these 492 cliques (Figure 1). Of the 4,833 proteins in the PPI network, only 840 (~17%) are represented in the 112 highly interconnected clique communities. Comparing with the full PPI network, a larger portion of the gene pairs connected in the PPI clique communities are also connected in the coexpression network (Supplementary Figure 2). We compared the clique communities with manually curated stable protein complexes in MIPS21. Of the 74 protein complexes consisting of 5 or more proteins in this set, 72% (53/74) of them significantly overlapped with the clique communities, and 74% of the clique communities significantly overlapped the set of stable protein complexes. The clique communities were generally significantly enriched for genes involved in protein complexes, with 330 genes involved in protein complexes (out of 674 represented in the PPI network) overlapping the set of genes comprising the clique communities (a 3.5-fold enrichment; p = 1.54×10−96). The increased overlap between the PPI and coexpression networks likely reflects the higher confidence links represented in the clique communities of the PPI network, as well as the biological need for proteins that are in complexes to be regulated to similar levels.
Our findings are consistent with previous reports22,23. However, while these previous reports focused on known protein complexes23 and establishing associations between gene expression clusters and protein interaction clusters22, here we have systematically uncovered known and unknown complexes from the PPI data. Because PPI data can be noisy, the clique-community analysis serves as a filter to uncover not only the underlying building blocks (cliques) of the PPI network, but also their high-level organization (communities)20.
Coexpression networks highlight how biological networks are organized into functional modules that are under the control of common genetic loci and transcription factors, as well as how DNA variations at specific loci perturb these network modules and in turn induce changes in higher order phenotypes. Despite these and other advantages, coexpression networks do not provide explicit details on the connectivity structure among genes in the network and do not represent causal associations. A number of efforts have sought to integrate different data types like gene expression, genotype, PPI, TFBS and literature data using a Bayesian approach24-26. However, the networks resulting from such efforts are still comprised of modules from which causal information or detailed mechanisms at the gene level are not easy derived. However, both simulation and experimental results have demonstrated that Bayesian networks reconstructed by incorporating genetic data lead to predictive networks12,27. Here we extend this approach by integrating diverse data types, including gene expression, genotype, TFBS, and PPI.
We used a Bayesian network approach to construct three networks: 1) a Bayesian network based on expression data alone (BNraw), 2) a Bayesian network based on expression and eQTL data (BNqtl), and 3) a Bayesian network based on expression, eQTL, TFBS, and PPI data (BNfull). BNqtl was constructed by incorporating eQTL information from the BXR cross as prior evidence that two genes are causally related (see Supplementary Methods for details). To complement the use of the eQTL data, we constructed BNfull by incorporating TFBS data derived from high-quality ChIP-Chip experiments, phylogenetic conservation10, protein complex data, and eQTL data. Manually curated protein complexes21 and complexes identified by the clique community analysis described above for the PPI network were leveraged to enhance the TFBS target gene sets28. If at least half of the genes in a protein complex carried a given TFBS, then all genes in the complex were added to that TFBS gene set (Supplementary Table 1). In the Bayesian network reconstruction process, this extended TFBS data set was considered as prior evidence that two genes are causally related (see Supplementary Methods for details).
There were 3,779, 3,712, and 3,645 links in BNraw, BNqtl and BNfull, respectively. All three Bayesian networks were highly similar with respect to edge overlap when the edge direction was not considered, demonstrating that the different networks captured the covariance structure for the expression traits to a similar degree. In fact, BNraw and BNqtl had 3,335 edges in common (~90% overlap) and BNqtl and BNfull had 3,335 edges in common (~86% overlap). To test the relative power of these networks to predict system behavior, we examined whether the networks could recapitulate known biological processes: 1) test whether the networks could predict the GO categories, 2) test whether the networks could predict genes regulated by different transcription factors, and 3) test whether the networks could predict the expression responses to gene knockout signatures represented in the yeast compendium data set18.
A total of 75 GO terms (46 GO biological processes, 16 GO cellular components, and 13 GO molecular functions categories) were searched for enrichment in each of the Bayesian networks. Supplementary Table 2 shows that 26, 27, and 22 signature sets were significantly enriched (permutation p value < 0.01) in BNraw, BNqtl and BNfull, respectively. These results demonstrate that each of the Bayesian networks captured a significant proportion of the associations among genes known to operate in common pathways. The GO categories represent association-based relationships between genes rather than the cause-effect relationships reflected in gene-specific perturbation data like the yeast compendium data18. Therefore, this result suggests that the eQTL and TFBS data contributes mainly to making causal inferences.
We identified 15, 19, and 30 TF target sets that were significantly enriched in BNraw, BNqtl and BNfull, respectively (Supplementary Table 2). Because BNfull was constructed using the TFBS data as priors, it is unfair to compare this network to the other two networks. However, the extent of enrichment between BNqtl and BNraw was significantly different (Wilcoxon Signed Rank Test p = 0.0002), indicating that eQTLs enhance the power of the Bayesian networks to infer causal associations.
One of the true tests of a causal network is the ability to predict what genes will change in response to a specific perturbation. Single and double gene perturbation experiments enable testing the predictive power of a network in this way. A previously published yeast knockout compendium18 consisting of 300 expression profiles from 287 knock-out strains and 13 chemical perturbation experiments was used to carry out this test for each of the Bayesian networks. Supplementary Table 2 shows that 92, 111 and 116 signature sets were significantly enriched (permutation p value < 0.01) in BNraw, BNqtl and BNfull, respectively. In addition, the significances of the enrichments for BNqtl and BNfull were much greater than for BNraw (Wilcoxon Signed Rank Test p = 1.09×10−5), as shown in Supplementary Figure 3.
Like transgenics, gene knockouts, and other artificial perturbations, eQTLs represent perturbations that affect gene expression traits. In some cases, a given QTL may have pleiotropic effects on a number of expression traits, leading to eQTL clusters that co-localize to a common genetic locus (known as an eQTL hot spot). From the BXR data used to reconstruct the Bayesian networks, we have previously identified 13 eQTL hot spots, with 9 putative regulators proposed for 8 of the hot spots that were based on genes with known biological functions and cis eQTLs that were coincident with these hot spots (Table 3)16. We identified all of the gene expression traits linked to each of the 13 hot spot regions and then searched each of these gene sets for enrichment in subnetwork structures in BNraw, BNqtl, and BNfull, similar to the tests for enrichments carried out above. All but 2 small eQTL hot spots were enriched in subnetworks of BNraw, BNqtl and BNfull (Supplementary Table 3).
One objective method for assessing the predictive power of each Bayesian network is to use the different networks to infer the causal regulators for each of the eQTL hot spots. To identify causal regulators for a given hot spot, genes that gave rise to a putative cis eQTL in the corresponding eQTL hot spot region were selected. For this set of putative regulators we defined the signature for each regulator as the set of genes in the subnetwork that could be reached by the putative regulator following directed links throughout the entire network. The signature for each putative regulator was then intersected with the set of genes linked to the corresponding hot spot region. If the significance of the overlap was significant, we declared the putative regulator as a regulator of the hot spot and associated subnetwork.
We predicted the causal regulators for the eQTL hot spots represented in BNraw, BNqtl, and BNfull Causal regulators were inferred from BNraw for 7 of the eQTL hot spots, with two of these regulators matching those previously identified in the BXR cross16. However, ten causal regulators were inferred from BNqtl and six of these matched regulators that were previously identified in this cross16. Causal regulators were also inferred from BNfull for 9 eQTL hot spots. Again, six of the regulators matched those previously identified for this cross (Table 3). For eQTL hot spots 5 and 6, BNraw failed to predict the correct regulators when the regulators were known, whereas BNqtl and BNfull correctly identified the regulators in these instances (Table 3). These results support that while there is strong similarity between the different Bayesian networks, when we compare the extent of local network enrichments for genes operating in known pathways, the networks constructed by incorporating the eQTL, TFBS, and PPI data as prior information on the relationships among pairs of genes, have greater power to infer causal regulators for validated signature gene sets. We examined 9 of the 13 previously identified eQTL hot spots16 in the BXR cross for which a causal regulator could be identified in BNfull. The percentage of variance for each eQTL hot spot that can be explained by a causal regulator in BNfull is summarized in Supplementary Table 4. The network was used to elucidate the mechanism driving the eQTL hot spot activity, and we used experimental methods to validate prospectively all of the novel predictions.
In a number of cases, causal regulators for eQTL hot spots previously proposed and/or tested16 were predicted by BNqtl and BNfull. For example, our analyses indicated the mitotic exit regulator AMN1 as causal for the variation of a subset of transcripts linking to eQTL hot spot 2, which was confirmed in previous work16. We also made novel, detailed predictions for the mechanism of variation of additional transcripts linking to this hot spot (Supplementary Data). Our analyses identified GPA1 as the causal regulator for hot spot 7 and the transcription factor HAP1 as the causal regulator for hypoxia-related transcripts in hot spot 8, both consistent with previous work16. We used previously reported microarray results to experimentally confirm the causal regulator roles predicted for these genes (Supplementary Data). It is important to note that for each of these regulators, hypotheses in previous studies were generated based on manual curation, whereas our methods are data driven and objective, predicted by BNfull, which was constructed by integrating eQTL, expression, TFBS, and PPI data.
We also examined hot spot 4, which encompasses the leucine biosynthetic enzyme LEU2. The RM parent of the BXR cross contains an engineered LEU2 deletion, while BY is wild-type; previous analyses identified expression effects in the cross that are likely the result of this perturbation16. In keeping with this, in BNfull we observed a subnetwork enriched for amino acid biosynthesis pathways (enrichment p = 1.70×10−47) containing genes whose expression linked to the hot spot 4 region. LEU2 was predicted as a causal regulator for this subnetwork, confirming as above that our inference methods correctly identify known biological effects in the cross (Figure 2a).
To investigate the mechanism by which LEU2 deletion causes expression changes, we first noted enrichment in the subnetwork of genes with binding sites for LEU3 (enrichment p = 1.42×10−8) and GCN4 (p = 3.81×10−12), consistent with the known roles of these transcription factors in regulation of amino acid biosynthesis genes. We also observed strong overlap between the subnetwork and the GCN4 knockout gene expression signature (Figure 2b; p = 1.04×10−75). We profiled the expression of a LEU2 knockout strain and found strong overlap between this signature and the hot spot, as expected (p = 4.91×10−18); again, known GCN4 and LEU3 targets were enriched in the signature, suggesting that LEU3 and GCN4 are likely to be key mediators of the LEU2 deletion effects. Interestingly, the amino acid biosynthetic gene ILV6 also lies in the hot spot 4 region and its expression shows strong linkage in cis; in fact, our methods inferred ILV6 as an additional major causal regulator for the hot spot (Figure 2a). In particular, ILV6 was predicted to be causal for GCN4. We profiled the expression of an ILV6 knockout strain and found significant overlap with the hot spot 4 subnetwork (p = 4.03×10−52), including up regulation of GCN4 and many of its targets (Figure 2b). Our inference of ILV6 as a major regulator can be explained by either of two models. Naturally occurring polymorphisms in ILV6, and the engineered deletion of LEU2, may together act as the genetic causes of expression variation linked to hot spot 4. Alternatively, the LEU2 deletion may be the sole genetic difference responsible for the hot spot, causing expression changes in ILV6 that in turn trigger downstream effects. In either case, our data confirm that variation in ILV6 levels can influence expression of genes in the hot spot 4 subnetwork as predicted.
We also analyzed eQTL hot spots with no previously identified causal regulators. Two BNfull subnetworks were found to be enriched for genes in hot spot 12 (see Supplementary Figure 4 for the large subnetwork). The larger of the two subnetworks was enriched for carbohydrate metabolism (p = 4.87×10−14) while the smaller network was enriched for amino acid biosynthesis (p = 1.56×10−10). The large subnetwork is comprised of 452 genes, which were enriched for targets of the MSN2 (represented by PIL1 in the network) and SKN7 transcription factors, and also showed more modest enrichment of the targets of the transcription factors MSN4, which responds to stress; CIN5, which responds to salt stress; XBP1, which responds to starvation or stress; and REB1 (represented by ENP2). The main transcription factor for the small subnetwork is GCN4. The small subnetwork overlaps the hot spot 4 subnetwork just described. In this case the amino acid biosynthesis processes are likely responding to stress rather than causing stress.
Our method predicted a single causal regulator for the large subnetwork: PHM7, a gene of unknown function that is regulated by phosphate levels29. A cis-acting polymorphism is known to affect PHM7 expression between RM and BY30, suggesting a model in which differential expression of PHM7 in the BXR progeny drives expression variation of hot spot 12 genes. The MSN2 and MSN4 transcription factors are known to mediate the expression response to phosphate and to more general stressors31, suggesting that variation in PHM7 might impact expression of stress-related genes in the BXR cross through these factors. To test the role of PHM7 in regulation of hot spot 12 genes, we deleted PHM7 in the BY background and measured genome-wide expression via microarray, of this strain and wild-type BY, under phosphate-limited conditions. 1329 transcripts showed a significant difference between the two strains; this set was enriched for targets of MSN2 (p = 1.66×10−4) and of the upstream regulator SOK2 (p = 7.11×10−4). Of the genes changing in response to the PHM7 deletion, 817 are in BNfull and 155 of those are in the hot spot 12 subnetwork (Supplementary Figure 4b; enrichment p = 2.70×10−10), confirming that variation in PHM7 can influence expression of genes in the subnetwork as predicted. We note that 65% of the genes in the hot spot 12 subnetwork are not in the PHM7 signature, suggesting that PHM7 may be one of multiple causal, polymorphic regulators. As we showed for eQTL hot spot 4 where multiple causal regulators may be at play, validation of PHM7 as a causal regulator does not exclude the existence of other causal regulators.
A large BNfull subnetwork was found to be enriched for genes in hot spot 11 (Supplementary Figure 5). Our methods inferred the putative mitochondrial transporter SAL1 as this subnetwork's main causal regulator. The BY strain bears a frameshift allele in SAL1, which truncates the protein product by 49 residues relative to RM and results in a loss of its function as a suppressor of a deletion of adenine nucleotide translocase32. To test the role of the SAL1 polymorphism in expression variation in the BXR cross, we introduced the RM version of the SAL1 coding region into the BY genome as a replacement at its own locus, and in parallel we introduced the BY region into the RM background by the same method. We measured expression in these strains via microarray, and computed expression changes for each gene between the engineered strains and their wild-type counterparts. The expression changes for genes not in the subnetwork were close to zero (mean= -0.0114, stdv= 0.2326), whereas the expression changes for genes in the subnetwork were slightly larger (mean= 0.1209, stdv= 0.2358), although not significantly so at the 0.05 level. However, the two-sample Kolmogorov-Smirnov test shows the distributions of the two groups are significantly different (p=1.57×10−22), suggesting that SAL1 may play a minor, causal regulatory role.
To expand our search for causal regulators in this case, we employed a complementary approach to identify regulators that may not be detected by gene expression methods3. We examined genes located in the hot spot region that gave rise to cis eQTL and that harbored coding polymorphisms at highly conserved amino acids that induce known phenotypic changes in yeast (Supplementary Data). MKT1 was the only gene in this hot spot giving rise to a cis eQTL and harboring a coding polymorphism for a highly conserved amino acid (D30G) previously shown to induce phenotypic changes in yeast33,34. Therefore, we utilized a previously generated strain carrying the RM version of the functional polymorphism (G30) in the BY genome (YAD35033) and profiled expression changes in synthetic media to test whether MKT1 was a causal regulator for this hot spot. The eQTL hot spot 11 subnetwork was significantly enriched for genes in the MKT1 allele swap signature, with 698 of the genes in this signature overlapping the set of 3,662 genes used to construct the network, and 124 of these overlapping the subnetwork comprised of 317 genes (Fisher Exact Test P value = 1.86×10−18). This result validates MKT1 as a major causal regulator for hot spot 11.
Previous yeast network reconstructions focused on a more limited number of genes in order to make the reconstruction tractable24,28,35. Our integrative analysis combined large-scale genotype, gene expression, PPI, and TFBS data to construct networks comprised of more than 50% of the genes in the yeast genome using a novel Bayesian network reconstruction method. The relative utility of the resulting networks were highlighted by predicting responses to independent experimental perturbations and the known known biology of the system. Specifically, we demonstrated that networks constructed by incorporating genetic, TFBS, and PPI data were more predictive than a network constructed from expression data alone. Further, our method of integrating diverse data was also demonstrated to predict novel interactions, which in turn led to the identification of genes that are not well annotated, but that nevertheless serve as causal regulators for eQTL hot spots.
The modules emerging from the coexpression network were shown to elucidate the functional relevance of the different components of the network. Of particular interest is our finding that the coexpression network overlapped poorly with the PPI network, suggesting that the PPI and coexpression data reflect complementary views of the system, that the PPI data generated via high-throughput experiments is not very specific19, or a combination of the two. We were able to identify structures in the PPI network that overlapped well with the coexpression network only after we performed a clique community analysis on the PPI network to define the core, highly interconnected substructures of this network. Through this analysis, we found that the overlapping structures were enriched for stable protein complexes, likely explaining the good correlation between the PPI clique communities and corresponding coexpression network modules.
The gold standard for assessing the predictive power of any network model is prospectively validating predictions made from such a model. We queried the different Bayesian networks constructed using progressively more data (BNraw, BNqtl, and BNfull) to predict the causal regulators of the subnetworks enriched for genes linked to the different eQTL hot spots in the BXR cross16. BNfull was demonstrated as the most predictive network, and five of the novel predictions made using BNfull were prospectively tested experimentally, and all of these predictions were validated, thus confirming the predictive power of the integrated network to elucidate the regulatory control of some of the subnetworks. These results are also consistent with a large-scale simulation study we conducted to assess the extent to which genetic information could improve the accuracy of Bayesian networks based on gene expression data in a segregating populations27.
The integrative reconstructions carried out in our study represent only the beginning steps needed to construct large-scale, accurate whole genome networks. A number of important limitations will need to be addressed to further enhance the accuracy of this type of network. First, the Bayesian network algorithm employed in this study does not permit loops, making it difficult to represent some types of feedback, which are obviously an important control mechanism in any biological system. Second, Bayesian networks do not effectively represent time-series data36. These issues might be addressed by using dynamic Bayesian networks, which explicitly include a temporal representation of the interaction between nodes. Third, we reconstructed networks from a limited amount of data generated from a single population and under only a single biological condition. Given the impact genetic background and environment can have on network structure, with the connectivity structure of a network varying as a function of genetic background and environment, populations representing different genetic backgrounds in different environmental contexts will have to be studied to assess the impact on network structure. However, these and other issues notwithstanding, our results support that the construction of large-scale whole gene networks based on genetic, gene expression, TFBS, PPI and related types of large-scale data can lead to networks that are capable of predicting complex system behavior.
Work at Princeton was supported by National Institute of Mental Health grant R37 MH059520 and a James S. McDonnell Foundation Centennial Fellowship to L.K., and Center grant P50GM071508 from the National Institute of General Medical Science to the Lewis-Sigler Institute. We thank J. Whittle for generating GPA1 allele swap data, Shyamala Iyer for performing some of the deletion strain validation experiments, and Adam Deutschbauer for providing us with the MKT1 allele swap strain YAD350. We would also like to thank Renee Ireton of the Department of Microbiology at the University of Washington for her careful reading and editing of this manuscript.
Author Contributions E.N.S., B.D., R.B.B., L.K., and R.B. constructed and characterized the genetically modified yeast strains. J.Z., B.Z., and E.E.S. carried out the coexpression and Bayesian network analyses, and performed bioinformatic analyses. E.N.S., B.D., R.B.B., L.K., and R.B. aided in the data analysis. All authors were involved in the study design, interpretation of the experimental results, and discussed the results and commented on the manuscript. J.Z., B.Z. and E.E.S. designed the study, developed methods, analyzed the data and wrote the paper.
Author Information All gene expression data generated for this study have been deposited into the GEO database under accession number GSE11111. The gene expression data for the BXR cross has been previously deposited into the GEO database under accession number GSE1990.