|Home | About | Journals | Submit | Contact Us | Français|
Genetic interaction analysis, in which two mutations have a combined effect not exhibited by either mutation alone, is a powerful and widespread tool for establishing functional linkages between genes. In the yeast Saccharomyces cerevisiae, ongoing screens have generated >4,800 such genetic interaction data. We demonstrate that by combining these data with information on protein-protein, prote in-DNA or metabolic networks, it is possible to uncover physical mechanisms behind many of the observed genetic effects. Using a probabilistic model, we found that 1,922 genetic interactions are significantly associated with either between- or within-pathway explanations encoded in the physical networks, covering ~40% of known genetic interactions. These models predict new functions for 343 proteins and suggest that between-pathway explanations are better than within-pathway explanations at interpreting genetic interactions identified in systematic screens. This study provides a road map for how genetic and physical interactions can be integrated to reveal pathway organization and function.
A major biological challenge is to interpret observed genetic interactions in a physical cellular context1–3. There are several major types of genetic interactions: synthetic-lethal interactions, in which mutations in two nonessential genes are lethal when combined; suppressor interactions, in which one mutation is lethal but when combined with a second, cell viability is restored; and an array of other effects such as enhancement and epistasis. Genetic interactions have been used extensively to shed light on pathway organization in model organisms1–4. In humans, genetic interactions are critical in linkage analysis of complex diseases5 and in discovery of new pharmaceuticals6. Although genetic interactions are classically identified by mutant screens7, recent studies have applied systematic ‘reverse’ methods such as synthetic genetic arrays (SGA)8 or synthetic lethal analysis by microarrays (SLAM)9 to catalog ~4,000 synthetic-lethal and synthetic-sick interactions in Saccharomyces cerevisiae.
Because of the high-throughput nature of SGA, discovery of new genetic interactions is largely automated. However, interpreting the functional significance of each result remains a relatively slow process. The problem is compounded by the large number of genetic interactions measured when screening one gene versus all others (~34 on average10) as well as possible false positives if the interactions are not confirmed by tetrad or random spore analysis. Thus, without further methods to aid in characterizing synthetic lethals, large-scale interpretation is a daunting prospect.
A promising solution may be to integrate synthetic lethals with other types of high-throughput interactions. For instance, direct physical interactions among proteins are being mapped by systematic two-hybrid11–15 or immunoprecipitation studies16,17, whereas physical interactions between transcription factors and promoter sites are determined using chromatin-immunoprecipitation in conjunction with DNA microarrays18,19. These interactions comprise a physical network, which correlates with the network of genetic interactions and provides potential clues as to the mechanisms behind particular synthetic-lethal effects. Previous studies have demonstrated this correlated structure in yeast, by showing that two proteins in the same region of the genetic network are likely to also physically interact8,10, that genes with similar patterns of genetic interactions often occur within the same protein complex10 and that a protein with many interactions in the physical network typically has many interactions in the genetic network also20.
These studies suggest that it may be possible to interpret observed synthetic-lethal relationships explicitly using physical interactions. In this regard, previous authors1,21 have noted that synthetic-lethal interactions are typically associated with one of three types of physical interpretations: between-pathway models, within-pathway models and indirect effects (Box 1).
Between-pathway interpretations. The genetic interaction bridges genes operating in two pathways with redundant or complementary functions. Deletion of either gene is expected to abrogate the function of one but not both pathways.
Within-pathway interpretations. The genetic interaction occurs between protein subunits within a single pathway. A single gene is dispensable for the function of the overall pathway, but the additive effects of several gene deletions are lethal.
Indirect effects. The synthetic lethal phenotype is not mediated by a localized mechanism in the physical network. Indirect effects can occur because a deletion phenotype represents not just the absence of one particular gene, but also the response of the cell to its absence, involving many diverse pathways21.
Here, we demonstrate a computational framework for assembling genetic and physical interactions into models corresponding to between- versus within-pathway interpretations. Regions of the physical network that correspond to each type of model are identified using a probabilistic scoring scheme. These models predict new protein functions and suggest that genetic interactions are more likely to bridge redundant or complementary processes than to combine additively within the same process.
We assembled a genetic interaction network from two primary data sources (Fig. 1). The first was generated by SGA, a large-scale screen10 crossing 132 yeast gene deletion strains versus each of the ~4,700 available deletion strains22 and resulting in 2,012 observed synthetic-lethal interactions and 2,113 synthetic-sick interactions. The second data source consisted of an additional 687 synthetic-lethal interactions culled from the literature and catalogued at the Munich Information Center for Protein Sequences (MIPS)23. The combined genetic network synthesizing these data consisted of 1,434 proteins (genes) linked by 4,812 synthetic-lethal interactions.
We also assembled a physical network of 5,993 yeast proteins connected by physical interactions of three types: 15,429 protein-protein interactions (the two proteins a and b display physical binding); 5,869 protein-DNA regulatory interactions (a binds upstream of the gene encoding b) and 6,306 shared-reaction metabolic relationships (a and b are enzymes that operate on at least one metabolite in common). The protein-protein interactions were downloaded from the DIP database24 as of July 2004 and predominantly included data from large-scale experiments13,15–17. The protein-DNA interactions were obtained from a large-scale chromatin-immunoprecipitation study of 106 transcription factors18 (interactions with P = 0.001). Enzymatic reactions linked by common metabolites were obtained from KEGG25, excluding metabolite cofactors such as ATP or H2O (listed in Supplementary Table 1 online). The combined physical network covered 94.4% of all proteins in the genetic network. Both networks are provided at http://www.cellcircuits.org/Kelley2005/ in Cytoscape26 (SIF) format.
Preliminary statistical analyses confirm a limited relationship between genetic and physical interactions (see Supplementary Fig. 1 online and Tong et al.8,10), but demonstrate a need for structured models to efficiently separate signal from noise. Towards this goal, we implemented a probabilistic modeling procedure to capture the between-pathway interpretation of genetic interactions. This procedure involved a search for pairs of physical pathways that were densely connected by genetic interactions, in which a ‘pathway’ was loosely defined as any densely connected set of proteins in the physical network (this definition generically covers many network structures, including protein complexes). Pairs of pathways (constituting a single network model; see Fig. 1) were assigned a score proportional to the density of physical interactions falling within each pathway and the density of genetic interactions bridging between pathways (Box 2). This search generated 360 significant models covering 401 pathways and incorporating a total of 1,573 genetic interactions (196 MIPS, 687 SGA synthetic lethal, 690 SGA synthetic sick) and 1,931 physical interactions (1,248 protein binding, 77 regulatory, 606 shared reaction). Significance of these models was assessed by comparison to random genetic and physical networks. Detailed information for all models is provided in Supplementary Tables 2 and 3 online and at http://www.cellcircuits.org/Kelley2005/.
Scoring within-pathway explanations. The within-pathway model implies dense interactions within a single group of proteins in both the physical and genetic networks. We adopt a previously described log-odds score37 to assess the likelihood that a group of proteins is more densely connected than would be expected at random:
where V is a set of proteins and E a set of interactions among those proteins (genetic or physical). IE(a,b) is an indicator function which equals 1 if and only if the interaction (a,b) occurs in E and otherwise 0. For Modeldense, interactions are expected to occur with high probability (β) for every pair of proteins in V. In this work, β is set to 0.9 (Supplementary Fig. 2 shows how the results depend on choice of β). For Modelrandom, the probability of observing each interaction (ra,b) is determined by estimating the fraction of all networks with identical degree distribution which also contain that interaction. Comparable random networks are generated by ‘crossing’ pairs of edges in a process similar to that described by Milo et al.30 In this randomization, only edges of the same type are allowed to be crossed. In addition, for undirected types, either interacting node is allowed to serve as the ‘source’ in crossing the edges. Such randomization generates a family of random networks which resemble the original network and corrects for the presence of highly connected proteins, which score highly under both models. The interaction density is evaluated independently for the physical and genetic networks, yielding an overall score for the within-pathway model:
Scoring between-pathway explanations. The between-pathway model implies dense genetic interactions connecting two separate, nonoverlapping groups of proteins, where each group is densely connected by physical interactions. The density of physical interactions is scored independently within two sets of proteins V1 and V2 using the above function S. A related log-odds score is used to evaluate the probability that the genetic interactions Egenetic bridging between these sets are denser than random:
The final scoring function for the between-pathway model is then:
Search and Significance. Sets of proteins that are well explained by either the within-pathway or between-pathway models are identified using a greedy network search procedure. The search is as previously described by Sharan et al.37 except that it is seeded from each pair of genetically interacting proteins. Pathways that share more than 50% of genetic interactions with a higher-scoring result are discarded. To determine the significance threshold, identical searches are performed over 100 random trials in which both the genetic and physical networks are randomized as described above. Models that score higher than the maximal-scoring models in 95% of random trials are reported as significant.
Pooling diverse genetic and physical interaction data sets widens the search but also has the potential to decrease the coverage of network models, because not all data sets may be equally predictive and high-scoring network models are more likely to arise at random in large networks. To investigate the effect of data pooling, we repeated the search on a smaller network comprising large-scale synthetic-lethal (SGA) and protein-binding (DIP) interactions only. This reduced search identified 20 models containing a total of 137 synthetic-lethal and 120 protein-binding interactions (Fig. 2). In comparison to the complete search, fewer protein-binding and SGA synthetic-lethal interactions were incorporated into models, demonstrating the synergy obtained by data pooling (although models generated by the restricted search performed somewhat better in validation). Supplementary Table 4 online analyzes the impact of removing each physical and genetic data set from the modeling procedure.
We next searched the physical and genetic networks for within-pathway explanations. This procedure assigned a high score to single sets of proteins that were densely connected by both physical and genetic interactions (see Fig. 1, Box 2 and Supplementary Fig. 2 online). This search yielded 91 significant models. In all, these contained 272 MIPS, 225 SGA synthetic lethal and 169 SGA synthetic-sick interactions associated with 318 protein-binding, 37 regulatory and 36 shared-reaction interactions. Four representative within-pathway models are shown in Figure 3.
As initial validation of the between- and within-pathway models, we found that both types were significantly enriched for particular functional annotations recorded in the Gene Ontology database27. Two-hundred and fifty-one out of 401 pathways in between-pathway models were enriched for proteins with a common Molecular Function, Biological Process or Cellular Component annotation using the hypergeometric test (P = 0.05; Bonferroni-corrected for multiple testing)28. Similarly, 52 of the 91 within-pathway models were enriched for Gene Ontology annotations. Moreover, these functional enrichments were higher than expected based on the physical interaction network alone (see Supplementary Table 5 online).
Having established that proteins in many of the between- and within-pathway models were enriched for specific annotations, we used this concept to predict new protein functions. Specifically, for physical pathways in which a majority of proteins were already assigned a common significant annotation, we predicted this term for the remaining proteins in the pathway. To eliminate overly general predictions, significance was assigned only to those terms that were enriched at a level of P = 0.05 and were associated with fewer than 100 yeast proteins overall.
For between-pathway models, this approach predicted 745 molecular function, biological process or cellular component annotations among 282 proteins. In comparison, the within-pathway models predicted 285 annotations involving 127 proteins, bringing the total to 973 annotations for 343 proteins accounting for repeated predictions. A list of novel functional predictions is provided in Supplementary Table 6 online. Less than a quarter of these predictions were attainable using a similar approach based on the physical network only (Supplementary Table 7 online).
Accuracy of these predictions was estimated using cross validation29. Using a standard five-way procedure, the set of yeast proteins was partitioned such that annotations were hidden for one-fifth of the proteins and annotations for the remaining four-fifths of proteins were used to predict the hidden information. Each prediction for a protein in the ‘hidden set’ was scored as a success or failure depending on whether it recovered a hidden annotation. Using this approach, the success rate was estimated to be 63% for between-pathway models, 69% for within-pathway models.
Finally, we investigated whether the network models could predict the existence of new genetic interactions (Fig. 4). According to the between-pathway model, proteins in one pathway genetically interact with many of the same partners in a second pathway. This leads to the occurrence of ‘complete bipartite motifs’ in the genetic interaction network, defined as four-protein subnetworks in which the first two proteins are connected to the second two proteins by all four possible genetic interactions (Fig. 4a; see Milo et al.30 for an introduction to network motifs). When an incomplete motif (IM) is observed, for which only three of the four genetic interactions are present, the motif implies that the remaining interaction is true. Physical network information is incorporated by requiring that valid incomplete motifs fall within (i.e., are subgraphs of) a between-pathway model.
We applied the technique of five-way cross-validation to estimate the accuracy of genetic interaction prediction versus the minimum number of required incomplete motifs (Fig. 5). In each of five cross-validation trials, approximately one-fifth of the genetic interaction data were withheld, including both positive and negative interactions measured for each genetic ‘bait’ in SGA. These positive and negative interactions were subsequently used to test prediction accuracy. For instance, at a prediction threshold of eight or more incomplete motifs, the between-pathway models predicted 43 new genetic interactions with 87% estimated accuracy (Fig. 5). To assess the contribution of the physical models in the prediction process, we also predicted ‘naive’ genetic interactions by relaxing the requirement that incomplete motifs fall in a between-pathway model. The estimated accuracy fell to 5% for these naive predictions, evaluated at the same threshold of eight incomplete motifs.
For the within-pathway models, genetic interactions were implied between proteins that had genetic interactions with one or more common neighbors (Fig. 4b). The physical network was incorporated by restricting the proteins and neighbors to fall into a single within-pathway model. The number of common neighbors was used as a measure of confidence in the implied genetic interaction, and cross validation was used to estimate the prediction accuracy as a function of this number. The maximal prediction accuracy was 38%, achieved at a prediction threshold of three or more common neighbors (Supplementary Fig. 3 online). The corresponding success rate for naive predictions, made without constraining the proteins to occur in within-pathway models, was 15%. Thus, both types of models enhance the accuracy of prediction of genetic interactions, but between-pathway models appear to be better predictors than within-pathway models.
Given a systematic approach for associating genetic interactions with physical interpretations, it is of interest to ask which type of interpretation is most common. Focusing on large-scale SGA measurements, roughly three-and-a-half times as many genetic interactions are associated with between- as opposed to within-pathway models (1,377 versus 394 SGA interactions). These figures can be viewed as an a priori expectation that a newly determined SGA interaction will fall between versus within pathways, suggesting that SGA interactions typically span between multiple physical network regions instead of occurring within a single complex or pathway. One reason for the preference towards between-pathway models may be that SGA interactions are mainly targeted to nonessential genes (due to their use of complete gene deletions as opposed to, e.g., point mutations made by classical techniques).
Using physical models, it is possible to characterize approximately 40% of the genetic interactions as occurring between or within pathways. Whether the remaining interactions belong to between-pathway models, within-pathway models or are best characterized as ‘indirect’ (Box 1) cannot be reliably determined at this stage. For example, consider the case of two related pathways, each with only one protein required for pathway function. In this case, only the required proteins would be connected by a (single) genetic interaction across the pathways, making it difficult for the between-pathway model to achieve statistical significance.
Further examination of the between-pathway models reveals that many of the genetically linked pathways have clear interdependent functional relationships. For example, pathway M contains members of the prefoldin complex, which have synthetic-lethal interactions with members of pathways N and T forming parts of the dynactin complex and kinetochore, respectively (Fig. 2a). The prefoldin complex promotes folding of α- and β-tubulin into functional microtubules31. These are important for the function of dynactin, an adaptor complex involved in translocating the spindle and other molecular cargos along microtubules32, as well as the kinetochore, which anchors chromosomes to spindle microtubules during metaphase33. Apparently, deletion of proteins in the prefoldin complex reduces micro-tubule stability, leading to synthetic-lethal interactions with pathways that are directly dependent on microtubule function.
These pathways also predict a new function for the uncharacterized protein Yll049w (pathway N). This protein binds Jnm1, a dynactin protein which is required for spindle partitioning in anaphase32. In addition, it has synthetic-lethal interactions with members of the prefoldin complex in a manner similar to dynactin genes. Together, these relationships suggest that Yll049w is associated with dynactin during spindle partitioning. However, because Jnm1 has 12 physical interactions overall, and Yll049w has a total of 14 interactions in the genetic network, this prediction would have been difficult to make without an integrated approach.
Pathways O, U and Y provide another example of synergistic pathways linked by genetic interactions (Fig. 2a). Pathways U and Y mediate retrograde transport of proteins to the Golgi apparatus34,35. Pathway O (Bre1, Lge1) is involved in histone ubiquitination and cell size control, where cell size is influenced by the histone ubiquitination activity by an unknown process36. The abundant genetic interactions between pathways O and U indicate a possible role for retrograde transport in histone ubiquitination, or reciprocally, for histone ubiquitination in retrograde transport. Moreover, the uncharacterized protein Yel043w is physically associated with Bre1 and Lge1 and also has the same pattern of genetic interactions, suggesting that the three proteins may function together.
In summary, we have presented a methodology for integrating large-scale genetic and physical networks to capture the physical context behind observed genetic interactions. Approximately 40% of yeast synthetic-lethal genetic interactions can be incorporated into high-level physical pathway models and are approximately three and a half times as likely to span pairs of pathways than to occur within pathways. Further studies will be needed to address other types of genetic effects to extend this approach from yeast to the growing number of other organisms for which protein networks are now available. As systematic approaches generate ever larger databases of interactions across a variety of species, integrative modeling approaches such as the one proposed here will be indispensable for selecting and organizing the information into predictive models.
We thank Jonathan Wang, Owen Ozier and Gopal Ramachandran for preliminary investigations and Vineet Bafna, Ben Raphael and Vikas Bansal for insightful commentary. Craig Mak, Silpa Suthram and Taylor Sittler provided helpful reviews of the text. Funding was provided by the National Institute of General Medical Sciences (GM070743-01) and the National Science Foundation (NSF 0425926).
Note: Supplementary information is available on the Nature Biotechnology website.
COMPETING INTERESTS STATEMENT
The authors declare that they have no competing financial interests.