The first step in building a classifier to predict SL pairs is the identification of a set of features to treat as variables in the prediction procedure. The ideal features in this case are those that capture information about the relationships between genes. Towards that end, we extracted 152 SL-independent features (no known SL interaction is involved) from a number of sources, including multiple gene expression studies
[12],
[13],
[14],
[15], protein-protein interaction databases (
www.yeastgenome.org as of Sept 2007), transcription factor binding databases
[16], functional annotations as defined in the Gene Ontology (
www.yeastgenome.org as of May 2008), and gene network modules and clique communities
[16]. Among the 152 features identified, 62 were intended to capture the likelihood of two genes being directly related to each other (e.g., co-regulated in gene expression studies, protein/DNA sequence similarity, and direct physical interactions in the PPI network). The other 90 features were derived by overlaying pairs of networks (individual features) using a methodology similar to that used for deriving binary 2-hop features in the previously described MNDT approach
[8]. In MNDT, the overlay (2-hop) features essentially capture binary transitive relationships between gene pairs. We extended this approach by computing weights for the edges of the overlaid networks via an exhaustive search for the strongest transitive link (maximum of the product of weights for any two input edges) over a set of weighted networks, as shown in . Therefore, our “overlay” feature is a generalization of the 2-hop one, since it involves finding the strongest link connecting two genes, as opposed to finding “any” link in the 2-hop feature. Such a generalization makes use of more information in the weighted networks being overlaid, and thus overlay features are expected to be more effective for predicting genetic interactions. For more details about how to compute overlay features, see the
Materials and Methods section. A complete list of the features used in our study can be found in Table S1 in
Text S1.
Datasets and classification method
To characterize the extent to which these features could differentiate the SL and non-SL classes, we collected 9,994 SL interactions and 125,509 non-SL interactions from the SGD database (
www.yeastgenome.org, as of May 2008). Note that some of the interactions in SGD labeled as SL may be actually synthetic sick (SS) interactions, stemming from some of the SS interactions having been referred to as ‘synthetic lethal’ in the original publications
[2],
[17]. However, these (SS) interactions are still expected to have strong effects, and thus should exhibit similar characteristics as synthetic lethal interactions for learning and classification. For simplicity, we refer to all of these interactions as SL in this paper. The SL network thus prepared comprised of 9994 SL interactions covering 2502 genes and on the average, each gene had 8 connections, while the overall data set, referred to as SGD-SL, consisted of 135,503 interactions.
We first employed the Kolmogorov-Smirnov (KS) test to capture the difference of the distributions of a feature in the sets of positive (SL) and negative (non-SL) examples
[11]. The D-statistic from the KS test is then used as the measure of the discriminative power of each feature. shows the distribution of the 6 most discriminative features for the two classes, along with their D-statistic values, and Figure S1 in
Text S1 shows the ratio of the frequencies of these features. Table S1 in
Text S1 presents a complete ranking of all the 152 SL-independent features and the 15 SL-dependent ones (the known SL interactions are involved) considered in this study. Not surprisingly, features derived from physical protein interactions and functional annotations are among the most discriminative, which is consistent with previous findings
[8],
[18].
Using the above set of features, we developed an integrative classification system for predicting whether a given gene pair is synthetic lethal or not. As the first step, the negative (majority) class (non-SL, 92.9% of the set) is randomly under-sampled to produce a set of negative examples of the same size as the positive (minority) class (SL, 7.1% of the set) for handling the rare class problem
[19] with our data set. This balanced combination of these two sets is used to train a non-parametric multi-classifier system that enabled the simultaneous use of multiple classification procedures, such as SVM, neural networks and decision trees. Such a multi-classifier combination (henceforth referred to as ensemble or MNMC) is desirable for complex problems like SL prediction involving noisy inputs, since precise solutions with a high coverage are often difficult to achieve by a single classification procedure
[20] (see
Materials and Methods for details). Note that the under-sampling is applied only to the training set, while the true ratio of the number of positive to negative examples is maintained in the test set. Thus, the results presented below are unbiased and comparable with other methods.
Validation
We tested each of the individual classifiers and the ensemble (MNMC) on our SGD-SL dataset. shows the receiver operating characteristic (ROC) curves of the seven classifiers based on 10-fold cross-validation. As one can observe from this figure, the ensemble consistently outperformed the individual classifiers, with the SVM classifier performing the best among the individual classifiers and k-NN performing the worst. At a false positive rate of 20%, the true positive rate of the ensemble was roughly 55%, 2% higher than the best individual classifier (SVM). In addition, the prediction precision (fraction of the number of true SL predictions to the size of the complete set of SL predictions) of the ensemble was as high as 49% vis-a-vis a recall (fraction of the number of true SL predictions to the size of complete set of known SL examples) of 16.6% at a classification score threshold of 0.2. This type of high precision is important for predicting new SL interactions with high confidence. We also tested the performance of the ensemble based on an expanded feature set including the 152 SL-independent features and an additional set of 15 SL-dependent features (for details, see
Materials and Methods). As shown in , we observed a 15% increase in the true positive rate at a false positive rate of 20% using the expanded feature set (MNMC.all), as against the SL-independent set (MNMC.slif), thus demonstrating that the ensemble is indeed able to make effective use of the information provided by the features. A similar advantage of the SL-dependent features is also reflected in the precision-recall results of this experiment ().
To further evaluate the importance of the SL-independent feature set for predicting SL pairs, we constructed a positive test set from the 337 least-connected SL pairs in the network of the 9,994 known SL interactions, as well as a positive training set of 9,129 positive SL examples that did not share any gene with this test set. The connectivity of an SL pair in the SL network was defined as the minimum of the degrees of the two genes comprising the pair. Therefore, for these 337 least-connected pairs, the SL-dependent features based on network overlay are either missing or less effective than those for the well-connected pairs. The 337 SL gene pairs covered 283 unique genes, giving rise to 199 pairs that were included as non-SL interactions in our original data set and used here as a negative test set. The negative training set was comprised of 125,310 non-SL interactions obtained by removing this negative test set from the original 125,509 non-SL interactions in our data set. Predictions were then carried out for these test sets using the ensemble classifier trained using the training sets. The overall performance for the SL-independent and SL-dependent features was evaluated in terms of the respective ROC and precision-recall curves. As shown in , the performance of the predictor based on the SL-independent feature set (MNMC.slif) is consistently better than that of the SL-dependent feature set (MNMC.all). For example, at a false positive rate of 20%, the SL-independent features lead to a true positive rate of 75%, 5% higher than that obtained from the SL-dependent features. Note that the difference in performance of the two sets of features is not as large as expected, since we allow Weka (our implementation platform) to impute missing values, and thus all the SL-dependent features were not absent for the training and test set examples. However, in the strictest case where this imputation is not allowed, the gap is expected to be much larger. This difference in performance is likely due to the overfitting that results with some classifiers using the latter feature set. In conclusion, this experiment showed that SL-independent features were more effective in predicting new SL interactions that were weakly connected in the known SL interaction network. Given that only a small fraction of all the gene pairs have been experimentally tested for SL interactions between them, and given that the majority of the untested pairs are expected to be weakly or not connected to the known SL network, the SL-independent features in conjunction with the multi-classifier approach is expected to lead to more robust and accurate predictions, and can thus largely reduce the burden of experiments.
Another test of the performance of our approach was based on an independent test set constructed from 730 new SL interactions added to the SGD interaction database between May and November 2008. These interactions formed the positive test set for this experiment, while a negative test set of 5163 non-SL interactions between the genes constituting this positive test set was extracted from the non-SL interactions in our original data set. The ensemble classifier was then trained using the 9994 positive and 120346 (

=

125509-5163) negative examples in our original data set. This trained classifier was then used to make predictions for this new test set, and the results were evaluated using ROC curves. Differences in the performance of the different classifiers (shown in ) are similar to those discussed above. Here, we see that at a false positive rate of 10%, SL-independent features produce a 5% higher true positive rate than the SL-dependent features.
The advantage of SL-independent features becomes much clearer in the corresponding precision-recall curves, as shown in and 4(F). Take the result of the unseen data as an example (), where the precision of MNMC.slif is 10% higher than that of MNMC.all at a recall of 20% and the difference in precision becomes even larger (30%) at a recall of 10%. This provides additional evidence that for the currently unscreened gene pairs, SL-independent features can provide more accurate predictions due to their lower dependence on the currently known SL pairs.
Although MNMC.slif outperforms MNMC.all at recall less than 45% for the unseen samples, the precision of MNMC.slif is still not high, as shown in . The low performance results from the fact that many of the most discriminative features based on our data are not available for most of the 730 SL pairs. For example, the Pathway Comembership and Common Functions features are available only for 3% and 30% of the 730 pairs respectively, while the numbers are 5% and 40% for our 9994 SL pairs. Moreover, the most discriminative features based on our data are not most discriminative for the 730 unseen samples. On the contrary, the much less discriminative features in the whole dataset become highly discriminative for the unseen samples. For example, the top three most discriminative features, O(SemanSim CC, Brem Abs TOM), O(SemanSim CC, SemanSim BP) and O(SemanSim MF, SemanSim BP) have much larger D-statistics (0.92, 0.88 and 0.85 respectively) for the unseen data set than those (0.03, 0.09 and 0.07 respectively) for the whole SGD-SL data set. These interesting observations actually point out a future research direction for predicting genetic interactions: we can first partition the samples into distinct groups based on the discriminative utility of the features available and then train individual classifiers for each group.
Comparison with the MNDT and other approaches
To evaluate the effectiveness of our overall prediction approach, i.e., the set of features and the multi-classifier predictive model, we performed a direct comparison of our approach to the current state of the art algorithm MNDT
[8], using the SSL dataset used in the latter study. This dataset was comprised of 3,866 SSL examples and 688,045 non-SSL examples
[8]. This number is slightly different from that of Wong
et al.'s data set due to our use of ORF names instead of SGD IDs and the deletion of duplicates in our version of the data set. shows the four ROC curves that result from a four-fold cross-validation procedure, corresponding to MNMC based on all the features (MNMC.all), MNMC based on the SL-independent features (MNMC.slif), MNDT based on all the features (MNMC.all), and MNDT based on the SL-independent features (MNDT.slif), in addition to a curve corresponding to a random classifier. Note that the feature sets used by MNDT
[8] are different from those used by MNMC. Not surprisingly, both methods show better performance when all the features, including SL-dependent and –independent features, are used. It can be seen that the precision-recall and the ROC curves for MNMC.all are higher than those of MNDT.all for most of the range of the score threshold, and this is also reflected in the higher AUC score for the former (0.897 vs 0.862). Although this difference is agreeably not very high, it indicates the advantage of our under-sampling and our multi-classifier prediction technique. On the other hand, our MNMC.slif (AUC

=

0.805) outperforms MNDT.slif (AUC

=

0.598) substantially, which shows that our approach is able to make much better use of SL-independent features for SL prediction, and the performance of MNDT largely comes from SL-dependent features. For example, at an FPR of 20%, MNMC.slif leads to a TPR of 65%, 28% higher than that produced by MNDT.slif using their SL-independent features, and, at an FPR 30%, the gap between the two TPRs becomes even larger (31%). Similar observations can be made form the precision-recall curves for this experiment (). These results demonstrates the advantage of our approach over existing approaches, which arises from the facts that 1) we employ an extended set of features to characterize gene pairs, and 2) we employ under-sampling and a multi-classifier ensemble to carry out the training and predictions.
We also compared these results with those produced by the PPI-SVM method proposed by Paradugu et al
[11]. When SL-independent features are used, MNMC.slif outperforms PPI-SVM on this dataset. For example, at an FPR of 18%, the highest TPR of PPI-SVM w/o 2Hop is 52.4% while MNMC.slif leads to a TPR of 62.4%.
Discovery of novel SL interactions between transcription factors
Given the accuracy of SL predictions provided by our approach, we applied it to study functional redundancy within the yeast transcriptional regulatory network. Knock-out studies in yeast have revealed a surprising robustness to single deletions, particularly among transcription factors, where the rate of gene essentiality is 8% compared to the genome background rate of 17%
[21]. Expression profiling experiments have revealed that putative targets change relatively little in their expression even upon deletion of the corresponding regulators, which provides further evidence for the robustness of the transcriptional network or at least suggests limitations of our current understanding of the transcriptional network
[22]. We hypothesized that the predicted SL interactions between transcription factors (TFs) might provide insights into the genetic relationships underlying this redundancy. Towards this end, we generated a test set of 6,903 TF pairs from the currently available 118 TFs in yeast (except
MATA1)
[16], of which 8 are known to have SL interactions and 5 are known not to have SL interactions. We used the ensemble classifier trained on our SGD-SL data set to make predictions for the remaining 6,890 TF gene pairs using our SL-independent feature set. Among these 6,890 pairs, we predicted 467 SL interactions based on a classification score threshold of 0.2, which achieved a precision of 49% at a recall of 14% as determined by 10-fold cross-validation on our collected SGD-SL dataset. Since at the threshold of 0.2, the precision of MNMC.slif is 41% at a recall of 14% for the 730 new SL interactions (unseen data set), we estimate that the precision of the predicted TF SL network will lie between 41% and 49% at a recall of about 14%. Fourteen TF pairs were predicted to be synthetic lethal with classification scores above 0.4. The top five TF SL interactions in terms of their classification scores were
GZF3:DAL80,
NRG1:AZF1,
HAP3:HAP5,
SWI5:ACE and
YHP1:YOX1. shows the network of the 467 predicted and 8 known SL interactions between 106 transcription factors. Table S2 in
Text S1 lists the 475 TF SL interactions and Table S3 in
Text S1 shows the degree of each transcription factor in this network. On the average, each of the 106 TFs participates in 9 SL interactions, slightly more than observed in the known SL interaction network (8 interactions per gene). Six TFs (
FKH1,
AZF1,
GZF3,
STP1,
REB1 and
CHA4) are involved in over 25 synthetic lethal interactions each.
Some of the predicted SL interactions among transcription factors are actually well supported by literature. Recent studies have revealed the functions of the YAP family of transcription factors (
YAP1,
YAP2 (CAD1),
YAP3,
YAP4 (CIN5),
YAP6-8) as a response to stress induced by drug treatments, oxidative stress, metal detoxification, and DNA damage, among others
[23],
[24],
[25],
[26]. These studies have also suggested that the YAP TFs have overlapping but distinct functions, although the relationships among them are still not well understood. In particular, there has been no systematic study of genetic interactions among the YAP transcription factors to date. shows a sub-network of the predicted SL interactions involving the YAP transcription factors. This network is comprised of 56 links among 36 TFs, including 7 YAP TFs (
YAP1-7).
YAP5 has the highest number of interactions (14), followed by
YAP6 (12),
YAP2 (11),
YAP1 (9),
YAP3 (7) and
YAP4 (6), while
YAP7 only has a single SL interaction with
REB1. As shown in , there are four synthetic interactions among the YAP TFs, namely
YAP1:YAP2,
YAP1:YAP3,
YAP1:YAP5 and
YAP4:YAP6. Previous clustering analyses of YAP protein sequences
[23] and YAP DNA binding sequences
[26] revealed that the YAP TFs could be grouped into three related subfamilies: 1)
YAP1 and
YAP2, 2)
YAP4 and
YAP6, and 3)
YAP5 and
YAP7. Here, we predict that the genes in two of the three subfamilies have synthetic lethal interactions between them. In particular,
YAP1 was predicted to have SL interactions with
YAP2,
YAP3 and
YAP5.
YAP1 plays a central role in the response to oxidative stress and regulates the response to H
2O
2-induced stress, cadmium, and drug stress, while
YAP2 responds only to cadmium stress
[27]. Thus, the SL interaction between
YAP1 and
YAP2 implies a loss of the ability to respond to cadmium stress when both the TFs are deleted, consistent with the previous finding that the double mutant yap1yap2 is more sensitive to cadmium
[27]. As another example, consider the predicted SL interaction between
YAP4 and
YAP6. Although
YAP4 and
YAP6 both regulate osmotic stress, only the yap4 null mutant shows impaired growth when exposed to hyperosmolarity
[28]. However, the double mutant yap4yap6 strain displays further reduction of glycerol metabolism and accumulation, which is crucial to osmo-tolerance
[27]. All together, these analyses imply condition-specific SL interactions between
YAP1 and
YAP2, and between
YAP4 and
YAP6. Finally, it is known that the YAP proteins, as part of the class of basic leucine zipper (bZIP) proteins, have DNA-binding domains similar to the true yeast AP-1 factor
GCN4 [23],
[27].
GCN4 and
MET28 are also part of a group of 14 known bZIP proteins
[23]. The predicted TF SL interaction network includes SL interactions between
YAP1 and
GCN4, and
YAP5 and
MET28.
We also surveyed predicted SL interactions among HAP TFs, shown as a network in . Interestingly, the four HAPs (2, 3, 4 and 5) form a fully connected clique except for a missing link between
HAP2 and
HAP4, while
HAP1 does not interact with any of the other HAP TFs. Not surprisingly,
HAP2,
HAP3,
HAP4, and
HAP5 share the CCAAT-binding factor (CBF) and form a protein-protein and protein-DNA interaction complex
[29]. The fact that the assembly of Hap2p, Hap3p, and Hap5p requires all three subunits simultaneously suggests condition-specific SL interactions among all the three TFs. Furthermore, the previously identified interaction between Hap4p and the Hap2p/Hap3p/Hap5p-DNA complex
[29] was also supported by our predictions.
In summary, the exploration of a small part of the predicted TF SL interaction network already leads to some interesting findings. Thus, the predicted TF SL interactions are expected to be useful not only for studying the specific functions of TFs but also for understanding general mechanisms underlying robustness in regulatory networks.
Prediction of genetic interactions between all non-essential yeast genes
In order to expand the utility of genetic interactions, we used our approach to make predictions for synthetic lethal interactions between all pairs of non-essential genes in
S. cerevisiae. For this endeavor, we adopted the list of 3885 non-essential genes used in the construction of the SGA arrays
[30] and derived all possible gene pairs from this set. After excluding the 135,503 gene pairs in our SGD-SL data set, we obtained 7,471,681 gene pairs that could potentially encode genetic interactions. Next, MNMC.slif, trained using our SGD-SL data set, was employed to compute the classification scores denoting the likelihood of these 7.5 million unseen gene pairs to encode genetic interactions. Applying a threshold of 0.2 to the scores, as done for the results above, wepredicted 50,210 SL interactions between 3477 genes, thus demonstrating the wide gene coverage of our predictions. The prediction scores and the predicted classes (SL or non-SL) at a threshold of 0.2 for the 7.5 million pairs are available at
http://sage.fhcrc.org/downloads/downloads/predicted_yeast_genetic_interactions.zip. We expect that this valuable resource will be useful for computational and experimental biologists aiming to understand and utilize synthetic lethal interactions in yeast.
Effect of under-sampling
An important characteristic of our approach is the under-sampling of the non-SL class to construct the training set. Although the results presented in this paper were generated using a perfectly balanced training set (1
![[ratio]](/corehtml/pmc/pmcents/x2236.gif)
1 ratio between the number of examples in the positive and negative classes), we did observe a dependence of the results on the sampling ratio. lists the AUC scores of the ensemble classifiers constructed using SL-dependent and SL-independent features of a previously described SSL data set
[8] after varying the ratio between the number of positive and negative examples in the training set. As expected, the performance of both the classifiers deteriorates as the imbalance between the two classes increases, although the performance is quite stable up to a ratio of 10. Thus, even though our results are not very sensitive to reasonable skewing between the sizes of the classes, the determination of the optimal sampling ratio for a given data set may be difficult.
| Table 1Dependence of the AUC scores of the ensemble classifiers trained using SSL-dependent and SSL-independent features on the sampling ratio used to generated the training set for Wong et al's SSL data set [8]. |
Effect of the size of positive training samples
We also tested how the amount of available SL interactions (positive examples) for training affected our combined classifier's performance. Different portions (10%, 20%, 30%,…, 100%) of the 9994 SL examples in our SGD-SL data set were used in a 10-fold cross-validation procedure to test the efficacy of the resultant predictor. shows the AUCs obtained from each of these prediction experiments. It can be seen from this table that the performance of our classifier is quite robust to the amount of training SL examples available, with the AUC varying in a narrow range of ~0.7–0.74.
| Table 2Dependence of the prediction performance on the size of positive training sample set. |
Comparing PPI-only and non-PPI features for predicting genetic interactions
We further investigated the performance of PPI-only and non-PPI features for predicting genetic interactions. In this experiment, we split our 152 SL-independent features into two sets: (1) the set of PPI-only features which are derived only from the PPI network and (2) the set of non-PPI features which don't involve the PPI network. Note that the PPI-related overlay features, which involve the PPI network and other networks, were excluded from both the feature sets. The two feature sets were then used to train and test our multi-classifier on our SGD-SL data set. shows their performance in terms of ROC and precision-recall curves. As expected, the non-PPI feature set substantially outperforms the PPI-only one. At an FPR of 20%, the TPR of the non-PPI features is 16.6% higher than that of the PPI-only features (53.4% versus 36.8%). At a recall of 20%, the precision of the non-PPI features is 18.9% higher than that of the PPI-only features is (36.7% versus 17.8%) and at a recall of 30%, the precision of the non-PPI features is 15.7% higher than that of the PPI-only features is (29.9% versus 14.2%).
This difference in performance results from two factors: (1) the non-PPI features, such as GO annotations, microrarray data etc., form a much richer source of information than just a physical interaction between the proteins corresponding to two genes for measuring the strength of association between them, and (2) the PPI-only features generally have a much smaller coverage. We believe that the performance of the PPI-only features can be potentially improved by including other PPI-based features, such as those from Paladugu
et al.'s study
[11] (some of them are already included in our PPI-only features), but the non-PPI features will still be valuable for predicting genetic interactions, as also shown by others
[8].
In summary, the results presented in this section demonstrated the utility of our MNMC approach for predicting novel SL interactions, particularly using only SL-independent features.