Drug response can be predicted from transcript levels in untreated cells
We sought to predict the response of each segregant to each small-molecule perturbagen (drug), or SMP, from patterns of gene expression measured in a neutral (i.e., SMP-free) medium. We classified each segregant as sensitive, resistant or partially resistant to a given SMP according to its final yield in that SMP; 225 SMP responses were tested (this represents 89 SMPs, with multiple responses to some SMPs measured at different time points and concentrations). The gene expression levels of the segregants classified as sensitive or resistant were used to train a support vector machine (SVM) 
. SVMs are very powerful at classifying multidimensional data, and therefore should give us the ability to predict both Mendelian and genetically complex SMP responses. For each SMP, we used a feature selection algorithm within each fold of the cross-validation approach (see Methods
) to rank the genes according to their individual contribution to the ability to predict segregant SMP responses. We then trained support vector classifiers using 1, 10, 50, 100, 200, 500, and 1000 highest ranked gene(s). The classifiers were used to predict sensitive/resistant status of segregants not included in the training set using a cross-validation approach (Methods and Data S1
). We found that the support vector classifier trained on 1000 genes had the greatest average predictive power, correctly predicting the SMP response of 69.7% of the segregants on average for the SMPs considered, although it should be noted that the differences between 50, 100, 200, 500 and 1000 highest ranked genes are negligible (). The prediction accuracy for individual SMPs varied from near 100% to near chance. We compared this classifier to a naïve mode classifier, using the same cross validation as with the SVM, which calls all segregants in the test set sensitive or resistant according to the category that occurs more frequently in the training set (this provides a better comparison of the predictive value of expression information than does 50
50 random classification). Taking the prediction accuracy from the best performing set of features for each compound, the SVM outperformed the mode classifier on average (74% vs. 64%) and equaled or outperformed it for all but one SMP considered (the single instance was well within the standard deviation of the SVM performance). Instances where the mode classifier performed well reflect unequal distributions of sensitive and resistant segregants. Performance was very robust when the algorithm employs classifiers trained on the 50 through 1000 most highly ranked genes. Performance decreased slightly for the classifier with 10 genes, and dropped more appreciably for the classifier trained using the single most highly ranked gene (although performance still remained above chance). This decline in performance is likely due to insufficient or noisy information when too few genes are used.
Summary of marker-based and transcript-based prediction algorithms.
Comparison of transcript- and marker-based prediction of drug response
After demonstrating our ability to predict SMP response using steady-state expression alone, we sought to compare these results to prediction based on genotypes. Linkage analysis is dependent on an association between a genotyped marker and a phenotype, in our case sensitivity or resistance to an SMP. Any response to an SMP that significantly links to a marker should therefore be well predicted by that same marker. We first used a much simpler algorithm than the one described above, wherein the genotype at the single most correlated marker was used to predict sensitivity or resistance. We repeated this process in a leave-one-out fashion for all classified segregants. Because we are using the most correlated marker, the response to SMPs exhibiting strong linkage should be easier to predict than response to SMPs exhibiting weak linkage or no linkage. On average we correctly predicted SMP response with 69% accuracy, but, as expected, prediction accuracy was good (75%) when a strong linkage signal was present (lod ≥4) and poor (55%) otherwise. When no strong linkage signal was present, the prediction accuracy was worse than the performance of the mode classifier, highlighting that in these instances the single most correlated marker offered almost no information to perform classification.
We further sought to examine our ability to predict more complex SMP responses (those without strong linkage results). We trained support vector classifiers using 1, 10, 50, 100, 200, 500 and 1000 highest ranked marker(s). We found the support vector classifier trained on the 500 highest-ranked markers to have the greatest predictive power overall, correctly predicting the SMP response of 71.7% of the segregants on average for the SMPs considered (). Performance was very robust in the range of 50–500 highest-ranked markers, but worsened at 1000. A deterioration in classifier performance when more than 500 markers are considered is a result of classifier overfitting. After removing two outliers (alpha factor and niguldipine, which each show linkage with lod >40 and are nearly perfectly predicted using both transcripts and markers), we correlated prediction accuracy with linkage (e.g., lod score) when the prediction algorithm utilizes either the 200 highest-ranked features () or the single highest-ranked feature (Figure S1
). The correlation between marker-based prediction accuracy and linkage is modestly positive (r
0.13) when using 200 features and, as expected, significantly greater when predicting on the single best feature alone (r
The relationship between linkage and prediction accuracy.
A direct comparison of transcript-based prediction and marker-based prediction is presented in . For classifiers trained on feature sets ranging from 1 to 1000 features, we plotted the maximum prediction accuracies of transcript- and marker-based prediction for each SMP; more or fewer features are required for maximum accuracy depending on the SMP. The plot reveals that both prediction algorithms perform equally well for a large number of SMPs, but are weakly correlated (r
0.37) with each other. This weak correlation suggests that there is non-overlapping biological information embodied by transcript levels as compared to genotyped markers. However, some SMP responses (the on-diagonal points in ) are equally well predicted by transcripts and by markers, demonstrating that both are providing equivalent amounts of information. It is possible that the information being provided by the two sets of predictors is redundant, resulting from the fact that expression differences among the segregants arise due to the underlying genetic variation. For example, consider the response to the SMP alpha factor. Alpha factor is a 13 amino acid pheromone secreted by yeast cells of the alpha mating type in the presence of yeast cells of the opposite a
mating type; a
cells arrest in the presence of alpha factor because they express the sensitizing alpha-factor receptor STE2
. Genotype at the mating type locus and expression of STE2
are completely redundant in this case where sensitivity is determined by the presence or absence of the drug target. A clinical analogy would be clinical efficaciousness of EGF-receptor antagonists (e.g., gefitinib) in the 10% of patients with lung cancers that express sensitizing alleles (somatic deletions and point mutations) of the EGF receptor 
. Genotyping the EGF receptor stratified patients into drug-responsive and drug-unresponsive cohorts, and EGF receptor expression levels correlate with drug sensitivity.
Head-to-head comparison of marker-based prediction and transcript-based prediction.
22 SMPs are better predicted (>15% percent improvement) by markers than transcripts, while no SMPs are better predicted by transcripts than markers by the same margin. In fact, only 6 SMPs are better predicted by transcripts than markers by 10%, and of these none by greater than 12.2% (; Figure S1
). One SMP for which genotype is much more predictive is tetrachloroisophthalonitrile, an uncoupler of oxidative phosphorylation. The maximum predictive power of expression for tetrachloroisophthalonitrile is 65% considering the 500 most predictive transcripts, while the maximum predictive power of genotype is 90% considering only the single most strongly linked genetic marker. In previous work we showed that a non-synonymous mutation in the gene PHO84
, which encodes a high-affinity inorganic phosphate transporter, alters sensitivity to tetrachloroisophthalonitrile 
. We also showed that the quantitative trait locus (QTL) on chromosome 13 that contains PHO84
is a linkage hot spot that affects response to 25 SMPs 
. The above-mentioned SNP in PHO84
also alters to a lesser extent sensitivity to the chemically similar SMP pentachlorophenol, which suggests that greater genetic complexity underlies the physiological response of cells to pentachlorophenol. The maximum predictive accuracy of expression for pentachlorophenol is 69% considering the 500 most predictive transcripts, while the maximum predictive power of genotype for it is 90% considering the single most linked genetic marker. Another example of more accurate genotype-based prediction is response to copper sulfate (CuS04
). The maximum predictive power of expression for copper sulfate is 58% considering the 10 most predictive transcripts, while the maximum predictive power of genotype is 93% considering only the single most linked genetic marker. Linkage analysis has shown that a marker near CUP1
, which encodes a copper-binding protein that mediates resistance to copper stress, segregates with copper sulfate resistance; CUP1
is also subject to copy-number variation between strains 
. The inability of transcripts to predict SMP response may occur when genetic variation does not perturb expression levels under neutral (drug-free) conditions, especially in the case of stress-responsive genes, and therefore does not manifest a steady-state expression signature that would enable transcript-based prediction.
Next we considered cases where transcript-based prediction out-performs marker-based prediction. Expression outperforms genotype for 80 SMP response predictions above the diagonal in . This improvement may be due to chance or to expression signatures caused by genetic factors. However, as mentioned above, there are no SMP responses for which expression-based prediction is 15% more accurate than genotype-based prediction (). This result is consistent with expectation given that expression differences are ultimately explained by genetic differences in these yeast strains. However, there are 6 SMP responses where expression-based prediction outperforms marker-based prediction by 10% or more. In these cases, expression may be a better predictor of SMP response than genotype because several unlinked polymorphisms, possibly in transcriptional regulatory genes, could affect steady-state expression of multiple genes in the pathway modulated by the SMP. Additionally, in cases where transcript-based prediction outperforms marker-based prediction by at least 10%, the average LOD score is 5, while in the reverse case it is 8.5. This is consistent with the idea that transcripts may be valuable when sensitivity or resistance has a complex genetic basis with many minor-effect variants rather than one major-effect variant.
Using both transcript and marker data improves prediction ability for SMPs
We next asked whether combining both transcripts and markers into a single prediction algorithm would improve our ability to predict SMP response. First, we looked at the best prediction accuracy across all feature sets of both marker- and transcript-based prediction. In 80 out of 226 SMP responses tested, the best transcript-based prediction outperformed the best marker-based prediction, with an average improvement in accuracy of 4.8%. Interestingly, there are no distinguishing mechanistic characteristics of this group of 60 SMP responses, (which, in some cases, includes the same compound tested at multiple concentrations or at multiple time points); in other words, they are structurally diverse and target a wide array of cellular processes. This suggests that transcript information can provide additional predictive information above genotype data alone.
As a second test, we created a combined set of features that included all transcripts and markers, totaling over 9,000 features. We repeated the above-described process of selecting the best 1, 10, 50, 100, 200, 500 and 1000 features, and then used them to train a support vector classifier. The set of 500 features performed best on average with an accuracy of 72%, essentially the same as the marker-based prediction using the same number of features (71.6%). Interestingly, genotyped markers comprised over 95% of all selected features, with many (60) SMP response predictions based solely on marker features. These results are consistent with the observation that genotyped markers provide most of the information used in SMP response prediction. However, when transcripts are selected, they often encode gene products involved in biological processes affected by the SMP. For example, carbonylcyanide p-trifluoromethoxyphenylhydrazone (FCCP) is a proton ionophore that depolymerizes the mitochondrial membrane potential 
. At least three QTL determine drug response to FCCP in this cross 
. Expression of two genes, one encoding a component of the vacuolar ATPase (YDL185W
) and the other a component of the F1-F0 ATP synthase (YMR064W
), improves FCCP response prediction. The differing information provided by the best combined set of features versus the best set of markers suggests that valuable insight may be gained from using both steady-state transcript levels and genotyped markers. However, combining both transcripts and markers into a single set of features rarely performs better than taking the best of marker-only or transcript-only prediction accuracy, possibly due to the added noise of too large a set of features.