|Home | About | Journals | Submit | Contact Us | Français|
Transcription factor (TF) perturbation experiments give valuable insights into gene regulation. Genome-scale evidence from microarray measurements may be used to identify regulatory interactions between TFs and targets. Recently, Hu and colleagues published a comprehensive study covering 269 TF knockout mutants for the yeast Saccharomyces cerevisiae. However, the information that can be extracted from this valuable dataset is limited by the method employed to process the microarray data. Here, we present a reanalysis of the original data using improved statistical techniques freely available from the BioConductor project. We identify over 100 000 differentially expressed genes—nine times the total reported by Hu et al. We validate the biological significance of these genes by assessing their functions, the occurrence of upstream TF-binding sites, and the prevalence of protein–protein interactions. The reanalysed dataset outperforms the original across all measures, indicating that we have uncovered a vastly expanded list of relevant targets. In summary, this work presents a high-quality reanalysis that maximizes the information contained in the Hu et al. compendium. The dataset is available from ArrayExpress (accession: E-MTAB-109) and it will be invaluable to any scientist interested in the yeast transcriptional regulatory system.
High-throughput assays such as microarrays allow users to ask detailed questions about biological relationships between genes. A major use of microarrays has been to measure expression changes in response to perturbations such as deletion and over-expression of genes of interest (1–3). As changes are likely to be triggered by the perturbed gene, such experiments may help reveal its cellular function (4); for example knockouts and partial deletions have been used to identify genes that are essential to survival (5).
Over the past few years, considerable effort has been invested into deciphering the transcriptional regulatory network of the yeast Saccharomyces cerevisiae, and several large-scale perturbation datasets of transcriptional regulators are now available. An early review by Svetlov et al. (6) compiled over 900 individual biochemical and genetic interactions between 83 transcription factors (TFs) and 494 genes. Using microarrays, Hughes et al. (7) published a compendium of 300 experiments including 35 TF knockouts. A newer dataset by Chua et al. (8) covered 55 TF mutants.
Most recently Hu et al. (9) presented a compendium of 269 TF knockout microarrays. Covering almost all yeast regulators, this is currently the most comprehensive perturbation dataset of TFs for any organism, and it is therefore of great interest in the genome-scale investigation of eukaryotic gene regulation. Having performed these experiments, there is a major challenge to process large and frequently noisy datasets in order to identify differentially expressed genes with confidence. Unfortunately, the authors of the Hu et al. study used relatively dated and insensitive approaches for microarray data-processing: as a result the published P-values and target-gene ranking are likely to be unreliable. Specific examples include the lack of background and print-tip correction during normalization (10–14), and use of an error model that does not account for systematic experimental biases (7,15–18). Moreover, P-values were not corrected for multiple-testing, which is crucial for minimizing false positives (15,16,19–21). In short, the lack of robust data-processing procedures have limited the amount of information that could otherwise be extracted from this substantial body of valuable experimental work, and it has greatly restricted the use of these data in follow-up investigations.
A strong consensus for the best methods for microarray data processing has emerged over the past 5 years. In addition, sensitive analysis techniques have been developed that deliver improved data correction and statistical tests for differential expression (12,18,22). Here, we present a reanalysis of the original raw data published by Hu et al. using updated statistical methods that are freely available through the BioConductor software suite (23). We identified 110 487 differentially expressed genes—nearly nine times the total reported by Hu et al. The reanalysis recovers 90% of the original dataset, suggesting that we have identified a vastly expanded list of target genes. To validate the biological significance of the dataset, we assessed the enrichment of Gene Ontology (GO) (24), KEGG functional annotations (25) and Reactome pathways (26) for target genes, the occurrence of upstream TF-binding sites, and the prevalence of protein–protein interactions among TFs and target genes. In summary, this work presents a high-quality reanalysis that maximizes the information contained in the Hu et al. compendium, and the dataset will be invaluable to any scientist interested in the yeast transcriptional regulatory system. Further, the reanalysis constitutes a prime example of the effect of using up-to-date analysis techniques in maximizing the information obtained from high-throughput generated data.
Raw microarray data were downloaded from the Longhorn Microarray Database (27). Microarrays were normalized using the VSN package, including print-tip and background correction (12). Array probes that were not annotated as Open Reading Frames (ORFs) in the original dataset were discarded, and duplicate and triplicate array probes were averaged. Differential expression was calculated using a moderated eBayes t-test as implemented in the Limma Bioconductor package (18). The resulting P-values were FDR-adjusted across the whole microarray dataset to correct for multiple testing (28). An adjusted P-value cut-off of 0.05 was used to detect significant differential gene expression. See Supplementary Data for further details.
Functional enrichment was performed using g:Profiler (29) based on Ensembl annotations [(30); release 49]. Gene annotations from low-confidence electronic evidence were removed. For each deletion assay, we computed a log-score by aggregating the log of the P-values for each category of GO, Reactome and KEGG with a significant enrichment (Supplementary Data). A global score for our reanalysed dataset and the original analysis was computed as the sum of log-scores for each individual TF.
High-confidence DNA–protein interactions derived from ChIP-chip experiments were obtained from (31). Data were filtered further to include only direct interactions as defined in ref. (32). Only matches classified as bound, only in YPD and with a P < 0.001 were considered.
Predicted TF binding sites were obtained from refs. (33,34). Erb and van Nimwegen derived a set of ‘trusted’ position weight matrices (PWMs) for 72 regulatory factors by running the PROCSE and PhyloGibbs algorithms on a set of experimentally derived TF binding sites from SCPD (35) and (31). These PWMs were then used to scan multiple alignments of each intergenic region in S. cerevisiae with the orthologous regions of another four Saccharomyces species. Predicted binding sites with a posterior probability > 0.5 were used in our analysis. MacIsaac et al. (34) applied a combination of the conservation-based PhyloCon and Converge algorithms to ChIP-chip data (31), to predict binding sites for 172 TFs (34). Only predictions conserved in more than three species with a P < 0.001 were considered for our analysis.
Binding sites from these data sources were then mapped into gene promoters. If the centre of a binding site was located between −1000 and +100 bp from the transcription start site of a given gene, it was said to be located in the gene’s promoter region. Similar results were obtained for shorter upstream promoter regions (−600 to +100 bp; data not shown). Our dataset of mapped binding sites covered 142 knockout TFs. Enrichments of binding sites in gene promoters for both direct and indirect interactions were calculated using a cumulative hypergeometric test (Supplementary Data).
Protein–protein interactions were obtained from refs (36,37). The enrichment of differentially expressed genes for interacting TFs was assessed using a cumulative hypergeometric test. To test enrichments of TFs targeting protein complexes, we constructed protein–protein interaction modules for each TF target. We then compared the number of protein–protein interactions among TF targets, and between TF targets and non-targets using a cumulative hypergeometric test. In both cases, P-values were corrected for multiple testing using FDR (Supplementary Data).
The compendium consisted of 588 two-colour cDNA microarray hybridizations for 269 mutants (9). The original analysis used a modification of the error model developed by Hughes et al. (7). Briefly, systematic errors in gene expression measurements were estimated from 10 control experiments in which the co-hybridized samples were taken from identical RNA preparations. The resulting log-ratio values provided a model for the error distribution. For each mutant experiment, genes displaying changes in expression values beyond the error distribution were identified. The level of expression change was quantified using a statistic X, calculated from the mean log ratios of replicate samples using the minimum-variance weighted average method (i.e., genes with larger variance than observed in the control data were assigned proportionately larger standard errors). The significance of the ratios was computed from the X-scores, and a threshold of P < 0.001 was applied to define the set of differentially expressed genes. This resulted in a dataset of 12 284 differentially expressed genes across 266 mutant strains (the original analysis detected no differentially expressed target genes for ARG82, YDR026C and YJL206C).
To reanalyse the dataset, we downloaded the unprocessed numerical text files from the Longhorn Database (27). For each array, we applied background correction and print-tip normalisation using the VSN package available from the BioConductor software project (12,23). We then extracted expression values for 6253 protein-coding genes presented on the arrays. Most genes were represented by single probes; for the 360 genes represented by multiple probes we averaged expression measurements across all replicates. Probes corresponding to non-protein-coding regions of the genome were excluded from further analysis.
We identified differentially expressed genes using the Limma eBayes package (18) distributed through the BioConductor project. The experimental design generally consisted of duplicate hybridizations for mutant and wild-type strains against a reference RNA sample and growth level control arrays (Supplementary Data). Therefore to obtain differential expression measurements between the mutant and wild-type, we integrated data from the two sets of comparisons (i.e. mutant versus reference RNA; and reference RNA versus wild-type). To correct for biases such as batch effects, we incorporated the control arrays into the error model. Variations to this experimental design were handled appropriately on a case-by-case basis (Supplementary Data). Finally, we applied a compendium-wide FDR correction to adjust for multiple testing (28), and used a P-value threshold of 0.05 to define differentially expressed genes.
Our reanalysis returned a list of 110 487 differentially expressed genes across 269 mutants—almost nine times as many targets—and we recovered 90% of genes presented by the original analysis (Figure 1A; Supplementary Table S1). The difference in number of target genes between our analysis and the original one is due to the increased sensitivity of our approach and not to the usage of a different threshold, as most additional targets in our reanalysis are not obtained by simply adjusting the P-value cut-off of the original analysis (Supplementary Figure S1). Moreover, there is a strong correlation (r = 0.83; Pearson correlation) between the two datasets for numbers of genes affected by each TF knockout (Figure 1A). As previously observed for regulatory interactions (38), there is a non-uniform distribution of target gene numbers: there is a large number of TFs with relatively few targets, and some very influential TFs that affect more than a third of the yeast genome (Figure 1B; Supplementary Figure S2). Among the top 20 mutants with the greatest impact are chromatin-based regulators including the SWI/SNF remodelling complex (SNF2, 5, 6 and SWI3), individual histone modifiers (SPT10, SIN3, SIR2) as well as global DNA-binding TFs (HAP2, RAP1 and MCM1) (Supplementary Figure S2). Interestingly, there is little correlation between wild-type expression level of a TF (calculated as wild-type A-value) and the number of genes it affects (r = 0.13; P = 0.027; Pearson correlation), indicating that factors with low expression can play a significant gene regulatory role. This is additionally supported by the observation that TFs with no significant depletion in the corresponding knockout mutant often affect a large number of target genes (Figure 1B; Supplementary Figure S2). The 10% of target genes reported in the initial study but not detected by our approach are likely to include both genes with marginal P-values in our study, as well as artefacts from the initial study resulting from the different normalization procedures.
Together, these observations suggest that our reanalysis identifies a vastly expanded repertoire of potential regulatory targets compared with the original dataset. In order to assess the biological significance of our results, we examined the gene lists by integrating several different sources of evidence.
First we checked the expression levels of the TFs themselves. Intuitively we expect the TF under consideration to have lower expression in the mutant strain compared with the wild type strain. Our analysis confirms this for 155 TFs compared with just 88 for the Hu et al. analysis (Figure 2).
Of the remaining 114 TFs, 78 display a negative fold change in the mutant strain albeit at statistically non-significant levels. They tend to be expressed at much lower levels than other TFs (P < 10−18; Wilcoxon test), indicating that low-level expression changes are harder to detect with current microarray technology. Among these regulators are several that affect many genes, emphasising that even small adjustments to TF expression levels can have a dramatic effect on target genes.
For the 36 TFs for which we do not observe a negative fold change, we suggest that changes in their expression levels are too subtle to detect given the experimental noise. Surprisingly, both analyses show that the cell cycle regulator MCM1 has significantly elevated expression in the mutant strain. As deletion of this gene is lethal, it was placed under the control of an inducible promoter; therefore we suspect that there may have been a fault in the construct rather than in the microarray experiment.
Next we examined the functional annotations of the differentially expressed genes. As most TFs are considered to regulate distinct cellular processes, their target genes should be associated with a coherent set of molecular and biological functions.
For each TF knockout, we used the g:Profiler web-tool (29) to identify the GO, KEGG and Reactome pathway annotations that are over-represented in the target gene list (Supplementary Table S2). Figure 3 illustrates the top 50 enriched GO functional categories among target genes. We calculated a score measuring the enrichment of functional annotations by summing the absolute logarithms of all P-values below 0.05 (Supplementary Data). Across all TF knockouts, our reanalysis has a higher score than the original analysis (log-score = 36 230 compared with log-score = 18 519). Comparing individual TFs, the gene lists from our reanalysis score equal or higher in 214 out of 269 cases. Note that the greater functional enrichment in our dataset is not due to increased numbers of target genes, as we apply a hypergeometric test to compare the functional annotation in the test set against a randomly picked sample of similar size. Thus the results indicate that our additional targets are biologically meaningful and not noise.
Our dataset recovers 95% of enriched functional categories for the Hu et al. analysis, and in fact improves the significance of the enrichment (in terms of number of genes with a particular annotation) in 85% of these cases. Moreover, functional categories that we observe are generally in good agreement with previous knowledge. For example, ISW1 is a component of several chromatin remodelling complexes; its deletion causes up-regulation of 188 genes involved in eukaryotic translation initiation (P < 10−6) and components of the ribosome (P < 10−11) (39,40). Additionally, we identify 247 down-regulated genes related to several metabolic processes including glycolysis (P < 10−4), alcohol biosynthesis (P < 10−3) and fungal-type cell wall (P < 10−6), functions that were previously not obviously associated with this TF. In contrast, the original analysis reported just 13 differentially expressed genes for ISW1, which do not show any functional enrichment.
In another example, FKH2 is best known for its role as a cell cycle TF, which regulates genes during the G2/M phase; it is also involved in transcriptional elongation and chromatin silencing (41–43). The knockout causes down-regulation of 343 genes with strong representation in functions relating to ribosomal biogenesis and assembly (P < 10−32), and up-regulation of 64 related genes that encode membrane-associated proteins (P < 10−5). The previous analysis describes just two differentially expressed genes. Here, it is possible that the mutant does not display as great an effect as expected owing to the back-up provided by the homologue FKH1; thus these results highlight the importance of detecting small expression changes using sensitive methods.
In the original analysis, Hu et al. reported a surprisingly low overlap between their list of differentially expressed genes and publicly available ChIP-chip datasets. They suggested several explanations, ranging from data quality issues to the fact that perturbation experiments can reveal secondary regulatory interactions that are absent from ChIP-chip experiments.
To re-examine the overlap between the knockout and TF-binding datasets, we considered information from large-scale ChIP-chip (31) and motif-finding studies (33,34) (see ‘Methods’ section). By incorporating the results published in a recent study by Zhu et al., we filtered the ChIP-chip data to focus on high-confidence direct DNA–protein interactions (31,32). We mapped all observed and potential TF-binding sites reported in the motif-finding studies to a recent version of the yeast genome (Saccharomyces Genome Database, 1 December 2008) and considered only those that are located within promoter regions (defined as: from −1000 to +100 bp of the transcription start site). In total, we collected binding sites for 142 TFs included in our analysis, comprising 5188 ChIP-chip interactions and 17 091 motif predictions.
We calculated the intersection between our list of differentially expressed genes from the TF deletion mutants and targets identified by ChIP-chip or binding-site predictions. Our analysis identified 645 of the ChIP-chip targets (compared with 168 targets identified by the original analysis). The overlap between the ChIP-chip and the deletion datasets is statistically significant in both cases, while our analysis shows a notable improvement (P < 10−71 versus P < 10−56). Combined analysis with motif predictions shows even stronger agreement; we were able to recapitulate gene expression changes for 2230 binding events or motifs (compared with 585 in the original study) and, for 37 TFs, the target genes are enriched in binding sites for the deleted TF itself (P < 10−193) (Figure 4A and C; Supplementary Table S3). A similar level of enrichment was detected for the original analysis for 34 TFs (P < 10−184).
We extended this analysis by considering regulatory cascades (44), in which we allowed differentially expressed genes to be secondary targets of the TF under consideration. We detected significant enrichments of binding sites for 129 TFs, compared with 42 TFs for the original dataset (Figure 4B and D, Supplementary Table S3). Altogether, we were able to find gene expression changes associated with 34 232 binding events (compared with 1353 for the original analysis—a 25-fold enrichment). Therefore, our analysis demonstrates that a large proportion (~98%) of the differential expression is likely to be due to secondary-regulatory interactions.
The inclusion of protein–protein interaction information provides an additional perspective to the assessment of our dataset. Eukaryotic TFs generally function in a combinatorial manner, and we can identify potential regulatory units by searching for groups of TFs that interact physically with each other (45). For this, we used two datasets: a compilation of protein–protein interactions among stable protein complexes by Collins et al. (36) determined using affinity purification/mass spectrometry, and a yeast two-hybrid screen by Yu et al. (37), which captures both stable and transient interactions.
Intuitively, we expect TFs that function together to show significant overlap in their target genes. Of the 115 pairs of physically interacting TFs in the dataset, 92 display such an overlap (compared with 49 pairs in the original analysis) (Supplementary Table S4). These include well-known regulatory combinations; for instance HIR2 and HIR3 are subunits of the histone regulatory nucleosome assembly complex that acts as a transcriptional repressor (46,47). The individual knockouts cause up-regulation of genes involved in nuclear assembly and nucleosome functions: of the 550 differentially expressed genes in the HIR2 and HIR3 knockout mutants (240 HIR2; 371 HIR3), 61 are targeted by both HIR2 and HIR3 (P < 10−22). The Hu et al. dataset, however, shows no overlap among 15 target genes.
Similarly, the RTG regulators RTG1 and RTG3 form a complex to activate the retrograde pathway in response to mitochondrial dysfunctions and nutrient starvation (48,49). Again, there is a significant, although small, overlap in the target genes of the two TFs (26 out of 610 genes—144 RTG1, 466 RTG3, P < 10−4), whereas there are no overlapping genes among 45 targets in the Hu et al. dataset. This demonstrates the potential of our reanalysis in obtaining meaningful biological information. Other examples can be found in the Supplementary Data.
Previous studies have reported that TFs tend to co-regulate genes that interact with each other (50). Therefore we used the interaction information from above to test whether we detect similar behaviour in our reanalysed data. Out of 110 487 differentially expressed genes, there are 3846 pair-wise interactions between co-regulated genes, covering 2262 genes in total (36,37). Most TFs (225) target at least one pair of interacting genes, compared with just 39 TFs in the previous analysis.
To check the statistical significance of our observations, we used a simple module construction approach available through the GraphWeb tool (51). For each TF mutant, we defined a set of ‘core’ connections comprising interactions among the differentially expressed genes and a set of ‘neighbourhood’ connections that also include interactions between the core and non-differentially expressed genes. For each mutant, we then compared the number of interactions among core and neighbourhood genes, measuring the proportion of interacting genes that are targeted by the same TF.
We find that targets of 154 TFs are enriched for membership to an interaction module (compared with 38 TFs for the original analysis) (P-value threshold <0.05; Supplementary Table S5). An interesting example consists of 16 TFs—including histone modifiers (SIN3, SPT10, HFI1, CDC73, SDS3, SAS4, SAS5), general TFs (TAF14) and growth- and metabolism-specific TFs (GLN3, UME6, BAS1, SUM1)—that affect modules related to vitamin B6 metabolism, which is essential to successful glycolysis. Deletion of these regulators cause defective growth phenotypes such as reduced fitness in rich medium (all except SAS4, SAS5, SDS3) and altered glycogen accumulation (SIN3, HFI1, TAF14, GLN3, SUM1) (1,52,53).
The compendium of TF-knockout expression data by Hu et al. (9) is an invaluable resource that makes a substantial contribution to our understanding of the transcriptional regulatory system in the yeast S. cerevisiae. We performed a comprehensive reanalysis of the raw data to maximize the information that could be gained from these experiments.
We employed standard pre-processing methods that are freely available through the BioConductor software project. These methods improved on the original publication by greatly reducing the noise inherent in microarray experiments, and by applying strict filters such as multiple-testing correction to minimize false positives.
The resulting dataset contained nearly nine times more differentially expressed genes, including 90% of the original gene list. The numerous functional assessments provide strong indications that the additional targets are biologically meaningful. In fact, the reanalysed dataset achieved better results in most of the tests compared with the original data, suggesting that the current method provides greater sensitivity without compromising on the quality of the output. We demonstrated the importance of detecting small but significant changes in gene expression. For some weakly expressed TFs whose deletion nevertheless has a high regulatory impact, their expression changes are too low to detect even when sensitive methods of analysis are applied. Moreover, there is little correlation between wild-type TF expression and the number of genes affected in its knockout.
Adr1 illustrates a perfect example of the power of the reanalysed dataset. During growth on non-fermentable carbon sources, the Adr1 TF activates glucose-repressed genes involved in non-fermentative carbon metabolism, peroxisome biogenesis and beta-oxidation (54–57). In fermentative growth conditions in rich media, Adr1 is inhibited by the phosphatase complex Glc7-Reg1 via an unknown mechanism (58–60) and its targets are not induced. Given the knockout experiments were conducted in rich media and a role for Adr1 in these conditions has not been described previously, it is a surprise to find that the deletion of ADR1 causes up-regulation of genes enriched in respiratory and mitochondrial functions [respiratory electron transport chain (P < 10−8), oxidative phosphorylation (P < 10−4), mitochondrial respiratory chain (P < 10−8)]. These observations hint at a role for Adr1 in glucose conditions either as a direct repressor, or as an activator of a repressor of these genes. A possible role as both an activator and repressor, depending on carbon source availability, is reminiscent of the dual regulators Sko1 and Hap1 which ‘switch’ regulatory roles in response to osmotic stress and haem concentration, respectively (61,62).
Hal9 provides a second example for which an experimentally testable hypothesis may be gleaned from the reanalysed dataset. The physiological role of this TF is unknown. One hundred and senventy-two genes are differentially expressed upon deletion of HAL9, and down-regulated genes are enriched in several biosynthetic and transport functions, the most significant of which is the GO term branched chain family amino acid biosynthetic process (P < 2 × 10−3). A possible functional link between HAL9 and amino acid sensing and metabolism was made in a previous genetic study that identified HAL9 as a negative regulator of PTR2 expression (63). PTR2 encodes a transporter of di/tripeptides which is transcriptionally induced by extracellular leucine (64) through the SPS plasma membrane amino acid sensor system [reviewed in ref. (65)].
The current study clearly demonstrates the importance of utilizing advanced and sensitive statistical methods in order to benefit fully from microarray experiments, which are often expensive and time-consuming. Moreover, the additional data afforded by our reanalysis represents a useful resource for future studies of gene regulation. For instance, a comparison of the target gene list with additional microarray data may facilitate the identification of new regulators for poorly characterized cellular processes.
The data presented here are publicly available from the ArrayExpress database (E-MTAB-109).
Supplementary Data are available at NAR Online.
EMBL and ERDF through EXCS and COBRED LSHB-CT-2007-037730.
Conflict of interest statement. None declared.
J.R. acknowledges funding from the Marie Curie Biostar programme, the Tiger University programme of the Estonian Information Technology Foundation, the Artur Lind and Ustus Agur Foundations.