|Home | About | Journals | Submit | Contact Us | Français|
We developed m:Explorer for identifying process-specific transcription factors (TFs) from multiple genome-wide sources, including transcriptome, DNA-binding and chromatin data. m:Explorer robustly outperforms similar techniques in finding cell cycle TFs in Saccharomyces cerevisiae. We predicted and experimentally tested regulators of quiescence (G0), a model of ageing, over a six-week time-course. We validated nine of top-12 predictions as novel G0 TFs, including Δmga2, Δcst6, Δbas1 with higher viability and G0-essential TFs Tup1, Swi3. Pathway analysis associates longevity to reduced growth, reprogrammed metabolism and cell wall remodeling. m:Explorer (http://biit.cs.ut.ee/mexplorer/) is instrumental in interrogating eukaryotic regulatory systems using heterogeneous data.
Eukaryotic transcriptional regulation is a core cellular process that governs the expression of genes. Understanding gene expression is crucial in explaining complex biological processes including development, disease and cancer. Transcription factors (TF) are key proteins that activate or repress transcription by binding sequence-specifically to DNA in promoter regions of target genes. Mapping such regulatory networks and TF functions is therefore an important goal of current biomedical research. In complex vertebrate organisms like human, this task is hindered by enormous genomic space, numerous cell types, and distinct experimental procedures with data that is often unsuitable for direct comparison. The relatively simple unicellular model organism budding yeast (S. cerevisiae) serves as a platform for regulatory genomics. Multiple types of global-scale data of yeast gene regulation are available to date, including microarrays with TF deletion (ΔTF) strains [1,2], predictions of TF binding sites (TFBS) [3-5], and measurements of chromatin state such as nucleosome positioning . These data appear to be complete, however the agreement between transcript expression and TF binding events remains modest [2,7]. While part of this controversy can be attributed to experimental and statistical noise, we may still lack significant details regarding the biological relationships among such heterogeneous information. Consequently high-throughput data constitute less reliable evidence and much functional knowledge is extracted from careful and expensive focused studies. Most TFs and their exact roles in cellular processes remain poorly understood. Therefore biologically meaningful computational analysis is an important challenge in deciphering cellular regulatory networks.
Computational prediction of TF function from gene expression and DNA binding data is an active area of research. Numerous algorithms have been published elsewhere, albeit few have been validated experimentally. Earliest approaches focused on a specific class of data and used alternative types of evidence for computational validation. For instance, microarray clustering followed by DNA motif discovery in gene promoters helped establish the genome-scale link between mRNA expression profiles and TF binding [8,9]. Similarly, analysis of cell cycle expression patterns of TF-bound genes led to recovery of cell cycle TFs . More recent methods use statistical modeling to integrate multiple types of evidence. For example, ARACNE extracts transcriptional networks from numeric microarray data using mutual information , and MARINA is a down-stream method that identifies master regulators of these networks through association tests with TF binding target genes . The SAMBA biclustering algorithm studies matrices of regulators and target genes, and highlights regulatory relationships between genes and TFs that co-occur in clusters . The linear regression method REDUCE integrates numeric microarray data, DNA sequence and TF affinity matrices by modeling the linear relationship between gene expression levels and TF-DNA interactions . The GeneClass algorithm additionally integrates information about gene function, as it constructs decision trees of discrete microarray profiles and TF binding sites to select predictors of process-specific genes . While this method provides direct modeling of gene function, TFs and gene expression data are studied as independent predictors. Notably, none of the above methods take advantage of recent ΔTF microarrays that reveal regulator target genes [1,2]. Nested effects models are designed to extract regulatory networks from perturbation data , although integration of TFBS and gene annotations is not supported. Nucleosome positioning measurements also remain unexplored in all above approaches. In summary, additional computational efforts are required for meaningful integration of versatile biological data.
Here we propose a method m:Explorer that uses multinomial logistic regression models to predict process-specific transcription factors. We aim to provide the following improvements in comparison to earlier methods. First, our method allows simultaneous analysis of four classes of data: (i) gene expression data, including perturbation screens, (ii) TF binding sites, (iii) chromatin state in gene promoters, and (iv) functional gene classification. The model is based on the assumption that TF target genes from perturbation screens and TF binding assays are equally informative about TF process specificity. Second, we reduce noise by including only high-confidence regulatory relationships, and do not assume linear relationships between regulators and target genes. Third, we integrate detailed information to better reflect underlying biology: multiple subprocesses may be studied in a single model, and chromatin state data are incorporated into TF binding site analysis. TF target genes with simultaneous evidence from gene expression and TFBS data are highlighted separately. Fourth, our analysis is robust to highly redundant biological networks, as statistical independence is not required. We use univariate models to study all TFs independently and avoid over-fitting that is characteristic to many model-based approaches. This is statistically valid under the assumption that a complex model may be understood by examining its components.
To test our method, we compiled a comprehensive dataset covering most TFs of the budding yeast. We benchmarked m:Explorer in a well-studied biological system and establish its improved performance in comparison to several similar methods. Then we used the tool to discover regulators of quiescence (G0, stationary phase), a cellular resting state that serves as a model of chronological ageing. Experimental validations of our predictions revealed nine TFs with significant impact on G0 viability. Besides demonstrating the applicability of our computational method, these findings are of great potential interest to yeast biologists and researchers of G0-related processes like ageing, development and cancer.
Here we tackle the problem of identifying transcription factors that regulate process-specific genes (Figure (Figure1).1). Our model m:Explorer uses three types of independent regulatory information to characterize target genes of TFs: gene expression measurements from TF perturbation screens, TF binding sites in gene promoters and DNA nucleosome occupancy in binding sites. The fourth input is a list of process-specific genes for which potential transcriptional regulators are sought.
The first stage of our analysis involves data preprocessing and discretization in which high-confidence TF target genes are identified from multiple sources (Figure (Figure1A).1A). We assumed that genes responding to TF perturbation are likely targets of the regulator. We previously analyzed a large collection of ΔTF microarrays, extracted genes with significant up or down-regulation (moderated t-test, FDR p ≤ 0.05), and assigned these to perturbed regulators (Step 1, methods described in ). We also followed the assumption that TF binding in promoters is likely to indicate regulation of downstream genes, and binding sites in low nucleosome occupancy regions are more likely targets of TFs. We collected TF-DNA interactions from multiple datasets and classified genes as TF-bound if at least one dataset showed significant binding in 600 bp promoters (Step 2). We further categorized our TFBS collection into nucleosome-depleted TFBS (onesided t-test, FDR p ≤ 0.05) and sites with no nucleosome depletion. Next we integrated TF target genes into a genome-wide matrix, by assigning non-related genes to a baseline class and creating extra classes for genes with multiple evidence (Steps 3, 5).
Besides regulatory targets of transcription factors, our method requires a list of process-specific genes for which potential regulators are predicted. These may originate from literature, additional microarray datasets, pathway databases or biomedical ontologies. Several non-overlapping lists of genes may be provided to integrate further information about sub-process specificity, sample treatment or differential expression. These genes are organized similarly to TF targets (Steps 4, 5).
The second stage of our analysis involves multinomial regression analysis of process-specific genes and TF targets (Figure (Figure1B).1B). It is a generalization of linear regression that associates a multi-class categorical response (process-specific genes) with one or more predictors (TF target genes). Through the logistic transformation, each gene is assigned a log-odds probability of being process-specific given its relation to a particular TF, as
where yi is the process annotation of the i-th gene, and pi,c is the probability that gene i is part of sub-process c, given a linear combination of K types of evidence x X regarding TF target genes. All probabilities are computed relative to the baseline genes denoted by class C. The TF relation to process genes is quantified through regression coefficients β such that positive coefficients reflect a higher probability of TF target genes involving in the given process. Coefficients β are sought iteratively in maximum likelihood estimation. Likelihood reflects the estimated probabilities of all N genes belonging to their actual class, and thus provides a measure for model evaluation:
where yi,c = 1 if yi is of class c and 0 otherwise, and the probability of gene-class relationship is computed as
Maximising the log likelihood l leads to optimal regression coefficients and the corresponding likelihood value :
Here we implemented a statistical test to assess the process specificity of a given TF by comparing two multinomial regression models. The null model H0 : g(Y) = β0 is an intercept-only model where process-specific genes are predicted solely based on their frequency in the full dataset (Step 7). The alternative model H1 : g(Y) = β0 + βkXk is a univariate model in which TF targets are also considered as predictors of process genes (Step 6). We use the likelihood ratio (LR) test with the chi-square distribution to compare the likelihoods of the two models, and decide if adding TF information substantially improves fit to data given its additional complexity (Step 8), as
where ν corresponds to degrees of freedom and reflects number of model parameters. To predict all regulators to a process of interest, we test all TFs independently, correct for multiple testing and find TFs with significant chi-square p-values (Benjamini-Yekutieli FDR, p ≤ 0.05).
In summary, m:Explorer uses the multinomial regression framework to associate process genes with TF regulatory targets from TFBS maps, gene expression patterns and nucleosome positioning data. Our method finds candidate TFs whose targets are especially informative of process genes, and thus may regulate their expression.
We used m:Explorer to study transcriptional regulation and TF function in yeast, as it has the widest collection of relevant genome-wide evidence. First we compiled a dataset of 285 regulators that contains carefully selected target genes for nearly all yeast TFs from microarrays, DNA-binding assays and nucleosome positioning measurements. Statistically significant target genes from regulator deletion experiments originate from our recent reanalysis  of an earlier study . High-confidence TFBS targets were assembled from earlier chromatin immunoprecipitation (ChIP) assays by Harbison et al. , in silico TFBS predictions [4,17], and recent refinements with protein-binding microarrays by Zhu et al. . The data were further processed with in vivo nucleosome positioning measurements  to distinguish binding sites where lower nucleosome occupancy reflects open chromatin structure.
Our dataset of 285 regulators contains 128,656 significant associations between regulators and target genes. Statistically reasoned cutoffs render our dataset sparse: it comprises high-confidence signals to 7.2% of approximately 1.8 million potential TF-gene pairs. The dataset includes 107 TF target sets with knockout data, 16 TFs with TFBS predictions and 162 TFs with both types of evidence. The majority of all gene-regulator associations (84%) originate from TF perturbation arrays (Figure (Figure2A).2A). As observed previously, the agreement between binding sites and ΔTF targets is low: only 1.5% of all high-confidence targets constitute both types of evidence. Along with 170 confirmed or putative DNA-binding TFs, our dataset covers cofactors, chromatin modifiers and other regulatory proteins (Figure (Figure2B2B).
In conclusion, the yeast TF dataset is a useful resource for studying gene regulation.
First we tested m:Explorer in a well-defined biological context. Cell cycle is a thoroughly described regulatory system with four consecutive phases: gap-1 (G1), synthesis (S), gap-2 (G2) and mitosis (M). Some of the earliest microarray experiments identified cell cycle-regulated yeast genes [18,19], and a computational analysis organized these into phase-specific groups . Several focused studies have investigated the roles of individual cell cycle TFs [21-25], and a genome-wide experiment outlined the underlying regulatory network in its interconnected, circular nature . Altogether, the core cell cycle network comprises nine transcriptional regulators (Swi4, Swi6, Mbp1, Ndd1, Fkh1, Fkh2, Swi5, Ace2, Mcm1, Additional file 1, Table s1).
Here we applied m:Explorer and the TF dataset to select regulators to cell cycle genes. We focused on a recent tiling array study that measured genome-wide transcription during cell cycle at five minute resolution . We used the list of 600 periodically expressed genes that contains specific groups for the four cell cycle phases and two checkpoints (G1, S, G2, G2/M, M, M/G1; 41-257 genes). This structured list of genes was then analyzed in a single m:Explorer run. We identified 46 statistically significant TFs (Benjamini-Yekutieli FDR p ≤ 0.05, LR test from m:Explorer) including all nine core TFs (Figure (Figure3A).3A). Our results are ordered meaningfully, as eight of nine core TFs are ranked first (all p ≤ 10−9). Besides core TFs, our results include at least four regulators that interact directly with the core TFs or act as secondary regulators. Notably, Stb1 forms a complex with G1/S TFs to affect gene expression in G1 , whereas Yox1 cooperates with Mcm1 to repress the expression of M/G1 specific genes . The negative cell cycle regulator Ste12 is known to interact with Mcm1 in a specific pheromone-induced response . In addition to cell cycle regulators, we found components of the transcriptional machinery, including the general transcription factor Taf14 and multiple subunits of the Mediator complex (Ssn2, Cse2, Srb2, Srb8, Gal11). Several chromatin modifiers are also present, e.g. the silent information regulators (Sir2, Sir3) carry out genome silencing and are related to replicative cell ageing . We expected to see such regulators among our predictions, since their disruption is likely to affect any process that involves transcription.
Our method reveals additional details about cell cycle regulation. First, as we model all cell cycle phases in one run, relative TF phase activities can be quantified through regression coefficients (Figure (Figure3B).3B). For instance Swi4, Swi6 and Mbp1 make up the G1-S specific TF complexes MBF and SBF , and m:Explorer correctly highlights the phases with the strongest signal of regulatory activity. Second, we can assess the relative contribution of different kinds of regulatory evidence, and show that combined TFBS and ΔTF evidence are most informative of cell cycle regulation (Figure (Figure3C).3C). Third, simultaneous analysis of multiple sub-processes in a single multinomial model is advantageous to separate logistic models for each related subprocess, since the latter approach is more prone to false positive predictions (Additional file 1, Figure s1). We performed m:Explorer analysis for four cell cycle phases and two checkpoints separately and recovered all cell cycle TFs found by the multinomial model, however also retrieved a large number (28) of additional false positive TFs not associated to cell cycle. Despite the above, analysis of sub-processes showed that m:Explorer is applicable to relatively small gene lists, for instance Mcm1 and Yox1 are correctly recovered as regulators of M-phase through only 55 informative genes.
Next we compared m:Explorer with eight similar methods for predicting TF function in regulatory networks (Additional file 1). As no other method allows exact replication of m:Explorer models, we used combinations of discretized and numeric gene expression, TF binding and cell cycle data as required (Table (Table1).1). Method performance evaluation was carried out with the Area Under Curve (AUC) statistic that accounted for 18 cell cycle TFs (Additional File 1, Tables s1-s2). To measure performance robustness, we also conducted a benchmark in which random subsets of input data were presented to each method (30, 50, 70 and 90% of yeast genes, 100 subsets each). The simulation shows that m:Explorer substantially outperforms all tested methods in recovering cell cycle regulators (Figure (Figure3D,3D, AUC = 0.835 for 18 TFs, AUC = 0.996 for nine core TFs). Our method is reasonably accurate even when 50% of genes are discarded from the analysis (mean AUC = 0.747). The only method with comparable performance is the Fisher's exact test, a standard statistic for detecting significant biases in frequency tables. Comparison of m:Explorer and Fisher's test shows that our method is less prone to false positive discovery from randomly shuffled data (Additional File 1, Figure s2), and less dependent on microarray discretization parameters (Additional File 1, Figure s3). Fisher's test also prohibits the combined use of multiple features like gene expression, TF binding, nucleosome occupancy, and cell cycle phases. Simultaneous modeling of all data types in m:Explorer is likely to contribute to the demonstrated advantage over other approaches.
In conclusion, the cell cycle analysis showed that our approach successfully recovers a well-characterized regulatory system from multiple lines of high-throughput data. m:Explorer greatly outperformed several similar methods and showed robustness to incomplete data.
Next, we applied m:Explorer in a less familiar biological context to create experimentally verifiable hypotheses about TF function. We focused on the transcriptional mechanisms that govern cell quiescence (G0, reviewed by Gray  and Kaeberlein ). G0 is a cellular resting state with no proliferation, silenced genomes, reduced metabolism and translation, and greater stress resistance. Studying G0 has proven difficult and related regulatory programs remain elusive.
Quiescence of yeast cells can be experimentally induced as a response to prolonged starvation (Figure (Figure4A).4A). When glucose is depleted in exponentially growing cultures, growth rate is reduced as cells pass diauxic shift in which metabolic reprogramming initiates respiration of non-optimal carbon sources. Nutrients are depleted in post-diauxic phase, resulting in halted growth and differentiation to quiescent and non-quiescent cell populations . The quiescent fraction of homogeneous cells may survive for extended periods of time, while the ageing heterogeneous non-quiescent fraction dies on further starvation. Consequently, culture viability starts decreasing rapidly in later stages of G0. Induction and inhibition of quiescence has been associated to several highly conserved signalling pathways, including protein kinases A and C (PKA, PKC), TOR and Snf1 .
Here we studied two public microarray datasets and executed m:Explorer in two independent rounds. First, we retrieved 207 diauxic shift genes in three distinct subgroups of early, transient and late expression from the dataset by Radonjic et al. . Second, we used 594 genes and 676 genes characteristic of quiescent and non-quiescent cells from the study by Aragon et al.  (Figure (Figure4B).4B). We identified 29 and 82 statistically significant candidate TFs in the two runs, log-transformed the scores and produced a final list of 97 G0 regulators (Figure (Figure4C).4C). A large number of regulators is expected, as G0 entry is thought to comprise large-scale cellular reprogramming . Several top-ranking TFs have high scores in both m:Explorer predictions. This ranking is not an artifact of the overlap between diauxic shift and quiescence genes. Although the two lists comprise a considerable number of common genes (n = 62, p = 0.0005, Fisher's exact test), these were not sufficient for predicting a similar collection of G0 TFs, as m:Explorer analysis with the 62 genes only provided in a single significant TF (Mga2, p = 0.0005, LR test from m:Explorer).
In summary, the result of this analysis is an inclusive, prioritized list of candidate G0 TFs that serves as a resource for hypothesis generation and experimental testing.
Next we selected top 12 high-scoring TFs from our predictions for experimental testing. In total, 17 different strains were grown to G0 and assessed for viability in six consecutive weekly measurements (Figure (Figure5A).5A). We included deletion strains of candidate TFs (Δmga2, Δcst6, Δswi3, Δsds3, Δspt10, Δsin3, Δbas1, Δsnf2, Δspt20, Δhaa1, Δtup1, Δsnf11), positive controls (Δard1, Δmip1), negative controls (Δpdr3, Δgal3) and wildtype strains (Additional file 2). The viability of some strains was additionally monitored in five measurements over the first 72 hours of growth (Figure (Figure5B,5B, Additional file 3). To confirm the timeframe of exponential growth and diauxic shift, we measured culture density and glucose levels of wildtype strains during 48 hours of growth (Additional file 1, Figure s4). To distinguish TFs with significant viability deviations, we used a linear error model that accounted for viability in wildtype and negative control strains as well as experimental batch effects.
All tested strains showed significant deviances from background viability at different stages of the quiescence time-course (Figure (Figure5A).5A). The deletion strains of Bas1, Sds3, cst6, Mga2, and Spt10 show consistently greater viability in G0, indicating that their normal presence in wildtype cells suppresses viability and hastens cell ageing (Figure (Figure5C).5C). We refer to these knockout phenotypes as super-wildtypes (WT+). In particular, Δbas1 strains are on average 1.7-4.5 times more viable than wildtype in weeks 3-6 of quiescence (all FDR p ≤ 10−4, LR test from error model). The transcription factor Bas1 is involved in the regulation of amino acid and nucleic acid metabolic pathways , and cst6 is related to chromosome stability and non-optimal carbon source regulation [38,39]. Spt10 and Sds3 are chromatin modifiers involved in genome silencing [40,41], and Mga2 regulates fatty acid metabolism, transcriptional silencing and response to low oxygen [42-44]. Deletion of Sds3 of the Sin3-Rpd3 histone deacetylase complex has been associated to increased chronological cell ageing .
The deletion strains Δtup1, Δswi3, Δhaa1 are significantly less viable than wildtype in quiescence (Figure (Figure5A).5A). In particular, Δtup1 and Δswi3 strains become inviable in later stages of G0 (viability ≤ 0.005) and can be considered essential for survival in this cell state (Figure (Figure5D).5D). Two further strains Δspt20 and Δsnf2 are less viable in early quiescence, while Δsin3 shows later deviations. With the exceptions of Sin3 and Haa1, corresponding null mutants are previously known for decreased or absent respiratory growth. Tup1 is a general inhibitor of transcription that establishes repressive chromatin structure . Other factors are also involved in regulation of chromatin, transcription and genome stability, such as Swi3 and Snf2 of the SWI-SNF complex , Sin3 of Sin3-Rpd3 complex  and Spt20 of the SAGA complex . While the factors have not been specifically described in the context of quiescence, disruption of their global functions is likely to affect this cellular state. Besides the above, the reduced G0 viability of Δ haa1 potentially relates to its role in regulating cell wall proteins .
Before entering quiescence, most tested TF strains have similar viability to wildtype strains, suggesting that their function in regulating viability is specific to G0 (Figure (Figure5B).5B). During exponential growth at seven hours after inoculation, only three strains including the positive control Δ ard1 are significantly less viable. Ard1 encodes an N-terminal acetyltransferase subunit that guides genome silencing, and Δard1 fails to enter G0 as observed previously . In contrast, the other positive control Δmip1 is as viable as wildtype in exponential phase, and more viable in post-diauxic phase. Mip1 encodes a mitochondrial DNA polymerase subunit required for cell respiration , and Δmip1 loses viability in a similar manner to Δ tup1 (Figure (Figure5E).5E). Curiously, Δspt10 is less viable in exponential growth phase and early quiescence, while its viability exceeds wildtype after week three of our time-course. The negative control strains Δgal3 and Δpdr3 expectedly show no major deviations from wildtype viability (Figure (Figure5F).5F). The TFs are related to alternative carbon metabolism and drug resistance, respectively [53,54], and show non-significant scores in m:Explorer predictions of G0 TFs. Finally, our glycerol growth assays confirm the respiratory properties of tested strains (Additional file 1, Table s3) and mostly agree with previous studies [55,56]. However, in contrast to those reports, our data indicate that Δcst6 is viable on glycerol and indeed displays increased G0 viability.
According to our knowledge, most of our predicted TFs are not recognized as quiescence regulators. However previous functional evidence refers to processes important in quiescence, and hence lends confidence to our experimental observations. Besides uncovering novel regulators of viability in G0, our experiments show that m:Explorer provides biologically meaningful prediction of regulator function.
To gain insight into G0 gene regulation of validated TFs, we performed a functional enrichment analysis of their G0 target genes. We focused on quiescence genes defined by Aragon et al.  and identified the subset of genes that were bound by at least one WT+ TF or showed differential gene expression in at least one WT+ ΔTF microarray . Target genes were then scored by product of differential expression p-values across all WT+ ΔTF microarrays and ranked such that genes with most dramatic transcriptional changes were prioritized. The target gene list for viability-deficient TF strains was complied in a similar fashion. We expect that ΔTF differential expression is informative of regulatory relationships in quiescence. The strains underlying microarray profiling are genetically identical to the strains in our G0 experiments, although the former assays were performed with exponentially growing cells. Intersection of known quiescence genes with target genes of validated G0 TFs, and subsequent prioritization according to differential expression, is therefore likely to highlight high-confidence TF targets and functional relationships. To investigate this in detail, we then used the ordered gene list analysis of g:Profiler  to study the functional importance of significance-ranked target genes of WT+ and viability-deficient TFs.
Our analysis revealed 62 non-redundant Gene Ontology categories and KEGG and Reactome pathways with statistically significant enrichment in quiescence-related targets of G0 TFs (FDR p ≤ 0.05, hypergeometric test, Figure Figure6).6). A number of functions were found to be enriched in TF targets corresponding to both viability phenotypes, suggesting that improved and reduced viability in quiescence may involve common regulatory pathways. The most significant results include the KEGG pathway of ribosome (p = 10−15), proteolysis (p = 10−11), reproduction (p = 10−9) and oxidation-reduction process (p = 10−10). Other functions are informative of TFs responsible for reduced G0 viability. For instance, metabolic and catabolic genes (p = 0.0070 and p = 0.0035) are mostly up-regulated, while genes related to cell wall organization are inhibited (p = 0.030). In contrast, WT+ TFs with increased G0 viability associate to down-regulation of protein metabolic genes (p = 10−7) and modulation of alternative energy pathways such as fatty acid catabolism (p = 0.034) and glutamine metabolism (p = 0.047).
Taken together, the above results associate to known mechanisms of quiescence and provide clues of the regulatory programs of predicted G0 TFs. Inhibition of growth through down-regulation of ribosome genes has been linked to increased replicative lifespan . Efficient cell wall remodeling and response to increased oxidative stress are essential prerequisites of quiescence entry and survival . Expectedly, increased viability appears to correlate with reduced metabolism, as related genes show opposite expression patterns in corresponding strains. Further discussion on G0 TFs and related pathways can be found below.
It is tempting to speculate about the role of identified quiescence TFs in modulating quiescence signalling, as links between the factors and global G0-related pathways are apparent in our dataset. Our findings of WT+ regulators are especially intriguing, since their normal presence in wildtype cells reduces viability in quiescence and causes increased chronological ageing. From the perspective of evolutionary maintenance, WT+ regulators should engage in significant cellular functions that compensate for such negative properties.
As an example of G0 regulation, protein kinase A (PKA) mediates nutritional signals to the cell and is known as an inhibitor of quiescence . Its primary regulatory subunit Bcy1 acts as an inhibitor of the pathway, and mutations in Bcy1 cause viability loss and death in G0 [59,60]. This double negative regulatory mechanism provides a potential explanation to observed viability phenotypes. In our TF dataset, Δmga2 has significantly higher levels of Bcy1, potentially allowing more starving cells to pass into quiescence. The G0-essential Tup1 and Swi3 knockout strains have depleted levels of Bcy1 and as a possible consequence, we observe reduction and loss of viability. As another example, protein kinase C (PKC) guides cell wall remodeling in response to starvation and its activity is required for G0 entry . The cell wall biosynthesis enzyme Gsc2 is a downstream target of PKC  and part of the gene expression signature of quiescent cells . In ΔTF microarrays, Δmga2 and Δcst6 strains have elevated levels of Gsc2, while Δswi3 and Δtup1 show inhibition of PKC upstream of Gsc2 (Additional file 1, Figure s5). Other genes with known function in G0 appear to be regulated by WT+ and viability-deficient TFs. Notably, the conserved superoxide dismutase (SOD) genes are responsible for neutralizing oxidative damage of mitochondrial respiration. In yeast, SOD genes are required for G0 survival and extend chronological lifespan when over-expressed [63,64]. Induced levels of Sod2 expression in Δcst6 may explain our observations of increased G0 viability.
Several confirmed G0 TFs are also associated to mammalian gene regulation.Cst6 carries the DNA-binding domain of CREB, an extensively studied TF that regulates a variety of processes, including cell survival and proliferation, cellular metabolism, and synaptic plasticity of long-term memory . Bas1 is homologous to the MYB TF that regulates stem and progenitor cells and appears as an oncogene in multiple tumour types . Chromatin modifier complexes Swi/Snf, Sin3/Rpd3 and SAGA are also broadly conserved, for instance Swi3 homolog SMARCC1 is involved in versatile functions, including neural stem cell renewal and differentiation . As the yeast quiescence model associates to hallmark cancer properties of cell cycle control, proliferation and differentiation, further analysis of our Findings may reveal intriguing links to cancer biology.
Here we present the robust computational method m:Explorer for predicting functions of gene regulators from high-throughput data. We applied a model that probabilistically accounts for multiple types of regulatory signals and functional gene annotations. To take advantage of abundant genome-wide data and powerful experimental approaches, we present a case study for predicting transcription factors (TF) in the unicellular budding yeast. However, our method is not restricted to yeast and even not to these classes of data and regulators, being easily scalable to more complex regulatory systems of vertebrate organisms. Our method is also applicable to data such as protein-protein and genetic interactions that are categorical in nature. As shown here, m:Explorer is particularly useful in investigating sparse, high-confidence sets of data that may be controversial and not entirely comparable. For instance, we envisage large-scale characterization of human pathways in the context of heterogeneous tumours, utilizing sequence mutations, gene expression and chromatin modification data that are collected in cancer genomics projects.
In our model benchmarks, we demonstrate the advantage of univariate multinomial models in m:Explorer over similar multivariate models (AUC = 0.83 vs AUC = 0.67 for recovering cell cycle TFs). Briefly, the former models treat each TF independently in process gene classification, while the latter models include a non redundant collection of TFs as predictors. However, TF redundancy is an inherent property of robust biological networks that have evolved through gene and genome duplication . In our case, the core cell cycle system involves three pairs of homologous TFs (Swi4 and Swi6; Fkh1 and Fkh2; Ace2 and Swi5) that have strikingly similar TFBS and expression patterns. Due to redundancy, such TFs are not treated as significant predictors in the multivariate setting. This is evident in our simulations: none of the tested multivariate models included both TFs of homologous pairs as significant predictors.
This analysis provides multiple lines of evidence to establish m:Explorer among other methods with similar goals. First, we carried out a highly detailed reconstruction of the known cell cycle regulatory system and proved the validity of our approach through existing knowledge. Second, we repeated the same analysis using eight alternative computational methods and random samples of input data, and provided quantitative proof to the robustness and better performance of our method. Third, we predicted regulators to the enigmatic cellular state of quiescence and validated our top-ranking candidate TFs in follow-up experiments. Nine of twelve tested TFs were confirmed to have consistent and significant G0 viability deviations in gene knockout screens, while the remaining three factors showed differences in subsections of our time-course. Thus we proved a high success rate given our relatively simple experimental assays. Besides demonstrating the biological validity of our method, our findings reveal novel, previously unrecognized regulators of quiescence.
m:Explorer is available as an R package on our web site  and elsewhere. The yeast TF dataset may prove to be a useful resource for the community and is also provided. We have established a web server at , allowing online prediction of regulator function using the yeast TF dataset.
m:Explorer is a generally applicable method for inferring transcription factor function from heterogeneous high-throughput datasets. Our approach outperforms similar state-of-the art tools in recovering regulatory relationships in a well-studied eukaryotic system. Furthermore, the algorithm helps explore uncharacterized regulatory networks and propose valuable hypotheses for detailed assays. Our case study of quiescence G0 and subsequent experimental validations revealed nine novel regulators that enhance or reduce cellular longevity, providing insights to investigators of this cryptic cellular state. In conclusion, our computational and experimental analyses provide strong support to the validity and usefulness of m:Explorer.
The yeast transcription factor dataset of 6253 genes and 285 transcription factors was compiled from gene expression, TF binding and nucleosome positioning data. Perturbation microarrays for 269 regulators were originally produced by Hu et al. , while our recently reanalyzed dataset  was used here for discretized, high-confidence values of up- and down-regulation (moderated t-test, FDR p ≤ 0.05). Further details on microarray preprocessing are available in the related publication . TF binding site data for 178 TFs were compiled from multiple datasets of ChIP-chip , protein-binding microarrays  and computational predictions [4,17], using custom filtering and significance cutoffs proposed by the authors. Each promoter of 600 bp was considered to be bound by a TF if at least one binding site occurred in the dataset, and the TFBS was considered nucleosome-depleted (NDTFBS) if nucleosome occupancy  at the site was considerably below normalized genome-wide average (t-test, FDR p ≤ 0.05). Finally, gene expression and TF binding targets for each regulator were integrated and split into eight classes (up, down, TFBS, NDTFBS, up+TFBS, down+TFBS, up+NDTFBS, down+NDTFBS). All other genes except TF targets were assigned to the baseline class (not regulated).
Process-specific gene lists originate from previous high-throughput gene expression experiments. 600 cell cycle specific genes were retrieved from the tiling array experiment by Granovskaia et al.  and split into six sublists (G1, S, G2, G2/M, M, M/G1) according to authors' instructions. Three classes of diauxic shift genes (early, transient and late expression) originate from the G0 time series , and genes specific to quiescent and non-quiescent cell cultures were first mapped in the analysis by Aragon et al. .
m:Explorer is based on univariate multinomial regression and implements the functionality of the R NNET package  for model fitting. We use a list of process-specific genes as categorical model response, and TF target genes as predictors. Briefly, m:Explorer compares two models: the null intercept-only model classifies process gene through their frequency, and the alternative univariate model additionally incorporates TF regulatory targets as predictors. We apply the log likelihood ratio test with null and alternative models to decide if TF target genes are significantly informative of process-related genes. Detailed description of the model is available in Additional file 1.
Yeast cell cycle TFs were predicted from a single structured gene list and directly ranked according to log p-values from m:Explorer. G0 TFs were predicted in two independent m:Explorer runs using genes from two datasets. TF p-values from LR tests were log-transformed, scaled to unit range and summed across the two runs to create unbiased composite scores for final ranking. Unit-scaled positive regression coefficients were used to assess the relative phase specificity of cell cycle TFs, since these indicate over-represented regulatory targets in contrast to baseline genes. Relative contribution of regulatory evidence was computed in a similar way.
Linear regression was used to assess the significance of mutant strain viability deviations from control and wildtype strains. With viability as model response v, three types of variance were included as model predictors for assessing each mutant/time-point combination across all related replicas, as the alternative model H1 : v ~ i + c + b + m. The above reflect global variance i, variance of negative controls c, variance between two batches of independent time-courses b, and additional variance of the tested strain m. Significance of viability deviation was assessed with a LR test, similarly to the m:Explorer algorithm. Specifically, the null model comprised only global variance, negative control variance and batch variance as H0 : v ~ i + c + b, and null and alternative models were compared using the chi-square distribution. Resulting p-values were corrected for multiple testing with FDR.
Fisher's exact tests were used in multiple cases to evaluate the correlation of two binary variables. In the case of TF target genes and cell cycle genes, we applied the Fisher's test to assess whether the proportion of TF-regulated genes was statistically unexpected in the set of cell cycle genes. The Fisher's probability of observing a particular configuration in a two-way contingency table is computed as
where g denotes the number of genes in a particular set, C indicates cell cycle genes, T indicates TF targets, c shows genes unrelated to cell cycle, t shows genes not regulated by the particular TF, and n = gCT +gCt +gcT +gct reflects the number of all yeast genes. As Fisher's test does not support large contingency tables of multi-level variables, different types of TF regulatory targets were treated as the first category and non-regulated genes were assigned to second category, and cell cycle phase-specific genes were similarly merged into a bivariate discrete variable. A similar analysis was carried out to compare the overlap between diauxic shift genes and quiescence genes, using the set of all yeast genes as statistical background.
Gene Ontology (GO) and pathway enrichment analysis for G0 TFs was carried out with with g:Profiler software . We defined two ranked gene lists: G0 genes  that were differentially expressed in WT+ TF knockout strains (Mga2, Cst6, Sds3, Spt10, Bas1), and G0 genes that were differentially expressed in viability-deficient TF strains (Swi3, Sin3, Snf2, Spt20, Tup1, Haa1), according to TF knockout microarrays . The gene lists were ordered according to statistical significance in TF knockout data , computed as products of p-values across WT+ and RD strains for every gene. We used the ordered enrichment analysis of g:Profiler to find GO functions and pathways in ranked gene lists and applied statistical filtering to find significant enrichments (FDR p ≤ 0.05).
The one-tailed hypergeometric tests calculated by g:Profiler assess the significance of observing k or more genes of a certain functional category in a list of n genes, as
given that there are N genes in total and K of which are part of the functional category. As ordered enrichment analysis assumes that genes with stronger signals are ranked first, it consequently tests different subsets of the top list and returns the portion of top genes with the strongest p-value for a particular functional category . Resulting G0 functional categories were grouped into three classes: enriched G0 categories associating to WT+ TF targets, categories of viability-deficient TF targets, and categories with statistical enrichment in both groups of targets. Enrichment p-values were corrected for multiple testing with the FDR procedure. To rank the third class of common functional categories, we multiplied corresponding p-values of WT+ target genes and viability-deficient TF target genes. After functional enrichment analysis, redundant categories whose genes formed a subset of some other category were removed. To quantify each GO category and function, we also counted up-regulated and down-regulated G0 genes across all related TF strains.
Regulator knockout strains were selected as 12 top-ranking candidates from m:Explorer results. S. cerevisiae deletion strains originate from the EUROSCARF deletion collection in the BY4741 strain (MATa his3Δ1 leu2Δ0 met15Δ0 ura3Δ0). Liquid cultures were grown in triplicate at 30°C with aeration in YPD (1% yeast extract, 2% peptone, 2% glucose) for 28 days and subsequently shifted to room temperature without aeration. Viability measurements of the six-week time-course were taken in eight time-points: 7h after colony initiation, 48h after colony initiation, followed by six weekly measurements on days 7, 14, 21, 28, 35 and 42. Two independent batches involved distinct sets of tested strains, while wildtypes and controls were covered in both batches. A shorter, independent time-course covered the first three days of growth and involved viability measurements at 7h, 11h, 24h, 48h, and 72h. Cell density was measured at 600 nm. Colony forming units (CFU/ml) were determined by plating cells on YPD agar and counting colonies after three days of growth at 30°C. Culture viability was determined by dividing CFU/ml with total cell number per milliliter in corresponding culture (OD600 units ×107). Growth on glycerol was determined by streaking strains onto YPG plates (1% yeast extract, 2% peptone, 3% glycerol, 2% agar). Glucose concentration was determined by measuring NADPH production in hexokinase and glucose-6-phosphate dehydrogenase coupled reactions provided by Roche.
The authors declare that they have no competing interests.
Designed and implemented the method: JR. Compiled and analyzed data: JR. Conceived and designed experiments: JR, AA, JMV, JS, NML. Conducted experiments: AA. Wrote the manuscript: JR. Contributed to writing: AA, JV, JMV, JS, NML. All authors have read and approved the manuscript for publication.
Supplementary Online Material. Additional file 1 contains Supplementary Methods, Figures s1-s5 and Tables s1-s3.
Additional file 2contains normalized colony forming unit measurements for tested ΔTF strains, wildtypes and controls of the six-week quiescence time-course.
Additional file 3contains normalized colony forming unit measurements for tested ΔTF strains, wildtypes and controls of the 72-hour quiescence time-course. Abbreviations. ΔTF - transcription factor knockout strain; AUC - area under curve; CFU - colony forming units; ChIP - chromatin immunoprecipitation; DNA - deoxyribonucleic acid; FDR - false discovery rate; G0 - stationary phase, quiescence; G1 - gap-1; G2 - gap-2; GO - gene ontology; KEGG - Kyoto encyclopedia of genes and genomes; LR - likelihood ratio; M - mitosis; MBF - mlu1-box binding factor; NADPH - nicotinamide adenine dinucleotide phosphate; NDTFBS - nucleosome-depleted transcription factor binding site; OD - optical density; PKA - protein kinase A; PKC - protein kinase C; S - synthesis; SAGA - Spt-Ada-Gcn5-acetyltransferase; SBF - SCB binding factor; TF - transcription factor; TFBS - transcription factor binding site; TOR - target of rapamycin; WT+ - super-wildtype; YPD - yeast extract peptone dextrose; YPG - yeast extract peptone glycerol.
We are grateful to Gary Bader, Richard Bourgon, John Marioni, Gabriella Rustici, and Annabel Todd for their valuable advice, and to Tambet Arak for his work on the m:Explorer website. JR would like to acknowledge financial support from IKTDK, EXCS, ENFIN, EITSA, Marie Curie Biostar, the Ustus Agur Foundation, and the Artur Lind Foundation. Computational resources were provided by the University of Tartu HPC Centre.