Eukaryotic transcriptional regulation is a core cellular process that governs the expression of genes. Understanding gene expression is crucial in explaining complex biological processes including development, disease and cancer. Transcription factors (TF) are key proteins that activate or repress transcription by binding sequence-specifically to DNA in promoter regions of target genes. Mapping such regulatory networks and TF functions is therefore an important goal of current biomedical research. In complex vertebrate organisms like human, this task is hindered by enormous genomic space, numerous cell types, and distinct experimental procedures with data that is often unsuitable for direct comparison. The relatively simple unicellular model organism budding yeast (
S.
cerevisiae) serves as a platform for regulatory genomics. Multiple types of global-scale data of yeast gene regulation are available to date, including microarrays with TF deletion (ΔTF) strains [
1,
2], predictions of TF binding sites (TFBS) [
3-
5], and measurements of chromatin state such as nucleosome positioning [
6]. These data appear to be complete, however the agreement between transcript expression and TF binding events remains modest [
2,
7]. While part of this controversy can be attributed to experimental and statistical noise, we may still lack significant details regarding the biological relationships among such heterogeneous information. Consequently high-throughput data constitute less reliable evidence and much functional knowledge is extracted from careful and expensive focused studies. Most TFs and their exact roles in cellular processes remain poorly understood. Therefore biologically meaningful computational analysis is an important challenge in deciphering cellular regulatory networks.
Computational prediction of TF function from gene expression and DNA binding data is an active area of research. Numerous algorithms have been published elsewhere, albeit few have been validated experimentally. Earliest approaches focused on a specific class of data and used alternative types of evidence for computational validation. For instance, microarray clustering followed by DNA motif discovery in gene promoters helped establish the genome-scale link between mRNA expression profiles and TF binding [
8,
9]. Similarly, analysis of cell cycle expression patterns of TF-bound genes led to recovery of cell cycle TFs [
10]. More recent methods use statistical modeling to integrate multiple types of evidence. For example, ARACNE extracts transcriptional networks from numeric microarray data using mutual information [
11], and MARINA is a down-stream method that identifies master regulators of these networks through association tests with TF binding target genes [
12]. The SAMBA biclustering algorithm studies matrices of regulators and target genes, and highlights regulatory relationships between genes and TFs that co-occur in clusters [
13]. The linear regression method REDUCE integrates numeric microarray data, DNA sequence and TF affinity matrices by modeling the linear relationship between gene expression levels and TF-DNA interactions [
14]. The GeneClass algorithm additionally integrates information about gene function, as it constructs decision trees of discrete microarray profiles and TF binding sites to select predictors of process-specific genes [
15]. While this method provides direct modeling of gene function, TFs and gene expression data are studied as independent predictors. Notably, none of the above methods take advantage of recent ΔTF microarrays that reveal regulator target genes [
1,
2]. Nested effects models are designed to extract regulatory networks from perturbation data [
16], although integration of TFBS and gene annotations is not supported. Nucleosome positioning measurements also remain unexplored in all above approaches. In summary, additional computational efforts are required for meaningful integration of versatile biological data.
Here we propose a method m:Explorer that uses multinomial logistic regression models to predict process-specific transcription factors. We aim to provide the following improvements in comparison to earlier methods. First, our method allows simultaneous analysis of four classes of data: (i) gene expression data, including perturbation screens, (ii) TF binding sites, (iii) chromatin state in gene promoters, and (iv) functional gene classification. The model is based on the assumption that TF target genes from perturbation screens and TF binding assays are equally informative about TF process specificity. Second, we reduce noise by including only high-confidence regulatory relationships, and do not assume linear relationships between regulators and target genes. Third, we integrate detailed information to better reflect underlying biology: multiple subprocesses may be studied in a single model, and chromatin state data are incorporated into TF binding site analysis. TF target genes with simultaneous evidence from gene expression and TFBS data are highlighted separately. Fourth, our analysis is robust to highly redundant biological networks, as statistical independence is not required. We use univariate models to study all TFs independently and avoid over-fitting that is characteristic to many model-based approaches. This is statistically valid under the assumption that a complex model may be understood by examining its components.
To test our method, we compiled a comprehensive dataset covering most TFs of the budding yeast. We benchmarked m:Explorer in a well-studied biological system and establish its improved performance in comparison to several similar methods. Then we used the tool to discover regulators of quiescence (G0, stationary phase), a cellular resting state that serves as a model of chronological ageing. Experimental validations of our predictions revealed nine TFs with significant impact on G0 viability. Besides demonstrating the applicability of our computational method, these findings are of great potential interest to yeast biologists and researchers of G0-related processes like ageing, development and cancer.