In this study, we applied a novel machine learning algorithm to learn regulatory programs underlying oxygen regulation and heme regulation. This algorithm uses experimental data, including microarray gene expression data, promoter sequence, and ChIP-chip data, without introducing prior assumptions such as presuming a cluster structure in gene expression data. The results from our analysis show that the MEDUSA algorithm can provide important, unbiased information about global regulatory programs. MEDUSA identifies many DNA sequence motifs important for oxygen and heme regulation (). Further, experimental data from measuring OLE1 promoter activity confirms the specific predictions made by MEDUSA ().
Finally, a comprehensive comparison with a traditional “cluster-first” motif discovery approach demonstrated that MEDUSA is more successful at identifying binding site motifs relevant to oxygen regulation ().
MEDUSA identifies many regulators previously known to be involved in this system. For example, MEDUSA identifies Upc2, Mga2, and Hap1 
as important regulators in oxygen regulation (). Further, MEDUSA predicts many new regulators of oxygen and heme regulation, such as Pph3, Bem2, and Pcl1. In support of the regulatory network identified by MEDUSA, several identified regulators are known to interact with each other. For example, Mbp1 and Ure2 are known to coexist in one complex, and the MAP kinase kinase kinase Ssk22 acts upstream of Mbp1. Pph3 and Bem2 are known to coexist in one complex, and both likely mediate the regulation of both hypoxically induced genes and oxygen-induced genes in Δhap1
cells (, , and ). Wtm1 and Afr1, which are known to coexist in one complex, act in concert to promote oxygen regulation in wild type HAP1
cells ( and ). Likewise, Ire1, which acts upstream of Rgs2, may act with Rgs2 to mediate the regulation of heme-suppressed genes ( and ).
The purpose of achieving high prediction accuracy on the test data is to confirm that the identified regulators are statistically important predictors for the regulation of target genes. The number of significant regulators identified by MEDUSA is much smaller than the number of regulators whose expression is changed in a specific experiment. For example, in cells bearing the Hap1 expression plasmid (HAP1), we identified 18 significant regulators that may mediate the regulation of the oxygen-induced genes (), out of 98 regulators whose mRNA levels were significantly altered in the experiment. This dramatic filtering is achieved by three aspects of our computational approach. First, we require that regulators control their putative targets through shared motifs in the promoter sequences. Second, we train on examples from multiple experimental conditions. If a regulator cannot be associated with a binding site motif through which it contributes to target gene regulation in a consistent way across multiple conditions, it will not be selected by the algorithm. Finally, we use a novel margin-based score to identify the most significant regulators for specific conditional and gene sets. This filtering represents an important improvement over simply correlating expression levels of regulators with those of target genes.
It is important to note that the MEDUSA analysis did not identify the regulators that mediate stress responses, such as Msn2, Msn4, Tpk1, Usv1, Yap1, and Hsf1 
, although motifs for some of these regulators are identified in the promoters of target genes. In some aspects, anaerobic and heme-deficient conditions exhibit certain characteristics of stress responses. As such, certain genes induced by stress, such as genes involved in ribosome synthesis, were induced by anaerobic and heme-deficient conditions (see Figure S8
). However, the regulatory network mediating oxygen and heme regulation is clearly different from the general stress response network. The most significant regulators in the oxygen and heme regulatory network are not those involved in general stress responses. Interestingly, however, this oxygen and heme regulatory network shares many regulators with other signaling pathways, such as pheromone signaling and osmotic responses.
Our analyses suggest a remarkable flexibility of the oxygen and heme regulatory network. For example, in the absence of Hap1, certain new regulators, such as Glc8 and Mbp1 and Ure2, along with some of the regulators acting in wild type HAP1
cells, appear to be recruited to mediate oxygen regulation to substitute Hap1 (). Another feature of the oxygen and heme regulation network is its complexity. Although several previously known oxygen and heme regulators, including Hap1 and Mga2 
, are confirmed to be important in oxygen and heme regulation by our analysis, many other regulators appear to play important roles in global oxygen and heme regulation. Through biochemical validation of the predicted regulators for the OLE1
promoter, we have taken the first step in confirming the novel components of the oxygen regulatory network as predicted by MEDUSA. While much experimental work remains to be done, we are encouraged by MEDUSA's success in generating condition- and target-specific hypotheses that we were able to validate experimentally.
MEDUSA's “cluster-free” approach has the advantage that it can still be effective for small expression datasets, where clustering may only generate large and functionally heterogeneous gene sets. Moreover, clustering and most “module” learning approaches rely on the static assignment of genes to clusters across all experiments in a dataset, which may oversimplify coregulation relationships between genes. MEDUSA models condition-specific regulation in a more flexible way that avoids the cluster assumption. However, methods that produce sets of clusters or modules are more familiar and easier to visualize than MEDUSA's gene regulatory programs, and MEDUSA analysis requires an interpretation step to reveal detailed information for specific conditions or sets of genes. In the present work, we used margin scoring to extract significant regulators for the set of induced/suppressed target genes in each condition and significant motifs associated with genes belonging to expression signatures. This analysis gave a convenient summarization of our results, but more general kinds of statistical post-processing are possible and could be more informative. MEDUSA is well suited to a perturbation dataset, where the set of regulators exhibit diverse expression signatures across conditions. In a dataset where many of the regulators are highly correlated, such as in a short time series, there may not be enough information in the discretized regulator expression profiles for MEDUSA to resolve condition-specific regulators. More generally, MEDUSA incurs some loss of information by discretizing gene expression data prior to training. Extending MEDUSA to handle real-valued regulator and target gene expression levels, for example through a regression formulation, might address this limitation, but it could also introduce too much noise and lead to over-fitting. Further investigation is needed to determine whether such extensions to the MEDUSA algorithm will lead to greater biological insight.