Hundreds of yeast RNAs are expressed in a cell cycle–dependent, oscillating manner. In both budding yeast and fission yeast, these RNAs cluster into four or five groups, each corresponding roughly to a phase of the cycle [1
]. Large sets of phase-specific RNAs are also seen in animal and plant cells [10
], arguing that an extensive cycling transcription network is a fundamental property of Eukaryotes. The complete composition and connectivity of the cell cycle transcription network is not yet known for any eukaryote, and many components may vary over long evolutionary distances [3
], but some specific regulators (e.g., MBF of yeast and the related E2Fs of plants and animals) are paneukaryotic, as are some of their direct target genes (DNA polymerase, ribonucleotide reductase). Coupled with experimental accessibility, this conservation of core components and connections make the yeast mitotic cycle an especially good test case for studies of network structure, function, and evolution.
To expose the underlying logic of this transcription network, a starting point is to decompose the cell cycle into its component phases (i.e., G1, S, G2, M) and link the pertinent regulatory factors with their immediate regulatory output patterns, here in the form of phasic RNA expression. One way to do this is to integrate multiple genome-wide data types that impinge on connection inference, including factor:DNA interaction data from chromatin IP (ChIP) studies, RNA expression patterns, and comparative genomic analysis. This is appealing partly because these assays are genome-comprehensive and hypothesis-independent, so they can, in principle, reveal regulatory relationships not detected by classical genetics. However, the scale and complexity of these datasets require new methods to discover and rank candidate connections, while also accommodating considerable experimental and biological noise (e.g., [14
]). Microarray RNA expression studies in budding yeast have identified 230 to 1,100 cycling genes, the upper number encompassing nearly a fifth of all yeast genes [1
]. Specifics of experimental design and methods of analysis contribute to the wide range in the number of genes designated as cycling, but there is agreement on a core set of nearly 200. Yeast molecular genetic studies have established that transcriptional regulation is critical for controlling phase-specific RNA expression for some of these genes, though this does not exclude modulation and additional contributions from post-transcriptional mechanisms. About a dozen Saccharomyces
transcription factors have been causally associated with direct control of cell cycle expression patterns, including repressors, activators, co-regulators, and regulators that assume both repressing and activating roles, depending on context: Ace2, Fkh1, Fkh2, Mbp1, Mcm1, Ndd1, Stb1, Swi4, Swi5, Swi6, Yhp1, and Yox1. These can serve as internal control true-positive connections. Conversely, a majority of yeast genes have no cell cycle oscillatory expression, and true negatives can be drawn from this group. A practical consideration is how well the behavior of a network is represented in critical datasets. In this case, cells in all cell cycle phases are present in the mixed phase, exponentially growing yeast cultures used for the largest and most complete set of global protein:DNA interaction (ChIP/array) data so far assembled in functional genomics [21
]. These data are further supported by three smaller studies of the same basic design [22
]. This sets the cell cycle apart from many other transcription networks whose multiple states are either partly or entirely absent from the global ChIP data. Equally important are RNA expression data that finely parse the kinetic trajectory for every gene across the cycle of budding yeast [1
] and also in the distantly related fission yeast, S. pombe
]. This combination of highly time-resolved RNA expression data and phase-mixed (but nevertheless inclusive) ChIP/array data can be used to assign protein:DNA interactions to explicit cell cycle phases, while evolutionary comparison with S. pombe
highlight exceptionally conserved and presumably fundamental network properties.
Many prior efforts to infer yeast transcription network connections from genome-wide data ([15
] were designed to address the global problem of finding connection patterns across the entire yeast transcriptome by using very large and diverse collections of yeast RNA, DNA, and/or chromatin immunoprecipitation data. The present work focuses instead on a single cellular process and its underlying gene network, which represents a natural level of organization positioned between the single gene at one extreme and the entire interlocking community of networks that govern the entire cell. To model regulatory factor:target gene behavior, we adapted neural networks to integrate global expression and protein:DNA interaction data.
Artificial neural networks (ANNs) are structural computational models with a long history in pattern recognition [27
]. A general reason for thinking ANNs could be effective for this task is that they have some natural similarities with transcription networks, including the ability to create nonlinear sparse interactions between transcriptional regulators and target genes. They have previously been applied to model relatively small gene circuits [28
], though they have not, to our knowledge, been used for the problem of inferring network structure by integrating large-scale data. We reasoned that a simple single-layer ANN would be well-suited to capture and leverage two additional known characteristics of eukaryotic gene networks. First, factor binding in vivo varies over a continuum of values, as reflected in ChIP data, in vivo footprinting, binding site numbers and affinity ranges, and site mutation analyses. These quantitative differences can have biological significance to transcription output by affecting cooperativity, background “leaky expression” or the lack of it, and the temporal sequencing of gene induction as factors become available or disappear. This is quite different from a world in which binding is reduced to a simple two-state, present/absent call. Neural networks are able to use the full range of binding probabilities in the dataset. Second, ANNs can give weight and attention to structural features such as the persistent absence of specific factors from particular target groups of genes. This “negative image” information is potentially important and not used by other methods applied to date [15
]. The inherent ability of ANNs to use these properties is a potential strength compared with algorithms that rest solely on positive evidence of factor:target binding or require discretization of binding measurements into a simplified bound/unbound call.
ANNs have been most famously used in machine learning as “black boxes” to perform classification tasks, in which the goal is to build a network based on a training dataset that will subsequently be used to perform similar classifications on new data of similar structure. In these classical ANN applications, the weights within the network are of no particular interest, as long as the trained network performs the desired classification task successfully when extrapolating to new data. ANNs are used here in a substantially different way, serving as structural models [33
]. Specifically, we use simple feed-forward networks in which the results of interest are mainly in the weights and what they suggest about the importance of individual transcription factors or groups of factors for specifying particular expression outputs.
Here ANNs were trained to predict the RNA expression behavior of genes during a cdc28 synchronized cell cycle, based solely on transcription factor binding pattern, as measured by ChIP/array for 204 yeast factors determined in an exponentially growing culture [21
]. The resulting ANN model is then interrogated to identify the most important regulator-to-target gene associations, as reflected by ANN weights. Ten of the twelve major known transcriptional regulators of cell cycle phase-specific expression ranked at the very top of the 204-regulator list in the model. The cell cycle ANNs were remarkably robust to a series of in silico “mutations,” in which binding data for a specific factor was eliminated and a new family of ANN models were generated. Additional doubly and triply “mutated” networks correctly identified epistasis relationships and redundancies in the biological network. This approach was also applied to two additional, independent cell cycle expression studies to illustrate generality across data platforms, and to probe how the networks might change under distinct modes of cell synchronization.
Analysis of the weights matrices from the resulting models shows that the neural nets take advantage of information about specifically disfavored or disallowed connections between factors and expression patterns, together with the expected positive connections (and weights) for other factors, to assign genes to their correct expression outputs. This led us to ask if there is a corresponding bias in the biological network against binding sites for specific factors in some expression families as suggested by the ANN. We found that this is the case, in multiple sensu stricto yeast genomes relatively closely related to Saccharomyces cerevisiae, and also in the distantly related fission yeast S. pombe. This appears to be a deeply conserved network architecture property, even though very few specific orthologous genes are involved.