|Home | About | Journals | Submit | Contact Us | Français|
Summary: A major part of organismal complexity and versatility of prokaryotes resides in their ability to fine-tune gene expression to adequately respond to internal and external stimuli. Evolution has been very innovative in creating intricate mechanisms by which different regulatory signals operate and interact at promoters to drive gene expression. The regulation of target gene expression by transcription factors (TFs) is governed by control logic brought about by the interaction of regulators with TF binding sites (TFBSs) in cis-regulatory regions. A factor that in large part determines the strength of the response of a target to a given TF is motif stringency, the extent to which the TFBS fits the optimal TFBS sequence for a given TF. Advances in high-throughput technologies and computational genomics allow reconstruction of transcriptional regulatory networks in silico. To optimize the prediction of transcriptional regulatory networks, i.e., to separate direct regulation from indirect regulation, a thorough understanding of the control logic underlying the regulation of gene expression is required. This review summarizes the state of the art of the elements that determine the functionality of TFBSs by focusing on the molecular biological mechanisms and evolutionary origins of cis-regulatory regions.
Bacteria react to various environmental conditions by employing different modes of regulation, e.g., metabolic, translational, and transcriptional regulation. Their genes are organized into a hierarchical network of interconnected regulons, which is flexibly organized according to the environmental conditions that a cell faces (255). The expression of regulons is controlled by regulatory proteins (transcription factors [TFs]) with their concomitant DNA binding targets, which are known as TF binding sites (TFBSs). In some cases, the presence of cofactors is necessary for TF activity. In the end, the composition of regulons induced by a condition that the cell faces depends on the concentrations of active TFs. At gene promoters, one or more regulatory signals are integrated into one regulatory output. We term the function according to which regulatory output is determined under different conditions as the control logic of a promoter. The control logic is very important not only for the regulatory output of a promoter but also for motif stringency: how well does the TFBS fit the TFBS sequence that is optimal for binding by a given TF? A recent review by Balleza et al. focused mainly on regulatory network inference, regulatory network plasticity, chromosome structure, and how to make dynamical models of regulatory networks (11). Our review focuses on the mechanisms that determine the control logic of promoters, the relationship of motif stringency to regulatory output, and how these mechanisms are grounded in their evolutionary history. We will first briefly discuss the wide variety of basic mechanisms of regulation at bacterial promoters. We will then focus on TF target analyses, in particular on the experimental determination and in silico prediction of TFBSs and their distributions throughout the genome. Finally, the evolutionary dynamics of cis-regulatory regions are discussed, with a keen eye on the evolution of regulatory networks and its relationship to TFBS motif fuzziness and stringency.
Specifically, we will address the problem that the functionality of many in silico-predicted TFBSs can often be neither confirmed nor rejected on the basis of the experimental observations described in the literature. It can be very difficult to distinguish between DNA sequences that function as a binding site for TFs (true positives) and those that do not (false positives) on the basis of a DNA motif. Such a motif is produced from an alignment of several annotated or predicted binding sites. For instance, in statistical identifications of TFBSs, the unavoidable use of a cutoff will lead to a tradeoff between false-positive and false-negative results among the sequences close to this cutoff (215). There is a genuine need to be able to distinguish true and false TFBSs within this twilight zone. Part of the problem is that often, only a limited set of true positives outside of this twilight zone is available as input data, while no ideal negative data set exists (293). However, there is also the question of whether in the end one can truly categorize every potential TFBS as being “positive” or “negative” or if one should think about TFBS functionality in a more continuous manner. In order to tackle these matters, a deeper insight into the broad mechanistic and evolutionary frameworks of the regulatory complexity present in promoter sequences is required.
The issue of operons, multiple genes that are transcribed in a single mRNA, being central in prokaryotic gene regulation and the question of which prediction methods to be used for a given organism have been reviewed recently (35) and will not be discussed further. Also, the subject of gene expression being dependent on its presence at the leading or lagging strand during DNA replication has been reviewed extensively (224, 236, 246), as has the role of protein phosphorylation on, e.g., carbohydrate metabolism regulation (70). Other mechanisms of transcriptional regulation, such as attenuation and (anti-)antitermination have been discussed in depth as well (105, 252).
While many related reviews have focused on DNA motif discovery and the computational data integration needed to reconstruct transcriptional regulatory networks (TRNs) (112, 125, 191, 257, 259), the focus here is on the biological regulatory mechanisms that combine in promoters to yield specific gene expression outputs. Central to this review are the terms control logic and motif stringency. In other words, how are signals integrated at the prokaryote promoter, and how do these signals result in a graded regulatory response? We outline that the difference between spurious and functional TFBSs largely depends on a number of factors: (i) their location, (ii) their degeneracy, and (iii) whether the corresponding TF is local or more pleiotropic. Although in a few cases we cite eukaryote research that is relevant to the topic as well, the focus is clearly on prokaryotes. Prokaryotic transcription regulation is highly complex and will leave computational biologists busy for decades to create models of it that approximate its intricate reality.
Transcription is the process of transcribing DNA into RNA (e.g., mRNA, tRNA, rRNA, and small RNAs) and is performed primarily by RNA polymerase (RNAP). Transcription consists of five phases: (i) preinitiation, (ii) initiation, (iii) promoter clearance, (iv) elongation, and (v) termination. During preinitiation, RNAP binds to the core promoter elements (−10 and −35; positions indicate the location of each sequence with respect to the transcription start site) in the upstream region (cis-regulatory region) of a gene on the genome. After RNAP binding, a transcription bubble is created between positions −10 and +2 through a process termed isomerization (36). At the start of initiation, sigma (σ) factors associate with the RNAP and allow it to recognize the −35 and −10 sequences. After the first DNA base is transcribed into mRNA, the process of promoter clearance takes place. During this process, RNAP often slips from the DNA, producing incomplete transcripts (abortive initiation). RNAP no longer slips from the DNA when approximately 23-bp transcripts are formed. The elongation step involves the elongation of the mRNA transcript until transcription termination occurs. The termination of transcription is mediated either by hairpin structures in the DNA (transcriptional terminators; Rho-independent termination) or by binding of the Rho cofactor, which dissociates the mRNA from DNA (53, 123, 220).
In the next paragraphs, we discuss the cofactors that are involved in RNAP binding, TFBSs, and transcriptional activation and repression.
Some genes are transcribed highly, while other genes are barely transcribed or even not at all. This is due in large part to the fact that transcriptional regulation takes place mainly at the initial binding of RNAP to the DNA, the isomerization process, and the earliest stages of RNAP progression along the DNA duplex (36). Because the supply of both σ-factors and free RNAP in a cell is limited, there is intense competition between promoters for the binding of the RNA holoenzyme (36, 192a).
The binding of a specific σ-subunit of RNAP plays an important role in transcriptional regulation. The three main functions of σ-factors are (i) to ensure the recognition of core promoter elements, (ii) to position the RNAP at the target promoter, and (iii) to unwind the DNA near the transcription start site (321) (Fig. (Fig.11).
One genome may encode many different σ-factors, which, in addition to specific TFs, are used to determine the transcriptional response of a bacterial cell by each one guiding the RNAP to a specific set of target genes (111). In general, bacterial housekeeping σ-factors are similar to the Escherichia coli σ70 70-kDa σ-factor (111, 226) and regulate genes that are involved in cellular growth. Several members of the σ70 factor family have been described. E. coli K-12 has five other σ70 family σ-factors besides σ70 (231), whereas Bacillus subtilis has 17 known variants of σ70 (274). Typically, housekeeping σ70 σ-factors bind to the −35 and −10 DNA sequence elements in a promoter, which are relatively conserved hexanucleotide sequences with the consensus sequences TTGACA at position −35 and TATAAT at position −10 (36). The intrinsic strength of a core promoter (the level of transcription taking place from it apart from the effects of the binding of additional TFs) is determined largely by the extent to which the core promoter elements match these consensuses (154, 157, 289). Alternative σ-factors (among which are also those of the σ54 family) often regulate a set of genes having a clearly defined function, but their regulons may also cover a broader set of target genes involved in diverse biological processes and overlap significantly with those of housekeeping σ-factors (306). A specific subfamily of σ-factors that directly incorporates signals from the extracellular environment in regulating transcription (ECF σ-factors) also exists (121). Excellent reviews of alternative σ-factors that discuss their diverse functionalities in detail are available (111, 121, 151). Diverse σ-factors are often regulated by anti-σ-factors, which inhibit their function under specific conditions (139).
Two other important sites are the extended −10 element and the UP element (Fig. (Fig.1).1). The extended −10 element is located directly upstream of the −10 element and comprises four nucleotides with the consensus sequence TRTG (304, 305), and the approximately 20-bp UP element is located upstream of the −35 element up to −80 nucleotides (84, 205). Such UP elements are easily spotted, as they are AT rich and seem to be particularly associated with strong promoters. The relative contributions of these elements to RNAP binding differ strongly between promoters. A particular combination of these elements could result in RNAP binding a promoter sequence too tightly, which would in turn prevent the RNAP from escaping the promoter. Currently, predictions of bacterial core promoter sequences can be performed using the following methods: position-weight matrix (PWM) scoring (137); comparative genomics approaches (294); classification by, e.g., support-vector machines (107, 295); and a recently developed triad algorithm that incorporates UP element detection (66) (see also Table Table22 for an overview of methods that deal with promoter prediction).
In addition to these general methods that a cell uses to regulate gene expression, the cell utilizes specialized TFs that bind to specific DNA recognition sequences (TFBSs). TFBSs for a specific TF can differ in nucleotide sequence and composition, but they can be represented by a consensus DNA sequence motif, i.e., the representation of the target variability of the TF. Below, the different representations of sequence motifs are discussed.
The location and nucleotide composition of TFBSs determine in large part whether a TF represses or activates the expression of a certain gene. The length of bacterial TFBSs is usually between 12 and 30 bp, and they often appear in the form of direct repeats or palindromes, which may facilitate the dimeric binding of TFs (247). As most bacterial TFs have a helix-turn-helix domain and act as homodimers, the motifs of their TFBSs are usually structured as a “dyad” (spaced motif) with a spacing of a given number of uninformative base pairs (301). In some cases where TFBSs exist as direct repeats or palindromes, half-sites (with only one of the repeated segments or half of the palindrome) also have some functionality (168). TFBSs can be located at various positions relative to the canonical −35 and −10 promoter sequences ranging from far upstream to within and downstream of the promoter. Regulatory motifs are usually not strictly specific (as are the DNA motifs cut by restriction enzymes) but are only partially conserved and thus appear rather “fuzzy” (100, 266).
The thermodynamic state of TF proteins can be described using a three-state model (169, 283): (i) freely diffusing in three dimensions as monomers, (ii) unspecifically bound as mono- or oligomers to DNA by general electrostatic interactions and thus diffusing along the DNA backbone in one dimension, and (iii) specifically bound to a binding site at a local energy minimum through hydrogen bonds as well as hydrophobic and electrostatic interactions. Switching between the latter two states involves a conformational change of the TF protein, which is triggered by the molecular recognition of an energy minimum, most often through the binding of a protein α-helix to the major groove of the DNA (52). The combination of these three states enables the TF to find its target sites and bind to them in relatively little time (169). For a few model systems that were studied, the binding energy itself seems to be well approximated by the sum of the independent contributions of a small number of TF binding nucleotides (88, 221). The binding probability depends on the binding energy in a sigmoid way, thus generating a threshold between weak binding and strong binding that is exemplified by an insensitivity of the binding probability if the binding energy is between weak and strong binding (169).
Some TFs function to repress transcription, while others activate transcription. Still others function as either activators or repressors, often according to the positioning of the TFBS relative to the σ-factor binding site in the target promoter (231) (see Fig. Fig.11 for a summary of the main mechanisms). The binding and release of repressors and activators themselves are often controlled by cofactor binding. Cofactors are molecules that can range widely in size and nature, from small ions, nucleotides, covalently attached phosphate moieties, and sugars to peptides or whole proteins (2, 86, 118, 285). Although most activators function by first binding to the promoter DNA before interacting with RNAP, some activators (such as E. coli MarA and SoxS) also bind to free RNAP in the cytosol prior to binding their TFBSs (110, 200).
There are four main modes in which TFs have been described to mediate repression (36, 181, 247) (Fig. 1A to C): (i) repression by steric hindrance, often by binding of the repressor between or on the core promoter elements; (ii) repression by blocking of transcription elongation, often by binding at the start of the coding region (roadblock mechanism); (iii) repression by DNA looping, with binding sites often both upstream and downstream of the core promoter (in this case, an interaction between two monomers of the same TF is possible only if both TFBSs are spaced correctly); and (iv) repression by the modulation of an activator. In the latter case, a repressor binds to a TFBS that (partly) overlaps a different TFBS of an activator. The binding of the repressor to its site will then prevent the binding of the activator to its respective TFBS. An example of such an interaction is that between the CytR and CRP (for a review, see reference 36).
Similarly, four modes of activation by TFs have been described (12, 36, 181, 247, 279) (Fig. 1D to F): (i) class I activation, in which the TF binds upstream of the core promoter and interacts with the flexible α-subunit of RNAP; (ii) class II activation, in which the TF binds the DNA directly adjacent (mostly upstream) to the core promoter and promotes σ-factor binding; (iii) activation by DNA conformational change, in which the TF binds to the core promoter to enable it to be bound by a σ-factor, often by twisting the DNA helix; and (iv) activation by the modulation of a repressor, alleviating the repression effect. An example of the latter mode (also termed antirepression) was recently discovered for the B. subtilis competence activator ComK, a minor groove binding protein that binds adjacent to the repressors Rok and CodY at its own comK promoter (279). Although ComK binding to the DNA does not result in the physical displacement of Rok and CodY, it removes the repression effect and thus activates the expression of the gene (Fig. (Fig.22).
Although it seems obvious that spatial constraints on TFBS placement within promoters should exist, relatively few detailed experimental studies have been performed to specify these (187). Most repressor sites are located between positions −60 and +60 relative to the transcriptional start site (55, 83, 192, 212), although repressors often bind to sites much further upstream, as in the case of, e.g., DeoR repression of the E. coli ula operon (167). The degree of repression depends significantly on the TFBS position relative to that of the promoter (58). Activator sites are usually present upstream of or next to the −35 core promoter element (247) (Fig. (Fig.33 and Table Table1).1). Class I activators are generally bound between positions −60 and −95, while class II activator sites are adjacent to, or overlapping with, the −35 element (12). In a recent study by Cox and coworkers (58), regulatory effects of the activators LuxR, which regulates luminescence genes in Vibrio fischeri, and AraC, regulating arabinose metabolism in E. coli, were tested in vivo using 288 artificially constructed promoters that were inserted into a plasmid with a luciferase reporter gene. The regulatory effects of activator TFBSs located downstream of the −35 core promoter element appeared to be negligible compared to the effects of upstream sites. This work clearly indicates that control logic can be inferred for a number of regulators involved in metabolism.
Other spatial constraints are formed by the fact that activation or repression often functions only if TFs bind to specific positions on the promoter DNA helix, as TF binding to a TFBS in general has to be present at the same side of the DNA duplex as RNAP binding to fulfill its function. In two independent studies, Ushida and Aiba (298) and Gaston and coworkers (95) showed that the extent to which the well-studied E. coli catabolite repression protein (CRP) was able to activate gene expression on melR and lacZ promoters was dependent largely on the helical face to which it bound, which had to be identical to the face to which RNAP bound (Fig. (Fig.4).4). Therefore, within the region between positions −60 and −95, class I activators were mostly functional only around positions −61, −71, −81, and −91 (12), the intervals which match a single helical turn (10.5 bp) of B-form DNA (Table (Table1).1). For the Lactococcus lactis MG1363 pleiotropic regulators CodY and CcpA, the helicity of the TFBS compared to the transcription start site was shown to be important for the regulation of target genes as well (68, 334).
In many cases when a TFBS is positioned at a relatively long distance from core promoter elements, this has a specific regulatory function. For example, the fact that in B. subtilis, the ComK binding site (K-box) at the promoter of the comK gene itself is positioned one or two helical turns further upstream than K-boxes in other promoters provides a threshold for autoactivation. This can be relieved by the adjacent binding of DegU (116, 117). Because DegU binding stimulates comK transcription, the concentration of active ComK rises, and ComK can then activate the transcription of the comK gene without the additional help of DegU. DegU thus functions as a priming protein that can turn on an autostimulatory feedback loop (Fig. (Fig.22).
Based on RegulonDB, version 6.3 (94), a large percentage (about 65%) of the transcriptional units (operons or single genes) of E. coli K-12 that are annotated to be regulated by at least one TF are regulated by more than one TFBS for a given TF. Also, genes are often regulated by more than one different TF (31% of the total genes for E. coli K-12). For example, activators and repressors can antagonize each other at a particular promoter sequence (competitive regulation) (38, 101, 124). However, multiple different activators also frequently work together to induce transcription (cooperative regulation) (36, 38), each regulated by a different cellular or environmental signal (6). This is the case for the B. subtilis ackA promoter, the expression of which is governed by cellular levels of both glucose and branched-chain amino acids through activation by CcpA and CodY (210, 271). Sometimes, TFs contribute to activation independently in a combination of class I and class II interactions. In other instances, multiple activators interact with the DNA in a cooperative manner. In yet other cases, one activator functions to counter the function of a repressor while the other one performs the direct activation (27, 28, 36). Generally, two modes of cooperative binding exist (124): (i) homocooperative binding, in which more than one of the same TFs bind cooperatively to multiple instances of the same TFBS in one promoter region, and (ii) heterocooperative binding, in which different TFBSs in the same promoter are cooperatively bound by different TFs. The cooperative or competitive action of multiple TFs can result in complex regulatory events at cis-regulatory regions, as in the above-mentioned case of the comK promoter (Fig. (Fig.22).
Boolean logic gates such as AND, OR, and NAND (Fig. (Fig.5)5) can be accomplished with prokaryotic promoters by relatively simple combinations of interactions between two TFs and RNAP at a promoter (27, 28, 38, 275). For example, the AND gate, in which transcription occurs only if both of two active TFs are present at high concentrations, can be produced by two different activator TFBSs acting cooperatively. The OR gate, in which transcription occurs when either of two active TFs is present at a high concentration, can be produced by two activator TFBSs functioning independently on the same target. The NAND (not and) gate, in which transcription is repressed only when both of two active TFs is present at high concentrations, can be produced by a strong promoter regulated by two weak repressor TFBSs acting cooperatively (and requiring this cooperation to attain a significant repressive effect) (38). Thermodynamic models reported by Buchler et al. suggested that more complex Boolean logic gates (EQU and XOR) can also be attained (Fig. (Fig.5).5). An XOR (excluded or) gate, in which transcription occurs only when one out of two active TFs acting on a promoter is present at high concentrations, for example, can be accomplished by two different TFs acting independently as activators on two strong-affinity TFBSs while at the same time acting cooperatively as repressors on two weak-affinity TFBSs.
Finally, an EQU (equals) gate, in which transcription occurs only when the active concentrations of two TFs are approximately equal, can be produced by two different TFs acting as repressors on two strong-affinity TFBSs interfering with a strong promoter and at the same time acting as derepressors on each other's sites. Buchler et al. (38) and, later, Bintu et al. (27) also suggested options involving either multiple alternative core promoters acting on a gene or repression by DNA looping similar to the mechanisms that have been described for the E. coli lac operon (160).
Alternatively, Hermsen et al., using similar thermodynamic models, predicted that complex Boolean logic gates (including NOR, ANDN, and ORN) (Fig. (Fig.5)5) can also be accomplished by the cooperative binding of TFs in complex promoters with multiple TFBSs when in the cis-regulatory region, two modules, both containing an array of binding sites, overlap and thus compete for cooperative binding (124). The affinity of binding of σ-factors to the core promoter and of TFs to the different TFBSs determines the precise logic function governing the conditions for transcriptional activation or repression. For example, an EQU gate requires a strong core promoter to facilitate transcription when the concentrations of both TFs are low, while two homocooperative repression modules mediate repression only when one of the two TFs is present at a sufficient concentration (124). When both TFs are present in high concentrations, this repression is countered by a heterocooperative activation module containing both TFs; this heterocooperative array of sites must then have a higher cumulative binding affinity than do the overlapping homocooperative repression modules. Problems with the predictions described Hermsen et al. appear to be that the modules which they proposed lead to an overcrowding of TFBSs within promoters that seems quite unrealistic.
The biological relevance of these theoretical studies has still to be investigated, as few experimental efforts have yet focused on identifying complex logic gates regulated by cooperative TF binding. The most extensive experimental work in this respect was done by Kaplan et al., who mapped the control logic of 19 E. coli sugar metabolism-related genes, which are regulated by both CRP and a specific sugar regulator, in considerable detail (148). They did this by creating a map of gene expression levels under various concentrations of cyclic AMP (the metabolite determining CRP activity) and the sugar involved in activating the specific sugar regulator. Because the conditions were chosen in such a way that the expression depended almost exclusively on the concentrations of these two input signals, they could interpret the shape of the resulting map to infer the control logic of the promoter (147, 148). Interestingly, those authors found the sugar gene promoters to contain diverse control logics, including quite complex ones such as that of fucR, which approximates the XOR gate, by displaying reduced expression levels when both input signals are high and when both are low (148). Another promoter region that is interesting for future study in this respect would be the E. coli gltBDF operon, which is involved in one of the two main pathways of ammonia assimilation in this organism. This operon was recently shown to be regulated by multiple global regulatory proteins of E. coli (Lrp, IHF, CRP, and ArgR) (228).
Although the complex control logic that underlies cooperative regulation has not yet been described for modeling efforts, elementary control logic represented in stoichiometry matrices was described by Klamt and coworkers, who created a modeling tool, CellNetAnalyzer (155). In the end, an understanding of the different ways in which the different types of control logic can be produced by prokaryotic promoters can both help predict the input-output relationships between factors involved in promoter regulation for purposes of transcriptional network reconstruction (78, 259) and help synthetic biology efforts in the engineering of artificial biological circuits (275).
Although it is not our goal to give in-depth descriptions of all other transcriptional regulatory mechanisms, it is worthwhile to give a short overview of additional regulation mechanisms that add to the complexity of transcriptional regulation. Therefore, we will shortly touch on the regulatory mechanisms of promoter escape regulation, transcriptional interference, DNA methylation, chromosome supercoiling, histones, as well as the posttranscriptional regulation mechanisms of mRNA degradation, riboswitches, and short noncoding RNAs. For details, we will refer to some excellent reviews that have recently been written on these topics.
A large part of cellular transcriptional regulation takes place at the stage of transcription initiation, in which the bound RNAP has to escape the promoter to advance to downstream regions of the DNA template (132). Besides the possibility of regulation by TFs binding upstream of the core promoter elements, RNAP promoter escape can also be regulated by specific factors which bind to the RNAP itself. Recently, it was shown for one such promoter escape-regulating factor, GreA (129, 133), that it can also be sequence specific. In a microarray study comparing cells expressing either wild-type GreA or a strain carrying an inactivated version of the same factor, Stepanova et al. identified 126 genes that were specifically transcribed in the presence of wild-type GreA (282). The mechanism by which this specificity is mediated is not yet clear.
Another way in which transcription elongation can be regulated for both σ-factors and TFs is when different transcriptional activities interfere with one another in cis by the collision of RNAPs bound to, or initiated from, different promoters (268). This process is called transcriptional interference and can occur in convergent promoters, tandem promoters, and overlapping promoters. Convergent promoters are promoters producing converging transcripts, the 5′ regions of which overlap at least partially; tandem promoters are promoters in which one promoter is placed upstream of the other but transcribing in the same direction, and in overlapping promoters, the RNAP binding sites are at least partially overlapping. Transcriptional interference could very well be a widespread mechanism of gene regulation. An analysis of the 4,462 E. coli promoters in the RegulonDB database revealed 166 tandem promoters, 54 convergent promoters, and 435 promoters that are probably overlapping (268).
Modifications to the structure of the DNA itself can also function to regulate transcription. DNA methylation is such an epigenetic regulation mechanism (48, 184, 243). The best-studied bacterial DNA methyltransferases are the Caulobacter crescentus CcrM methyltransferase, which methylates the N6-adenine of GANTC (243), and the E. coli Dam methyltransferase, which methylates the N6-adenine of GATC sequences (182). Methylated GATC sequences within cis-regulatory regions can increase, decrease, or have no effect on transcription initiation efficiency (182). Also, Dam methyltransferases regulate gene expression through the formation of DNA methylation patterns (184), which appear because regulatory proteins compete with Dam for binding to the DNA at Dam sites and prevent their methylation. DNA methylation patterns can both repress and activate gene expression by either enhancing or blocking the binding of either repressors or activators at promoters (184).
Regulation at the level of DNA structure can also take place at the level of overall chromosome organization. Such regulation provides a more global control of transcription than the control of regulators that are specifically dedicated to a relatively small set of gene promoters (199). Crucial in regulating bacterial chromosomal organization are the histone-like nucleoid proteins HU, Fis, H-NS, StpA, IHF, and Dps (208, 288). Besides their global role in regulating supercoiling and chromatin dynamics, at least some of them may also act on a local level in a gene-specific fashion. Note that Fis and IHF were also shown to bind to specific DNA recognition sequences (208) and can have different regulatory effects (activation or repression) when bound to different sites within the same promoter (37). Also, higher-level macrodomains that are related to the transcriptional response to supercoiling and correspond to the distribution of binding sites for DNA gyrase, a topoisomerase involved in creating negative supercoils, exist on bacterial chromosomes (296). However, the domains of higher levels of transcriptional activity are not caused by superhelicity only; replication polarization of the chromosome also plays a major role (1, 183).
Although outside the realm of transcriptional regulation, the regulation at the posttranscriptional level should not be neglected. Riboswitches are regulatory domains that reside in the noncoding regions of mRNAs, where they bind metabolites and control gene expression (14, 195, 308, 319). That mRNA stability can be of high importance in regulating transcript abundance is perfectly illustrated by a study by Selinger et al., who measured mRNA half-lives for 1,036 open reading frames (ORFs) in E. coli, which appeared to range between 1 and 2,084 min, with the majority of half-lives between 2 and 20 min, while degradation speeds differed according to the lengths of the polycistronic transcripts (264). Finally, it was also discovered that short noncoding RNAs, first thought to be important for gene expression regulation in eukaryotes only, are also prevalent in prokaryotes and function, for example, by binding specifically to certain mRNAs to repress their translation (108, 109, 203, 244).
Determining target genes of transcriptional regulators is a field that has evolved quite rapidly in the past years. The reasons for this are emerging high-throughput methodologies for transcriptome analysis such as DNA microarrays (75) and mRNA sequencing (317), which allow the monitoring of thousands of transcripts simultaneously, and chromatin immunoprecipitation (ChIP) approaches, with which dozens of TF-DNA interactions can be discovered (233). Current work is in most cases focused on the association of targets (together forming a “regulon”) with their transcriptional regulator. This is done, for instance, by determining a regulon from DNA microarray targets querying a knockout of a transcriptional regulator. TRNs are reconstructed from DNA microarray data and literature data as primary data sources. Other approaches have integrated ChIP-on-chip protein-DNA interaction data, protein-protein interaction data, proteomics, metabolomics, and pathway information (162, 328, 332). Here, we discuss the techniques that are focused primarily on regulon reconstruction and how control logic is used in current approaches.
Regulons are usually identified using transcriptome comparisons between wild-type and TF knockout strains grown under one or a few conditions (329). More recently, time-series transcriptome analysis has also been performed for this purpose, e.g., the time-resolved determination of the CcpA regulons of B. subtilis and L. lactis (188, 334). From such experiments, groups of genes or operons that respond to specific environmental perturbations can be identified, which are referred to as stimulons (247). To define such stimulons, the level of gene expression of an unperturbed control is compared to that under a condition that stimulates a certain cellular response using DNA microarrays. If the mRNA is isolated under a specific condition, such experiments provide snapshot information of the regulatory role of TFs under those specific conditions (329).
In order to detect associations based on microarray data, coexpression or reciprocal expression between a TF and its target is required. The prerequisite of (anti-)correlated expression patterns is that there is an autoregulatory loop for the TF; i.e., the TF regulates its own expression. These autoregulatory loops are an important basic regulatory mechanism, especially for the negative regulatory loop, where the cell ensures that the expression of a given TF is downregulated after the TF has been produced. For E. coli K-12, about 50% of the TFs have negative autoregulatory loops (248). Therefore, one can conclude that for at least 50% of the TFs, a clear (anti-)correlation in expression patterns cannot be expected if other factors such as detection limits of the experimental technique are also taken into account.
In any case, additional experiments are required to distinguish between direct and indirect regulatory effects when the results of such experiments are analyzed. Some clustering algorithms that can quite effectively extract gene expression modules from perturbation data have been developed, such as the ENIGMA tool developed by Maere et al. (193). An advantage of the ENIGMA tool over to most earlier biclustering tools is that it can deal with partial coexpression between genes; i.e., genes show correlated expression only under a subset of conditions.
In order to globally identify the genomic regions that are occupied by a DNA binding TF, ChIP experiments are also used. In ChIP experiments, the chromosomal DNA is cross-linked to a tagged regulator protein, sonicated to produce small fragments, and then immunoprecipitated with an antibody against a given TF or its tag (234). In ChIP-on-chip, this enrichment of DNA binding to a certain TF is then compared to that of a control containing nonenriched chromosomal DNA with microarray analysis to reveal the binding sites of that TF on the genome (233, 329). Recently, a novel method called ChIP-Seq was also developed, in which ChIP is coupled to next-generation massively parallel sequencing technology (145, 198). Typically, only short 25- to 50-nucleotide reads (“tags”) are sequenced, and genomic regions with probable binding sites are identified by the high densities of such tags in the output (146, 299). The regulatory motif that characterizes a set of TFBSs can be detected using both gene expression (DNA microarray and mRNA sequencing) and ChIP-Seq or ChIP-on-chip data (often from genome tiling microarrays). For these methods, either (i) the cis-regulatory regions of genes with large differences in transcription rates between the respective TF knockout and its wild type in a microarray experiment are pooled or (ii) the cis-regulatory regions are precipitated with a certain TF, employing computational methods to identify overrepresented oligonucleotides in these sequences (112, 191). Tiling microarrays can also be used to obtain more precise information concerning the start of transcription, also referred to as promoter mapping (56, 242).
Other methods focus directly on identifying the DNA binding specificity of a TF and can be used to reconstruct regulons by using the resulting regulatory motifs to predict the binding site of a given TF computationally. One such methods is coined systematic evolution of ligands by exponential enrichment (SELEX). In SELEX, one starts with a random pool of oligonucleotides, after which strongly bound oligonucleotides are enriched by multiple cycles of target binding, selection, and DNA amplification (73). Although the standard SELEX method can easily be used to find the optimal consensus sequence of a TFBS motif, it fails in practice to provide a good data set for reconstructing a high-resolution motif of its DNA binding specificity because the oligonucleotide pool is too enriched for the most strongly bound sites (177). Fortunately, modifications to the protocol make it possible to obtain the needed amount of low- and medium-affinity sequences (177, 250). An even more promising approach that was recently developed is formed by protein binding microarrays, in which TF fusion proteins are bound on double-stranded DNA microarrays containing many different DNA sequence variants of a given length (23, 24, 39). Binding of the TF to spots can then be detected with fluorescently labeled antibodies against the protein to which it is fused. The method can be used in an impressively high-throughput manner to determine the DNA binding specificities of many TFs (331).
More traditional methods also reveal information on the presence of TFBSs in promoter DNA sequences. An example is DNase I footprinting. In this technique, a DNA fragment is allowed to interact with a DNA binding protein, after which the complex is partially digested with DNase I (93). The bound protein protects the region of the DNA to which it binds from DNase digestion. Subsequent electrophoresis identifies the region of protection as a gap in the background of digestion products (173). Another traditional method is the electrophoretic mobility shift assay, in which a protein-DNA mixture is separated on a gel and compared to a DNA-only control. One can then see if the protein binds the DNA: in this case, the DNA band from the protein-DNA mixture will be less mobile than that of the DNA-only control (69).
A major issue in TFBS motif discovery is the way in which motifs are represented (Fig. (Fig.6).6). Many different representations exist, and the choice is often determined by the level of accuracy, simplicity, interpretability, representational power, or computational convenience (191) (see Table Table22 for an overview of methods involved in TFBS discovery and visualization). Probably the simplest way of motif representation is the use of a consensus sequence of preferred nucleotides (A, C, G, and T). Either such a consensus sequence can be represented in a strict manner, in which case it represents only the optimal sequence, or degeneracy can be built in (e.g., R is purine, Y is pyrimidine, S is strong, W is weak, K is keto, M is amino, and N is any nucleotide, according to IUPAC nomenclature [http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html]) (63), in which case a limited amount of information can be represented on the proportions of nucleotides at the given positions. PWMs are currently the most common model for identifying TFBS motifs and are more precise than consensus-based representations (284, 293). In PWMs, the nucleotide observed at a position is assumed to be independent of the nucleotides at other positions. Motifs are visualized conveniently by sequence logos consisting of an ordered stack of letters in which the letter's height indicates the amount of information that the motif contains at that position (59, 262).
As a final critical note, a consensus sequence, PWM, or sequence logo does not necessarily convey all biologically relevant features of a DNA sequence that enables it to be bound by a TF. It is merely the result of determining the overrepresentation of nucleotides in a number of cis-regulatory regions of coregulated genes.
The identification of the sequence motifs that constitute the range of sequences functioning as TFBSs for a certain TF in a particular genome remains a challenge in computational biology, and a large array of options have been exploited to predict such motifs in silico (61, 71, 112, 191, 218, 257, 303) (see Table Table22 for an overview of methods involved in TFBS discovery and visualization). In general, one starts out with a set of DNA sequences that are a priori believed to be coregulated and therefore likely to be bound by one or more regulatory proteins (329). This list of genes can be determined, e.g., based on candidates that are differentially expressed in a DNA microarray experiment querying a perturbation or by determining coexpression over a compendium of microarray data (see above). Computational algorithms are then used to identify the motifs that could be responsible for this binding of TFs (191). Finally, the motifs that are found to be overrepresented in the DNA sequences of the coregulated genes can be used to search the genome for other additional putative TFBSs that match the motif. Recently, in a study reported by Westholm and coworkers (314a), where a meta-analysis of predicted TFBS distributions across the Saccharomyces cerevisiae genome was performed, it was demonstrated that there are significant numbers of TFBS motifs for which (a combination of) location and orientation are important for functionality. They also provided a Web tool (ContextFinder) that will allow researchers to perform such an analysis on a regular basis.
The basic algorithmic approaches that have been used thus far to identify DNA motifs can be grouped into two main categories: enumerative or word-based methods and probabilistic methods (61, 71, 112, 191, 218). Generally speaking, probabilistic methods are more appropriate for finding motifs in prokaryotes, as they are better suited to identifying longer sequence motifs in terms of computational cost (61, 323). However, in contrast to enumerative methods, they do not always find the global optimum in their search space.
Enumerative methods exhaustively catalogue DNA oligonucleotide words, which are then scored by statistical significance on a set of reference sequences to identify the most significantly overrepresented motif strings of a certain length (191). Multiples of these string-based motifs can then be merged into one approximate motif, if necessary. van Helden et al. developed the oligonucleotide analysis motif-finding algorithm based on this approach (300). Later, they adapted their method for the analysis of bipartite dyad motifs with a low-information-content linker region (301), which are characteristic for dimeric TFs and require algorithm alterations for detection (see also references 25, 50, and 315). While their method is exhaustive, its detection range is relatively limited, identifying patterns mainly with one or more highly conserved cores. An advantage of the oligonucleotide analysis and dyad analysis tools is that they are integrated into a wide collection of modular tools (RSAT) by which, for example, the string-based motifs that they produce can also be converted to PWMs (291, 297). A general limitation of enumerative methods is that searching for long sequence motifs is computationally expensive, and exhaustive searches become impractical for motif lengths longer than 10 nucleotides (232). One method to solve the problem is the use of suffix trees, as introduced by Sagot (254), which effectively reduce the size of the search space, making search time exponential with respect not to motif length but to the number of mismatches allowed in the motif. The well-known motif-finding algorithms Weeder and MITRA are also equipped with such suffix trees (81, 229). Recently, hybrid algorithms have also been proposed, in which probabilistic models are incorporated in dictionary-based methods related to enumerative algorithms (41, 253, 253, 309).
Probabilistic methods mostly first develop a probabilistic model (mostly a PWM) of the sequence data and then optimize it to find motifs common to multiple input sequences. Two algorithms frequently used for optimization are the expectation-maximization (EM) algorithm (43, 67, 172) and Gibbs sampling (98, 171). EM algorithms start off with a guess PWM as an initial motif model, consisting of a single oligonucleotide subsequence (n-mer) and background oligonucleotide frequencies. For each n-mer in the target sequence, the probability that it was generated by the motif instead of chance effects in the background sequence is calculated. Subsequently, the algorithm iterates between calculating a new motif model based on the old model plus the added motif sequences and calculating the probabilities of n-mers in the target sequence given this model (43, 67, 172) until a convergence criterion is reached. A disadvantage of the EM algorithm is that it is a local optimization method that is sensitive to the initialization point. The well-known motif discovery tool MEME, which is based on the EM algorithm, largely avoids local maxima by performing a single iteration for each n-mer in the target sequences and iterating the best motif from this set to convergence (10). Gibbs sampling can be considered a stochastic variant of EM (98, 171). In Gibbs sampling, the algorithm starts off with a number of n-mers randomly sampled from the input sequences. It then probabilistically decides for each iteration whether to remove an old site from and/or add a new site to the motif model. The probability is weighted by the binding probability for those sites based on the old model (180). Well-known motif discovery tools based on Gibbs sampling are AlignACE (249), MotifSampler (290), and BioProspector (179). Like the EM algorithm, Gibbs sampling can suffer from the problem of the presence of local optima. GibbsST is a promising algorithm that circumvents this problem in a new way, by a thermodynamic method called simulated tempering (269).
Because so many different tools are available for DNA motif discovery, balanced comparisons are of major importance. Although some efforts in this have been attempted (134, 276, 293), it remains a major challenge to the work field to find objective standards for algorithm evaluation. The main reason for this is that the various tools score differently depending on the data sets, and absolute benchmarks are lacking (256, 293). Tompa et al., who created eukaryotic benchmark data sets with which they tested 13 commonly used algorithms, found no single program to be superior across all performance measures and data sets (although Weeder outperformed the other tools in most cases) (293). Hu et al. performed a similar analysis with prokaryotic benchmark data sets for five motif discovery tools, although their analysis differed in that they allowed minimal parameter tuning during performance evaluation (134). Both studies found that the absolute measures of correctness of all programs were quite low, although Hu et al. found that the algorithms which they tested were capable of predicting at least one binding site accurately more than 90% of the time (134). Because of the limitations inherent in any single motif discovery tool, users are advised to use multiple algorithms, to run probabilistic algorithms multiple times, to pursue the top few motifs instead of the single most significant one, to combine similar motifs, and to evaluate the resulting motifs in terms of group specificity, set specificity, and positional bias.
A consensus is now emerging that because no single program is superior for all data sets, several programs (preferably based on different methodologies) should be combined to achieve optimal results (119, 134, 293) (Fig. (Fig.7).7). Hu et al. found that an ensemble method that combined outcomes of the tools that they tested increased both sensitivity and specificity considerably (134). They later extended the method in their EMD algorithm (135). Recently, two additional applications (SCOPE and MotifVoter) that combine the results of different motif search algorithms for prokaryote data have become available (44, 316). The application MOTIFATOR is focused on prokaryote data analysis and uses the SCOPE algorithm to search for overrepresented DNA motifs in upstream regions of DNA microarray targets (31). The resulting motifs are presented in combination with functional enrichment and a visualization of the putative TFBSs in relation to the ORF to allow the user to prioritize results. While SCOPE merges the scores of three complementary algorithms (BEAM  for nondegenerate motifs, PRISM  for degenerate motifs, and SPACER  for bipartite motifs), MotifVoter extracts its motifs by clustering the results of up to 10 well-known motif discovery tools such as Weeder, MEME, and AlignACE. Notably, MotifVoter significantly outperformed earlier ensemble algorithms on the benchmark data set reported by Tompa et al. as well as on a bacterial (E. coli) benchmark data set (316).
Comparative genomic approaches can also be used to detect TFBSs or to filter results from enumerative and alignment methods by using the assumption that nucleotides in a binding site motif are generally better conserved than the nucleotides in the vicinity of the binding site. With these so-called phylogenetic footprinting approaches, conserved regions that point to the presence of important functionality, i.e., TFBSs (and also RNAP/ribosome binding sites), are identified (30, 131). The most basic methodology is to construct a global multiple sequence alignment of the orthologous promoter sequences using an alignment tool such as ClustalW (292) and then to manually identify conserved regions within this alignment (Fig. (Fig.8).8). Genomes of three species having the optimal phylogenetic distance toward each other could be sufficient for the detection of such conservation (206). However, such an approach to phylogenetic footprinting does not always work because it may be difficult to obtain an accurate alignment, or an obtained alignment may be uninformative. Therefore, several motif-finding algorithms have been adapted to detect phylogenetic footprints in promoters of orthologous genes in tools such as OrthoMEME (235), Footprinter/MicroFootprinter (29, 217), PhyloCon (310, 311), PhyME (277), and PhyloGibbs (273). Some methodologies that avoid the use of alignments altogether have even been proposed (79, 106). Recently, an approach in which predicted motifs throughout different taxonomic levels can be compared has also been developed, which enables one to detect not only motif conservation but also motif divergence (144). Finally, the conservation of the genomic context of TFs can be used to detect genes regulated by a TF, after which motifs of such a TF can be obtained through the footprinting of all orthologues that share this identical genomic context (89, 313). Although the degeneration or turnover of a TFBS in one or more specific phylogenetic lineages is a potential hazard to the phylogenetic footprinting approach (136), a computational approach (CSMET) that takes into account such a lineage-specific evolution of TFBSs has recently been developed (241).
Finally, prediction approaches that make use of structural information about the TF (4a, 149, 159, 178, 213), either from crystallographic structures or from homology models, have recently been applied. Although the use of such models for ab initio predictions of TFBSs is still limited, Morozov and Siggia have shown that it can be used successfully to compute a PWM for a certain TFBS motif using the combination of structural information and a single strict consensus sequence (213). The method proceeds from the assumption that the conservation of a base pair in the binding site is correlated with the number of atomic contacts between that base pair and the TF, which functions as a reliable proxy of TF-TFBS binding affinity.
Gene regulatory networks or TRNs have become an important tool in studying global transcriptional regulation in prokaryotes (5, 11, 17, 125, 150, 259, 260). Figure Figure99 shows an example of the visualization of the E. coli K-12 TRN. In this figure, the nodes (boxes) correspond to genes, and the edges (lines) are the interactions between the genes. An interaction between the TF and its target is denoted as an edge between the TF node and its target node. The network is built by interconnecting the TF nodes to form larger network structures. Within a TRN, smaller network modules can be distinguished (for a review, see reference 3). These network modules are (i) positive and negative autoregulation (a TF regulates its own expression); (ii) feed-forward loops, where regulator A regulates the expression of regulator B and target C. (regulator B additionally regulates the expression of C; there are eight different regulatory combinations possible depending on the Boolean logic) (Fig. (Fig.5);5); and (iii) dense overlapping regulons, where gene expression is driven by a combination of TFBSs for different TFs.
These networks allow the study of the signal integration occurring at the promoters of genes (which are represented as nodes in the network) in a wider context. Additionally, predictions of the functioning of larger regulatory structures in the cell can follow from studying TRNs (265). Another example of analysis of TRNs is given by Carrera and coworkers, who described a method that allows predictions of the response of a TRN following perturbations (e.g., knockout of a TF) (47). A combination of analysis and reconstruction was given by Barrett and Palsson, who described an algorithm that allows the reconstruction of a TRN of a given organism by the iteration of a prediction of the most informative perturbation, performing that perturbation in the laboratory, and reconstructing the TRN including the new information (13). (see Table Table22 for an overview of methods involved with [gene regulatory] network analysis and visualization). Below, we nonexhaustively describe some approaches to gene network reconstruction, i.e., computationally determining interactions between genes.
The most common approaches are the modeling of Boolean logic networks (33, 163, 201, 272) or the use of Bayesian models or coexpression measures to create probabilistic networks (90, 91, 161, 258). More complex network models have also been introduced, such as continuous (rather than logical) models (57, 77, 216, 227) and single-molecule-level models (103, 263, 333). Reference or template-based network reconstruction is a methodology that uses reference networks to predict edges between genes for a given organism (18). CoryneRegNet is a database that contains data for regulatory interactions for a number of organisms, including E. coli K-12, that can be used for this purpose (15). Each of these methodologies has its own advantages: logical models allow relatively easy and flexible fitting to large-scale biological phenomena, continuous models allow an understanding of more confined processes that rely on finer timing and exact molecular concentrations, and single-molecule-level models allow study of the stochastic aspects of gene regulation. Template-based methods allow one to use knowledge on TRNs generated for different organisms. Although TRNs can be quite well compared between some related organisms (16), it remains to be established whether this assumption generally holds for other species and more specialized gene regulatory modules. The major data source for the above-mentioned approaches is gene expression data obtained from microarray experiments. Based on benchmarks of reconstruction using different algorithms and synthetic data, the reconstruction of TRNs, and conceivably determining regulon structure (see above), has been shown to be most effective when small time series of genetic perturbations are used, as opposed to larger-time-series microarray data (97).
For all these approaches, the process of reconciling laboratory data (gene expression data and ChIP-on-chip) with bioinformatic regulon predictions is of major importance (126, 302). This integration step is necessary to be able to reliably analyze genome-scale models of TRNs to predict the effects of the application of different stimuli to an organism (19). Schlitt and Brazma proposed subdividing regulatory network models into four categories: (i) part lists (systematized lists of network elements in a particular organism or system), (ii) topology models (the parts including their interconnections), (iii) control logic models (the description of the combinatorial effects of regulatory signals), and (iv) dynamic models (the simulation of the network in time) (259, 260). Currently, there are large gaps between part lists that, for example, constitute a regulon and topology models, in which the part lists are integrated to yield a network topology (259). A further level of complexity is added with the control logic of networks, which has been described in a number of studies (62, 96, 162, 163, 255).
There are still a number of categories of inconsistency between the models and experimental observations. For example, not all physical interactions reported by, e.g., ChIP-on-chip between TFs and cis-regulatory regions result in significant functional regulatory effects that are detectable in gene expression data (259). Moreover, many transcripts remain below detection limits of the techniques used in high-throughput gene expression studies (32). Also, many inconsistencies exist between TF-DNA interactions predicted by computational approaches (e.g., PWM-based methods) and ChIP-on-chip data (170, 211). Even in large collections of gene expression data collected under many different conditions, sometimes no transcriptional effects are discovered for certain TFBSs (127). Last but not least, only a small complement of an organism's genes is active under the laboratory conditions (single-species growth in liquid culture) in which they are commonly grown, so available microarray data query only a limited part of the regulatory space (281).
Currently, reconstruction of regulons or networks of regulons is done primarily by using DNA microarray data in conjunction with literature knowledge and in some cases is supplemented with data for protein-DNA interactions. This involves searching for overrepresented DNA motifs in the upstream regions of target genes (see above). Current algorithms that were developed for searching overrepresented DNA motifs create a background model of the genome. These background models are based mostly on (oligo)nucleotide distributions across genomic regions. In the following section, the genomic distribution of TFBSs is discussed. This information can be used to further improve detection of TFBSs and to reduce the number of false-positive and false-negative results.
A main obstacle for TFs to locate their functional binding sites across the chromosome are spurious TFBSs, sites with relatively high binding affinity (and relatively close to the TFBS consensus sequence) that have arisen nonadaptively throughout the genome without having been selected for a particular biological function (169). The fact that TFs do not have strict sequence specificity means that through simple mutations, spurious binding sites can quite easily appear by chance at positions where they do not significantly affect the transcription of nearby genes (174). Such spurious binding sites will lower the effective TF concentration within a cell.
Initial investigations into the distribution and dynamics of spurious TFBSs have been made by Huerta and coworkers (137, 138), who focused on the distribution of RNAP σ-factor binding core promoter elements throughout eubacterial and archaeal genomes. Their statistical investigation, in which they counted the number of RNAP binding motifs throughout different regions of 44 genomes, has shown that σ70 binding to −35 and −10 core promoter elements is overrepresented in regulatory regions (generally upstream regions) compared to nonregulatory regions of genomes (138). In two other studies, dinucleotide and/or trinucleotide frequencies of different genome regions were incorporated into the analysis to show that RNAP binding sites are also present below expectations (the number of motifs expected to arise by chance given certain oligonucleotide frequencies) in both coding and noncoding regions of bacterial genomes, which implies that natural selection acts to counter the appearance of spurious sites (92, 114). RNAP binding to −10 sites appeared to be overrepresented within regulatory regions relative to nonregulatory regions, even when the sites at the −10 position itself are not taken into account (92). Multiple −10 sites throughout promoter regions could perhaps function to maintain local RNAP abundance in these regions, or a cis-regulatory region may contain two promoters in tandem. In a study by Radonjic and coworkers, RNAP was reported to be present in the upstream regions of genes to ensure a fast response when the eukaryote Saccharomyces cerevisiae exits the stationary growth phase (238). RNAP binding sites downstream of the core promoter can also have important functions, such as the −10 site-resembling element at the transcriptional start site of the E. coli lac promoter, which mediates a transcription pause (34, 219). This mechanism possibly functions as a negative regulator of transcription in which the rescue of the stalled RNAP complex is dependent on one or more other TFs.
A similar, although not as extensive, study has been done by Hamoen et al. on a specific TF, the competence factor ComK, which was previously mentioned (115). Those researchers found that while both of the ComK binding sites (K-boxes) without any mismatches and 18 out of 25 of the K-boxes with one mismatch from the strict consensus were positioned in intergenic regions, only 56 of the 171 K-boxes with two mismatches were positioned in intergenic regions. Also, of the K-boxes with three mismatches, only 280 of the 864 were present in intergenic regions. Yet still, K-boxes with three mismatches were overrepresented in these intergenic regions, as they cover only 12% of the genome, and the difference in the percent GC content between genic and intergenic regions also could not account for the 32% of the triple-mismatched K-boxes found there. The only drawback of this study is that it did not take into account oligonucleotide frequencies, which in genic regions, for example, may be influenced by codon biases (4).
In general, it is also important that it is difficult to pinpoint which TFBSs are spurious and which are not. In a recent study, Shimada et al. found that 14 out of the 20 targets of the E. coli RutR TF found by ChIP-on-chip analysis were located in coding regions of the DNA and had little or no effect on transcription levels when tested (270). However, the computational prediction that other bacteria containing RutR homologues also have RutR TFBSs that are overrepresented in coding regions makes it tempting to suggest that RutR has some unknown function within these regions (270). Nonetheless, these results can just as well be explained as being an evolutionary relic.
One source of biological information that may help to distinguish between spurious and functional binding sites is that most local (nonpleiotropic) TFs tend to be encoded in close chromosomal proximity with one of their target genes, as was shown for E. coli by Janga et al. (142). Multiple biophysical models have shown that this makes sense because it allows the TFs to quickly reach their targets after translation, even at low concentrations, by sliding along the adjacent DNA (20, 158, 322). This implies that quite probably, an important part of the information determining the biological relevance of a TFBS is not present in its sequence but rather is present in its position on the chromosome (143).
Studying the effect of natural selection on the abundance of TFBS motif-like sequences in different genomic regions may reveal much about their functionality (113). Given the fact that in large bacterial populations, natural selection is by far the major determinant of genomic sequence (189), selection can be quantified quite easily by comparing the abundance of TFBS motif-like sequences with the abundance expected from chance alone. The genomic abundance of every short DNA motif sequence expected by chance can be calculated from oligonucleotide frequencies. These comparisons between observed and expected abundances can be performed with different regions of prokaryotic genomes (e.g., coding and noncoding). For many TFs, such analyses can reveal in which regions there is selection either for or against the presence of their TFBSs.
Besides distinguishing between general sequence categories such as coding or regulatory regions, different genomic regions can be specified for this analysis. For example, the abundance of certain TFBSs in different parts of cis-regulatory regions (e.g., −30, −50, and −100 nucleotides relative to the transcriptional start site) could be assessed separately to specifically identify the regions within promoters to which particular TFs generally bind. For example, the observed/expected abundance ratios of certain TFBS motif-like sequences in the first 50 to 100 nucleotides of coding regions could be compared with the observed/expected abundance ratios of these sites in coding regions. This might give insight in the natural selection leading to a large abundance of TFBS motif-like sequences in the 5′ part of coding regions. This, in turn, would then point to a possible roadblock function of the corresponding TF.
Both the hidden Markov model analysis used to calculate expectations of TFBS abundance from genomic oligonucleotide frequencies (60, 130) and the sliding-window approach to count the number of DNA motifs with a certain number of mismatches from the strict consensus (8) are straightforward. A more elaborate algorithm for detecting the positional overrepresentation of TFBSs that uses PWMs of spatially conserved motifs based on comparative genomics techniques was developed by Defrance and Touzet (65). Therefore, these analyses have the potential to become standard tools for the study of transcription as an addition to the most commonly used PWM-based tools. Information from such methods would enhance standard positional statistics of TF distribution throughout genomic regions, as was used by Huerta and coworkers, for example, who created simple motif density maps showing the location distribution of core promoter-like elements throughout regulatory regions (138). It should be kept in mind, however, that the regions selected as input for the analyses should be sufficiently large so that the motifs under study do not significantly affect the di- or trinucleotide frequencies themselves.
Finally, when TFBS sequences of a TF are sorted based on the distance from the degenerate consensus sequence, the strictness of the motif-TF interaction could perhaps be monitored by observing the effect of natural selection (assuming this to be the major mechanism shaping genomic sequences of bacteria) on the abundance of these sequences in relation to their distance to the degenerate consensus. Because consensus methods probably do not offer sufficient accuracy for such an analysis, PWMs could be used, by calculating PWM scores for all TFBS-like DNA words and observing the effect of natural selection on the abundance of groups of DNA words with different PWM scores. An alternative method is to approach this problem from the perspective of the actual biological effects of the TFBS-like sequences by performing a two-dimensional clustering with the PWM scores as a function of gene expression. This would also provide a means of visualization, and currently, several groups are following this approach (54, 64).
In order to realistically predict which DNA motifs in a genome are functional TFBSs and which are not, it is important to understand how TFBSs evolve. After all, the sequence of any TFBS is shaped by its evolutionary heritage. In the following section, we review the evolution of the information content of the nucleotides making up TFBSs. This information is a highly important yet a complex piece of the transcriptional regulation puzzle.
An important aspect of TRNs is that they can evolve rapidly (6, 7, 99, 141, 186). Therefore, transspecies extrapolation of information from TRNs is possible in only a very limited taxonomic range (16). The regulatory effects of a TF often vary already significantly between different strains of the same bacterial species (122). The structure and sequence of cis-regulatory elements may change even when gene expression patterns are conserved because there is a significant turnover of binding site sequences (74, 136, 187). In such a turnover event, a new TFBS bound by the same TF evolves next to the original TFBS, after which the original TFBS degenerates (100). Furthermore, TFs for which strong phylogenetic evidence exist that they are evolutionary orthologues rarely regulate orthologous genes (237). Amazingly, a recent study even shows that a TF (Lrp) from Proteus mirabilis that was heterologously expressed in the closely related bacterium E. coli regulated only 51% of the genes that were regulated by its highly similar (98% sequence identity) E. coli Lrp orthologue under the same conditions (176). In another study, B. subtilis ComK, which is normally a transcriptional activator, appeared to function mainly as a repressor when it was heterologously expressed in L. lactis (286). Also, by studying PhoP orthologues from Salmonella enterica and Yersinia pestis (79% identical), Perez and Groisman found that they acted differently on promoters (one able and one unable to induce transcription) in the two species even in a case where both orthologues bound the PhoP binding site in the promoter effectively (230). Apparently, the evolution of TF-TFBS interactions involves a complex interplay of both minor modifications to the sequences of TFs and functional changes in the architecture of promoters.
In the long run, TFs seem to evolve quite independently of their target genes through the rapid genome-wide tinkering of transcriptional interactions (7). Genes coding for repressors coevolve more tightly with their targets than do genes encoding activators. An activator can be lost when its targets remain in the genome. In contrast, a repressor usually can be lost by a genome only after either its target genes have also been lost or the TRNs have rewired significantly to diminish the regulatory role of the repressor (128). Therefore, the information content of repressor TFBSs is expected to be more conserved across related organisms.
Detailed computer simulations have shown that cis-regulatory regions can evolve in relatively little time through local point mutations, although the details of these models were based on eukaryotic genomes (22, 76). Also, local duplications caused by DNA strand slippage during replication, promoter rearrangements, and transposition of cis-regulatory regions between promoters can quickly generate novel TFBSs, although most of these processes have been studied in detail only for eukaryotes (156, 207). Furthermore, gene duplications that include cis-regulatory regions complicate the picture, because cis-regulatory regions of duplicate genes are known to be able to diverge rapidly (175, 223).
Although, as noted above, the energy of the binding of a TF to one of its TFBSs is often quite well approximated by the sum of the independent contributions of several important nucleotides, which has been referred to as the “additivity hypothesis” (21), the correlation between binding affinity and distance to the strict consensus sequence or the PWM score of a TFBS may not be quite perfect. Two studies have shown that the additivity hypothesis cannot fully account for the binding energies in the sequence space of TFBS motifs. In the first study, binding affinities of the Mnt repressor of Salmonella phage P22 were determined for its binding sites, in which positions 16 and 17 of the 21-bp operator had been varied to account for all 16 possible dinucleotide combinations (194). The two nucleotides appeared to be clearly interdependent: if position 17 was not a C, the preference of position 16 changed from A to C. In the second study, Bulyk and coworkers used protein binding microarrays to assess the binding affinities of a TFBS of the mouse zinc finger protein Zif268 for all 64 combinations of three nucleotides (40). Their analysis showed that a dinucleotide model (in which the effect of every nucleotide is dependent on the adjacent nucleotides) fitted their data better than a mononucleotide model (in which every nucleotide is scored independently) (40).
A reanalysis of these studies showed that the interdependency of nucleotides in a TFBS differed between different TFs, with the information increasing 2 to 15% when shifting from a mononucleotide to a dinucleotide model (21). Although those authors concluded that additive models are still accurate enough to be of use, it is still probable that—especially for TFs with a lower affinity for their binding sites (21)—search models based upon the assumption of additivity (which constitute the large majority of models used) will produce more false-positive and false-negative results than models in which nucleotide interdependencies are incorporated (330). When not taking into account the interdependencies of nucleotides for TF binding, interpretation problems might arise, especially when assessing large motifs, for example, when an asymmetric high level of conservation of a large part of a motif boosts the PWM score, while another part of a motif governing an essential structural DNA-protein interaction has degenerated. In a recent study, this has been shown to be the case for ComK binding sites in B. subtilis, in which transcription activation was almost completely abolished when the second thymine of the K-box was mutated into a guanine, even though the rest of the motif stayed intact (287). Similar results were also reported by Michal et al. for the Ndt80 motif in the eukaryote S. cerevisiae (209) and by Francke et al. for the LacI family, where the central CG nucleotides in the motif are essential for TF binding (89).
Furthermore, the surrounding sequence could have a significant effect on the effective binding affinity of a site (204) because the binding affinity of the surrounding sequence affects the time required for a TF to find its target through one-dimensional diffusion along the DNA. It may also affect the half-life of TF-TFBS binding, because if the surrounding sequence has a relatively high affinity for the TF, it will diffuse away more easily. An actual example of the influence of the surrounding sequence composition on TFBS functionality is the TFBSs of B. subtilis CcpA (cre boxes), which are more active when positioned in an AT-rich nucleotide context than when positioned in a GC-rich context (325).
The functionality or nonfunctionality of TFBSs is governed by evolutionary forces (215), which act mainly on the information content of DNA sequences. TFBSs are hard to identify because of the evolutionary tolerance (due to insufficient selection) of nonfunctional-site-resembling oligonucleotides and because of the array of evolutionary processes (the balance between selection, drift, and mutation) allowing fuzziness in functional binding sites.
If the size of the genome and the number of functional TFBSs is known, the amount of information needed for a TF to identify the site (Rfrequency) can be computed from the size of this genome and the number of sites (152). The information content of a TFBS (Rsequence) depends on motif length, motif stringency, and the genomic frequency of the nucleotides present in the motif (152). Evolutionary simulations have shown that the information content Rsequence of TFBSs will evolve to a value close to Rfrequency (152, 261), and a clear inverse correlation between TF binding specificity and pleiotropy (defined by the number of functional target sites to which a TF binds) has been found for the genomes of E. coli and B. subtilis (185, 266). The possibility that regulon size could therefore be estimated from the information content of TFBSs is intriguing. Francke and coworkers described CcpA and LacI operator motifs (TFBSs) for Lactobacillus plantarum (89). Those authors indeed reported that the CcpA operator cis-acting replication element site is quite degenerate, which reflects the global role that CcpA has in the control of cellular metabolism (89). It should be noted that part of the observed degeneracy could also be due to the higher number of sequences on which the motif representation is based. However, in a broader study of this in E. coli and B. subtilis, Lozada-Chavez et al. found a clear general negative correlation between the DNA binding specificity and pleiotropy of TFs as well (185). Notably, it could also be predicted that certain classes of TFs are structurally fit to function as nonpleiotropic regulators provided that their three-dimensional structure permits binding to TFBSs with larger motif lengths that can contain more information. In the end, the information content of a motif is a tradeoff between motif length and motif stringency (89).
The sequences of pleiotropic regulator TFBSs tend to be more conserved during evolution because of higher functional constraints (240, 266). This points to an interesting paradox, where the motif stringency does not have to correlate with motif sequence conservation, as is the case for sequence motifs for nonpleiotropic regulators, which are more stringent but not more conserved at the sequence level. Therefore, TFBSs of regulators that bind only at a single promoter may be very hard to trace because no overrepresentation of them can be found within the genome itself and because the rapid coevolution of the TFBSs with the gene of its TF may make phylogenetic footprinting impossible. However, the TFBSs of regulators that bind to very few targets could be determined by combining conserved gene context with phylogenetic footprinting within a limited phylogenetic range (89).
The evolution of TFBSs has generally produced nonrandom fuzziness of TFBS sequence motifs relative to their strict consensus sequence. The variation that occurs for particular nucleotides is different for every position in a certain TFBS motif, which reflects the importance of each nucleotide in establishing the specific binding of a TFBS by its TF. The position-specific variation that can be found within one genome is generally conserved throughout other relatively closely related genomes (214). A common problem in identifying TBFSs is that the number of regulated genes should be sufficient to determine a degenerate consensus sequence. Phylogenetic footprinting is a powerful tool to increase the number of TFBSs that can be used for this. Furthermore, comparing the position-specific stringency of TFBS nucleotides within a genome together with their conservation degrees across genomes can give an indication of selective pressures that have acted on certain TFBS nucleotides.
Two main evolutionary scenarios have been proposed to explain TFBS motif fuzziness from an evolutionary perspective (100). One scenario is that the binding affinity of each site is optimized evolutionarily to maximize the functionality of the site. Because the functionality of the site may demand a low binding affinity, fuzziness is a logical evolutionary consequence. This has been observed, for instance, for LacI, where the perfect palindrome has a higher affinity than the actual motif (89). In a second scenario, the fuzziness of TFBSs is attained automatically as a consequence of the balance between mutation and selection, because the function of a TFBS would be insensitive to its precise TF binding affinity as long as it is above some threshold. It should be noted that the two scenarios are not in contradiction and may both account for a part of the observed fuzziness.
In cis-regulatory regions that contain a single TFBS, the first scenario would play out if the expression of the gene has a graded response to the TF concentration, while the second scenario would play out if it responds in a binary or sigmoid fashion (104). A gene regulation model which incorporates the rate of transcription in combination with motif stringency and TF concentration would be more accurate compared to on/off models. In such a model, the threshold of TF abundance should also be modeled. Protein binding microarrays (9) could be used to determine the in vitro threshold concentration that results in the binding of a TF to its TFBS as a function of the TFBS sequence. Interestingly, Bilu and Barkai conducted a genome-wide survey of TFBSs in the yeast Saccharomyces cerevisiae in which they found that binding sites tend to be shorter and fuzzier if they are situated in more complex promoters containing more than one TFBS (26). Because promoters of essential genes tend to be bound by fewer TFs (26), one possible explanation for this fuzziness is that promoters can evolve to a larger complexity when they are under low selective pressure.
Stabilizing selection on a promoter sequence is weak when variation in the transcription rate of a gene is not likely to result in a deletion of the gene (190, 239). In such a situation, the emergence of a novel TFBS is also less likely to have deleterious effects, and there is more opportunity for evolutionary processes to incorporate such novel sites in a manner that is advantageous to the organism while still allowing for fuzzy TFBSs (26, 190). Indeed, data from comparative genomic analyses suggest that new TFBSs tend to appear in promoters that already contain multiple sites (26). However, it could also be argued that this is merely because in a promoter that already contains multiple TFBSs, a new TFBS confers a smaller change in the transcription rate, while the selective pressure on this transcription rate may be just as high as that for other genes.
The above-mentioned explanation of fuzziness may not be the whole story, as has been shown with examples of promoters with multiple TFBSs involved in cooperative or competitive DNA binding. Using a biophysical model of transcriptional regulation in which cis-regulatory regions with either homocooperative or heterocooperative sites were studied, Hermsen et al. found that TFBSs for which their TFs have weak binding affinity (i.e., fuzzy sites) probably have specific functions in cooperative transcription activation and repression (124).
In homocooperative activation, auxiliary TFBSs, which do not interact directly with the RNAP, need to be bound by their TF with higher affinity than does the primary site that interacts directly with the RNAP, in order to maximize the steepness of the response to the TF concentration (which is the primary function of homocooperativity) (38). On the other hand, in homocooperative repression, the auxiliary sites should be bound by their TFs with much weaker affinity than the primary site to establish a steep response (124). Thus, in each of these cases, the binding affinity of a TF for the auxiliary site is adapted in order to reach an optimal TF concentration dependence of the response. Experimental support for these results comes from E. coli cis-regulatory regions containing homocooperative LysR family activator binding sites and others containing homocooperative Fur repressor binding sites, which have both been studied in some detail and confirm the role of strong and weak binding sites proposed by the model (80, 165, 318). Therefore, in the case that multiple TFBSs for a single TF are present in a promoter, the secondary sites are expected to be more conserved than the primary TFBS in the case of activators and the other way around in the case of repressors.
Whereas most simple combinations of TFBSs lead to Boolean NOR or ANDN gates (153) (Fig. (Fig.5),5), the model constructed by Hermsen and coworkers also predicts that heterocooperativity or heterocompetitivity may facilitate more complex transcriptional responses, such as the Boolean AND or OR gates (Fig. (Fig.5).5). This finding is in accordance with earlier results reported by Buchler and coworkers (38). In promoters functioning as an AND gate, the core promoter (−35 and −10 RNAP binding sites) is weak, so there is no transcription without specific activation, and two TFBSs (binding TF1 and TF2) are present, which are both too weak to function by themselves and induce activation only cooperatively. Additional sites binding either TF1 or TF2 may be present in the promoter to steepen the response to the TF concentration (124). In such promoters, the fuzziness of the TFBSs is selective, since because of the lower binding affinity of the sites, both TFs are required to be present at sufficient concentrations to activate transcription.
The biological importance of cooperative regulation also has consequences for the prediction of TFBSs. Currently, TFBSs are determined case by case. The determination of TFBSs would benefit from the integration of searches on different TFBSs on the same cis-regulatory regions. In cases where multiple TFBSs are identified in a given cis-regulatory region, the motif detection stringency should be decreased in order to account for the more complex promoters in which cooperative or competitive regulation takes place. One reason is that TFBSs probably need less motif stringency to be functional if they are positioned next to another TFBS with a high binding affinity for the same TF because this will cause the local TF concentration in this promoter to be higher than normal. A second reason is that the biological usefulness of cooperative and competitive regulation mechanisms can be expected to have increased the frequency of TFBSs in promoters during evolution beyond the level that would be expected on the basis of selection acting on single TFBSs.
We conclude that in order to understand the structure and response of a TRN, the fuzziness of a TFBS should be considered in context with (the nature of) other TFBSs in the same promoter region. A fuzzy TFBS could still have an equally important role as a “perfect” TFBS in the case of homo- or heterocooperativity.
The scenario in which TFBS fuzziness is a result of mutational entropy has been developed in two theoretical studies using the assumption that the fitness of a TFBS depends solely on its binding affinity for its TF. Gerland and Hwa reported that the fuzziness of motifs arises naturally from the balance between selection and mutation: mutations that slightly lower the affinity of TF binding to the TFBS are not rapidly removed by selection compared to the event of a new mutation (100). A study reported by Sengupta and coworkers emphasized this point, while they also found that TFBSs of TFs governing large regulons were more fuzzy than those of TFs targeting only a few specific sites (266). This may be both because the amount of mutational variation in the TFBSs of a certain TF increases with the number of TFBSs (higher mutational forces) and because the information required to identify a TFBS is less specific if more TFBSs are present in the genome (lower selective force [see above]).
It seems that the two scenarios (the selective scenario and the mutation-selection balance scenario) explaining TFBS motif fuzziness are in reality probably intertwined and that both processes play important roles in TFBS evolution. The contribution of each process probably differs according to both the complexity of promoters and the selective pressures acting upon them. Finally, from a broader perspective, another reason for TFBS fuzziness could be that less strict binding of TFs to DNA motifs (unlike restriction endonucleases) both creates robustness to deleterious mutations and enhances the evolvability of new TFBSs (307). Interestingly, in two studies of the E. coli lac operon, it was predicted that this operon can easily evolve from its intermediate form to a pure AND gate or a pure OR gate, because the fact that fuzziness is allowed during evolution facilitates the discovery of nearby sequence space (204, 267). In the context of the ever-varying evolutionary challenges that bacteria must face, it is the inherent evolutionary versatility and adaptability of the relationship between TFs and TFBSs that make these systems so successful.
Although homocooperative interactions may explain the appearance of TFBS multiplets (multiple adjacent occurrences of the same type of TFBS) in quite a number of promoters, it probably does not account for all multiplets present in cis-regulatory regions. For example, probably not all TFs oligomerize on the DNA because they may bind to different faces of the DNA helix or may not have three-dimensional domains that strongly interact (320). In some promoters, multiple TF proteins act simultaneously with the RNAP to either repress or activate transcription in a synergetic manner without oligomerization, as was observed for the CRP TF in E. coli (166). An evolutionary model has confirmed that under high selective pressures, more than one TFBS can indeed be maintained in promoters when the binding sites contribute independently to transcriptional activation (100, 112). This process could also be a driving force in the apparently frequent process of binding site turnover because new sites can evolve under selection, while after relief of selection, the old site may degenerate instead of the new site (74, 187). For phylogenetic footprinting approaches to TFBS discovery, such binding site turnover forms a serious hazard.
In order to fully understand bacterial TRNs and to integrate experimental and computational information, an appreciation of the biological mechanistic intricacies of gene expression regulation is needed. As can be seen in Table Table11 and Fig. Fig.1,1, there is a large variety of biological mechanisms by which transcription is regulated in prokaryotes. So far, only a few of them have been taken into account in regulon reconstruction and TRN reconstruction efforts. Spatial positioning of TFBSs, motif stringency, and combinatorial regulation mechanisms should especially be taken into account.
Barrett and Palsson as well as Covert and colleagues predicted that through an iterative model-building strategy in which iterations of high-throughput experiments and in silico modeling are performed subsequently, regulatory network elucidation for the model organism E. coli could be completed within years (13, 57). Such iterative approaches are indeed promising, because in this way, future experimental research will be streamlined effectively to yield the most information-dense results. However, if complex regulatory mechanisms such as those discussed in this review play a major role in prokaryotes, the outlook given by Barrett and Palsson as well as Covert and coworkers is probably too optimistic. More complex models may be needed to arrive at a TRN with a minimum number of inconsistencies. Moreover, there are more general issues in network reconstruction. In many cases, DNA microarray data are still used as a primary data source. Large compendia of microarray data obtained under different conditions are required to distinguish between direct and indirect regulatory effects (87, 326). Even when large data sets are available, a limitation of this approach is that only those networks which are (differentially) expressed under the conditions in which the transcriptome analysis was performed can be reconstructed. Furthermore, current efforts are focused on the association of targets with their transcriptional regulator. This involves the assumption that the transcriptional regulator should be coexpressed with its targets. The problem with this assumption is that this is the case only for TFs with autoregulation; i.e., the TF regulates its own expression. For E. coli K-12, the numbers of TFs that negatively regulate themselves are most common and have been estimated to be about 50% (248). Another scenario could be that the transcriptional regulator is expressed earlier than its targets, which would require aligning and phasing gene expression patterns of the regulator and its targets (324). Still another approach could be to (i) determine stimulons (120), i.e., genes for which the expression is changed when applying a stimulus; (ii) determine the different regulons that are part of a stimulon; and (iii) determine the causal (TF-target) relationships within the regulons.
In order to be able to reliably reconstruct TRNs with the correct interactions between their nodes, it seems absolutely vital that more functionality of a promoter can be predicted and used as input than just the presence or absence of a certain TFBS in it. For one, the positions of TFBSs for activators and repressors should be taken into account. Recently, such information has been successfully implemented to increase motif search accuracies by searching for sequence motifs that are nonhomogeneously distributed within promoters (49). Another, more specific, possibility to use positional information is to weight the predicted transcriptional effect of class II activators by the helical face at which they are positioned relative to the core promoter elements (115). Furthermore, some classes of activators or repressors function only in a specific positional range relative to the transcription start site, so putative TFBSs for these TFs outside these regions could be discarded (although not if experimental information points to functionality). However, in the end, one would like to predict the steepness and control logic of the response of a promoter to the concentrations of numbers of active TFs in the cell. In order to accomplish this, a “grammar” should be constructed that can predict the promoter function from the positioning and combinations of multiple TFBSs in a promoter.
Synthetic biology is the field of research where biological building blocks conferring a certain functionality are identified by a combination of molecular biology, bioinformatics, and engineering. These building blocks can subsequently be transferred to a different organism to add biological functionality. Synthetic biology approaches seem both the solution and an additional application for this: a solution because synthetic approaches will allow the construction of large synthetic promoter libraries with which such a grammar can be constructed (58, 102, 153) and an application because such grammar definitions will allow the de novo design of synthetic promoters with any control logic of choice to function in a synthetic regulatory module (197). Once such a detailed grammar has been constructed, it can be employed to take into account combinatorial regulation in reconstructing TRNs by using and improving on software systems such as the newly developed RENCO (251). The possible role of homocooperative binding could be integrated by boosting TFBS motif scores if they are positioned next to TFBSs for the same TF, especially because it has been shown that weak-affinity TFBSs can function as strong-affinity TFBSs if they are positioned next to additional strong-affinity TFBSs (102). In cooperative binding at multiple TFBSs, but also when homocooperative binding takes place at a single TFBS (binding of a homodimer), it should be taken into account that at such TFBSs, regulation will probably be more sigmoid because of a larger concentration effect. This can be integrated into Boolean logic functions as previously demonstrated (57) (Fig. (Fig.55).
To be able to predict gene expression regulation more accurately, it is also vital that many less-well-studied regulation mechanisms are understood. One example of regulatory sequences that have been neglected is formed by the core promoter elements themselves as well as the associated UP element and the extended −10 element. How the affinity of RNAP binding to core promoters is determined by their sequence and how this affects the transcription rate of a downstream gene should be studied in more detail. This would then result in the possibility of giving genes specific coefficients that signify the strength of the core promoter without other regulatory interactions. Also, promoters that could be regulated by transcriptional interference (42, 268) should probably be assessed separately, as the mechanisms are quite complex and have not yet been studied in detail. A kind of “dominance factor” could be used to indicate the activity of one promoter at the expense of the activity of another. More global information such as the chromosomal positioning (1, 183) of genes is probably easy to integrate into in silico regulatory networks, as all genes can be given a constant that lowers the predicted level of expression of genes if they are closer to the terminus of replication. Similarly, the distance of putative TFBSs from the gene encoding their TF (corresponding to the search speed of the TF toward it) can be taken into account. Finally, non-TF sequence-specific DNA binding proteins such as Dam methyltransferases and some nucleoid proteins should be added to promoter and network analyses. Perhaps transcript cleavage factors such as GreA (133, 282) also have some sequence specificity, and this could be investigated further experimentally.
When attempting to validate in silico-predicted TRNs, it should at all times be noted that expression data actually do not represent gene transcription rates but represent merely mRNA abundance rates. More global assessments of mRNA stability such as that performed by Selinger and coworkers (264) and the more recent mRNA sequencing techniques by, e.g., Illumina (http://www.illumina.com) are probably adequate and necessary to quantify the role of selective mRNA degradation, which may be the cause of many inconsistencies between high-throughput expression data and computational predictions. Also, at the mRNA level, the role of riboswitches should not be underestimated (14, 195, 308, 319).
Recent bioinformatic applications are increasingly appreciating the biophysical reality of protein-DNA binding. For example, Manke et al. recently demonstrated that it is possible to accurately predict regulatory interactions using a continuous model of TFBS binding affinities instead of discrete descriptions of the absence or presence of TFBSs (196). Importantly, this model also takes into account the binding affinity of a TF for the background sequence. Also, nucleotide interdependencies have started to be modeled into motif discovery algorithms (222, 327, 330). In complex promoters where homocooperative regulation takes place, there is a good chance that fuzzy motifs that function specifically to bind TFs with weak affinity are not incorporated in in silico predictions because of a lack of statistical significance. The role of information content in the fuzziness of motifs can also be integrated into models, as the pleiotropy of a TF could be used to determine how distant from the degenerate consensus motif a TFBS sequence is allowed to be in order to attain an optimal balance between false-positive and false-negative results. However, the problem with using the degree of TF pleiotropy for this is that one attempts to use the output of a model (the regulon size) as input before actually obtaining this output. However, integration with experimental data and iterative modeling should be able to solve this.
On the experimental side, new technical possibilities are opening up as well. Protein binding microarrays are further complementing the ChIP-on-chip approach in identifying DNA sequences that bind to specific TFs. They also have potential in discovering the functional sequence space of TFBSs if, for example, many different degenerate versions of the consensus sequence are put on an array. Maybe even more promising are ChIP-Seq approaches (198, 245). Initial methodological tests reveal that especially a combination of traditional ChIP-on-chip and ChIP sequencing yields a more comprehensive list of functional TFBSs throughout genomes (85). Also, combining high-resolution ChIP-on-chip or ChIP-Seq data with gene expression data appears to be promising for the network reconstruction of specific regulons (51). The integration of transcriptomics and metabolomics will finally also reveal more insights into the role of small-molecule concentrations (for example, as small TF binding ligands or in riboswitch regulation) in regulating gene transcription on a global scale (202), which is especially important when integrating experimental data for organisms grown in different media.
Conceivably, it will not be possible to integrate all biological mechanisms mentioned in this review into computational methods for regulatory network reconstruction. Due to the complexity of the matter, early attempts to integrate these mechanisms in general models will probably yield results that are not more accurate than the results of highly optimized simplistic models. Of course, in order to design methodologies that we can use within a reasonable time, we need to get closer to the complex biological reality without overfitting the data on too-complex and noisy models that involve too many parameters that fail to be informative (72, 150). Only a first level of biological complexity is handled by current computational approaches. New models based on the features that are described here may significantly enhance our grip on the TRNs as they actually are. Such modeling will point out (i) which features do and which do not contribute to the successful separation of genes as being part of certain regulons or not and (ii) which genes cannot be correctly classified based on the current features and thus contain features or “biology” missing in the model. As Sandve and colleagues recently mentioned, “another mathematical reformulation of existing approaches will certainly not change the status of the field” (257). However, if the integration of biological mechanisms into computational models goes hand-in-hand with advances in algorithm development and the increasing use of high-throughput experimental data to validate network reconstructions, significant advances in grasping the regulatory complexity residing inside bacterial cells can surely be expected.
Work of S.A.F.T.V.H. was in part supported by the BMBF (grant number 0313978A) within the framework of the transnational SysMO initiative in the project BaCell-SysMO.
We thank Christof Francke for critical reading of the manuscript and Siger Holsappel for graphically enhancing Fig. Fig.2.2. We thank the anonymous reviewers for their constructive comments.
Sacha van Hijum was born in Bedum, The Netherlands, in 1972. He studied bacterial molecular biology (University of Groningen, The Netherlands) and obtained his Ph.D. at the Microbial Physiology Department (University of Groningen). He did three postdoctorals at the Molecular Genetics Department (University of Groningen) and one at the Interfacultary Centre of Functional Genomics (University of Greifswald, Germany). Presently, he is working at NIZO Food Research (Ede, The Netherlands) and the CMBI Bacterial Genomics Group (Radboud University, Nijmegen, The Netherlands). For the past years, research focus was on studying gene regulatory interactions in prokaryotes using computational biology techniques. Currently, the focus has broadened to data analysis and mining of high-throughput technologies such as DNA microarrays, proteomics, metabolomics, and next-generation sequencing. Bioinformatics is used to integrate these complex and multivariate data sources in order to understand the complex interactions occurring at various regulatory levels (e.g., transcriptional and metabolic networks) underlying an organism's response to its changing environment.
Marnix Medema was born in Vaassen, The Netherlands, in 1986. He obtained his B.Sc. in biology at the Radboud University of Nijmegen and then finished the top master program Biomolecular Sciences at the University of Groningen, from which he graduated in 2008. He is now starting his Ph.D. research at the department of Microbial Physiology in Groningen, on genomics and systems biology of the actinomycete bacteria.
Oscar Kuipers was born in Rotterdam, The Netherlands. He studied Biology at Utrecht University and received his master's degree in Molecular Biology, Biochemistry, and Informatics in 1986. He obtained his Ph.D. in protein engineering of porcine pancreatic phospholipase A2 in 1990, after which he was appointed as postdoctoral and later project leader and group leader of genetics at NIZO Food Research in Ede, The Netherlands. In 1999, he was appointed Full Professor in Molecular Genetics at the University of Groningen, The Netherlands. His current research interests include functional genomics and physiology studies of low-GC gram-positive bacteria. Currently, he is studying gene regulatory networks in these organisms as well as the phenomenon of phenotypic bistability occurring in, e.g., competence for genetic transformation and sporulation. Moreover, he has a keen interest in biosynthesis, regulation, immunity, and mode of action of a number of different antimicrobial peptides, in the role of metal ions in virulence and pathogenesis, and in general stress responses in bacteria.