|Home | About | Journals | Submit | Contact Us | Français|
The study of transcription has witnessed an explosion of quantitative effort both experimentally and theoretically. In this article, we highlight some of the exciting recent experimental efforts in the study of transcription with an eye to the demands that such experiments put on theoretical models of transcription. From a modeling perspective, we focus on two broad classes of models: the so-called thermodynamic models that use statistical mechanics to reckon the level of gene expression as probabilities of promoter occupancy, and rate equation treatments that focus on the time-evolution of a given promoter and that make it possible to compute the distributions of messenger RNA and proteins. Our aim is to consider several appealing case studies to illustrate how quantitative models have been used to dissect transcriptional regulation.
The very existence of this special themed issue on CellBio-X hints at a growing belief in what one might call a Bio X effect, the idea that somehow by attacking biological problems from a physical or quantitative perspective, we will either refine our understanding of established biological processes or discover completely new effects of mechanisms. One way to view the possible significance of this emphasis on biological numeracy is by analogy to the different kinds of catch fisherman can expect when using nets or hooks of different types. Certain nets are sure to catch some fish and not others. By introducing new ways of fishing or by casting these nets or hooks in new places, a different ocean is revealed. We argue here that the kind of approaches reflected in this special issue are a complementary kind of biological net that can reveal things that are impossible to see using traditional verbal and pictorial descriptions.
The type of quantitative approaches in biology argued for above have been ballyhooed far and wide, whether in the pages of learned reports [1, 2], the existence of new online resources , in a variety of books and articles [4–9] or the establishment of new programs or courses at universities around the world [10–12]. But what is the basis for this growing enthusiasm for biological numeracy and what rewards has it delivered (or might it deliver in the future) in our understanding of cellular decision making in particular?
Even as relative newcomers to the study of transcription, it is clear to all of us that with the passing of each year, the rapid pace of technological advances is resulting in a new generation of impressive and beautiful experiments that are painting a much more nuanced picture of the regulatory steps exploited by cells as they make decisions. One thing is clear: many of these experiments challenge the conventional verbal and pictorial representations of gene expression. With this increasing reliance on systematic precision measurements of gene expression [13–15] comes the possibility of asking entirely new classes of questions about how regulation works. Further, these approaches are beginning to suggest how regulatory networks can be engineered to create entirely new biological functions, one of the signature achievements of the synthetic biology approach. With the new-found emphasis on reporting the results of these experiments quantitatively there has come a growing trend to use models which are described in the same quantitative language as the data itself.
To make this claim concrete, consider the example of transcriptional repressors that bind at two sites on the DNA simultaneously, thereby looping the intervening fragment of the genome. Elegant experiments measured how the level of gene expression depends upon the length of the DNA loops in the lac operon, resulting in the authors noting serious differences between the in vitro and in vivo signatures of the underlying DNA mechanics . Indeed, those experiments and others like them have served as the basis of more than a decade of effort aimed at getting a deeper understanding of biological action at a distance and specifically, trying to reconcile the in vitro and in vivo views of DNA mechanics [17–24]. One of the ambitions of the present paper is to reveal a series of examples of precisely this character where biological numeracy serves as the basis for asking new kinds of biological questions. The history of modern biology is replete with examples of this kind: Mendel counting peas with different traits, Morgan and Sturtevant tracking the frequencies of mutations in flies, Delbruck and Luria measuring the fluctuations in the number of bacteria resistant to viral infection, Hodgkin and Huxley measuring the electrical currents across cell membranes, to name a few. In all of these cases analysis of quantitative data from a quantitative perspective led to new biological insights.
In an earlier set of papers [19, 20], we explored biological numeracy in the context of transcription using thermodynamic models [25, 26]. Here we extend the arguments made there from the vantage point of the impressive experimental advances which have characterized the field since those articles were written. Some of these experimental advances include the direct observation of transcription at the single-molecule level [27, 28], single-cell measurements on transcription which yield protein and mRNA distributions in a population of cells [29–31], high-throughput methods which permit the analysis of many architectural motifs, and an explosion of synthetic biology transcriptional architectures [32–35], etc.
As a result of these powerful experimental advances, there has also been a new round of model building aimed at responding to this next generation of measurements. It is now becoming routine to see extremely complicated diagrams of “genetic networks” with vague and hopeful analogies to electronic circuits. What marks our understanding of such circuits and the electronic components that make them up, however, is a reliable understanding of their input-output properties (or transfer function) . Part of our mission is to explore the interplay between experimental and theoretical strategies for dissecting transcriptional regulation in a way that comments on the fruitfulness of such analogies.
One of the many ways in which new experimental methods are sharpening the questions we can ask about transcription centers on the fact that it is now possible to measure the distribution of gene products in a population of cells by watching cellular decision making at the single-cell level [14, 37, 38]. We argue that distributions provide yet another way to probe the mechanistic underpinnings of observed patterns of gene expression. Though the details are themselves fascinating, our primary emphasis here will rather be on the style of quantitative thinking used in attacking these problems. Further, with apologies to the many scientists whose work has propelled the field forward, we focus on a few instructive case studies which we find are most sympathetic to illustrate our main arguments with no attempt at being comprehensive in our coverage of the literature.
In the next section, we provide an overview of the use of thermodynamic models to study cellular decision making. The main point of this section is to show how the thermodynamic models have sharpened the questions we can ask about regulatory networks and have clarified our understanding, while at the same time bringing into relief certain surprises and paradoxes. The second main section focuses on both measurements and models in which time figures explicitly. Experiments have now reached the point where it is possible to watch the synthesis of individual mRNAs, for example, on a cell-by-cell basis. Both the individual trajectories and the distributions obtained by tallying up the behavior of many cells together pose challenges which fall outside the scope of the thermodynamic models but can be explored using rate equations that reckon how the transcriptional state of the system will change during a small instant of time Δt.
The ability to perform systematic experimental manipulation of the various parameters (such as transcription factor binding site positions, strengths and concentrations) highlighted in Figure 1 has resulted in a variety of different measurements of the level of gene expression for a spectrum of promoters [33, 39–44], though our discussion will often focus on the classic lac operon which has become a central quantitative testbed [18, 45–50]. Within the framework of the thermodynamic models which compute the probability that RNA polymerase will occupy the promoter of interest, the simplest way to make a direct comparison between the measurements and models is through the vehicle of the fold-change which gives the ratio of the level of gene expression in the presence and absence of regulatory elements whose abundance serves as an experimental knob. For the special case of simple repression considered in Figure 2A, the fold-change can be written simply as
where [R] is the concentration of repressors and K is an effective dissociation constant which is a measure of the affinity of repressors for their target binding sites. The origins of this formula are illustrated schematically in Figure 2B which shows how to take the cartoon representation of the various states of the promoter and to find their associated statistical weights as prescribed by the Boltzmann factor from equilibrium statistical mechanics[8, 19, 20]. Note that the concentration of polymerases does not enter eqn. (1) since we are considering the “weak promoter” approximation in which the affinity of RNA polymerase for the promoter is very weak [19, 20]. With the Boltzmann factors in hand, we can then compute the level of gene expression on the assumption that promoter occupancy and gene expression are linearly related [8, 19, 20].
How can we explore the potency of a formula like that given in eqn. (1) and the many others like it that are highlighted in Figure 3? Several important case studies have been carried out using well-characterized bacterial promoters which permit a direct and meaningful comparison between the measurements and this result (and similar calculations and measurements have been done for more complex regulatory architectures as shown with a few examples in Figure 2). Note that we are vehemently opposed to the idea that the goal of a model is to “fit the data”. Rather, besides the central aim of having a coherent “story” about entire suites of data and the mechanisms that underlie them, a much more useful outcome of model building of the kind we describe here is that it leads to some surprise or paradox, which in turn might imply that the original cartoon representation of the regulatory process of interest is incomplete or flawed.
Some of the most complete quantitative examples of this overall strategy have taken place in the lac operon where by eliminating the auxiliary operators it is possible to construct a genetic circuit with the kind of simple repression highlighted in fig. 1A. Indeed, all of the “knobs” highlighted in that figure have been systematically altered experimentally and the resulting level of gene expression has been characterized as shown in Figures 4A and 4B.
In one of the most thorough studies to date, the lac operon was probed in quantitative detail by using the thermodynamic framework to dissect the way in which the molecular factors responsible for activation and repression interact. There is an unparalleled depth of knowledge and quantitative data available for all of the molecular players and interactions responsible for the output of the lac system. This provides a unique opportunity to challenge the quantitative modeling perspective with real experimental data and demonstrate that this classic, well-characterized biological system can have new life as a proving ground for the techniques of physical biology. This case study is highlighted in Figure 4B. Here, through the judicious construction of a variety of mutants, the response of the lac system to each of its molecular components was carefully isolated and measured [42, 44]. By comparing the results of these experiments to a thermodynamic model formulated based on the known properties and interactions of the system, it was shown how the complete output of the operon can be explained in quantitative detail as the result of the accumulation of multiple known interactions between the individual components. In the language of electronic circuits introduced above, this can be likened to predicting the properties of the circuit based upon the known quantitative characteristics of its constituent capacitances, resistances and so on.
Conversely, one can imagine the characterization of a system in which much less is known about the constituents and their interactions. By comparing the results of experimental characterization to the predictions of a simple model capturing the known properties of the system, inconsistencies that arise can be a signal that our understanding of the system is incomplete. For example, the wild type response of the lac system to changing concentrations of its repressor is extremely sensitive: the output serves essentially as a switch – it is completely off at high levels of repressor and abruptly switches on as concentrations are lowered. There is nothing inherently surprising about this observation, and such behavior might be expected from a cartoon model of the action of a repressor; however, when the sensitivity of the response is compared quantitatively to the prediction of simple modeling, it is seen that such a high level of sensitivity cannot result from the action of the repressor alone. It is only through the combined action and interaction of the repressor, positive feedback, and DNA looping that the high sensitivity can be explained.
Once a gene is transcribed, it can be subject to further regulation before it is finally present in the cell as an active protein. One way in which genes can be post-transcriptionally regulated is through interaction with small untranslated RNAs, or sRNAs [51–53]. sRNAs can bind to the transcribed mRNA of genes, blocking their availability to the translational machinery of the cell or marking the mRNA for degradation. To understand these mechanisms, the same thermodynamic ideas introduced above have recently played out in the context of RNA regulation. Quantitative dissection of this kind of regulation [54, 55] shows that the stoichiometric co-degradation of sRNA with their targets results in different quantitative regulatory characteristics than regulation by protein transcription factors, which are not consumed during regulation and act catalytically, and thermodynamic modeling conveys a deeper understanding of this mode of regulation and its advantages and disadvantages relative to regulation by proteins.
One of the frustrating features of the experimental strategy used in the case studies described above where the idea is to measure the gene regulation function (or the fold-change) is that it requires a new strain every time we want to change the number of repressors, for example. That is, each of the black data points in Figure 4A corresponds to a different strain. Is there a more systematic way to tune the repressor concentration knob without resorting to the construction of new strains? A recent set of clever experiments (just one of many illustrations of the amazing experimental advances in recent years) found a way to circumvent this limitation by allowing the dilution of the repressor molecules as the cells divided. When a mother cell containing N repressors divides, each of the daughters should get roughly N/2 repressors and in subsequent generations this results in roughly N/2n repressors in the daughter cells when the original mother cell has undergone n rounds of division . The significance of this fact is that the level of repression is thereby titrated systematically generation by generation. In turn, the regulated gene increases its level of protein production with each subsequent generation. One beauty of this method is that it permits a direct determination of the number of repressors that are mediating the fold-change, a fundamental prerequisite to any direct comparison between the thermodynamic models and their experimental realization as shown in Figure 4A. Interestingly, this example feeds directly into the next section of the article since it illustrates some of the nuance that comes on the heels of knowing something about the fluctuations in a system as opposed to only mean values.
Experimentally, by far the most common way of exerting control of the binding of transcription factors to DNA is by using inducer molecules [42, 44]. Though this approach allows for tuning the strength of DNA binding, in this case there is an extra layer of knowledge and modeling required to explicitly link theory and experiment. Unless the intracellular concentration of inducer, which can be taken up by the cell in either an active or passive manner, as well as the parameters of inducer-transcription factor interaction are known, it is very hard to relate the extracellular inducer concentration to an effective concentration of transcription factors that are able to bind DNA.
Another way in which the transcription factor copy number is tuned in multicellular organisms is to exploit the naturally occurring spatial variation in their concentration that arises in different parts of a developing embryo. At different stages of the developmental process different spatial patterns of transcription factor concentrations are established. Recent quantitative experimental efforts in the developing fruit fly embryo are in the process of paving the way to the same sorts of systematic theory/experiment interplay already enjoyed in the study of transcription in bacteria [57–63]. For example, by measuring the spatially-dependent expression of a reporter gene that is under the control of transcription factors that have a concentration gradient along the anterior-posterior axis of the embryo, a first cut has been made at the input-output relation between the hunchback and bicoid genes as shown in Figure 4C . Building on earlier work in flies that explored the so-called minimal stripe element , recent experiments have adopted the synthetic biology approach by placing different repressor binding sites at different locations on the genome and then measuring the resulting fold change in a way that makes it possible to compare to first-generation thermodynamic models for these complex systems .
A critical assumption of the thermodynamic model approach is the use of an equilibrium framework for describing the competition between RNA polymerase and the factors that regulate it for the same piece of genomic real estate. One of the ways to judge the merits of this approach is by appealing to the relative time scales of the processes that mediate regulation in comparison with the rate of transcription initiation itself. For promoters where there is a clear separation of time scales for these two classes of processes, regulation on one side and initiation of RNA production on the other, the mean number of messenger RNAs produced by the cell is proportional to the equilibrium probability of the promoter being in a transcriptionally active state. In one limit, when the processes accompanying regulation are fast compared to those associated with initiation of RNA production, transcription factors and RNA polymerase will have enough time to reach binding equilibrium with promoter DNA, and RNA production initiates from this equilibrium state. In the opposite limit of fast transcription initiation the slow switching between different promoter state is not affected by RNA production and the mean RNA number reflects the mean time the promoter spends in the active state. As an example, in vitro and in vivo studies of the lac promoter have found that the typical time for the Lac repressor to come on and off the promoter DNA is on the order of minutes [65, 66], while the events that lead to transcription when the repressor is not present occur on second or sub-second time scales [67, 68], thus justifying the equilibrium assumption.
The same concrete interplay between systematic measurements and thermodynamic models described in this section has been played out again and again for a range of different prokaryotic and eukaryotic promoters. Though there are reasons to be skeptical as to whether insights as dramatic as those garnered in the early days of gene regulation will come out of these kinds of quantitative approaches, the fact that so many researchers are now using these ideas signals a growing consensus that we can only claim we really understand what is going on when we can construct a quantitative framework that mirrors what is observed experimentally. Perhaps even more significantly, this kind of detailed quantitative understanding might serve as the most useful jumping off point for those trying to engineer new architectures using more than enlightened empiricism.
Despite their broad reach, the thermodynamic models are relatively silent when it comes to the growing mass of temporal measurements which examine the regulatory responses of individual cells over time or for those measurements in which cell-to-cell variability or mRNA and protein distributions are reported. For these phenomena, we must turn to a different class of models.
No matter how appealing the simplicity of the descriptions introduced in the previous section, there are now an increasing number of single-cell experiments that are delivering not only the entire distributions (as opposed to the means that are the central focus of the thermodynamic models), but also that yield the stochastic trajectories of mRNA concentrations (and protein) as a function of time as shown in Figure 5. These kinds of data call for theoretical models that go beyond the thermodynamic framework.
One general class of models that are used to respond to such data are built using rate equations or master equations (these approaches have important differences, but we focus on their common features). These models tell us how in a small time increment the population of the chemical species of interest (e.g. mRNA or protein) or the probability distributions themselves will vary [69–72]. The key assumption of these models is that one can define distinct states of the promoter like in the thermodynamic models and then describe the time evolution of the promoter as a biased random walk between the different states as shown in Figures 2C and 2D. The transitions from one state to the next are characterized by rate constants, namely, the probabilities per unit time that the specific transition of interest will occur [29, 70, 72–82]
If we interest ourselves in the time evolution of mRNA, the idea in these time-dependent approaches is that the amount of mRNA found at time t+Δt can be obtained by considering the amount at time t and then summing up all the ways that mRNAs can be gained and lost during that small increment of time Δt. For example, there will be a loss of mRNA due to both degradation and cell division, while there will also be terms tending to increase the amount of mRNA as a result of transcription itself (and the average rate of transcription will depend in turn upon the concentrations of regulatory proteins such as activators and repressors). The simplest model for the transcription process posits a mean production rate per unit time and a mean degradation rate per mRNA γ, resulting in a steady-state average mRNA number of <mRNA>=r/γ.
However, even for this simple model, if we consider the number of mRNA as a function of time the instantaneous number will not always be equal to this predicted mean value. Because the arrival and binding of individual RNA polymerase molecules at the promoter is an inherently random event, at any given time there may be fluctuations resulting in slightly more or less mRNA than the predicted mean. The size of these fluctuations can be quantified by the ratio between the variance of the distribution (Var(mRNA)) and the square of its mean (<mRNA>). For the simple model outlined above, the fluctuations are characterized by
This simple model of stochastic mRNA production and decay implies that mRNA is made stochastically in uncorrelated transcription events that are independent. The prediction of the model is that the mRNA number is described by a Poisson distribution, for which the variance is equal to the mean.
One of the powerful insights that emerges from experimental data like that shown in Figure 5A is that they reveal that the most naïve model of mRNA dynamics described above is not borne out experimentally. Whereas the simplest model is predicated on the idea of a uniform rate of mRNA production, we see that even for a simple regulatory architecture that the mRNA production is “bursty”, with brief periods of time in which the promoter is active and multiple mRNAs are produced, followed by long periods of time in which transcription is turned off. In a case like this, the governing equations are more involved since one has to track how the probability of being in either the active or inactive state changes in a time Δt [69–72, 74, 79]. However, even with this more complex two-state model, it is possible to compute the expected mean and the variance and the resulting expressions are shown in Figure 5A and more generally in Figure 3. Consistent with the observations, the variance and the mean are not equal, as the initial naïve model predicts.
Rather than focusing solely on the lowest orders moments of the mRNA distribution, recent measurements and models have even permitted a determination of the entire distribution [70, 77]. One particularly interesting case study in yeast is highlighted in Figure 5B. The number of mRNA molecules being actively transcribed in individual cells was determined using state-of-the-art single molecule techniques. By measuring the entire mRNA distribution, quantitative information about the processes that must be responsible for generating the observed distribution, and even the rates at which they occur, can be determined.
As shown in this section, recent experiments are now routinely generating data that call for theoretical analysis beyond the thermodynamic models. As a result, ideas based on rate equations have stepped into the breach and are themselves producing a range of falsifiable predictions that not only guide experiments, but have altered our picture of the transcription process itself.
The amazing progress in biology in the last half century seems in many ways analogous to progress in astronomy after the invention of the telescope. The expansion of our factual understanding of living matter is staggering. Further, it seems that the analogy to astronomy goes deeper. Just as quantitative observations of the motions of celestial bodies called for theoretical underpinnings, allied with the development of this new generation of biological facts has come a concomitant need for theoretical frameworks that allow us to tell stories about these facts in a way that brings them under the same theoretical roof and in a way that suggests fruitful directions for further experimentation.
The attempt to cast our understanding of biological processes such as transcription in purely quantitative terms as reviewed in this article is only in its infancy. Indeed, there are many challenges that stand in the way of making this approach more generally applicable including ignorance of the complete set of molecular players and linkages in many networks of interest and an unruly proliferation of parameters even in those cases where the relevant molecular actors and linkages are known. It is no accident that much of our discussion focused on the seemingly overworked example of the lac operon. This reflects the fact that in order to make quantitative progress like that advocated here, it is necessary to have a well-characterized system and few if any systems have been subjected to the same level of experimental scrutiny as the lac operon. Our figure 3 is an attempt to make more generic predictions about other common regulatory architectures to break away from a lac operon-dominated mindset. It is in a similar spirit that several other key case studies in yeast and flies have been used in a similar vein as these kinds of approaches are brought to bear on the much more challenging case studies to be found in eukaryotes where other factors such as nucleosomes add another level of complexity to the problem. Despite these challenges, our sense is that an important way to make continued progress is the selection of certain key case studies which will be characterized by depth rather than breadth. In these case, the acid test should remain the ability to make testable predictions about how certain key “knobs” alter the level of expression and the fundamental mantra of the quantitative approach is that failure of the predictions of such models is an opportunity to learn something new.
Though the discussion in this paper centered on transcription, we could have written a similar story using the same two frameworks (i.e. thermodynamic models and rate equations) for discussing signal transduction in bacterial chemotaxis, for example, and much work in this vein is already underway [83–86]. The same could be said for a variety of other interesting problems in biology. In that sense, this paper should be seen more broadly as reflecting several useful strategies with much broader biological reach than merely the fascinating topic of transcription. In each of these cases, the underlying argument is the same. As noted by Abraham Pais in his discussion of Einstein’s role in the emergence of the modern quantum theory of solids, “In order to recognize an anomaly, one needs a theory or a rule or at least a prejudice” . In that sense, the approach advocated here is to use quantitative models to build prejudices which can then serve as a scalpel to dissect experiments in a way that the traditional verbal and pictorial descriptions cannot and which reveal anomalies that can help us better understand and ultimately control living matter.
We are grateful to Rob Brewster, Robert Sidney Cox III, Ido Golding, Thomas Gregor, Daniel Jones, Justin Kinney, Dan Larson, Ron Milo, Nigel Orme, Linda Song and several anonymous reviewers for stimulating discussions, providing data, help with figures and/or critical evaluation of the manuscript. HG and RP are also extremely grateful to the NIH for support through the NIH Director’s Pioneer Award (DP1 OD000217), RO1 GM085286 and RO1 GM085286-01S. TK acknowledges the National Institutes of Health (GM078591, GM071508); and the Howard Hughes Medical Institute (52005884), AS and JK acknowledge the support of the National Science Foundation through grant DMR-0706458. AS was also supported by grants GM81648 and GM43369 from the National Institutes of Health.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Note that the reference list provided here is not comprehensive. Our strategy has been to select representative examples that can serve as an entry point into the literature for interested readers and which illustrate through a specific case study an illustration of a much more general approach. Other authors would have come up with other citations to make their points about this vast subject with a correspondingly vast literature.