|Home | About | Journals | Submit | Contact Us | Français|
Analyzing the expression of single genes in single cells appears minimalistic in comparison to gene expression studies based on more global approaches. However, stimulated by advances in imaging technologies, single cell studies have become an essential tool in understanding the rules that govern gene expression. This quantitative view of single cell gene expression is based on counting mRNAs in single cells, monitoring transcription in real time, and visualizing single proteins. Parallel advances in mathematical models based on stochastic, discrete descriptions of biochemical processes have provided critical insight into the underlying cellular mechanisms that control expression. The view that has emerged is rooted in a probabilistic understanding of cellular processes and quantitatively explains both the mean and the variation observed in gene expression patterns among single cells. Thus, the close coupling between imaging and mathematical theory has established single cell analysis as an essential branch of systems biology.
Gene expression refers to the sum of processes that results in a particular level of a specified mRNA and protein in the cell. For many cell biological studies, gene expression is the starting point for elucidation of mechanism at the microscopic, molecular level, while the gene expression profile is the parts list compiled at a macroscopic level. Describing the coordination of gene expression is therefore a central step towards understanding cellular systems. Classical gene expression studies use isolation of mRNAs or proteins from cell populations to determine expression profiles. Those methods, however, lack spatial resolution, are not able to detect cell to cell differences within a population, and can only represent a static picture. To fully understand biological processes, more direct methods have to be applied, ideally giving the researcher the ability to monitor individual molecules within single cells in real time.
In recent years it has become possible to analyze gene expression at the single cell, single molecule level 1,2. Such studies reveal that expression of individual genes, even within clonal population of cells is highly variable; understanding the mechanisms that cause these differences has thus become an area of active research. The quantitative accounting of mRNA and protein expression has been made possible by a rich interaction between biologists, physicists, and mathematicians as ever more precise measurements capable of counting single molecules have advanced in concert with mathematical descriptions of gene expression. The view of gene expression that has emerged from these studies is one in which small numbers of molecules of both mRNA and protein leads to randomness and variation within populations which have direct phenotypic consequences 3,4.
In this review, we focus on the recent experimental developments and microscopy techniques which are being used to understand the rules that govern gene expression. In particular we focus on methods that aim to count the total number of molecules – either mRNA or protein – in a single cell. We also summarize the theoretical approaches used to describe this experimental data and show how stochastic models are able to quantitatively describe gene expression at the single cell level. These mathematical models provide a framework for understanding how the relative balance of the kinetic steps in expression (rates of transcription, RNA decay, translation, protein decay) contributes to the differential regulation of RNA and protein in the cell. The remarkable developments in this field both – experimental and theoretical -- have led to a quantitative description of gene expression in a context which can be readily utilized by cell biologists.
After the development of GFP as a tool in cell biology, observation of fluorescent proteins quickly became the most quantitative experimental method for measuring gene expression in single cells. Even though protein accumulation is the final step in gene expression, quantification of abundance and variation in protein levels was used to infer the mechanisms of gene expression which acted at the level of individual genes and mRNA 1,2.
The first effort to obtain absolute protein numbers on a genomic scale however did not use single cell methods but was rather an ensemble measurement. Ghaemmaghami and colleagues determined global protein abundance in S. cerevisae using a library of ProteinA tagged strains and quantitative western blotting, providing the first glimpse into the scale of protein abundance for a complete organism5. Using a similar library where proteins were tagged with GFP, two studies then measured relative protein concentrations by FACS and determined the variability of protein expression for proteins in S. cerevisiae (Box 1) 6,7. Measuring the abundance and variation of individual proteins provided the quantitative basis for assessing different models of expression regulation. Combining absolute numbers and expression variability, complemented by earlier studies measuring expression variation of small sets of proteins, these studies concluded that variation in protein expression is dominated by the stochastic production/destruction of messenger RNAs and scales with natural protein abundance 8–12. This variation in expression arising from stochastic fluctuations has been referred to as “noise” in gene expression. These studies suggested further that protein-specific differences in noise correlate with a protein’s “mode” of transcription, meaning the kinetic details of how a gene was transcribed into mRNA8–12.
Fluorescent techniques are the most direct tools for measuring protein and mRNA concentrations in single cells. Detection of single mRNAs has been established in multiple labs as a method for quantitative gene expression analysis 3,25,26. Single protein detection in cells, however, is technically still very challenging, and most measurements of protein distributions are achieved by quantifying relative protein abundance5. For both protein detection and RNA detection, one can obtain probability distributions by looking at a single cell over time or by looking at many cells at a particular instant in time46. In live cell measurements, one can measure kinetics directly; in fixed cell measurements, dynamics are inferred from the probability distributions.
The two commonly used methods to measure protein concentration in single cells are Fluorescence-activated cell sorting (FACS) and live cell microscopy. Both measure signals emitted by florescent proteins. FACS has been used extensively to determine the expression variation of a collection of >2500 yeast genes6,7. However, its limited sensitivity does not allow detection of low abundance proteins. In addition, each cell is only analyzed once. Live cell microscopy on the other hand acquires time series of individual cells, resulting in a direct measure of expression kinetics, and fluorescence microscopy is more sensitive than FACS. However, fewer cells are analyzed compared to studies using FACS13.
Fluorescent in situ hybridization (FISH) allows the detection of single mRNAs in intact cells (Figure 1A)3,25,56. FISH is the most direct way to acquire quantitative mRNA expression data as no genetic manipulations are required. However, cells have to be fixed and FISH only provides a snapshot of mRNA abundance and gene activity. Similar to FACS measurements of protein distribution, FISH provides information on the kinetics of expression by considering many cells at a snapshot in time. Using probes labeled with different dyes, FISH can also be used to measure the expression levels of multiple genes within the same cell55. In contrast to FISH, the MS2 system allows mRNA detection in living cells (Figure 1B). Single cytoplasmic mRNAs as well as nascent transcripts at the site of transcription are detected in real time using this fluorescent protein based approach27,28. Yet, single mRNA counting in living cells is challenging: simultaneous single mRNA abundance and nascent mRNA quantification has not yet been described in living eukaryotes. However, the MS2 system is the only to directly visualize real time mRNA expression (see also Table 1).
These models still contained unknown parameters. One is how many proteins are translated from a single mRNA. Knowing the number of proteins per cell and the number of proteins per mRNA is critical for understanding the stochastic variation that has been observed in gene expression.
The visualization of single proteins in single cells provides the ultimate sensitivity in quantifying gene expression. However, in addition to simply observing single molecules, one must be able to record each and every protein molecule in the cell at a given time, or produced from a given mRNA. This mandate exceeds the already stringent experimental conditions required for single molecule microscopy and demands a new set of experimental approaches. One approach was realized in a landmark paper in 2006 by Yu and coworkers in E. coli 13. The authors attached the fluorescent protein Venus to a membrane protein, Tsr, constituting a reporter for monitoring lac operon activity. The membrane localization of Venus slowed the diffusion of the reporter protein so that it could be visualized in the microscope. After the protein was imaged, it was immediately bleached in preparation for observing the next membrane-localized Venus. Protein production was based on the dissociation event of the repressor from the operator region of DNA. Using this system, the authors were the first to observe that protein molecules are generated in bursts from a single mRNA. These bursts of protein production are due to the relative rates of protein translation and mRNA degradation, and have fundamental consequences for gene expression (see below). Protein bursting amplifies variation that occurs from stochastic production of RNA, because each mRNA can produce multiple proteins before it is degraded (4.2 in the case measured by Yu et al). The same laboratory also showed, using a different method of cell lysis followed by enzymatic amplification, that the yeast protein β-galactosidase is synthesized in geometric bursts of 1.7 proteins/mRNA14.
Thus far, direct microscopic visualization and counting of proteins for single molecule gene expression measurements has only been demonstrated in prokaryotes. Such an approach may be quite challenging to implement in eukaryotes where translational burst sizes are likely to be larger, due to the longer lifetimes of mRNA. One approach for overcoming this problem has been proposed by Rosenfeld and coworkers15. Their method for obtaining absolute protein numbers is based on long-time observation of dividing cells. If partitioning of proteins at cell division obeys a binomial distribution – every protein partitioning event is independent – it is possible to empirically determine the number of proteins which were present before cell division. However, any approach which seeks to count fluorescent proteins will be confounded by protein folding and maturation of the fluorescent protein chromophore16.
Measuring protein distributions and determining the amplitude of translation bursts revealed different parameters necessary to model gene expression1,2. However, accumulation of proteins is the last step of gene expression and is influenced by multiple upstream processes. Changes in protein levels can be caused by altering transcription, RNA or protein half-life, translation efficiency or any combination thereof. To understand the entire expression pathway, the individual steps have to be analyzed independently, necessitating direct measurements of transcriptional output by determining mRNA levels.
Similar to measuring the protein level, in vitro ensemble measurements were first used to quantify RNA and were crucial to understand gene expression at the single cell level, especially in yeast. Quantitative microarrays showed that expression levels for more than 80% of genes are very low, fewer than two copies per cell17. Combining mRNA copy number and half-life allows the calculation of average transcription frequencies for each gene17. These numbers, often in combination with measurement of protein concentration and/or protein noise measurements were used in many studies to model gene expression kinetics. However, knowing only these numbers, compared to the whole distribution at the protein level, limited the descriptive power of models, and determining the variability in mRNA expression became essential6.
Different approaches were introduced to determine mRNA concentration in individual cells (Table 1): single cell quantitative PCR, single cell microarrays, in situ fluorescent PCR, the MS2 system (described below) and single molecule resolution fluorescent in situ hybridization (FISH)3,18–26. FISH showed to be a very fruitful approach (Figure 1/Box 1). Pioneered in a study by Femino et al., single mRNA sensitivity FISH allowed detection of individual mRNAs in fixed cells and was able to determine the exact number of mRNAs per cell for any gene of interest3,25,26.
Determining mRNA distribution in single cells showed that the variability in expression levels for different genes was much larger in higher eukaryotes then in yeast 3,26. Integrating expression variability into kinetic models revealed the existence of a range of kinetic modes by which mRNAs are expressed. In one extreme, genes are transcribed in bursts, where periods of activity are interspaced by long periods of inactivity. In another mode transcription events are uncorrelated and uniformly distributed in time. Raj and colleagues showed bursting expression for two genes in higher eukaryotes26. Using a tetracycline induced reporter gene, the authors demonstrated that mRNA levels vary considerably when the gene was activated and showed that those distributions can only be explained by bursting transcription. The second gene, the endogenous gene coding for RNA polymerase II, showed a similar bursting expression pattern. These results suggested that transcription bursting might be the prominent expression mode in higher eukaryotes. On the other hand, experiments in yeast showed a very narrow distribution of the expression levels for three housekeeping genes, suggesting these genes do not burst but are constitutively transcribed3. Their variability was low enough to be explained by pure Poisson noise. Interestingly, the same study also found a gene in yeast showing much higher variability, suggesting that constitutive as well as bursting transcription exists in yeast.
Bursting transcription was also described in E. coli27. Here, RNA was not detected by FISH but rather by using the MS2 system. This approach uses a unique genetically encoded tag that, when inserted into RNA and bound by specific fluorescent proteins, allows mRNA detection in living cells (Figure 1B)28. The advantage of this system is that expression levels are monitored in real time allowing high temporal resolution expression data. The MS2 system has only been applied to mRNA counting in bacteria but will likely be a powerful tool in other organisms.
With the inclusion of mRNA distributions, one can achieve a more complete description of gene expression than is possible by considering only protein distributions. However, mRNA levels are not a direct measure of transcription per se. The inference of transcriptional dynamics that comes from counting mRNA in fixed cells is limited by the half-life of mRNA. For an mRNA with a 30 min half-life, the steady state cytoplasmic mRNA level reflects almost an hour of mRNA expression. However, transcriptional responses are often fast and, depending on the length of a gene, require only a few minutes to produce mRNA 29.
Higher temporal resolution observation of transcription kinetics can be obtained only by measuring transcription directly. Single cell methods for studying transcription rely on the ability to detect nascent mRNAs. Using the MS2 system, Chubb et al. studied the expression of the developmentally regulated dscA gene in Dictyostelium (Figure 1B). The study found that transcription occurred in irregularly-spaced bursts, with the length and amplitude of these bursts staying fairly constant30. Transcription of the yeast CUP1 gene on the other hand was shown to be achieved in a different manner. Upon induction, mRNA production was constant over the time of activation31. The constant transcription was rather surprising when compared to the binding behavior of the transcriptional activator Ace1p that regulated CUP1 transcription. Using fluorescence recovery after photobleaching (FRAP, described in this issue by Lidke & Wilson) the authors showed that Ace1p bound only transiently to the CUP1 promoter, with a residence time of less than two min suggesting that constant rebinding of Ace1p was required to ensure efficient transcription.
The low stability of promoter complexes in living cells (determined by FRAP, and reviewed in 32) appears to be a common phenomenon, and might be one important factor that defines transcription kinetics. Many activators have very short dwell times at the transcription site, some for only a few seconds, suggesting that activators do not have to be stably bound to their promoter to allow transcription33–36. Their affinity however might regulate transcription frequency. Binding of the HSP activator which regulates Drosophila heat shock genes becomes very tight upon heat shock37. Heat shock genes are very efficiently transcribed, with new transcripts initiated about every four seconds at full activation29. It is possible that tight binding of activators allows efficient transcription but simultaneously reduces the flexibility to fine tune the transcriptional response. In addition, the position of activator binding sites with respect to histones was shown to affect both transcription initiation and transcription rate38,39. To further underscore the dynamic, probing nature of molecular interactions at the gene, Darzacq et al. showed that only about 1% of polymerase-gene interactions lead to a completion of an mRNA40. Thus, there seem to be many different dynamic ways to modulate the transcriptional outcome, and a combination of methods will likely be required to dissect this process, probably gene by gene.
A relatively simple expression control seems to occur at constitutively expressed housekeeping genes in yeast. Zenklusen and colleagues used single molecule resolution FISH to determine the exact number of nascent mRNAs on constitutively expressed genes3. This analysis showed that on short genes expressed at a low level, only a single nascent mRNA is detected at the gene. At a transcription elongation velocity of less than 1kb per minute, this suggested that initiation of individual mRNAs was separated by minutes. Taken together with the stability of promoter complexes described above, it seems likely that assembled transcription factor complexes often fall apart after initiation of a single mRNA. Combining polymerase occupancy data (determined from nascent mRNA at a transcription site) with the counting of mRNAs within the same cell further allowed modeling of the expression kinetics of these genes and showed that individual initiation events were uncorrelated with each other for most genes3. This simple regulation might suggest the existence of a stochastic limiting step that controls the expression behavior. Such a step may constitute the binding of an activator, opening of chromatin, assembly or stability of a pre-initiation complex or the efficiency of a polymerase to enter elongation. Measuring transcriptional responses in real time with single mRNA resolution will be necessary to dissect these different possibilities.
The advantage of counting single molecules is that one obtains the probability distribution of molecules corresponding to each stage of the central dogma for a single gene. The probability of observing a certain number of proteins or RNA molecules in a single cell carries more information than the mean alone: one is able to infer general rules and mechanisms for expression based on comparisons between mathematical models and the observed probability distributions. These mathematical models differ from those that cell biologists are accustomed to encountering. Instead of continuous, deterministic models of kinetic behavior, the mathematics of gene expression is described by discrete, stochastic models. This latter class of models takes into account the small numbers of molecules involved – at both the mRNA and protein level – even though the basic kinetic mechanisms (for example, first-order kinetic decay of mRNA and protein) is physically the same in both cases41. Indeed, there has been a tremendous amount of parallel development both in the theoretical models which predict single molecule distributions and the experimental techniques which can measure these single molecule distributions. In many cases, there is excellent agreement between the model and the experiment, enabling a distillation of a large body of work on gene expression, for example in S. cerevisae, into a few numbers.
The gene expression description which has gained wide popularity, both for its simplicity and generality, is one in which a gene can be considered off (incapable of producing transcripts), or on (capable of producing transcripts) (Figure 2A). When the gene is on, transcripts can be produced with a certain initiation rate (ν0, following the notation of42). These transcripts are degraded with a rate d0 and translated into protein with a rate ν1, which likewise is degraded at a rate of d1. This model of gene induction, sometimes called a Random Telegraph Model41, was first proposed by Ko 43 and later expanded by Peccoud and Ycart44. This model results in a set of stochastic differential equations known as the master equation which explicitly takes into account the random nature of events associated with single molecules44. The solution to this master equation describes gene expression – from gene to mRNA to protein – at the single molecule level and takes the form of a probability distribution. Obtaining this solution under various limiting cases is the basis for a quantitative understanding of gene expression.
The steady state solution was first obtained by Raj and coworkers 26 who used it to explain the distribution of PolII mRNA in fixed cells. This elegant work, both experimental and theoretical, demonstrated how variation in expression begins with the process of transcription. Recently, a time-dependent solution to the master equation was reported by Shahrezaei and Swain 42 and by Iyer-Biswas and coworkers45.
The primary implication of the telegraph model is that variation in gene expression is greatly increased through the process of transcriptional or translational bursting. Mathematically, transcriptional bursting means simply that many transcripts are produced from a single transcription on state26,27,43,46,47; translational bursting means that many proteins are produced from a single mRNA11,13,14,48–50. Biologically, a transcriptional burst may be due, for example, to the stability of a transcription preinitiation complex, leading to many transcripts produced from a stable complex (26,27,43). Transcriptional bursting does not occur for all genes but is rather one limiting kinetic case that can be observed3. A translational burst is due to the fact that translation frequency (ν1) is greater than mRNA decay frequency (d0) for most genes 9,42,47,48. The translational burst from a single mRNA follows a geometric distribution 48,50 (see also Figure 2B), and has been observed directly13,14. Intuitively, this geometric distribution can be understood as the relative frequency of encounter of a single mRNA with either the translation machinery or the RNA decay machinery. When translation frequency is greater than RNA decay frequency, the mRNA is more likely to be translated than degraded. So the probability of a burst of n proteins is the probability of encountering the translation machinery n times in a row before encountering the decay machinery once48. The result is a long-tailed decaying distribution for number of proteins/mRNA which is very different from the peaked distribution of protein/cell (Figure 2). In the former case, the mean number of proteins to come from a single mRNA is the ratio of translation/mRNA decay (ν1/d0), but the most likely number of proteins to come from a single mRNA is zero. Thus, the balance of production and decay not only determines the mean, but also the relative variation, providing the cell with a means of limiting or enhancing variability according to selective pressure10.
The consequences of stochastic gene expression, and the success of the stochastic model in explaining measured probability distributions, can be illustrated by considering the S. cerevisae gene MDN1, which codes for a protein involved in ribosome biogenesis. The gene is a housekeeping gene which is necessary for survival and present at low copy number in every cell (Figure 2B). For this gene, the model can be simplified even further because the gene is always on, producing transcripts in single uncorrelated events3. The steady state solution to the master equation for mRNA distribution then becomes a Poisson distribution. The measured distribution of nascent chains, mRNA/cell, and protein/cell are shown in Figure 2B as gray bars. The theoretical probability distributions are shown as red lines, with the corresponding equation shown underneath. There are no free fitting parameters in these curves -- the kinetic rate constants are the initiation frequency (ν0, obtained from Zenklusen et al.3), the RNA decay rate (d0, obtained from Holstege et al.)17, the translation rate (ν1, obtained from Arava et al.51), and the protein decay rate (d1, obtained from Belle et al.52). The final output is the protein abundance and variation, from Ghaemmaghami et al.5 and Newman et al.6, respectively. Although MDN1 is a simple example of gene expression, the complete agreement between theoretical, biochemical, and microscopic data from multiple laboratories is a milestone in our description of gene expression.
The immediate question that arises from the telegraph model of gene expression is: what is the biological interpretation of the on and off state or the active and inactive state? In some cases, it has been possible to connect an on/off state with a direct biological correlate, for example nucleosome remodeling around the promoter12. However, other scenarios may apply for different genes, and may be as simple as the kinetic dwell time of a specific factor or as global as a stage of the cell cycle3. The strength of this mathematical description lies in the ability to classify a wide range of behaviors in a few generic rate constants. Although a complete thermodynamic description of a particular regulatory unit based on kinetic rate constants is always desirable53, for a great many genes, especially in eukaryotes, this description requires a level of detailed understanding of the constituent elements that is not yet available. Therefore, models such as the telegraph model (and multi-state extensions thereof54), provide an abstract intermediate for classification which seems particularly suited to the complexity of cell biological studies.
The ability to count molecules within cells is an important step towards a more quantitative analysis of gene expression. Just as high throughput sequencing markedly advanced our knowledge of gene expression by counting sequence tags, single molecule counting in cells has introduced a new era in quantitative gene expression analysis (Table 1). Integrating these numbers into mathematical models will reveal important insights into the mechanisms of gene expression. One limitation is that single mRNA and protein counting is still limited to a few genes per cell, compared to entire genome capability in techniques such as RNA sequencing. The ability to analyze more genes simultaneously within single cells will provide a systems level understanding at the single molecule level 55,56.
One of the experimental challenges in a complete quantitative description of gene expression is to obtain measurements of the distribution of proteins translated from a single mRNA. Implicit in the theoretical model above is the assumption that translation events and mRNA decay events are independent. This assumption results in the geometric distribution of protein/mRNA. However, in many cases, this assumption may not hold, and there is a competing model where translation leads to modifications of the mRNA which make the RNA increasingly likely to be degraded57. At present, there is an order of magnitude disagreement between estimated protein burst sizes in S. cerevisae. Bar-Even and coworkers report an average calculated protein burst size for > 40 genes to be ~ 12007, Cai et al. measure a burst for a single gene of 1.714, and the MDN1 gene has a calculated burst size of 303,17,51. To better understand how protein production is controlled from single mRNAs, it will be necessary to achieve both single RNA imaging and single protein imaging in the same cell.
This combination of systems biology, computational biology, and single molecule microscopy lays the groundwork for a quantitative understanding of gene expression that will expand rapidly in the coming years.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.