|Home | About | Journals | Submit | Contact Us | Français|
Single-cell variations in gene and protein expression are important during development and disease. Cell-to-cell heterogeneities can be directly inspected one cell at a time, but global methods are usually not sensitive enough to work with such a small amount of starting material. Here, we provide a detailed protocol for stochastic profiling, a method that infers single-cell regulatory heterogeneities by repeatedly sampling small collections of cells at random. Repeated stochastic sampling is performed by laser capture microdissection or limiting dilution, followed by careful exponential cDNA amplification, hybridization to microarrays, and statistical analysis. Stochastic profiling surveys the transcriptome for programs that are heterogeneously regulated among cellular subpopulations in their native tissue context. The protocol is readily optimized for specific biological applications and takes about one week to complete.
Even within a clonal population, no two cells are truly equal1-4. Nonuniformities in the cellular microenvironment5-7 combine with random fluctuations caused by transcription5, 8, 9, translation10, 11, and cell division12 to yield cell-to-cell heterogeneities that can be profound. Biological mechanisms exist to suppress variation13, but they are energetically costly14. Thus, isolating “pure” subpopulations by lineage or surface markers is largely artificial, because these cells will eventually drift back to a steady-state heterogeneity15-18. Instead, a better strategy for understanding cell-to-cell differences may be to exploit population variability and consider each cell as its own self-contained experiment19-21.
This approach has now become possible with the development of global techniques for analyzing single cells22. Genomic23-26 and proteomic27-30 methods are actively advancing, but the first to appear was single-cell transcriptomics31-33. The details of mRNA expression profiling in single cells can vary widely depending on the method31-43. Algorithmically, however, the different protocols all involve roughly the same steps: 1) extract cellular RNA by chemical, thermal, or enzymatic methods; 2) perform an oligo(dT)-based capture or an abbreviated oligo(dT)-primed reverse transcription (RT) to prepare a cDNA library of roughly uniform length; 3) tail the library with homopolymer; 4) exponentially pre-amplify the tailed cDNA with a universal homopolymer-containing primer; and 5) detect the amplification products by quantitative PCR (qPCR), oligonucleotide microarrays, or RNA-seq. Steps 1–5 are iterated across dozens or hundreds of single cells in an effort to reconstruct the population-level distribution and identify recurrent expression states.
Studies using the above workflow have uncovered many qualitative heterogeneities in the areas of neuroscience32, 44, 45 and tissue development33, 41, 46, 47. Interestingly, when similar approaches were applied to cells from a common lineage—where regulatory heterogeneities are possibly more quantitative rather than qualitative—the findings were much more generic48-50. This suggested that existing transcriptomic methods did not clearly separate biological variability from measurement variability when using one cell worth of starting material42, 43. Indeed, certain steps essential to the procedure, such as the RT step, are known to add substantial measurement variation when starting with minute amounts of input RNA51, 52. A second confounding factor was that virtually all protocols required tissue dissociation to isolate single cells by micropipette aspiration or FACS34-40. This is a major drawback for studying adherent populations such as epithelia, where cell detachment alters signaling and gene expression within minutes53-55. Cell-to-cell variation caused by the dissociation procedure could distort the true heterogeneities in the resident tissue. To study single-cell biology through transcriptomics, it would be crucial to reduce measurement and handling artifacts substantially.
The collective challenges with existing methods prompted us to develop stochastic profiling56, an alternative approach for gaining single-cell information quantitatively, efficiently, and in situ. Stochastic profiling does not focus on intrinsic biological noise5 but rather on cell-to-cell heterogeneities in the regulation of gene expression. The method is based on the premise that heterogeneities in single-cell regulation can be inferred without measuring single cells explicitly (Fig. 1a). Instead, small collections of ~10 cells are repeatedly sampled at random by laser-capture microdissection57, and mRNA expression for each of these “stochastic samplings” is quantitatively profiled with a customized small-sample cDNA amplification procedure56. Highly accurate and precise expression profiles for the stochastic samplings are achievable because of the 10-fold increase in starting material compared to single cells. After building gene-by-gene histograms from 15–20 stochastic samplings (Fig. 1b), statistical hypothesis testing is then used to identify transcripts whose distribution is significantly different from the lognormal distribution, a common null model for ordinary biological variability58, 59. Transcripts subject to dichotomous single-cell regulation are identified at this step because of binomial fluctuations in the proportion of high- and low-expressing cells collected during each stochastic sampling. Finally, the filtered transcripts are clustered to identify groups of genes with correlated sampling fluctuations that suggest putative single-cell expression programs56.
Our procedure holds two main advantages over existing methods, both of which are related to the 10-fold increase in starting material. First, stochastic-profiling measurements are much more sensitive to low-abundance transcripts and less prone to experimental noise compared to single-cell amplifications56. For this reason, stochastic profiling is readily compatible with laser-capture microdissection, which is superior to dissociation-based methods for retaining adherent cells in their native context57, 60. Tissues are snap-frozen within seconds and can be ready for microdissection with minimal processing in aqueous solutions where RNA can be damaged or degraded. However, because the RNA must be proteolytically released from the microdissected specimen, the overall yield is lower than would be obtained by chemical lysis of suspension cells. 10-cell sampling offsets the decrease in RNA yield per cell and enables quantitatively accurate expression profiling56.
Second, by measuring 10 cells at a time, stochastic profiling surveys the overall population heterogeneity more efficiently. For strict single-cell methods, large numbers of samples must be processed to ensure that the major subpopulations have been captured34, 40, 41, 46. These sample numbers eventually become prohibitive because of the most-expensive step: gene detection (Step 5 above). By contrast, stochastic profiling gathers information about 10 cells per sample, which hones in on recurrent heterogeneities with fewer samples. Through computer simulations, we determined that stochastic profiling should be able to detect high-low transcriptional heterogeneities that occur with a frequency of 5–50% (Fig. 1c). It would be difficult to identify infrequent heterogeneities (~5–20%) confidently with 15–20 single-cell analyses, but such patterns are readily uncovered by stochastic profiling56, 61.
In general, stochastic-profiling clusters are extremely informative56, because spurious correlations among 15–20 random samplings are unlikely, even when surveying the transcriptome. For example, a Pearson correlation of R = 0.7 among 18 samplings has a 0.06% probability of being observed by chance, implying only six false correlations when 10,000 genes are surveyed. The simplest explanation for a correlated transcriptional cluster is that the constituent genes are jointly controlled by a common upstream regulatory factor, which is heterogeneously activated.
We have recently used this reasoning to study the single-cell regulation of FOXO transcription factors during 3D organotypic culture of breast epithelial cells61-63. Stochastic profiling identified a clear split in the sampling fluctuations of FOXO-regulated genes, which we independently validated in single cells by multicolor RNA FISH61. Over 90% of gene pairs within a single FOXO cluster were strongly correlated among single cells (R > 0.6), whereas over 60% of gene pairs across clusters were weakly correlated or uncorrelated (R < 0.4). Bioinformatic analysis64, 65 of promoters together with chromatin immunoprecipitation revealed that one cluster of FOXO target genes was coregulated by another transcription factor, RUNX1 (ref. 61, 62). FOXO–RUNX1 crosstalk was unanticipated and became apparent only when examining the heterogeneous expression state of single cells via stochastic profiling. Very recently, RUNX1 was found to be recurrently mutated in breast cancer66, 67, independently validating our earlier predictions of its tumor-suppressive role61, 62.
Looking forward, we anticipate that stochastic profiling will be useful as a tool for studying heterogeneous cell-to-cell regulation. For example, it was shown that co-fluctuating proteins in yeast act as “noise regulons” that coordinate important biological processes68. Remarkably, the functions of several regulons (e.g., stress response and protein biosynthesis) were identical to the expression programs identified previously by stochastic profiling of 3D breast-epithelial cultures56. This suggests that there may be some inherent circuits linked to heterogeneous regulation that are widely conserved20. Another future direction for stochastic profiling is to examine the mechanisms of partially penetrant phenotypes6, 69, 70. Conceivably, such phenotypes are driven by upstream molecular heterogeneities before the phenotype is obvious. Stochastic profiling could be used to search for these heterogeneities in an unbiased way.
Last, we emphasize that the principle of stochastic profiling is completely general. Although implemented for transcriptomics, the concept of random sampling could be applied to other high-sensitivity methods that analyze small numbers of cells29, 71-73. For protein analysis, the 10-cell threshold of stochastic profiling may be much easier to reach than a one-cell threshold because of the inability to amplify the starting material. Particularly exciting would be small-sample stochastic profiling of chromatin modifications at a genome-wide level74, 75. The analysis pipeline described at the end of the protocol here could be immediately adapted to such alternative implementations of our method.
In the Supplementary Software, we provide a script (StochProfParameters.m) that simulates stochastic profiling with six user-defined parameters: 1) the number of cells per sampling; 2) the number of samplings used to build the distribution; 3) the coefficient of variation (CV) of the log-normal reference distribution (CVref) that specifies the null model for hypothesis testing; 4) the underlying CV of the log-normal test distribution (CVtest), which is used to diagnose false positives; 5) the fold difference in expression (D) between high and low subpopulations; and 6) the expression fraction (F) of cells in the high subpopulation. By running this script, users can survey up to two parameter ranges at a time to assess the performance of the method for different biological applications (Fig. 1).
When working with tissues and tissue-like material, proper cryosectioning is an important first step for stochastic profiling. Ideally, fresh samples are embedded and frozen simultaneously in a dry ice-isopentane bath. However, the procedure also works with tissues snap-frozen in liquid nitrogen and then embedded afterwards. Fixed-frozen specimens are incompatible because RNA is crosslinked within the tissue and cannot be released enzymatically.
We prefer to use cryostats with disposable microtome blades that can be replaced after each set of sections is collected. During sectioning, the goal is to keep specimens at the lowest temperature possible. The downstream histology procedure is meant to preserve RNA integrity, not morphology. Thus, we routinely cut sections at temperatures that cause some chattering of the blade and flaking of the tissue. After wicking each section, the slide is placed immediately into a slide box within the refrigerated cryostat to refreeze the section as quickly as possible. The sample should never thaw thereafter. Because of the atypical sectioning requirements, we prefer to cut sections ourselves rather than submitting samples to a core histology facility.
Various stains have been reported to be compatible with laser-capture microdissection76. However, the stochastic profiling protocol uses nuclear fast red because of its superior ability to maintain RNA integrity77. Since RNA is most susceptible to hydrolysis during aqueous processing steps78, a broad-spectrum RNAse inhibitor is spiked into the staining solution immediately before use. The staining protocol described here is versatile and can be applied to various tissue types as well as cultured adherent cells plated on coverslips (Fig. 2).
After washing briefly, samples are dehydrated with an ethanol series and cleared with xylenes. We recommend purchasing ethanol in small quantities, because ethanol is hygroscopic and opened containers will draw moisture from the air. Likewise, the xylene step should be precisely controlled for effective microdissection. Excessive clearing can lead to overdrying and collateral pickup of cells adjacent to individual laser shots. Conversely, insufficient clearing will dry the section too slowly, causing ambient moisture to enter the section and making microdissection impossible (see TROUBLESHOOTING).
For stochastic profiling, it is critical to maintain the RNA integrity of each cell that is microdissected. Ultraviolet-based microdissection platforms cut tissues very cleanly, but RNA near the ultraviolet laser is severely degraded. Thus, our protocol uses an infrared-based microdissection instrument for gentle, mechanical dissociation of a single cell from its neighbors. Collateral pickup of surrounding tissue is easily removed by gently pressing the microdissection cap against a weak adhesive note. As a positive control, a larger sampling of 100 cells is included to assess the overall amplification efficiency. To control for amplification variability, a large pool of microdissected cells is split into multiple identical 10-cell aliquots after elution from the microdissection cap. Usually, the 15–20 repeated samplings are split across two separate days so that sampling-to-sampling variation and day-to-day variation can be compared. If the observed expression heterogeneities are strongly associated with specific amplification groups, it indicates problems with day-to-day reproducibility of the procedure.
The biological strategy for random sampling must be clearly defined up front, as it critically influences the types of regulatory heterogeneities that will be uncovered by the method. Stochastic profiling starts with a hypothesis about where such heterogeneities might lie hidden. Experimental conditions (time point, treatment condition, etc.) should be optimized beforehand to focus on the sought-after heterogeneities as cleanly as possible. To avoid complications from obvious cell-to-cell variation, such as differences in the cell cycle or cellular microenvironment, samplings should be collected as uniformly as possible. Key parameters to control for include cell size, distance from blood vessels, and contact with ECM or neighboring cell types. Hidden variations arising from clonal cell subpopulations can be averaged out within each sampling by microdissecting cells across different regions of the tissue. Alternatively, by collecting the sampled cells locally, clone-to-clone variations can be enriched if desired.
Small-sample cDNA amplification56 for stochastic profiling involves: 1) cellular proteolysis to release RNA from the specimen, 2) an abbreviated oligo(dT)-primed RT to yield a cDNA pool of uniform length, 3) poly(A) tailing with terminal transferase, and 4) poly(A) PCR with a universal oligo(dT)-containing primer (AL1, ref. 79; Fig. 3). Care must be taken to avoid contaminating samples with poly(A) PCR amplicons from previous experiments, and lack of contamination should be confirmed with a blank control that has been run through the entire procedure. Genomic DNA contamination is not ordinarily a problem because of the small amount of starting material, but this should be evaluated with a no-RT control. Because there is no DNAse step and the abbreviated RT often does not span an intron, we sometimes observe slight amplification in the no-RT sample for transcripts with many pseudogenes. We consider this acceptable as long as the no-RT levels are negligible compared to the transcript levels observed in the samples.
When working with samplings obtained by microdissection, the initial cellular proteolysis is critically important for high-sensitivity amplification. Tissue sections are solvent fixed and bound to the polymer on the microdissection cap (see above). Thus, RNA must be freed from precipitated ribonucleotide binding proteins in a way that is compatible with the downstream amplification steps. We use proteinase K as a broad specificity protease because of its high activity at elevated temperatures. To avoid digestion of the later enzymes in the procedure, proteinase K is irreversibly inhibited with saturating concentrations of phenylmethanesulfonyl fluoride (PMSF). Excess PMSF is then rapidly hydrolyzed in the alkaline pH of the first-strand synthesis buffer without noticeable inhibition of the RT step itself. More-stable protease inhibitors are not as effective, presumably because of interference later in the procedure.
Although our protocol was originally designed for microdissected cells56, the amplification is also compatible with suspension cells obtained by FACS or limiting dilution. The digestion buffer components are separated into two parts for washing-storage and lysis-digestion, with saponin added to the lysis component as a gentle permeabilizing agent (see PROCEDURE). Small quantities of suspension cells can be stored frozen in buffer before starting the lysis-digestion without loss of amplification efficiency (Supplementary Fig. 1). This provides a convenient means for archiving primary or flow-sorted samples before starting the amplification.
We have recently discovered that the performance of the PCR amplification depends critically on the cell type and sample format. This is likely because of differences between the overall mRNA content of different cells and the efficiency of RNA extraction during cellular proteolysis. Thus, we recommend a sample-specific optimization protocol that should be followed when adapting stochastic profiling to new biological contexts. Among all parameters, we have found that the amount of AL1 primer and the number of poly(A) PCR cycles are the most crucial for sample-specific optimization (Fig. 3). The original primer concentration (5 μg AL1 per 100 μl PCR reaction56) is the minimum required for high-sensitivity detection. Amplification of some samples continues to improve with up to tenfold higher amounts of AL1, and thus our optimization protocol recommends testing 5–50 μg in pilot experiments.
When performing the optimization, fractions of the PCR amplification should be collected from 25–40 cycles for monitoring by qPCR. The goal is to identify the maximum number of cycles where high- and low-abundance transcripts (defined by qPCR cycle threshold) are still amplifying efficiently with a 100-cell sample. We use housekeeping genes80 as abundant mRNA species and then screen various surface receptors and transcription or splicing factors that can act as low-copy readouts of the amplification. By surveying 6–8 genes within this range, the optimal AL1 concentration and PCR cycle number is readily identified for a cell and sample type (Fig. 4). Next, the amplification is repeated under the optimized AL1 and cycling conditions with serial dilutions of starting material from 100 cells to one cell. We consider the amplification valid when all transcripts tested show a reproducible log-linear increase in qPCR cycle threshold with decreasing starting material down to three cells (see ANTICIPATED RESULTS). At the 10-cell input level used for stochastic profiling, there should be no need to exclude “unsuccessful” amplifications38.
The cDNA prepared by small-sample amplification is immediately suitable as a template for qPCR, but samples must be labeled before global profiling by microarrays. Amplified cDNA is diluted and reamplified in the presence of aminoallyl-dUTP, which provides a strong nucleophile for conjugation to fluorescent succinimidyl esters. Reamplification involves design criteria that are different from small-sample amplification. During the 10-cell amplification, processivity and sensitivity of the PCR reaction are paramount. For reamplification, sensitivity is less of an issue and a higher priority is placed on achieving a high degree of labeling. We screened several polymerase blends for their ability to efficiently incorporate high levels of aminoallyl-dUTP and had the greatest success with Roche High Fidelity polymerase. Our protocol replaces 80% of thymidine bases with aminoallyl-uracil to maximize dye coupling. The aminoallyl moiety is located at the 5-position of the pyrimidine ring of uracil, which is not adjacent to the 2- and 3-positions involved in base pairing. Consequently, unreacted aminoallyl groups are not expected to interfere with microarray hybridization.
As with small-sample amplification, the number of PCR cycles during reamplification must be optimized empirically. To obtain accurate cycle-by-cycle estimates for the extent of amplification, a pilot reaction is performed in the presence of SYBR Green and monitored by real-time qPCR81. Great care must be taken to avoid saturating the reaction and ruining quantitative accuracy. Thus, the maximum number of reamplification cycles for all samples must fall near the mid-exponential phase of the first sample that amplifies detectably (Fig. 5a). Varying the number of PCR cycles on a sample-by-sample basis is not recommended, because the SYBR Green estimates by qPCR are a mixture of amplified material and primer dimer. Instead, samples with lower amounts of starting material can be reamplified in several parallel reactions that are pooled and concentrated during the purification step. This conservative strategy avoids overamplifying some of the samples inadvertently, and it retains the quantitative accuracy and reproducibility of the starting material56.
Before dye coupling, it is important to remove primer dimers from the reamplification. Primer dimers will cause an overestimation of cDNA yield, and aminoallyl-labeled primer dimers will compete for the dye label. We sought to avoid the need for a gel-purification step82, because the DNA yields after gel extraction and isolation are typically poor. Instead, we use PureLink spin columns with a modified protocol that achieves near-stoichiometric isolation and recovery of cDNA from the anion-exchange resin (see PROCEDURE). Aminoallyl-labeled cDNA is coupled to Alexa Fluor 555, which is interchangeable with Cy3 but shows superior performance for microarray applications83. Additionally, Alexa Fluor 555 decapacks are available as single-use aliquots, which save cost. After dye conjugation and secondary purification, the degree of labeling is determined by spectrophotometry with the Invitrogen Dye:Base Ratio Calculator: http://probes.invitrogen.com/resources/calc/basedyeratio.html. Our protocol enables the conjugation of 7–10 dye molecules per ~500 bp amplicon (see ANTICIPATED RESULTS) (Fig. 5b).
Alexa Fluor 555-labeled cDNA should be compatible with any commercial microarray platform. However, we have performed stochastic profiling exclusively with Expression BeadChips from Illumina because of their reduced cost, higher throughput, and equivalent performance compared to more-expensive alternatives84. The hybridization protocol is followed as recommended by the manufacturer with the following modifications: 1) 1 μg of cDNA is added to each well rather than 750 ng cRNA to account for the fact that only the complementary strand of the cDNA sample will hybridize, and 2) the samples are denatured briefly at 95 °C and then added to the slide prewarmed at the 58 °C hybridization temperature. The second modification attempts to minimize re-annealing of the labeled cDNA with its complementary strand before hybridization. From this point, slides are incubated, washed, and scanned exactly as recommended.
A stochastic profiling experiment typically involves 16–20 random 10-cell samplings and the same number of amplification controls (a larger pool of 160–200 cells split into 10-cell aliquots before small-sample amplification). Each microarray is normalized to its mean fluorescence intensity, and then genes are filtered according to two criteria for the amplification controls. First, the gene must be reproducibly amplified. Irreproducible transcripts are readily flagged, because an unsuccessful amplification causes dramatic fluctuation artifacts in the controls. We apply a loose filter to the amplification controls that removes genes with control fluctuations >5-fold. Second, each gene must be reproducibly detected. We retain genes with a median detection P < 0.1 across the amplification controls as determined by the microarray manufacturer. After filtering, the data are re-normalized by median fluorescence intensity to adjust for residual post-filtering differences in overall signal. The re-normalized 10-cell samplings comprise the final preprocessed dataset for analysis.
The first step in the analysis pipeline is to extract the genes with sampling fluctuations significantly greater than measurement fluctuations. Because eukaryotic gene-expression variability is often log-normally distributed58, 85, we logarithmically transform the data for analysis. To standardize the log-transformed data, each transcript is then scaled by its geometric mean taken across all samplings, and each sampling is scaled by its geometric mean taken across all transcripts. Next, we must isolate the transcripts with sampling variations that are significantly greater than the measurement variation intrinsic to the amplification. This allows us to estimate a reference distribution with which to compare the fluctuations of the 10-cell sampling measurements. In our original work56, we compared the CV of the sampling fluctuations with the CV of the amplification controls by using McKay’s approximation86. However, we now prefer to avoid approximations and instead directly examine the ratio of variances with respect to the F distribution87. Genes with significantly higher sampling variances relative to controls (at a user-defined false-discovery rate, FDRvar) are extracted and then sorted based on their CV for subsequent distribution testing.
There are a variety of methods for comparing empirical data with a (log)-normal distribution. Our earlier work used the χ2 goodness-of-fit test56, 61, but we now favor the K-S test because it is conservative and can be accurately applied on a gene-by-gene basis. Other alternatives could be considered when seeking greater statistical power88. To define a reference distribution for the K-S test, we inspect the cumulative distribution function of CVs from genes with measurable sampling variations (Fig. 6a). Using this empirical plot, we week to identify a CVref that encompasses the baseline biological variation but is distinct from variance caused by cell-to-cell heterogeneities in transcriptional regulation. Simulations indicate that there is a wide tolerance for detecting heterogeneities as long as CVref is below 35% and the underlying variability of the test distribution is within three-fold of CVref (Fig. 1d). Usually, a reasonable CVref can be identified around the first inflection point of the cumulative distribution function (Fig. 6a, red). The method assumes that this inflection point indicates the median baseline biological variation (low CV), which can be used as the base condition to test for heterogeneous regulation (high CV). If the earlier variance filtering is too stringent, then fewer low-CV transcripts will enter into the cumulative distribution function, making it harder to identify CVref (Fig. 6a, darker gray). Ideally, the function would appear as the superposition to two staggered sigmoid curves, indicating a clear separation of the baseline variation (reference distribution) and the heterogeneous cell-to-cell regulation comprising the variation of the test distributions. Last, we threshold the P value of each gene according to a second user-defined false-discovery rate (FDRhet), which is generally less stringent than FDRvar because of the conservative nature of the K-S test. The genes falling below this threshold are predicted by stochastic profiling to be heterogeneously expressed (Fig. 6b, green).
To facilitate the filtering and analysis of stochastic-profiling data, we provide here a pair of MATLAB functions that execute the necessary calculations (Supplementary Software). StochProfMicroarrayFilt.m takes tab-delimited ASCII files of gene names, relative microarray fluorescence intensities, and detection P values and outputs the filtered, median-scaled array data. This output can be saved as a MATLAB workspace so that the time-consuming filtering step only needs to be done once. StochProfAnalysis.m takes the filtered output as input, performing the variance and distribution tests to arrive at the final gene set, which can be standardized by Z score and clustered hierarchically.
As representative microarray data, we include two ASCII files containing 16 stochastic 10-cell samplings and 16 amplification controls (Supplementary Data 1 and 2). When executing the analysis pipeline, there are three user-defined inputs to consider: 1) FDRvar, the false-discovery rate for testing significant biological variation above measurement variation; 2) CVref, the reference CV estimating background biological variation (Fig. 6a); and 3) FDRhet, the false-discovery rate for testing significant cell-to-cell heterogeneity above background biological variation (Fig. 6b). All three parameters will influence the total number of genes predicted to be heterogeneously expressed. However, our analysis of an earlier dataset56 suggests that the fundamental clusters of single-cell gene expression are less sensitive to the exact parameter values (Fig. 6c, white boxes). We recommend that the user iterate through StochProfAnalysis.m several times with different combinations of FDRvar, CVref, and FDRhet to identify the salient clusters of interest.
Stochastic profiling provides a global means for identifying candidate genes that may be subject to heterogeneous transcriptional regulation. However, this is just a starting point for more-specific observations and perturbations of the phenomenon. To validate predicted heterogeneities, we use RNA FISH because gene-specific reagents are readily synthesized, which can be multiplexed in different fluorescence channels. When verifying a heterogeneous transcriptional cluster, multiple gene pairs should be tested in different combinations to examine the extent of coregulation. Overall, we have observed extremely good concordance between stochastic-profiling predictions and RNA FISH experiments with single genes or gene pairs56, 61.
Validated transcripts can be pursued further to test for functions of the cell-to-cell heterogeneity. We usually start by following up RNA FISH observations with immunofluorescence to confirm that regulatory heterogeneities propagate to the protein level. (Here, it is not uncommon to see some dampening in the cell-to-cell variation due to the extra steps of translation and protein turnover.) Direct functional testing can be challenging, because one needs to separate the role of the heterogeneity from the general role of the gene-protein itself. We initially seek to homogenize the cell-to-cell expression pattern by eliminating the minority expression state observed by RNA FISH. For example, if a high-expression state is observed in 15% of the overall population, we will target the endogenous gene by RNAi with the goal of eliminating the high population. Conversely, if a high-expression state is observed in 85% of the overall population, we will constitutively express the gene to eliminate the low population. The difficulty is that either of these perturbations will also change the overall levels of expression. Ultimately, assigning function to a heterogeneity requires addback approaches, where the endogenous gene is knocked down by RNAi and then an RNAi-resistant version is expressed constitutively at near-endogenous levels. Unlike specificity tests for RNAi targeting sequences89, here the expectation is that addback will not revert the phenotype caused by knockdown but instead may yield another phenotype caused by disruption of the cell-to-cell heterogeneity.
The biggest drawback of stochastic profiling is that the method does not provide a direct single-cell readout, which can be problematic for some applications. If gene-expression clusters are partially correlated, for example, stochastic profiling cannot distinguish whether single cells have a partial coexpression or whether the sampling pattern is caused by an admixture of cells with uncorrelated expression. We are actively working to develop analytical approaches for extracting accurate single-cell information from stochastic-profiling data (manuscript in preparation).
A related limitation is that stochastic profiling cannot diagnose all forms of heterogeneity. Analytically, the method assumes that baseline biological variation is log-normally distributed, which is not true for transcripts with low transcriptional burst frequencies relative to their mRNA degradation rate90. This could create problems with false positives, where regulatory heterogeneities would be predicted for genes that simply have an intrinsically noisy expression pattern. Such transcripts would need to be distinguished at the validation and follow-up phase (see above). Problems will also arise with extremely low-abundance transcripts, where some cells will have exactly zero copies, because the log-normal distribution is only defined for values greater than zero59. It is unclear whether such transcripts would be amplified with enough technical reproducibility to reach the distribution-testing phase (see above).
As with nearly all single-cell transcriptomic methods, stochastic profiling focuses on polyadenylated mRNAs and therefore cannot monitor other RNA species (miRNAs, ncRNAs, etc.). Consequently, the current method focuses on oligonucleotide microarrays for detection and not RNA-seq36, 47. A final limitation is that stochastic sampling thus far has only been performed based on cell morphology or tissue geography together with simple histological stains. In principle, fluorescent reporters or rapid immunofluorescence91 could be used in the future to achieve stochastic profiling within molecularly defined cellular subtypes.
Prepare a 20 mg ml−1 solution in nuclease-free H2O and store in 20 μl aliquots at −20 °C for up to six months. After thawing, keep at 4 °C for up to one month.
Prepare a 17.42 mg ml−1 solution in 100% (vol/vol) ethanol shortly before use.
Prepare a solution containing 15 μl nuclease-free H2O, 5 μl 100 mM dATP, 5 μl 100 mM dCTP, 5 μl 100 mM dGTP, 5 μl 100 mM dTTP, and 5 μl 80 OD ml−1 oligo(dT)24. Store in 5 μl aliquots at −20 °C for up to six months.
Prepare a solution containing 5 μl RNAse H (5 U μl−1) and 5 μl 25 mM MgCl2. Mix on ice and use immediately.
Prepare a solution containing 363 μl nuclease-free H2O, 400 μl 5× Invitrogen terminal transferase buffer, and 15 μl 100 mM dATP. Store in 100 μl aliquots at −20 °C for up to one year. Add 0.2 μl of terminal transferase (400 U μl−1) per 3.5 μl 2.6× tailing buffer immediately before use.
▲CRITICAL Do not use the Roche 5× TdT reaction buffer that come with the terminal transferase. This buffer lacks the Co2+ cofactor that is important for transferase activity.
! CAUTION Co2+ is toxic if ingested or inhaled. Use appropriate precautions.
Prepare a 25 mg ml−1 proteinase K and 1% (wt/vol) saponin solution in nuclease-free H2O immediately before use.
Prepare a 15 μg μl−1 solution in nuclease-free H2O and store in 10 μl aliquots at −20 °C for up to one year.
Dissolve 12.6 g of NaHCO3 in 100 mL of H2O. Adjust the volume to 150 mL to yield a final concentration of 1 M. Filter-sterilize the solution and store at room temperature (22 °C) for up to one month.
Dissolve 408.3 g of sodium acetate-3 H2O in 800 mL of H2O. Adjust the pH to 5.2 with glacial acetic acid. Adjust the volume to 1 L with H2O. Dispense into aliquots and sterilize by autoclaving. Store at room temperature for one year or more.
|5× first-strand buffer||20 μl|
|1× stock primer mix||2 μl|
|20 mg ml−1 proteinase K||1 μl|
|Nuclease-free H2O||57 μl|
|Total volume||80 μl|
|5× first-strand buffer||22 μl|
|1× stock primer mix||2.2 μl|
|Nuclease-free H2O||55.8 μl|
|Total volume||80 μl|
|20 U μl−1 Anti-RNase||1 μl|
|20 U μl−1 SUPERase-In||1 μl|
|100 mM PMSF||1 μl|
|Nuclease-free H2O||17 μl|
|Total volume||20 μl|
|10× ThermoPol buffer||10 μl|
|100 mM MgSO4||2.5 μl|
|20 mg ml−1 BSA||0.5 μl|
|100 mM dNTP||1 μl|
|15 U μl−1 RocheTaq polymerase||2 μl|
|15 μg μl-1AL1||0.3–3 μl|
|Nuclease-free H2O||68–70.7 μl|
|Total volume||90 μl|
|1–4||94 °C for 1 min||32 °C for 2 min||72 °C for 6 min with 10 sec|
increase at each cycle
|5–25||94 °C for 1 min||42 °C for 2 min||72 °C for 6 min 40 s with 10|
sec increase at each cycle
|26–30||94 °C for 1 min||42 °C for 2 min||72 °C for 6 min|
|31–35||94 °C for 1 min||42 °C for 2 min||72 °C for 6 min|
|35–40||94 °C for 1 min||42 °C for 2 min||72 °C for 6 min|
|10× High-Fidelity PCR buffer without Mg2+||20 μl|
|25 mM MgCl2||28 μl|
|100 mM dATP||0.4 μl|
|100 mM dCTP||0.4 μl|
|100 mM dGTP||0.4 μl|
|50 mM aminoallyl-dUTP||0.64 μl|
|10 mM dTTP||0.8 μl|
|20 mg ml−1 BSA||1 μl|
|15 μg μl−1 AL1 primer||0.6 μl|
|3.5 U μl−1 High-Fidelity polymerase||2 μl|
|100× SYBR green||0.5 μl|
|PCR-grade water||135.3 μl|
|Total volume||190 μl|
|1–40||94 °C for 1 min||42 °C for 2 min||72 °C for 3 min, then measure fluorescence|
|10× High-Fidelity PCR buffer without Mg2+||10 μl|
|25 mM MgCl2||14 μl|
|100 mM dATP||0.2 μl|
|100 mM dCTP||0.2 μl|
|100 mM dGTP||0.2 μl|
|50 mM aminoallyl dUTP||0.32 μl|
|10 mM dTTP||0.4 μl|
|20 mg ml−1 BSA||0.5 μl|
|15 μg μ1−1 AL1 primer||0.3 μl|
|Amplified cDNA||1 μl|
|3.5 U μ1−1 High-Fidelity polymerase||1 μl|
|PCR-grade water||71.9 μl|
|Total volume||100 μ1|
|1–OPT||94 °C for 1 min||42 °C for 2 min||72 °C for 3 min|
[Genes, Samples, StochSamplings, ControlSamplings] = StochProfMicroarrayFilt;
save(‘FilteredMicroarrays’)The filtered microarray data can now be recovered to the workspace at any time by typing:
[HetGenes, HetData, HetGenesPval]=StochProfAnalysis(Genes, Samples, StochSamplings, ControlSamplings);
load FilteredMicroarraysand return to Step 89.
The rapid histology protocol should give faint pink nuclear staining in cells and tissue sections, which is easily identified during microdissection (Fig. 2). For small-sample cDNA amplification, a reasonably clear optimum should exist for AL1 primer amount and cycle number (Fig. 4). By following the two-step optimization procedure, we have identified conditions for microdissected primary melanoma cells (Fig. 7a and Supplementary Fig. 2), HT-29 colon adenocarcinoma cells microdissected off of coverslips (Fig. 7b and Supplementary Fig. 3), and SKW 6.4 lymphoblastoid suspension cells isolated by limiting dilution (Fig. 7c and Supplementary Fig. 4). With microdissected samples, quantitative accuracy and reproducibility are usually lost with one-cell equivalents of starting material56 (Fig. 7a,b). This reemphasizes the importance of the random 10-cell sampling approach for microdissected tissue. Interestingly, one-cell measurements are possible with suspension cells (Fig. 7c and Supplementary Fig. 1), which is consistent with earlier results from single cells obtained by micropipette aspiration or FACS34-40. Based on simulations, the overall reproducibility of 10-cell amplification replicates must be within 35%, because background biological variation will only amplify this error and ultimately can give rise to false negatives (Fig. 1d).
During reamplification and labeling, it is not uncommon to see some spread in the total cDNA levels on a sampling-to-sampling basis (Fig. 5a, yellow). This reflects global differences in the extent of mRNA extraction from the microdissection cap. For qPCR, the differences can be accounted for with a panel of loading-control genes92. For microarrays, it is better to perform replicate reamplifications of low-abundance samples and pool before labeling (see TROUBLESHOOTING). When labeling aa-cDNA, we typically observe a degree of labeling of ~1.5 Alexa Fluor 555 dye molecules per 100 bases (Fig. 5b). The yield of 555-cDNA should be very close to 100% relative to the input aa-cDNA provided that the modified elution protocol is used with the PureLink columns.
Each microarray sample should detect a comparable number of genes to that obtained by conventional methods (typically 7000–10,000 genes depending on the platform). After running the StochProfMicroarrayFilt.m algorithm, at least half of the detected transcripts on the array should be measured with sufficient reproducibility for analysis56. The extent of cell-to-cell heterogeneities identified by StochProfAnalysis.m can vary widely depending upon the biological context and the exact analysis parameters (Fig. 6c). When a clonal cell line was globally profiled in 3D culture56, we found that 10–20% of transcripts were predicted to be heterogeneously expressed. Conceivably, this percentage could be substantially higher when considering a population of cells that is actively proliferating (Fig. 7b,c) or genomically unstable (Fig. 7a,b). Regardless of the exact numbers, stochastic profiling provides a general method for uncovering cell-to-cell heterogeneities in a variety of biological settings.
We thank Sameer Bajikar for critically reading this manuscript, Craig Slingluff for providing the primary melanoma sample, Byong Kang for help with screening low-abundance genes, and Cheryl Borgman for help with frozen sectioning. This work was supported by the National Institutes of Health Director’s New Innovator Award Program (1-DP2-OD006464), the American Cancer Society (120668-RSG-11-047-01-DMC), the Pew Scholars Program in the Biomedical Sciences, and the David and Lucile Packard Foundation.
L.W. designed the current implementation of stochastic profiling and the optimization protocol for different cellular contexts. K.A.J. conceived of the method, supervised the development of the current implementation, coded all computer simulations, and wrote the manuscript with contributions from L.W.
COMPETING FINANCIAL INTERESTS
The authors declare that they have no competing financial interests.