We initially based our analysis on murine Th2 cells (
Zhu et al, 2010), as these cells can be obtained in large quantities
ex vivo and can be prepared as a pure and homogeneous cell population. Furthermore, there is a well-characterized set of genes whose proteins are known to be expressed and functional in Th2 cells, as well as a set of genes known to be not expressed in these cells (
Supplementary Table S1 lists the genes we used in our study,
Supplementary Figure S1 shows expression of two marker proteins in the cells).
We generated Th2 poly(A)+ RNA-seq data for two biological replicates and calculated gene expression levels using the standard measure of Reads Per Kilobase per Million (RPKM;
Mortazavi et al, 2008,
Supplementary Table S2 gives the number of reads and mappings we obtained). The expression levels of the biological replicates are highly correlated (
r2=0.94,
Supplementary Figure S2). We then calculated the mean RPKMs of the two samples for all genes and log
2 transformed these values.
Displaying the distribution of all gene expression levels as a kernel density estimate (KDE) reveals an interesting structure: the majority of genes follow a normal distribution, which is centered at a value of ~4 log
2 RPKM (~16 RPKM), whereas the remaining genes form a shoulder to the left of this main distribution (, solid line). This was conserved under different KDE bandwidths (
Supplementary Figure S3, left panel) or different histogram representations (
Supplementary Figure S3, right panel). As genes with zero reads cannot be included on the log scale, we prepared an alternative version of where we assigned low RPKM values to these. This helps to illustrate the fraction of zero-read genes (
Supplementary Figure S4). As a comparison, we studied microarray data for the same cell type from a recent publication (
Wei et al, 2009). The correlation between the microarray and the RNA-seq data was very good and highly statistically significant (Pearson
r2=0.83, Spearman ρ=0.84,
Supplementary Figure S5). Surprisingly, displaying the distribution of microarray expression levels results in a clearly bimodal distribution (). Again, the appearance of the distribution was insensitive to the KDE bandwidth choice or histogram bin size (
Supplementary Figure S6). The bimodality was conserved when alternative normalization and processing schemes were used, independent of KDE bandwidths (
Supplementary Figure S7).
Visual inspection of both microarray and RNA-seq data thus reveals two overlapping main components of the distribution of gene expression levels. Quantifying this by curve fitting confirms a good fit to two distributions: the goodness-of-fit (measured by Akaike Information criterion, AIC (
Akaike, 1974), Bayesian Information Criterion, BIC (
Schwarz, 1978) or Likelihood ratio tests (
Casella and Berger, 2001)) shows strong increases for both microarray and RNA-seq data when two-component models are fit by expectation-maximization (compared to single- or more-component models;
Supplementary Figure S8). We designate the two groups of genes as the lowly expressed (LE) and highly expressed (HE) genes (), because we will present evidence below that the LE genes are expressed rather than simply being experimental background. Our findings are not limited to Th2 cells and hold for virtually all recently published metazoan RNA-seq data sets (e.g.,
Marioni et al, 2008;
Mortazavi et al, 2008;
Mudge et al, 2008;
Wang et al, 2008;
Supplementary Figure S9;
Cloonan et al, 2008; and
Supplementary Figure S10A and B) and all microarray data sets (e.g.,
Cui et al, 2009; GNF Atlas 3 (
Chintapalli et al, 2007); FlyAtlas (
Lattin et al, 2008); and
Supplementary Figure S11) we have studied. The existence of further minor groups of genes cannot be excluded, but is not clear at this point due to the diverse curve-fitting results for the different data sets if higher-order (more than two components) models are considered.
The difference between the microarray and RNA-seq distributions is explained by the fact that the microarrays yield a signal for all genes, part of which is due to cross-hybridization of oligonucleotide probes if the gene is not strongly expressed. RNA-seq on the other hand yields a signal for a gene only if at least one sequencing read is found. The accuracy of RNA-seq is biased toward longer and more highly expressed genes, e.g., 5% of all genes account for 50% of all reads in our data as well as in other data sets (
Mortazavi et al, 2008;
Oshlack and Wakefield, 2009;
Bullard et al, 2010).
To explore how this accuracy bias affects the shape of the LE distribution, we studied the RNA-seq detection limit. We first plotted the number of genes with zero reads as a function of the total number of reads (taking subsets of the total reads). The number of genes without reads decreases slowly, with no change in slope and hence no indication of reaching a plateau. Even at a total of 25 million reads, ~30% of all genes are undetected (). We further estimated the numbers of genes remaining undetected at each expression level by assuming Poisson-distributed read numbers (
Jiang and Wong, 2009) and by determining the expected frequency of zeroes. This confirms the sensitivity drop at the lower end of the LE peak (). Extrapolating the numbers of expressed genes including the undetected ones reveals an emerging LE peak (). Thus, the smaller portion of LE genes in the RNA-seq data compared with the microarray data is at least partially due to the RNA-seq detection limit, although this only becomes a problem for genes at less than approximately −3 to −4 log
2 RPKM. It should be noted that these low expression levels correspond to an absence of transcripts in the majority of cells, as we demonstrate further below.
To further confirm that the LE genes correspond to low expression and not experimental noise, we performed real-time PCRs. We tested amplification by exon-spanning primers of a set of genes that are known to be expressed or not expressed in Th2 cells, plus five random genes that we detected between −3.7 and −5 log
2 RPKM in the RNA-seq experiment (
Supplementary Table S1). We were able to successfully PCR-amplify all genes with high specificity. The expressed genes map to the HE peak, while almost all unexpressed genes map to the LE peak, if we align the PCR results with the microarray/RNA-seq data ().
We also tested the extent to which genomic DNA can be detected in our polyA-purified mRNA sample, as proposed by
Ramskold et al (2009) as a means of quantifying experimental background. We randomly selected intergenic fragments with the same length distribution as genes, 10 kb away from genes. The resulting RPKM distribution contains a high number of zero-RPKM fragments (79%), while the majority of non-zero fragments peaks slightly left of the LE shoulder (). The 90% quantile of this intergenic background distribution is at −4.97 log
2 RPKM, which means that we can be quite confident (with probability >90%) that genes with an RPKM value above this level are truly expressed rather than representing experimental background noise (). Further, the overlap between the intergenic and the normalized LE fit is small (
Supplementary Figure S12). We cannot rule out that detection of intergenic DNA corresponds to transcription as well, which would make the case for transcription of LE genes even stronger.
Analysis of the strand-specific mRNA-sequencing data of ES cells by
Cloonan et al (2008) yields similar conclusions. The poly(A)-purification protocol selects for reads that are antisense to genes (the antisense reads correspond to mRNA). In the distribution of ‘sense' reads (corresponding to antisense transcripts in genic regions), more than 50% of genic regions have zero reads. This distribution is unimodal and shifted by ~2 log
2 RPKM with respect to the LE distribution, and overlaps almost perfectly with the distribution of reads in intergenic regions (
Supplementary Figure S10A).
We next determined the distribution of RPKM within introns, again using fragments with the same length distribution as transcripts (please note that our intronic read densities are not enriched at 5′ or 3′ ends of the intronic regions (
Supplementary Figure S13)). The resulting intronic distribution is significantly higher than the intergenic background (two-sided Wilcoxon rank-sum test,
P<2.2 × 10
−16) and peaks at roughly −1 log
2 RPKM (). Introns thus have one- to two orders of magnitude lower read density than exons. This suggests that we are detecting incompletely processed transcripts at a low but significant and uniform level across the whole range of transcript abundances.
As introns are one- to two orders of magnitude longer than exons, introns should be detectable with roughly the same accuracy as exons, if the full-length set of introns of a gene is used. If we plot the RPKM in exonic regions versus the RPKM in intronic regions for each gene, there is significant correlation (r2=0.86, P<2.2 × 10−16) across the whole spectrum of expression levels. Calculating the correlation for LE and HE genes separately yields only slightly lower correlations among LE genes compared with HE genes, and both correlations are highly significant (). This provides evidence that confirms that LE genes are transcribed rather than experimental background: there would not be such a high correlation between introns and exons, particularly in the low-abundance region, if their detection were due to noise.
We next studied gene expression using a single-cell approach by performing single-molecule RNA-fluorescence
in situ hybridization (FISH;
Raj et al, 2008) for five genes that are expressed at different levels according to the literature and our RNA-seq data. The distributions of mRNA numbers per cell were very broad for expressed genes (e.g., Gata3), while low mRNA numbers from ‘not-expressed' genes (e.g., Il2) were still detected (). All genes had Fano factors (σ
2/μ) larger than 4, indicating that they had extra-Poisson variation (a Poisson random variable would have σ
2/μ=1) and therefore burst-like transcription (
Raj and van Oudenaarden, 2009) (
Supplementary Table S3). Importantly, cells expressing Tbx21 were not anticorrelated with cells expressing Gata3 (), meaning that we do not have a sub-population of Th1 cells in our Th2 cell populations. This demonstrates that LE expression is not due to a contaminating cell type, as the same cells express groups of genes at HE and others at LE levels.
It should be further noted that the LE/HE groups cannot theoretically result from a mixture of different cell types. Mixing of different cell types leads to gene expression levels for each gene that are an average across cell types. Hence, such distributions will become more unimodal, not less so (following the central limit theorem).
As the RPKM as measured by RNA-seq should be proportional to the mean mRNA numbers per cell, we can use the RNA-FISH results to estimate how our RPKM values translate into mRNA numbers. We find that one RPKM corresponds to an average of roughly one transcript per cell in our Th2 data set (). Please note that the value of one RPKM/one transcript on average per cell serves as an estimate only as it is based on a limited number of data points. See
Supplementary Figure S14 for log-transformed versions of .
To study the nature of the LE and HE groups in more detail, we prepared Th2 ChIP-seq data for the activating H3K9/14 acetylation histone modification (
Roh et al, 2005;
Wang et al, 2009b) (H3K9/14ac) and one IgG control. We calculated the histone-modification level at each gene by identifying a globally enriched window around the transcription start sites of genes, and using reads in this window as a measure of each gene's modification level, normalized by the total reads (giving the normalized locus-specific chromatin state, NLCS, as used in
Hebenstreit et al (2011)). Thus, we were able to plot histone-modification levels of each gene against expression levels from the RNA-seq or microarray data using a heatmap representation (, RNA-seq and , microarrays).
Supplementary Figure S15 is an alternative version of this figure, where we randomly assigned low RPKM values to the zero-read genes.
This strikingly confirms the two groups of gene expression levels, as there is a very good agreement between LE genes and absence of histone marks on one hand, and HE genes and presence of H3K9/14ac marks on the other hand (). This is seen for both the microarrays as well as the RNA-seq data. This extends previous findings of the relationship between H3K9/14ac and transcriptional activation by revealing an on/off-type of correlation between this histone mark and the LE/HE groups of genes. It should be noted that there is a very weak correlation within the LE and HE groups. The strongest correlation is within the RNA-seq HE group with a correlation coefficient r2=0.29 in log space and r2=0.097 on linear space.
As the LE group of genes is still expressed at low levels and contains at least five genes that are characterized as not expressed and non-functional in Th2 cells, it seems likely that the HE group of genes represents the active and functional transcriptome of cells. This is supported by SILAC proteomics data (
Graumann et al, 2008), which is available for the embryonic stem cell data we presented earlier (
Supplementary Figure S10) and which indicates protein expression of HE genes only (
Supplementary Figure S10C). The tight correlation recently observed between RNA and protein levels in three mammalian cell lines also supports this (
Lundberg et al, 2010).
Gene ontology (GO) analysis of LE and HE genes in the Th2 cells supports the notion that HE comprises the functional transcriptome, as many T-cell-specific processes (e.g., GO:0050863, GO:0045582, GO:0042110) and housekeeping processes are enriched (
Supplementary Table S4). On the other hand, many GO terms referring to differentiation of other cell types (e.g., ear development GO:0043583, neuron fate commitment GO:0048663) are enriched among the LE set of genes (
Supplementary Table S5).
In conclusion, our data show that two large groups of genes can be discriminated based on the distribution of expression levels. RNA-FISH indicates that the boundary between the groups is found at an expression level of roughly one transcript per cell. In addition, H3K9/14ac marks are associated with the promoters of HE genes only (). It thus seems likely that the LE/HE groups reflect different transcription kinetics depending on the chromatin state or vice versa. The LE group is likely to correspond to ‘leaky' expression, producing non-functional transcripts. The majority of LE genes are expressed at less than one copy per cell on average, and it would be interesting to know whether such stochastic expression has any function, e.g., in cell differentiation, or any deleterious effects. There may be a trade-off between the cost of repressing expression entirely and unwanted consequences of stochastic expression.
Regulation of gene expression is mostly mediated by transcription factor binding events at promoters and enhancers (
Heintzman et al, 2009). Often, differential regulation induces only small changes in expression levels, probably serving to fine-tune expression and shifting genes within the HE group. Our data suggest that in addition to this, there is a key decision about whether a gene becomes ‘switched on' and expressed, which coincides with a boost in both transcription and H3K9/14ac histone modification.