To systematically investigate small RNAs possibly associated with TSSs in
M. pneumoniae and
E. coli, we specifically isolated non-fragmented RNAs ranging in size from 15 to 65 bases and subjected them to direct strand-specific sequencing (DSSS) () (
Vivancos et al, 2010).
The sequencing reads of the small (15–65 bases) RNAs from both
M. pneumoniae ( and ;
Supplementary Figure S2) and
E. coli (
Supplementary Figure S1) showed a non-uniform distribution along the chromosome. The reads often appeared as well-defined, single peaks () with an abrupt raise and a flat plateau (
Supplementary Figure S2A). We also observed more complex patterns, such as two or more overlapping distributions, indicative of multiple promoters (
Supplementary Figure S2B).
In order to quantify these small RNAs automatically and consistently, we developed a computational method that identifies narrow peaks with a flat plateau significantly above the background (; ;
Supplementary information). Automatic analysis of
M. pneumoniae small transcripts obtained from stationary phase cells (
Yus et al, 2009) allowed 1339±116 small RNAs to be identified (using data from three independent biological replicates;
Supplementary Tables S3 and S4), of which 457±28 (~34%) were located <10 bases away from a manually annotated TSS (annotated in this study based on the published data in
Guell et al, 2009) (;
Supplementary Table S1). In total, ~73% of the
M. pneumoniae TSSs have an automatically assigned small RNA (). Visual inspection of the missing TSSs reveals that all of them have an associated RNA that was not identified by the algorithm due to a complicated shape or low height (
Supplementary Figure S4B). In this way, we identified 1371 small RNAs on the plus strand from the
E. coli data set. Using the Regulon database (
Gama-Castro et al, 2008) to extract a high-confidence group of sigma 70-dependent TSSs on the plus strand (;
Supplementary Table S2), we reproducibly found that a somewhat smaller (but still large) proportion of active TSSs have associated small RNAs in the stationary phase (44±5%; ). In both species, the number of small RNAs decreased in the exponential phase (from 1239 to 818 for
M. pneumoniae, and from 1371 to 684 for
E. coli; ). The distance from the small RNA start position to the experimentally determined TSS of the cognate full-length transcript overlapped significantly (
P=0.00015;
Supplementary Figure S4B), with differences between the small RNA starting positions and the annotated TSSs of −0.5±10 bases in
M. pneumoniae, and of −3.5±12 bases in
E. coli. At each TSS, we observed a dominant species of small RNAs, with some minor ones that started at the same point but had slightly different lengths (
Supplementary Figure S5). These results suggest that these newly identified small RNAs are transcribed from the promoters of the corresponding cognate full-length transcripts. Hence, we named these ‘transcription start site-associated' RNAs (tssRNAs). We propose that tssRNAs could be used as markers for promoters in uncharacterized genomes. They could also help to identify ambiguous TSSs, for example in regions where transcripts overlap or when no clear boundaries (e.g., start and end) can be observed at the RNA level (see scheme in , and example in ).
We next applied a number of independent approaches to confirm the existence of tssRNAs and to rule out possible technical artifacts. We found that (i) tssRNAs were also observed when using RNA-seq protocols that do not require fractionation of RNAs by size (using TrueSeq, from Illumina) ( and ); (ii) tssRNAs were unequivocally detected on tiling arrays hybridized with total cDNA (see
Supplementary information; and ) and, similarly, deep sequencing of non-size-fractionated RNAs showed a clear peak at the TSSs from low expression genes (allowing co-detection of the cognate tssRNAs) (
Supplementary Figure S6B); (iii) direct hybridization of fluorescently labeled small RNAs (<65 bases) onto tiling arrays further substantiated the presence of tssRNAs (;
Supplementary Figure S6A); and (iv) the existence of tssRNAs was directly confirmed by Northern blot analyses (;
Supplementary Figure S12, see below).
Small RNAs associated with TSSs in eukaryotes (tiRNAs) are likely to be the result of endonucleolytic cleavage of the nascent RNA (
Taft et al, 2009). To see if this is also the case in bacteria, we exposed the purified small RNAs to terminator 5′-phosphate-dependent exonuclease prior to cDNA synthesis (;
Supplementary Figure S3). Since the 5′ ends of bacteria transcripts bear a triphosphate, while RNA degradation products have a single phosphate, this treatment should remove all RNA degradation products (
Sharma et al, 2010). tssRNAs remained largely unaffected by this treatment, while the level of other (background) small RNAs was reduced (;
Supplementary Figure S3), consistent with the view that tssRNAs are primary transcripts and not the result of endonucleolytic cleavage. However, this experiment cannot distinguish from other possible mechanisms that produce such a 5′ end, like specific endonucleolytic cleavage near the 5′ end, or 3′-to-5′ RNase activity with some degree of protection of the first 40–50 bases. The fact that tssRNAs are observed to be isolated entities, without a corresponding long transcript (
Supplementary Figure S11), would exclude that they are generated by degradation, since the RNAs we observed could not arise from a longer RNA. Even if secondary structure of the 5′ untranslated region (UTR) could explain their resistance to degradation in some cases, it is very unlikely that this could apply to all the detected tssRNAs. In fact, we did not observe any particular enrichment in three-dimensional structures at the 5′ of
M. pneumoniae genes (
Supplementary Figure S14). In sum, this supports the idea that they are primary transcripts resulting from transcription and not the result of degradation.
To further determine the length of tssRNAs, we analyzed the following various experimental results from
M. pneumoniae: (i) standard electrophoresis of total RNA gave an apparent size of 35–40 bases (
Supplementary Figure S3); (ii) Northern blot of tssRNA promoters driving YFP (see below) were within similar range (
Supplementary Figure S12); (iii) tiling array showed a size of around 44 bases (); (iv) deep sequencing of RNA fractionated in various size ranges (<65, 65–100, 100–150, 150–200, and >200 bases) showed a consistent enrichment of tssRNAs in the size range 15–65 bases (
Supplementary Figure S6E); and (v) using the TrueSeq kit from Illumina, an improved DSSS method that does not involve any size selection step (see
Supplementary information), we verified that the tssRNAs were in the size range of 35–55 bases (). All methods offer a congruent view of the tssRNAs ranging in length from 35 to 55 bases. In
E. coli, we observed a similar pattern, but with slightly smaller sizes for the tssRNAs (33–40 bases;
Supplementary Figure S8). Thus, the bacterial tssRNAs are clearly larger in size than the described abortive transcripts found
in vitro and
in vivo (which range from 6 to 17 bases; see
Goldman et al, 2009). Moreover,
in-vitro transcription (IVT) performed with
M. pneumoniae genomic DNA resulted in a pattern similar to that observed
in vivo (
Supplementary Figure S7B) but did not reveal the presence of any tssRNA (
Supplementary Figure S7A). In sum, these results indicate that tssRNA are distinct from abortive transcripts, and that tssRNA synthesis requires the endogenous RNA polymerase machinery.
The majority of TSSs in
M. pneumoniae have an associated tssRNA (). However, we also found a large number of tssRNAs at other genomic positions that are not associated with the start of a full-length transcript. One explanation could be that these are synthesized from ‘cryptic' promoters, that is, promoter-like sequences that appear randomly in the genome. We thus scored the quality of putative RNA polymerase recognition sites −10 regions (Pribnow boxes), which are the most conserved regions in
M. pneumoniae promoters (
Guell et al, 2009) along the chromosome (‘Pribnow_score';
Supplementary Methods; ). The score was based on an analysis of the sequences upstream of the manually annotated TSSs (as determined by transcriptome analysis;
Guell et al, 2009) (
Supplementary Table S1), or by 5′ sequencing (
Weiner et al, 2000) (see
Supplementary Methods). Our analysis showed that all tssRNA upstream regions have a better Pribnow score than the average value of a random sequence (taking the whole
M. pneumoniae genome into account), meaning that they have promoter-like features. Analyzing the 25 bases upstream of the tssRNA start sites for the best-scored Pribnow boxes showed that they are located at the right distance of around 10 bases upstream (according to the previously determined distance of 9±3 bases; see (
Shultzaberger et al, 2007) (). Moreover, these regions have classic Pribnow sequences (of TANAAT, where N can be any base;
Supplementary Figure S9), indicating they are true cryptic promoters. Consistent with this, we found RNA polymerase to be bound to them (see below).
| Table 2Pribnow box analysis of tssRNA upstream regions |
Of the tssRNAs that did not map to the TSS of a long transcript, 34% are close to a translation start codon, about 21% are intragenic, and 48% are intergenic (in stationary phase; ). We analyzed the upstream promoter-like sequences to determine whether intragenic tssRNAs can be considered to be background and can thus be used to distinguish the true positives. However, all three sets had a good Pribnow score (although worse than that of the TSS-associated tssRNAs) at the right distance to the TSS (;
Supplementary Figure S9), and thus all could represent true TSSs. We additionally observed a positive correlation between the Pribnow score and the expression level of the tssRNA (
Supplementary Figure S10). These results suggest that tssRNAs found at intergenic and intragenic regions reflect true TSSs from promoters that are likely to originate from random sequences, or from promoters that are activated under specific conditions. Considering the base composition of
M. pneumoniae, we estimated a probability of having 1562 Pribnow boxes (TANAAT) in the genome, a figure that is around 30% larger than the actual one (of 1131 TANAAT sequences in the genome).
To demonstrate that tssRNAs not associated with long transcripts are the result of spurious transcription, we made three
M. pneumoniae tssRNA promoter constructs that had a good Pribnow score and supported a high level of expression (
Supplementary Figure S12A–C) but that did not produce a corresponding full-length transcript (
Supplementary Table S5). These promoters were fused to the yellow fluorescent protein (YFP-Venus) gene. We did not detect any YFP expression from any of the constructs (as shown for two cases; ), even when they were trimmed to a minimum length (i.e., they were ‘leaderless', which improves the signal for a synthetic promoter) or when a ribosomal recognition sequence (Shine-Dalgarno) was added (). Adding the first 20 bases of the tssRNA did not influence the expression levels from either of the two tssRNAs promoters. We confirmed by Northern blot that these promoters did not yield full-length transcripts but rather only tssRNAs (;
Supplementary Figure S12D). On the other hand, promoters that produce mRNAs and associated tssRNAs (), or even rRNAs, produced detectable Venus expression and long transcripts, as well as tssRNAs (;
Supplementary Figure S12). These results suggest that although a good promoter will support RNA polymerase recruitment and tssRNA production, DNA features other than the Pribnow box are needed to produce full-length RNAs from productive transcription.
So far, we have confirmed the existence of native tssRNAs that are associated with full-length transcripts in both
E. coli and
M. pneumoniae. However, it is still unclear whether these are co-regulated by the same promoter sequences and thus expressed to the same extent, or whether the tssRNAs could be independently expressed. In
M. pneumoniae, tssRNA expression levels correlate weakly with that of the corresponding mRNA (
R=0.54; see
Supplementary Figure S13). However, when comparing the expression levels of tssRNAs with those of the cognate full-length transcripts in
M. pneumoniae in both exponential and stationary phases, we observed an important relative increase of tssRNAs expression only in the stationary phase (
P=3.52 × 10
−32, two-sample
t-test), when transcription is known to be repressed (;
Supplementary Figure S15 and
Table S6). This could indicate that tssRNA production is driven independently from its associated full-length RNA, and/or that it depends on other protein factors that determine transcription. To test for these possibilities, we first performed chromatin immunoprecipitation analyses of the two RNA polymerase subunits, α and β, in
M. pneumoniae (MPN191|RpoA and MPN515|RpoB, respectively), followed by DNA ultrasequencing (ChIP-seq) and DNase protection assays. These results revealed that the RNA polymerase indeed binds to both the productive (i.e., associated with long associated transcripts) and unproductive (isolated) tssRNAs (
Supplementary Figure S16). The RNA polymerase was found to be located both at the promoter region (−10), a position at which it is known to stall and produce abortive transcripts prior to initiation of transcription elongation (
Goldman et al, 2009), and at some nucleotides downstream of the TSS, where it could produce tssRNAs (around the +30 position; ).
Positioning at downstream regions is more prominent in unproductive, isolated tssRNAs (
Supplementary Figure S17), despite the fact that the overall affinity of the RNA polymerase is generally lower at these promoters, which on average have slightly worse Pribnow scores (). This would indicate that RNA polymerase pausing is more likely to occur in non-productive promoters. Altogether, these results suggest that, once elongation has started, RNA polymerase pausing could be a mechanism to avoid spurious transcription at any place where a Pribnow box sequence is present. tssRNAs could thus represent a transcriptional checkpoint to ensure that the RNA polymerase machinery is correctly assembled (e.g., that the sigma factor is lost and the correct elongation factors are recruited) (
Roberts et al, 2008;
Yang and Lewis, 2010;
Burmann and Rosch, 2011). This would further guarantee that there is no unnecessary transcription, avoiding the energy expense and preventing unwanted products, such as truncated proteins or transcripts that are antisense to essential genes (
Supplementary Figure S18).