Tiling array analysis is widely used in eukaryotes to study transcriptional complexity and identifying non-coding transcripts [49
]. Recent studies in Mycobacterium leprae
described whole genome tiling array approach for sRNA identification [20
]. Parallel sequencing technology was used for sRNA identification in Salmonella
] and Vibrio cholerae
]. Individual experimental studies [10
] altogether identified 14 sRNAs in two different strains of S. pneumoniae
(D39 and R6 strain). To our knowledge, this is the first study to report the use of whole genome tiling arrays for experimental identification of sRNAs at a global scale in S. pneumoniae
. The tiling array analysis method described here is a combination of the methods described by others [49
], but tailored for prokaryotic genomes. Hfq protein plays a central role in sRNA function in E. coli
, facilitating the pairing of sRNA with its mRNA target [53
]. One experimental approach for sRNA identification in bacteria could be the co-immunoprecipitation of sRNA using Hfq antibodies [16
]. However, S. pneumoniae
TIGR4 genome does not code for Hfq protein which precludes applying this method to TIGR4 genome. Therefore, tiling array approach described in this study is a pragmatic experimental approach for identifying sRNAs. Identifying the sRNA repertoire of TIGR4 is the first step towards understanding the sRNA regulatory network of this human pathogen.
The transcriptome map generated in this study identified expression in two thirds of TIGR4 genome. Tiling array analyses of E. coli
and yeast reported expression of 87% and 90% of the genome respectively [50
]. Compared to these studies, TIGR4 genome expression in this study was relatively in lower proportion (68%). Possible reasons for this lower expression could be the growth conditions and/or the stringent intensity cutoff used for identification of expressed regions. We choose a stringent intensity cutoff (11.0) to maintain a low false positive rate (1.63%) for identifying sRNAs, which are usually short in length (50-200 bp).
As a result, we report for the first time genome-wide identification of 50 novel sRNAs in pneumococcus using tiling arrays. Additional features, such as presence of a promoter and rho-independent terminator, were computationally predicted for identified sRNAs. Almost half of the identified sRNAs showed the presence of a rho-independent terminator. As speculated by others [29
], our analysis indicates that identification of rho-independent terminator sequence is the strongest determinant for the identification of sRNA. Furthermore, the identification of rho-independent terminator downstream from sRNA sequences helped us in differentiating the sRNAs from the 5' untranslated extensions of genes. However, it is possible that some sRNAs may be associated with a rho-dependent terminator and thus would not be identified in our search.
Comparative genomics of sRNA sequences revealed that only six sRNA sequences involved in various regulatory activities were conserved beyond Streptococcaceae
(example Lactobacillus, Clostridium, Bacillus
(Table ). The evolutionary tree of Streptococcus family [30
] indicates that S. mitis, S. gordonii, S. sanguinis SK36
are phylogenetically closer to S. pneumoniae
than other species (like S. pyrogens, S. mutans or S. bovis)
which explains the conservation of 25 sRNAs in S. mitis, S. gordonii
, and S. sanguinis
SK36), but not present in other species like S
, or S. bovis
.. It also indicates that sRNA prediction algorithms that rely on comparative genomics need to first account for the observed low sequence conservation of sRNAs among different species [13
]. Our results suggest that computational methods which rely on comparative genomics to find sRNAs need to focus on carefully selected closely related species. The 50 sRNAs identified in this study along with their comparative genomics could serve as a training dataset for further computational sRNA predictions in pneumococcus, particularly for the identification of sRNAs which are not expressed under our experimental conditions. At last, we speculate that computational prediction of Streptococcus
sRNAs using comparative genomics with S. mitis
, and S
SK36 will identify new as yet undescribed sRNAs.
Exploring the sequence characteristics of sRNAs described in this study showed that sRNAs predicted to have similar biological function share common sequence motif. We identified 8 sequence motifs, of which five were identified in TIGR4 for the first time. Members of the motif group without predicted function could have similar structural or functional properties. For example SN20 had motif M3 and might function as a FMN switch similar to SN4 and SN10, which also contain this motif. Likewise, sRNAs present in motif group M4 could be predicted to have similar yet undefined function. Structural analysis of motif (Additional file 4
) suggests that they mainly form two kinds of structure in sRNAs; firstly, the whole motif forms a stem loop structure (like motif M5) and secondly, the motif is present as two stem loop structures including the unpaired region between them (like motif M6). Furthermore, motifs present in sRNAs with similar function also formed a conserved secondary structure (for example, motifs M1, M2, and M5). We speculate that (SN32 and SN38), (SN16 and SN29), (SN21, SN24 and SN33), (SN14 and SN37) contains similar motif structure and might share similar yet unknown structure/function. This structural conservation of motifs also suggests that motif regions of sRNA could be structurally or functionally important regions and can be used as targets for mutational studies to decipher function.
The accuracy of computational operon prediction in bacteria is 85-91% in terms of specificity and sensitivity for predicting operonic gene pairs (pairs of consecutive genes that are part of the same operon) in E. coli
and B. subtilis
, respectively [56
]. However, the sensitivity of prediction drops to as low as 50% when predicting transcription units with more than one gene [56
]. Two examples were discussed in results where the computational prediction failed to identify a gene pair as a part of an operon due to the presence of a large intergenic region between them. The accuracy of computational operon prediction algorithms also decreases when performing predictions for newly sequenced genomes for which no training dataset is available. Based on tiling array analysis, we generated 520 gene pairs that were co-expressed and identified 202 transcription units in S. pneumoniae
TIGR4. Our results clearly demonstrate the effective use of tiling arrays for operon identification at a whole genome scale. An obvious limitation to the tiling array approach is the inability to identify operons whose genes are not expressed in the experimental growth condition. Nevertheless, our results demonstrate that combining operons identified by tiling with computational prediction greatly improves operon identification in genomes, as speculated by other researchers [57
]. The operons identified in this study, though not comprehensive, still represent a validated dataset of approximately 202 operons.
Around 8% of the S. pneumoniae
TIGR4 genome is repetitive in nature. It includes sequences (> 50 bp) that are present at multiple locations in the genome, such as mobile genetic elements, small dispersed repeats like RUP and BOX elements, and other repetitive regions. Although these regions were excluded for identifying sRNA, we detected a high level of transcription in these repetitive regions from both sense and antisense strands. Because it is not possible to identify the actual origin of transcription with tiling arrays, future experiments designed to analyze the transcriptional activity in these repeat regions are warranted. In view of recent findings where sRNAs are involved in repressing expression of toxic proteins [58
] and are present in multiple copies, we speculate that that these repetitive regions may be involved in various regulatory activities within the cell.
In conclusion, our combinatorial approach of experimental identification of sRNAs on a genome scale using tiling arrays in conjunction with computational analyses of sRNAs in S. pneumoniae TIGR4 has resulted in the description of 50 sRNAs in this clinically relevant strain. Our result forms the initial framework for understanding sRNA-based regulation of S. pneumoniae gene expression.