|Home | About | Journals | Submit | Contact Us | Français|
Small RNA (sRNA) molecules have gained much interest lately, as recent genome-wide studies have shown that they are widespread in a variety of organisms. The relatively small family of 10 known sRNA-encoding genes in Escherichia coli has been significantly expanded during the past two years with the discovery of 45 novel genes. Most of these genes are still uncharacterized and their cellular roles are unknown. In this survey we examined the sequence and genomic features of the 55 currently known sRNA-encoding genes in E.coli, attempting to identify their common characteristics. Such characterization is important for both expanding our understanding of this unique gene family and for improving the methods to predict and identify sRNA-encoding genes based on genomic information.
Traditionally, most RNA molecules were thought to function as the mediators that carry the information from the gene to the translational machinery. Exceptions to these were the transfer RNAs and ribosomal RNAs that had long been known to carry out functions of their own, associated also with translation. However, it is now widely acknowledged that other types of untranslated RNA molecules exist that are involved in a diverse range of functions, from structural through regulatory to catalytic. Untranslated RNA molecules are present in many different organisms, ranging from bacteria to mammals, affecting a large variety of processes including plasmid replication, phage development, bacterial virulence, developmental control and more (1–7). These unique RNA molecules have recently drawn much attention, as many novel RNA-encoding genes were discovered during the past two years (for review see 8–11).
Small RNA (sRNA)-encoding genes have been hard to detect both experimentally and computationally. Their small size complicates their detection by biochemical and genetic methods. Furthermore, sRNA genes are not translated into proteins and are thus immune to frameshift and nonsense mutations. For the same reason these genes are impossible to predict by the conventional computer algorithms that rely on the presence of protein coding sequences. Indeed, until the year 2000 and for the last three decades only 10 sRNA-encoding genes were known in Escherichia coli (6), most of which were discovered fortuitously.
During the last two years six studies were published that described systematic searches for novel sRNA genes in E.coli (12–17). One study provided 562 predictions of novel non-translated RNA genes using a machine learning approach (15). A second study used a whole genome array approach to examine the transcription of E.coli coding regions as well as intergenic regions (17), and identified 317 novel transcripts with unknown functions. Out of these, nine transcripts were suggested by the authors to be good candidates for sRNA molecules, based on their size, expression pattern or conservation. The remaining four studies used a combination of computational and experimental approaches (12–14,16). The study by Argaman et al. (12) relied on transcription signals and sequence conservation in empty intergenic regions of the genome, and resulted in the prediction of 24 candidate sRNA-encoding genes. Experimental testing of these candidates showed that 14 of them were expressed, and their 5′ and 3′ ends were determined (12). Two of these sRNA genes, gcvB and rprA, were independently discovered in two separate studies (18,19). The study by Wassarman et al. (13) relied on sequence conservation and DNA array data, and led to the prediction of 60 candidate sRNA genes. These candidates were examined by northern blots and 17 of them were found to be sRNA genes (13). Rivas et al. (14) relied on the premise that functional RNAs should have a conserved structure in the different organisms in which they exist. Based on an algorithm that detects such structural conservation, 275 candidates were predicted. Experimental testing of 49 candidates detected the expression of 11 genes (14). The algorithm by Chen et al. relied on transcription signals alone, and resulted in the prediction of 227 candidate sRNA-encoding genes (16). Eight of the predictions were tested experimentally, out of which seven were found to be expressed. Since several genes were identified in more than one study, all in all 45 novel sRNA genes were discovered. Thus, there are currently 55 known sRNA-encoding genes in the E.coli genome (Table (Table1).1). These probably comprise only a portion of the population of sRNA-encoding genes in E.coli, as many predicted genes still await their detailed experimental testing. In the current survey we focus on the genomic characterization of the sRNAs encoded in the genome of E.coli.
All known E.coli sRNA-encoding genes are located in ‘empty’ intergenic regions, where no other gene resides on either of the two strands (based on the Colibri and Ecogene database annotations of E.coli K12 MG1655 genes [http://genolist. pasteur.fr/Colibri, (20)]. In order to find out whether these genes are located in certain genomic regions, their distribution along the genome with regard to the origin of replication was examined (Fig. (Fig.1).1). The DNA replication of E.coli starts from a single origin and is bidirectional, creating two replicores: the left replicore and the right one (21). While 51% of the open reading frames (ORFs) are located on the left replicore and 49% on the right replicore, the sRNA genes show different proportions, as 64% of them (35/55) are located on the left replicore (p < 0.05 by a binomial test). There are no significant differences in the number of ‘empty’ intergenic regions and in their length between the two replicores. Therefore, there is no obvious reason for the difference in the number of sRNA-encoding genes between the two replicores. We also examined the distribution of sRNA genes between the lagging and leading strands and found that they are distributed about equally.
In general, the genome of E.coli is quite compact. There are only 33 ‘empty’ intergenic regions that are >900 nt, many of which contain repetitive sequences (21). Most intergenic regions in the E.coli genome are relatively short: 1155 are <50 nt, 1889 range between 50 and 300 nt, and 478 are 300–900 nt long. No sRNA gene resides in an intergenic region that is <50 nt. Of the sRNAs, 71% (39 genes) reside in larger intergenic regions of lengths between 300 and 900 nt, which make up only 20% of the total intergenic regions that are >50 nt and 39% of the nucleotides of these regions. Twenty-seven percent of the sRNA genes reside in relatively short intergenic regions (50–300 nt), which make up 79% of intergenic regions >50 nt and 52% of the total nucleotides of these regions. Only one sRNA gene (rnpB) resides in an empty intergenic region >900 nt. Analysis of the size distribution of the sRNAs themselves shows that most of them (47 genes) are between 50 and 250 nt long (Fig. (Fig.2).2). A weak correlation was found between the size of an sRNA and the size of the intergenic region in which it resides (r = 0.26, p ≤ 0.05). In addition, we examined possible clustering of the sRNA genes in the intergenic regions. It seems that the sRNA-encoding genes do not tend to be clustered, as only in three cases two genes are located in a single intergenic region (ryeC and ryeD, sraE and rygB, sraC and ryeB), and only in one case three genes are located in the same intergenic region (ryfA, IS128 and C0614). In each of the pairs of genes, ryeC and ryeD, sraE and rygB, and ryfA and IS128, there is some sequence similarity and they are transcribed in the same direction. C0614 and sraC are opposites of ryfA/IS128 and reyB, respectively.
In hyperthermophiles it was shown that sRNAs differ in their base composition from the rest of the genome (22,23). In fact, novel sRNA-encoding genes were identified in the AT-rich genomes of hyperthermophiles such as Methanococcus jannachii and Pyrococcus furiosus by searching for GC-rich regions (22,23). We examined whether the sRNAs in E.coli also have some distinct base composition. For comparison, we examined the base composition of the ORFs, tRNA genes, rRNA genes and the intergenic sequences. Base composition was analyzed for sRNAs for which either the 5′ or the 3′ end was experimentally determined or the 3′ end was predicted (a total of 44 sRNAs). As a group, the sRNA molecules seem to be only slightly richer in guanines and cytosines (48.2%) in comparison with intergenic regions where sRNA-encoding genes have not been identified (42.4%) (Table (Table2).2). This analysis also revealed that the sRNAs’ GC content is lower than that of the tRNAs (59%) and rRNAs (54%).
The sRNA-encoding genes can be divided into two subgroups based on their functions: the regulatory sRNAs that act as regulators of gene expression, and the housekeeping sRNAs that affect different aspects of cellular metabolism. Analysis of the GC content of 12 sRNAs with known function showed that the base composition of the housekeeping sRNAs (ssrA 53%, rnpB 62%, ffs 62% and ssrS 55%) is closer to the composition of the tRNAs and rRNAs, while the regulatory sRNAs show a lower GC content (Fig. (Fig.33 and Table Table2).2). dicF, which is not a housekeeping sRNA, is also GC rich (55%); however, dicF was not included since it is known to originate from a phage, and is therefore expected to have a composition different to that of the E.coli genome. It is conceivable that, in order to meet functional requirements that are based on specific structures, tRNAs and perhaps also rRNAs have a high GC content which is associated with a more rigid structure. It is intriguing, however, that the housekeeping genes, as opposed to the regulatory genes, also have a high GC content which could indicate the need for a rigid versus a more flexible structure. Unlike the housekeeping sRNA genes, the regulatory sRNAs such as oxyS or dsrA often regulate the expression of a number of different genes. Thus, it is possible that they might need a more flexible structure. It would be interesting to find out whether novel sRNAs with a relatively high GC content such as sraH or rydB would be classified as housekeeping.
The sequences of nine of the previously known sRNAs were shown to be conserved in closely related bacteria. Based on this and on the premise that functional sequences are expected to be conserved in related organisms, three of the genome-wide studies used sequence conservation as a criterion for prediction (12–14). However, the three studies used different stringency for conservation analysis: Rivas et al. (14) allowed a lower level of sequence conservation, and relied on structural conservation, while Argaman et al. (12) and Wassarman et al. (13) requested a higher level of sequence conservation. We therefore attempted to conduct a consistent conservation analysis for all 55 sRNA-encoding genes.
For the conservation analysis, the sequences of the 55 sRNAs in E.coli were used as queries and compared by BLAST (24,25) to the sequences of 102 complete bacterial genomes, downloaded from the NCBI ftp server (ftp.ncbi.nlm.nih.gov) and listed in the Supplementary Material. An sRNA-encoding gene was considered as conserved in another organism if the alignment had an E-value lower than 0.001 (Table (Table3);3); however, in the Supplementary Material we list alignments up to an E-value of 1.0. Since the majority of the sRNAs did not show any significant sequence similarities beyond Yersinia pestis, only these comparisons are summarized in Table Table3.3. Out of the 55 sRNAs examined, two genes were found uniquely in E.coli K12. Fifty sRNA sequences were conserved in the two E.coli O157:H7 genomes and in E.coli CFT073. All but three of the sRNAs were conserved in Shigella flexneri. In Salmonella typhimurium and in Salmonella typhi 42 of the sRNAs were conserved. Sixteen sRNAs were conserved in Y.pestis, with an E-value below 0.001. We also compared the sRNA sequences with all available archeal sequences and did not find any significant similarities.
Out of the 13 sRNAs with known functions, five regulatory and all four housekeeping sRNAs were conserved in Y.pestis. Remarkably, three out of the four housekeeping sRNAs (ssrA, ffs and rnpB) are conserved beyond Y.pestis, each in several additional organisms (marked in Table Table33 as such and detailed in the Supplementary Material). The only regulatory sRNA sequence that is conserved beyond Y.pestis is spf, showing a significant similarity to a sequence in the Shewanella oneidensis genome. None of the new sRNAs with unknown functions were conserved beyond Y.pestis.
It is important to note that in this study we examined conservation only through sequence similarity. In many cases the function of an sRNA may depend on the structure of the molecule rather than on a specific sequence. In addition, there are instances in which the function of the sRNA depends on sequence complementarity to a target gene. In such cases, complementary mutations in the sRNA and its target gene may cause a change in the sequence of the sRNA. Indeed, there are E.coli sRNA-encoding genes (ssrA, rnpB and micF) that are known to have homologs in other organisms. Yet, these homologs do not share enough sequence similarity with the E.coli sequences to allow a BLAST search to find them when the E.coli sequence is used as a query. For example, ssrA is known to exist in organisms as distant as archea. Indeed, it has been shown that additional ssrA homologs could be identified when a representative sequence from an organism distant from E.coli was used as a query (26). It is therefore quite likely that some of the sRNAs surveyed have homologs in additional genomes, yet these were not detected by our analysis due to sequence drift. As the algorithms for the optimal alignment of a sequence to an RNA secondary structure (27) improve, it may become possible to find these more distant orthologs.
It has previously been observed that conservation of gene adjacency in different organisms is associated with functional relationships between the conserved genes. Moreover, it was shown that in many cases proteins encoded by pairs of adjacent genes whose adjacency is conserved, work in concert, by physical interaction (28–30). This property, in turn, was used to infer relationships between conserved adjacent genes (30). The same rationale may hold for sRNA-encoding genes and genes to which they are functionally related. We therefore examined the conservation in other genomes of the genes flanking the sRNA-encoding genes. When comparing the E.coli K12 sRNA sequences to the sequences of the 102 complete bacterial genomes we noted the positions in which the sRNAs were found in the other genomes. Based on the GenBank annotation of the relevant genomes, the genes flanking the conserved sRNAs were extracted and a database was created. For this analysis an sRNA was considered conserved if it had an E-value lower than 1. The adjacent genes of the E.coli K12 sRNA genes were translated and used as queries for a TBLASTN search against the database of the flanking genes described above. We found that conservation of adjacency was very widespread in the bacteria closer to E.coli K12. For 44 out of the 50 sRNAs conserved in E.coli O157:H7 both flanking genes were conserved (Table (Table3).3). For the remaining six sRNAs only one flanking gene was conserved. This high rate of conservation may be due to the close relatedness between the two E.coli strains. For 23 of the 42 conserved sRNAs in S.typhimurium LT2, we observed a conservation of both adjacent genes. For 11 additional sRNAs one of the adjacent genes was conserved. In Y.pestis CO92, out of the sRNAs for which BLAST found similar sequences with an E-value below 1, only in five cases were both adjacent genes conserved. In 11 additional cases one of the adjacent genes in Y.pestis was conserved. In some cases conservation of adjacency was used also to support the prediction of an sRNA as conserved in Y.pestis. BLAST hits with an E-value above 0.001 were normally ignored; however, if an sRNA was found in Y.pestis with an E-value between 0.001 and 1, and one or more of its adjacent genes was conserved, this strengthened our certainty in the conservation of this sRNA in Y.pestis. In this manner three additional sRNA-encoding genes (sraA, sraK and rygB) were found to be conserved in Y.pestis, thus raising the number of sRNAs conserved in Y.pestis to 19 (Table (Table3).3). For two sRNA genes (gcvB and ssrA) the conserved adjacent genes (gcvA and smpB, respectively) are known to be functionally associated with the sRNAs: GcvA is known to activate the transcription of the adjacent sRNA, gcvB (18); SmpB is an RNA-binding protein, which is necessary for the activity of ssrA (31). These observations could suggest that in the other cases of conservation observed in Y.pestis (Table (Table3),3), the conserved adjacent genes may be functionally associated with the sRNAs, and provide some hint as to the function of these sRNAs.
In order to reveal whether there is sequence conservation among some of the sRNA molecules, an all against all comparison of the sRNA sequences was conducted using the GCG program bestfit that implements the Smith–Waterman algorithm (32). For sRNAs with defined 5′ and 3′ ends, the precise sequences were used; otherwise, the entire sequence of the intergenic region that contains the sRNA was used. For most sRNA-encoding genes the sequence comparison did not reveal any similarity with any of the other sRNAs. sraE and rygB, which reside in the same intergenic region between aas and galR, show significant sequence similarity with 77% identity over 84 nt. These two sRNAs overlap a previously identified intergenic repeat called PAIR2 (33). ryeC and ryeD, which reside in the same intergenic region between yegL and yegM, and two additional sRNA genes named tp8 and rygC, which reside in different intergenic regions, also seem to share sequence similarity (with percent identities ranging between 68 and 83%, covering substantial regions of the sRNAs). These sRNA genes overlap repeats that have been previously identified and named QUAD repeats (33). Also ryfA and IS128, which reside in the same intergenic region, show some sequence similarity (94% identity over 18 nt). A pair of intergenic repeats called PAIR3 was identified in this region; however, IS128, as well as ryfA, correspond only partially to the locations of the previously identified repeats (33). This could be partially due to the fact that the exact 5′ and 3′ ends of IS128 were not experimentally identified. C0614 and IS128 are opposite and therefore complementary. C0362 and C0664 are complementary along 31 bases due to the inclusion of REP elements (21) within their sequences.
In addition to the 45 novel sRNA-encoding genes that were discovered, a large number of additional candidate genes were predicted. In the studies of Argaman et al. (12) and Wassarman et al. (13), a relatively small number of sRNAs were predicted and all of the candidates were tested experimentally. In the studies of Chen et al. (16), Rivas et al. (14) and Carter et al. (15) a large number of sRNAs were predicted but only a small portion of these were experimentally examined. Tjaden et al. (17) used a whole genome array to detect transcription from the intergenic regions of the E.coli genome. These transcripts were further characterized and classified as either 5′ or 3′ untranslated regions, operon elements, sRNAs or as transcripts with unknown function.
We compiled a list of the candidates that were not yet verified experimentally (Table (Table44 and Supplementary Material). In this compilation the number of non-redundant candidates is 1001. Out of these, 906 candidates were predicted by one study alone, 85 candidates were predicted by two studies, and 10 candidates were predicted by three independent studies. As seen in Table Table4,4, only very few of the candidate sRNAs are located within 5′ and 3′ UTRs or operons, as determined by Tjaden et al. (17).
The 55 sRNA-encoding genes in E.coli seem to be a much more varied class of genes than tRNA genes and rRNA genes. It is hard to determine a genomic or sequence feature that is shared by all of them. Still, common characteristics could be identified.
By examining the distribution of the sRNA-encoding genes on the genome we found a preference for the left replicore and no preference for the leading or lagging strand. We also found that the sRNA-encoding genes are not clustered in certain intergenic regions, as usually no more than one sRNA gene exists per intergenic region. These genes usually reside in intergenic regions ranging in size from 300 to 900 nt. They very rarely reside in intergenic regions >900 nt, which in E.coli usually contain repetitive sequences.
Our analysis of base composition revealed that all genes are richer in GC in comparison with intergenic sequences. sRNAs are, however, less GC-rich than the other types of genes. There does exist a sub-group of sRNAs, ‘housekeeping sRNAs’, that are richer in GC compared with the other sRNAs. These sRNAs seem to have a base composition more similar to that of structural RNA genes, such as tRNAs and rRNAs. The difference in the GC content could point to the different structural requirements associated with the function of the sRNAs; regulatory sRNAs versus housekeeping sRNAs.
Since three out of the five studies that led to the discovery of the 45 novel sRNAs relied on sequence conservation, it is not surprising that most of the known sRNAs are conserved in closely related bacteria. The conservation is strongest in the other E.coli strains and in S.flexneri, while only 19 of the sRNAs are conserved also in Y.pestis. Only four of the sRNAs were conserved beyond Y.pestis. Three of these sRNAs carry out housekeeping functions. No sRNA homologs were found in archea. It is important to note that since we examined conservation through sequence similarity alone, it is possible that some of the sRNA homologs that maintain only structural conservation may have been missed.
Conservation analysis of the sRNA-encoding genes along with their flanking genes revealed stronger conservation in the other strains of E.coli, S.flexneri and the two Salmonella strains, than in Y.pestis. In two of the cases in which gene order was conserved in Y.pestis, the adjacent gene and the sRNA were known to be associated. This may point to a functional association for the other sRNA-encoding genes and their adjacent genes in cases where such a conservation was observed.
Most sRNAs do not show sequence similarity with other sRNAs. We found that in most cases where sRNAs were similar in sequence they were also located in neighboring genomic locations, suggesting that they may have resulted from duplication events. Indeed, most of these were previously identified as intergenic repeats (33). It could therefore be interesting to check other intergenic repeats, to see whether they too encode for sRNA molecules.
We compiled the candidate sRNA genes predicted in the different studies and compared them with the annotation of 5′ UTRs, 3′ UTRs and operons reported by the study of Tjaden et al. (17). After uniting overlapping predictions into single candidates there remain 906 candidates that are unique to a single study, 85 candidates that were predicted by two studies and 10 candidates that were predicted by three studies. We find that most candidates are not located within annotated operons, 5′ UTRs or 3′ UTRs.
No function is known at present for 42 out of the 55 discovered sRNAs. Our survey provides pointers that may aid in associating function to some of these molecules. Also, the various characteristics we have identified may be used for the development of a refined algorithm for predicting additional sRNA-encoding genes in E.coli, as well as in closely related organisms.
Supplementary Material is available at NAR Online.
We thank Elena Rivas for sending us unpublished data. This study was supported by a grant from the Human Frontier Science Program granted to H.M. and S.A.