|Home | About | Journals | Submit | Contact Us | Français|
The level of sequence heterogeneity among rrn operons within genomes determines the accuracy of diversity estimation by 16S rRNA-based methods. Furthermore, the occurrence of widespread horizontal gene transfer (HGT) between distantly related rrn operons casts doubt on reconstructions of phylogenetic relationships. For this study, patterns of distribution of rrn copy numbers, interoperonic divergence, and redundancy of 16S rRNA sequences were evaluated. Bacterial genomes display up to 15 operons and operon numbers up to 7 are commonly found, but ~40% of the organisms analyzed have either one or two operons. Among the Archaea, a single operon appears to dominate and the highest number of operons is five. About 40% of sequences among 380 operons in 76 bacterial genomes with multiple operons were identical to at least one other 16S rRNA sequence in the same genome, and in 38% of the genomes all 16S rRNAs were invariant. For Archaea, the number of identical operons was only 25%, but only five genomes with 21 operons are currently available. These considerations suggest an upper bound of roughly threefold overestimation of bacterial diversity resulting from cloning and sequencing of 16S rRNA genes from the environment; however, the inclusion of genomes with a single rrn operon may lower this correction factor to ~2.5. Divergence among operons appears to be small overall for both Bacteria and Archaea, with the vast majority of 16S rRNA sequences showing <1% nucleotide differences. Only five genomes with operons with a higher level of nucleotide divergence were detected, and Thermoanaerobacter tengcongensis exhibited the highest level of divergence (11.6%) noted to date. Overall, four of the five extreme cases of operon differences occurred among thermophilic bacteria, suggesting a much higher incidence of HGT in these bacteria than in other groups.
rRNA sequences play a central role in the study of microbial evolution and ecology. Particularly, the 16S rRNA genes have become the standard for the determination of phylogenetic relationships, the assessment of diversity in the environment, and the detection and quantification of specific populations (14, 16). Indeed, the rRNAs combine several properties which make them uniquely suited for such diverse applications. First, they are universally distributed, allowing the comparison of phylogenetic relationships among all extant organisms and thus the construction of a “tree of life.” Second, the rRNAs are generally thought to be part of a core of informational genes which are only weakly affected by horizontal gene transfer (HGT) (1, 8), so their relationships provide a solid framework for the assessment of evolutionary changes in lineages. Third, the rRNAs are functionally highly constrained mosaics of sequence stretches ranging from conserved to more variable. This enables the design of PCR primers and hybridization probes with various levels of taxonomic specificity and is exploited in microbial ecology when the number and distribution of different rRNA genes are taken as a measure of diversity (14). A testimony to the significance of these approaches is the vast and growing database of 16S rRNA genes, of which an increasing number are derived from the large majority of uncultured Bacteria and Archaea. Although many of these organisms appear to dominate in the environment, their distribution and relationships are only known from clone libraries derived from nucleic acids recovered from the environment (16).
The interpretation of microbial ecology and evolution via 16S rRNA sequences has been complicated in recent years by the realization that many bacteria harbor multiple, heterogeneous rRNA operons. The three rRNAs, namely 16S, 23S, and 5S rRNAs, are typically linked together into an operon, which frequently contains an internal transcribed spacer and at least one tRNA. It has been shown that bacterial genomes can contain between 1 and 15 such operons and that 16S rRNA sequences can differ up to several percent between operons (28, 40, 41, 43). Such sequence heterogeneity within single genomes creates a significant problem for culture-independent analysis of microbial communities since it can lead to a severe overestimation of microbial diversity based on 16S rRNA approaches (6). This has led many authors to omit small-scale sequence differences encountered in environmental 16S rRNA clone libraries from estimates of diversity (7, 17, 39). However, such “microdiversity” has been reported to make up a significant fraction of the sequence composition of clone libraries (4, 10, 13, 24). That it potentially signifies functional differences is suggested by comparisons between closely related 16S rRNA sequences and physiological properties of isolates (12, 32, 36) or overall genome architectures (2, 18, 31, 35). In addition, reports of relatively large divergence among rRNA genes between operons have suggested that HGT may affect rRNA genes to a larger extent than was previously assumed (28, 40, 41, 43). If this effect is indeed widespread, then phylogenetic relationships among bacteria may become significantly blurred.
Here we present an in-depth comparison of 16S rRNA genes of bacterial and archaeal operons largely based on published complete genome sequences, focusing on the following questions. (i) What is the range and average of divergence among 16S rRNA genes coexisting in genomes? (ii) What is the redundancy of identical 16S rRNA sequences within multiple operons? (iii) How widespread does HGT appear to be, as evidenced by the number of genomes with highly divergent 16S rRNA genes? This work is intended as a complement to the recently compiled ribosomal copy number database (22) and pursues the overall goal of providing bounds of accuracy and reliability for 16S rRNA sequence-based estimations of diversity and phylogeny.
Information on 16S rRNA variation within 81 genomes with multiple operons was retrieved from several sources between 10 June and 5 August 2003 (Table (Table1).1). The majority of the genomes came from the National Center for Biotechnology Information (NCBI) Microbial Genome Database (http://www.ncbi.nlm.nih.gov/genomes/MICROBES/Complete.html), from which a total of 57 bacterial and 2 archaeal genomes with multiple rrn operons were recovered. In addition, rrn operons from two bacterial genomes were obtained from The Institute for Genomic Research (TIGR) Microbial Genomic Database (www.tigr.org/tdb/mdb/mdbcomplete.html). The rRNA Operon Copy Number Database (rrndb) (http://rrndb.cme.msu.edu) and multiple literature sources served as sources of information for 12 and 10 microorganisms, respectively (Table (Table11).
16S rRNA sequences from all retrieved genomes were aligned and analyzed with the Sequencher 4.1 software package (Genes Codes, Ann Arbor, Mich.) or with CLUSTAL X (19). For each set of operons, the numbers of divergent and identical 16S rRNA genes were determined. For divergent genes, the percentage of nucleotide difference and the number of polymorphisms were calculated. In cases in which insertion-deletion (I/D) events of >1 bp were observed, two values expressing divergence between the genes were calculated, with the first excluding (−I/D) and the second including (+I/D) the inserted segments (see Table Table33).
The information on rRNA operon copy numbers from the rrndb was amended with data obtained from the genomes retrieved so that a total of 355 bacterial strains were evaluated. This showed a range in rrn operon numbers from 1 to 15, with two rrn operon copies representing the most common class, with 25% of the total (Fig. (Fig.1).1). About 40% of the strains had either one or two operons. This was noted previously and demonstrates that despite a highly increased rate of genome sequencing and determination of operon copy numbers, this ratio has remained roughly constant over the last few years (6, 11). The next most abundant classes were four, seven, six, and three operons per genome, with frequencies of 14, 13.5, 11.5, and 6.7%, respectively (Fig. (Fig.1).1). Genomes with ≥10 operons were observed in only 4.3% of the cases. Although overall no clear correlation between rrn copy number and distantly related phylogenetic divisions was apparent from the data (21), a pattern of low operon copy number was observed for three domains. None of the 31 genomes belonging to the α-Proteobacteria had more than four operons. Furthermore, mycoplasma genomes displayed a maximum of three operons, and among the spirochetes, 20 of 24 genomes had a maximum of two operons.
The pattern of operon numbers among archaeal strains differed from that for Bacteria. Although information for only 23 strains could be retrieved from databases and the literature, a dominance of low rrn copy numbers was apparent (Fig. (Fig.1).1). The majority of genomes (65.2%) have a single operon (Fig. (Fig.1).1). Only one archaeon, Methanococcus vannielii, was found to have four operons.
Although variations in operon numbers between different bacterial species have been well documented, variations between strains of the same species are considered less often. Sixteen examples of bacterial species with variable operon numbers in different strains were retrieved from databases and the literature (Table (Table2).2). Overall, the variation in operon numbers does not appear to be large, but the phenomenon is not restricted to a specific phylogenetic group since operon variation between strains of the same species was detected for diverse species (Table (Table2).2). For two species, Vibrio cholerae and Bacillus cereus, three different operon numbers have been reported, with B. cereus containing the highest variance in numbers. For all other bacteria, only two operon numbers which were different by a single operon were found (Table (Table2).2). Furthermore, for at least 42 species with multiple strain entries in the rrndb, only a single operon number was reported, indicating that operon copy variation in closely related bacteria may occur only in a minority of cases.
For determination of the numbers of operons with identical and divergent 16S rRNA sequences, genomes with multiple operons were compiled primarily from the NCBI and TIGR databases, with some additional information from the rrndb and the literature (Table (Table1).1). The 76 bacterial genomes analyzed stem from 18 divergent phylogenetic groups; however, the Proteobacteria and Firmicutes divisions dominated the data set, with 33 and 21 genomes, respectively (Table (Table1).1). Within the Proteobacteria, the gamma and alpha subdivisions were overrepresented, with 20 and 7 genomes, respectively (Table (Table1).1). All other groups were represented by only one or two genomes, with the exception of the Actinomycetales, for which five genomes have been sequenced (Table (Table1).1). Among the Archaea, only five genomes were obtained from the databases, all of which belong to the Euryarchaeota and the four phylogenetic groups Methanococcales (1), Methanobacteriales (1), halophilic Archaea (1), and Methanosarcinales (2).
A total of 392 operons (380 from Bacteria and 12 from Archaea) with 230 (221 from Bacteria and 9 from Archaea) associated sequences were retrieved from the 81 (76 Bacteria and 5 Archaea) genomes. Among the Bacteria, 29 genomes (38% of the total) had completely invariant 16S rRNAs. In sum, over 43% of the sequences in bacterial genomes with multiple operons were identical, while for Archaea this number was 25%, or almost half that for Bacteria. A higher proportion of identical 16S rRNA sequences were found among bacterial genomes with fewer operons (Table (Table1;1; Fig. Fig.2).2). Genomes with two and three operons showed identity in 70.6 and 85.7% of the cases (Fig. (Fig.2).2). Not surprisingly, this picture changed for genomes with higher numbers of operons; however, only relatively few complete genome sequences were available for each class (Fig. (Fig.2).2). Several genomes displayed high numbers of identical 16S rRNA sequences. Streptococcus agalactiae 2603 V/R has no differences in the 16S rRNAs of all seven of its rrn operons. Similarly, Clostridium acetobutylicum and B. cereus ATCC 14579 exhibited 7 identical operons of a total of 11 and 13 operon copies, respectively. However, examples of the other extreme were also found. For example, in Bacillus subtilis all 10 operons harbor different 16S rRNA sequences. Among the five archaeal genomes, only Methanosarcina mazei showed identity in all three rrn operons, despite the generally low number of operons per genome (Table (Table11).
The averages and ranges of percent nucleotide divergence of multiple 16S rRNA genes within the genomes are shown in Table Table3.3. Values were calculated with and without consideration of insertions and deletions, but the results differed only slightly. This indicates that divergence is largely caused by mutations and not by insertions or deletions (Table (Table3).3). For most classes of operon numbers, both the averages and ranges of nucleotide divergence remained under 1%, with lower operon number genomes displaying fewer differences in their 16S rRNA sequences. Cases for which the range of nucleotide divergence exceeded 1% were only found for genomes with 2, 5, 7, and 10 operons, but a comparison with the average values indicated that such a high level of divergence is rare. However, clear exceptions to the generally low level of divergence are four genomes which show extreme nucleotide differences among their 16S rRNA genes. These are the genomes of Desulfotomaculum kuznestovii (8.3% difference; two rrn operons), Thermobispora bispora R51 (6.4%; four rrn operons), Thermoanaerobacter tengcongensis (11.6%; four rrn operons), and Thermonospora chromogena (6%; six rrn operons). A 5% divergence between both rRNA operons was also reported for the archaeon Halobacterium marismortui (28). Since these four genomes fell clearly outside the range of values observed for all other genomes, they were excluded from Table Table33.
Our analysis revealed that, to date, the highest interoperonic divergence in 16S rRNA genes occurs in the genome of Thermoanaerobacter tengcongensis, with an 11.6% nucleotide difference. This extremely thermophilic bacterium was isolated from a Chinese hot spring (42), and its genome was completely sequenced by Bao et al. (3). It contains four rrn operon copies, and the 16S rRNA genes display a total of 188 polymorphisms (Table (Table1).1). The 16S rRNA genes clearly fall into two types, with the first representing three of the four operons (rrnA, rrnB, and rrnD). These contain only 1.1% divergent positions (17 polymorphisms), and operons rrnA and rrnD differ at only two nucleotides (0.13% divergence). In contrast, the second rrn type is represented by a single operon (rrnC) and contains 171 of the 188 polymorphic nucleotide sites, representing 90% of the total divergence. A large fraction of the variation is due to two significant length differences in variable stems, which give the molecule a total length of 1,620 nucleotides (Fig. (Fig.33).
We explored whether the highly divergent operon might have arisen via HGT or may represent a pseudogene, which accumulates mutations in the absence of functional constraints. A secondary structure analysis was carried out based on the rationale that mutations in a pseudogene should accumulate throughout the molecule and disrupt the secondary structure at multiple places. On the other hand, a functional or recently functional rRNA gene arisen via HGT would display nucleotide changes that are (i) concentrated in variable regions and (ii) compensated for if they are located in stem regions. This analysis detected only one nucleotide change of a total of 171 in an evolutionarily conserved position (base 94, C→A). In addition, 12 bp were identified as being compensated for in moderately variable stems. The remaining nucleotide substitutions were observed in variable regions and loops and did not disrupt the secondary structure of the 16S rRNA. About half of the extreme divergence of operon rrnC is associated with three inserted regions, of 24, 31, and 28 bases, distributed in different stem-loop regions. Even in the absence of these insertions, rrnC would still differ by 82 bases, or 6.5%, from the other operons. Overall, the secondary structure analysis provided no evidence that the molecule has lost its potential functionality, suggesting that rrnC does not represent a pseudogene.
Growing databases of completely sequenced genomes allow the exploration of patterns of interoperonic divergence among 16S rRNA sequences and provide critical information for the assessment of microbial diversity and evolution. Among Bacteria, classes of up to seven operons appear to be common, with no clear predominance of a single class of operon numbers (Fig. (Fig.1A).1A). Nonetheless, as previously noted (11, 22), about 40% of bacteria have fewer than two operons. The picture is different for Archaea, among which the majority of strains have been shown to have a single operon and no genomes with more than four operons have been reported to date (Fig. (Fig.1).1). A detailed analysis of divergence among the 16S rRNA genes in completely sequenced bacterial genomes revealed that ~40% of operons contain sequences identical to those of other operons (Table (Table1).1). This number appears to be much smaller for the Archaea; however, few completely sequenced genomes are available (Table (Table1).1). Overall, the large majority of 16S rRNA sequences from the same genome display very high similarities, with the ranges and averages remaining within a 1% nucleotide difference (Table (Table3).3). Only five genomes with extreme divergence among operons were detected, so overall, few incidences of HGT between divergent genomes are suggested.
Based on the level of divergence and redundancy among 16S rRNA sequences between operons of the same genome, more accurate bounds for diversity estimates of bacterial communities can be suggested. The analysis showed that 76 genomes with multiple operons contained 221 sequences (Table (Table1).1). Thus, if 16S rRNA gene diversity among these genomes were to be analyzed analogously to microbial communities by cloning and sequencing, a roughly threefold overestimation of diversity would result. However, this clearly represents an upper bound since a considerable fraction of genomes contain single operons. The magnitude of this fraction is difficult to estimate since genome sequences are currently derived from cultured strains. Among these, organisms with multiple operons are likely overrepresented since they appear to be more adaptable to changing environmental conditions and grow more readily on culture media (5, 21). This also makes it likely that environments which display more stable conditions overall harbor bacteria with fewer operons, leading to a less severe overestimation of diversity. However, there are currently 21 genomes with a single operon available. Adding these to the above estimate provides a lower bound of diversity overestimation of 2.5-fold. Thus, overall we suggest this value as a conservative bound for the correction of bacterial diversity estimates by cloning and sequencing.
Operon numbers appear conserved overall among closely related organisms, but even among strains of the same species small-scale variation is evident (Table (Table2).2). Among more distantly related organisms, no pattern of high or low operon numbers emerged from the analysis, so correction factors can only be applied to overall estimates of microbial diversity, not to individual phylogenetic groups. However, three notable exceptions were evident. Despite the considerable numbers of strains analyzed, the α-Proteobacteria, Spirochaetales, and mycoplasma strains appear to contain only low numbers of operons. For example, no α-Proteobacteria with more than four operons have been described to date, and the seven genomes available for α-Proteobacteria show high homogeneity, with only seven 16S rRNA sequences. This suggests that diversity estimates by clone libraries may be more accurate for this phylogenetic group than for others and that the α-Proteobacteria may overall be adapted to relatively stable environmental niches. In this context, it may be predicted that newly isolated representatives of the SAR11 clade, which dominates the open ocean environment (33), also contain few rRNA operons.
The operon comparison revealed the highest divergence to date among 16S rRNA genes within a single genome and showed that four of the five examples of highly divergent 16S rRNA sequences stem from thermophilic organisms. Thermoanaerobacter tengcongensis displayed 11.6% nucleotide divergence due to 188 polymorphic sites among its four 16S rRNA genes. A secondary structure analysis suggested that the rrnC operon arose via HGT since none of the divergent nucleotides appeared to disrupt the functional configuration of the molecule. Indeed, the three insertions in rrnC result in two longer but perfectly matched stems compared to the other operons (Fig. (Fig.3).3). Similar length differences have been detected in the thermophilic bacterium D. kuznetsovii (40). Additional evidence for HGT of the rrnC operon in Thermoanaerobacter tengcongensis is provided by its higher similarity (95%) to other Thermoanaerobacter species (T. subterraneus SL9 and T. keratinophilus 2KXI).
Whether there is an ecological significance to the occurrence of extreme divergence in thermophiles remains unknown, but for at least some strains it has been confirmed that the divergent rRNAs are transcribed and are thus likely functional (41). However, the pattern may suggest that genomes of thermophiles are prone to HGT. This is supported by the suggestion of HGT in other thermophiles (23, 25, 29, 38). For example, extensive studies of strains of a Thermotoga sp. showed that ~25% of the genes are likely of archaeal origin (29), and a comparison of the genomes of “Pyrococcus abyssi,” Pyrococcus furiosus, and Pyrococcus horikoshii suggested the occurrence of extensive HGT (25).
Despite some extreme cases of 16S rRNA divergence among five genomes, overall a clear dominance of close relationships exists, with the vast majority of interoperonic sequence differences showing <1% divergence (Table (Table3).3). Thus, 16S rRNAs may primarily diverge due to mutation or HGT between closely related organisms only. This conforms to the complexity hypothesis, which states that successful HGT over large phylogenetic distances should be a rare occurrence for rRNA genes (1). Because the rRNAs are structural molecules, successful interactions with a large number of other gene products are dependent on the primary sequence of the rRNAs and should theoretically limit functionality in a highly heterologous genomic background. On the other hand, the rRNA genes, as members of a multigene family, are subjected to homogenization processes such as gene conversion (15, 26, 27). Were such processes to occur at high rates, they would relatively quickly erase traces of HGT even if it occurred between distantly related organisms. Nonetheless, genome sequences, taken as a snapshot of the incidence of HGT of 16S rRNA genes between phylogenetically distant organisms, currently confirm that the rRNAs provide a relatively solid framework for the estimation of phylogenetic relationships.
This work was partially supported by a grant from NSF-OCE to M.F.P. and a postdoctoral fellowship from the Spanish Ministry of Education (Ministerio de Educacion, Cultura y Deporte [MECD]) to S.G.A.
We are indebted to Francisco Rodríguez-Valera and Alex Mira for useful comments on the manuscript.