|Home | About | Journals | Submit | Contact Us | Français|
A 16S rRNA gene database (http://greengenes.lbl.gov) addresses limitations of public repositories by providing chimera screening, standard alignment, and taxonomic classification using multiple published taxonomies. It was found that there is incongruent taxonomic nomenclature among curators even at the phylum level. Putative chimeras were identified in 3% of environmental sequences and in 0.2% of records derived from isolates. Environmental sequences were classified into 100 phylum-level lineages in the Archaea and Bacteria.
Comparative analysis of 16S small-subunit rRNA genes is commonly used to survey the constituents of microbial communities (4, 13, 23, 24), to infer bacterial and archaeal evolution (14, 19), and to design monitoring and analysis tools, such as microarrays (5, 10, 17, 20, 29, 30). Because the rate of production of 16S small-subunit rRNA gene sequence records for uncultured organisms now exceeds the rate of production for their cultured counterparts, taxonomic placement of sequences lags behind. In fact, 43% of full-length 16S small-subunit rRNA gene records in the GenBank database are amalgamated into the pseudodivisions “environmental samples” and “unclassified.” Annotation styles are inconsistent, creating barriers for computational categorization of biological sources. Furthermore, since rRNA genes from environmental DNA are usually PCR amplified, it is suspected that many clandestine chimeric sequences are intercalated into the public databases. For a small sample of 1,399 sequence records from known phyla, it was estimated that 3% of the public data might contain chimeras (2). The effect of these poor-quality data, exacerbated by barriers in exchanging nomenclature, has led to several conflicting taxonomies. The probability of mistakenly adopting a chimeric sequence in a phylogenetic inference or as a reference for probe/primer design is increasing noticeably. Finally, ARB (21) database administration needs to be streamlined for workers who maintain 16S small-subunit rRNA gene collections on their local computers.
Greengenes addresses these concerns by providing four features: a standardized set of descriptive fields, taxonomic assignment, chimera screening, and ARB compatibility. Heuristics are used to consider the author's annotations and categorize each source as a named or unnamed isolate, an unnamed symbiont, or an uncultured organism. Other standard descriptors include sequence quality measurements, authors, and a “study_id” that links all the records associated with a project. Greengenes maintains a consistent multiple-sequence alignment (MSA) of both archaeal and bacterial 16S small-subunit rRNA genes to facilitate taxonomic placement. Taxonomy proposed by independent curators, including the NCBI, the Ribosomal Database Project (RDP) (Bergey's) (7), Wolfgang Ludwig (21), Phil Hugenholtz (16), and Norman Pace (23), is tracked to promote user awareness of several estimations of phylogenetic descent, allowing a balanced approach to node nomenclature when dendrograms are generated. Comprehensive chimera assessment is a distinguishing characteristic of the Greengenes data assembly process. Each sequence is scored for chimeric potential, a breakpoint is estimated, and parent sequences are identified. Furthermore, since biologists often collect and visualize 16S small-subunit rRNA gene relationships using the freely available ARB software, Greengenes simplifies the chore of keeping a research group's private ARB database current by providing standardized alignments and an import filter (greengenes.ift) that imports the alignment and other standardized fields from 16S small-subunit rRNA gene records vetted weekly from GenBank.
To illustrate the utility of the Greengenes data assembly process and to examine the validity of prokaryotic candidate phyla, we aligned and chimera checked more than 90,000 public 16S small-subunit rRNA gene sequences. Taxonomic classifications from the major curators were used when such classifications were available. Sequence data were imported from NCBI for complete or nearly complete gene sequences (length, >1,250 nucleotides) deposited as of 2 April 2006. Alignment of both archaeal and bacterial sequences was performed with the NAST aligner (8) against a “Core Set” of templates selected from a phylogenetically broad collection (16). The resulting MSA was formatted so that each sequence occupied a consistent 7,682 characters or 4,182 characters; the latter allowed compatibility with RDP v8.1 (22) alignments. Both these formats were concise enough for browsing in common MSA graphical interfaces, such as ClustalX (28), MEGA (18), and the platform-independent interface Jalview (6), as well as ARB. Other standard expansions, such as the >20,000-character Ludwig alignment, are alternate formats that will be available in future releases to give maximum flexibility to researchers.
For high-throughput chimera screening of the aligned sequences, the program Bellerophon (15) was used with two modifications. First, the algorithm was modified to reduce the number of potential parents considered in the partial trees, which allowed run time to scale linearly rather than logarithmically with the count of candidate sequences in a collection. Second, a new metric was implemented, which weighted the likelihood of a sequence being chimeric according to the similarity of the parent sequences. The more distantly related the parent sequences were to each other relative to their divergence from the candidate chimeric sequence, the greater the likelihood that the inferred chimera was real. This metric, called the divergence ratio, used the average sequence identity between the two fragments of the candidate and the corresponding parent sequences as the numerator and the sequence identity between the parent sequences as the denominator. All calculations were restricted to 1,287 conserved columns of aligned characters using a 300-bp window on either side of the most likely breakpoint. A divergence ratio of >1.1 and fragment-to-parent levels of similarity of >90% were required for classifying sequences as putatively chimeric.
Taxonomy was linked to each record by various methods. NCBI taxonomic nomenclature and RDP taxonomic nomenclature were extracted directly from the corresponding GenBank-formatted records. The Pace and Ludwig annotations were exported from curated ARB databases. The Hugenholtz taxonomy was also derived from a curated ARB database in which tree topologies had been verified using RAxML-VI (27) for maximum likelihood inference. The general time-reversible model of evolution was applied together with optimization of substitution rates and site-specific rates according to a gamma distribution. Different search algorithms were considered depending on the run time of the standard hill climb (SHC) search method. If the running time was less than 8 h, simulating annealing (SA) was processed with the default starting temperature and a termination time set at approximately 24 h. If simulating annealing was not used and SHC terminated within 24 h, SHC was used. Furthermore, rapid hill climb was used in all other cases when the running time was less than 24 h. If rapid hill climb did not terminate within the set limit, the number of taxa was reduced. After 100 bootstrap replications, a consensus tree was calculated using Consense (12) and imported into ARB. This database (greengenes.arb) is available for download through Greengenes and is updated periodically.
Of the 90,000 NCBI records analyzed, 54% were derived from uncultured organisms, the majority of which were deposited in the last 5 years (Fig. (Fig.1).1). Only three studies have submitted more than 1,000 full-length clones; however, we expect the number of large 16S small-subunit rRNA gene surveys to increase due to the availability and falling cost of high-throughput sequencing. Bellerophon detection of putative chimeras in 3% of the sequences from uncultured organisms was not unexpected considering the initial estimates (2). Surprisingly, 0.2% of sequences derived from pure cultures were also determined to be putative chimeras. Multiple distinct 16S rRNA genes have been encountered when clone libraries have been created from colonies assumed to be pure cultures prepared from numerous third-party sources (Colleen Cavanaugh, personal communication). It is possible that isolated colonies contain symbiotic bacteria which increase PCR template complexity, enabling chimera formation. In addition, thousands of full-length 16S small-subunit rRNA gene-annotated GenBank records were only partially aligned using NAST. Future versions of NAST could be altered to allow alignment extensions across regions having low template similarity or to allow candidates to be aligned in sections using divergent templates. Both of these options may allow a greater abundance of chimeric data to be imported into Greengenes but perhaps would capture novel phyla from the public repositories. Alternately, manually aligned sequences from novel phyla can be offered from the user community for recruitment to the Core Set advocating periodic reevaluation of the partially aligned set.
Discovery of chimeras in 16S small-subunit rRNA gene data collections is crucial if the data set is going to be a foundation for applied bioinformatics. Chimeras are a fundamental problem when they are used as templates with probe selection software, a growing concern with the recent increase in 16S small-subunit rRNA gene microarray probe development (3, 8, 11). The 15 to 30 bases surrounding the chimeric breakpoint can appear to be sufficiently different from all other records in a database to cause a probe selection algorithm to justifiably identify the region as a target's signature and suggest complementary probes that can be synthesized. These probes could appear to be very valuable considering their minimal mishybridization potential, but in fact, they would rarely be useful since they target nonexistent organisms. Chimera test results from Greengenes allow greater control over input to probe selection software, should aid in avoiding artificial terminal restriction fragment length polymorphism pattern predictions from ARB-compatible TRF-CUT (25), and can increase the accuracy of sampling rarefaction curves (26).
The fraction of putative chimeras in the deposited sequences from an individual study varies from none to more than 20% (Fig. (Fig.1),1), suggesting that chimera screening is still not being uniformly applied by sequence generators. The problem is exacerbated with sparsely populated candidate phyla. For instance, the bacterial phyla “SAM” and “5” and the class GN4 (Proteobacteria) may require reevaluation. Likewise, the genera Tistrella, Caldotoga, Dehalobacterium, and Desulfovermiculus are currently anchored by sequences with evidence of chimeric composition. Additional sequences could lead to empirical rejection of certain classifications or may aid in defining the true breadth of sequence variation for these taxa.
Comparison of five different taxonomies uncovered surprisingly great disparity between expert curators. Loosely interpreting a “phylum” to be any labeled group or division immediately subordinate to the domain Archaea or Bacteria, we compared the five curations in a Venn diagram (Fig. (Fig.2).2). The main source of the disparity is the discordant naming of novel candidate phyla or the absence of names for candidate phyla. For example, Pace and Hugenholtz have independently named more than 12 phylum-level lineages, many of which are the same lineages, and RDP has not named any of these lineages. This is a consequence of the huge number of environmental sequences in the public databases and the frequent redundant naming of environmental lineages in the literature. We hope that making multiple taxonomic classifications available through Greengenes will aid in standardizing classification, particularly classification of environmental lineages.
Greengenes is also a functional workbench to assist in analysis of user-generated 16S rRNA gene sequences. Batches of sequencing reads can be uploaded for quality-based trimming and creation of multiple-sequence alignments (9). Three types of non-MSA similarity searches are also available, seed extension by BLAST (1), similarity based on shared 7-mers by a tool called “Simrank,” and a direct degenerative pattern match for probe/primer evaluation. Results are displayed using user-preferred taxonomic nomenclature and can be saved between sessions.
In summary, Greengenes offers annotated, chimera-checked, full-length 16S rRNA gene sequences in standard alignment formats. The relational database links taxonomies from multiple curators and multiple sequences from a single study. We found that there is incongruent taxonomic nomenclature among curators even at the phylum level. Bellerophon found putative chimeras in sequences derived from both uncultured and isolated organisms. The data set can be compared to user-provided sequences via a web interface or can be imported directly into ARB for advanced analyses. We anticipate that Greengenes will be valuable to researchers conducting environmental surveys and for 16S rRNA microarray design.
In the immediate future, we plan to develop and implement a number of community curation tools. This should allow the user community to actively participate in improving the quality of the Greengenes database and should ensure that time-consuming manual improvements of sequence and sequence-associated data, including taxonomic corrections, are propagated for the benefit of the whole community. Specifically, five curation tools that should capture manual improvements are in development: (i) improvements in individual sequence alignments, (ii) manual verification of putative chimeras, (iii) recruitment of novel lineages to the Core Set, (iv) corrections in the Greengenes description (the abbreviated description of the record usually has the form [habitat] clone [clone name] for environmental sequences), and (v) updating taxonomic group names. One of the main challenges in the implementation of these tools is to ensure that only high-quality manual edits are incorporated into Greengenes. For example, for a suggested alignment alteration, the submitted sequence must (i) match the existing sequence, (ii) preserve the location of highly conserved positions in the 16S rRNA gene, and (iii) record the curator information as part of the update transaction. We recognize the desire of many users to contribute to a distributed curation effort, and we hope that Greengenes will become a resource to facilitate this desire.
We thank Kirk Harris and Norman Pace for sharing their ARB database and Richard Phan and Yvette Piceno for assistance with the web interface.
The computational infrastructure was provided in part by the Virtual Institute for Microbial Stress and Survival (http://VIMSS.lbl.gov) supported by the U.S. Department of Energy Office of Science Office of Biological and Environmental Research Genomics:GTL Program and the Natural and Accelerated Bioremediation Research Program through contract DE-AC02-05CH11231 between Lawrence Berkeley National Laboratory and the U.S. Department of Energy. Web application development was funded in part by the Department of Homeland Security under grant HSSCHQ04X00037.