In our previous paper, we described Pintail, a novel program for screening individual sequence records for chimeras and other anomalies (
2). While it is useful to have a tool that can consider a single sequence in detail, there is also a need for software that specializes in screening whole gene libraries quickly and accurately. Mallard was therefore written to meet this separate need. This paper demonstrates the ability of the Mallard program to identify putative anomalies, which can then be checked further with the Pintail tool (
2), and shows that Mallard successfully detects anomalies within bacterial taxa, archaeal taxa, libraries of near-complete sequences, and libraries of partial sequences. Currently, the most widely used program for screening whole gene libraries is Bellerophon (
13). In comparing Mallard with Bellerophon, the former consistently performed better than the latter in correctly identifying chimeras while minimizing false-positive results. We therefore conclude that Mallard is a significant improvement in chimera detection. However, it should also be noted that Bellerophon sometimes identified chimeras that Mallard missed. In light of this, and acknowledging the conclusions of previous studies investigating chimeras (
14,
31), we believe that more than one method should be employed where feasible to detect as many chimeras as possible.
Like Bellerophon and most sequence comparison methods generally, Mallard uses aligned sequence data and is dependent on the quality of these alignments to arrive at the correct answer. In this study, we used a mixture of ClustalW alignments and alignments downloaded from the RDP website. Unlike ClustalW, the RDP's alignment procedure takes into account 16S rRNA secondary structure when constructing an alignment. Theoretically, this should make RDP alignments more accurate than ClustalW alignments; however, in practice we found that RDP alignments were sometimes inferior. An example is the RDP alignment for the gene library of Spear et al. (
25), which successfully identified four chimeras but also generated 28 false-positive results; further investigation revealed that these false positives were caused by poor alignment. Realigning them with ClustalW resolved the problem, and the four correctly identified chimeras were identified without extra false positives. We recommend that the user pay particular attention to the quality of the alignment when using Mallard, Bellerophon, or indeed any other alignment-based method.
In our previous study (
2), we estimated that, overall, around 5% of
Bacteria 16S rRNA gene sequence records within the public repositories have substantial errors. In our current study, we found anomaly levels of 6.8% among
Verrucomicrobia records (
Bacteria) and 7.8% among
Crenarchaeota records (
Archaea). More significantly, however, in our survey of 16S rRNA clone gene libraries submitted during 2005, we showed that the average number of anomalies per submitted library had risen to 9.0% over the course of that year. This is very likely an underestimate. Using a 100% cutoff line alone to identify putative anomalies resulted in a conservative estimate of true anomalies, and as a result, some more subtle (and not so subtle) chimeras that we know exist were excluded from our final counts.
The submitted 2005 clone libraries varied greatly in chimera content, ranging from 0 to 45.8% of the total sequence records considered. Of the 25 libraries, only 17 are currently associated with papers, and of these, the amounts of information on how libraries were constructed and checked vary greatly (for example, only nine papers actually stated that chimera detection methods were used, preventing any conclusion as to the efficacy of existing methods based on these libraries). Consequently, it is difficult to draw any conclusions as to why such a variation in chimera content has occurred. It has been speculated that increasing the number of cycles when PCR amplifying DNA can increase the chances of chimera formation (
30), although no correlation between chimera generation and cycle number could be detected in the current study. The harshness of the DNA extraction method used has also been implicated in chimera formation, but even recourse to “gentle” DNA extraction methods involving detergents or enzymes does not appear to reduce the problem (
17), and certainly there is insufficient information available to draw any conclusions in this regard from the 2005 clone libraries considered.
It would appear, therefore, that chimeras within 16S rRNA gene clone libraries are inevitable, at least with current PCR methodologies. Previously, it had been estimated that up to 30% of individual PCR-generated clone libraries were likely to be chimeric (
17,
30,
31). We cannot comment on how many chimeras were originally generated by the researchers considered in this study, but we note that libraries with up to 45.8% chimeras are being submitted without comment to the public repositories. Serious anomalies are polluting the public repositories to such an extent that their usefulness is being surreptitiously and progressively compromised. The effects are already being felt; for example, some putative chimeras were especially difficult to check during the current study because so many anomalies had been submitted for the taxa they supposedly represent.
This study indicates that most libraries submitted during 2005 contain misleading anomalies, and the average anomaly content per library is estimated to be 4% higher than the 5% estimated previously for the public repository overall. Moreover, our results show that the vast majority of these errors are now chimeras—the most insidious and misleading of anomalies. At least 90.8% of the anomalies considered in this study had chimeric patterns, which contrasts dramatically with the 64.3% of anomalies reported previously (
2). Our previous study showed that between 1993 and 2004 a steadily increasing number of chimeras were submitted to the NCBI database (
2), at least among the phyla investigated by that study. Overall, we conclude that the specific problem of chimeric 16S rRNA sequences in the public databases is at best not improving and at worst is becoming more acute. We offer our software free to the wider research community in the hope that it will complement existing methods to ensure that as few chimeras and other anomalies as possible are submitted in future.