The accuracy was similar for most of the tested aligners, therefore primarily convenience issues, such as which tool has the smallest memory footprint and which tool benchmarks the fastest, drove the choice of aligner. Additional major determining factors were, i) which aligner could handle the size of our reference database, and ii) which aligner could map both paired end reads and fragment reads in a single execution. The number of reference genomes is increasing with a rapid rate. For example, only the HMP project is committed to sequencing 3,000 bacterial genomes over its course, resulting in an ever-increasing size of the RGD (presently 7.3 Gb). Many available next generation aligners impose a 4 Gb database size limitation, which is a technical hurdle for mapping algorithms that use the Burrows-Wheeler transform in their implementation (e.g. 
). Additionally, steps within the read mapping process (see Text S2
) prior to alignment result in a fraction of the paired end reads being orphaned during low complexity screening, resulting in a sample having both a set of paired end reads, and a fragment read file that both need to be aligned. These issues together would have been a computational hurdle because they would have required us to run four alignments per sample to scan the full breadth of the RGD.
The SOAP aligner was a statistically significant outlier, detecting fewer hits to all strains in the MMD as compared to the other aligners. BWA’s primary weakness was its inability to handle a database larger than 4 Gb in size. The SMALT aligner, while claiming to be able support larger databases if the user increases search window size, was unable to handle a database larger than 6 Gb in our hands. In addition, the loss of sensitivity prompted by an increased window size (data not shown) was of concern. Novoalign displayed the smallest memory footprint of all aligners tested during our benchmarking. Its limitation proved to be speed, clocking in as the slowest aligner tested (over 10 fold slower than the frontrunner). MAP performed similarly to CLC, and was able to support the large database size we required, but the version tested was limited in that the only available mapping strategy revolved around their topN setting, which will only report hits with that number or fewer identical top hits (i.e. topN
5 tells MAP not to report a query that aligns equally well to >5 spots in the reference). Drastically increasing the topN value to ensure we are not missing hits caused a significant increase in the amount of memory needed to complete the alignment. Note that parameter modifications have since been made in MAP to address this issue (Brian Hilbush, RealTime Genomics, personal communication), but only after this evaluation had been completed. Finally, only the CLC aligner was able to map both fragment and paired end reads in a single execution while still considering read paring information. While several aligners achieved similar levels of sensitivity and accuracy, the overall feature set that CLC offered tipped the balance and so it was selected for the optimization related analysis in our study.
None of the aligners compared were able to map 100% of the 22,735,802 mock community reads back to the MMD. Depending on the aligner, only 63% to 92% of the mock community queries could be aligned (). This is attributed in part to the fact that the mock query data had not been screened for low complexity. The DUST application 
was used to mask low complexity sequence and subsequently remove it from the query set. This filtering accounted for 3-4% of the unmappable queries (data not shown). The inability to map the remainder of the reads is likely due to the fact that: i) of the 21 genomes included in the MMD, 2 are based on draft assemblies (Actinomyces odontolyticus ATCC 17982
and Enterococcus faecalis OG1RF
) therefore may not be complete representations of their respective genomes and ii) not all plasmid sequences associated with each strain were available in GenBank when the MMD was created.
The CLC parameters were tested to achieve maximum sensitivity while minimizing false positives on a gross level. Due to limitations in the availability of bacterial organisms for inclusion in the reference database, no amount of parameter tweaking will be able to completely overcome problems with false positives detection, but by considering the problem at a higher taxonomic level (the genus level), where we do have good representation across the phylogeny, we were able to arrive at a parameter combination that could provide a relatively good profiling of a microbial community.
Based on the results, in the ideal case when all organisms in the query pool are represented in the database (as in the case of aligning the mock query data against the MMD), it is apparent that the length constraint has a much stronger impact on sensitivity than did the various similarity settings tested. And it was also apparent that only the most stringent length requirements hampered sensitivity. But when we attempted to model the state of live data by replacing several strains with other organisms from within the same genus, we began to see a difference in community structure reflecting changes of required percent identity. This is expected when sequences are mapped to more divergent strains. Furthermore, there is a significant overall decrease in detection caused by the substitution of D. geothermalis
for D. radiodurans R1
. D. radiodurans R1
is by far the most abundant strain present in the mock community, and its replacement (D. geothermalis
) is only ~46% similar at a genome wide level, so this was an expected result. Somewhat surprising is the fact that when the genus Streptococcus
was modified by removing two of the three mock species present 
, the depth found for the remaining genome alone was approximately equivalent to the average depth found across all three Streptococcus
species in the original MMD. One could have expected that under a top random mapping strategy the remaining species would have captured the reads that had originally mapped to the now missing species, increasing its reported depth of coverage. But instead we observed the same depth of coverage for the remaining species as was originally seen. A possible explanation for this is that the two removed species were diverse enough from S. mutans UA159
to prevent any kind of cross mapping. Consistent with this, genome-wide pair-wise alignments between S. mutans UA159
and the other two genomes shows a ~51% and ~35% similarity to Streptococcus agalactiae 2603V/R
and Streptococcus pneumoniae TIGR4
The experiment investigating the effects of mapping strategy on taxonomic resolution (i.e. the ability to correctly identify an organism at a given taxonomic level) showed a clear trade-off between the fraction of the reads representing a sample that can be characterized and the accuracy of that classification. As shown in , 88% of all reads can find a hit under the top random mapping strategy, but 21% of those alignments are incorrect at the strain level. Thus, using this strategy, we can only confidently classify reads to higher order taxonomies (the genus level in this figure). Under the unique placement only strategy we are able to annotate only 58% of reads, but in this case the characterization is accurate at the strain level. This ability to classify data to the strain level represents a key advantage that shotgun metagenomic sequences have over 16s rRNA characterizations made with the commonly-used RDP algorithm 
which typically makes identifications (at 0.8 level of confidence) at the genus level.
Looking into the effect of strain representation within the reference database on mapping resolution, we found a relationship between the number of strains available in the RGD under a given genus and our ability to resolve down to the strain level. Mappings performed against the MMD, where only the mock strains known to be present were available, showed that mapping strategy played little role in the ability to detect coverage. In such a perfect scenario almost any hit will be to the correct strain because there are no phylogenetically closely related neighbors to compete for alignment within conserved regions that could preclude its detection under unique placement only rules. But when a query metagenomic shotgun sequence is mapped to the RGD, strains with many similar neighbors available in the database preclude accurate mappings to finer grained taxonomic levels. In summary, genera with many strains available in the reference database, such as Bacteroides
will present a challenge when trying to resolve to the strain level using the unique only mapping strategy. Genera with few strains available will offer strong resolving power using either mapping strategy. In the two examples where this did not hold true, the mock community strains were considerably divergent from other organisms within their genus () 
Furthermore, based on alignments of the mock queries against the RGD under the top random alignment strategy, we found that the number of false positive identifications at the species level is higher than what is seen when taxonomic assignments are made at the genus level. Approximately 4% of these classifications are incorrect at the species level, but all of those reads can be mapped to the correct genus. When using the unique only mapping protocol, we did find a few more false positive classifications at the species level, but the overall misclassification rate was not significantly inflated (0.3% false positive rate). This is expected because the only time a misclassification can happen under unique only rules is when the sequence’s strain of origin is missing from the RGD, but the read happens to fall into a region that is divergent from other close neighbors within the same species, but conserved in some other organism represented in the database. Based on our results, this is a very rare event.
In future studies, the more advanced approach would be to generate a pan-genome (e.g. 
) reference database in which only the unique portions of genomes are represented. By maintaining only single instances of highly conserved regions, and by tracking which genomes share these unique conserved regions, one could confidently classify the shotgun metagenomic sequences to a lower taxonomic resolution more confidently. Using a pan-genome reference would allow either mapping strategy, top random and unique only, to see conserved sequence and offers solid annotation for all genomes sharing that region. This will allow the user to fine tune the taxonomic resolution based on available information, sometimes allowing annotation down to the species level where previously one could only assign a genus level classification.
In conclusion, we compared six short read aligners for the purpose of identifying an aligner and parameters that will enable accurate profiling of metagenomic communities for any project that uses large NGS datasets and aims at completing the analysis within a reasonable timeframe. We used a mock community sample of known composition and aligned it against the MMD, which comprises genome sequences of all organisms in the community. Five of the six aligners perform similarly well, with the notable exception of the SOAP aligner, which seemed to detect less coverage in general. The selection of CLC aligner was prompted by several practical factors: i) the ability to handle large databases, ii) its ability to map both paired end and fragment sequence data in a single operation and iii) its speed and small memory footprint. The MAP aligner held a respectable second place, but its lack of support for traditional top random & unique only mapping strategies (at the time of this evaluation) and its inability to map both paired end and fragment reads simultaneously kept it from taking the lead.
Once the best performing aligner was chosen, we focused on identifying appropriate parameters for mapping shotgun metagenomic data. When the database provided the exact strain targets for all reads in the query, we found that that length of alignment constraint had the strongest effect on mapping sensitivity, with the percent identity (considering only two fairly stringent settings) having only a minimal effect. But when swapping out several MMD strains with other organisms from the same genus, the percent similarity setting becomes more important. When the genome of an exact strain present in the metagenomic community is not sequenced (therefore absent from the reference database) but the genome of a close relative is sequenced, having a slightly more lenient similarity cutoff can improve sensitivity at the species or genus level. The suggested parameter settings for profiling microbial community structure using metagenomic shotgun sequences are 80% similarity over 75% length of the query being required to align.
We further explored the issue of mapping resolution and the effects of taxonomic density (i.e. the number of closely related strains available under a species or genus) within the RGD. We considered the cost of the top random mapping strategy in loss of resolution at the strain level to the benefit of being able to map a larger fraction of samples to the genus level. While identifying a larger percentage of samples at a lower resolution might have more immediate value for some applications than correctly identifying a smaller portion of the samples at a finer grained taxonomic level, its of importance to note that by using the unique placement only alignment strategy the capability to map to a greater degree of taxonomic clarity exists. We also showed that the number of conserved strains or species present within a genus both increases the likelihood of correctly identifying the genus of a read, while lessening the likelihood of correctly identifying the exact strain (or species) under the top random mapping strategy. The final Read Mapping Standard Operating Procedure is described in Text S2