|Home | About | Journals | Submit | Contact Us | Français|
When a specimen belongs to a species not yet represented in DNA barcode reference libraries there is disagreement over the effectiveness of using sequence comparisons to assign the query accurately to a higher taxon. Library completeness and the assignment criteria used have been proposed as critical factors affecting the accuracy of such assignments but have not been thoroughly investigated. We explored the accuracy of assignments to genus, tribe and subfamily in the Sphingidae, using the almost complete global DNA barcode reference library (1095 species) available for this family. Costa Rican sphingids (118 species), a well-documented, diverse subset of the family, with each of the tribes and subfamilies represented were used as queries. We simulated libraries with different levels of completeness (10-100% of the available species), and recorded assignments (positive or ambiguous) and their accuracy (true or false) under six criteria.
A liberal tree-based criterion assigned 83% of queries accurately to genus, 74% to tribe and 90% to subfamily, compared to a strict tree-based criterion, which assigned 75% of queries accurately to genus, 66% to tribe and 84% to subfamily, with a library containing 100% of available species (but excluding the species of the query). The greater number of true positives delivered by more relaxed criteria was negatively balanced by the occurrence of more false positives. This effect was most sharply observed with libraries of the lowest completeness where, for example at the genus level, 32% of assignments were false positives with the liberal criterion versus < 1% when using the strict. We observed little difference (< 8% using the liberal criterion) however, in the overall accuracy of the assignments between the lowest and highest levels of library completeness at the tribe and subfamily level.
Our results suggest that when using a strict tree-based criterion for higher taxon assignment with DNA barcodes, the likelihood of assigning a query a genus name incorrectly is very low, if a genus name is provided it has a high likelihood of being accurate, and if no genus match is available the query can nevertheless be assigned to a subfamily with high accuracy regardless of library completeness. DNA barcoding often correctly assigned sphingid moths to higher taxa when species matches were unavailable, suggesting that barcode reference libraries can be useful for higher taxon assignments long before they achieve complete species coverage.
Taxonomic assignments are crucial for effective communication of biological research, enabling comparability between studies. Yet, the ability to categorize biodiversity effectively and accurately is hampered by a lack of taxonomic experts . DNA barcoding has been proposed as a method capable of partially alleviating this "taxonomic impediment" by enabling accurate species identifications by non-specialists using nucleotide comparisons across a standard gene region .
In a typical scenario, a specimen of unknown species affinity is encountered, the DNA barcode of the query is sequenced and then compared with a reference library of DNA barcodes  to establish a species match for the query. However, just as morphological identification keys cannot provide accurate binomial names for queries from species not included in the key, DNA barcoding cannot assign a species identification when there are no barcode records for conspecifics in the reference library. Consequently, barcoding appraisal studies usually require a priori knowledge that the species of the query is present in the reference library (e.g. [4-6]). In real life, a consequence of widespread routine use of DNA barcoding is that failed species matches (e.g. < 98% similarity with the closest library sequence ) are frequently encountered (e.g. ). In such situations it may be tempting to attempt assignment to a higher taxonomic level (i.e. genus, tribe, subfamily). For example, Armstrong and Ball  suggested their query barcode sharing 94.6% similarity with the closest library match (Clostera albostigma) was a likely congener but not conspecific of the reference library barcode. There is considerable disagreement over the likely accuracy and appropriateness of such assignment attempts (e.g. [1,5,9-11]), which is not surprising given the different purposes and criteria employed.
Hebert et al.  expressed optimism for barcode-based assignments to higher taxa in animals. Such assignments are useful as shorthand for phylogenetic hypotheses from which biological characteristics of organisms can be predicted. For example, by assigning a specimen to the genus Aellopos one can predict that as a caterpillar it most likely fed on plants of the family Rubiaceae . The capacity to make predictions based on taxon membership is especially pertinent where fundamental impediments, e.g. an egg or an incomplete specimen, preclude morphology-based detection of characteristics. While assignment to pre-determined taxa is an operation distinct from the description of taxa, assignment accuracy is related to the ability of the character system used as the basis of assignment to track organismal phylogeny (i.e. display a phylogenetic signal ). This operation is confounded by the fact that many currently recognized supraspecific taxa are not natural . In such cases, the failure of a character system to provide accurate assignments can reflect "imperfect" taxonomy rather than the lack of phylogenetic signal.
In this study, we test the ability of DNA barcodes to enable accurate higher taxon assignments. Specifically we ask: If species coverage in the DNA barcode library is incomplete, can the barcode from a sphingid species not represented in the library be assigned to the genus it belongs to, or, recognised as being from a sphingid genus missing from the library? Likewise, can the barcode from a sphingid genus not represented in the library be accurately assigned at the tribe and subfamily level? We address these questions using the moth family Sphingidae because a comprehensive global reference barcode library is available (86% of known species ) containing relatively stable and well-studied taxa (Figure (Figure1A).1A). This enables us to assemble sub-libraries with a wide range of different species completeness and also provides a robust taxonomic framework against which to judge assignment accuracy. We evaluated assignment accuracy using concordance with the current classification of Sphingidae  while recognising that morphologically derived taxonomy represents falsifiable hypotheses. Consequently, we also examined the assignments a posteriori in light of a more recent phylogenetic study of the family . Sphingidae is the target of a global barcoding campaign  and shows high success for species-level barcode identifications (Figure (Figure1B1B).
Since Hebert et al.  proposed that DNA barcoding could be used to assign queries to higher taxa, researchers have performed higher taxa assignments using ad hoc criteria based on the frequency of best hits, degree of sequence similarity, bootstrapping or BLAST scores (e.g. [18-22]). However, these studies usually involved fragmentary tissues of unknown taxonomic origin and consequently assignments could not be independently confirmed (i.e. using morphology). Therefore, both the accuracy and optimal approach for such assignments remain unclear. In this study, we test the extent to which assignment accuracy depends on assignment criteria applied by comparing the performance of several approaches employed in prior studies.
While some consider the use of tree-based assignment approaches controversial , we consider it justified for supraspecific taxa sharing phylogenetic as opposed to tokogenetic affinities. Using tree-based criteria, queries are successfully assigned when they cluster with barcodes from their correct taxon . Meier et al.  use the following example where they imagine a reference library containing a chimp barcode but no human barcode to illustrate the difficulty with such an approach: "Imagine a query clustering with a chimp barcode. Based on the query's position, one cannot decide whether it comes from Homo sapiens or another chimp, i.e., forming a cluster on a tree is logically insufficient for assigning a sequence." We address this concern by establishing objective rule sets for our tree-based assignment criteria based on topology (Table (Table1).1). We include assignment criteria that require a taxon to be "monophyletic" or "exclusive" for a query to be assigned to that taxon (Table (Table1).1). This requires that we overlook the fact that trees based on COI do not perfectly track organismal phylogeny at deeper levels  and that many "traditional" taxa are not monophyletic . Ekrem et al.  suggest the inability of COI analysis to reconstruct monophyletic taxa prohibits the use of barcodes for higher taxon assignments.
Previous barcoding studies employed neighbor-joining (NJ) algorithms  to produce "Taxon ID trees" since the goal of DNA barcoding is species assignment and species discovery and not phylogenetic reconstruction . In this study we used NJ as an approximation to phylogenetic analysis due to computational constraints and the large number of replications undertaken. NJ provides additional comparability as both BOLD  and GenBank  use NJ in their tree-based identification options. Our tree-based assignment criteria are equally applicable regardless of tree construction method although use of trees selected with a different optimality criterion may produce different results.
In addition to tree-based assignment we used criteria based on direct sequence comparison. We chose not to consider "character-based" approaches (e.g. ) because nucleotide synapomorphies are unlikely to be pure (i.e. consistency index = 1 ) and compound diagnostics have proven unwieldy [29-31]. Of the two assignment criteria we use, both based on K2P  genetic distance (Table (Table1),1), the least stringent is "best match". A query is assigned the taxon of the reference barcode that it most closely matches irrespective of how similar the query and library barcodes are. Under this criterion some false assignments are inevitable. A "false-positive" result, where a query barcode is matched to a reference barcode despite significant divergence, is a frequent consequence of using the BLAST algorithm by itself . For example, the query dataset used here contained five monobasic genera. For these barcodes the only possible result for a genus assignment using "best match" are "false-positive". These errors can be avoided by using the modified assignment criterion, "best close match". With "best close match" the best-matching reference barcode is identified, but the query is only assigned the taxon name of that barcode if the barcode is sufficiently similar (i.e. below a threshold). Otherwise, the query remains unassigned (i.e. "ambiguous"). In our case, the threshold value can be selected by plotting the number of "true-positives" and "false-positives" against the K2P distance from the query to the "best match". We then determine a threshold that maximizes the number of "true-positives" while minimizing the number of "false-positives". It remains unclear why one would expect that there should be a common threshold across taxonomic groups of the same rank or how this could be implemented in a real-life scenario. Many studies have shown a universal threshold of genetic distance to distinguish taxa cannot be determined . However, in the absence of better strategies, this method at least provides a rigorously derived threshold value .
Based on their study of species in one family of Diptera, Ekrem et al.  concluded that assigning a barcode record to the correct genus or species-group was unlikely unless a "near perfect" match is present in the reference library with the further prediction that a "comprehensive" library is also essential for accurate assignment to family or even order. Furthermore, Ball and Armstrong  suggested that the failure of a lymantriine barcode to group with other members of its subfamily was attributable to low taxon sampling in their reference library (also see [5,34]). Considering that growth of the DNA barcode library will take time, a key issue concerns the effect of completeness of the reference library on the accuracy of higher taxon assignments. By using a global and comprehensive barcode reference library of considerable phylogenetic breadth (86% of known species in the family), the Sphingidae, we addressed this uncertainty through simulating different levels of species completeness of the reference library and examining the effect on assignment accuracy.
Using barcode records assembled as part of the global barcoding campaign on Sphingidae , we selected one barcode from each species to act as a reference barcode for that taxon. Reference barcodes were available for 1088 of the 1270 described species listed in Kitching and Cadiou  and for an additional seven Costa Rican species described or revalidated since 2000 (= 1095 sphingid species). Barcode sequences were selected to maximize length and quality and ranged from 267-658 bp, with 77% being 658 bp and 93% > 600 bp. The sample comprised 200 genera with all the currently recognised tribes and subfamilies (Figure (Figure1A)1A) represented. Three saturniid barcodes (Arsenura drucei, Lonomia electra, Periga cluacina) were also included as this family represents the putative sister family to the Sphingidae  taking the full reference library to 1098 barcodes (see additional file 1: Full reference library).
Barcodes from 118 sphingid species collected in Area de Conservacion Guanacaste, northwestern Costa Rica, were used as query barcodes (see additional file 2: Query dataset). DNA was extracted following automated protocols  and the DNA barcode amplified and sequenced . These Costa Rican sphingids comprised a well-documented [38,39], diverse subset of the family, with each of the tribes and subfamilies represented among 29 genera. All the queries were correctly assigned to species when using the full reference library and a "best match" assignment criterion.
For the purposes of this study the following were considered libraries of 100% completeness: for genus assignment attempts, the representative from the same species as that of the query was the only barcode removed from the reference library; for tribe and subfamily assignment attempts, the barcodes from all the representatives of species in the genus of the query were removed from the reference library. All contribal genera were not removed in the case of subfamily tests, due to the increased level of uncertainty regarding naturalness of these taxa.
We subsequently created sub-libraries from the full reference library with different levels of species completeness. In an approach termed here "random sampling" barcodes were chosen at random to construct sub-libraries comprising 10, 20, 30, 40, 50, 60, 70, 80 and 90% of the full reference library. Sub-sampling at each species richness level was repeated 30 times. A different approach termed here "constrained sampling" limited the random selection of species to ensure a minimum of one species per genus in the sub-library. This approach was reiterated to construct sub-libraries comprising 20, 30, 40, 50, 60, 70, 80 and 90% of the full reference library and was repeated 30 times at each species completeness level. For the sub-libraries as with the 100% library, for genus assignment attempts, we removed the reference barcode for the species of the query from the sub-libraries. For tribe and subfamily assignment attempts we removed the reference barcodes for the genus of the query.
In each assignment attempt we allowed two possible outcomes: (i) A "positive" assignment (i.e. the query was assigned to a taxon) or (ii) An "ambiguous" assignment (i.e. the query was not assigned to a taxon). A "positive" assignment was either true (TP) - it matched with the morphology-based identification, or false (FP) - it disagreed with the morphology-based identification . An "ambiguous" assignment was either true (TA) - the true taxon based on morphology was not represented in the reference library/sub-library (by at least two barcodes for "strict" criteria (Table (Table1)),1)), or false (FA) - the true taxon based on morphology was represented in the reference library/sub-library (by at least two barcodes for "strict" criteria (Table (Table1))1)) .
The requirements for a "positive" assignment depend on the different criteria employed as detailed in Table Table1.1. Note, the number of "potential TP" will not always be equal to 118 (i.e. the number of queries) because the taxon of the query may not be present in the sub-library. For example, the number of "potential TP" at the genus level with the 100% library and the "liberal" criterion is 113, due to 5 queries being members of monobasic genera.
We developed software in C++ to automatically construct sub-libraries, perform assignments according to four tree-based criteria and evaluate assignment success. The main tool took as input the queries, the outgroups, the complete reference library (all in fasta format), the sampling strategy, and an integer (X) indicating the percentage of the reference library to sampled. The software automated the analytical process as follows:
For each query:
For each replication:
Remove query species (or genus) from reference library.
Randomly select × percent of reference library without replacement according to input sampling strategy.
Combine query, outgroups, sampled reference library into a single file.
Construct NJ tree from file using Clustal W v.2 .
For each of four criteria:
Read tree, assign query a taxon or not according to criterion.
Evaluate accuracy of assignment (true or false).
The four tree-based methods were "liberal" (Figure (Figure2A)2A) , "strict" (Figure (Figure2B)2B) [25,40], "liberal & exclusive" and "strict & exclusive". We also performed "best match" for all taxon assignments and "best match" and "best close match" for assignment to genus (with the randomly sampled library) where assignment was based only on the most similar reference library barcode (Table (Table1).1). For "best match" only a "positive" assignment is possible (i.e. the assignment is TP or FP) (Table (Table1).1). For "best close match" the query was assigned to the taxon of the most similar library barcode based on K2P distance, provided it was within a certain threshold. If there were no barcodes in the library within the threshold, the assignment was "ambiguous". In order to select a threshold we looked at the results of the "best match" criterion and plotted the number of "true-positives" and "false-positives" against the K2P distance from the query to the "best match". The distance that maximized the number of TP (which in our case also corresponded to the distance with the lowest proportion of FP) was selected as the threshold.
Measures of accuracy were calculated as follows: 1. Precision, the fraction of barcodes placed in a taxon that belongs there, TP/(TP+FP); and 2. Overall Accuracy, the proportion of barcodes placed without any error, (TP+TA)/(TP+FP+TA+FA) . Note, for 'best match" due to the absence of the "ambiguous" category overall accuracy equals precision. The results are discussed below in terms of these measures.
The results of all the experiments are provided in additional file 3: Results of all experiments.
The overall accuracy of assignment to genus was 0.83 using the "liberal" and 0.75 using the "strict" criterion. The precision of assignment to genus was 0.86 using the "liberal" and 0.98 using the "strict" criterion. A number of query species were consistently assigned to the wrong genus across all analyses resulting in FP. Even though these FP were technically incorrect assignments Table Table22 details how in many cases the assignments made some sense considering the taxonomic structure and phylogeny of the family. These included four species in monobasic genera: Pachylioides resumens, Phryxus caicus, Pseudosphinx tetrio, and Neococytius cluentius, for which the only possible outcomes were FP or TA, since a query belonging to a monobasic genera cannot be a TP. A second group of FP were query barcodes (Madoryx plutonius, Manduca albiplaga, Pachylia darceta and Pachylia ficus) assigned to monobasic genera in the reference library (see Table Table2).2). Queries belonged to species not present in the reference library, monobasic genera have only a single species that was present in the library, therefore, this group could more correctly be interpreted as TA or FA assignments. Two FP, Xylophanes godmani and Xylophanes turbata, were queries from an exceptionally species-rich genus (104 species globally). The overall accuracy of assignment to genus in this study was similar to that reported by Elias et al.  who found 69-81% of their Ithomiinae queries were assigned to the correct genus using tree-based criteria.
Overall accuracy of assignment to tribe was 0.75 using the "liberal" and 0.66 using the "strict" criterion (Figure (Figure3).3). Precision of assignment to tribe was 0.81 using the "liberal" and 0.95 using the "strict" criterion. Many of the query barcodes placed in the wrong tribe belonged to genera that are positioned as paraphyletic or polyphyletic with respect to their current tribal designations, according to recent phylogenetic study (e.g. Agrius, Aleuron, Cautethia, Cocytius, Enyo, Eumorpha, Pachygonidia ), or were on long branches in a basal position (Pachylia) within their tribe. An instructive example is Eumorpha, a genus currently placed in the tribe Philampelini. Query barcodes belonging to Eumorpha were assigned to tribe Macroglossini. This is consistent with the placement of Eumorpha (+Enyo) as sister to a clade comprised of Macroglossini on the phylogeny of Kawahara et al. .
Overall accuracy of assignment to subfamily was 0.90 using the "liberal" and 0.84 using the "strict" criterion with "best match" having the highest overall accuracy for this taxonomic level (0.92) (Figure (Figure3).3). Precision of assignment to subfamily was 0.83 using the "liberal" and 0.96 using the "strict" criterion.
Considering Figures Figures44 and and5,5, it is clear that different criteria produced contrasting results. For example, "liberal" was frequently the highest scoring criteria in terms of overall accuracy (Figure (Figure3),3), but performed less well in terms of precision with an average of 18% of assignments to genus being FP (Figure (Figure4).4). "Strict" had lower overall accuracy across all sub-libraries, but higher precision with an average of only 2% of assignments to genus being FP (Figure (Figure44).
The criteria requiring exclusivity resulted in an overwhelming number of FA assignments (Figure (Figure5)5) and produced very low overall accuracy and precision despite their lower incidence of FP (Figure (Figure5).5). Note that the success rate for criteria without the exclusivity requirement are higher, because they did not require "monophyly"; i.e. queries can be assigned on trees with congeneric (or contribal and subfamilial) barcodes found in two different "clades" as long as the rules of the criterion are met.
Success under "best match" was similar to "strict" at the tribe level but very similar to "liberal" at the subfamily level (Figure (Figure4),4), where it actually had the highest overall accuracy but was still behind the "strict" criteria in terms of precision (Figure (Figure33).
In order to be able to use "best close match", we first determined the optimal threshold to be 0.05 K2P distance (Figure (Figure6)6) and this value was used to decide whether a query had a close enough barcode match be given a "positive" assignment. "Best close match" successfully reduced the high number of FP seen with "best match" (Figure (Figure6),6), but, like the "strict" criterion resulted in a large number of FA. Success under "best close match" was very similar to "strict" but it produced a much lower number of TP with the larger sub-libraries (Figure (Figure66).
The "liberal" and "strict" criteria were generally the highest-scoring criteria in terms of overall accuracy and precision across all taxonomic levels and all sub-libraries (Figure (Figure33 and and4).4). An exception was the high precision observed for the "strict & exclusive" criterion that was matched by extremely low overall accuracy (Figure (Figure5).5). Precision was consistently high for the "strict" criterion (> 0.90) for all sub-libraries and for assignments to all taxonomic levels. Precision was lower using "liberal" but conversely overall accuracy was higher (Figure (Figure77).
The effect of library completeness was visible in assignment to genus using "liberal" with overall accuracy increasing from 0.59 with the 10% sub-library to 0.83 with the 100% library (Figure (Figure7).7). Using "strict" however, overall accuracy although generally lower was relatively stable regardless of library completeness, increasing only 0.06 between the 10% and 100% libraries. The opposite pattern was seen in overall accuracy of assignments to tribe and subfamily with "liberal" being more stable across sub-libraries, and "strict" being more variable (Figure (Figure77).
Results for assignment to genus using random versus constrained sampling of sub-libraries were very close in terms of overall accuracy, with constrained having slightly lower overall accuracy across all completeness levels (Figure (Figure4a).4a). Conversely, constrained sub-libraries resulted in assignments with slightly higher precision across all completeness levels.
We present the results from an in-depth study of higher taxon assignment using DNA barcoding. The reader of DNA barcode literature may be surprised by the assignment accuracy reported here, values that may contrast with the expectation of authors like Ekrem et al. . This may be explained largely by differences in study design. Our experimental design measures the relative precision and overall accuracy of different assignment criteria across reference libraries of different levels of completeness and structure. No single assignment criterion was superior across the range of taxonomic scenarios examined and there was often a conflict between overall accuracy and precision. Our results discussed below, together with implications for criterion selection, indicate a clear requirement for species to be in taxa that are well-differentiated clades to maximize the number of correct assignments. Whether these success rates are high enough to be useful remains a judgment call for the end-user.
In this study we have presented simplified examples where the species of the query barcode is missing from reference libraries (and the entire genus for assignments to tribe and subfamily) to ensure we were solely addressing the question of assigning the query to the next least inclusive taxon. By excluding the possibility of a species (or genus) match, which would effectively provide the higher taxonomy of the query, this study was a rigorous test of the effect of assignment criteria and species completeness of the reference library on higher taxon assignments. The arbiter of success was necessarily a classification  that is already considered "out of date" [17,42]. As such, a pertinent issue to DNA barcoding success is taxonomy/species tree incongruence as well as species tree/gene tree incongruence . This is especially the case for the large species-rich genera e.g. Xylophanes, Manduca, where generic boundaries may need to be revisited  (Table (Table2).2). The effect of using an "old" classification was perhaps particularly apparent when considering the results of the tribe experiments, although adoption of a new classification did not appear to improve assignment accuracy (data not presented). FP at the tribe and subfamily level often reflected new knowledge of Sphingidae phylogeny , and therefore reflect real phylogenetic signal among barcode sequences.
The Sphingidae has received a relatively extensive treatment by taxonomists. This raises a concern for other less-well-studied groups; how accurately can barcoding be expected to assign queries to taxa that are most likely not natural [24,44,45]? While this study provided few examples where the barcode assignment was clearly at odds with current taxonomic understanding, it would be much more difficult to assess in other moth families. Despite most systematists adhering to cladistics since Hennig , many "good" Lepidoptera taxa, including those within Sphingidae , lack reliable (private) morphological synapomorphies which would enable rapid assignment of species to higher taxa. It is difficult to assess how our results would compare with morphological assignment accuracy by a non-specialist. However, it is clear that even a specialist taxonomist would have difficulty in assigning an egg to a genus, while DNA barcoding can be used with any tissue sample from any life stage. There are groups of species e.g. from Microlepidoptera, Pyralidae, that are far more difficult to assign morphologically to taxa and lack of morphological synapomorphies may reflect the instability of the current classification. The results presented here echo the relative stability of subfamilial and generic taxa compared to the tribes [16,17] suggesting the real challenge in taxonomy is to build new, robust phylogenies, and ensure that these are reflected in the classification.
Another challenge highlighted by our study is the lack of equivalency of taxonomic ranks  in terms of genetic distance . This is clear through our inability to increase success of sequence comparison criteria through the use of "best close match". An optimal threshold will always be taxon specific and even within a relatively small group a universal threshold is unlikely to be effective. Avise and John  proposed a temporal scheme to standardize taxonomic ranks. However, an obvious objection is that the scheme would require significant revisions of all groups. Instead of speeding up taxonomic work, to be effective, large-scale employment of a distance-based assignment criterion would have to start with the redefinition of most taxa. Like Kelly et al. , we furthermore found that tree-based criteria outperform the direct sequence comparison methods, thereby rendering threshold values for "taxon-level" divergences unnecessary.
When the query's taxon was not in the reference library, only "strict" was relatively immune to FP, and consequently was the best scoring criterion in terms of precision. Given this ability to limit FP, "strict" was also the criterion for which overall accuracy was least affected by reference library completeness.
It seems intuitive that "best match" would perform better in libraries where taxon matches are always available. However, in real-life, it is impossible to know whether a query is from a "new" taxon or from a taxon that is already represented in the reference library. Considering the problems with direct sequence comparison methods, relying solely on distances, we do not believe they are promising tools although arguably these are the most practical. Interestingly, "best match" had highest overall accuracy, beating tree-based criteria at the highest taxonomic level investigated in this study - subfamily.
We tested criteria that allow for "ambiguous" assignments and found library completeness had a weak effect and high overall accuracy and precision was seen at low completeness. Our comparison of constrained and randomly selected reference sub-libraries showed that accuracy is not compromised by the absence of taxa in the reference library. We found that whether the library was incomplete or all species were present in the library, the criteria selected to provide an assignment was still a factor determining success.
Techniques for assigning sequences to a higher taxon are still in their infancy, but new methods are appearing more frequently (e.g. CAOS ). Based on our results, we suggest a conservative approach that initially uses a "strict" tree-based criterion in large-scale assignment systems. Although a large number of queries would remain ambiguous due to the more conservative nature of the criterion, we nevertheless consider this result with its higher precision to be preferable to an assignment criterion like "best match", which yields marginally more TP but also a large number of FP. Criteria requiring exclusivity were the most conservative, but given their very low overall accuracy and precision they would probably only be justifiable for forensic purposes .
Tree-based criteria could be easily incorporated into the current library set-up (BOLD), by providing higher taxonomy alongside the species name attached to barcodes on a Taxon-ID tree. The current approach offered by BOLD uses a similarity search to collect the top 100 hits in the reference library and then constructs a NJ tree to allow the attachment of a query barcode to this 100 best backbone tree . From this tree an attempt can be made to assign the query to genus using "strict". However, if no "positive" genus assignment can be made, an attempt could be made at assignment to tribe, etc. Alternatively, an assignment can be attempted to a taxonomic level determined by the taxonomic sufficiency requirements of the investigators. For example, water monitoring using invertebrate diversity indices may only require samples be identified to family to be useful. Some could argue that heuristic taxonomic groupings (OTUs) based on barcodes are better than no taxonomic hypothesis at all, and certainly superior to heuristic morphology-based equivalents, being rooted in a standardised, objective, consistently coded character set.
Our empirical test of higher taxonomic assignments reveals that a tree-based assignment system would successfully assign most queries to a higher taxon at some level. A conservative approach using the "strict" tree-based method should be used initially in large-scale identification systems. The failures we observed do not make us question the usefulness of barcode libraries for generic and suprageneric assignments. They indicate imperfect taxonomy and suggest that the barcodes themselves could aid our ability to revise non-natural taxa. An advantage of any DNA-based system is that the data are readily available for further analysis with alternative models or approaches. Discounting DNA barcoding as a tool for providing taxonomic assignments because the library is not yet complete is pusillanimous.
JJW, RR, JS, PDNH conceived of the project and designed experiments. JJW, DHJ, WH, MH assembled the query dataset. RR, IJK, JH assembled the sphingid reference library and made it available for this study. JJW, RR, JS performed the experiments and analyzed the data. JJW, RR, JS drafted the manuscript. DHJ, WH, MH, IJK, PDNH assisted in writing the manuscript. All authors read and approved the final manuscript.
Full reference library. Excel worksheet containing a list of COI sequences used in the full reference library in this study, including the species name attached to the barcode and BOLD (http://www.barcodinglife.org) Sample IDs and Process IDs.
Query dataset. Excel worksheet containing a list of COI sequences used as the query dataset in this study, including the species name attached to the barcode and BOLD (http://www.barcodinglife.org) Sample IDs and Process IDs.
Results of all experiments. Excel worksheet containing the summarized results of all the experiments performed.
This study was funded by grants from NSERC and Genome Canada through the Ontario Genomics Institute to PDNH. Acquisition of the 118 species of query sphingids was supported by USA NSF grants DEB 0072730 and 0515699 to DHJ and the staff of ACG. The sphingid reference library includes contributions by Sam Adams, Philippe Annoyer, Patrick Basquin, Robert Beck, Alex Borisenko, Ron Brechlin, Philippe Darge, Ulf Eitschberger, Yves Estradel, Axel Hausmann, John Janovec, Jean-François Landry, Tomas Melichar, Scott J. Miller, Joël Minet, Kim Mitter, Jacques Pierre, Chris Schmidt, James Tuttle, Thierry Vaglia, Evgeny Zakharov; many taxa were only accessible within the collections of the following institutions: the Australian National Insect Collection (Canberra), The Smithsonian Institution (Washington), the Muséum National d'Histoire Naturelle (Paris), the Canadian National Collections of Insects and Arachnids (Ottawa), the Bavarian State Collection of Zoology (Munich). JJW would like to thank the staff and visitors to BIO and the Hanner lab for helpful discussions and encouragement over the course of this study. Massimiliano Virgilio and an anonymous reviewer significantly improved the manuscript through their comments.