Query dataset, 100% reference library and sub-libraries
Using barcode records assembled as part of the global barcoding campaign on Sphingidae [15
], we selected one barcode from each species to act as a reference barcode for that taxon. Reference barcodes were available for 1088 of the 1270 described species listed in Kitching and Cadiou [16
] and for an additional seven Costa Rican species described or revalidated since 2000 (= 1095 sphingid species). Barcode sequences were selected to maximize length and quality and ranged from 267-658 bp, with 77% being 658 bp and 93% > 600 bp. The sample comprised 200 genera with all the currently recognised tribes and subfamilies (Figure ) represented. Three saturniid barcodes (Arsenura drucei, Lonomia electra, Periga cluacina
) were also included as this family represents the putative sister family to the Sphingidae [35
] taking the full reference library to 1098 barcodes (see additional file 1
: Full reference library).
Barcodes from 118 sphingid species collected in Area de Conservacion Guanacaste, northwestern Costa Rica, were used as query barcodes (see additional file 2
: Query dataset). DNA was extracted following automated protocols [36
] and the DNA barcode amplified and sequenced [37
]. These Costa Rican sphingids comprised a well-documented [38
], diverse subset of the family, with each of the tribes and subfamilies represented among 29 genera. All the queries were correctly assigned to species when using the full reference library and a "best match" assignment criterion.
For the purposes of this study the following were considered libraries of 100% completeness: for genus assignment attempts, the representative from the same species as that of the query was the only barcode removed from the reference library; for tribe and subfamily assignment attempts, the barcodes from all the representatives of species in the genus of the query were removed from the reference library. All contribal genera were not removed in the case of subfamily tests, due to the increased level of uncertainty regarding naturalness of these taxa.
We subsequently created sub-libraries from the full reference library with different levels of species completeness. In an approach termed here "random sampling" barcodes were chosen at random to construct sub-libraries comprising 10, 20, 30, 40, 50, 60, 70, 80 and 90% of the full reference library. Sub-sampling at each species richness level was repeated 30 times. A different approach termed here "constrained sampling" limited the random selection of species to ensure a minimum of one species per genus in the sub-library. This approach was reiterated to construct sub-libraries comprising 20, 30, 40, 50, 60, 70, 80 and 90% of the full reference library and was repeated 30 times at each species completeness level. For the sub-libraries as with the 100% library, for genus assignment attempts, we removed the reference barcode for the species of the query from the sub-libraries. For tribe and subfamily assignment attempts we removed the reference barcodes for the genus of the query.
Query assignment criteria
In each assignment attempt we allowed two possible outcomes: (i) A "positive" assignment (i.e. the query was assigned to a taxon) or (ii) An "ambiguous" assignment (i.e. the query was not assigned to a taxon). A "positive" assignment was either true (TP) - it matched with the morphology-based identification, or false (FP) - it disagreed with the morphology-based identification [40
]. An "ambiguous" assignment was either true (TA) - the true taxon based on morphology was not represented in the reference library/sub-library (by at least two barcodes for "strict" criteria (Table )), or false (FA) - the true taxon based on morphology was represented in the reference library/sub-library (by at least two barcodes for "strict" criteria (Table )) [40
The requirements for a "positive" assignment depend on the different criteria employed as detailed in Table . Note, the number of "potential TP" will not always be equal to 118 (i.e. the number of queries) because the taxon of the query may not be present in the sub-library. For example, the number of "potential TP" at the genus level with the 100% library and the "liberal" criterion is 113, due to 5 queries being members of monobasic genera.
We developed software in C++ to automatically construct sub-libraries, perform assignments according to four tree-based criteria and evaluate assignment success. The main tool took as input the queries, the outgroups, the complete reference library (all in fasta format), the sampling strategy, and an integer (X) indicating the percentage of the reference library to sampled. The software automated the analytical process as follows:
For each query:
For each replication:
Remove query species (or genus) from reference library.
Randomly select × percent of reference library without replacement according to input sampling strategy.
Combine query, outgroups, sampled reference library into a single file.
Construct NJ tree from file using Clustal W v.2 [41
For each of four criteria:
Read tree, assign query a taxon or not according to criterion.
Evaluate accuracy of assignment (true or false).
The four tree-based methods were "liberal" (Figure ) [40
], "strict" (Figure ) [25
], "liberal & exclusive" and "strict & exclusive". We also performed "best match" for all taxon assignments and "best match" and "best close match" for assignment to genus (with the randomly sampled library) where assignment was based only on the most similar reference library barcode (Table ). For "best match" only a "positive" assignment is possible (i.e. the assignment is TP or FP) (Table ). For "best close match" the query was assigned to the taxon of the most similar library barcode based on K2P distance, provided it was within a certain threshold. If there were no barcodes in the library within the threshold, the assignment was "ambiguous". In order to select a threshold we looked at the results of the "best match" criterion and plotted the number of "true-positives" and "false-positives" against the K2P distance from the query to the "best match". The distance that maximized the number of TP (which in our case also corresponded to the distance with the lowest proportion of FP) was selected as the threshold.
Figure 2 Visualisation of two tree-based assignment criteria. The distinction between a "positive" and an "ambiguous" assignment and how the assignment is achieved - based on the location of the query on the tree. A). "Liberal" - When the query barcode (Q) was (more ...)
Measures of accuracy were calculated as follows: 1. Precision, the fraction of barcodes placed in a taxon that belongs there, TP/(TP+FP); and 2. Overall Accuracy, the proportion of barcodes placed without any error, (TP+TA)/(TP+FP+TA+FA) [33
]. Note, for 'best match" due to the absence of the "ambiguous" category overall accuracy equals precision. The results are discussed below in terms of these measures.