Proportion of compromised taxonomic annotations
The results are summarized in , which portrays a variegated picture of the taxonomic status of publicly indexed fungal sequences. Based on the conservative criteria defined for a thorough BLAST match and the discriminative variability of the ITS region, one would expect any such thoroughly matching pair to be conspecific. Yet 11% of all 15491 applicable sequences find thorough matches in other congeneric but heterospecific sequences, and another 7% among species of a different genus. When synonyms are accounted for, these correspond to 3231 distinct accession numbers such that a minimum of 10% and a maximum of 21% of the applicable sequences have compromised taxonomic annotations (Supporting Information
). These entries form, in turn, the best matches of 5% of all insufficiently identified sequences, such that in a worst-case scenario, one in every twenty insufficiently identified sequences finds its most similar counterpart among entries whose taxonomic annotation can be questioned.
That 10–21% of the INSD sequences have incorrect or unsatisfactory taxonomic annotations translates into a matter of concern for the researcher seeking to establish the taxonomic affiliation of newly generated sequences. To obtain a clearer picture of the extent to which this process will be hampered by the compromised entries, the sequence identification procedure was reproduced through the use of UNITE, a highly filtered, closed-submission taxonomic database for reliable ITS-based identification of mycorrhizal fungi (http://unite.ut.ee
). We employed the 240 species present in both INSD and the UNITE databases such that the UNITE sequences were used as input for comparison in INSD (Supporting Information
). As the taxonomic affiliations of the UNITE sequences are well-known and -documented, the proportion of times a different taxonomic affiliation is suggested by INSD-even though a conspecific ITS sequence is present therein-represents a rational estimate of the impact of taxonomically compromised annotations in INSD. We found that one has on average a 20% (49/240) chance of obtaining a different species name on top of the INSD BLAST hit list, each such case hinting at a compromised annotation of either the topmost match or the purportedly conspecific INSD sequence (or even both). In a further 8% (20/240) of the cases, the correct species name was present in the topmost region of the hit list but was obscured by the presence of insufficiently identified sequences, such that one would be reluctant to annotate ones sequence after the best fully identified match. Jointly these estimates imply that the taxonomic and nomenclatural problems in public sequence databases are more far-reaching than previously assumed and that this has considerable repercussion on sequence-based species identification.
Insufficiently identified sequences, orphans, and other compounding factors
More than 27% of all fungal ITS sequences in INSD are insufficiently identified, and the majority (74%) of these find their best match in other insufficiently identified-rather than fully identified-sequences. Similarly, over 90% of the fully identified sequences find their best matches in other fully identified sequences. In other words, the two sequence classes constitute two largely separate entities, both of which convey information not present in the other.
Six percent of all sequences over 350 bp lack good BLAST matches altogether (i.e., have an E-value of >0.0 as reported by BLAST). These outliers probably represent a mix of species whose closest relatives have not been sequenced and species that lack close, extant relatives. Two thirds of these sequences are fully identified; the oldest sequence with an unsatisfactory BLAST match has resided in INSD for a full 14 years. Interestingly, 85% of the fully identified sequences that fail to find a thorough match do so in the presence of other purportedly congeneric sequences, and 35% even in the presence of other purportedly conspecific sequences.
The observation that a comparatively small set of sequences explains a disproportionally large part of the results () is probably best viewed as an indication of a highly patchy and non-random taxonomic distribution of species sampled. Roughly half of both the identified and the insufficiently identified sequences do not constitute the best BLAST match of any other sequence. Similarly, 76% of all mycological studies account for 100% of all best BLAST matches, such that there are over 1000 studies in INSD whose sequences do not constitute the best match of any other sequence (a study is defined as a distinct combination of the INSD AUTHORS and TITLE fields as to correspond to a published or unpublished scientific manuscript). A full 55% of all sequences are best matched by another sequence from the same study.
Sequence annotations play an important role for the researcher trying to verify alleged names and taxonomic integrities. However, many entries in INSD prove to be both devoid of vital information and outdated (). For example, 82% of the sequences lack explicit reference to a voucher specimen, 63% are not tagged with specimen country of origin, and 42% of all sequences are marked as not having been published in spite of the fact that about 40% of these indeed have been (Supporting Information
). Although 14% of all sequences contain DNA ambiguities, less than 1% of all sequences have ever been updated. That these issues pose a further obstacle to sequence identification needs little iteration.
Primary data - a challenge for biological barcoding
The present study suggests that the taxonomic reliability in public databases is not satisfactory, and that the problem shows little tendency for self-amelioration over time (). This is worrisome, particularly since DNA sequences have been opined as the primary information source in barcoding-type approaches to species identification (where reference DNA sequences serve as arbiters-barcodes-of conspecificity). It is apparent from that the major sequence databases are not optimally suited to serve as barcoding engines as they presently stand; new techniques and strategies for data indexation and verification will have to be explored to address the above shortcomings 
. It is, however, not in technology that the greatest challenge to barcoding lies; rather, it is in the integrity of the primary data itself 
(). As the results presented herein suggest, the relation of species and species names-taxonomy - to barcoding could be only one: that of the primus motor
. No technical feats could ever make up for compromised primary data or lack of such data altogether.
The large body of insufficiently identified fungi in INSD constitutes a silent plea for a wide and generalized sequencing effort of well-identified and -annotated [type] specimens residing in herbaria worldwide to form the basis for such barcoding initiatives. This will without doubt be a painstaking undertaking involving taxonomic experts in all groups of fungi. The approach taken by the UNITE database has been to cover as many genera of fungi as possible at the temporary expense of intrageneric completeness. That approach finds support in the present study: in order to avoid the current situation where insufficiently identified sequences amass and obscure similarity searches in the public sequence databases, select reference sequences covering the whole range of fungal diversity need be made available as early on as possible.
The species is in many ways the basic unit in biology, and the ever-increasing rate at which DNA sequences are released and used for scientific research prompts us to make any effort to verify that these are tagged with correct names. Sadly, more than 10% of all publicly available fungal ITS sequences have compromised taxonomic annotations, and the information needed to evaluate whether any given name is reasonable is in many cases simply not there. The inherent difficulty in species identification in the fungi, however, suggests that these estimates need not necessarily reflect the status of the total body of DNA sequences. Even so, caution and patience should be attributes of anyone seeking to identify species through DNA sequence data alone.
Barcoding-type approaches will doubtlessly be a central and most valuable element in future species identification, though contemporary major sequence repositories are not optimally suited for such operation. While we can expect technological advancements to eliminate many of the problems faced at present, the taxonomical aspect of the DNA sequences remains a substantial concern. Taxonomy lays at the heart of sequence-mediated species identification, and unlike the latter it forms a poor candidate for automation. Sadly the declining number of taxonomists is a problem for which no shortcuts exist and moreover one whose immediate resolution does not seem to be looming on the horizon.