The amount of biomedical literature available to researchers is growing exponentially, with over 18 million article entries now available in MEDLINE [
1] and over a million full-text articles freely available in PubMed Central (PMC) [
2]. This vast information resource presents opportunities for automatically extracting structured information from these biomedical articles through the use of text mining. A wide variety of biomedical text-mining tasks are currently being pursued (reviewed in [
3,
4]), such as entity recognition (
e.g. finding mentions of genes, proteins, diseases) and extraction of molecular relationships (
e.g. protein-protein interactions). Many of these systems are constructed in a modular fashion and rely on the results of other text-mining applications. For example, in order to extract the potential interactions between two proteins, the proteins themselves first need to be correctly detected and identified.
One application that could facilitate the construction of more complex text-mining systems is accurate species name recognition and normalization software (
i.e. software that can tag species names in text and map them to unique database identifiers). For example, if the species and locations of species mentions discussed in a document were known, it could provide important information to guide the recognition, normalization and disambiguation of other entities like genes [
5-
7], since genes are often mentioned together with their host species. In recent text-mining challenges such as the identification of protein-protein-interactions at BioCreative II [
8] or bio-molecular event extraction at the BioNLP shared task [
9], some groups considered species identification and normalization an essential sub-task [
10]. Likewise, improved methods for identifying species names can assist pipelines that integrate biological data using species names as identifiers [
11,
12].
In addition to being useful for more complex text-mining and bioinformatics applications, species name recognition software would also be useful for "taxonomically intelligent information retrieval" [
13]. Document search queries could be filtered on the basis of which species are mentioned in the documents [
14], providing researchers more fine-grained control over literature search results. This use case provides a powerful extension to simple keyword-based PubMed searches, since all synonyms of a species would be normalized to a standard database identifier, and could therefore be retrieved by any synonym used as input. This can currently be done to some degree by specifying Medical Subject Heading (MeSH) terms when performing a PubMed query. However, MeSH-based queries have limitations since the set of MeSH tags comprises only a small subset of all species. Additionally, semantic enhancement (marking-up entities in text and hyper-linking them to external databases [
15,
16]) of research articles with species names could provide readers with easier access to a wealth of information about the study organism. Accurate recognition and normalization of species mentions in biological literature would also facilitate the emerging field of biodiversity informatics, which aims to develop databases of information on the description, abundance and geographic distribution of species and higher-order taxonomic units [
13,
17,
18].
The task of identifying species names in biomedical text presents several challenges [
10,
13,
19], including: (i) Species name ambiguity: many abbreviated species names are highly ambiguous (
e.g. "C. elegans" is a valid abbreviation for 41 different species in the NCBI taxonomy). Ambiguity is also introduced because names can refer to different NCBI taxonomy species entries (
e.g. "rats" can refer to either
Rattus norvegicus or
Rattus sp.). (ii) Homonymy with common words: some species common names are widely used in general English text (
e.g. "Spot" for
Leiostomus xanthurus and "Permit" for
Trachinotus falcatus). These names introduce a large number of false positives if not properly filtered. (iii) Acronym ambiguity: species dictionaries contain acronyms for species names (
e.g. HIV for Human immunodeficiency virus), which can refer to multiple species or other non-species entities. In fact, it has previously been shown that 81.2% of acronyms in MEDLINE have more than one expansion [
20]. This presents challenges relating to identifying when an acronym refers to a species, and, if so, which species when it refers to several. (iv) Variability: while species dictionaries cover a large number of scientific names, synonyms and even some common misspellings, they cannot match human authors in variability of term usage. In some cases, authors use non-standard names when referring to species, spell names incorrectly or use incorrect case.
Despite these challenges, several attempts have been made to automate the process of species name recognition and normalization using a range of different text mining approaches. Previous efforts in species name recognition can broadly be categorized in two groups: software aiming to identify species names in legacy documents in the field of biodiversity (e.g. the Biodiversity Heritage Library [
21]), and software aiming to identify species names in current biomedical literature (e.g. MEDLINE or PubMed Central). The main aim of tools profiled towards the field of biodiversity is to recognize as many species names as possible, many of which have not been recorded in existing species dictionaries. Biodiversity-oriented methods typically use rule-based approaches that rely on the structure of binomial nomenclature for species names adopted by Carl Linnaeus [
22]. By taking advantage of regularity in naming conventions, these approaches do not have to be updated or re-trained as new dictionary versions are released or species names change, and can cope with the very large number of possible species names in the biodiversity literature. However, rule-based methods are often unable to identify common names (
e.g. Drosophila melanogaster follows the typical species name structure, while "fruit fly" does not).
TaxonGrab [
23] is such a rule-based tool, which consists of a number of rules based on regular expressions. Using an English-language dictionary, it finds all words that are not in the common-language dictionary, and applies rules based on character case and term order in order to determine whether a term is a species name or not. It is implemented in PHP and available under an open-source license [
24]. TaxonGrab performance is high (94% recall, 96% precision) against a single 5000-page volume on bird taxonomy, but it has not been evaluated on biomedical articles. "Find all taxon names" (FAT) [
25] is a more complex mention-level method related to TaxonGrab, with several additional rules aimed at increasing recall and precision. FAT reports better accuracy than TaxonGrab (>99% recall and precision) on the same evaluation set and can be accessed through the GoldenGate document mark-up system [
26,
27]. It is important to note, however, that the performance of these methods has not been evaluated against normalization to database identifiers.
The uBio project provides a set of modular web services for species identification [
28] and automatic categorization of articles based on the species mentioned in them [
11]. FindIT, part of the uBio suite, is a rule-based system aiming to perform species name recognition, aided by a range of dictionaries. After recognition, a confidence score is given for each match and, where possible, any recognized species names are mapped to uBio Namebank records. However, like TaxonGrab, FindIT is unable to recognize common names such as "human." TaxonFinder is a related method influenced by both TaxonGrab and FindIT, that brings together elements from both systems [
29,
30]. MapIT performs species name normalization by mapping species names to a taxonomic tree rather than directly to a database identifier. The implementation is not described in detail and no evaluation of the system is reported. Our testing of the system reveals that MapIT will map common names such as "human" to any species with a name or synonym that contains human,
e.g. "Homo sapiens," "Human immunodeficiency virus" and "Human respiratory syncytial virus."
Using dictionary-based methods instead of rule-based methods, it is also possible to recognize common names, making the software more suitable for processing biomedical research articles, where authors often only refer to species by using their common (vernacular) names, such as "human" or "mouse." Recognized species names are typically normalized against the NCBI taxonomy [
31]. For example, PathBinderH [
14] is a dictionary-based web service where users can submit PubMed queries and filter the documents retrieved by species mentioned in the documents. Unfortunately, the service is currently limited to 20,000 species and is restricted to a fixed set of 65,000 of documents in MEDLINE. AliBaba implements a dictionary-based web service for species name recognition in PubMed abstracts and normalization to NCBI taxonomy identifiers, which includes methods to filter homonyms for common species names [
32]. WhatizitOrganisms [
33] is another dictionary-based system based on the NCBI species taxonomy, also available as a web service, that recognizes and normalizes species as well as other taxonomic ranks. It is a one of modules of the more general Whatizit system [
33], which provides a number of different entity recognition and normalization pipelines based on dictionaries for different entity types. Neither the implementation details nor any evaluation of either AliBaba or WhatizitOrganisms system have been reported, however an analysis of WhatizitOrganisms output is presented here.
Recently, Kapeller
et al. [
10] have reported work on species name recognition and normalization in an attempt to determine the "focus organisms" discussed in a document. This system includes a dictionary-based term search combined with filters to remove common English words, and then ranks species based on their mention frequency in the abstract or main text. Evaluation is performed against a set of 621 full text documents where species mentions have been automatically generated from corresponding protein-protein interaction entries in the IntAct database [
34], with a reported recall of 73.8% and precision of 74.2%. Since it is aimed at recognizing species in order to guide protein name normalization, the system is limited to the 11,444 species with entries in UniProt [
35], and does not implement any disambiguation methods since levels of species name ambiguity are low in this dictionary. The software is not available either for download or as a web service.
Wang and colleagues [
7,
36,
37] have developed a species name recognition system to aid the disambiguation and identification of other entities such as gene/protein names and protein-protein interactions. This system uses diagnostic species names prefixes along with names from the NCBI taxonomy, UniProt and custom hand-compiled dictionaries to tag species with either rule-based or machine learning techniques. This system requires other entities of interest (
e.g. genes) to be pre-tagged as input, and only attempts to tag species mentions associated with these other entities of interest. Training and evaluation is based on two related corpora of 217 and 230 full-text documents manually annotated for proteins, genes and species. Against these evaluation sets, their rule-based approaches can achieve either very high precision (91%) with very low recall (1.6%) or intermediate values (~45%) of both performance measures [
7,
37]. Alternatively, their machine learning based approaches that use contextual features around entities of interest to tag species yield higher performance (~70%), but are highly biased toward species represented in the training dataset [
7]. Very recently, Wang
et al. [
38] have described extensions to this system and have made their Species Word Detector method available as an UIMA component [
39] together with a corpus where protein/gene mentions (but not species mentions) have been manually annotated and linked to NCBI taxonomy identifiers [
40].
Finally, Aerts
et al. [
41] use a sequence-based approach to detect species referred to in biomedical text by extracting DNA sequences from articles and mapping them to genome sequences. Based on a set of 9,940 full text articles in the field of gene regulation, these authors report that the correct species can be identified (relative to the species annotated in the ORegAnno database [
42]) for 92.9% of articles that contain a DNA sequence that can be mapped to a genome. No software for this approach is available as a web service or standalone application. Additionally, this approach requires that articles report a DNA sequence of sufficient length to be mapped unambiguously to a genome, which is unlikely for most abstracts and may only be available for a limited proportion of full text articles.
Here we aim to produce a robust command-line software system that can rapidly and accurately recognize species names in biomedical documents, map them to identifiers in the NCBI taxonomy, and make this software freely available for use in other text-mining and bioinformatics applications. We have named this software system LINNAEUS, in honour of the scientist who established the modern species naming conventions [
22]. The goal of this work is not to discover all possible species names across publications in all domains of the life sciences, but to provide efficient methods to link species names in the biomedical literature to standard database identifiers. We perform recognition and normalization for all species names at the mention level, rather than at a document level, as document-level properties (such as focal organisms [
10]) can naturally be inferred from the mention level. This also enables software built upon LINNAEUS to use the precise location of species mentions, such as in the disambiguation and normalization of other positional entities (such as genes or proteins) or in direct link-outs from mentions in semantically enhanced documents. Additionally, we aim to address which dataset is best suited for evaluating the accuracy of species name recognition software. To do so, we evaluate several automatically generated biomedical document sets with species names attached to them, and conclude that a manually annotated gold standard is necessary to reveal the true performance of species name identification systems such as LINNAEUS. We therefore also provide a new gold-standard corpus of full-text articles with manually annotated mentions of species names.