Biological research is name-centered: proteins are referred to in free text by their names or symbols rather than using the unambiguous identifiers provided by annotation databases (such as SwissProt accession numbers [16
]). Identifying mentions of proteins and genes unambiguously within free text is a fundamental step for the later extraction of functional attributes of these entities. Unfortunately this is a difficult process, partly because of the complex nature and usage of gene and protein names. Genes and proteins maybe referred to in free text in a range of different ways: as full names (for example, porin), as symbols (the Saccharomyces cerevisiae
, and also through typographical variants (POR-1)
. Many genes also have several synonyms (such as OMP2
, or the gene name may be ambiguous [17
] and refer to words that also have a different meanings depending on the context (for example, big brain
, the full name for the Drosophila melanogaster
, could also be an anatomical description). Furthermore, it has been suggested that errors in gene names might be introduced automatically by certain applications in bioinformatics [18
In the NLP field, the identification of entities in free text is known as named-entity recognition (NER). To identify biological entities such as genes, proteins and drugs automatically and unambiguously within free text, over 50 information-extraction and text-mining tools have recently been implemented, and two community-wide evaluations have been carried out [19
]. The top left of Figure shows nine existing NER applications for biology that are provided via an online server or are directly downloadable. Note that the average recovery of biological entities from free text by 15 NER tools was 80%, and the results had an accuracy of 80% [21
]; these figures are significantly lower than in the case of entities found in documents from fields such as economics, which demonstrates the complex nature of protein names.
Figure 1 An overview of biological natural language processing (BioNLP) and text-mining applications for biology. The major topics are represented by the inner circle of seven approaches, and the corresponding applications are given in the outer layers of boxes. (more ...)
Proteins and genes are characterized within biological databases through unique identifiers; each identifier is associated with its corresponding protein or nucleotide sequence and functional descriptions. The automatic recognition of entities such as genes and proteins in free text is insufficient if it is not linked to the corresponding database identifiers. Distinguishing between the use of protein names and protein-family names constitutes a serious obstacle in the task of highlighting protein entities in free text, as text passages sometimes refer to the general properties of protein families and at other times to the properties of individual proteins.
Different research communities have addressed the issue of named-entity recognition in biology in different ways. The NLP community has typically tried to identify names by analyzing the syntactic structure of sentences, making use of information about parts of speech in a sentence and the syntactic roles of words, whereas bioinformaticians have instead explored the identification of variants of the names contained in databases, even adapting standard bioinformatics algorithms such as BLAST to the problem of protein-name identification [22
]. Neither of these two strategies seems to be efficient by itself, and many intermediate combinations are therefore appearing, including the following examples. GAPSCORE [23
] is an easy-to-use online tool for detecting protein and gene names within free text (a 'protein tagger'). The text to be analyzed can be pasted into an online form and submitted to the server, which returns a list of the words observed in the document and a statistical quality score that indicates how probable it is that the each word represents a gene or protein name. Another online protein tagger is NLProt, developed at Columbia University [25
]. NLProt is based on a machine learning technique called support vector machines (SVMs) and allows protein identification either in a submitted text or in the text corresponding to a list of submitted PubMed article identifiers. Additional protein taggers include Yapex [27
], also available online, and three downloadable tools, AbGene [29
], ABNER [31
] and KEX [33
]. Abbreviations or acronyms are often used as a shorter form to refer to gene names in articles; the Abbreviation Server [35
] developed at Stanford University allows a similar search strategy to that used by GAPSCORE to be applied to biomedical abbreviations such as gene symbols. Finally, the AliasServer [37
] helps in linking the various aliases of a given gene through different biological databases for various species.
One of the main challenges when linking protein names to database entries is distinguising between proteins that have the same names but belong to different genomes - a process called inter-species gene disambiguation. This is especially cumbersome in the case of mouse and human genes; the same gene symbol is often used in both species and both names are often mentioned in the same textual passage. The complex nature of protein- and gene-name identification is reinforced further by the dynamic nature of gene-name usage and name creation, with official gene names being changed and new synonyms being created [39
]; it is clear that static approaches and dictionaries will not be sufficient for solving the problem.