The amount of information in the life sciences is staggering and growing exponentially. One of the largest biomedical resources of textual scientific information, the Medline database, currently contains over 14 million abstracts, with an estimated increase in size of more than one article per minute. Scientists are faced with an overload of information, which is particularly pressing in the biological field where high-throughput experiments in genomics and proteomics generate new data at an unprecedented rate. More often than not, interpretation of these data requires the digestion and integration of information contained in many thousands of articles and other information sources, a daunting task clearly beyond the capacity of human reading and comprehension.
Recently, a number of information retrieval systems have been proposed to extract and relate pertinent biological information from large corpora of text [1
]. These systems even hold promise for the discovery of new, "tacit" knowledge that is hidden in the literature. The term "conceptual biology" has already been coined to distinguish this emerging field of research as a branch of biological research in its own right [10
]. There are however several issues that limit the practical utility of current text-mining tools [11
]. One problem is the highly-variable use of gene nomenclature in the literature [12
], producing multiple symbols and names for one and the same gene. This complicates relating information in different documents that deal with the same gene but use different symbols. One approach to deal with this synonym problem is to make use of the information about genes and their aliases that is available in existing genetic databases.
A second, probably more intricate, problem is that a single gene symbol may refer to multiple genes, or may also be the abbreviation of terms with completely different, non-gene meanings. When building gene networks from the literature [1
], for example, one would not want to contaminate the network on prostate specific antigen (PSA
) with puromycin-sensitive aminopeptidase, psoriatric arthritis, pig serum albumin, or one of the more than 100 other meanings of PSA that can be found in the literature [14
The extent of this ambiguity or homonym problem has been further subject of two recent studies. Tuason et al.
] compared gene symbols of four organisms (not including human) and showed that up to 20% of the gene symbols of an individual organism were ambiguous with the other three organisms. In another study by the same group, Chen et al.
] found that 85% of correctly retrieved mouse genes in a set of 45,000 abstracts were ambiguous with gene names from 20 other organisms, while ignoring gene names that were also English words. When the latter were included, 233% additional "gene" instances were retrieved, most of which were false positives. In several other studies [17
], it was also suggested that solving this ambiguity problem is an important requirement for large-scale application of text-mining tools in the biomedical field.
General word-sense disambiguation has been studied extensively in the field of natural language processing. A wide variety of approaches has been proposed (see [20
] for excellent reviews), including dictionary-based approaches and the use of supervised learning techniques to build classifiers that assign the proper sense to an ambiguous term. Typically, these methods use the words in a window around the ambiguous term, or information derived from this context window, such as part-of-speech or collocation.
Recently, several studies have explored the use of disambiguation techniques in the biological field. Hatzivassiloglou et al.
] applied machine learning methods to classify symbols into one of three categories: genes, proteins, and mRNA. No attempt was made to resolve homonyms with two or more senses within one group, or with a sense outside of these three groups, and performance results were rather moderate, although still better than human interpretation. The same problem was recently tackled by Ginter et al.
], who proposed a new classifier design and were able to slightly improve on the best method used by Hatzivassiloglou [22
]. In a series of articles [17
], Liu and co-workers investigated the effect of different supervised learning techniques, feature representations, and context window sizes on disambiguation performance. They obtained excellent results on a small number of ambiguous biomedical abbreviations [17
], but for training they typically needed dozens of examples for each of the possible senses. In practice, these numbers may often be difficult to obtain. Widdows et al.
] compared several methods for disambiguating ambiguous concepts from the Medical Subject Headings (MeSH) thesaurus [27
] on a set of 70 ambiguous terms. Their most successful method achieved 74% precision and utilized existing MeSH-term co-occurrence data, which were derived from the MeSH annotations by human annotators. However, their method would not work well for gene symbols, which are poorly covered by MeSH.
Recently, Podowski et al.
] used Bayesian classifier models to disambiguate gene symbols found in LocusLink [28
]. Interestingly, their system can distinguish between gene and non-gene meanings of a symbol, acknowledging the fact that many gene symbols are abbreviations of terms with non-gene meanings. They validated their system on two manually curated test sets of 66 gene symbols, and found that the accuracy of the system is mostly over 90% when more than 20 abstracts per gene sense were available for training.
Although several of these approaches produced very good disambiguation results, they require substantial amounts of training data, typically tens of instances per sense. For gene symbol disambiguation, these numbers may be difficult to acquire. Given the extent of the homonym problem for gene symbols, manual curation of training data would be extremely laborious. Any practical disambiguation system should be trained with data that are gathered automatically, but even then the required numbers are unlikely to be available for many of the ambiguous symbols.
Here we present a disambiguation method for gene symbols, which maintains excellent performance when trained with sparse data. At the basis of our approach lies a thesaurus that is used to find biomedical concepts, including gene symbols, in text. Focusing on human genes, we first quantify the ambiguity problem for gene symbols, particularly paying attention to ambiguity arising from non-gene meanings of gene symbols. We then describe our disambiguation approach and assess the performance of the disambiguation algorithm on a large test set of documents.