Abbreviations substitute for fully expanded terms (e.g. computed tomography
) through the use of shortened term-forms (e.g. CT
). In the bio-medical literature, abbreviations are used for various important terms including: genes, proteins, diseases and chemical names (Federiuk, 1999
). Results of our experiment (Section 3.2
) show that 32.0% of UniProt entries include abbreviations in description and gene name fields. Wren et al.
) reported that abbreviations are used more frequently than expanded forms.
Abbreviations present two major challenges to bio-medical text mining: term variation and ambiguity. We consider an information retrieval system that collects documents referring to polymerase chain reaction
. Because polymerase chain reaction
might be abbreviated as PCR
, the system is expected to retrieve documents in which PCR
appears. At the same time, abbreviations are ambiguous: the same abbreviation might refer to different concepts (Ananiadou et al.
; Erhardt et al.
). Because PCR
means other than polymerase chain reaction
, the system should be able to perform abbreviation disambiguation—to judge whether an occurrence of PCR
actually means polymerase chain reaction
or not (McCray and Tse, 2003
; Sehgal and Srinivasan, 2006
). In general, abbreviations are much more ambiguous than ordinary terms. Liu et al.
) report that 81.2% of abbreviations in Unified Medical Language System (UMLS) were ambiguous, with an average of 16.6 senses.
presents problems of term variation and ambiguity of abbreviations. In all, 129 distinct expanded forms were extracted for the abbreviation PCR
from all MEDLINE abstracts, including polymerase chain reaction
, polymerization chain reaction
and amplification reactions polymerized
. Abbreviation recognition is a task of collecting expanded forms for abbreviations. It has been explored extensively using various approaches: through the use of heuristics and/or scoring rules (Adar, 2004
; Park and Byrd, 2001
; Pustejovsky et al.
; Schwartz and Hearst, 2003
), machine learning (Chang and Schütze, 2006
; Nadeau and Turney, 2005
; Okazaki et al.
) and co-occurrence statistics (Liu and Friedman, 2003
; Okazaki and Ananiadou, 2006
; Zhou et al.
). The 129 expanded forms in were obtained using the abbreviation recognition method (Okazaki and Ananiadou, 2006
), which is based on co-occurrence statistics. As depicted in , expanded forms extracted by abbreviation recognition are mixtures of concepts/senses and their term variations. The abbreviation PCR
has 129 expanded forms that can be consolidated to 30 senses (e.g. polymerase chain reaction
, pathologic complete response
). In general, a single sense has more than one surface form (i.e. variant). The sense of pathologic complete response
, for example, was actually described in MEDLINE abstracts by one of the 14 variation forms (e.g. pathologic complete response
and pathologically complete responses
). Clustering of expanded forms into a set of distinct senses, thereby creating a sense inventory for a given abbreviation, is a crucial step towards abbreviation disambiguation. Abbreviation disambiguation has been studied less intensively than abbreviation recognition, partly because clustering for creating sense inventories for numerous pairs of abbreviations and their surface expanded forms.
Term variation and ambiguity of the abbreviation PCR.
As described in this article, we first formalize the task of creating sense inventories as an independent task of clustering in which similar expanded forms for an abbreviation are gathered into a cluster (sense). Because the quality of sense inventories has a significant effect on the performance of abbreviation disambiguation, we developed a new supervised method for clustering expanded forms. We constructed a dataset for the method and measured its performance. The effect of clustering on abbreviation disambiguation was also evaluated quantitatively. The main contributions of this article are 3-fold.
- A sense inventory is key to robust management of abbreviations because it provides target senses for disambiguation that correspond to biomedical entities and concepts. Therefore, we present a supervised approach for clustering expanded forms, and evaluate the quality of the sense inventory. The experimental result reports a 0.915 F1 score in clustering expanded forms.
- We investigate the possibility of conflict of protein and gene names with abbreviations to estimate the importance of abbreviation disambiguation. Results showed that 32.0% of UniProt records include abbreviation terms and that 16.7% of records have ambiguous abbreviations with multiple definitions.
- We conduct an experiment of abbreviation disambiguation on the sense inventory whose quality was demonstrated by the Contribution (i). The proposed system achieves 0.984 accuracy on a dataset obtained from all of MEDLINE.