Although large parts of the medical record exist as structured data, a significant proportion exists as unstructured free texts. This is not just the case for legacy records. Much of pathology and imaging reporting is recorded as free text, and a major component of any UK medical record consists of letters written from the secondary to the primary care physician (GP). These documents contain information of value for day-to-day patient care and of potential use in research. For example, narratives record why drugs were given, why they were stopped, the results of physical examination, and problems that were considered important when discussing patient care, but not important when coding the record for audit.
uses information extraction (IE) technology2
to make information available for integration with the structured record, and thus to make it available for clinical care and research3
. IE aims to extract automatically from documents the main events and entities, and the relationships between them, and to represent this information in structured form. IE has immense potential in the medical domain. One of the earliest IE applications was the analysis of discharge summaries in the Linguistic String Project4
, and it has since seen application in various clinical settings.
Although much IE research has focused on fully automated methods of developing systems (pioneering work is reported in5
), most practical IE still needs data that has been manually annotated with events, entities and relationships. This data serves three purposes. First, an analysis of human annotated data focuses and clarifies requirements. Second, it provides a gold standard against which to assess results. Third, it provides data for system development: extraction rules may be created either automatically or by hand, and statistical models of the text may be built by machine learning algorithms.
Biomedical corpora are increasingly common. For example, the GENIA corpus of abstracts has been semantically annotated with multiple entities. It does not, however, include relationships between them6
. Other authors have reported semantic annotation exercises specific to clinical documents, but these are generally restricted to a single type of entity7
. This paper reports on the construction of a gold standard corpus for the CLEF project, in which clinical documents are annotated with both multiple entities and their relationships. To the best of our knowledge, no one has explored the problem of producing a corpus annotated for clinical IE to the depth and to the extent reported here. Our annotation exercise uses a large corpus, covers multiple text genres, and involves over 20 annotators. We examine two issues of pertinence to the annotation of clinical documents: the use of domain knowledge; and the applicability of annotation to different sub-genres of text. Results are encouraging, and suggest that a rich corpus to support IE in the medical domain can be created.