Search tips
Search criteria 


Logo of amiasummtspLink to Publisher's site
AMIA Jt Summits Transl Sci Proc. 2012; 2012: 38.
Published online 2012 March 19.
PMCID: PMC3392069

Feasibility of pooling annotated corpora for clinical concept extraction


Availability of annotated corpora has facilitated application of machine learning algorithms to concept extraction from clinical notes. However, it is expensive to prepare annotated corpora in individual institutions, and pooling of annotated corpora from other institutions is a potential solution. In this paper we investigate whether pooling of corpora from two different sources, can improve performance and portability of resultant machine learning taggers for medical problem detection. Specifically, we pool corpora from 2010 i2b2/VA NLP challenge and Mayo Clinic Rochester, to evaluate taggers for recognition of medical problems. Contrary to our expectations, pooling of corpora is found to decrease the F1-score. We examine the annotation guidelines to identify factors for incompatibility of the corpora and suggest development of a standard annotation guideline by the clinical NLP community to allow compatibility of annotated corpora.

Articles from AMIA Summits on Translational Science Proceedings are provided here courtesy of American Medical Informatics Association