Our specific information extraction task is heterogeneous for the following three reasons. Firstly, a wide array of information elements needs to be extracted from full-text RCT journal articles, including eligibility criteria, the name of all experimental and control treatments, intervention parameters (dosage, frequency, duration, etc.), sample size, start and end date of enrolment, primary and secondary outcomes, funding information, and publication details (date, authors). Some of these elements are always present while others may be absent. Some elements can have only one value while others may have several values within the same document (e.g., various funding agencies). Some elements are short and well defined (e.g., drug route) while others could span a longer piece of text with widely varying wording (e.g., eligibility criteria).
Secondly, trials may come from a wide and unrestricted range of medical subfields, from testing pharmacological and procedural treatments to organizational and educational interventions.
Thirdly, in practice we will see a range of document standards and formatting schemas. Richness of in-document annotation may vary between publishers and ranges from detailed XML to various forms of HTML, PDF and even OCR-ed documents in ASCII.
We chose a single information extraction approach that is able to handle this diversity across the various information elements, medical subfields, and document formats. To extract the value of a certain information element from a text, a text classifier first selects the sentence(s) in the text that is or are most likely to contain the target piece of information. After that, a regular expression matcher pulls out the snippet from the high scoring sentence(s) that contains the information. This combination of a statistical method with minimal (‘weak’) rules fits the diversity in the task well, since it is less likely to require extensive individual modeling for each information element, medical subfield, and document format than methods with a strong semantic and/or linguistic reliance4
There have been a few recent efforts to semantically annotate medical articles, including RCT reports, and clinical documents.5–9
Most of the research has been focusing on extracting the main characteristics of a study: main condition, interventions, outcomes, and, in the context of RCTs, description of the study population. A typical approach addresses some or all of the three key problems: relevant sentence identification, named entity recognition, and information/relation extraction. Various studies have concentrated on the first step: mapping sentences onto a structured publication template (notably ‘background’
’, and ‘conclusions
Semantically structured texts represent a richer information source for the next steps of the process. Overall, the applied techniques include state-of-the-art machine learning algorithms (Naïve Bayes, Hidden Markov Models, SVM, Conditional Random Fields)5,6,8,10–12
, manually designed or cue-word-based classification/extraction rules5,6,8
, and use of medical lexicons5–7,9
, such as UMLS, MeSH, or Semantic Groups. In addition, Paek et al. address a general task of semantic parsing of sentences and identifying the semantic roles of the words in a predicate.13
This extra step can potentially boost the information extraction part of the typical approach. On the whole, the previous research demonstrates that machine learning and NLP techniques can successfully tackle the task of automatic information extraction in the medical domain.
Our work extends the previous research in two main directions. First, we present one general approach to automatically extract over 20 information elements with differing characteristics, while other work has focused on only one to four elements at a time. Second, we work with full-text articles, whereas the past projects use only abstracts or other short text summaries. Full–text articles present more challenges yet allow us to extract information typically not found in abstracts/summaries (e.g. funding agencies, secondary outcomes, and whether the trial was stopped early).