Design of ARC
Current use of clinical IR technologies is heavily dependent on the system developer. With ARC, we are attempting to either automate or shift to the end user as many of the processes of clinical IR as possible. shows the current processes of clinical IR versus the proposed shift in responsibilities we are attempting to achieve with ARC.
Current processes of clinical information retrieval (IR) versus those proposed in the design of the automated retrieval console (ARC).
The ARC design is based on the hypothesis that supervised machine learning with robust enough feature sets is capable of delivering acceptable performance across a number of clinical IR applications. This approach allows us to reduce end-user input to a reference set that can be used as both the training and test sets for any one application. Proceeding with this hypothesis, the challenge becomes how best to enable the end user to perform the remaining processes of clinical IR use, including annotation, training versus test set partitioning, performance calculation, storage of models and results, and deployment on the larger corpus.
Toward this end, ARC features several interfaces to enable greater end-user control over the processes of clinical IR. The ARC menu from which each of the interfaces is launched is shown in .
Automated retrieval console (ARC) menu, showing the various ARC interfaces.
The ‘Create New Project’ interface is used to establish a workspace and import samples. This workspace is used to save the state of any project including models and performance results across the various interfaces. Annotation can be a bottleneck in applying IR technologies. The ‘Judge’ interface shown in was therefore designed to be simple and fast, featuring one click and shortcut key labeling (‘Y’, ‘N’) and document advancement (left arrow, right arrow). The reference set created in the Judge interface is saved to the workspace and used for model creation and performance calculations. The ‘Kappa’ interface supports the calculation inter-rater reliability by presenting totals of agreement among judges that can be exported to statistical packages. The ‘Feature Blast’ interface iteratively calculates the performance (ie, recall, precision, F-measure) of different combinations of feature types and classifiers to determine appropriate combinations for a given application. The ‘Laboratory’ interface enables developers to explore and evaluate different approaches to IR. Developers can use the Laboratory interface to select which feature types and models to experiment with, tracking the performance of each combination. The ‘Retrieve’ interface shows the performance of all models created as part of a project and facilitates deployment of saved models on larger collections.
A screen shot of the Judge interface. The annotation instructions shown in the ‘Help Information’ window is populated as part of the creation of a new project.
The ARC was used to manage all of the processes involved in this study from sample creation to algorithm evaluation. It was developed in Java and is available as open-source software at http://research.maveric.org/mig/arc.html
. Users can download ARC or, thanks to the generous cooperation of the National Library of Medicine and Dr Guergana Savova, users can download a ‘full’ version of ARC with cTAKES and its UMLS-based knowledge base installed. The site also features html and video tutorials designed to use a small collection of simulated radiology reports.
The focus of this study was the evaluation of the algorithms used within the Feature Blast interface to retrieve relevant documents across a number of different applications with no custom software development. Building on the collection of currently available open-source clinical IR software, ARC combines open-source NLP pipelines with machine learning.
The ARC uses UIMA-based pipelines for NLP. The UIMA pipelines can be launched to process text from within ARC, or complete UIMA project files can be loaded into ARC. Each pipeline created in UIMA has an XML-formatted configuration file that describes the structured output the pipeline produces. The ARC reads the XML configuration file and exposes NLP-structured output as feature types for machine learning classification. As a result, any UIMA-based pipeline can be used by ARC. However, the goal of this study is to design and evaluate the ability of our approach to perform well across different applications with no custom code or rules development. We therefore chose cTAKES, a general concept-mapping clinical pipeline.20
The transforms performed on clinical data using cTAKES result in more than 90 different types of structured output (eg, noun phrases, tokens, sentences, SNOMED codes).
The version of cTAKES available for this study uses a section boundary detector that is based on the HL7 Clinical Document Architecture (CDA), which is not widely implemented by the VA Healthcare System. Therefore one minor modification made to cTAKES was the removal of the CDA-based section boundary detector and the addition of a regular expression-based section boundary detector. The ability to make such modifications easily is a function of the modular design of open-source NLP frameworks such as UIMA and GATE. An abbreviated list of some of the structured results produced by cTAKES is provided in .
Abbreviated list of cTAKES structured output
For supervised machine learning, ARC integrates the open-source Application Programming Initiative (API) exposed by the MAchine Learning for Language Toolkit (MALLET).41
In this study, two particular classifiers from MALLET are used: a MaxEnt classifier and a classifier based on CRFs.
The ability of ARC to reduce developer involvement in the clinical IR process is predicated on the capacity of the system to ‘learn’ effective approaches to solving a given IR problem. After a user provides ARC with a reference set, ARC's Feature Blast algorithm uses the following steps to identify which types of NLP output and machine-learning classifiers to combine for a given application. Firstly, it processes text documents with the cTAKES NLP pipeline, exposing more than 90 NLP-derived feature types (eg, noun phrases, tokens, SNOMED concepts) for supervised classification. Using 10-fold cross-validation, the system partitions both the training and test sets and calculates the performance of each individual NLP-produced feature type using all available machine-learning classifiers. The performance of each of the individual feature types and classifier combinations is stored to the workspace.
The optimal combination of feature types and classification algorithms could be determined by calculating all possible variations. However, with greater than 90 different feature types and two classifiers, the cost in time would be prohibitive. Instead, we explored the performance of two different algorithms designed to identify favorable combinations more efficiently. The two algorithms used to determine those combinations are described below.
- 1. Algorithm 1: top scoring combinations
The first algorithm used by Feature Blast to determine optimal combinations evaluates all combinations of the five top scoring feature types or classes (eg, noun phrases, concepts) using either selected or all available classification algorithms. Algorithm 1 reduced the process to a manageable 52 iterations (26 combinations of feature types multiplied by two classifiers). The five top scoring feature types are defined as:
|Configuration||Feature type combinations|
|2||2nd highest F-measure|
|3||3rd highest F-measure|
|4||Highest recall not already included|
|5||Highest precision not already included|
- 2. Algorithm 2: top score + negation
A limitation of the first algorithm is its exclusion of feature types that score poorly as the only feature types in consideration but may add value as part of a combination of feature types. The one feature type that most obviously falls into this category is negated concepts or phrases. For example, in classifying imaging reports consistent with cancer, evidence of negated concepts (eg, ‘no evidence of cancer’) may add value. The cTAKES assigns negation to both named entities and UMLS concept unique identifiers (CUIs). A named entity is an atomic element or ‘thing’ found in the text, usually mapped from a noun phrase (eg, ‘heart attack’). Several different named entities can mean the same thing (eg, heart attack, myocardial infarction, MI), and therefore named entities are often mapped to unique concepts such as UMLS CUIs (eg, heart attack = CUI C0027051). The ARC supports the conversion of negated entities and concepts to features by allowing the user to specify a prefix or suffix to any feature type through the user interface. For example, by adding the prefix ‘neg’ to all negated named entities (eg, ‘cancer’), ARC will pass ‘neg-cancer’ as a feature to the classifier. In each case, we chose the highest scoring configuration of negation, selecting either the negated named entity or the negated CUI based on the highest F-measure.
Our second algorithm, combining top scoring feature types and negation is defined as:
|Configuration||Feature type combinations|
|1–5||Algorithm 1 combinations|
|6||Highest recall + highest precision|
|7||Highest recall + negated text|
|8||Highest precision + negated text|
|9||Highest recall + highest precision + negated text|
Data collection and sampling
In this study, we evaluate the ability of ARC to retrieve relevant documents from the collection of relevant and irrelevant documents returned from ICD-9 code-based queries. To test the ability of our approach to generalize across different applications, three samples and targets for retrieval were used: (1) imaging reports consistent with lung cancer; (2) pathology reports consistent with colorectal cancer (CRC); (3) pathology reports consistent with prostate cancer. For each sample, 500 documents were chosen at random from documents created between 1997 and 2007 at hospitals within the New England Veterans Integrated Service Network (VISN 1). Our original case finding queries for identifying the collections from which samples were selected were as follows.
- Select all pathology reports within 60 days before and 60 days after the first appearance of ICD-9 codes 153.x, 154.x.
For prostate cancer:
- Select all pathology reports within 60 days before and 60 days after the first appearance of ICD-9 codes 185.x.
For lung cancer:
- Select all imaging reports within 60 days before and 60 days after the first appearance of ICD-9 codes 162.x.
We considered only the first appearance of a targeted ICD-9 code, regardless of assignment position (primary code, secondary code, etc). These samples were used to create ‘gold standard’ reference sets for both training and testing the algorithms.
Creation of reference sets
For each of the three samples, two physician judges assigned values of ‘relevant’ or ‘irrelevant’ to each of the 500 documents. A third physician judge served as final adjudicator for any disagreements. A total of five physicians participated in the creation of the three reference sets. Reviewers were instructed to base their assessment of relevance on whether each document was ‘consistent with a diagnosis of cancer.’ They were instructed to ignore any clinical history and instead focus on the immediate report of the pathologist or radiologist. In-situ cancers in the colon or rectum were counted as CRC, and prostate intraepithelial neoplasia was counted as prostate cancer. For CRC and prostate cancer, even if the subject of the report was tissue outside of the organ of interest, if the pathologist recorded CRC or prostate cancer, the reviewers were instructed to classify the document as consistent with the particular cancer of interest.
Whereas the pathology report is the primary document for recording a diagnosis of prostate cancer and CRC, imaging reports are less likely to contain conclusive evidence of a lung cancer diagnosis. Instead, lung cancer diagnoses may be determined by a combination of imaging studies, biopsies, and/or laboratory results. Despite the potential inconclusiveness of imaging reports for lung cancer, they are considered important documents for finding lung cancer cases and monitoring cancer progression. They also provide the opportunity to test the performance of our approach on a sample of documents with less structure and with less agreement between judges. The imaging reports in this study were generated from a number of study modalities including x-rays, CT scans, and MRI.
In order to evaluate the effectiveness of the proposed approach, we captured the performance of individual feature types and both classifiers for all three samples as well as the performance of algorithms 1 and 2 using both classifiers. In all experiments, performance was measured in terms of recall, precision, and F-measure using 10-fold cross-validation. The performance of the NLP system has a direct effect on the quality of the features produced for classification. However, the focus of this study does not include a specific evaluation of cTAKES' performance on the samples used. illustrates the design of the study.
A graphical representation of the study design. CRC, colorectal cancer; CRF, conditional random field; MaxEnt, maximum entropy; NLP, natural language processing.