We extended the cTAKES pipeline to improve NLP capabilities, simplify feature extraction, and facilitate document classifier development. We constructed a gold standard document corpus of radiology reports suggestive of hepatic decompensation. We then applied the system as follows: (1) developed rule-based classifiers, (2) performed system tuning in which we iteratively improved document annotation by modifying the system configuration, and (3) evaluated machine-learning algorithms for document classification.
YTEX pipeline
The cTAKES is a modular pipeline of annotators that combines rule-based and machine-learning techniques to annotate syntactic constructs, named entities, and their negation context in clinical text. cTAKES uses the OpenNLP Maximum Entropy package for sentence detection, tokenizing, part-of-speech tagging, and chunking; uses the SPECIALIST lexical variant generator for stemming; and uses an algorithm based on NegEx for negation detection.
23–25 The cTAKES DictionaryLookup module performs named entity recognition by matching spans of text to entries from a dictionary. We used the cTAKES distribution included with ARC, which is distributed with Unified Medical Language System (UMLS) database tables for use with the DictionaryLookup module.
26 The UMLS Metathesaurus unifies over 100 source vocabularies and assigns each term a concept unique identifier (CUI).
We modified cTAKES as follows: we developed regular-expression-based named entity recognition and section detection annotators (NamedEntityRegex and SegmentRegex); we adapted the latest version of the NegEx algorithm to cTAKES for negation detection (Negex); and we developed a module to store annotations in a relational database (DBConsumer; see ). The annotators we developed are highly configurable; refer to the online appendix for a detailed description of all modifications to the cTAKES pipeline and configurations used in this study.
cTAKES can annotate demarcated sections from documents that conform to the Clinical Document Architecture format, which is not used in the VHA. To identify document sections, we developed an annotator that identifies section headings and boundaries based on regular expressions.
The DictionaryLookup algorithm performs named entity recognition by matching spans of document text to word sequences from a dictionary. Some clinical concepts are too complex, have too many lexical variants, or consist of non-contiguous tokens, making them difficult to represent in a simple dictionary. To address this issue, we developed an annotator that uses regular expressions to identify such concepts.
The cTAKES negation-detection algorithm is based on an older version of the NegEx algorithm and has limited support for long-range detection and post-negation triggers. To address these issues, we replaced the cTAKES negation-detection algorithm with an annotator based on the latest version of the Java General NegEx package, which supports long-range detection and post-negation triggers.
27In order to efficiently extract different feature sets from documents annotated with cTAKES, we developed a module that stores cTAKES annotations in a relational database. UIMA annotations are limited in complexity and obey a strict class hierarchy. These restrictions on the structure of UIMA annotations facilitate a high-fidelity relational representation. We used an object-relational mapping tool (Hibernate) to map UIMA annotations to relational database tables using a table-per-subclass strategy; refer to the online appendix for a detailed description of the data model.
28 YTEX supports SQL Server, Oracle, and MySQL databases. The effort involved in mapping new or modified annotations to the database is minimal, making this approach applicable to any UIMA annotation.
Storing annotations in a relational database greatly simplifies the development of rule-based classifiers: document feature vectors can be retrieved using SQL queries, and rules can be implemented using SQL ‘case’ statements.
Machine-learning document-classification techniques often employ the ‘bag-of-words’ or ‘term-document matrix’ representation of documents.
21 In this representation, documents occupy a feature space with one dimension for each word or term; words may be a word from a natural language or may be a technical identifier. The value of each dimension is typically either an indicator, asserting the presence of the word in the document, or a numeric value, indicating the term frequency. This feature space is typically high-dimensional and sparse, that is, the feature vectors mostly contain zeros. Most statistical packages support specialized file formats for efficient handling and exchange of sparse data sets. To use the bags-of-words document representation with WEKA, we developed a tool for exporting annotations obtained via SQL queries in the WEKA sparse file format. The tool takes as a parameter an SQL query that retrieves instance id, attribute name, and attribute value triples; it executes the query and rotates rows into columns to produce a sparse matrix representation of the data (). This transformation is similar to the SQL ‘pivot’ operator but differs in that it can create a matrix with an arbitrary number of columns.
The generic nature of the tool allows classification on any unit of text: the instance id can refer to a document, sentence, or phrase. The attribute name represents a dimension—for example, a stemmed word or concept identifier; and the attribute value may be numeric or categorical. The tool enables the integration of other relational data sources with document annotation data—for example, administrative, pharmacy, or laboratory data. Refer to the online appendix for sample SQL statements used to export document annotations and administrative data for use with WEKA.
Reference standard document corpus construction
To develop a gold-standard classification of radiographic findings indicative of hepatic decompensation, we used the results of a chart review designed to screen for and confirm cases of hepatic decompensation in the Veterans Aging Cohort Study (VACS) (). For the chart review, subjects enrolled in VACS were screened for radiographic findings of hepatic decompensation at enrollment by evaluating for suggestive International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) diagnostic codes, and laboratory abnormalities up to 1 year before through 6 months after entry into the cohort to identify possible prevalent cases. Additionally, a random sample of 100 patients who did not screen positive by the above criteria was selected to ensure the absence of hepatic decompensation events. Two trained data abstractors reviewed reports of abdominal ultrasounds, abdominal CT scans, and MRI studies, and recorded onto structured data-collection forms the following information: presence and quantity of ascites (fluid within the peritoneal cavity); presence and location of varices (dilated veins within the esophagus and stomach), and presence, number, and dimensions of liver masses. Two endpoint adjudicators with expertise in chronic liver diseases reviewed data forms and determined whether these outcomes of interest (ie, ascites, varices, liver masses) were present or absent. Disagreement on classification of the finding resulted in a review by a third reviewer to adjudicate the outcome. All findings were recorded in an electronic ‘adjudication database.’
As part of this study, we randomly selected the data-abstraction forms of 236 patients with ICD-9-CM diagnostic codes and/or laboratory abnormalities suggestive of hepatic decompensation and transcribed them to a database. We then linked the abstraction data to the original radiology reports, and defined a gold-standard classification of radiology reports. We labeled radiology reports included in the chart review ‘abdominal radiology reports.’ We assigned additional class labels to these reports indicating the presence of ascites, varices, and/or liver masses based on the data-abstraction forms ().
Rule-based classifier development
We initially classified documents using manually developed rules. These interpretable classifiers allowed us to explore the feature space, optimize feature representations, and understand and rectify NLP errors that caused misclassification. We implemented the rules as SQL case statements, operating on feature vectors retrieved via SQL queries. For example, to identify radiology reports that assert the presence of varices, we focused on named entity annotations that contain CUIs related to varices, and represented documents as vectors with a column for each concept. Refer to the online appendix for sample SQL statements and a list of features we used in classification rules.
The ability to filter, aggregate, and transform document annotations using SQL queries allowed us to easily experiment with different representations of document concepts and their semantic and syntactic context. We found the following feature selection and representation approaches effective: filtering out concepts located within certain document sections; representing the negation status of concepts using a ‘relative negation count’; combining different concepts in a single feature; and using within-sentence concept co-occurrence.
The document section to which a term belongs is an important feature for document classification: for the discrimination of abdominal radiology reports from other radiology reports, terms in the title had far more importance than terms in the document body. For the identification of documents that assert the presence of a clinical condition (ie, ascites, varices, or liver masses), we found that filtering out terms from the clinical history section of documents improved classifier performance.
We combined distinct UMLS concepts under a single feature, thereby reducing the number of features needed and simplifying rule development. For example, the distinct UMLS concepts ‘Ascites’ (C0003962), ‘Peritoneal Fluid’ (C0003964), and intra-abdominal collection (C0401020) could for the purposes of this classification task be grouped under a single feature ‘Ascites.’
For the identification of liver masses, within-sentence co-occurrence was an important feature. For example, the sentence ‘A rounded, echogenic focus is seen in the left lobe of the liver’ contains the terms ‘echogenic focus’ and ‘liver’. We used co-occurrence of these terms within a sentence as a simple heuristic to infer the presence of a liver mass. Knowing that both these terms are in the same document is insufficient to infer the presence of a liver mass.
Concepts can be negated and affirmed within the same document as a result of errors in the negation detection algorithm, or due to deeper semantic content; exclusively considering affirmed or negated terms obscures this information. To address this issue, we represented the negation context of concepts using a ‘relative affirmation count’: the number of times a concept was affirmed minus the number of times it was negated within a document.
For example, the rule for the classification of varices compares the number of affirmed varices terms to the number of negated varices terms outside of the ‘Clinical History’ section of the document (). If any particular varices term is affirmed more than negated, the document is classified as ‘varices positive.’ Refer to the online appendix for a description of other rule-based classifiers.
System tuning
To improve classifier performance, we performed multiple iterations of system tuning: (1) we generated document annotations with YTEX; (2) we classified documents using rule-based classifiers; (3) we manually examined all misclassified documents, and modified rules to resolve misclassification errors where necessary; (4) we reconfigured YTEX to rectify NLP errors; (5) we forwarded incorrectly labeled radiology reports to endpoints arbitrators (VLR and ZD-S) who reviewed these documents and updated document labels and patient adjudication databases.
We found that many classification errors were due to problems in named entity recognition (NER) and negation detection. To address these issues, we reconfigured the NER and negation-detection modules: we added entries to the dictionary used by the DictionaryLookup module; we configured regular expressions for use with the NamedEntityRegex module; and we modified the list of negation triggers used by the NegEx module. Refer to the online appendix for a detailed description of the regular expressions, dictionary entries, and negation triggers used for this study.
Upon evaluation of misclassified documents, we noticed that lexical variants of clinical concepts needed for classification were not included in the UMLS. For example, ultrasounds were often denoted with the term ‘echogram,’ which is not contained in the UMLS. We added additional entries to the YTEX UMLS dictionary to identify these concepts.
Some clinical concepts consisted of non-contiguous tokens, making them difficult to capture in a dictionary. For example, the following phrases were used to note the presence of ascites in radiology reports: ‘fluid is noted in the subhepatic area’, ‘free fluid around the liver is noted’, or ‘free fluid in the perihepatic region’. In these examples, the term ‘fluid’ is separated from the term ‘liver’ or ‘hepatic’ by several variable words. We configured regular expressions to identify these concepts.
Evaluation of machine-learning algorithms
Although the accuracy of the rule-based classifiers was satisfactory, we explored whether machine-learning algorithms could improve classification accuracy by using additional features that we overlooked, or features that could not easily be used in simple rule-based classifiers. For example, radiology reports that asserted the presence of hepatocellular carcinoma often asserted the presence of liver masses; machine-learning algorithms may leverage such associations to improve classification accuracy. We trained and evaluated the following machine-learning algorithms: decision trees (C4.5 algorithm), machine-learning analogs of rule-based classifiers
29
30; random forests, ensembles of decision trees
31; and SVMs, which have been successfully applied in document classification.
3
32–34To test whether system tuning and feature representation improved classifier performance, we evaluated classifiers against different representations of the document corpus:
- baseline: this dataset represents the annotations generated by the un-tuned pipeline;
- simple: this dataset employs a bag of affirmed terms document representation, which ignores document section and negated terms;
- rich: this dataset uses the rich document feature representation that leverages the syntactic and negation context of named entities as described above.
We exported the document corpus in the WEKA sparse file format, split the corpus into a training set and a held-out test set, performed cross-validation on the training set, selected the optimal algorithm, and performed a final evaluation of classifier accuracy against the held-out test set. We used the cross-validation results to estimate classifier accuracy for varices, as we did not have enough reports for a held-out test set. Refer to the online appendix for a detailed description of the different corpus representations and machine learning process.
The datasets we exported had over 4000 features. For feature selection, we ranked features from the training set by mutual information and evaluated classifier performance using a 4‐fold cross-validation on the top n features, with n varying between 1 and 500. Accuracy peaked with fewer than 500 features for all classification tasks. We then performed a 4‐fold cross-validation 25 times with the optimal algorithm and number of features on the training set to generate empirical distributions for the information retrieval metrics specificity, precision, recall, and F
1-Score with which we assess classifier performance. These are defined as follows
35:
- specificity: TN/(TN+FP);
- precision (positive predictive value): P=TP/(TP+FP);
- recall (sensitivity): R=TP/(TP+FN);
- F1-score: (2*P*R)/(P + R);
- TP: true positives (classified as positive when in fact positive);
- FP: false positives (classified as positive when in fact negative);
- TN: true negatives (classified as negative when in fact negative);
- FN: false negatives (classified as negative when in fact positive).