|Home | About | Journals | Submit | Contact Us | Français|
To address the problem of extracting structured information from pathology reports for research purposes in the STRIDE Clinical Data Warehouse, we adapted the ChartIndex Medical Language Processing system to automatically identify and map anatomic and diagnostic noun phrases found in full-text pathology reports to SNOMED CT concept descriptors. An evaluation of the system’s performance showed a positive predictive value for anatomic concepts of 92.3% and positive predictive value for diagnostic concepts of 84.4%. The experiment also suggested strategies for improving ChartIndex’s performance coding pathology reports.
Pathology reports are an important source of diagnostic information in the electronic health record. This is particularly the case with cancer diagnoses. Unfortunately, most institutions use either unstructured or semi-structured free text documents to represent this information, making it difficult to automatically extract structured data for clinical or research purposes. Beginning with the work of Pratt et al1 in the 1970s a number of informatics groups have worked on this problem2,3,4,5,6,7,8.
We have previously reported on the ChartIndex system9,10, a Medical Language Processing (MLP) system which uses a statistical natural language parser (the Stanford Parser11), augmented with the Unified Medical Language System’s (UMLS) Specialist Lexicon, to automatically identify noun phrases in clinical documents and map them to UMLS concept descriptors using the National Library of Medicine’s Metamap Transfer12 (MMTx) software. We have also reported on ChartIndex’s ability to reliably identify negations in clinical radiology reports using a grammar-based classification model13. As part of our work on the STRIDE (Stanford Translational Research Integrated Database Environment) system14, we are interested in automatically extracting meta-data, for research purposes, from the over one million full-text pathology reports stored in the STRIDE Clinical Data Warehouse (CDW). The entire text of each pathology report in the CDW is indexed using Oracle Text. We are interested in using SNOMED CT to code the anatomic site and/or tissue from which a surgical pathology specimen was derived as well the pathologic findings and diagnoses reported for that specimen.
To evaluate the ability of ChartIndex to identify and code this information in full-text pathology reports and to automatically assign SNOMED CT codes for anatomic sites, tissues, pathologic findings and diagnoses, we used the system to index a corpus of pathology reports and tasked medical experts to access the accuracy of the coding.
The ChartIndex system was originally developed to concept-index radiology reports. To handle pathology reports the system was modified as follows: (1) Added pathology report section headings and their canonical mappings to the document parsing module to improve report section segmentation; (2) Identified a set of text patterns/fragments commonly found in Stanford pathology reports that the parser should ignore; (3) Added a list of common abbreviations found in Stanford pathology reports to improve disambiguation; (4) Added a new mode of negation detection to handle text fragments where no parse tree is generated; (5) Created indexing rules to handle the report sections typically found in pathology reports - these ChartIndex rules use information such as UMLS semantic types and the matching score assigned by the UMLS concept mapping module to optimize mapping precision and recall. One such indexing rule identifies anatomical site descriptors from the specimen section for concepts with a matching score of 870 or higher, and if no anatomic site descriptor is found in that section, the rule then tries to identify site descriptors in the diagnosis section of the report. This rule helps improve precision because the short phrases in the specimen sections usually give contain anatomical sites together with the procedures used to acquire the specimen, while in the diagnosis sections false positives may be introduced by comments not pertaining to the actual specimen.
The document set for this study consisted of 500 de-identified single-specimen surgical pathology reports, selected at random from more than ten thousand consecutive non-cytology reports from Stanford University Medical Center. Cytology reports were excluded because of the high rate at our institution (>90%) of reports with negative findings. Reports on slides and blocks from other institutions (“outside consults”) were also excluded. Each electronic pathology report consisted of a demographics section, an optional section transmitting the “clinical history” from the surgeon to the pathologist, a required section identifying the “specimen submitted”, a required section listing the “diagnosis”, optional sections commenting on the diagnosis and optional sections describing the gross and microscopic features of the specimen.
To de-identify the reports, patient demographics were removed and the report accession number replaced by an MD5 hash file identifier. For parsing, we chose to examine only the “specimen submitted”, “diagnosis”, and “comment” sections from each report, as these sections contained the anatomic and diagnostic data of interest to this study. These report sections were then further de-identified, using regular expression matching, by replacing specific dates with an arbitrary date, replacing names with an arbitrary name and mapping ages into one of the following categories: newborn, infant, toddler, child, teenager, young adult, adult, mature adult, or elderly. The reports were then manually inspected to verify de-identification.
The de-identified document set was divided into a training set of 100 reports and a test set of 400 reports. The training set was used to train the ChartIndex parser on the typical document segmentation, sentence structure, sentence fragments and abbreviations/acronyms used in pathology reports at Stanford (e.g. “NSA” for “no significant abnormality”.) The 400 reports in the test set were parsed into the XML document format used by ChartIndex, a format based on the HL7 Clinical Document Architecture (CDA)15. ChartIndex parsed the reports and each noun phrase identified was mapped to one or more UMLS concept descriptors using the National Library of Medicine’s MMTx software. The set of UMLS concept descriptors generated for each report was then filtered to remove all non-SNOMED descriptors. The resulting list of SNOMED descriptors associated with each report section was further filtered to retain only SNOMED descriptors with appropriate UMLS semantic types indicating anatomical sites or diagnoses. For example, SNOMED descriptors associated with the Specimen Submitted section of the reports (where the anatomic source of the specimen is described) were filtered to ensure that only SNOMED concepts with UMLS semantic types in the “Anatomical Structure” semantic class hierarchy (A1.2) were included.
The de-identified SNOMED-indexed reports were then divided into four sets of 100, with each document reviewed independently by two experts from a panel of three pathologists and one internal medicine physician. An example of one of the simpler reports used in the study is shown in figure 1. The top portion of this sample contains three sections extracted from the original report. The bottom section contains the ChartIndex-generated SNOMED terms for Tissue/Site and Findings/Diagnosis. Most reports in the study were more complex than this sample, containing several SNOMED terms.
The Expert reviewers were instructed to enter a ‘+’ before each SNOMED term that correctly represented a concept present in the report or enter a ‘−’ if the SNOMED term did not represent a concept in the report. Only anatomic sites and diagnosis concepts were listed.
Two different experts then independently scored each indexed report and any differences in assessments were reconciled by two of the authors using a variant of the Delphi method16. The Dephi method uses a consensus-building approach in which group communications are structured to allow results from each individual expert to be shared and revised. This enables experts to reconsider their decisions after seeing conflicting results from others. All remaining results variations were then arbitrated by a final expert judge, i.e. one of the physician authors (DPR or HJL).
We calculated the positive predictive value (precision) of the ChartIndex parser and the inter-observer agreement ratio for each set of 100 documents and also for the entire test set. The inter-observer agreement ratio was calculated as the total number of agreed-upon concepts in two expert’s initial responses (before the reconciliation process) divided by the total number of SNOMED concepts in the set. The positive predictive value was calculated as the number of true positives after inter-observer reconciliation divided by the total number of SNOMED CT concepts generated by ChartIndex.
The positive predictive value (PPV) of SNOMED CT coding by ChartIndex on the test set of 400 surgical pathology reports is shown in table 1.
The number of agreed versus total concepts and agreement ratios between initial evaluations are shown in Table 2, with the last two rows as their 95% confidence interval (CI) lower bound (LB) and upper bound (UB):
During the data analysis, it emerged that Set 3 had a significantly lower agreement ratio of 78.4% (CI-LB=75.4, CI-UB=81.5%) between the two reviewers before reconciliation. To examine the impact of this variance, we separately calculated the Positive Predictive Value (PPV) of Set 3, the other three sets as a whole and the total document set separately. These results are show in table 3. Following reconciliation, the results for Set 3 are generally consistent with the results obtained from the other three sets. This result also validated the reliability improvements derived from the Delphi-based reconciliation method used by this study.
A post-hoc analysis of the data from this experiment suggested a number of enhancements to ChartIndex that could further improve the system’s indexing performance with pathology reports. For example, ChartIndex’s parser interpreted text beginning with consecutive hyphens (a common formatting convention on Stanford pathology reports) as compound phrases, not sentences, and therefore did not attempt full grammar-based parsing of the text. Instead, the system defaulted to using normalized string matching on the text and, if this approach failed, then relied on MMTx to parse and map the text to UMLS concepts. This approach sometimes resulted in the failure to properly detect concepts or to return concepts with a score high enough to pass the pre-set threshold (e.g.’--MILD FOCAL ADENOMATOUS CHANGES’). This problem can be addressed by either improving full parsing of fragmented text or by chunking longer text segments into smaller noun phrases before passing them on to the concept-mapping module.
Another adaptation of ChartIndex for this experiment used regular expression matching to detect negations in fragmented text without a parse tree, an approach similar to that of NegEx developed by Bridewell and Chapman17. Using this approach we found that some straightforward negations were missed because only limited regular expressions were derived from a small training set. For example, ‘--NEGATIVE FOR HYPERPLASIA AND CARCINOMA’, resulted in a positive finding of ‘HYPERPLASIA’. Either expanding the negation-matching regular expressions, or using the grammatical approach when a parse tree becomes available can address this.
Other indexing failures were traced to ambiguity in the UMLS synonyms, which are used to map noun phrases to UMLS descriptors. For example, the noun ‘NODULE’ is a UMLS synonym for both ‘NODULE’ and ‘NODULUS CEREBELLI’. This is a classic word sense disambiguation (WSD) problem, which has received considerable attention in the biomedical domain over the last ten years18. One common approach to this problem is to build a corpus in which each instance of ambiguous tokens is annotated with the correct meaning (concept) in context and train machine learning algorithms on the corpus to classify sense tokens. Weeber et al. developed the NLM WSD test collection19 using MEDLINE abstracts, which is freely available online at http://umlsks.nlm.nih.gov/. Liu et al. used conceptual relationships in the UMLS to create an automated method of constructing a WSD corpus from MEDLINE® abstracts20. Their method achieved high precision 96.8%, but limited recall 50.6%. Furthermore, Leroy and Rindflesch found that concepts (senses) in the UMLS Metathesaurus used by many researchers may have contributed to inaccuracies in the gold standard, therefore limiting the performance of word sense disambiguation techniques21. Creating a large WSD corpus from clinical documents is more difficult. More recently, Xu et al. demonstrated that for abbreviations in clinical documents, a clustering-based, semiautomated approach can significantly reduce the manual annotation effort compared to traditional manual approaches while producing more complete sense inventories compared to random sampling22. In our study’s data set, there was no instance of ‘NODULUS CEREBELLI’. Indeed a search of all one million pathology reports in STRIDE did not find this phrase, indicating that we may need a much larger clinical document set to build a robust WSD corpus.
Another type of indexing failures was traced to the normalized string search used to map noun phrases to UMLS descriptors. For example, the phrase ‘ABNORMALITY’ returned ‘CONGENTIAL ABNORMALITY’. These mapping errors can be addressed using a mapping variance dictionary allowing ChartIndex to directly map a given phrase to a specific UMLS concept descriptor, rather than attempting a normalized string search. We also encountered another limitation with the lexical mapping approach. For example, ‘RECTAL BIOPSY’ in the Specimen Submitted section of a report only returned ‘BIOPSY OF RECTUM’ as a perfect match. ‘RECTUM’ was a concept with a very low matching score for this phrase and was therefore not returned by ChartIndex as an anatomical site. Instead of relying on lexical matching only, we should identify RECTUM as the DIRECT_PROCEDURE_SITE_OF the ‘BIOPSY OF RECTUM’ procedure. This can be implemented using a database table holding relevant relationship data between concepts.
The results of this experiment demonstrated that ChartIndex could be successfully adapted to automatically identify and index anatomic and diagnostic noun phrases found in the test set of pathology reports, using SNOMED descriptors. With an overall PPV of 88.4% the performance of the system was considered to be adequate for this task. As ChartIndex had been initially developed and evaluated using only radiology reports, this suggests that the system’s MLP model has the potential to successfully index other classes of clinical document. Performance on pathology reports is approximately equivalent to our experience using ChartIndex with radiology reports. This experiment focused on measuring precision (PPV) and did not measure recall, a potential shortcoming of the methodology that we used in this experiment.