|Home | About | Journals | Submit | Contact Us | Français|
To test the feasibility of using text mining to depict meaningfully the experience of pain in patients with metastatic prostate cancer, to identify novel pain phenotypes, and to propose methods for longitudinal visualization of pain status.
Text from 4409 clinical encounters for 33 men enrolled in a 15-year longitudinal clinical/molecular autopsy study of metastatic prostate cancer (Project to ELIminate lethal CANcer) was subjected to natural language processing (NLP) using Unified Medical Language System-based terms. A four-tiered pain scale was developed, and logistic regression analysis identified factors that correlated with experience of severe pain during each month.
NLP identified 6387 pain and 13 827 drug mentions in the text. Graphical displays revealed the pain ‘landscape’ described in the textual records and confirmed dramatically increasing levels of pain in the last years of life in all but two patients, all of whom died from metastatic cancer. Severe pain was associated with receipt of opioids (OR=6.6, p<0.0001) and palliative radiation (OR=3.4, p=0.0002). Surprisingly, no severe or controlled pain was detected in two of 33 subjects’ clinical records. Additionally, the NLP algorithm proved generalizable in an evaluation using a separate data source (889 Informatics for Integrating Biology and the Bedside (i2b2) discharge summaries).
Patterns in the pain experience, undetectable without the use of NLP to mine the longitudinal clinical record, were consistent with clinical expectations, suggesting that meaningful NLP-based pain status monitoring is feasible. Findings in this initial cohort suggest that ‘outlier’ pain phenotypes useful for probing the molecular basis of cancer pain may exist.
The results are limited by a small cohort size and use of proprietary NLP software.
We have established the feasibility of tracking longitudinal patterns of pain by text mining of free text clinical records. These methods may be useful for monitoring pain management and identifying novel cancer phenotypes.
Pain is a debilitating part of the experience of metastatic cancer. An automated system to categorize and track pain in electronic medical records could provide a powerful means to improve clinical care, and could allow novel ‘high pain’ or ‘low pain’ phenotypes to be defined and studied on a molecular basis. We tested the feasibility of using natural language processing (NLP) of text from clinical encounters to depict meaningfully the experience of pain in patients with metastatic prostate cancer over time.
Worldwide, prostate cancer is the second most commonly diagnosed cancer and the sixth leading cause of cancer death in men.1 In the past decade, significant effort has been made to better understand and reduce the burden of pain on the cancer patient,2–5 the patient's family, caregivers, and society.6 Pain status can predict survival in metastatic prostate cancer,7 and changes in pain status have been examined as a surrogate marker of effectiveness of new therapies.8 9 Several validated pain survey tools have been proposed for routine clinical care.10 11
NLP has been used to quantify associations between diseases, conditions, and symptoms,12–16 for vocabulary discovery,17 and for cohort construction.18–25 NLP applications focusing on pain in clinical records have successfully detected the experience of pain in free text within an electronic medical record.26–28 Some studies suggest that, in some scenarios, NLP of medical record text may perform better than patient-completed surveys in detection of clinically relevant pain.26 28
Although pain has previously been normalized and classified manually for purposes of statistical correlations,29 we used NLP to automatically characterize the experience of pain over thousands of records. To our knowledge, this is the first study to combine NLP, date resolution, and statistical analysis to create a longitudinal study of pain in the clinical record. Our system normalized each mention of pain in longitudinal clinical records by severity classification and number of days before death. We used regression modeling techniques to analyze both the newly structured data and the existing structured data to search for phenotypic correlations with pain in the context of metastatic prostate cancer.
Pain management is fundamental to effective clinical care, and significant pain is a consequence of the disordered biology of many cancers. This study tests the feasibility of automatically tracking patient pain over time using NLP of clinical record text. If NLP-based pain tracking is feasible, further study will be indicated to test the hypothesis that adoption of NLP-based pain tracking within electronic health record systems could provide significant added value in clinical care and in advancing research in disease phenotyping.
Thirty-three men from the PELICAN (Project to ELIminate lethal CANcer) integrated clinical/molecular autopsy study of metastatic prostate cancer were the subjects of this study. Subjects joined the institutional review board-approved study between 1995 and 2005. The mean age of the study subjects at the time of diagnosis of prostate cancer was 62 years (range 42–75). The mean interval between diagnosis and death was 6.3 years (range 0.8–15.4). Of the 33 subjects in the study, 27 were Caucasian, five were African-American, and one was of Hispanic background. Six subjects were seen only in community hospital inpatient, clinic and private office settings; the remaining 27 subjects were followed in a combination of oncology center and community hospital clinic settings.
The study obtained and analyzed all available paper, electronic, radiologic, radiation therapy, and pathology medical records for each subject. Subjects provided a list of institutions and physician offices where medical care was received, and copies of medical records from the various institutions and offices were obtained.
A total of 23 887 pages of paper records were converted into electronic text using methods described in the online appendix. The electronic record included laboratory values, radiology reports, pathology reports, and records of inpatient and outpatient encounters with providers. The text recorded in 4409 inpatient and outpatient encounters is called the ‘PELICAN corpus’ and is the focus of this study.
The full curated electronic text of each paper record was placed in the ILSR database, a system created to support the PELICAN Study. The average number of inpatient or outpatient records per year between diagnosis and death was 32 (range 4–212). Subject date of birth, date of death, race/ethnicity, all available serum prostate-specific antigen (PSA) concentrations, body weight measurements, body height measurements, and radiation therapy records were separately tabulated in ILSR by project data curators.
A multidisciplinary team consisting of NLP software developers, medical subject matter experts (SMEs), and statisticians developed a pain categorization model based on a conservative four-tiered pain scale: no pain (category 0); some pain (category 1); controlled pain (category 2); severe pain (category 3).
We used ClinREAD, a proprietary healthcare-domain-oriented, rule-based NLP system (Lockheed Martin, Bethesda, Maryland, USA) built on AeroText (Rocket Software, Newton, Minnesota, USA) and previously successfully used by members of the study team in the Informatics for Integrating Biology and the Bedside (i2b2) obesity challenge.22 30 ClinREAD was chosen because of its availability to the project team and team familiarity with its use. Other valid approaches, including machine learning, were not used because of lack of available resources for the current project. The first stage of the current project involved iterative development and evaluation of NLP-based pain extraction and qualification (severity, anatomy, and date) in the 4409-record PELICAN corpus, for the purposes of discovery over a closed dataset. During this stage, we made iterative modifications to our entire system, data model, normalization rules, and vocabulary (details in online appendix). We tested the generalizability of the NLP methods on 889 unannotated, deidentified discharge summaries provided courtesy of i2b2.31
The system rated each mention of experienced or explicitly denied pain on the basis of the context in which it was found (table 1). We developed 42 pain severity contextual rules, such as (complete list in online appendix table 2):
Vocabulary from the Unified Medical Language System (UMLS; version 2010AB)29 was imported via the Metathesaurus from 35 level 0 source vocabularies (see online appendix table 3). We selected 16 semantic types based on the domain of the data as shown in online appendix table 4. Lookup tables were created from each set of synonymous terms in order to associate each phrase with a preferred term and a UMLS concept unique identifier (CUI). A filtering process similar to that of Roberts et al32 was used to remove irrelevant terms. After filtering, a total of 675 000 terms and phrases were contained in the study vocabulary.
We combined the vocabulary terms with context patterns in order to recognize internal dates, negatives, conditionals, and pain severity. These context patterns were developed manually. ClinREAD, like MedLEE,33 is rule-based. Each clinical concept (‘sign or symptom’, ‘finding’, ‘injury or poisoning’, ‘disease or syndrome’, or ‘neoplastic process’) is associated with a date and a body location; see online appendix for further detail. The system resolved incomplete dates (eg, ‘in July’) based on the date of the encounter, and resolved relative dates (eg, ‘four days prior to admission’) based on the previous date mention. Each resolved date is represented as a range (startdate, enddate). This date resolution component was based on the development team's previous work34–37 and is described in the online appendix. When dates were missing, the date of the clinical encounter was used as the default. Date associations were used to normalize the clinical concept to the number of days before death, for each individual study subject. This calculation is enabled through the conversion of the midpoint of absolute date ranges to the modified Julian format.38 Each mention of pain was associated with a severity level from the four-tiered pain scale. A subset of 637 strings from semantic type ‘sign or symptom’ were identified as indicating pain, listed in online appendix table 5. The NLP algorithm used for the study is summarized in figure 1 and as follows.
Although our system is proprietary, it could be replicated using other tools. One might start with any system that extracts concepts and identifies assertions as defined for the 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text.38 One could then integrate a temporal reference extraction and normalization tool such as HeidelTime,40 GATE with the Tagger_DateNormalizer plugin,41 or DANTE,42 filter out the pain-related concepts, as listed in appendix table 5, and identify the level of pain using rules defined in appendix table 2 and the lookup table defined in appendix table 11.
NLP processing of the records database produced structured data on each pain mention in each clinical record for all 33 study subjects. These data were combined with demographic and other separately curated data about each subject into a single study database suitable for statistical analysis.
We undertook a univariate logistic regression analysis to identify correlates of severe pain for use in a multivariate model; factors investigated included receipt of various drugs (eg, opioids, chemotherapy, steroids), body mass index (BMI), receipt of palliative radiation, and frequency of utilization of health services—that is, we correlated severe pain, as derived by NLP processing, with clinical and demographic factors from the structured (ie, non-NLP-based, pre-existing) portion of the study database. For this analysis, ‘severe pain’ was any reading of controlled or severe pain—that is, any reading of 2 or 3 during a month of observation versus any other reading (−1=no data, 0=explicit report of no pain, 1=reported pain not described as controlled or severe); see online appendix for further details. We then constructed a multiple regression model to assess the strength of associations between the occurrence of severe pain and all defined variables for which p was less than 0.1 in the univariate analysis. Inclusion of a dichotomous variable indicating ‘last year before death’ controlled for time effects.43 All statistical analyses were conducted with SAS V.9.2 using the Proc Logistic procedure.
We determined a pain index value for each subject during four intervals before death, with pain index defined as the mean monthly maximum pain value (max_pain) for months in which a pain report was available; the monthly max_pain values used were no pain=0, some pain=1, and controlled pain or severe pain=2 (see figure 3). We then obtained longitudinal views of pain status in each subject by plotting color-coded monthly max_pain values from diagnosis until death (figure 2). When no pain status report was available for a given month, we used the most recent pain status as the imputed value for a given subject.
To test the possibility of visualizing a summary of pain records from a group of subjects, we displayed the fraction of study subjects in each pain severity up to the time of death (see online appendix figure 2A), as well as the worst pain severity detected for each subject for each month up to death (see online appendix figure 2b).
The purpose of the project was discovery over a closed dataset and a study of feasibility. To evaluate and improve the performance of the NLP algorithm on the dataset, we completed multiple rounds of SME evaluation (GSB and RJT). Across all patient encounters, the NLP algorithm identified 6387 pain mentions (mean 1.45 pain mentions per record) and 13 827 drug mentions.
After development, we evaluated performance on the closed PELICAN corpus using the AeroText Answer Key Editor. The SMEs separately corrected 32 automatically annotated full text clinical encounter records randomly selected from the entire study set to create ‘answer keys’. These 32 records contained 207 mentions of pain. The NLP developers had no influence on the correction of the annotations. Inter-annotator agreement on pain mention (exact token match in the text and normalized concept name), pain start and end date (exact match), body location of pain (exact match), and pain severity integer are shown in table 2A, B. We assessed inter-annotator agreement by scoring one annotation set against the other. The entire team then met to discuss and adjudicate the two sets of corrections. Pain mentions on which there were disagreements were resolved to form the gold standard answer key; see online appendix table 6 for examples. We then assessed system performance compared against the gold standard answer key, requiring correct answers (region of text, normalized concept name, body location, and pain severity integer) to be exact matches. Recall is the percentage of pain mentions in the record that were correctly identified by the NLP system. Precision is the percentage of pain mentions identified by the NLP system that are correct. F-measure is the harmonic mean of precision and recall, and provides a measure of overall accuracy. F-measure for pain mention detection was 0.95, and for overall average pain severity assignment was 0.81 (see also table 3).
We further evaluated the generalizability of our NLP methods using a blind test set from 889 unannotated, deidentified discharge summaries from i2b2.31 Detailed methods are provided in the online appendix. A test set of 30 discharge summaries (containing 111 pain mentions) was chosen and kept unknown to the NLP developers at all times during the evaluation process. The remaining i2b2 records were designated the ‘development set.’ Ground truth was created using the same process as the PELICAN evaluation, with the added control that the annotation process was supervised by a developer not involved in the project. Inter-annotator agreement on the i2b2 corpus is shown in table 2C, D. The SMEs adjudicated each disagreement to obtain an approved gold standard. Several differences were noted by the SMEs between the i2b2 and the PELICAN clinical record corpora, including an increased frequency of ambiguously dated pain mentions in the i2b2 corpus, as shown by the low inter-annotator agreement for start and end dates in table 2C. Further discussion can be found in the online appendix, as can the adjudicated annotations for our 30-report i2b2 test set.
The NLP system was run, as built, on the blind i2b2 test set and scored against the approved gold standard using the AeroText scoring tool. The initial extraction F-measure for pain mentions in the new test set was 0.87; see appendix for complete scores. A 10-hour development process was then conducted to adjust for stylistic differences in the new corpus. The system was scored again, and the system F-measures on pain mentions and pain severity increased to 0.90 and 0.81, respectively.
Date association accuracy was significantly lower than for the PELICAN corpus, falling for start date from 0.90 for PELICAN to only 0.64 for i2b2. We believe that this was the result of a larger number of ambiguous date references in the i2b2 corpus and differences in the annotation guides used by the SMEs to annotate the two corpora; see online appendix for further discussion.
Post-development measures of the NLP extraction over the i2b2 corpus are given in table 3; the final NLP extraction of pain in the i2b2 test set is given in the online appendix. Developers remained blind to the test set throughout the development process. The blind evaluation on an independent dataset showed that, with 10 h of development time to adjust for corpus stylistic differences, the NLP system developed for this project is generalizable beyond the PELICAN corpus.
Overall, pain increased markedly during the last 2 years of life (figure 2). Metastatic prostate cancer was the listed cause of death in all study cases, and none of the subjects was found to have significant additional contributing causes of death. In the final year of life, subject pain index varied widely, from 0.3 to 1.6, with a roughly equal distribution of subjects across this spectrum. The five African-American study subjects clustered at the high end of the pain index spectrum (range 1.3–1.6) (table 4).
The system detected no severe or controlled pain in two subjects (8 and 30). The number of clinical encounter records available per year between diagnosis and death for these two subjects was 32 and 19, indicating that the lack of severe pain reports in these two subjects was not due to a lack of clinical encounters. We found no evidence that these subjects died earlier in the course of their disease from non-cancer causes. Since bone pain is the major source of pain in men with metastatic prostate cancer, we reviewed bone scan findings in these two subjects, and both demonstrated widespread bone changes consistent with metastatic prostate cancer, similar to scan results from all other study patients.
In the initial univariate analysis, all considered variables except for receipt of definitive radiation and maximum recorded BMI correlated significantly with severe pain. African-American ethnicity was borderline associated with severe pain (OR 1.5, p=0.09). Receipt of opiates (OR 25.6, p<0.001), palliative radiation (OR 13.8, p<0.0001), and being in the last year of life (OR 9.9, p<0.001) were strongly associated with severe pain. See online appendix for detailed univariate analysis results.
In the multivariate analysis, only five of the 12 remaining factors were significantly associated with severe pain (p<0.1): receipt of palliative radiation, opioids, or chemotherapy; being in the last year of life; and the number of outpatient visits (table 4). Receipt of non-steroidal anti-inflammatory drugs (NSAIDs), corticosteroids or sex-steroid-manipulating drugs were not significantly associated. These findings are consistent with current clinical practice, where palliative radiation44 45 and opioids46 are treatments typically reserved for severe pain, and NSAIDs, corticosteroids, and sex-steroid drugs are used more generally across the pain spectrum.46 Similarly, the last year of life is clinically known to be when severe pain is most common46 and when clinical encounters are most frequent. The multivariate model found no significant association with increasing serum PSA concentration, age at diagnosis, or decline in BMI to <90% maximum after controlling for the effects of time. There was a non-significant trend associating African-American ethnicity with more severe pain.
The model and findings were robust, explaining 83% of the variability in the data. When we excluded six patients who were seen only in a community setting and who had fewer recorded clinical encounters, the patterns of association remained unchanged. Moreover, when we removed from the model all variables that were not significant in the univariate analysis, the strengths of the associations (adjusted ORs) of the remaining variables and p values changed only marginally.
In multivariate regression analysis, pain status detected by NLP correlated statistically with parameters clinically known to be associated with increased pain. Conversely, pain status detected by NLP was not associated with parameters not expected to be clinically associated with pain status, such as administration of definitive radiation with curative intent. These results suggest that meaningful NLP-based pain status monitoring is feasible. While this project used a rule-based NLP system, machine-learning-based NLP tools should be tested in future work.
Text in longitudinal data is valuable for the study of symptoms such as pain, where the clinical unstructured description may be more complete than it is in structured data.28 NLP techniques convert such unstructured data into structured data, which is typically more amenable to rigorous analysis and display.
Relief of pain is essential in the management of many acute and chronic diseases, and convenient automated monitoring of patient pain status could provide a valuable new tool for improving quality of life and care. Real-time, easy-to-interpret views of the pain status history of an individual patient or a group of patients, as shown in figure 2 and online appendix figure 2, could allow busy clinicians to identify patients most in need of increased pain management intensity, and allow researchers to perform visual and quantitative comparison of groups of subjects participating in clinical trials of novel therapies or novel clinical interventions.
NLP-based determination of pain status may help to identify clinically significant molecular differences between prostate cancers. For example, a study of the molecular differences in the cancers of the two men who apparently experienced no severe pain could provide important clues to the biological determinants of severe pain in metastatic prostate cancer. Similarly, the trend toward increased pain experienced by the five African-Americans compared with the other men in the study is consistent with an oncology clinical trial which found that African-American men were more likely than white men to have extensive disease and bone pain.47
Our study has several limitations. First, the dataset was relatively small, covering just 33 patients. Second, it was difficult to distinguish pain mentions that were not related to the subjects’ metastatic prostate cancer. Although SME review of the records revealed only rare examples of pain not related to prostate cancer in the current study, future studies should implement formal methods to identify and link pain to a relevant disease source. Third, it was difficult to distinguish pain control status from the patient's current experience of pain. This study defaulted to ‘controlled’ pain as one of the pain categories because there were multiple records where the patient was noted to be taking opioids for pain, but no current pain level was provided. Fourth, we may have slightly biased our annotation of the PELICAN corpus by using system outputs to initialize annotations. This technique has been shown to improve consistency, reduce annotation time,48 and improve inter-annotator agreement.49 50 We minimized possible bias by having the annotators work independently and by submitting the results to team scrutiny and collaborative discussion. In essence, the answer key was generated by compiling the answers of four (overlapping) SMEs: two humans, the system itself, and the team as a whole. The similar evaluation results obtained on the separate i2b2 corpus, which used isolated test and development sets, suggest that any bias was minimal. Finally, the use of proprietary ClinREAD and AeroText NLP software may limit reproducibility. However, this limitation is at least partially mitigated by our provision of detailed rules, as well as results from our analysis of the i2b2 corpus. Investigators interested in further analysis of the study dataset using other methods and under appropriate confidentiality protection are invited to contact the senior author.
Electronic health records have greatly facilitated detection and understanding of disease phenotypes and their relationship with genetic and non-genetic factors.51–55 The study reported here, which we believe to be the first to use NLP to obtain longitudinal pain status information in a cohort of patients, shows that NLP-based monitoring of patient pain status is feasible and generalizable to new datasets, and provides a number of phenotype-oriented observations useful for guiding future research. Future studies should focus on comparison of pain-status tracking by NLP versus other validated pain survey tools, and on practical integration of the two methods in settings where electronic health records are in routine use.
We thank the men and their families who participated in the PELICAN integrated clinical–molecular autopsy study of prostate cancer. We also thank Mario A Eisenberger, Michael A Carducci, V Sinibaldi, T B Smyth, and G J Mamo for oncologic and urologic clinical support. M Rohrer and M Padmanaban provided database support. W B Isaacs facilitated initial development of the PELICAN study by GSB. Deidentified i2b2 clinical records used in this research were provided by the i2b2 National Center for Biomedical Computing funded by U54LM008748 and were originally prepared for the Shared Tasks for Challenges in NLP for Clinical Data organized by Dr Ozlem Uzuner, i2b2 and SUNY. We also thank the JAMIA reviewers for helpful suggestions about article content.
Contributors: NHH, RJT, LS, JAH, LCC, and GSB collaboratively directed and designed the study (LCC retired in 2010). NHH conducted the NLP analyses, with assistance from DA. LS and RL conducted the statistical analyses. RL managed data integration and developed the graphic representations. NHH, RJT, LS, JAH and GSB wrote the manuscript. GSB proposed the current study and graphic representations, and founded, designed and directs the PELICAN (Project to ELIminate lethal prostate CANcer) project.
Funding: Autopsy study of prostate cancer support 1994–1998 from CaPCURE. Support for natural language processing project from Lockheed Martin Information Systems and Global Solutions.
Competing interests: None.
Ethics approval: This study was conducted with the approval of Johns Hopkins Medicine Institutional Review Board.
Provenance and peer review: Not commissioned, externally peer reviewed.
Open Access: This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 3.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/3.0/