An overview of our approach is outlined in . Starting with the complete EMR (narrative + codified data), we : (1) created an RA database (RA Mart) of all possible RA patients; (2) randomly selected 500 subjects from the RA Mart for medical record review to develop a training set of RA and non-RA cases; (3) developed and trained 3 classification algorithms on the training set; (4) applied the 3 classification algorithms to the RA Mart to obtain the predicted RA cases; and (5) validated the classification algorithm by performing medical record reviews on 400 of the predicted RA cases, a validation set, to confirm RA status to determine the positive predictive value (PPV). Steps 3-5 were conducted for each algorithm: narrative + codified EMR data (complete), codified EMR data, narrative only EMR data.
Data source
We studied the Partners HealthCare EMR, which is utilized by two large hospitals, Brigham and Women's Hospital (BWH) and Massachusetts General Hospital (MGH), that combined, care for approximately 4 million patients in the Boston metropolitan area (Massachusetts, USA). The EMR began on October 1, 1996 for BWH and October 3, 1994 for MGH. To build an initial database of potential RA subjects (‘RA Mart’), we selected all subjects with ≥1 ICD9 code for RA and related diseases (714.xx) or subjects who had been tested for antibodies to cyclic citrullinated peptide (anti-CCP) (). Subjects who were deceased or age< 18 at the time of the RA Mart creation (June 5, 2008) were excluded. In total 29,432 subjects had at least one ICD9 code for RA (714.xx) (n=25,830) or had been tested for anti-CCP (n=3,602) (4,283 subjects had at least one ICD9 code for RA and had anti-CCP checked). The Partners Institutional Review Board approved all aspects of this study.
Codified EMR data
We used the following codified data in our analysis: ICD9 codes, electronic prescriptions, and anti-CCP and rheumatoid factor (RF) laboratory values. The ICD9 codes included RA and related diseases 714.xx (excluding juvenile idiopathic arthritis/juvenile rheumatoid arthritis (JRA) codes), systemic lupus erythematosus (SLE) 710.0, psoriatic arthritis (PsA) 696, and JRA 714.3x (abbreviated as RA ICD9, PsA ICD9, SLE ICD9, and JRA ICD9). Because a single visit could result in multiple tests and notes, leading to multiple codes for the same day, we eliminated codes that occurred less than one week after a prior code. In our analysis, RA ICD9 was analyzed in two forms: (1) number of RA ICD9 codes for each subject at least one week apart (RA ICD9) and (2) number of normalized RA ICD9 codes which is the natural log of the number RA ICD9 codes for each subject at least one week apart. We determined which subjects were RF and anti-CCP positive according to the cutoffs at each hospital laboratory. The presence of a coded medication signifies that a patient was prescribed the medication by a physician using a computerized prescription program embedded within our EMR or had the medication entered onto a medication list maintained by a physician. The presence of a coded medication does not signify that the medication was actually filled as patients can take prescriptions to any pharmacy. The coded medications assessed in this study included the disease modifying anti-rheumatic medications (DMARDs): methotrexate, azathioprine, leflunomide, sulfasalazine, hydroxychloroquine, penicillamine, cyclosporine, and gold. Biologic agents included the anti-tumor necrosis factors (anti-TNF): infliximab and etanercept, and other agents including abatacept, rituximab and anakinra. Adalimumab was not available as coded data in our system. To provide an index of medical care utilization, we assessed the number of ‘facts’, which is related to the number of medical entries a subject has in the EMR. Examples of a fact include: a physician visit, a visit to the laboratory for a blood draw, a visit to radiology for an X-ray.
Narrative EMR data and natural language processing (NLP)
We used five types of notes to extract information from narrative data: health care provider notes, radiology reports, pathology reports, discharge summaries, and operative reports. We utilized natural language processing (NLP) to extract clinical variables from the narrative data entered in a typed format (no scanned hand-written notes were used). We used the Health Information Text Extraction (HITex) system (
19) to extract the clinical information from narrative text. HITEx is an open source NLP tool written in Java and built on the General Architecture for Text Engineering (GATE) framework (
20). The NLP application determines the structure of unstructured text records and outputs an annotated document tagging variables of interest (further details provided in Zeng et al., 2006 (
19)).
The variables included broad concept terms such as disease diagnoses (RA, SLE, PsA, JRA), medications (listed above, with the addition of adalimumab), laboratory data (RF, anti-CCP, the term seropositive) and radiology findings of erosions on x-rays. We used the Health Information Text Extraction (HITEx) system (
19) to extract clinical information from narrative text. We extracted the variables mentioned above from the narrative data and created coded NLP variables for the number of mentions per subject as well as dichotomous variables for each disease diagnosis, medication, laboratory test result, and erosions on x-rays. To account for variability in language usage, a variety of specific phrases can be defined which is then collapsed into a single concept term for analyses. The clinicians on the team developed lists of terms to be used for each NLP query. Further analysis was performed to determine positive or negative variables. For example, a patient was flagged as CCP positive by NLP if terms were found in their records such as ‘anti-CCP+’, ‘CCP positive RA.’ For RF, anti-CCP, seropositive and erosions, a negation finding algorithm was used to distinguish subjects who were positive or negative for the variable. For example, the algorithm could distinguish a subject who was anti-CCP positive vs. anti-CCP negative.
Two reviewers (KPL and RMP) assessed the precision of select NLP concepts: anti-CCP positive, RF positive, seropositive, methotrexate and etanercept. For each concept, one sentence containing the concept was selected from each of 150 randomly selected subjects with records containing the concept. The reviewers assessed whether the concept extraction was correctly described in the context of the sentence. We assessed two categories of NLP concepts. The first assessment for precision identifies whether a concept was identified appropriately from the physician note within a specific sentence. Concepts in this group include disease diagnoses and medications. A patient was scored as ‘correct’ for methotrexate by NLP if the term methotrexate was present in the sentence extracted from the medical record. This includes instances where subjects were prescribed the medication, the medication was held, contemplated or if the subject had taken the medication in the past. The second assessment for precision requires that the patient have a positive result. This pertains to the concepts RF, anti-CCP, seropositive, and erosions. We scored the NLP as correct for ‘RF positive’ only if the patient was also found to be RF positive on review from the sentence extracted from the medical record. We scored NLP as incorrect if RF was mentioned with no evidence that the patient was RF positive (RFpos) in the sentence. An example of how precision (with respect to positive predictive value) was calculated as follows:
Precision= (# sentences RFpos by NLP and confirmed as RFpos on review)/(#sentences RFpos by NLP) The precision of NLP concepts was high: erosions, 88% (95% CI: 84, 91%); seropositive, 96% (95% CI: 95, 97%); CCP+, 98.7% (95% CI: 98, 99%); RF+, 99.3% (95% CI: 99.1, 99.4%); methotrexate, 100%; and etanercept, 100%.
Training set of 500 subjects
We established a training set of 500 subjects randomly selected from the RA Mart for medical record review. To establish the gold standard diagnosis, two rheumatologists (KPL and RMP) reviewed the medical records for the presence of the 1987 American College of Rheumatology Classification Criteria for RA (
21) and classified subjects as definite, possible/probable and not RA. Definite RA was defined as subjects who had a rheumatologists’ diagnosis of RA and supporting clinical data such as records describing synovitis, erosions, or greater than one hour of morning stiffness. Possible RA was defined as subjects with persistent inflammatory arthritis with RA in the differential diagnosis by a physician. Subjects with a diagnosis of RA by a physician, but insufficient supporting information of clinical signs and symptoms of the disease, were also classified as possible RA. Finally, subjects with an alternate rheumatologic diagnosis or whose diagnosis was unclear were considered to not have RA.
For our training set, subjects classified as definite RA were considered ‘RA cases’, while subjects classified as possible and as not having RA were classified as ‘controls’. Eighty-one percent of RA cases had sufficient information from the EMR to fulfill the 1987 ACR Classification Criteria for RA (
21). This is consistent with the published specificity of the 1987 ACR criteria which ranges from 80-90% when compared to the gold-standard of rheumatologists’ diagnosis of RA (
21,
22). Both RMP and KPL reviewed the same 20 subjects to assess percent agreement, and were in 100% agreement on the final diagnosis.
Classification algorithm: selecting informative variables and assigning parameters
We used penalized logistic regression to develop a classification algorithm to predict the probability of having RA (
23,
24). To avoid over-fitting the model, we used the adaptive LASSO procedure which simultaneously identifies influential variables and provides stable estimates of the model parameters (
25). The optimal penalty parameter was determined based on Bayes’ Information Criterion (BIC). We developed three different algorithms using (i) codified EMR variables only; (ii) narrative EMR variables only; and (iii) complete variables (narrative + codified). All three models were adjusted for age and gender and all predictors were standardized to have unit variance. The predicted probabilities based on these models were used to classify subjects as having RA.
We selected the threshold probability value for classifying RA by setting the specificity level at 97% for all 3 algorithms. Subjects whose predicted probability exceeds the threshold value were classified as having RA, denoted by Alg. To assess the overall accuracy of these algorithms in classifying RA with the training data and to estimate the threshold value for Alg, we used three-fold cross-validation repeated 50 times to correct for potential over-fitting bias. Furthermore, we used the bootstrap method to estimate the standard error and obtain confidence intervals for the accuracy measures. The predictive accuracy of the algorithm to classify RA vs. non-RA was subsequently validated using a separate validation set.
Validation of classification algorithm and assessment of sensitivity, specificity and PPV
Once the classification algorithm was established, we applied it to the remaining RA Mart and assigned a probability of RA to each subject. To validate the performance of the classification algorithm, we randomly sampled an independent set of 400 subjects (validation set) from the subset of subjects who were classified as RA (Alg by any of the three algorithms). These cases were then validated through a blinded medical record review by two rheumatologists, KPL and RMP for RA. The sensitivity, specificity and PPV were calculated using the following formulas:
PPV= (Number of Alg subject confirmed as RA on medical record review)/(Number of Alg subjects)
Sensitivity= (PPV × PAlg)/PRA
Specificity= 1- [((1-PPV) × PAlg)/(1- PRA)]
PAlg is the proportion of subjects identified by the algorithm as having RA in the RA Mart, and PRA is the RA prevalence estimated from the training set. Sampling the validation set from the subset of subjects who were classified as RA can improve the precision in estimating PPV which is the primary accuracy parameter and outcome of interest.
To assess and compare the difference in accuracy between the 3 algorithms, we compared their PPV values and obtained confidence intervals (CIs) using the validation data:
Difference in PPV= PPV complete algorithm- PPV codified variables only algorithm;
Difference in PPV= PPV complete algorithm- PPV narrative variables only algorithm
The differences in PPV were significant if the 95% CI did not include zero. Although the PPV values between the 3 algorithms can be compared, the 95% CIs associated with the PPV values (in contrast to the difference in PPV) cannot since these estimates were derived from the same validation set of 400 subjects for all 3 algorithms.
For comparison, we also assessed the accuracy of criteria used in administrative database studies: ≥3 ICD codes for RA (
8) and ≥ 1 RA ICD9 code + at least one DMARD (
5). We used the training set to generate these data as it allows for unbiased estimates of these simple criteria. To compare differences in accuracy between our algorithms and the simple criteria above, we also used the difference in PPV and 95% CI.
Descriptive statistics
We assessed differences in characteristics between RA cases and controls in the training set using the t-test and the Wilcoxon Rank Sum test to compare differences between means and medians respectively. P-values are two-sided. Chi-square was used for between-group comparisons expressed as proportions and analysis of variance (ANOVA) for comparison of multiple groups.
Case only analysis
To assess whether our EMR RA cohort can replicate known associations among clinical variables, we performed a case only analysis to compare the risk of erosions in anti-CCP+ vs. CCP- subjects and RF+ vs. RF- subjects. We assessed the association between anti-CCP and radiographic erosions by including only those subjects in our database who have had anti-CCP tested in the clinical laboratory (e.g., autoantibody status was derived from codified data). Similarly, we assessed the relationship between RF and erosions only among those who had RF tested. Odds ratios and 95% CI were calculated using 2×2 contingency tables. All analyses were conducted with SAS software, version 9.2 (SAS Institute) and the R package (The R project for Statistical Computing,
http://www.r-project.org/).