Current clinical research is limited by a labor-intensive subject selection process, which has become a formidable obstacle to conducting broad and deep studies and drawing powerful conclusions. An HTCP algorithm leverages machine-processable EMR data, improving such inefficiency. Oftentimes, a patient is seen by multiple medical centers, and thus a single medical center does not have the patient's complete medical data when executing an algorithm. To our knowledge, how this data fragmentation across healthcare providers affects the accuracy of an HTCP algorithm has not been previously investigated. Such an investigation is difficult to conduct because it requires accessing multiple EMRs from heterogeneous sources at multiple medical centers. By taking advantage of the REP, we accomplished such a novel demonstration.
When using the combined Mayo Clinic and OMC EMR data for the 12 740 eligible subjects, 6.0% (765) met eMERGE T2DM algorithm inclusion criteria for T2DM subjects (). This percentage is slightly lower than the prevalence of DM for all age groups in the USA (8.3%)24
because not all Olmsted County residents were tested for DM in the 2 years of the study.
Our results, combined with findings from other studies,8
show the advantage of access to more complete data for clinical research. In the present study, data fragmentation across healthcare centers resulted in incomplete data for any one EMR when the eMERGE T2DM algorithm was executed in Olmsted County, and that incompleteness substantially decreased the algorithm's accuracy.
For T2DM subject identification, we found categorization differences with data from both centers relative to the use of data from any one alone. The differences were mainly the result of a large proportion of FN T2DM subjects (n=252; FNR, 32.9%). The 252 FN T2DM subjects differed with respect to age and sex distribution from the 513 TP T2DM subjects. This difference suggests that, for age/sex-matched designs, matching could be skewed when HTCP algorithms are applied to EMR data from a single medical center. Even though the eMERGE T2DM algorithm is reported to achieve 98% for identification of T2DM subjects compared with clinician review,16
we still identified 27 (5.0%) FP T2DM subjects because of data fragmentation across healthcare centers.
For non-DM subject identification, we also found categorization differences using data from both centers relative to using data from any one alone. The differences were mainly the result of a large proportion of FN non-DM subjects (n=1573; FNR, 37.0%). Even though the eMERGE T2DM algorithm is reported to achieve 100% PPVs for identification of non-DM subjects compared with clinician review,16
we still identified 215 (7.4%) FP non-DM subjects because of data fragmentation across healthcare centers.
An incomplete diagnosis is the main reason for FP errors and accounted for all FP T2DM subjects. Absent laboratory results and incomplete diagnosis led to the majority of FP non-DM subjects. FNs were caused by the incompleteness of diagnosis, laboratory values, or prior medications. We also found that 53 subjects (21%) and 499 subjects (32%) were missed because they had made fewer than two clinical visits during the study period. As the time frame we used was 2 years, which is broader than the recommended frequency of T2DM visits (3–6 months),24
these insufficient clinical visits must have resulted from data fragmentation across centers as well.
The misclassification errors caused by data fragmentation could lead to sampling bias and risk serious distortions in the findings of resulting studies.26
These outcomes should be carefully considered by clinical researchers when developing or executing an algorithm. The ultimate solution for the data fragmentation problem is integrating EMR systems across various healthcare centers. However, to achieve such an ambitious goal, not only do serious technological challenges exist, but also complex ethical issues need to be addressed. Some ONC (the Office of the National Coordinator) funded Beacon projects prototype this issue.27
Clinical narratives (unstructured clinical data) document a patient's detailed description about diseases that may contain data from other healthcare centers. This additional information can be extracted by using natural language processing techniques and turned into normalized data for further analysis using other advanced techniques—for example, data mining.28
Then, discovered patterns could be reviewed and adopted in subject selection criteria. This approach may work with the caveat that additional data must be relevant for the condition under study. Our previous work, along with other studies, has shown its potential for subject selection tasks.6
Several issues about this study design should be considered when interpreting the findings. Because of unavoidable random or systematic errors (eg, physician experience, communication quality between the patient and the clinician, and coding quality), it is extremely difficult to obtain a patient's actual condition or the true gold standard.33
The manual effort required to validate the distinction between T1DM and T2DM obtained using the algorithm against medical review requires information at the time of DM onset34
and was beyond the scope of the present study. In this study, our gold standard was based on classifications using 2 years of EMR data from two major healthcare centers in Olmsted County. Because most Olmsted County residents receive their healthcare at these two healthcare centers and the observation window we chose is much broader than the recommended frequency of T2DM visits, this is a pragmatic gold standard for this study.
Our results may not generalize to large metropolitan areas. Our study setting is a sparsely populated, relatively isolated county in southeastern Minnesota. The residents of Olmsted County have fewer options for healthcare centers than people living in a large metropolitan area. Thus the misclassification errors that we found by comparing the selected categorizations are most likely smaller than in a usual situation. Also, this study focuses on how HCTP is affected by incomplete data due to data fragmentation across healthcare centers alone. It does not investigate the impact of incomplete data due to other factors, for example, insufficient longitudinal data, which is a topic for another study (unpublished material, Wei W, 2011). In addition, the algorithm scope of our study is limited to the eMERGE T2DM algorithm alone. For a more complete evaluation of the impact of data fragmentation on an HTCP algorithm, this study needs to be repeated at different geographic locations under various periods of observation on a wide spectrum of HTCP algorithms.