|Home | About | Journals | Submit | Contact Us | Français|
Body mass index (BMI) is an important outcome and covariate adjustment for many clinical association studies. Accurate assessment of BMI, therefore, is a critical part of many study designs. Electronic health records (EHRs) are a growing source of clinical data for research purposes, and have proven useful for identifying and replicating genetic associations. EHR-based data collected for clinical and billing purposes have several unique properties, including a high degree of heterogeneity or “clinical noise.” In this work, we propose a new method for reducing the problems of transcription and recording error for height and weight and apply these methods to a subset of the Vanderbilt University Medical Center biorepository known as EAGLE BioVU (n=15,863). After processing, we show that the distribution of BMI from EAGLE BioVU closely matches population-based estimates from the National Health and Nutrition Examination Surveys (NHANES), and that our approach retains far more data points than traditional outlier detection methods.
Genetic association studies are increasingly requiring large numbers of DNA samples linked to a multitude of phenotypes, traits and exposures to fully discover and describe the complex genetic architecture of human disease. In recognition of this need, concerted efforts are being made to amass the needed data through a variety of mechanisms including traditional epidemiologic designs or more contemporary biobanking approaches. In the United States, while large cardiovascular or cancer epidemiologic collections exist, there is no plan for the ascertainment of a larger US population-based cohort for genetic association studies given the enormous financial investment required1. Instead, resources and effort have been directed towards combining existing smaller studies2 or partnering with health-care providers3, 4. The latter effort is receiving much support given the potential for practice-based biobanks to collect of large numbers of study samples linked to data collected in a clinical setting as part of patient care5. The advantage of this approach is that large numbers of clinically relevant DNA samples are readily available to investigators for genetic association studies.
The gold standard of study design is the prospective longitudinal cohort study, where individuals are ascertained from a population at a baseline start date for multiple measures (generally collected using a questionnaire) and are followed over time with updates at regular intervals. These cohort studies are very difficult to execute; recruitment is challenging, individuals drop out as the study progresses, the scope and data collection methods must be chosen and fixed at baseline, and large numbers of individuals are needed for statistical power. By comparison, practice-based biobanks at major metropolitan medical centers provide an attractive alternative; large portions of the population can be ascertained, most medically relevant data are collected in a loosely standardized way, individuals are tracked over time, and cost is reduced due to burden-sharing with health care providers. The major drawbacks however are non-regular intervals of information update and non-uniform data collection due to differences in clinical practice. These two issues could be jointly considered as a problem of “clinical noise.”
While electronic health records (EHRs) are a rich source of phenotypic information, structured and free-text information from the EHR may require various degrees of processing to extract precise disease states. In the coarse sense, the presence of the same billing code from multiple distinct dates may be sufficient for phenotyping of some traits6, but others may require refinement to eliminate confounding factors7. Continuous measures, such as laboratory values, may require extensive processing to remove confounding factors including medication use, comorbid conditions, and biases in sampling due to lab requisition protocols. Even critical measures such as vital signs can have high rates of missingness8, and are subject to observational bias9.
One research variable that best illustrates the “clinical noise” problem inherent in biobanks linked to EHRs is body mass index (BMI). BMI is a well-established risk factor for type 2 diabetes, hypertension, asthma10, and various forms of cancer11, 12. BMI is a critical comorbidity for many clinical outcomes, and while this fact has been established by numerous epidemiological studies, the height and weight measurements that form the basis for this measure are prone to transcriptional and conversion errors within EHR systems. The quality of BMI data has been previously examined from clinical records, and despite having an accurate protocol for measuring weight and height, only 35% of patient visits had data collected properly, typically because the patient’s shoes were not removed prior to measurement13. However, measures were collected and recorded frequently (94.7% and 77.9% of the time for weight and height respectively). Wheelchair users are typically unable to stand for height measures with a stadiometer, forcing reliance on self-report14 or other less accurate measures15. Furthermore, reliance on self-reporting for weight and height has well-established biases13, and this bias has an racial/ethnic-dependent component16, and varies with age though studies conflict on this effect16, 17. Even when height and weight are measured according to protocol, the results may not be recorded in consistent units across the clinic, and other studies using EHR data have required harmonization of units18.
While it is known that clinical noise is especially problematic for assessing body-mass index from EHRs, there are few strategies proposed to address it. The most popular way to address this problem for BMI and other variables is manual curation. However, it is infeasible to extract and clean all height and weight data points manually given nearly every clinic visit has a recorded value resulting in a very large dataset (hundreds of thousands to millions of data points). Therefore, to enable the semi-automatic extraction of high quality height and weight data from EHRs to calculate BMI, we developed the Adjacency-based Longitudinal Outlier Extraction (ALOE) method and applied it to clinical records to a subset of the Vanderbilt University Medical Center’s biorepository known as EAGLE BioVU (n=15,863)19. ALOE takes advantage of the longitudinal nature of the EHRs and the expectations of changes in weight and height over time for a given age range. Overall, we demonstrate that our data extraction method extracts high quality height and weight data with less data loss than standard outlier approaches.
BioVU is the Vanderbilt University Medical Center (VUMC) biorepository linked to de-identified EHRs. BioVU, including the ethical and legal considerations, has been previously described for the adult clinics3 and pediatrics20. In brief, DNA is extracted from discarded blood samples drawn at VUMC outpatient clinics, and the DNA sample is linked to a de-identified version of the patient’s EHR known as the Synthetic Derivative (SD). The VUMC SD contains approximately 20 years of clinical data representing ~2.1 million patients. To date, BioVU contains more than 200,000 DNA samples linked to de-identified EHRs. As part of the larger Population Architecture using Genomics and Epidemiology I (PAGE I) study21, all DNA samples from minority (non-European descent) patients in BioVU as of 2011 were selected for study19. This subset of BioVU, referred to here as the Epidemiologic Architecture for Genes Linked to Environment (EAGLE) BioVU, contains 15,863 DNA samples including DNA samples from African Americans (n=11,519), Hispanics (n=1,702), and 1,118 Asians (n=1,118). Race/ethnicity in BioVU is administratively assigned and has shown to be highly concordant with genetic ancestry among European Americans and African Americans22 but less so for other groups such as Hispanics23.
The National Health and Nutrition Examination Surveys (NHANES) are population-based cross-sectional surveys conducted by the National Center for Health Statistics at the Centers for Disease Control and Prevention. For each study participant, data on demographics, health, and lifestyle are collected. A physical exam is conducted by a CDC physician or health professional, and laboratory measures are assayed from blood and urine. Biospecimens for DNA extraction were collected beginning with phase 2 of NHANES III (between 1991 and 1994; n=7,159). DNA was also collected on consenting participants for NHANES 1999-2000 and 2001-2002 (n=7,839). NHANES is diverse and DNA samples were collected from self-described non-Hispanic whites (n=6,634), non-Hispanic blacks (n=3,458), Mexican Americans (n=3950), and others (n=956). For this study, CDC-measured height and weight were accessed for participants with DNA samples from NHANES III, NHANES 1999-2000, and NHANES 2001-2002 for a total of 14,734 samples.
All procedures were approved by the CDC Ethics Review Board and written informed consent was obtained from all participants. Because no identifying information was accessed by the investigators, Vanderbilt University’s Institutional Review Board determined that this study met the criteria of “non-human subjects.”
We first examined the distributions of raw height and weight values to flag extreme unrealistic observations originating from transcription errors. Next, we divided the observations into obese and non-obese individuals. To identify obese individuals, clinical records were examined for International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) morbidity obesity codes (278.01) and/or mention of “obesity” in EHR clinical free text. Observations not having an obesity code or the obesity keyword were considered non-obese. This step was performed to disambiguate true distributional outliers from errors due to unit conversion or measure recording. Extreme outliers identified in non-obese individuals were manually investigated for validity and removed accordingly.
Measurements of height and weight are typically recorded at regular intervals in the course of clinical care – often multiple times per year. If we assume that errors in transcription and unit conversion are distributed uniformly over all recorded measures, the distribution of an individual’s height and weight measurements over a fixed time interval can be used to identify and correct errant measures. To evaluate observations across the longitudinal dataset, we generated a change-ratio distribution. A single index measurement was selected over a given year, and all subsequent measurements were divided by this index value to produce a range of [0, ∞]. This ratio is examined relative to established unit conversions [pounds to kilograms, inches to centimeters, feet to centimeters, meters to centimeters] to identify unit inconsistencies. If the value is approximately 1, both observations are recorded in the same unit, and deviations from 1 provide an approximation of the unit mismatch. For our dataset, we assumed that most measurements are recorded in centimeters (for height) and kilograms (for weight). The index observational value was defined by testing the following conditions per year:
Once the index value is assigned, all available measurements are divided by the index value to generate the change ratio distribution over this time interval and to identify spikes in the distribution indicative of unit mismatches.
Each of these conversions is then resolved to the base units of centimeters and kilograms (Figure 1). All conversion value ranges were given standard deviations to account for a 30 lb. change in weight and a 6-inch deviation in height measurements over the course of a year. If this algorithm were applied over a shorter (months) or longer (5 year) time interval, these standard deviations would be adjusted to reflect natural changes in weight and height expected over that period. If values were outside the standard deviation, the corresponding measurements were set to missing due to lack of validity. Manual editing and specific conditioning was used to preserve data in the case of clear transcription errors, such as the addition of a zero (e.g. a person gains 100 lbs. in one visit and lost 100 lbs. in the next visit).
To demonstrate the ALOE method, we present here a single patient within EAGLE BioVU with 70 independent clinic visit dates spanning seven years in the clinical record. We plotted all measured weights available in the EHR for this patient and observed at least one dramatic weight difference between two clinic visit dates only one month apart (Figure 2). This observed difference suggests a dramatic weight loss of 163 lbs. (73.94 kgs) followed by a dramatic weight gain of almost the same amount of weight a month later. Similar observations were made for the last four clinic visit dates compared with the other weights immediately preceding them.
To determine the nature of the transcription error, we divided the smallest two weights of the first three weights (121.28 and 126.1 kg, respectively) by the fourth weight (123.23 kg) to establish the weight index value. The smaller of the two weights (121.28 kg) was closest to 1 when divided by the fourth weight and declared the weight index value. We then divided all five suspect weights by the weight index value, and all five were within the range of 0.50 – 0.57 consistent with a transcription error where the original measurement was in kilograms assumed to be in pounds that was converted to kilograms (kgx2).
With this approach, we exploit the relationship between height, weight, and age. We regressed age onto height and weight respectively, creating a linear slope to predict values for each individual measure. Incorrect values exhibit a larger deviation from the predicted value and can be identified using a variety of statistical measures of influence, including Cook’s distance, Leverage, DFfits, Studentized residuals, and Covariance Ratio. If the modeled data indicated at least three positive tests of any statistical measure, that individual data point was set to missing. This method was executed two different ways: generating a single model over all observations for an individual, and generating multiple models over all available observations iteratively. The iterative approach used the influence measures to identify significant outliers, excluded the identified data outliers, and repeated this procedure up to three times. All analyses were conducted using SAS v9.3.
The majority of EAGLE BioVU individuals are African American (73%), followed by Hispanics (11%), and Asians (7%). The median age of individuals in EAGLE BioVU is 37 (20.46 standard deviation), with ~16% of individuals at least 55 years of age. As expected given EAGLE BioVU is drawn from a clinical population, the majority of individuals are female (63.35%). On a per patient basis, the number of clinic visits captured in EAGLE BioVU ranges from 1 to 1,456 with an average of 81.8 clinic visits per patient19.
We extracted the height and weight values recorded in EAGLE BioVU for all clinic visits per individual and calculated BMI. The distribution of 225,903 per-visit BMI values for children (age < 18) and adults calculated from raw height and weight measurements (Figure 3). The effects of unit mismatches are clear, with impossible (-36) and extreme values (954) derived from improperly converted height and weight measures in the calculation. These errors also cause a wide standard deviation (14.88).
We then applied our ALOE method to the raw height and weight measures extracted from the clinic visits. Figure 1 illustrates the fundamental principle of the ALOE method. Once an index measurement is selected in step 2, observations effectively cluster (by slope) based on recorded units. Weight (Figure 1A) naturally fluctuates over the course of a year, shown by a cloud of points off the main diagonal. Height measurements (Figure 1B) have much less natural variability, and points off the diagonal for height likely represents measurement error, either due to recall bias or the impact of shoes on stadiometer measurements.
The distribution of median BMIs after processing by the ALOE method is shown in Figure 4a. Median BMI was selected per-individual and is plotted for comparison to baseline BMI measurements collected in epidemiological studies. The distribution of BMIs from the NHANES is shown in Figure 4b. After processing, our data show a very similar distribution to the population level estimate. There is a slight skew toward higher BMIs in EAGLE BioVU possibly reflecting both known geographical and racial/ethnic differences in BMI distributions in the United States19, 24, 25.
We also examined the differences in dropped data points based on residual modeling and ALOE strategies. Table 1 illustrates that ALOE retains more data points than both versions of residual modeling. When performing residual modeling for outlier detection across the entire dataset, just over half of all observations are considered usable after processing. Using an iterative approach (described in the methods section), we progressively eliminated outliers across the entire dataset which may have eliminated many true observations. Performing this modeling within each individual proved more successful, but still may not have detected outliers due to subtler unit conversions (inches to centimeters).
In this work, we have shown that height and weight measures extracted from BioVU, an EHR-based biorepository, follow distinct patterns representing problems in unit conversion. By exploiting the temporal nature of the EHR, and the fact that individuals often have multiple height and weight measurements over time, many errant entries of height and weight can be resolved into the correct units. The ALOE approach leverages expected changes in weight and height measurements over a fixed time period (1 year) to identify outlier observations which can be converted (in the case of unit error) or dropped (in the case of transcription error). Greater than 98% of all observations are retained from ALOE, and the resulting distribution of derived BMI measures closely matches those reported by the nationally representative NHANES.
The issue of clinical noise is due largely to the extreme heterogeneity that is typical of large clinical databases. Temporal heterogeneity is frequent, as some patient records have frequent visits and laboratory measures, where others have few or none. Various clinics use different laboratory panels, uneven collection of clinical measures, and may even record measures using inconsistent units. For example, while weight is typically consistently recorded as part of patient intake, height is not recorded as regularly. When it is recorded, it may be from self-report or direct measure via a stadiometer, and even then some clinics may record in metric versus US customary units. This is common when comparing pediatric or natal measures to adult measures. Self-report may result in transcription errors, such as the entry of 5 feet, 9 inches as 59 inches. All these issues are further compounded by the presence of true outliers in the clinical system – abnormal or out-of-range test values indicative of a clinical disorder.
The ALOE approach has limitations. The method relies on dense temporal data, with multiple measures over a fixed time period. In this study, we used a 1-year window, and while this could be expanded, larger time intervals allow for larger natural changes in weight that may reduce accuracy in clustering unit distributions. Also, as with any quality control process, a degree of manual editing and interaction with the data is still recommended to preserve some data points. That is, even when the ALOE approach is applied, corrections and removal of outliers is at the discretion of the individual investigator. The ALOE approach only offers solutions for research settings and does not address the cause of transcription errors in the actual clinical record. Nevertheless, despite the nearly ubiquitously measured height and weight values stored in clinical systems have systematic flaws that can be reasonably corrected in research settings with appropriate data processing techniques.
This work was supported in part by NIH grant U01 HG004798 and its ARRA supplements. The dataset(s) used for the analyses described were obtained from Vanderbilt University Medical Center’s BioVU which is supported by institutional funding and by the Vanderbilt CTSA grant funded by the National Center for Research Resources, Grant UL1 RR024975-01, which is now at the National Center for Advancing Translational Sciences, Grant 2 UL1 TR000445-06.