|Home | About | Journals | Submit | Contact Us | Français|
We validated a Fitbit sleep tracking device against typical research-use actigraphy across four nights on 38 young adults. Fitbit devices overestimated sleep and were less sensitive to differences compared to the Actiwatch, but nevertheless captured 88 (poor sleepers) to 98percent (good sleepers) of Actiwatch estimated sleep time changes. Bland–Altman analysis shows that the average difference between device measurements can be sizable. We therefore do not recommend the Fitbit device when accurate point estimates are important. However, when qualitative impacts are of interest (e.g. the effect of an intervention), then the Fitbit device should at least correctly identify the effect’s sign.
The usefulness and validity of research-grade actigraphy devices are well known (Sadeh, 2011). The rise in interest regarding consumer sleep tracking devices for research implies the need for testing such devices against accepted sleep monitoring technologies. This article reports results from a validation study of the Fitbit sleep tracking device against standard actigraphy. Fitbit is a leading maker of devices that claim to track sleep, although recent validation attempts have produced mixed results (Evenson et al., 2015; Meltzer et al., 2014; Montgomery-Downs et al., 2012). Other consumer sleep trackers have also been the subject of validation tests. Validation studies of the Jawbone UP device, for example, have produced similar mixed results (de Zambotti et al., 2015; Evenson et al., 2015; Toon et al., 2015). A summary of the claims and validity of numerous consumer sleep monitors is found in Russo et al. (2015), with a focus on the question of their possible usefulness even absent clinical-level data validity. Our study intends to contribute to this debate. As we will show, our data are somewhat in line with previous conclusions. We provide evidence suggesting serious reservations about using the Fitbit device if accurate measurements are desired, but it may prove useful for qualitative purposes in certain settings.
This study adhered to the guidelines outlined in the Declaration of Helsinki as revised in 2008. We recruited 38 adult participants (23 females, 15 males; 26.05±7.99years) who each simultaneously wore a commonly utilized research-grade actigraph (Actiwatch Spectrum Plus: Philips Respironics) and a popular commercial sleep tracker (Fitbit Charge HR) for 4weekdays/nights. Both the Actiwatch and Fitbit were set to sample data at 30-second epochs, and the Fitbit was set to “normal” mode. We used the Pittsburgh Sleep Quality Index (PSQI; Buysse et al., 1989) to identify both good (PSQI5; n=20) and poor sleepers (PSQI>5). Participants kept sleep diaries, and we report both raw and diary-adjusted Fitbit data on total sleep time (TST) and efficiency. The procedures used for diary-aided scoring of the Fitbit data were similar to validated actigraphy procedures (Goldman et al., 2007). Because participants simultaneously wore both devices, this assured that the diary-aided scoring of both Fitbit and Actiwatch data utilized the exact same sleep diary record. Participants were compensated US$50 for participation, and procedures were approved by the Institutional Review Board in the Office of Research Protection at Appalachian State University (IRB approval #15-0325).
On the first day, participants visited our lab and provided written informed consent, completed the PSQI, received device instructions, and were assigned both an Actiwatch and a Fitbit device. Before departing, participants were instructed to return to the lab each day for approximately 20minutes. During this time, they completed sleep diaries online and lab technicians synced Fitbit devices with lab computers and downloaded participants’ Fitbit and Actiwatch data from the previous day.
We compare each participant’s Fitbit nightly sleep measure to the analogous actigraphy-produced measure: time-in-bed (TIB), TST, sleep efficiency (as automatically device-scored), and TST/TIB (which we call quasi-efficiency). As noted above, Actiwatch data are scored using validated procedures, and we examine Fitbit measures of TST using both raw and diary-adjusted data. To our knowledge, existing validation studies of consumer monitoring devices do not always adjust device data with input from sleep diaries, even though this is common in many research studies. Some devices require user activation of “sleep mode,” which may serve as a diary-type measure. The Fitbit Charge HR does not require such user activation. Also, some validation studies involve concurrent polysomnographic (PSG) data acquisition, but it is not always clear whether consumer device data are adjusted as part of the scoring procedure.
For each outcome measure, M, we estimate the following linear model:
where ϵ is a random effects error term accounting for the multiple observations (n=4) per participant (i.e. error terms are clustered by participant). The null hypotheses that both α=0 and β=1 imply Fitbit outcomes are statistically no different than Actiwatch outcomes on average. Rejection of α=0 reflects a general over/underestimation by Fitbit of the actigraphy-based measure. Rejection of β=1 indicates hypo- or hyper-sensitivity of the Fitbit to changes in the outcome measure, compared to actigraphy. All estimations of model (equation (1)) were performed using the panel data random effects option in Stata 13 software.
We also performed Bland–Altman analysis on the differences in device measurements (Bland and Altman, 1986). Enhanced Bland–Altman plots were constructed using SAS software, and these plots include the linear prediction and 95percent confidence interval on the difference between the outcome measures of the two devices (sleep time or sleep efficiency).
Finally, our unique longitudinal approach (most studies validate a device based on one night with PSG measures, for example) allows us to examine whether any systematic measurement differences between devices are a function of multiple measurements on the same participant.
All reported results are based on diary-adjusted (i.e. “scored”) Fitbit and Actiwatch measures, as is typically done with actigraphy data. Diary-adjusted scoring of the Fitbit data significantly reduces the variance in sleep outcome measures from the Fitbit (see section “Results”). In fact, the correlation between the Actiwatch raw versus scored data is .9582, compared to .6327 between Fitbit raw versus scored data. Diary adjustments are used not to calibrate all the device data to match the diary, but rather the diary is used as a complement to the device data when sleep start/stop times are ambiguous in the device data record.
Figure 1 and Table 1 summarize the key correlational results, while Figure 2 highlights the importance of the diary-aided scoring of the Fitbit data (i.e. manual adjustments of raw Fitbit data similar to typical scoring procedures used with actigraphy in sleep research studies). In Figure 1, the scatterplot Fitbit data measures (TST and efficiency) are compared to the analogous Actiwatch measure, with the linear regression estimate of equation (1) superimposed. Table 1 shows the full estimation results of TST, sleep efficiency (shown in Figure 1), TIB, and quasi-efficiency (not shown in Figure 1) as well as estimates for the separate subsamples of good and poor sleepers. In most instances, Table 1 indicates that the Fitbit generally overestimates TIB, TST, and efficiency relative to the Actiwatch measure (i.e. rejection of α=0 in favor of α>0). The results most closely approximate α=0 and β=1 for the subsample of good sleepers, for whom we estimate that the Fitbit measure of TST is statistically indistinguishable from the Actiwatch TST measure. This correlational analysis does not, however, draw our attention to the differences between device measurements, which may be sizable and still produce a high correlation measure between devices.
Standard and enhanced Bland–Altman plots showing measurement differences between devices were constructed for TST, sleep efficiency, and quasi-efficiency measures. In Figures 3 to to5,5, we show results from analysis on the pooled sample as well as the subsamples of good and poor sleeper data for each of these measures. The enhanced plots (right-hand side panel in each figure) include a linear prediction of the measurement difference and confidence intervals on that difference.
From Figures 3 to to5,5, we see a key concern regarding device point-estimate reliability of the Fitbit. In many instances, the difference in sleep parameter measurement is not only outside the random variation one might expect, but also the magnitude of the differences is substantial. Also, the Bland–Altman plots reveal that longer recorded values of Fitbit TST or sleep efficiency are associated with an even larger difference between the sleep parameters that the Fitbit and Actiwatch are measuring. Given the Actigraph Spectrum is a well-validated and commonly used research device for obtaining such sleep measures (i.e. it is our benchmark device between the two), this finding indicates that the Fitbit is not sufficiently accurate in the precision of its measurements compared to well-accepted device standards.
Finally, we also conducted longitudinal analysis on whether the day of testing (day 1, 2, 3, or 4) revealed any systematic tendencies regarding the difference between Fitbit and Actiwatch device measurements. The longitudinal data on each participant are shown in Figures 6 and and7,7, which in each case are separated by good and poor sleepers. One can see that the data show that the Fitbit tends to overestimate sleep efficiency and marginally overestimate TST across all days, but regression results in Tables 2 and and33 confirm no systematic trend across days. Thus, the Fitbit may yet provide useful information regarding the qualitative change in a participant’s sleep trends, even though the specific values are likely biased.
Given the prevalent use of actigraphy for monitoring participant sleep levels outside of a sleep laboratory environment, we aimed to assess the practical usefulness of the Fitbit device as an alternative to actigraphy in certain contexts. The National Sleep Foundation places significant emphasis on sleep level targets and guidelines, and they routinely identify sleep deficits by comparing nightly sleep guidelines to self-report measures. One use of low-cost sleep monitoring devices may be to help assess within-participant sleep trends in settings where clinical accuracy is not necessary. In other words, consumer sleep tracking devices may still be qualitatively useful for personal goal tracking or even some applied research purposes (e.g. Did intervention X significantly increase John Doe’s nightly sleep?).
Our statistical analysis finds that diary-adjusted Fitbit data show fairly reasonable correlation on the key TST variable for good sleepers and somewhat lower but still high correlation on TST for poor sleepers. The regression fit between Actiwatch and Fitbit sleep efficiency (and it is unclear how that is defined with Fitbit) is inferior, which suggests that perhaps the use of the quasi-efficiency measure, TST/TIB, may be more reasonable. Nevertheless, the correlation between Fitbit and Actiwatch quasi-efficiency is substantially lower than the correlation between their TST measures.
Additional analysis with Bland–Altman plots show that the magnitude of the differences between device measurements can be substantial. In some instances, the difference in nightly sleep measured by the Fitbit is more than a full hour different from the analogous Actiwatch measure. Also, confidence intervals on the predicted difference between device measurements as a function of the Fitbit measure typically do not include the “zero difference” line, and the predicted difference in device measurements is not constant across the range of values in our data set. Finally, we exploit the unique longitudinal nature of our data set by examining whether the difference between Fitbit and Actiwatch measures of TST and sleep efficiency differs systematically over the course of the four evenings of data collection—such analysis is not possible with typical validation studies examining only a single night of device testing. We do not find evidence of differences in device measurement differences across consecutive evenings of testing. Overall, while the Fitbit may be useful for promoting a heightened awareness and concern over one’s sleep, we do not recommend it as an alternative to traditional actigraphy when accurate point estimates of TST or sleep efficiency are desired. However, the significant positive average relationship between device measurements suggests a limited but useful role for the Fitbit for those instances where the average sign of the effect is all the researcher needs (e.g. assessing the directional impact of an intervention, assuming sufficient sample size). The qualitative value of the Fitbit data appears to be present for both good and poor sleepers.
Declaration of conflicting interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Actigraphy devices were purchased for an earlier project funded through National Science Foundation Grant BCS-1229067 to Dickinson. Fitbit devices purchased directly through Amazon.com from GEAR UP (Department of Education Grant P334A140205) funding. Neither the National Science Foundation nor the Department of Education (or GEAR UP) had any role in the study design, collection, analysis or interpretation of the data, writing of the manuscript, or the decision to submit the paper for publication.