|Home | About | Journals | Submit | Contact Us | Français|
Motion sensor devices such as actigraphs are increasingly used in studies that seek to obtain an objective assessment of activity level. They have many advantages, and are useful additions to research in fields such as sleep assessment, drug efficacy, behavior genetics, and obesity. However, questions still remain over the reliability of data collected using actigraphic assessment. We aimed to apply generalizability theory to actigraph data collected on a large, general-population sample in middle childhood, during 8 cognitive tasks across two body loci, and to examine reliability coefficients on actigraph data aggregated across different numbers of tasks and different numbers of attachment loci. Our analyses show that aggregation greatly increases actigraph data reliability, with reliability coefficients on data collected at one body locus during 1 task (.29) being much lower than that aggregated across data collected on two body loci and during 8 tasks (.66). Further increases in reliability coefficients by aggregating across four loci and 12 tasks were estimated to be modest in prospective analyses, indicating an optimum trade-off between data collection and reliability estimates. We also examined possible instrumental effects on actigraph data and found these to be nonsignificant, further supporting the reliability and validity of actigraph data as a method of activity level assessment.
The difficulties with obtaining an objective assessment of children's activity levels have long been recognized (Teicher, Ito, Glod, & Barber, 1996). Parent and teacher reports of children's activity levels are commonly used as activity level measures in both clinics and research, but there is awareness of possible biases and errors (Saudino, Ronald, & Plomin, 2005; Thapar, Harrington, Ross, & McGuffin, 2000; Verhulst, Achenbach, Althaus, & Akkerhuis, 1988) such as contrast effects (Eaves et al., 2000; Saudino, Cherny, & Plomin, 2000; Simonoff et al., 1998) or halo effects (Abikoff, Courtney, Pelham, & Koplewicz, 1993; Schachar, Sandberg, & Rutter, 1986; Stevens, Quittner, & Abikoff, 1998). Although observer-rated data may be considered more objective than parent or teacher reports, the expense and practicalities of obtaining such data have made them difficult to use in large-scale studies and unsuitable for long-term field studies of activity level. Motion sensor devices, such as actigraphs, provide a potential way around these pitfalls, by providing objective, technologically simple activity level data (Eaton, McKeen, & Saudino, 1996), in a method that can be used over large periods of time and has been shown to be readily accepted by the majority of young people (Van Coevering et al., 2005). Despite useful additions to research in varied fields such as sleep assessment, drug efficacy, behavior genetics, and obesity, the validity and reliability of actigraphic assessment has been queried (Jean-Louis et al., 1997). Empirical research has largely validated the use of actigraphy as a method of activity level assessment, showing significant correlations with activity level assessments by methods such as room respiration calometry (Puyau, Adolph, Vohra, & Butte, 2002), oxygen consumption over activities of varying intensity (Treuth et al., 2004), spinners and precision pendulums (Tryon, 2005), and questionnaire data (Saudino, Wertz, Gagne, & Chawla, 2004). A comparison study found higher correlations for three actigraphy monitors and treadmill pace, but found that the CSA Monitor (Computer Science and Applications, Inc.) was the most accurate at predicting energy expenditure as measured by indirect calorimetry, compared with the Tritac and Biotrainer systems (Welk, Blair, Wood, Jones, & Thompson, 2000).
However, there are still methodological issues to be addressed regarding the use of motion sensor data. Although the validity of actigraphs as a method of activity level data has been addressed in the above studies, the reliability of the actigraph method is still open to question (Sadeh, Sharkey, & Carskadon, 1994). Reliability studies have largely focused on sleep–wake identification (see Cole, Kripke, Gruen, Mullaney, & Gillin, 1992, for a review), with little data on the reliability of actigraphy assessment in ascertaining an assessment of overall activity level, although a study used generalizability theory to show that CSA accelerometers have the lowest error variance when compared with Biotraner Pro, Titrac, and Actical machines (Welk, Schaben, & Morrow, 2004). However, few studies have focused on how to quantify and maximize the reliability of actigraph data.
In his seminal paper “Aggregation and Beyond,” Epstein (1983) discussed issues surrounding the reliability of all behavioral data, highlighting the need to minimize situation-specific influences on behavior measurement, and the role of aggregation across measurement occasions in achieving this. He defined aggregation as “a basic procedure for reducing error of measurement and for enhancing and establishing the range of generality of a phenomenon by averaging over many measurements” (p. 367), in particular, by canceling out—or at least reducing—instances of unrepresentative behavior within the measurement of a phenotype. For data aggregation to be appropriate, the measurements must measure the same concept and have a common variance. Aggregation over measurement occasions is an important concept for increasing the reliability of data, and aggregated summed components are generally expected to have better reliability than single variables are (Rousson, Gasser, & Seifert, 2002); combining scores across theoretically related cognitive variables has been shown to increase data reliability and decrease error variance (Kuntsi et al., 2006).
One of the advantages of motion sensor data is that they facilitate aggregation over a large sample (Eaton et al., 1996). However, given that children's activity levels can be highly variable, further aggregation—such as across measurements and collecting data across several periods—can be useful to minimize momentary or unrepresentative influences. This may be particularly important for fields such as behavior genetics, where it is not the mean score over the sample that is important but the correlation between individuals. Earlier work has highlighted the importance of aggregating actometer data across instruments (Eaton, 1983) and limbs (Eaton et al., 1996), to reduce behavioral variability and produce the most reliable score of overall activity level, and these studies have also speculated that aggregation over longer data collection periods such as 24–48 h, would increase functional reliability over shorter periods of 15 min (Eaton et al., 1996). This idea was supported in one actigraph study, which reported an increase in the reliability of actigraph data collected over three 5-min bouts of treadmill walking over that collected during one (Welk et al., 2004). However, there is a lack of data on this latter issue for actigraph data not collected during bouts of regulated physical activity, when the demands of different tasks may elicit differing activity levels (Dane, Schachar, & Tannock, 2000). Nor is there available data on the suitability of aggregation across limbs for actigraphs themselves, which record different information than that of the actometers used by Eaton and colleagues.
Classical test theory, item response theory, and general-izability theory are three major ways to assess reliability. However, the former two approaches consider only one source of measurement error at a time; nor do they provide an overall estimate of reliability, or explicit information about how many extra measurements would be needed to obtain a specific reliability coefficient in future studies (Mushquash & O'Connor, 2006). Generalizability, sometimes called G theory, allows the estimation of variance due to person, facet, and residual variance components (Cronbach, Gleser, Nanda, & Rajaratnam, 1972). Using available data, a G study estimates reliability coefficients for different numbers of measurements. Further, prospective analyses can estimate the number of measurements needed in future studies to achieve a chosen reliability coefficient.
We aimed to examine whether there is justification for aggregation across body loci and tasks for actigraph data, and, if so, whether aggregation results in an increase in reliability by using G theory to calculate reliability coefficients for different levels of aggregation. Furthermore, we aimed to examine whether effects of remaining battery life, intermachine differences, and research assistant differences have a significant effect on the data collected.
Participants were members of the Study of Activity and Impulsivity Levels in Children (SAIL), a study of a general population sample of twins between the ages of 7 and 9. The sample was recruited from a birth cohort study, the Twins Early Development Study (TEDS; Trouton, Spinath, & Plomin, 2002), which had invited parents of all twins born in England and Wales between 1994 and 1996 to enroll. Despite attrition, the TEDS families continue to be representative of the U.K. population with respect to parental occupation, education, and ethnicity (Spinath & O'Connor, 2003).
Families on the TEDS register were invited to take part if they fulfilled the following SAIL project inclusion criteria: their twins' birth-dates were between September 1, 1995, and December 31, 1996; they lived within feasible traveling distance of the research center (return day trip); their ethnic origin was white European (to reduce population heterogeneity for molecular genetic studies); they had recently participated in TEDS, as indicated by return of questionnaires at either the 4- or the 7-year data collection point; they had no extreme pregnancy or perinatal difficulties (15 pairs excluded), specific medical syndromes, chromosomal anomalies (2 pairs excluded), or epilepsy (1 pair excluded); they were not participating in other current TEDS substudies (45 pairs excluded); and they were not on stimulants or other neuropsychiatric medications (2 pairs excluded).
The present analyses focus on data obtained following contact with the first 693 suitable families on the register. Of these, 400 families agreed to participate, reflecting a participation rate of 58%. Of the 800 participants, data from 9 individual children were excluded from analyses on the SAIL dataset (5 children with IQs below 70, and 1 child because of each of the following: neurofibromatosis, epilepsy, hypothyroidism, and ADHD with scioptic tendencies). A further 3 children were excluded because of difficulties during test sessions that inappropriately affected the data (e.g., playing with the actigraph). Actigraph data were therefore available for 797 children; but of these, 109 had incomplete data and were therefore excluded from the present analyses. This was a necessary step for the generalizability analyses and arose either as a mechanical failure or as a testing-related issue (such as a specific computer failure leading to nonadministration of a task). This left for the present analyses a final sample of 598 children, mean age 8.36 years (SD = 0.26). All participants gave informed consent, and the study was approved by the Institute of Psychiatry ethical committee (approval no. 286/01).
The families visited the research center for the actigraph assessments (for further details, see Wood, Saudino, Rogers, Asherson, & Kuntsi, 2007). The actigraph readings used in the present analyses are taken from a laboratory-based test session, when the twins were apart completing a short-form IQ test and several cognitive–experimental tasks, under the supervision of separate experimenters who administered standardized instructions. The total length of the testing session was approximately 2 h, excluding a 25-min unstructured break approximately halfway through the session. The children completed 4 separate tasks, with differing task conditions. The children completed all tasks while seated; no task required any movement other than arm movement, usually computer mouse control. However, for the block design subtest in the Wechsler Intelligence Scale for Children (Wechsler, 1991), the children manipulated objects on the desk. Each task condition was separated by a 3- to 4-min break, during which instructions for the next task were given and prizes awarded. For the purposes of simplicity, therefore, as far as these analyses are concerned, each condition is treated and referred to as a separate “task.” The children completed the following four tasks:
The children wore two actigraphs, each slightly larger than a watch (MTI Health Services, version 323, Health One Technology), one on the dominant leg (established by asking with which leg they would kick a ball and start upstairs on), and one on the waist. These attachment loci were chosen to minimize the relationship between actigraph data and task performance. MTI actigraphs have been shown to have the least variability across overall G coefficients and the highest reliability compared with other personal motion sensors (Welk et al., 2004). These devices contain accelerometer technology, which records the number of movements as well as the cumulative magnitude. The actigraph data output was set to readings per minute, measured in gravitational acceleration (G) units, a standard measure of acceleration. This acceleration is then sampled and digitized by a 12-bit analog-to-digital converter and passed through a digital filter, which band-limits the accelerometer to 0.25–2.5 Hz. This was selected to detect normal human activity, while rejecting motion from other sources.
In all analyses, we obtained an average reading by dividing the cumulative magnitude by the number of minute readings; this removed the effect of time, since some children spent longer in some conditions than did others. Actigraph data collected during the eight “tasks” were used in the generalizability analyses. There was an average task length of 10.60 min, with a range of 8–14 min.
All analyses were conducted using Stata statistical software release 9.2 (Stata Corporation, College Station, TX), with the exception of the generalizability analyses, which were conducted with SPSS version 14.0 (SPSS, Inc.) using an adaptation of a syntax script provided by Mushquash and O'Connor (2006). Log transformations were applied to data (optimized minimal skew through the lnskew0 command in Stata version 9.1) to normalize skewed distributions. Since the data were collected on a twin sample, analyses were conducted using the “cluster” command in Stata to control for the genetic relationship between members of the sample (Armitage, Berry, & Matthews, 2001). Where this was not possible—for example, in the generalizability analyses and the principal components analyses—analyses were conducted separately on Twin 1 and Twin 2, where the assignation to Twin 1 or Twin 2 status was random, to control for the nonindependence of the data. For these analyses, analyses for Twin 1 and Twin 2 yielded a similar pattern of results, so data from Twin 1 only is presented here (Twin 2's data are available from the first author on request).
To investigate whether there is shared variance underlying individual task data, a principal components analysis was run on task level data to see how many latent components were extracted from measurements, and to see whether all measurements loaded onto a shared latent component(s). This should indicate whether, as required for aggregation, all measurements do in fact share a common variance, and are, therefore, likely to measure the same concept (Epstein, 1983).
Generalizability analyses reveal and compare the sources of variance in a common metric (Mushquash & O'Connor, 2006), decomposing the variance into person-specific variance, facet-specific variance, and error variance. This makes G theory preferable to true-score theory, but following the advice of O'Brien (1995), G coefficients are here called reliability coefficients, a more familiar term, given that reliability and G coefficients are analogous. In these analyses, the common metric is actigraph data and the number of tasks and number of body loci are the two facets. A fully crossed, two-facet design was used that estimated person, task, body loci, and residual (error) variance. Using these component variances, the reliabilities across different levels of facet can be estimated—that is, the reliability of actigraph data collected from, and averaged across, both different numbers of tasks and different numbers of body loci. We use Cicchetti's (1994) differentiation for interjudge reliability coefficients of a clinical significance where <.40 indicates “poor” reliability; .4 to <.6, “fair” reliability; .6 to <.74, “good” reliability; and >.75, “excellent” reliability.
The analyses are presented from two applications of G theory. The first set is a “G study” which presents reliability coefficients for the data collected. The second set of analyses relate to a “D study” where, using G theory, prospective analyses can indicate what reliability coefficients would be in future studies for numbers and combinations of facets not collected as part of the G study. In this case, we present reliability coefficients for a D study in which the number of body loci is extended to four and the number of tasks is extended to 12.
The effect of individual differences between the two experimenters administering the tasks and intermachine effects on the data were assessed through a regression model on which the assumption of nonindependence data was relaxed.
For the leg data, a principal components analysis showed that one major factor accounted for 58% of the variance between the actigraph data collected during the 8 “tasks” (eigenvalue, 4.62). No other clear factors emerged. Between 40% and 69% of the variance in each task measurement was explained by this one factor. As such, there was shared common variance between the actigraph data for all 8 tasks, and none of the tasks contributed significantly to a unique aspect of actigraph data.
For the waist data, once again, only one clear factor emerged (eigenvalue, 4.42), explaining 55% of the variance between task level actigraph data. Between 41% and 68% of the variance in each task level actigraph set of data were explained by the extracted factor. Similarly, for data averaged across all eight tasks, a principal components analysis extracted one component that explained 74% of the variance shared between leg and waist data for Twin 1 (eigenvalue, 1.48). This suggests that both body loci share a common variance, and that no single locus contributes significantly to a unique aspect of actigraph data.
Reliability coefficients for aggregated data are presented in Table 1. Using data collected from one locus, the G study showed that reliability coefficients increased from .29 to .52 as data were aggregated across 1 and 8 tasks, respectively (Table 1; Figure 1). Prospective analyses for the D study indicate that the reliability coefficient would increase to .54 if data were aggregated across 12 tasks (Table 1; Figure 1), which indicates that that the increase in reliability is more modest as one aggregates across subsequent numbers of tasks.
When data were aggregated across two body loci, the reliability coefficients increased in comparison with using data collected from one locus. The G study showed that for data collected during one task, but aggregated across two loci, the reliability coefficient increased by .1 to .39 (Table 1; Figure 1). When data are aggregated across eight tasks, the reliability coefficient increases from .52 for data collected on one locus to .66 when aggregated across two loci (Figure 1). Prospective analyses for the D study indicated that using data aggregated across four body loci would increase this latter reliability coefficient to .77 (Table 1; Figure 1).
There was more variance across person (.32) than across the facets (.05 across task and .18 across body loci), confirming the shared variance in actigraph data across individual tasks and body loci. However, the interaction between persons and tasks (.23) and between persons and body loci (.23) indicates that the rank ordering of persons changes across tasks and loci, so, as might be expected, to some extent different tasks elicit different activity levels.
Situational effects on actigraph data were assessed using actigraph data aggregated across all eight tasks and both body loci, since the analyses above indicate increased reliability for data aggregated in this way. To minimize data disruption, since the data were on the same scale the raw data were first summed, then log transformed (optimized minimal skew through the lnskew0 command in Stata version 9.1) to normality to create an aggregate actigraph score per person, across tasks and limbs.
Intermachine differences did not have a significant effect on actigraph data collected across the sample [t(322) = −0.32, p = .75]. Remaining battery life of the actigraph did not have a significant effect on actigraph data collected across the sample [t(226) = −1.26, p = .21]. There were no significant differences in the actigraph data collected across the sample between the two research assistants [t(327) = 0.77, p = .44].
With evidence rapidly accumulating that actigraphs are a valid measure of activity level (Puyau et al., 2002; Saudino et al., 2004; Treuth et al., 2004; Tryon, 2005), questions remained about the reliability of activity level as measured by these devices over short measurement occasions, given the inherent variability of human activity. We showed that actigraph data, collected during a single visit to the research center, can have good reliability (Cicchetti, 1994), with many instrumentation variables not having a significant effect on the data collected. However, we also showed that this reliability is dependent on aggregation across actigraph measurements.
The task level and attachment loci level (leg and waist) data for both leg and waist measurements measure the same construct of activity level and have a common variance, fulfilling the main criteria to justify aggregation (Epstein, 1983). The G study indicated that, once aggregated across tasks and loci, the data showed higher reliability coefficients, with an average coefficient of .52, when one body locus was averaged across 8 tasks, or .39 when 1 task was averaged across the waist and leg data. When averaged across two body loci and 8 tasks, the reliability coefficient rose to .66, compared with a reliability coefficient of .29 when just 1 task and one body locus were used. On the basis of G theory, D study analyses indicated that, had data been further aggregated (e.g., across 12 tasks and four body loci), the reliability coefficient was estimated to become .79. This suggests that using longer chunks of actigraph data to measure children's activity levels, here by aggregating across tasks, would produce a more reliable measure of overall activity level over shorter chunks by canceling out momentary or unrepresentative influences. It is likely that this finding will extrapolate to data collected over a longer period, since the tasks were administered consecutively, with only a small break between each. This finding is supported by personal communication with the researchers who administered the cognitive tasks, and with informal viewing of video tapes of the session. As increasing numbers of researchers exploit the advantages of actigraphs to collect activity level data (Morgenthaler et al., 2007), this is an important point for study design: Testing task level data, when tasks are fairly short, may not be a suitable method for examining activity level differences, and conclusions drawn from small chunks of data should be treated with caution (e.g., Inoue et al., 1998). Reducing error variance in actigraph data through aggregation has many potential applications, but previous analyses on these data used aggregation across limb scores as an important step in maximizing sensitivity to the underlying genotype (Wood et al., 2007).
Encouragingly, of three possible instrumental and rater effects on actigraph data (intermachine differences, interrater individual differences, and the effect of battery life), none had a significant effect on the data. This finding reinforces both the reliability and validity of actigraph data. Nonetheless, we would still encourage researchers to counterbalance devices across individuals and across body loci, and to be aware of potential maintenance issues with the actigraphs.
The analyses presented relate to actigraph data collected in a laboratory setup; however, the applicability of the findings to a more naturalistic setting requires further study. A benefit of actigraph measurement in a laboratory setup is that it more closely resembles a classroom setting, in which increased activity level may be most problematic. Finally, it may be task level or loci differences that researchers are interested in. These analyses highlight the potential importance of such differences through highlighting task-specific variance, and the present aggregation methods relate more to those seeking to assess overall activity level. When investigating task-specific differences, our data suggest that researchers should make across-data comparisons aggregated across theoretically related tasks of interest, and/or aggregate across several body loci, especially if tasks are fairly short, which is likely to be the case in child research.
Our results show that actigraph data are reliable. We would recommend that actigraph data be aggregated across theoretically related tasks and body loci. Data aggregated over an average of 80 min and over two body loci show higher reliability than data collected from one locus for an average of 10 min.
The Study of Activity and Impulsivity Levels in Children (SAIL) is funded by a project grant from the Wellcome Trust (GR070345MF). K.J.S. is supported by Grant MH062375 from the National Institute of Mental Health. A.C.W. is supported by the Economic and Social Research Council. We thank the TEDS-SAIL families, Eda Salih, Hannah Rogers, Rebecca Gibbs, Greer Swinard, Kate Lievesley, Kayley O'Flynn, Suzi Marquis and Rebecca Whittemore, Vlad Mereuta, Desmond Campbell, and everyone on the TEDS team.