|Home | About | Journals | Submit | Contact Us | Français|
Wearable monitors are increasingly being used to objectively monitor physical activity in research studies within the field of exercise science. Calibration and validation of these devices are vital to obtaining accurate data. This article is aimed primarily at the physical activity measurement specialist, although the end-user who is conducting studies with these devices also may benefit from knowing about this topic.
Initially, wearable physical activity monitors should undergo unit calibration to ensure inter-instrument reliability. The next step is to simultaneously collect both raw signal data (e.g., acceleration) from the wearable monitors and rates of energy expenditure, so that algorithms can be developed to convert the direct signals into energy expenditure. This process should use multiple wearable monitors, a large and diverse subject group, and should include a wide range of physical activities commonly performed in daily life (from sedentary to vigorous).
New methods of calibration now use “pattern recognition” approaches to train the algorithms on various activities, and they provide much better estimates of energy expenditure than were previously available with the single-regression approach. Once a method of predicting energy expenditure has been established, the next step is to examine its predictive accuracy by cross-validating it in other populations. In this paper, we attempt to summarize best practices for calibration and validation of wearable physical activity monitors. Finally, we conclude with some ideas for future research ideas that will move the field of physical activity measurement forward.
Wearable physical activity monitors are now widely used in research studies. These devices work by sensing either physiological or mechanical responses to bodily movement, and they use these signals to estimate variables that reflect physical activity. Calibration and validation of wearable activity monitors are the first steps to obtaining accurate and objective information. A detailed understanding of the best methods for doing this will ensure that the best possible data are collected. This paper is aimed primarily at the physical activity measurement specialist, rather than the end-user seeking to use “off-the-shelf” devices for assessing physical activity.
This paper will discuss the process of calibrating and validating wearable physical activity monitors. First, we define validity and identify the types of validity analyses that are most appropriate for wearable activity monitors. Second, we review methods of calibration, paying attention to both “unit calibration” (to ensure inter-instrument reliability), and “value calibration” (i.e., conversion of raw signals into metabolic units) of wearable monitors. Third, we describe how to investigate the validity of wearable activity monitors, to determine whether they accurately estimate energy expenditure, time spent in various intensity categories, and the type of activities performed. We also discuss the strengths and weaknesses of various calibration methods. We conclude by summarizing what we consider to be best practices for calibrating and validating wearable activity monitors and priorities for future research.
In 2004, at a conference on “Objective Monitoring of Physical Activity” held at the University of North Carolina, Welk (26) distinguished between two types of calibration in reference to wearable activity monitors. “Unit calibration” is performed to reduce inter-instrument variability and to ensure that individual activity monitors are correctly measuring the direct signals (acceleration, heart rate, body posture, heat flux). “Value calibration” of wearable monitors refers to the process used to convert the direct signals into other established measurement units. Value calibration is a validity issue, and is performed to ensure that a wearable monitor gives the intended values for outcome variables (i.e., “derived” variables).
Validity is defined as the extent to which an instrument measures what it is intended to measure (7). There are several different types of validity, including: criterion-referenced validity, content validity, and construct validity. Physical activity researchers are most interested in criterion-referenced validity with wearable physical activity monitors because the variables that we are attempting to measure are highly objective (as opposed to subjective). Content and construct validity have lesser value when it comes to establishing evidence of validity for wearable physical activity monitors; however, they are often used in validating physical activity questionnaires.
Criterion-referenced validity is comprised of two types: concurrent validity and predictive validity. Concurrent validity is determined by comparing or correlating data collected at the same time from a measure (wearable monitor) and a criterion measure. The criterion measure should be a gold standard or a measure with the highest accuracy and precision. Predictive validity is the extent to which a physical activity monitor is able to predict scores obtained using a criterion instrument.
Wearable monitors typically measure acceleration, physiological signals, body posture, or some combination of these factors. The validity of these direct signals can be checked by comparison to a “gold standard.” For accelerometers, a simple check on unit calibration would be to spin it in a circle with a known radius and frequency (RPM), so that it is exposed to a known acceleration. This technique can be used to verify that the accelerometer displays values within the manufacturer's stated tolerance limits. Unit calibration checks are important with some older activity monitors (e.g., ActiGraph [LLC, Fort Walton Beach, FL] 7164) that use cantilever beam sensors with analog filtering. They are not necessary with newer monitors (e.g., ActiGraph model GT1M), which tend to use a direct compression sensor integrated into a solid state, micro electro mechanical system accelerometer with digital filtering. Due to tighter manufacturing tolerances with micro electro mechanical system accelerometers, the digital filter, and the initial unit calibration performed at the factory, the sensitivity varies little between the newer devices. They should remain calibrated for the life of the device, according to the manufacturers.
Although the newer micro electro mechanical system -based accelerometer technologies are more robust than their predecessors, they are not infallible. For example, little is known about the threshold detection levels of the existing accelerometer models, which could have a marked impact on sedentary time outcomes or wear-time algorithms that look for consecutive epochs of zeros (21). In addition, the robustness of the different manufacturers’ unit calibrations are unclear. For example, it has not been established that a single-point unit calibration (one g-force and one frequency) is adequate to test the entire dynamic range. With these issues in mind, we still recommend that all accelerometer-based activity monitorsundergo a unit calibration check with a mechanical shaker across a range of accelerations and frequencies before deployment.
Typically, unit calibration is not performed with wearable heart rate monitors and body posture devices. This is because the heart rate values obtained from heart monitors have previously been shown to be valid compared to electrocardiograms. Body posture monitors have been shown to be valid by placing them in different orientations (horizontal, vertical, tilted), and comparing this measurement to the output from the monitors. Thus, there does not appear to be a need for unit calibration with these devices.
Value calibration of wearable monitors (e.g., metabolic calibration) refers to the process by which measurement researchers obtain data that allow them to convert direct signals from monitors into estimates of energy expenditure, time spent in various intensity categories, and activity type. This process involves collecting data on multiple individuals as they perform different activities and simultaneously collecting criterion data.
The data collected by accelerometer-based activity monitors are referred to as “activity counts.” These activity counts are derived from the raw acceleration versus time curve. For instance, when an accelerometer is placed on a belt worn tightly around the waist, a sinusoidal acceleration versus time curve is observed. These acceleration data are filtered, full-wave rectified (meaning that the absolute value of the accelerations is used) and then integrated (i.e., the area under the curve is determined) over a pre-determined time period. The resultant activity counts are displayed for discrete time periods, or epochs (e.g., 1-min). The activity counts are then used to predict energy expenditure (9).
In 2004, Welk (26) summarized the principles for designing accelerometry-based value calibration studies. He noted that during calibration:
Historically, most value calibration studies on accelerometers have used a single linear regression approach. After collecting energy expenditure and activity count data on multiple individuals performing a range of physical activities, the relationship between these variables is plotted graphically and linear regression is used to determine the line of best fit. Once a single regression equation has been developed, the activity counts obtained by an individual performing an unknown activity can be used to estimate his or her energy expenditure. Activity count cutpoints denoting the dividing line between light- and moderate-intensity physical activity (3 metabolic equivalents [METs]) and moderate- and vigorous-intensity physical activity (6 METs) are typically identified. These cutpoints are then used to tally up the amount of time spent in light, moderate, and vigorous physical activity.
Montoye et al. (19) performed one of the first value calibration studies of a uniaxial accelerometer (a prototype to the Caltrac [Muscle Dynamics, Torrence, CA]). They had individuals perform level treadmill walking, treadmill running, bench stepping, knee bends, and floor touches. They used linear regression to determine the line of best fit relating acceleration to energy expenditure. Freedson et al. (13) used a similar value calibration approach with only treadmill walking and jogging. Hendelman et al. (15) and Swartz et al. (25) calibrated the ActiGraph by having subjects perform a variety of moderate-intensity lifestyle activities. In all, over a dozen regression equations have been developed for the ActiGraph alone (Table 1).
Because single regression equations cannot accurately determine energy expenditure across a wide range of activities, Crouter et al. (12) developed a two-regression equation that discriminates between walking/running and intermittent lifestyle activities based on the variability in counts across successive epochs. They calibrated the ActiGraph model 7164 on 20 different physical activities that ranged from seated rest to vigorous exercise. The method then uses one of two regression equations to predict energy expenditure, thus achieving a closer estimate than previous single regression models. Newer approaches to conducting value calibration studies that make use of “pattern recognition” have been developed, and some of them are even more accurate than Crouter's two-regression model.
Pattern recognition is a branch of artificial intelligence concerned with classifying or describing observations. The goal of this method is to classify data (or patterns) based on previous knowledge or statistical information extracted from the data. Pattern recognition requires the following: (a) a sensor that gathers the observations to be classified or described; (b) a feature extraction mechanism that computes numeric information from the observations; and (c) a scheme that performs the task of classifying or describing observations. The classification scheme is usually based on a set of patterns (input variables and desired outputs) that have previously been classified or described. This set of patterns is termed the training set. The machine learning strategy in this case is termed “supervised learning.” Machine learning can either use regression, in which case the resulting output function will be a continuous variable (e.g., energy expenditure), or it can use a statistical procedure known as “clustering,” in which case the output will be a category label.
Pattern recognition uses one of several approaches: statistical, syntactic, or neural. Statistical pattern recognition is based on statistical characteristics of the data. Syntactic pattern recognition is based on the structural interrelationships of features. Neural pattern recognition uses a computational method developed from artificial neural networks. No matter what approach is used, pattern recognition still requires “value calibration” in that models must be developed based on the relationship between activity counts and a direct measure of energy expenditure or physical activity intensity. The utility of pattern recognition is highly dependent on the physical activities included in the calibration or training study. Rothney et al. (22) and Staudenmayer et al. (23) provide detailed descriptions of artificial neural network calibration studies. These studies demonstrate that pattern recognition has much greater accuracy than other methods of estimating energy expenditure using accelerometer-based activity monitors.
A number of “second generation” commercially available wearable monitors are already using pattern recognition. The Intelligent Device for Energy Expenditure and Activity (IDEEA) monitor takes data from an array of five accelerometers placed in different locations of the body, and uses this to predict energy expenditure and activity type based on an artificial neural network that was trained on a variety of activities (28,29). The SenseWear Armband (Bodymedia Inc., Pittsburgh, PA) measures acceleration, body temperature, and skin galvanic response and uses this to predict energy expenditure. It is repeatedly updated as new data collected with more subjects and more activities is used to train the artificial neural network.
Value calibration of wearable heart rate monitors often involves constructing individual heart rate versus energy expenditure calibration curves. It is well known that heart rate is linearly related to oxygen uptake over a wide range of intensities, but heart rate is not a very good predictor of light-intensity physical activity. A typical method of individual calibration is the “flex method,” whereby the flex heart rate is identified as the average heart rate obtained during sitting, standing, and light exercise, and the linear relationship between heart rate and VO2 is measured during a graded exercise test. Under free-living conditions, if an individual's heart rate falls below the flex heart rate, they are credited with 1.0 MET. Above the flex heart rate, the energy expenditure is estimated from the linear heart rate to energy expenditure (HR-EE) relationship (17).
Calibration studies of the combined heart rate-accelerometry method also have been conducted (4–6,14,24). Similar to the heart rate method, key parameters such as resting heart rate or even the individual HR-EE calibration curve must be assessed in order for this method to achieve optimal accuracy. That is, while generalized algorithms, which work for all people without the need for individual calibration, have been established, they do not have the same high level of accuracy as algorithms that use individual calibration data (5,6).
Each calibration method has strengths and weaknesses. The linear regression method is simple and easy to understand, which has led to its widespread adoption. Unfortunately, the large number of regression equations and devices is a major weakness that limits our ability to draw comparisons between studies. Adopting a single regression equation could solve this problem. However, a second and even bigger problem is that no single regression equation accurately measures energy expenditure for all physical activities. For example, equations developed on walking and jogging work reasonably well for those activities, but they severely underestimate the cost of most other activities (11).
New mathematical models, including hidden Markov models (20), artificial neural network (22), and classification trees (3), use the rich information contained in the acceleration versus time curve to arrive at even more accurate estimates of energy expenditure. However, a potential weakness with the current applications of pattern recognition is the reliance on data collected over 1 minute. Although models are being developed from acceleration data collected over 1 second, the parameters entered into the neural network require analysis of 1 minute’s worth of 1-second data. Additionally, each minute of 1-second data is collected from highly orchestrated and controlled calibration studies in which activity is being performed in a consistent fashion to ensure steady state energy expenditure. Approaches that use raw acceleration data collected at 20 to 30 Hz may not be subject to this limitation.
Currently, it is not known how these newer mathematical models will handle transitions in activities or the use of data collected over shorter periods that will allow one to effectively monitor more sporadic physical activity patterns (such as those exhibited by children). The issue of misclassification due to transitions from one activity type to another has surfaced with the Crouter two-regression approach (18) and this could be problematic for pattern recognition, given that artificial neural networks typically use inputs like measures of variability and auto-correlation of counts within a predetermined time period. Thus, future calibration studies may need to focus as much on transitions between activities as on the activities themselves. Modifications to the Crouter two-regression approach can address the transition limitation, but it is not certain whether similar solutions can be applied to pattern recognition.
An important calibration issue related to the use of an artificial neural network is the selection of inputs from the activity monitor used to predict activity type and/or intensity. Some researchers are using the raw acceleration data, while others are using activity counts recorded over 1-second epochs. There is no clear consensus on what parameters or signals to “extract” and feed into the neural network. The inputs may need to vary according to the prediction goal (activity type, METs, or the combination of both) and the population under study. This will be an important area for future research. The field may need to reach a consensus on what inputs are required to predict energy expenditure or physical activity intensity when using artificial neural networks or a related statistical classification technique for data reduction purposes.
Receiver Operating Characteristic (ROC) curves have been used to determine cut-off points for differing intensities of activity (sedentary, light, moderate, vigorous) in order to minimize false negatives and false positives, as discussed by Welk (26). This has the potential advantage of allowing the researcher to select cutpoints that maximize sensitivity at the cost of specificity or vice versa. This does not appear to have happened in practice to date, but these options may be appropriate for certain research questions. In addition to the ROC methodology for creating cutpoints, other methodologies include Decision Boundaries (16) and Reference Activity Calibration (10). Although these methods all have their advantages, they could further “muddy the water” regarding multiple cutpoints and limit comparability across studies.
One final point about new data reduction approaches is that they must be shown to be superior to previous approaches. For continuous variables, this may be accomplished by comparing standard error of the estimate (SEE) values, root mean square error, or 95% prediction intervals. However, new methods that involve broadly classifying the intensity of physical activity also should be compared to existing methods.
The strength of the individualized HR-EE calibration method is that it accounts for the fact that fit individuals (and younger individuals) have lower heart rate values than unfit (and older individuals), when they exercise at the same exercise intensity. However, a major weakness of this method is the need to conduct individual HR-EE calibration curves on each person, which limits its application in large-scale studies. Although this can be overcome by using generalized HR-EE regressions, the accuracy of the method decreases until it is in the same range as the best methods using accelerometer-based activity monitors. The HR-EE calibration method itself is also subject to errors that result from psychological stress, excitement, changes in ambient temperature, dehydration, and other variables.
Once calibration has taken place, the next step is to validate the wearable monitor by comparing it against a gold standard. It is important to note that researchers do not validate the measurement instrument (or physical activity monitor) per se. Instead, they validate the instrument in relation to the purpose for which it is being used. The practical implication is that a given wearable monitor (and associated data processing rules) may be valid for measuring one outcome variable but not another. For example, a waist-mounted uniaxial accelerometer may provide valid estimates of time spent in moderate-to-vigorous physical activity but not valid estimates of energy expenditure. Furthermore, the monitor may provide valid estimates of a given outcome in one group but not another. Estimates of time spent in different intensity categories may be valid for adults aged 20 to 50 years, but not children or adults over the age of 70 years. Estimates of total daily energy expenditure (TEE) may not be valid in any of the groups.
A typical scenario in which the intended use of a measurement instrument should be considered is the validation of accelerometer energy expenditure prediction equations. Researchers often attempt to evaluate the validity of accelerometer regression equations by correlating predicted and measured energy expenditure. However, in many cases, the equations were developed so as to derive cutpoints to broadly classify the intensity of the physical activity. In this case, validity should be judged by evaluating how well the cutpoints classify physical intensity categories, not whether they accurately predict energy expenditure for the time period. Similarly, when validating a device that measures the posture of a subject (sitting or standing), validity should be judged on the basis of how well it classifies sitting or standing. It is important to distinguish between “measurement error” (mis-measurement of something continuous) and “misclassification error” (when the target is categorical). The data analysis and statistical procedures required to validate continuous and categorical data differ markedly. Further details on this are presented in the paper by Staudenmayer et al. (8) in this supplement.
Virtually all wearable monitors use direct signals to estimate derived variables, the primary one being energy expenditure. Several methods can be used to validate devices that predict energy expenditure:
Indirect calorimetry is an appropriate criterion for minute-by-minute energy expenditure. The Cosmed K4b2 and the Jaeger Oxycon Mobile are examples of portable metabolic measurement systems that measure gas exchange (VO2 and VCO2) and calorie expenditure. Because these breath-by-breath systems are more prone to error than are mixing chamber systems, a good practice is to periodically ensure the validity of instrument by measuring the oxygen cost on two to three individuals at rest and at work rates of 50, 100, 150, and 200 Watts on a cycle ergometer. The VO2 measurements should be within 100 ml/min of expected values.
The DLW validation method is an excellent method for assessing TEE. It is important to note, however, that the DLW validation method suffers from a lack of temporally linked intensity information. In other words, it cannot provide any information on bout frequency, intensity, and duration of physical activity.
A relatively new derived variable is the type of physical activity, which can be validated against direct observation. Researchers are now using pattern recognition systems, such as artificial neural networks and classification trees, to predict types of activity (e.g., sitting, standing, walking, running, and bicycling). This is important because researchers want to know how much time people spend doing different types of activities. Knowing the type of activity could also lead to more accurate estimates of energy expenditure, because activity-specific predictions could then be applied.
A wide range of physical activities should be used during calibration procedures. Given that most people spend the majority of the 24-hour day lying, sitting, and standing, it makes sense to include these activities. In addition, activities should span the entire range from sedentary activities to vigorous. Light, moderate, and vigorous activities that are typical of the types of activities performed by the population of interest should be included (26). It is helpful to think of physical activities as falling into several domains: transportation, housework, occupation, and leisure-time sports/recreation.
Core activities that should be examined include lying, sitting, standing, car driving, slow walking, brisk walking, bicycling, stair climbing, stair descending, slow running, and fast running. Other activities may include: television watching, vacuuming, sweeping/mopping, washing windows, washing dishes, doing laundry, lawn mowing, raking, one-on-one basketball, singles racquetball, and singles tennis (2,12,15,27). Selection of the activities used for metabolic calibrations has not been highly scientific up to this point. To take a more scientific approach, researchers may be able to use data from time use surveys or physical activity logs to get a more accurate picture of the most common activities performed by the population of interest.
The predictive validity of wearable monitors must be shown by cross-validation. One method of cross-validation involves dividing the data into two subsets. A researcher performs the metabolic calibration on one subset (the training set), and then validates the analysis on the other (the validation set). An alternative method of cross-validation procedure is the “leave-one-out” approach. This involves leaving out a single observation to use as the validation data, and using the remaining observations for metabolic calibration. This is repeated until every observation in the sample serves as the validation data one time.
Ideally, further cross-validations should be conducted in a simulated free-living environment across activities that are similar to, but different from, those included in the validity study. This maintains a degree of experimental control, but is more similar to what the monitor would cope with in a real-world setting. For example, a bank of tasks could be divided into categories; each category would contain tasks of a similar nature/intensity. A given number from each category could be selected for calibration and different ones (but from the same general categories) could be put together into a simulated daily routine (punctuated by sitting time, TV viewing etc.) for the cross-validation. VO2 could be assessed continually during the semi-structured routine using a portable system. The performance of the monitor over the entire period, as well as for individual tasks, could then be evaluated.
In calibrating and validating wearable activity monitors, it is helpful to keep these points in mind:
Future research studies will need to answer the following questions: Do triaxial accelerometers improve predictive accuracy for determining energy expenditure compared to uniaxial accelerometers? Do pattern recognition approaches using raw data improve upon those using 1-second data? Which locations on the body provide the best predictions of energy expenditure? Do multiple sites provide greater predictive accuracy than single sites? Do pattern recognition approaches accurately measure energy expenditure and accurately classify activity type, when there are transitions between activities?
Future research should determine which demographic variables (e.g., age, height, weight, sex) are confounders and need to be used in conducting value calibrations of wearable monitors. In addition, researchers should consider hether it is appropriate to use VO2 data for the criteria, when attempting to measure physical activity over short time periods (e.g., 10 seconds). Finally, research laboratories should join together to design these studies and foster collaboration between groups in order to ensure a logical approach with comparability between the studies. Research groups should work together on these types of studies to attempt to increase the number of devices tested, to expand the serial number range tested, and to allow for different methods to be tested simultaneously.
Supported by National Institutes of Health grants 5 R21 CA122430 and 5 R01 HD55400. Results of the present study do not constitute endorsement by ACSM.
CONFLICTS OF INTEREST