|Home | About | Journals | Submit | Contact Us | Français|
This study evaluated the ability of end-of-day (EOD) ratings to accurately reflect momentary (EMA) ratings on 10 widely used pain and fatigue items. Rheumatology patients (N=105) completed ≥5 randomly scheduled EMA assessments of each item per day as well as EOD ratings. Correlations were high between EOD and EMA ratings of the five pain items (r= .90-.92) and somewhat lower for the five fatigue/energy items (r= .71-.86). To examine the ability of EOD ratings to represent a week of EMA ratings, 7 EOD ratings were averaged and correlated with EMA (r ≥ .95 for pain items, r = .88-.95 for fatigue/energy items). Further, averaging only 3-5 EOD ratings achieved very high correlations with a week of EMA ratings. Within-subject correlations of EOD with mean daily EMA across 7 days confirmed patients’ ability to provide daily ratings that accurately reflect their day-to-day variation in symptom levels. These EOD results were compared to traditional recall ratings collected in the same protocol. It was concluded (1) that EOD ratings were a better representation of EMA than were recall ratings, and (2) that EOD ratings across a reporting period can replace EMA for studies targeting average levels of pain or fatigue.
This study in chronic pain patients demonstrated that end-of-day ratings of pain are highly accurate representations of average levels of pain experience across a day; Ratings of fatigue were somewhat less accurate, though still at a level that would be valid.
There are now multiple methodologies available for assessing patient reported outcomes (PROs). The traditional method is paper questionnaires that often have reporting periods of 7 days or longer.8 Some years ago, Ecological Momentary Assessment (EMA) was developed to collect PROs multiple times across a reporting period.21 Based on both theoretical and empirical considerations, EMA has been hailed as a superior approach, because patients may have difficulty with accurately recalling the continuous stream of symptom experiences and with creating a rating that is representative of the average of those experiences.22 The PROs reported in this study are pain and fatigue/energy, two of the most common symptoms in chronic illness for which patients’ self-reports are indispensable.
Despite the proposed advantages of EMA, it carries a cost for researcher and patient. Momentary protocols using electronic diaries involve complex software, expensive hardware, sophisticated analytic approaches, as well as significant time on the part of the patient. Some research questions require the detail afforded by EMA, such as when the dynamic relationship between mood and pain is investigated. However, often “average” or “usual” symptom level is targeted (e.g., pain level over a week). Previous studies have shown that 7-day recall ratings for pain share approximately 50% of the variance with average EMA.2,166 We wondered whether end-of-day recall ratings (EOD) would be more accurate in reproducing the average of EMA ratings for the same time period. If EOD ratings are accurate summaries of EMA, they could provide a very attractive measurement alternative to the more expensive EMA method and to the less accurate recall method.
Aggregated EMA ratings captured in a random time sampling protocol are the most accurate representation of an individual’s symptom experience for that reporting period and will be treated as the “gold standard” in this study. We recognize that not everyone agrees with this. Some argue that recall ratings capture a meaningful constellation of experience and thoughts that go beyond the aggregate of momentary experience.15 However, while this is plausible, to date there is little empirical evidence to support this.
Three hypotheses are tested to determine the appropriateness of using EOD as a substitute for EMA for two reporting periods: a day and a week. First, we hypothesized that single EOD ratings would correlate highly with average EMA ratings for that day. The second hypothesis was that the average of seven EOD ratings would correlate highly with average EMA ratings for that week, thus providing a good measure of the average symptom level for the week. The third hypothesis was that within-subject correlations across days of average daily EMA and EOD ratings would be at least of moderate size, indicating the ability of EOD ratings to accurately distinguish changes in pain and fatigue/energy levels across days. If these hypotheses are supported, then it would suggest that for some applications EOD could replace EMA. A comparison of these results with previous research on the correlations of EMA with traditional 7-day recall ratings will indicate if averaged EOD ratings can provide a more accurate summary of average symptom levels than longer-term recall.
Study participants (N=117) were recruited from two offices of a community rheumatology practice. Patients were required to be available for 30 consecutive days and to satisfy the following eligibility criteria: ≥ 18 years of age; absence of significant sight, hearing, or writing impairment; fluent in English; normal sleep-wake schedule; diagnosed with a chronic rheumatological disease; experienced pain or fatigue in the last week; able to come to the research office two times within a month; and, had not participated in another study using an electronic diary within the last 5 years. The study protocol was approved by the Stony Brook University Institutional Review Board, and patients provided informed consent and were compensated $100 for participation in the study. The data were collected between September 2005 and June 2006.
Telephone screening of 279 patients determined that 86 (31%) were ineligible due to visual or hearing difficulties, inability to hold a pen, atypical sleep-wake schedule, no chronic illness diagnosis, or previous participation in a momentary assessment study. Of 193 eligible patients, 76 (39%) declined participation, and 117 (61%) participated. Eleven patients dropped out of the study, and 1 patient was eliminated from the analysis due to missing data for all 7 days of the analysis, thus the study sample was N=105. The most prevalent diagnoses were osteoarthritis (49%), rheumatoid arthritis (29%), lupus (16%) and fibromyalgia (11%). Participants tended to be female (87%), White (91%), married (65%), and had a mean age of 56 years (range 28-88). Most were high school graduates (96%), with 72% having completed some college (see Table 1).
Momentary and EOD ratings of pain and fatigue/energy were collected for 29-31 days on a hand-held computer (Palm Zire 31). The electronic diary (ED) utilized a software program provided by invivodata, inc. (Pittsburgh, PA) that featured auditory tones to signal the participant to complete a set of ratings. The ED was programmed to generate an average of 7 randomly-scheduled (within intervals) prompts spread across the participant’s waking hours (an average of one every 2 hours and 20 minutes, constrained to ensure a minimum of 30 minutes between prompts) determined by when the participant informed the ED that she was going to bed at night and set the wake up alarm the next morning. In addition to the random signals, the ED prompted the participant to complete an EOD assessment at the time the ED was put to sleep at night.
Each ED assessment began with the participant responding to questions about his location, activity, whether alone or with others, and one positive and one negative affect rating (“happy” and “frustrated”) on a visual analogue scale (VAS). Our pain and fatigue/energy items were drawn from well-established questionnaires used to assess pain and fatigue: SF36 version 226, Brief Pain Inventory (BPI) 5, Brief Fatigue Inventory (BFI) 12, and McGill Pain Inventory10,11 (See Table 2). The momentary items were written in the past tense, because patients were instructed to make their rating based upon their symptom experience just before they were beeped. The EOD items were prefaced with the instruction, “please think about the entire day when answering the following questions.” All momentary and EOD item ratings were made on a 100-point VAS. With the exception of the SF36v2 Bodily Pain item (none to very severe), all of the other item VAS anchors were not at all to extremely.
Following a telephone eligibility interview, patients came to the research office and were trained in the use of the electronic diary (ED) to collect the momentary and EOD ratings of pain and fatigue/energy for approximately 30 days. A research assistant telephoned the patient 24 hours after the research office visit to answer any questions and troubleshoot any problems with using the ED. Additionally, a follow-up call was made once a week for the next three weeks to ensure the ED was working properly and to answer any questions. Across the month, patients also completed six recall assessments that varied by the length of the recall; these data have been reported in a previous paper.2 At the end of the month, patients returned to the research office to deliver the ED and to engage in a recall assessment of the last 7 days of pain.
The hypotheses were tested using the first 7 days of data in order to avoid any potential effects of extended repeated measurement on both the momentary and EOD ratings. Also, these data would best generalize to the many studies involving a reporting period of 7 days.
Analyses are based on a comparison of the average of EMA ratings with the EOD ratings for the corresponding reporting period. In order to have the most accurate representation of the reporting period, we conceptualized the mean of the EMA ratings (for a day or a week) as a latent variable, thereby adjusting for the inherent unreliability of the observed mean due to random sampling variability and missed ED prompts. Specifically, we estimated multilevel mixed models (using the MIXED procedure in SAS, version 9.1) in which the mean score of momentary ratings for each reporting period was treated as a random (i.e., latent) variable. Had we simply used observed averages of momentary ratings in the analyses, the standard errors of the means would tend to be larger for the mean of a day than for the mean of the week due to the smaller number of momentary reports used to calculate the former. Hence, the correlation of the average momentary rating with the EOD rating would be more attenuated (i.e. weaker) for daily relative to one-week averages. The estimates generated by the multilevel modeling approach are adjusted for attenuation, and reflect our best estimate of the association between EOD measures and the “true mean” of momentary pain or fatigue/energy (while awake).
We restricted our use of latent variables to momentary assessments because they were serving as the “gold standard” for assessing the predictive utility of EOD ratings. However, when EOD assessments were aggregated to predict a 7-day average of momentary assessments, we used a simple average that did not adjust for unreliability due to sampling variability, because we had complete coverage (i.e. 7 EOD reports) for the week.
For any given day, sufficient momentary assessments are necessary to accurately characterize symptom experience for that day. Of the 5-7 random prompts that were scheduled each day, a minimum of 3 reports was required based upon our earlier work indicating that 3 reports is comparable to a higher density of 6 or 12 reports across the day.23 A patient’s data for any of the seven days was classified as missing if fewer than 3 momentary ratings were captured or if the EOD assessment for that day was missing. Ninety-six patients (91%) provided acceptable data for all 7 days, 7 patients had data for 6 days, and 2 had data for 5 days. All analyses were performed for the full sample of N = 105. For the 1-day analyses, means and correlations for each of the 7 days were simultaneously estimated in a single mixed-effects model, using the full-information maximum likelihood method. This method allows inclusion of all cases in the analysis even if some observations are missing, and has been referred to as “state of the art” for handling missing data in the methodological literature.17 For the 7-day analyses, if there were less than 7 EOD reports, ratings from all non-missing days were averaged to create a mean score for the week.
In order to include a patient’s data from any given day in the analyses, data were required for both the EOD and a minimum of 3 EMA ratings for that day. EMA prompts were randomly scheduled and, therefore, are viewed as a random sample of all possible reports that could have been obtained during the sampling period. Compliance was high with patients completing an average of 91% of the momentary prompts (range 68% - 100%). This compliance makes it reasonable to assume that the momentary data represent a random sampling of symptom experience over the patients’ waking hours. On average, patients completed 5.6 (SD = 1.27) momentary assessments each day with 82% of days having 5 or more momentary reports. Across all patients and days (735 reporting days) only 9 days had less than 3 momentary reports, and only 4 EOD assessments were missing, resulting in only 11 days (1.5%) classified as missing data (for 2 days, both the EOD assessment was missing and less than 3 momentary reports were completed).
Descriptive statistics showing the means and standard errors for each of the pain and fatigue/energy items on each of the days of the week and aggregated over the entire week are shown in Table 3. Pain intensity (BPI) as measured by EMA averaged 40.7 for the week and was significantly lower than the EOD average of 47.0 (p <.001). Mean momentary fatigue (BFI) was also significantly lower than the mean EOD ratings for the week (47.0 vs. 52.5, p <.001). Eight of the 10 items showed this pattern for both daily and weekly average comparisons. For the two positive symptoms (energetic and full of life), the pattern was reversed, though only significantly so for the latter.
The first hypothesis was that there would be a high correlation between individuals’ averages of EMA ratings for a single day (treated as a latent variable) and their corresponding EOD assessments for the same day. The “average” daily correlation, pooled across the 7 days, for each of the 10 pain and fatigue/energy items are presented in Figure 1 (open diamonds). The first five items on the x-axis measure pain, and the magnitudes of their pooled daily correlations are high: pain intensity (.90), bodily pain (.90), stabbing (.91), nagging (.92), and aching (.92). The last five items measure fatigue and energy-related symptoms. These correlations are somewhat lower: fatigue (.86), tired (.83), worn-out (.82), full of life (.83), and energetic (.71). The bars around each pooled daily correlation show the range of individual correlations across the 7 days for that item. Relative to the consistently strong correlations for the pain items, a much wider range of correlations was observed for the fatigue and energy items.
The second hypothesis addresses the degree of correspondence between the average of 7 days of EOD ratings with the average of the week’s EMA ratings. These correlations are displayed in Figure 1 (solid diamonds). For the pain items, none was below .95 in magnitude. For fatigue/energy items, the correlations ranged from .88 to .95. These high correlations indicate that for all items the average of seven EOD assessments corresponds excellently with the average momentary symptom level for the week.
While this finding confirms the hypothesized association between EOD and EMA reports for the week, we were also interested in extending the logic of these analyses to determine if averaging fewer than 7 EOD reports can also result in a high level of correspondence with the week’s average momentary symptom level. To accomplish this, we estimated the correlations that would be obtained when a single EOD report or the average of fewer than 7 (i.e., 2 through 6 randomly selected) EOD reports is used. For each EOD report that is dropped from the average, the reliability of the average decreases, which attenuates the correlation coefficient below the level found for the full set of 7 EOD reports. The reliability of the average of any specified number of EOD reports was estimated from the internal consistency of the full set of 7 EOD reports, using the Spearman-Brown formula.13 This reliability estimate was used, in turn, to calculate the correlation between the week’s EMA average and the average of EOD reports when the latter is based on fewer than seven reports. The results are displayed in Figure 2.
For the pain items (Figure 2a), when at least 3 EOD reports are averaged, the correlation with the week’s EMA average is greater than .90; averaging 5 EODs yields a correlation of about .95. For the fatigue and energy items, the strength of correspondence is lower, especially for some items (Figure 2b). The SF36v2 energetic item stands out as having poorer correspondence than the other items. The 4 other items reach a correlation of .90 by using 4 or 5 EOD reports, but the energetic item can only approach a correspondence of .90 by using all 7 EOD reports.
The third hypothesis addresses the ability of EOD reports to accurately capture daily variation in symptom levels across the week. For instance, if the aggregated EMA ratings were higher on 1 day compared with the remaining 6 days, was the EOD correspondingly higher on that day? These average within-subject correlations are displayed in Figure 1 (see x symbol). For the pain items, the correlations ranged from .71 to .80, whereas for the fatigue/energy items, the correlations ranged from .46 (energetic) to .72.
The objective of this study was to determine if EOD ratings of pain and fatigue are a reasonable proxy for EMA in studies interested in daily or weekly symptom levels. Our results confirmed the first hypothesis by showing that daily correlations between EOD and EMA were above .90 for the pain items and somewhat less, but still greater than .80, for most of the fatigue items. Second, the 7-day average of EOD ratings correlated very highly with the week’s average EMA for all of the pain items (≥ .95) and almost as well for the fatigue/energy items (≥.88). Furthermore, analyses suggested that fewer than seven EOD reports could be averaged and still provide a high level of correspondence with the week’s EMA. Jensen and McFarland also reported the feasibility of collecting fewer ratings to accurately represent the week.7
The third hypothesis, a more stringent test of the validity of EOD reports, considered their ability to discriminate day-to-day fluctuations in patients’ levels of pain and fatigue across the 7 days. Within-subject analyses demonstrated good correspondence for the five pain items and two of the fatigue items (r’s ≥ .70); three of the SF36v2 items did not perform as well (r’s of .46 to .66). In comparison, we reference other within-subject correlations collected in this protocol that have been reported previously. 2 On the final day of the study, patients retrospectively reported for each of the last 7 days the BPI average pain (for the day). A within-subject analysis was conducted of those retrospective pain ratings with the daily averages of EMA ratings for those days. The average within-subject correlation was much lower (r= .29) for the retrospective reports compared with the average correlation of .76 for the EOD ratings of the same item. Taken together, these data indicate that patients can provide EOD ratings that reflect differing pain levels across the days, whereas retrospective reports fail to do so.
The finding that the fatigue and energy items corresponded less well with EMA than the pain items was unexpected. It is possible that fatigue and energy ratings made over some time period (hours or days) may evoke more complex considerations that go beyond the experiences captured by momentary assessment.15 For example, a patient may rate energy on an EOD report as lower if he found that he was not able to accomplish all of his tasks for the day, yet this might not be reflected in that day’s momentary ratings of energy. Second, fatigue has been shown to exhibit a pronounced diurnal cycle, with greater fatigue at the end of the day, in some individuals.19 If the EOD rating is influenced by current levels of fatigue, then the inter-individual variability in diurnal cycles would be expected to attenuate its correspondence with the average fatigue for the day measured by aggregated EMA ratings.
One might still argue that less costly, single recall ratings (e.g., 7-day) are preferable to collecting EOD ratings as outcome measures. This argument can be evaluated in the context of other data collected in this same protocol and reported elsewhere2 that examined the correlation of EMA with traditional recall ratings (RR) using various reporting periods (1, 3, 7, and 28-days). Those data allow us to compare the relative accuracy of EOD ratings versus RRs for the same 10 pain and fatigue/energy items. We found that the correlations of the 7-day RRs and EMA ratings for most of these items were less than .80 – and in a number of cases much less. In contrast, this study demonstrated that the average of 7 EOD ratings yielded a correlation of ≥.95 for the pain items and ≥.88 for the fatigue/energy items. Therefore, the level of shared variance between EMA and the EOD ratings ranged between 77-90%, while it ranged from 27-55% for the 7-day recall ratings.2 This is a substantial difference. When a clinician or researcher wants an outcome measure that represents the patient’s aggregated experience of the symptom over a given reporting period, these data provide strong support for using time- and date-stamped EOD ratings rather than recall assessment.
Evidence of the effects of item scaling on correlations with EMA was obtained by another comparison of the EOD data in this paper with the data reported in our recall paper2. As noted above, one of the recall periods in the previous paper was 1-day. This 1-day RR was essentially the same as the EOD task in this study: both used the same item wording and both were made in the evening. The correlations of 1-day RR with EMA from the prior paper were generally in the range of .70 - .80, which are lower than those observed for the EOD ratings reported in this paper. In theory, they should be the same, since both involve recall over one day. What is different are the response scales used and the method for capturing the ratings (EOD=hand-held computer; RR=paper and IVR). The EOD ratings from this study were done on 100-point VAS as were the EMA ratings. Thus, the EOD and EMA ratings shared an identical scale format. In the case of the 1-day RRs from the prior study, we preserved the scaling of the item of the standard instrument. Depending upon the instrument from which the item was drawn, the 1-day RRs often had only 4-6 response options. One interpretation of these results is that the ability of RRs to finely differentiate pain and fatigue levels may be attenuated for items associated with a small number of response options. In fact, the two RR items with 0-10 point scales (BPI: average pain, and BFI: usual fatigue) showed the highest correlations with EMA (r = .82 and .86, respectively). However, a number of studies have examined VAS versus Likert scales for the measurement of pain, and there is not a consensus on whether scales with a greater number of response options perform better.1,9,25 Furthermore, most evidence suggests that the method (paper vs. computer) for capturing the ratings have little impact on ratings. 3,4
Although not specified as a hypothesis at the outset, we examined level differences in EOD and EMA reports. Consistent with previous work,20,24 we observed mean level differences between EOD and EMA ratings in 9 of the 10 items. While level differences between aggregated EMA and traditional RRs are frequently found, we did not know if level differences would be evident in ratings with such a proximal reporting period. The EOD ratings of the negative symptoms were generally about 5-6 points higher than the EMA ratings. This is less than the level differences observed in RRs which can be upwards of 10-15 points depending upon the item and the length of the recall period.2,20,27 However, it suggests that even when recall is only hours, as in EOD reports, cognitive heuristics, such as peak or duration neglect (ignoring times when the symptom is close to zero) may inflate the ratings.14,18 The implication of this observation is that as long as the same measurement method is used for repeated measures in a clinical trial, the level difference will remain constant and will not impact the interpretation of treatment outcomes. However, if different methods are used, there is a serious risk of false conclusions. For example, if a researcher decided to use a 1-month recall of usual pain for baseline and the average of 7 EOD ratings for post-treatment assessment, it is likely that he would observe a 10-point or more reduction in pain simply due to measurement differences. Conversely, if the EOD ratings were used at baseline and the 1-month recall at post-treatment, then pain would appear to be 10-points worse at post-treatment combined with whatever actual treatment effect was operating.
Finally, the limitations of this study are considered. First, these data may not generalize to PROs other than pain and fatigue/energy or to other populations. Second, it is possible that by virtue of providing five or more EMA symptom reports per day patients were more informed of their symptom levels and could make a more accurate EOD report. The accuracy of the EOD reports in this study may therefore be somewhat better than those that would be obtained in the absence of simultaneous EMA reports. However, it should be noted that the traditional recall reports that were compared to the EOD reports from the same protocol would have benefited in the same way. Thus, the comparative improvement in accuracy with EOD is not in doubt. Third, the patients making up this sample were rheumatology patients who on average had moderate levels of pain and fatigue. We do not believe that the patterns observed would be different in samples of pain center patients with higher symptom levels; but this would need to be confirmed empirically. Fourth, this study was not conducted in the context of a clinical trial where symptom levels would be expected to change across the assessment period. Our patients were generally in a steady clinical state. It is uncertain as to whether EOD accuracy would be different in a clinical trial.
This research was supported by grants from the National Institutes of Health (1 U01-AR052170-01; Arthur A. Stone, principal investigator) and by GCRC Grant M01-RR10710 from the National Center for Research Resources. We would like to thank Pamela Calvanese, Doerte Junghaenel, and Leighann Litcher-Kelly for their assistance in collecting data. Software and data management services for the electronic diary assessments were provided by invivodata, inc (Pittsburgh, PA). JEB and AAS have a financial interest in invivodata, inc. and AAS is a senior scientist for the Gallup Organization.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.