|Home | About | Journals | Submit | Contact Us | Français|
A substantial amount of research has been conducted in an effort to understand the impact of short-term (<48 hr) total sleep deprivation (SD) on outcomes in various cognitive domains. Despite this wealth of information, there has been disagreement on how these data should be interpreted, arising in part because the relative magnitude of effect sizes in these domains is not known. To address this question, we conducted a meta-analysis to discover the effects of short-term SD on both speed and accuracy measures in 6 cognitive categories: simple attention, complex attention, working memory, processing speed, short-term memory, and reasoning. Seventy articles containing 147 cognitive tests were found that met inclusion criteria for this study. Effect sizes ranged from small and nonsignificant (reasoning accuracy: = −0.125, 95% CI [−0.27, 0.02]) to large (lapses in simple attention: = −0.776, 95% CI [−0.96, −0.60], p < .001). Across cognitive domains, significant differences were observed for both speed and accuracy; however, there were no differences between speed and accuracy measures within each cognitive domain. Of several moderators tested, only time awake was a significant predictor of between-studies variability, and only for accuracy measures, suggesting that heterogeneity in test characteristics may account for a significant amount of the remaining between-studies variance. The theoretical implications of these findings for the study of SD and cognition are discussed.
Since the seminal experiments on human sleep deprivation (SD) in the late 19th century (Patrick & Gilbert, 1896), there have been frequent attempts to better understand the effects of SD on human physiology and performance. Within this body of research, substantial resources have been invested in interrogating the effects of total SD (complete sleep loss over an extended period) on cognition and cognitive performance. Though less common than long-term chronic sleep loss, in which sleep time is partially reduced over days or weeks, cognitive failures associated with total SD are nevertheless of great interest and importance, as their real-world consequences are often catastrophic (Dinges, 1995; Mitler et al., 1988).
Within the literature, much attention has centered on the effects of SD on basic attentional processes and complex, real-world tasks. There has been relatively less interest in other cognitive domains on which SD has a known deleterious effect, such as short-term memory, working memory, processing speed, and tasks of selective attention. Although there is general agreement that SD does exert an effect across most of these cognitive domains (for a review, see Durmer & Dinges, 2005), much less is known about the degree of these negative consequences. Moreover, various studies have found little to no effect of short-term SD on tests within some of these cognitive domains (e.g., Binks, Waters, & Hurry, 1999; Quigley, Green, Morgan, Idzikowski, & Kin, 2000), which has led to some disagreement about how to fit these data into a coherent theory.
As a means of performing this synthesis, several comprehensive qualitative reviews on the effects of SD on cognitive function have appeared over the past several years. By and large, these reviews have concluded that SD has effects on a broad suite of cognitive domains, but they hesitate to make claims about the relative magnitude of these effects. Moreover, there exist three general schools of thought as to the theoretical framework through which the available data should be interpreted. We summarize these viewpoints below, noting that these theories are not mutually exclusive and certainly amenable to integration.
Many of the early studies on SD and cognition cite novelty and motivation as critical variables in determining performance under adverse conditions (Wilkinson, 1961; Williams, Lubin, & Goodnow, 1959). These suggestions were made subsequent to the initially paradoxical observations that many highly demanding cognitive tests are unaffected by short bouts of total SD. For example, performance on Baddeley’s Logical Reasoning Test is consistently found to be stable, even as sleepiness and impairment in other cognitive domains appear (Magill et al., 2003; A. P. Smith & Maben, 1993). These negative findings have prompted the creation of theories such as the “controlled attention” model of Pilcher, Band, Odle-Dusseau, and Muth (2007). In this model, the authors highlighted the importance of “bottom-up” task characteristics, arguing that tasks that are monotonous or intrinsically less engaging are more severely affected by SD, due to the fact that greater top-down control is needed to sustain optimal performance on these tests. The authors suggested that tasks be classified on the basis of whether they encourage attentive behavior and hypothesized that tasks that are high on this dimension are affected the least by SD.
Several reviewers have suggested that SD has domain-specific effects on cognition, with particular focus on tasks mediated by prefrontal cortex (PFC) function. Jones and Harrison (2001) and Harrison and Horne (2000) both reviewed the literature on the impact of SD on PFC-oriented tasks and concluded that these tests provide incremental validity in assessing impairment beyond the consideration of vigilance or sustained attention alone. For example, Harrison, Horne, and Rothwell (2000) gave young adults a neuropsychological battery following 36 hr of total SD and found specific impairments on PFC-oriented tests (of temporal memory, verbal fluency, and response inhibition) but not on a test of recognition memory. The authors noted that the impairments seen were similar to those displayed by healthy, middle-aged (55– 64 years) participants, with diminution of PFC function being a known consequence of normal aging. More recently, neuroimaging data have lent further support to this claim; for instance, studies using functional magnetic resonance imaging (Chee & Choo, 2004; Drummond et al., 2000) have demonstrated hypoactivation in regions of the lateral and medial PFC to a variety of tasks following SD, thus localizing the putative neural basis for the observed behavioral changes.
Proponents of this view interpret these findings as evidence that the impairment seen in many complex cognitive tasks is not merely driven by the failure of more basic cognitive skills; that is, PFC-oriented tasks are vulnerable to specific failures that are above and beyond those expected to be caused by low arousal and sleepiness (Harrison et al., 2000). Conceptually, this can be thought of as a neuropsychological model; that is, SD produces a reversible functional lesion in the PFC that is detectable by tests sensitive to these deficits in brain-injured patients. This model provides some explanatory power in resolving the mixed results in the literature that researchers had tried to account for with moderators such as task type, task length, novelty, and motivation.
Finally, other reviewers have singled out arousal and vigilance as general factors that explain much of the variance in cognitive deficits following sleep loss. Durmer and Dinges (2005) stated that “cognitive tasks vary considerably in their sensitivity to sleep loss” (p. 120) but remarked that reaction time measures of tasks of attention and vigilance are the predominant instruments used to assay vulnerability to SD. Lim and Dinges (2008) also spotlighted vigilant attention as a cognitive process that is consistently and robustly affected by total SD. Finally, Balkin, Rupp, Picchioni, and Wesensten (2008) made the stronger assertion that “the array [of activities affected by sleep loss] is so extensive that it is reasonable to posit that sleep loss exerts a nonspecific effect on cognitive performance” (p. 654).
There is strong experimental evidence for these assertions. Tests of sustained attention (e.g., the Psychomotor Vigilance Test) are not only reliable but also highly valid in predicting real-world performance and assessing the level of impairment faced by an individual under conditions of fatigue (Dinges et al., 1997; Lim & Dinges, 2008). The Psychomotor Vigilance Test is also sensitive in tracking both circadian and homeostatic modulations in sustained attention and arousal over the course of several days without sleep (Doran, Van Dongen, & Dinges, 2001). Finally, models of attention often stress that vigilance and sustained attention are fundamentally important to many higher aspects of cognition and that these higher processes will necessarily decline if a subject is not able to sustain a sufficient level of vigilance while performing a task (Sturm et al., 1999; Sturm & Willmes, 2001).
The three models discussed are above not mutually incompatible. One could argue that the controlled attention hypothesis and the vigilance hypothesis merely take different perspectives in explaining the same set of phenomena and that the neuropsychological hypothesis, though consistent with both of these models, accounts for effects above and beyond what may be expected from either. As a result, certain theorists have proposed a more integrative approach in interpreting the available data. For instance, Boonstra, Stins, Daffertshofer, and Beek (2007) suggested that impairment in the PFC following a period of SD may underlie changes in both executive functioning and attention, stressing the role of the PFC in the interaction between top-down and bottom-up processes.
If we believe that there is some predictive power in all of these models, a new and more pragmatic question arises: To what degree are different cognitive functions impaired? Without a standardized basis of comparison, there is no way to assess the relative importance of each of these theoretical approaches. Knowledge of the effect sizes associated with each of these impairments may be of use in determining the targets for intervention in real-life situations so as to minimize the deleterious effects of SD on workers in the field.
The meta-analysis (M. L. Smith & Glass, 1977) is an increasingly popular method of synthesizing data from the primary literature and is a useful tool in addressing the question posed above. This method entails a systematic search for all articles related to a topic that meet a preordained set of inclusion criteria, calculating the effect sizes in all studies that meet these criteria and accumulating these effect sizes by weighting them on the basis of their sample sizes. This final step uses an estimate of sampling variance to give greater weight to studies with larger samples, thus providing a more unbiased estimate of the true effect size of a given manipulation.
To our knowledge, three meta-analyses have been conducted to date on the effects of SD on performance. Koslowsky and Babkoff (1992) summarized 27 studies and reported that total SD showed greater correlations with performance as the duration of deprivation increased, and that speed or latency variables were generally affected more than accuracy measures, arguing that these data support the lapse hypothesis of Williams et al. (1959). Pilcher and Huffcutt (1996) analyzed 19 studies for the effects of SD on cognitive task performance, motor task performance, and mood and concluded that “simple” tasks were affected more than “complex” tasks after short periods of SD but that the reverse was true for long periods (>45 hr) of SD. The same pattern was seen with task duration, with short tasks being more adversely affected after short periods of SD, and the reverse true of long tasks. These authors found that partial SD (in which a restricted amount of sleep is allowed every night) had a more pronounced effect on cognition overall than total SD. Most recently, Philibert (2005) conducted a meta-analysis to assess the effects of SD on cognitive and clinical performance in physicians and nonphysicians. Overall, the effect sizes for changes in cognitive function were −0.564 (95% CI [−0.406, −0.722]), with the most pronounced effects on tests of vigilance and sustained attention (d = −1.33, 95% CI [−1.124, −1.536]). This final analysis was the most comprehensive and methodologically sophisticated of the three, with 60 studies and 5,295 individual effect indices included.
The studies described above have a number of weaknesses that remain to be addressed. First, two of these meta-analyses (excepting Koslowsky and Babkoff, 1992) aggregated performance variables measuring accuracy and speed into a single category when summarizing effect sizes. There is little empirical evidence that speed and accuracy are uniformly affected by SD, and aggregating the two outcome types may result in the loss of interesting information. Second, the cognitive categories in these previous analyses were relatively coarse; for example, no distinction was made in the Philibert (2005) analysis between working memory and short-term memory paradigms. Finally, none of the previous analyses performed attempted to control for differences in study quality or took into account the interindividual differences present in cognitive responses to total SD.
Our purpose in conducting the current meta-analysis was thus twofold: (a) to investigate the relative magnitude of the effects of SD on different cognitive domains and (b) to explore whether the effects on accuracy and reaction time measures were different in any of these domains. The overarching motivation for this analysis was to uncover evidence that may inform our understanding of the effects of short-term acute SD on cognition and thus aid in assessing the relative importance of current theoretical models.
Our primary collection of literature was gathered by searching online electronic databases for articles relevant to our topic of interest through December 2008. The four databases used were PsycINFO, Medline, Web of Science, and EBSCO MegaFile. In each of these databases, we conducted a search using a combination of the following terms: sleep deprivation or sleep restriction and cognition, attention, memory, performance, vigilance, and executive function (12 combinations in all). This search yielded 4,276 hits in total. We next scanned the abstracts of these articles to determine their suitability for inclusion in the analysis. In total, 176 of the articles were empirical studies that employed SD as a primary independent variable and used at least one cognitive measure as a dependent outcome. We subsequently obtained the full text of these articles to determine whether they met full inclusion criteria. These criteria were as follows:
Because of the restrictions imposed by Criterion 3, a number of subareas within the realm of SD research necessarily had to be omitted from this analysis. A survey of the literature on SD and decision making revealed that outcome variables on these tests did not form a cluster that was homogeneous enough to warrant a quantitative synthesis. This was because many of these experiments employed complex, real-world scenarios, opting for ecologically valid designs over more controlled neuropsychological tests (for a review, see Harrison & Horne, 2000). Moreover, it is unclear how outcome variables from standardized decision-making tests (e.g., the Iowa Gambling Test) should be compared with the accuracy measures obtained from other cognitive domains. Finally, experiments on implicit and procedural memory were also excluded, as these form a separate body of literature pertaining to sleep and memory consolidation (Stickgold & Walker, 2005), the analysis of which is beyond the scope of this article.
In addition to this online literature search, we obtained data from several other sources. We conducted hand searches of the journal Sleep and the Journal of Sleep Research from 1988 to 2008. We also reviewed the reference lists of the major review articles on SD and cognitive performance that have been published over the last several years. Finally, to mitigate the “file drawer” problem (Strube & Hartmann, 1983), we contacted eight major sleep laboratories conducting research in this field to request unpublished data from experiments, as well as master’s and doctoral theses. We received additional data from one of these laboratories, as well as replies from all but one of the remaining investigators informing us that they did not have suitable data for inclusion. In total, 70 articles and 147 data sets met inclusion criteria and were included in the meta-analysis (see Table 1). Among these, numerous data sets contained more than one cognitive outcome; these were coded separately, according to the recommendations of Hunter and Schmidt (2004). Altogether, 209 aggregated effect sizes and 5,104 individual effect sizes were calculated from these data sets.
Each cognitive test was assigned a domain according to the classification scheme below.
Tests in simple attention involved the visual or auditory detection of a single class of stimuli, with no element of perceptual discrimination, orienting, inhibition, or any other form of executive attention, such as the Psychomotor Vigilance Test (Dinges & Powell, 1985) and other simple reaction time tests. This was the only category in which effect sizes were calculated for lapses and omissions instead of accuracy.
Tests in complex attention assessed all attentional processes more demanding than those in the first category (e.g., selective or executive attention) but did not have any major working memory component or require any short-term or long-term memory encoding, such as the Stroop test (Stroop, 1935), the Continuous Performance Test (Conners, 1995), and the go/no-go paradigm.
Tests in processing speed primarily assessed cognitive throughput or processing speed, requiring multiple repetitions of a rehearsed process within a fixed period. Examples include the Digit Symbol Substitution Test from the Wechsler Adult Intelligence Scale (Wechsler, 1997a) and tests of serial addition and subtraction.
Tests in working memory involved the maintenance and/or manipulation of relevant information over a brief period, culminating in a decision and response, such as the Sternberg working memory task and the delayed-match-to-sample test.
Tests in short-term memory involved the encoding, maintenance, and retrieval of information. The amount of information to be stored had to exceed working memory capacity, and maintenance typically occurred over a longer period. Examples include word list learning and the Logical Memory subtest of the Wechsler Memory Scales (Wechsler, 1997b). This domain was further subdivided into short-term memory recall and short-term memory recognition. Only effect sizes for accuracy measures were computed for this cognitive category. We note that individuals with a background in cognitive psychology may consider many of these as long-term memory tests due to differences in nomenclature across fields.
Tests in reasoning and crystallized intelligence assessed mental processes such as problem solving, vocabulary exercises, and other forms of crystallized cognitive ability. Examples include Raven’s Advanced Progressive Matrices test (Raven, Raven, & Court, 1998) and Baddeley’s Grammatical Reasoning Test (Baddeley, 1968). Only effect sizes for accuracy measures were computed for this cognitive category.
We had originally intended to analyze tests of verbal fluency (e.g., the Controlled Oral Word Association Test; Spreen & Strauss, 1991) as a seventh category in this study. However, as only three articles to date contained such tests and met all inclusion criteria (Binks et al., 1999; Fluck et al., 1998; Horne, 1988), we omitted this category from our final analysis.
It has been recommended that studies entered into a meta-analysis be coded for study quality (Chalmers et al., 1981; Detsky, Naylor, O’Rourke, McGeer, & L’Abbé, 1992). This is especially important when the pool of studies entered into the analysis is highly heterogeneous and the designs have varying threats to internal and external validity (Detsky et al., 1992). In our survey of the literature, we observed significant discrepancies in how experiments of SD are conducted and controlled, and thus deemed that this step was appropriate for our analysis. We identified seven features that were important determinants of a study’s reliability and validity, and coded each experiment so that they received a 0 or 1 score on each of these criteria:
For between-subjects studies, were subjects randomized to the sleep-deprived and control groups? For repeated-measures studies, was the study order counterbalanced to avoid the potential confound of order effects?
Were the treatment and control groups equal in number? Were they treated similarly (e.g., in compensation and study conditions)?
Were subjects randomly recruited from the population? Was the study sample representative of the population, or did the experiment include only a particular subgroup of people (e.g., fighter pilots, only women)?
Were appropriate statistical tests used to analyze the data?
Did the cognitive tests used appropriately capture the construct of interest? Were they well validated? Were appropriate dependent measures used in the analysis?
Was the study conducted in a sleep laboratory? Were subjects monitored during their time in the study? Were their diet and activity controlled?
Were participants screened for good sleep history or absence of sleep disorders? Was sleep history monitored in the period prior to the experiment (e.g., sleep diaries or actigraphy)?
Studies were assessed and coded by two independent raters (Julian Lim and one other rater who was trained on the criteria above). They assessed interrater reliability using intraclass correlation coefficients from a two-way mixed model with raters as fixed effects and studies as random effects. The intraclass correlation coefficient for the entire sample was .798, indicating that there was a high level of agreement between the two raters on study quality over the entire sample of articles.
In addition to coding for study quality, we recorded the following variables for use as potential moderators in the secondary analysis: length of SD and the times of test administration (which were used to calculate circadian offset).
The primary metric of the meta-analysis is the effect size, which is a standardized estimate of the magnitude of the treatment effect. We calculated all effect sizes (in this analysis, d values) using a baseline test and the test at the most extreme point of SD in the experiment (with a few exceptions, noted in Table 1). In the case where means (μ) and standard deviations or errors (σ) were reported, we calculated effect sizes using Equation 1 for between-subjects studies:
where N1 and N2 are the number of subjects in the control and experimental groups, respectively. In contrast, effect sizes in within-subjects or repeated-measures studies should be calculated with the standard deviation of change scores as an error term (Hunter & Schmidt, 2004); however, these are seldom reported in the literature. Instead, we estimated this term using the pre- and posttest standard deviations and correlations, as suggested by Morris and DeShon (2002). In cases where this information was not available, we reverted to Formula 1 as our best estimate of the effect size. Where only t or F values were reported, we converted these to effect sizes following the recommendations of Rosenthal (1991), as shown in Equations 2 and 3. Where only p values were reported (for t tests), we back-converted these to t values using statistical tables and applied Formula 3. Once a d value was obtained, its mathematical sign was adjusted so that worse performance following SD was always reflected by a negative effect size.
As recommended by Hedges and Olkin (1985), we next adjusted for effect size inflation in studies with small samples by applying the correction in Equation 4. This adjustment yields the Hedges’s g, which treats the variance of each study as an estimate rather than a constant. The difference in this correction for between- and within-subjects studies is accounted for by the differing degrees of freedom in the denominator of the equation.
In order to combine the results of between- and within-subjects designs, all effect sizes need to be converted into a single metric. As the effect of interest in this study was that of SD on an individual over time, the within-subjects design is the appropriate standard of measurement for this analysis. Effect sizes from between-subjects experiments were converted into the appropriate metric by the formula suggested by Morris and DeShon (2002; see Equation 5), where ρ is the correlation between the pre- and posttest scores.
As ρ is seldom reported in the literature, we estimated this value from data collected on various cognitive tests in our own laboratory. Pre- and posttest correlations from this investigation generally fell in the .4 –.6 range. To simplify this conversion, therefore, we assumed that for SD experiments, ρ = .5 (the unity case where gws = gbs).
We computed the within-studies variance due to sampling error (Var(e)) for each of the data sets using Equation 6 for within-subjects studies and Equation 7 for between-subjects studies. For within-subjects studies, we used a sampling variance term that takes into account a Treatment × Subjects interaction1 (Hunter & Schmidt, 2004), as recent research has emphasized the large and stable intersubject variability in cognitive performance after SD (Leproult et al., 2003; Van Dongen, Baynard, Maislin, & Dinges, 2004).
Separate analyses were conducted for accuracy (or lapses) and speed for the cognitive domains of simple attention, complex attention, working memory, and processing speed. Only accuracy measures were compiled for the domains of short-term memory and reasoning.
We calculated the overall average effect size for each outcome measure type and domain using Equation 8. Each effect size was weighted by the inverse of its sampling variance (wi), thus giving relatively less weight to studies with small sample sizes. Effect sizes were also weighted by their individual mean-adjusted study quality (sqi; i.e., quality for study i divided by the mean quality rating in its respective domain). Mean-adjusted scores are a viable method for accounting for differences in study quality2 (Bérard & Bravo, 1998; Detsky et al., 1992) and have the advantage of not widening the confidence intervals of pooled effect sizes.
The weights (wi) in Equation 8 were derived via a random-effects model, which assumes that the “true” effect size of each study is not identical and that samples were drawn from populations that differ in meaningful ways. This was clearly the case in our analysis; as a most basic example, the length of SD varied from 24 to 48 hr between studies, and it is known that the magnitude of performance deficits grows with escalating sleep pressure (Doran et al., 2001).
Finally, by calculating the variance components associated with between-studies and within-studies variation, we were able to obtain two measures of dispersion for each set of studies, the Q statistic, which reflects the total amount of variance in the meta-analysis, and the I2 statistic, which ranges from 0 to 100 and is an index of the proportion of variance in the sample attributable to between-studies differences (Higgins, Thompson, Deeks, & Altman, 2003).
A complete list of studies and individual effect sizes is presented in Table 1. The total sample size for the analysis was 1,533, with an average of 21.3 (SD = 11.1) subjects in each study. The average study quality for the complete sample ranged from 3 to 7 (M = 5.21, SD = 1.18).
Average effect sizes for each cognitive domain and outcome are presented in Table 2. A total SD period of 24 – 48 hr had a significant effect in reducing performance for outcomes in all cognitive domains, except for accuracy measures in tasks of processing speed (p = .06), as well as accuracy measures on tests of reasoning and crystallized intelligence (p = .08). As there were relatively few studies in each of these categories, however, it is possible that the analysis had insufficient power to detect a significant effect for these outcomes.
As anticipated, the largest effects of 24 – 48 hr of SD were on tests of vigilance, or simple attention. Effect sizes within this domain were −0.762 (lapses) and −0.732 (reaction times), which represent moderate to large effects. These effects are comparable to those reported by Philibert (2005), who found effect sizes of −1.142 and −0.553 for vigilance tests conducted after 24 –30 and 30 –54 hr of SD, respectively. Effect sizes for complex attention and working memory fell in the moderate range, and tests of processing speed showed on average small but significant effects.
We performed analyses of variance on the aggregate effect sizes to test two sets of null hypotheses: first, that average effect sizes are identical across cognitive domains (with separate analyses conducted for speed and accuracy), and second, that average effect sizes for speed and accuracy are identical within each cognitive domain. As two of the cognitive domains (short-term memory and reasoning) contained only one outcome measure, we did not enter all information into a two-way analysis of variance. Tables 3 and and44 summarize the results of this analysis. We found a significant difference in effect sizes across cognitive domains for measures of both speed, Q(3) = 24.5, p < .001, and accuracy, Q(5) = 36.8, p < .001; however, there were no differences between speed and accuracy measures within each cognitive domain, even prior to correction for multiple comparisons.
The I2 value is an index of the proportion of variance within each analysis that is due to between-studies differences; Higgins et al. (2003) suggested that values of 25, 50, and 75 may be used as benchmarks of low, moderate, and high heterogeneity, respectively. I2 values in each analysis ranged from small (reasoning: 5.38) to moderate to high (working memory reaction time: 66.28; see Table 2), indicating that moderator analyses was appropriate in most of these cognitive domains. Indeed, given that the number of hours of SD in these studies ranged from 24 to 48, and that several types of cognitive tests made up these collective indices, it would have been surprising to discover low I2 values in this first-pass analysis. As a way to visualize this dispersion, Figures 1 and and22 are displays of forest plots of the accuracy and reaction time measures for a sample domain (simple attention).
We coded three study variables to test their impact as moderators of the effect of SD. Circadian time was estimated by plotting the time of test administration as a sinusoidal function with a 24-hr period and a performance nadir at 0600 hr, with peak amplitude arbitrarily defined as 1. Circadian offset was computed by subtracting the time of test administration for sleep-deprived subjects from time of test administration of the control group. Homeostatic sleep pressure was estimated as the elapsed time between sleep offset and time of test administration. In cases where any of this information was not explicitly reported, or the testing time occurred over a range greater than 2 hr, we did not code these variables, and the effect size was excluded from the moderator analysis.
As there were insufficient cases to conduct separate metaregressions within each cognitive category, we combined all results for accuracy and reaction time effects, and conducted stepwise multiple regression within these two larger data sets, using the average effect size found for each cognitive domain as a covariate. For accuracy measures, the omnibus effect for the model was significant, R2 = .176, F(2, 97) = 10.39, p < .001, but only homeostatic sleep pressure was a significant predictor of study effect size, b = −0.22, t(98) = −2.43, p = .02. In contrast, the overall model for reaction time measures was not significant, indicating that none of the coded variables were a significant predictor of heterogeneity in this sample.
The results from our meta-analysis support the conclusions of previous reviews that short-term total SD has a significant deleterious effect across most cognitive domains. Our current study represents an advance over previous meta-analyses in several important respects. First, we were able to take into account the known Treatment × Subject interaction in experiments of SD (Van Dongen et al., 2004), thus improving the estimation of the sampling variance for each study. Second, we weighted each effect size on the basis of study quality, thus giving less influence to studies that may have been less well conducted. Third, we had more stringent inclusion criteria than Philibert (2005), which increased the homogeneity of our sample. Finally, and most important, we classified behavioral tests into finer grained cognitive domains than previous meta-analyses, further increasing the similarity of studies within each subsample.
Overall, average effect sizes appear to fall along a continuum, with tasks of greater complexity affected relatively less after total SD. The relative magnitude of effect sizes across cognitive domains was similar to those seen in the meta-analysis of Philibert (2005), although the absolute size of these effects was smaller across all categories. This is likely due to two reasons: We excluded all studies with a period of total SD greater than 48 hr, and we did not disattenuate effect sizes based on test–retest reliability of dependent measures.
The difference in the average effect size among the six cognitive domains was statistically significant and ranged from −0.125 to −0.762. As anticipated, the combined effect size for simple attention and vigilance tasks was the largest among all the categories studied. This finding is consistent with the notion that vigilance is the fundamental process affected by SD (Lim & Dinges, 2008) and the deficit for which compensation is least available. In contrast, average effect sizes for complex attention and working memory tests fell into the moderate range. Although this pattern of results has been observed in the literature, this is, to our knowledge, the first time that this difference has been systematically investigated in a large body of studies.
Several points of interest arise on inspection of the group effect sizes of the complex cognitive tasks (all categories other than simple attention). First, we note that task performance in the complex attention category is relatively spared when compared with simple attention. These data are compelling, as many of the complex attention tests differ from the simple attention tests in only a single cognitive process (e.g., two-choice reaction time vs. simple reaction time). This finding suggests that for tests of orienting or executive attention, performance is relatively preserved after SD either because of the greater salience of the bottom-up feed (and thus the reduced need for internally motivated top-down control) or because of the recruitment of additional mental operations. However, we also observe that complexity alone is an inadequate construct with which to identify tasks that may not be as affected by SD, as there were still substantial effect size differences among complex tasks in different domains. The nuances of these behavioral effects, as well as their neural correlates, should continue to be an interesting and fruitful area of study.
We failed to find significant effects in two of the categories tested. First, there was no effect of SD on accuracy measures in tests of reasoning and crystallized intelligence. Crystallized abilities (e.g., the retrieval of domain-specific knowledge) are thought to be highly stable over a range of cognitive states, and are even of use in assessing premorbid functioning following neurological insult or the onset of dementia (O’Carroll & Gilleard, 1986; Watt & O’Carroll, 1999). It is unsurprising, therefore, that outcomes on these tests are relatively unaffected by short-term SD.
Second, the average effect size of the change in accuracy measures for tests of processing speed failed (but only barely) to reach statistical significance. There are at least two potential explanations for this finding. Nearly all the tasks in the processing speed category were self-paced, as opposed to work paced, and several authors have commented on the differences between these two classes of tests. Williams et al. (1959) noted that a bias toward accurate responding is commonly found in complex, self-paced assignments, a conclusion reiterated by more recent researchers who have found speed but not accuracy effects on these tasks (e.g., De Gennaro, Ferrara, Curcio, & Bertini, 2001). Koslowsky and Babkoff (1992) also found a similar effect of work- versus self-paced tasks in their meta-analysis, although this increased effect size was seen only in studies with more than 48 hr of SD. A less common explanation of the relatively preserved accuracy on processing speed tasks relates to the nature of the operations being performed in them. These operations usually involve high levels of automaticity (e.g., decoding symbols in the Digit Symbol Substitution Test), and the fidelity of such overlearned skills is probably protected even during periods of fatigue, leading to the relatively small increase in the number of errors made.
An important feature of the current meta-analysis was the separate aggregation of accuracy and reaction time measures. Although there is some evidence that lapsing and lapse duration after SD are correlated in a test of simple reaction time (Lim & Dinges, 2008), there is no a priori reason to assume that this relationship should hold across all cognitive domains. This point is not intuitive and warrants further discussion. Figure 3 illustrates the curve representing the speed–accuracy trade-off in a typical cognitive test, as well as the downward shift of this curve following a period of SD. The unexplored factor in this relationship is whether SD also biases subjects toward faster or slower responding, as represented by a shift along the lower curve. For instance, increases in the number of commission errors or false alarms on simple reaction time tests after SD have been attributed to increased disinhibition (Dorrian et al., 2005), which can be thought of as a bias toward faster (and less accurate) responding.
As it turns out, the results of our analysis show remarkable agreement between accuracy and reaction time measures in each cognitive category: Overall, there was no significant effect when comparing accuracy and reaction time across the sample. This finding suggests that, on average, SD does not bias subjects toward either faster or more accurate responding, although this claim cannot be made of any individual cognitive test.
Of the three moderator variables studied, only hours awake (homeostatic sleep drive or sleep pressure) was a significant moderator of the effect of SD, and only for accuracy, not reaction time variables. Because of the nature of the coding in this study, we expected homeostatic sleep pressure to be a stronger predictor than circadian time or circadian offset, as there is considerable variability in endogenous circadian phase across individuals (Horne & Ostberg, 1976). Nevertheless, the results obtained in this analysis were surprising, as both circadian factors and homeostatic sleep drive are known to modulate cognitive performance (Mallis, Mejdal, Nguyen, & Dinges, 2004; Van Dongen & Dinges, 2005).
A likely explanation for this negative result is that much of the observed heterogeneity is due to the variety of cognitive tests in each sample. If this assertion is correct, it implies that the amount of impairment on tests that putatively assess the same cognitive domain may still differ considerably following SD. In other words, the validity of these tests in assessing the cognitive process may not be as high after SD. For example, total SD is known to exacerbate the time-on-task effect (Doran et al., 2001), suggesting that test length may be a confounding variable across tests of many cognitive processes. To obtain an objective standard of impairment, therefore, it may be necessary to establish norms on several of the most commonly used tests in each domain.
Although it would have been interesting to test the moderating effect of self-paced and work-paced paradigms in this analysis, these variables were highly confounded with cognitive domain (i.e., within each category, most or all tests tended to be either self-paced or work paced). From the data obtained in the main effects, however, we can infer that the differential effects of self-paced versus work paced on accuracy and reaction time measures are unlikely to be significant as suggested in previous meta-analyses. Instead, it is possible that these effects are present only under certain conditions (e.g., extremely long periods of SD or for particular subsets of tests).
As stated in the introduction, the chief objective of this meta-analysis was not to rule out any particular theoretical model but to direct attention to which of these models may have the greatest importance in explaining the real-world consequences of total SD. Although total SD does produce statistically significant differences in most cognitive domains, the largest effects are seen in tests of simple, sustained attention. This form of attention is critical in many industries involving sustained operations, during which a worker’s primary task may involve long, monotonous periods of low-level monitoring and situational awareness. Moreover, relatively brief failures of vigilance may potentially lead to disastrous consequences. For example, lapses in sustained attention are the direct cause of SD-related motor vehicle accidents (Dinges, Mallis, Maislin, & Powell, 1998), in which an eyelid closure of 4 s is a sufficient amount of time for a driver to completely veer off a highway. We argue, therefore, that this cognitive module is of the greatest practical concern in combating SD-related problems in real-world situations.
A striking feature of this deficit in sustained attention is how rapidly large changes emerge. Although our analysis was restricted to subjects who had gone a single night without sleep, effect sizes were still large for both speed and accuracy measures on simple attention tasks. These findings support the data showing that deficits in sustained attention often presage the other observable cognitive effects of SD and may have considerable utility as an early warning system for imminent cognitive failure. This cognitive component should therefore be one of the primary targets of assessment for work fitness and a basis for decisions on whether subsequent countermeasures should be applied.
On the next rung of the hierarchy, we note that tests of working memory and other tests of executive attention are also robustly affected by one night of SD. Considerable research has been conducted over the past several decades to assess the effects of SD on decision making and its component subprocesses (e.g., response inhibition, updating strategies, assessing risk; Harrison & Horne, 2000), and our data suggest that further investigation into these problems is a worthwhile endeavor. Indeed, neuroimaging data on these tasks are affording us new insights into the neural processes underlying the observable behavioral changes (for a review, see Chee & Chuah, 2008) and suggesting possible neuropharmacological mechanisms through which we may intervene to ameliorate these problems in individuals who are most vulnerable to sleep loss (Chuah & Chee, 2008).
Finally, although tests of processing speed and cognitive throughput such as the Digit Symbol Substitution Test are commonly used in SD paradigms, the results of this analysis demonstrate that their effects are relatively small compared with those of other tests. Indeed, studies of partial SD have demonstrated little or no effect on cognitive throughput tasks (Casement, Broussard, Mullington, & Press, 2006; Dinges et al., 1997). The implication of this finding is that changes in processing speed may be theoretically interesting but not of great practical significance in explaining and predicting real-world cognitive failures (Monk, 2007).
This analysis contains a small number of limitations that may have affected the validity of the conclusions drawn. As we were able to obtain only a small amount of unpublished data, it is possible that there was a bias in the analysis toward effect sizes that reached statistical significance. Nevertheless, we received a 100% response rate from laboratories surveyed, and all but one of these investigators denied possessing any unpublished data that met our inclusion criteria. We are, therefore, relatively confident that the study was not greatly affected by publication bias.
Although every effort was made in this analysis to classify studies into appropriate and meaningful categories, it is clear that with the possible exception of simple attention, pure assays of most of the cognitive domains we have identified do not exist. Moreover, there remained numerous dissimilarities among the forms and characteristics of the tests within each category (e.g., task length, task demands), particularly within the category of complex attention. As discussed, this is the most likely reason why heterogeneity was in the moderate range for almost all categories studied. Despite these drawbacks, we propose that our taxonomy is a useful heuristic for several reasons. First, significant between-categories differences were found in the meta-analysis, suggesting that we have captured meaningful constructs with the classification we employed. Second, we have stayed faithful to categories that are well defined in the neuropsychological literature. In many cases, focal deficits on these tests have been observed in patients with specific pathologies or injuries (e.g., working memory in attention-deficit/hyperactivity disorder patients; Barkley, 1997). Finally, several of the domains studied here have relatively high external validity. For instance, the challenge in simple attention tasks is similar to the real-world demands on air traffic controllers, and tasks such as the Psychomotor Vigilance Test have been shown to correlate highly with other indicators of dangerous, drowsy driving (Dinges et al., 1998; Price et al., 2003).
We were not able to study a number of moderator effects that may be important predictors of the residual intradomain heterogeneity. Task duration is likely to be one of these factors, with longer tasks associated with greater effect sizes due to the presence of the time-on-task effect. We were unable to code this moderator chiefly because many articles did not report task length and because of the variability in time to completion for all tasks that were self-paced. As we have already mentioned, the difference between self-paced and work-paced tests was highly confounded with cognitive domain, making it unfeasible to test this as a moderator. Additionally, variables such as novelty and motivation (Jones & Harrison, 2001), though potentially important in affecting test outcomes, are not easily quantified.
Finally, a substantial number of studies entered into this meta-analysis reported only accuracy or reaction time as a dependent variable in their final published work. As a result, we could not conduct paired comparisons of these measures to assess their reliability. We encourage authors publishing in this field in the future to consider reporting both accuracy and reaction time measures where appropriate so that their relationship after SD can be better explored. We also suggest that, wherever possible, data from individual test bouts and not just omnibus F values for a series of bouts be reported, so as to enable the inclusion of more studies in future quantitative syntheses.
The results of this analysis have revealed the pattern of effects across cognitive domains and outcomes after a period of short-term total SD. Overall, there was a significant difference among cognitive domains, but not between speed and accuracy, suggesting that SD has differential effects on different cognitive processes but does not bias subjects toward either faster or more accurate responding in any of these domains. As some of the known key moderators of this effect did not explain the remaining between-studies variance, we infer that that the remaining heterogeneity is due to intertest differences and that test characteristics can influence the level of performance in the sleep-deprived state even when they are ostensibly assessing the same cognitive domain.
Finally, our results indicate that simple attention is the cognitive domain most strongly affected by short-term SD. Although decrements in other cognitive modules such as decision-making and memory processes no doubt contribute to real-world errors and accidents, the results of this analysis argue that deficits in sustained attention may represent the most parsimonious explanation for these occurrences. Thus, in light of these and other data, we believe that countermeasures targeting this cognitive module may be the most efficient means of accident prevention in industries where SD poses a significant safety risk.
Julian Lim was supported by Air Force Office of Scientific Research Grant FA9550-05-1-0293 while conducting this analysis. David F. Dinges was supported through National Institutes of Health Grant NR004281 and by National Space Biomedical Research Institute (through NASA) Grant NCC 9-58-159. We wish to thank Oo Htaik for his assistance in coding study quality and moderator variables. Helpful advice was provided by Stephen Schueller, Christian Webb, and Alyson Zalta.
1This is accounted for by a in Equation 6, where a = 2(1 − ρ)/rYY and rYY is the square root of the test–retest reliability. In cases where reliability information for a particular test was not available, we first searched the literature for tests that were highly similar to the one used, then as a last resort used the average reliability from tests within the respective cognitive domain. In all cases, separate reliability coefficients were located and used for accuracy and reaction time measures.
2However, note that there is no gold standard as yet of incorporating study-quality information into pooled effect sizes. For the purposes of comparison, the supplemental materials table reports pooled effect sizes for each cognitive domain with and without these study-quality weights.
References marked with an asterisk indicate studies included in the meta-analysis that are discussed in the text. For a complete list, go to http://dx.doi.org/10.1037/a0018883.supp.