|Home | About | Journals | Submit | Contact Us | Français|
Steven P. Broglio, PhD, ATC, contributed to conception and design; acquisition and analysis and interpretation of the data; and drafting, critical revision, and final approval of the article. Michael S. Ferrara, PhD, ATC; Stephen N. Macciocchi, PhD, ABPP; and Ted A. Baumgartner, PhD, contributed to conception and design; analysis and interpretation of the data; and drafting, critical revision, and final approval of the article. Ronald Elliott, MD, contributed to conception and design and drafting, critical revision, and final approval of the article.
Context: Computer-based neurocognitive assessment programs commonly are used to assist in concussion diagnosis and management. These tests have been adopted readily by many clinicians based on existing test-retest reliability data provided by test developers.
Objective: To examine the test-retest reliability of 3 commercially available computer-based neurocognitive assessments using clinically relevant time frames.
Design: Repeated-measures design.
Setting: Research laboratory.
Patients or Other Participants: 118 healthy student volunteers.
Main Outcome Measure(s): The participants completed the ImPACT, Concussion Sentinel, and Headminder Concussion Resolution Index tests on 3 days: baseline, day 45, and day 50. Each participant also completed the Green Memory and Concentration Test to evaluate effort. Intraclass correlation coefficients were calculated for all output scores generated by each computer program as an estimate of test-retest reliability.
Results: The intraclass correlation coefficient estimates from baseline to day 45 assessments ranged from .15 to .39 on the ImPACT, .23 to .65 on the Concussion Sentinel, and .15 to .66 on the Concussion Resolution Index. The intraclass correlation coefficient estimates from the day 45 to day 50 assessments ranged from .39 to .61 on the ImPACT, .39 to .66 on the Concussion Sentinel, and .03 to .66 on the Concussion Resolution Index. All participants demonstrated high levels of effort on all days of testing, according to Memory and Concentration Test interpretive guidelines.
Conclusions: Three contemporary computer-based concussion assessment programs evidenced low to moderate test-retest reliability coefficients. Our findings do not appear to be due to suboptimal effort or other factors related to poor test performance, because persons identified by individual programs as having poor baseline data were excluded from the analyses. The neurocognitive evaluation should continue to be part of a multifaceted concussion assessment program, with priority given to those scores showing the highest reliability.
Over the past 15 years, sport concussion has been identified as a substantial clinical concern for physicians and allied health professionals. Many health care professionals working in athletic settings are required to diagnose and manage the 1.6 to 3.8 million sport-related concussions occurring in the United States annually.1 Team or primary care physicians have the ultimate responsibility for injury diagnosis and return-to-play decisions, although athletic trainers and neuropsychologists regularly function as part of a multidisciplinary team caring for athletes with concussions. Diagnostic and management decisions are based on many factors, including symptom presentation, physical examinations, and specialized tests designed to detect deficits resulting from concussive injuries. Consequently, physicians and other health care professionals should be familiar with contemporary instruments used to supplement traditional clinical examinations.
For some time, neurocognitive tests have been used in clinical decision making after concussion. In the past, clinicians have administered these tests,2–5 but more recently, tests delivered through a computer platform have been developed and have been made available for use.6–11 The advantages of the computer assessments are believed to include ease of administration, rapid scoring, and increased test-retest reliability secondary to standardized administration and scoring.12 In addition, computerized test batteries are accessible to a wide range of clinicians, including athletic trainers. Although computer-based assessments potentially offer a number of advantages over traditional testing methods, several psychometric issues must be evaluated in order to assess their clinical utility.
Neurocognitive tests used in postconcussive clinical decision making must be valid and sensitive to the effects of concussive injuries. However, before sensitivity can be examined, each instrument's test-retest reliability must be established. Thus far, only a limited number of reliability studies have focused on the currently available computer-based sport concussion assessment platforms. In addition, those investigators have used arbitrary test-retest time intervals rather than test-retest intervals commonly observed when managing return to play after concussion. For instance, intraclass correlation coefficients (ICCs) for response speed, working memory, and learning on the CogSport (CogState Ltd, Victoria, Australia) have been reported to range from .69 to .82 when volunteers were tested twice over a 1-week period.11 A group6 examining the Headminder Concussion Resolution Index (CRI) observed reliabilities of .90 for the processing speed index, .73 for simple reaction time, and .72 for complex reaction time over a 2-week time frame. Finally, ImPACT reliability was examined in 49 high school and collegiate athletes using a baseline administration and a 14-day follow-up. Test-retest reliabilities (Pearson correlation coefficients) ranged from .54 on memory to .63 on reaction time and .76 on processing speed.13
These test-retest reliabilities appear to fall within the generally acceptable range needed for clinical interpretations, but the interval between assessments is shorter than the typical test-retest interval seen in sport concussion assessment and return-to-play management. For instance, in the first year of implementing a standardized concussion assessment protocol, our injury data suggested the mean duration from baseline to the initial postconcussion evaluation was 45 days and approximately 5 days more before the athlete began a return-to-play protocol (unpublished data, University of Georgia, Department of Sports Medicine, 2005). One group14 reported a similar test-retest time in high school athletes, which is considerably longer than most test-retest time frames employed by investigators. The effect of short retest intervals on estimates of reliability is presently unknown.
When assessing test-retest reliability, a number of potentially confounding factors must be considered. One prominent factor is the level of effort exerted by the examinee. Suboptimal or variable effort would be expected to reduce test-retest reliability independent of the test's characteristics. In clinical populations, effort has been shown to be a determinant of test performance independent of brain-injury severity.15 Consequently, effort tests may identify participants and athletes who do not exert an earnest attempt during reliability studies, baseline testing, and postconcussion evaluations. Less than maximum effort would not only affect test-retest reliability but would seriously impair return-to-play decision making based on neurocognitive test performance. Although assessing effort seems clinically prudent, we were unable to find any researchers who examined effort in conjunction with test-retest reliability of computer-based concussion assessments or studies looking at return to play. In our study, we attempted to identify participants who may not have exerted optimal effort in order to provide the most empirically reasonable assessment of test-retest reliability. As such, test-retest reliability was examined under optimal conditions by excluding persons who displayed suboptimal effort during any of the examinations.
Consequently, our investigation was designed to examine the test-retest reliability of 3 commercially available, computer-based concussion assessment programs using clinically relevant time intervals, while simultaneously controlling for effort. Based on existing research, we hypothesized that all 3 computer-based assessment applications would yield acceptable test-retest reliability using pragmatic assessment intervals. We also hypothesized that impaired effort at any point in time would reduce reliability.
Student volunteers (n = 118) were recruited from the general university population. Sample size was estimated based on guidelines provided by Baumgartner and Chung16 for reliability studies using a 2-way analysis of variance model to estimate the ICC. Participants were excluded from the study if they reported that English was not their primary language or if they had been diagnosed with a learning disability or attention deficit disorder or had a diagnosed concussive injury within 6 months before or during the study.
Upon arrival at the testing facility, participants listened to a description of the testing procedures and then read and signed an informed consent approved by the institutional review board, which also approved the study. Each participant completed a brief questionnaire documenting demographics including height, mass, previously diagnosed concussions, and exclusion criteria. Participants then completed 3 commercial, computer-based concussion assessment programs, including ImPACT Concussion Management Software (version 4.5.729; ImPACT Applications, Pittsburgh, PA), Headminder Concussion Resolution Index (CRI) (Headminder Inc, New York, NY), and Concussion Sentinel (version 3.0; CogState Ltd, Victoria, Australia). The Memory and Concentration Test for Windows (MACT) (Green's Publishing Inc, Edmonton, Alberta) was administered to evaluate effort. All tests were administered according to the manufacturer's recommendations in a quiet laboratory setting to groups of fewer than 5 participants. Computer stations were positioned to minimize distractions from the environment and other test takers.
The CRI uses 6 tests to produce 5 index scores, including processing speed, simple reaction time, complex reaction time, and simple and complex reaction time errors. The ImPACT has 6 modules, which yield 5 index scores, including verbal memory, visual memory, visual motor speed, reaction time, and impulse control. The Concussion Sentinel uses 7 tests to develop 5 output scores, including reaction time, decision making, matching, attention, and working memory. The MACT is a computer-based effort assessment, which uses 4 modules to generate 5 output scores. Cutoff scores are recommended to establish acceptable levels of effort based on normative data.17 Total time to complete neurocognitive and effort testing for each participant was approximately 60 minutes.
Each participant was retested on each program approximately 45 days after the baseline evaluation (mean = 45.08 ± 1.56 days). The final assessment was administered approximately 5 days after that (mean = 5.56 ± 0.90 days). Testing on days 45 and 50 followed the same procedure as described above. Random assignment from all possible test administration orders was performed on each day of testing (ie, baseline, day 45, and day 50).
We calculated a 2-way random effects analysis of variance ICC (2,1) to estimate the test-retest reliability of each computer-based test variable for baseline to day 45 assessments and for day 45 to day 50 assessments.18–20 The calculation produces a value between zero and 1; values closer to 1.00 indicate less error variance and stronger reliability. Recommendations for ICC interpretation are diverse. Anastasi21 recommended .60 as the minimum acceptable ICC value. Portney and Watkins18 suggested that ICCs greater than .75 represent good reliability and ICCs less than .75 reflect moderate to poor reliability, depending on the magnitude. Randolph et al22 argued that the test-retest reliability must be greater than .90 to make decisions regarding an athlete's cognitive status after concussion. Level of effort was determined using the MACT manufacturer's guidelines.17 All data analyses were conducted using SPSS (version 13.0; SPSS Inc, Chicago, IL), and statistical significance was set at α = .05.
Data from 73 participants were included in all analyses. Five participants dropped out of the study after the baseline assessment. All participants identified as having invalid baseline assessment data (n = 40) were excluded from all analyses. This included 29 assessments on the ImPACT, 5 on the CRI, and 6 on the Concussion Sentinel. Invalidation for baseline test validity was determined through user's guidelines (ImPACT)23 or automated features (Sentinel and CRI). If a participant's baseline evaluation was identified as invalid on any single computer-based test at the baseline assessment, then all of his or her data were removed from the analyses. No participant was identified as having an invalid baseline evaluation on more than 1 test.
Participant demographics were as follows: age = 21.39 ± 2.78 years, height = 170.95 ± 9.00 cm, mass = 69.09 ± 15.07 kg, and total self-reported SAT scores = 1168.17 ± 99.76. Twelve participants (16.4%) reported having a history of diagnosed concussion, ranging from 1 to 5 injuries. No participants reported sustaining a concussion during the testing process or in the 6 months before participation.
The mean scores and SDs for each output variable listed by computer program are presented in Table 1. Our cohort performed slightly better on the neurocognitive measures than has been observed previously.24,25 The ICC values for each output score on the 3 computer-based concussion tests from baseline to day 45 and day 45 to day 50 are presented in Table 2. The Concussion Sentinel and CRI had similar test-retest reliabilities from baseline to day 45, whereas the Concussion Sentinel had the highest reliabilities for the day 45 to day 50 evaluations. Overall, the ICC values were somewhat higher from day 45 to day 50 compared with baseline to day 45. Based on the ICC interpretive guidelines previously described, test-retest reliabilities for all indexes on all 3 computer programs fell below the levels commonly recommended for making clinical decisions.
No participant sustained a concussion or any other health-related event during the study. Depending on the magnitude of decline, a negative change in test performance, therefore, could be considered a false-positive and indicative of neurocognitive impairment. A false-positive reflected a normal, healthy participant who was identified as impaired based on retest data. Automated features within each test identify when a significant change from baseline occurs on successive follow-up evaluations. False-positives for each output variable are presented in Table 3. Based on the significant-change analyses employed by each concussion assessment program, the percentages of participants with 1 or more false-positives on any variable on the day 45 assessment were ImPACT (38.40%, n = 28), CRI (19.20%, n = 14), and Concussion Sentinel (21.90%, n = 16). On day 50, the percentages of participants with false-positive results on 1 or more variables were ImPACT (34.20%, n = 25), CRI (23.30%, n = 17), and Concussion Sentinel (32.90%, n = 24).
Mean scores and SDs for the MACT are presented in Table 4. Using cutoff scores recommended by the developer, mean scores on the immediate recall, delayed recall, and consistency variables were greater than 85%, paired associates scores were greater than 70%, and free recall scores were greater than 50% for each day of testing.17 These findings imply that the cohort exhibited good effort on all days of testing. A review of individual subject data revealed no instances of poor effort on any day of testing.
Our study focused on test-retest reliability of the ImPACT, Concussion Sentinel, and CRI concussion assessment programs using a clinically relevant paradigm, while simultaneously controlling for suboptimal effort. Our reliability estimates contrast with those previously reported on all 3 instruments.6,11,13 The ICCs obtained in our study over a 45-day period were lower than those currently reported in the literature for computer-based tests. A previous reliability investigation of pencil-and-paper neurocognitive tests was conducted on a group of 48 athletes, with test administrations separated by 8 weeks. Using the Pearson r statistic, the author26 reported the test-retest correlations of the Digit Span, Digit Symbol, Symbol Search, Trail-Making Tests, Controlled Oral Word Association Test, and Hopkins Verbal Learning Test to range from .39 to .78. These values met or exceeded many of the reliability estimates calculated here. Clinicians must be selective, therefore, in choosing the neurocognitive evaluation that best suits their needs. Regardless of the instrumentation chosen, greater emphasis should be placed on those assessments showing the highest reliability.
Our ICC estimates varied between indexes and tests. The ICCs on several test indexes met or just exceeded the .60 level recommended as the minimal acceptable test-retest reliability for clinical decision making over a 45-day period.21 A greater number of indexes exceeded the lowest acceptable ICC level over a 5-day period. Regardless, no test index scores reached levels (>.75) typically considered acceptable for test-retest reliability.18,22 These results would suggest that in the absence of concussion, performance fluctuations may result from other factors, and postmorbid testing may not accurately reflect neurocognitive deficits. Investigators have identified factors such as sleep deprivation and stimulant use,27 intense physical activity,28 and daily stressors29 as possible modifiers of cognitive test performance. Additional research investigating these phenomena is warranted.
The lower reliability coefficients documented in our study, when compared with those previously reported, may be related to the testing interval and/or the statistical methods implemented. We used empirically determined and clinically relevant intervals between assessments, similar to or shorter than those used by other investigators.14,30,31 Previous investigators6,11,13 of test reliability have implemented follow-up times that are unrealistic for clinical applications. Nonetheless, test-retest time frames are an important clinical issue, and stability of test scores over time merits more investigation because some athletes may sustain a concussion months to years after baseline testing.
Regarding other differences in methods, we used an ICC estimate of test-retest reliability based on a body of literature indicating that the Pearson correlation coefficient used by other investigators13 is not the appropriate statistic for evaluating test-retest reliability.19,20,32 A Pearson r statistic is a bivariate measure of the relationship between 2 independent variables. That is, the r value is an indication of how one measure will respond when a second measure changes. The ICC is a univariate measure estimate of the agreement between scores on the same test at 2 points in time. The Pearson r calculation is limited by insensitivity to systematic changes in score means due to learning or practice effects, and it is also known to overestimate the correlation when sample sizes are limited.32
According to MACT data, our participants put forth good effort on each day of testing.17 A decline from baseline to day 45 was observed on free recall, but the score remained above the recommended cutoff indicating suboptimal effort. Additionally, free recall and paired associate learning performances were higher than those for normal, healthy controls described in the test manual.17 Effort testing has been reported to explain approximately 50% of the variance in neurocognitive test performance in those with traumatic brain injury.15 This finding would indicate that subtle changes in performance are better explained by effort than by decrements in neurocognitive capabilities. Not only were MACT scores high, but our participants' performance on the SAT was better than the national average and those identified by the computer programs as having invalid baseline data were excluded. Higher-than-average SAT scores may be a limitation of this study, because our participants may not accurately reflect the clinical population to whom these tests are often administered. However, the high SAT scores and the removal of data from those exhibiting poor baseline performance make suboptimal effort an unlikely cause of low ICC values.
Despite the participants' high level of effort, 20% to 40% of the cohort was identified as being impaired on at least 1 variable during follow-up evaluations (Table 3). This contrasts with findings from an earlier group,33 who reported that only 7% to 9% of athletes were likely to be misclassified as concussed based on a decline in performance on pencil-and-paper tests. Statistical techniques embedded within the ImPACT and CRI (the Reliable Change Index) and Concussion Sentinel (within-subjects SD) are used to identify participants as showing large declines in performance compared with the baseline evaluation. In our study, no participant reported sustaining a concussion or any other serious injury during the testing period, making it difficult to understand the “reliable” negative changes indicated by the computer tests. False-positive findings on testing have obvious implications for clinical decision making and concussion management.
To our knowledge, this study is the first to examine the test-retest reliability of multiple computer-based concussion assessment programs using clinically and empirically relevant assessment points while simultaneously controlling for effort. We observed lower reliabilities than were reported previously, despite optimal inclusion criteria. Reliabilities on some output scores fell within a minimally acceptable range, but no single test had uniformly acceptable reliabilities. Higher reliabilities have been reported by previous investigators using shorter test administration intervals.6,11 The effect of longer, clinically pragmatic testing intervals on test-retest reliability has not been fully elucidated.
In the final analysis, neurocognitive assessments have been shown to be sensitive to the consequences of concussion, and computerized testing has many practical advantages in athletic settings. Nonetheless, the psychometrics of computerized instruments continue to warrant investigation and modification when necessary. Athletic trainers and allied health professionals should be aware of the psychometric properties associated with the various instruments and how the clinical process of concussion diagnosis and management may be affected. Specifically, the lower-than-expected reliabilities reported here may influence test validity and may confound interpretation during the evaluative process. The substantial number of participants demonstrating a decline in test performance over time also raises the question about test specificity and the ability of these tests to accurately identify cognitive changes in individuals with concussions. Thus, clinicians using the instruments evaluated here should adopt a cautious and conservative approach to concussion management, with greater focus placed on those indexes producing higher reliability scores. Future researchers should address test-retest reliability, as well as test sensitivity and specificity, using clinically relevant intervals in specific populations.
Until the psychometric properties of these tests can be clarified, clinicians should use a battery of evaluative measures when assessing concussion.34,35 Findings from multiple assessment techniques, such as self-reported symptoms, postural control, and neurocognitive performance, should be incorporated into a concussion assessment protocol. No single assessment technique should be used to the exclusion of the others or the physical examination. Once the athlete returns to baseline on all measures, a return-to-play progression can begin with careful attention paid to symptom reoccurrence, both during and after exertional activities. Only when the athlete is free from symptoms at rest and exertion should a full return to participation be considered.36
This project was funded by the Louise E. Kindig Research Award and the Dissertation Completion Award from the University of Georgia. The funding organizations did not influence the design or conduct of the study; collection, management, analysis, interpretation of the data; or preparation, review, or approval of the manuscript.