There are several statistical methods that are used to assist the clinician in determining if a reliable change has occurred across time. The formulas for these different methods are presented in Table . In the examples below, T1 = score at Time 1, T2 = score at Time 2, M1 = mean score of control group at Time 1, S1 = standard deviation of control group at Time 1, M2 = mean score of control group at Time 2, S2 = standard deviation of control group at Time 2, r12 = correlation between M1 and M2. Additionally, for most of the examples below, we will use the following hypothetical scores (standard scores with M = 100 and SD = 15) and psychometric properties: T1 = 90, T2 = 80, M1 = 100, S1 = 15, M2 = 105, S2 = 20, and r12 = .85.
Standard Deviation Index
Whereas the simple discrepancy method might be the easiest change method to use, the Standard Deviation Index might be one of the most widely used among clinicians. In this method, the simple discrepancy score is divided by the standard deviation of the test score at Time 1. This yields a z-score, which can be compared with a normal distribution table to find out the statistical significance of that difference. Within the existing literature, a z-score of ±1.645 would typically be considered a “reliable change.” This ±1.645 demarcation point indicates that 90% of change scores will fall within this range in a normal distribution and only 5% of cases will fall below this point based on chance and only 5% of cases will fall above this point. One advantage of the Standard Deviation Index is that it is easy to calculate. It also provides a more precise estimate of relative change than the simple discrepancy score because it is tied to a specific z-score. Disadvantages associated with this method include: no control for test reliability, practice effects, or regression to the mean, and it is a one-size-fits-all approach. Additionally, as it puts change on a scale of standard deviation units, it is quantifying change on an incorrect metric (as will be described with the following methods).
In our patient example, the Standard Deviation Index would be −0.67 (i.e., [80 − 90]/15). When compared with a normal distribution table, a z-score of −0.67 falls at approximately the 25th percentile. Since this falls well above the typical cutoff of ±1.645, then a clinician would conclude “no change.” When one compares the simple discrepancy score (roughly 10th − 20th percentile) and the Standard Deviation Index (25th percentile), it is apparent that they are close, but not identical. Since the simple discrepancy score is tied to actual changes in some normative group, it is likely to be a more accurate reflection of change in the individual patient than the standard deviation index, which is tied to psychometric properties of the test from a single administration (e.g., standard deviation at Time 1). However, in the absence of access to any better methods, the Standard Deviation Index is favorable to a clinician's best guess about change.
Reliable Change Index
First developed to determine if clinically meaningful change occurred as a result of psychotherapy (Jacobson & Truax, 1991
), the RCI is a more sophisticated method for examining change. Similar to the standard deviation index, it uses the simple discrepancy between the Time 1 and Time 2 scores as the numerator. But unlike the standard deviation index, it uses the standard error of the difference (SED) in the denominator. In essence, the SED estimates the standard deviation of the differences scores (which is likely to be very different than the SD
of Time 1 scores used in the SD
index). Although the SED continues to include the standard deviation at Time 1, it also incorporates the reliability of the test (Table ). This makes the RCI a notable advancement over the prior two methods. Calculation of the RCI results in a z
-score similar to the standard deviation index, which needs to be compared with a normal distribution table. Advantages of the RCI include: a more precise estimate of relative change and control for the test's reliability. Disadvantages include: it does not correct for practice effects or variability in Time 2 scores and it remains a one-size-fits-all approach.
In the patient example, the RCI's numerator would also be −10 (i.e., 80–90). The RCI's denominator would be 8.22 (i.e., SED = √2 × 152(1 − 0.85)). This would result in an RCI of −1.22 (i.e., −10/8.22). Compared with a normal distribution table, a z-score of −1.22 falls at approximately the 12th percentile. Since this falls above our typical cutoff of ±1.645, then you would conclude “no change.” Despite finding “no change,” the accuracy of the RCI is noticeable compared with the other two methods, which is attributable to the additional error variance that is controlled for in the denominator of this method.
RCI + practice effects
Although the RCI was a notable improvement in assessing change, it was designed for measures of psychological constructs (e.g., depression, anxiety). Cognitive measures, however, change differently than psychological measures. In particular, many cognitive measures show practice effects on repeat testing, which is not accounted for in the RCI method. Therefore, Chelune, Naugle, Luders, Sedlak, and Awad (1993
) adjusted the RCI to control for practice effects (RCIPE
). The numerator of RCIPE
starts with the simple discrepancy score (i.e., Time 2 − Time 1). From this, discrepancy score is subtracted the mean practice effects from some relevant group (which could be healthy controls or a clinical sample). This practice-adjusted discrepancy score is the numerator in RCIPE
. In their original paper, Chelune and colleagues used the SED as the denominator. The resulting RCIPE
is compared with a normal distribution table, and ±1.645 is also used as a cutoff point for considering a statistically significant change. In addition to being a more precise estimate of relative change and controlling for the test's reliability, the main advantage of RCIPE
is that it controls for practice effects. One disadvantage of the RCIPE
method is that the practice effects correction is uniform (i.e., it does not allow for differential practice effects). Additionally, it remains a one-size-fits-all approach and does not control for variability in Time 2 scores.
In our patient example, the numerator of our RCIPE would be −15 (i.e., (80–90) − (105–100)). The denominator would still be 8.22 (i.e., SED = √2 × 152(1 − 0.85)). The resulting RCIPE would be −1.83 (i.e., −15/8.22). Compared with a normal distribution table, a z-score of −1.83 falls at approximately the 4th percentile. Since this value falls below our typical cutoff of ±1.645, then you could conclude that there had been a reliable and meaningful “change.”
Although the SED had been used for some time, Iverson (2001)
observed that the variability in the Time 2 scores was not unaccounted for in existing formulas. He introduced an adapted SED that does incorporate Time 2's variability (SEDIverson
), and this alternate calculation is now typically used as the denominator in RCIPE
. In our patient example, the numerator remains −15. The denominator changes to 9.68 (i.e., SEDIverson
= √(15√1 − 0.85)2
+ (20√1 − 0.85)2
= √93.67), and the RCIPE
is now −1.55 (approximately 6th percentile but “no change” according to ±1.645).
A few observations are probably necessary at this point. First, even though the previous methods might differ in the exact point at which this change score is located (e.g., 10th − 20th for simple discrepancy, 25th for standard deviation index, 12th for RCI, 4th for RCIPE, 6th for RCIPE with SEDIverson), they all consistently indicate some trend toward a decline in scores (i.e., all fall on the lower end of the distribution). Second, as more information is added to the equation, including test reliability, practice effects, and variability at Time 1 and Time 2, the estimate of change improves in accuracy. Third, the point at which we decide “change/no change” (i.e., ±1.645) is somewhat arbitrary, as many other factors must be considered when interpreting neuropsychological test scores. Lastly, all of the previous methods are constrained because they are unidimensional and rigid. This one-size-fits-all approach to assessing change does not account for differences in the individual patient (e.g., age, education, baseline level of performance, differential practice effects).
Regression-based change formulas
Developed around the same time (and by some of the same authors) as the RCIPE
was a regression-based method for determining if meaningful cognitive change had occurred (McSweeny, Naugle, Chelune, & Luders, 1993
). This method utilized multiple regression to predict a Time 2 score using the Time 1 score and other possibly relevant clinical information (e.g., age, education, retest interval). In the original McSweeny and colleagues paper, only the Time 1 score was a significant predictor of the Time 2 score (i.e., no other variables entered the equation), and we refer to these as “simple” standardized regression-based formulas (simple SRB). With this method, a predicted Time 2 score could be generated in T2
is the predicted Time 2 score, b
the β weight for Time 1 score (or regression slope), T1
the Time 1 score, and c
the constant (or regression intercept). The predicted score could then be tested in
, where SEE is the standard error of the estimate of the regression equation. The resulting RCISRB
also needs to be compared with a normal distribution table, and ±1.645 is again used as a typical cutoff point for considering change. Unlike its predecessors, the SRB model does allow for other variables in the prediction of a Time 2 score. In the case of the simple SRB, Time 1 cognition is accounted for in the model. This may be important if the Time 1 score falls at one extreme or another (e.g., high Time 1 scores may show less improvement on retesting due to ceiling effects, low Time 1 scores may show less decline on retesting due to floor effects). Additionally, regression to mean affects scores differently depending on their starting point (e.g., high Time 1 scores are more likely to regress downward, low Time 1 scores are more likely to regress upward). Other advantages of the simple SRB are that it provides a more precise estimate of relative change, it corrects for practice effects and retest reliability, and it corrects for variability in Time 2 scores. Furthermore, the SRB method can potentially incorporate additional clinically relevant variables (e.g., age, education, retest interval) into the prediction model, and we refer to this as the “complex” SRB approach. Although McSweeny and colleagues did not find that other variables to significantly contributed to the prediction of Time 2, more recent studies have found that demographic variables and retest interval contribute small, but statistically significant, amounts of variance for certain cognitive measures. Disadvantages of the SRB approach have primarily centered on that these formulas are complicated to calculate. Additionally, unless these formulas are already published, one would need access to an appropriate sample with test–retest data to generate the necessary regression analyses.
To continue with our patient example, we utilized the published simple SRB for the Repeatable Battery for the Assessment of Neuropsychological Status in older adults retested after 1 year (Duff et al., 2004
). Using Table , the Time 2 Delayed Memory Index is best predicted by the Time 1 score on that same measure (i.e., 90) multiplied by the β coefficient (i.e., 0.71) plus the constant (i.e., 30.60), yielding a
of 94.5 (i.e.,
is subtracted from the T2
and divided by the SEE of the regression equation, to yield an RCISRB
of −1.26 (i.e.,
). Compared with a normal distribution table, a z
-score of −1.26 falls at approximately the 10th percentile. Since this falls above our typical cutoff of ±1.645, then you would conclude “no change.” If other variables were included in the regression models, such as the Immediate Memory Index in Table , then this is a complex SRB (e.g., age and education add to the prediction of the Time 2 score).
One criticism of the SRB approach is that you typically need access to the actual data of relevant samples to generate the regression analyses. However, two groups have demonstrated that the key elements of the RCISRB
can be estimated from psychometric properties that are typically available in test manuals and published reports (Crawford & Garthwaite, 2007
; Maassen, Bossema, & Brand, 2009
). For example, with means and standard deviations at Time 1 and Time 2 from a relevant sample and the test–retest reliability coefficient, one can calculate a simple SRB and related RCISRB
(Table ). Whereas the constant and β coefficient used to calculate
would normally be taken from the regression results, they can be estimated from the means and standard deviations at Time 1 and Time 2 for a relevant sample. Similarly, the SEE, which would normally be taken from the regression analyses, can be estimated from the standard deviations at Time 1 and Time 2 and the test's reliability. The final calculation of this estimated RCISRB
, which we label RCISRBest
, is similar to that coming directly from the regression analyses (i.e.,
In our patient example,
would be 91.67 (i.e., best
= 20/15 = 1.33; cest
= 105 − 1.33 × 100 = −28;
). The SEEest
would be 9.68 (i.e., SEEest
)(1 − r12
) = √(152
)(1 − 0.85) = 9.68). The RCISRBest
would be −1.21 (i.e.,
). Compared with a normal distribution table, a z
-score of −1.21 falls at approximately the 12th percentile. Since this falls above our typical cutoff of ±1.645, then you would conclude “no change.”
There are additional variations on these different statistical methods for examining change. For example, Crawford and Garthwaite (2006)
noted that an adjustment is needed to the denominator in SRBs to control for a new case. Additionally, RCIs have been calculated for entire batteries, not just individual measures (Woods, Childers, et al., 2006
). Various debates have tried to refine these methods and identify instances when one is preferred to another (Hinton-Bayre, 2005
; Maassen, Bossema, & Brand, 2006
). This final debate is one worth briefly addressing: which change formula is best?
A number of authors have compared various RCI methods to determine their effectiveness in identifying change. Temkin, Heaton, Grant, and Dikmen (1999)
compared four of these methods (RCI, RCIPE
, simple SRB, and complex SRB) in a large sample of neurologically stable adults on five measures and two summary scores from the Halstead–Reitan Neuropsychological Test Battery. Results indicated that the original RCI was the poorest at identifying change, but that the other three methods were largely comparable. Two years later, Heaton and colleagues (2001)
examined the RCIPE
, simple SRB, and complex SRB in non-clinical and clinical samples on the same cognitive variables examined by Temkin and colleagues. Again, all three methods were found to be comparable, and it was noted that change models in normals might not apply to clinical cases. Frerichs and Tuokko (2005)
compared the standard deviation index, RCI, RCIPE
, simple SRB, and complex SRB in a large cohort of cognitively normal seniors on four memory measures. Results found greatest agreement between the RCIPE
, simple SRB, and complex SRB. Most recently, Maassen and colleagues (2009)
evaluated the outcomes of the RCIPE
, simple SRB, and his SRBest
in simulated and real data on a variety of neuropsychological measures. These authors concluded that the simple SRB was the most liberal at identifying change, the SRBest
was the most conservative, and the RCIPE
fell between the other two. Overall, there seems to be some consensus that the RCIPE
, simple SRB, and complex SRB are largely comparable in their ability to detect reliable and clinically meaningful change (Hinton-Bayre, 2010
No matter which method is chosen by a clinician, there is a growing body of literature to test their applicability in clinical samples. Many of these methods were developed on patients with epilepsy, but they have been since applied to cases of Parkinson's disease, Multiple Sclerosis, dementia, MCI, traumatic brain injury, cancer, and human immunodeficiency virus. Table provides references for many of these relevant studies.
Selected citations for studies using RCIs and SRBs in clinical samples
The assessment of cognitive change in the individual patient will remain an important component of a neuropsychologist's job responsibilities in the future. Although this part of clinical neuropsychology has grown rapidly over the past 20 years, there is still much room for additional growth. Some important future directions include the following.
- Examining these methods in geriatric and pediatric samples. Although there is a wealth of existing data on reliable change in adult samples (both controls and clinical cases), there is a dearth of relevant information on those under 18 and over 65 years of age. These two opposite ends of the age spectrum have unique developmental and degenerative processes that may make adulthood change norms less applicable.
- Better coverage of methods in clinical samples. Although some clinical conditions have been better studied with RCIs and SRBs (e.g., epilepsy, Parkinson's disease), others are woefully under-represented (e.g., Multiple Sclerosis, dementia, traumatic brain injury, brain tumors). Presumably, these under-represented conditions are being seen for repeated neuropsychological evaluations, but clinicians are not compiling this data, calculating these change indexes, and/or publishing their findings. We implore them to do so.
- Who is the ideal comparison group? When evaluating a patient with a traumatic brain injury for a repeat evaluation, is it best to compare his/her change to cognitively healthy controls? Or should his/her performance be compared with others with similar traumatic brain injuries? As noted earlier, both types of comparisons likely yield valuable information. However, Heaton and colleagues (2001) opined that “normal” change might not be applicable in clinical cases. To our knowledge, no one has empirically evaluated this assumption. If Heaton is correct, then it is even more critical that we increase our research efforts on determining what amount of change is expected in various disease states.
- Should raw scores be used to determine reliable change? Or corrected scores? In their original paper on SRBs, McSweeny and colleagues (1993) actually used a mix of raw and corrected scores in their analyses of change on the Wechsler Memory Scale-Revised and the WAIS-Revised in patients with epilepsy. Their argument for using raw scores with the Wechsler Memory Scale-Revised was that it led to a better fit of the data, and their argument for using corrected scores with the WAIS-Revised was that the age-corrected IQ scores would be more understandable to their audience. Regardless of one's arguments/choices, a consumer of RCIs and SRBs should always use the same metric that was used in the relevant publication. For example, if I want to use McSweeny's SRBs for the Wechsler Memory Scale-Revised, then I need to be using raw scores too. However, there is no literature to guide us on which is actually best when developing these change models.
- Expanding the methodology beyond specific cognitive tests. The vast majority of RCIs and SRBs are developed for individual neuropsychological test scores. However, future RCI and SRB studies might employ a battery-wise approach, like done by Woods, Childers, et al. (2006). Additionally, and perhaps more widely applicable, would be a shift to domain-specific RCIs and SRBs. Duff, Beglinger, Moser, & Paulsen (2010) examined if SRBs could be generated that predicted Time 2 scores on one test from Time 1 scores on a different test from the same cognitive domain (e.g., predicting Time 2 scores on Delayed Recall of Hopkins Verbal Learning Test-Revised from the Time 1 score on List Recall of the Repeatable Battery for the Assessment of Neuropsychological Status). Although the results were promising (e.g., domain-specific SRBs were comparable with test-specific SRBs), these results need to be validated and expanded. Furthermore, RCIs and SRBs could be generated for psychiatric and functional scales, MRI volumes, or other relevant outcome measures when evaluating changes in neuropsychological status.
- Handling more than two testing sessions. Nearly, all studies of cognitive change have examined two times points, but we are increasingly seeing patients who are being evaluated a third or fourth time. Can you use the same RCIs and SRBs to compare changes between Times 2 and 3 that you used to compare Times 1 and 2? Probably not, but there are only a few studies that have provided initial evidence of how cognitive changes vary with multiple assessments (Attix et al., 2009; Duff, Schoenberg, et al. (2008)). Other statistical methods (e.g., latent growth curve modeling) may be more appropriate for these complex trajectories.
- Refining methods. Although neuropsychologists have multiple methods at their disposal to assess change, the variables that go into these equations have not been successful in capturing all of the variance associated with true change. For example, Martin and colleagues (2002) developed SRBs for the WAIS-III and the Wechsler Memory Scale-III in a sample of non-operated epilepsy patients, and the resulting equations captured 31%–92% of the variance, even though baseline test score, age, gender, and seizure information were included as predictor variables. And these results reflect better-than-average SRBs. Therefore, we need to identify additional variables that might increase the captured variance in change models, perhaps including quality of education, premorbid intellect, medical and psychiatric information, occupational status, and performance in other cognitive domains.
- Overcoming obstacles for implementation in clinical practice. One potential reason for underutilization of change formulas by clinicians (and researchers) is that these formulas are cumbersome to calculate. Following the lead of Dr. Crawford (see
http://www.abdn.ac.uk/~psy086/dept/psychom.htm), we have become advocates for providing interested readers with change score calculators (e.g., Microsoft Excel spreadsheets) of our relevant work in this area. Interested readers can contact the first author for an example of one such calculator. We also strongly encourage other authors to follow this model.
- How should reliable change be addressed in forensic cases? Besides clinical cases, another venue where repeated assessment is common is in forensic evaluations. In an extreme case, a personal injury case that was tested by two different neuropsychologists on two successive days (Putnam, Adams, & Schneider, 1992). Although both evaluations produced comparable opinions, notable practice effects were observed across several measures, which could affect data interpretation. In another example, O'Mahar and colleagues (in press) recently reported that the 1-year test–retest stability of the Effort Index of the Repeatable Battery for the Assessment of Neuropsychological Status was relatively low (e.g., r = .32–.36) in two samples of geriatric patients. The reliability and reliable change observed on other effort measures has been notably understudied. In general, neuropsychologists should attempt to inform the courts about the potential complications of repeated evaluations and interpret their data accordingly (Heilbronner et al., 2010). However, more guidance and empirical data is clearly needed to assist neuropsychologists in forensic cases with repeated assessments.
- Is ±1.645 the best cutoff for determining change? Although this demarcation point was originally chosen because of its parallel with traditional parametric statistical testing, there is little (if any) data to support it as the best cut-point for assessing change. Improvements of +1.53 or declines of −1.18 still tell us something about change, even though they fall within the “no change” range.
- What is true change? Despite RCI scores, there are probably real-life events that also indicate change. When a patient with a traumatic brain injury can return to work, then change has probably occurred. When a slowly dementing patient can no longer live alone, change has occurred. When seizures become so disruptive that surgery is sought, change has occurred. When a child with Attention Deficit Hyperactivity Disorder shows improving grades in school while taking a stimulant medication, change has occurred. Although we currently track change with test scores, we probably need to be examining how our test scores track with real-life indicators of change.
In conclusion, repeated assessment is a relatively common occurrence in clinical neuropsychology that carries distinct benefits and unique challenges. Neuropsychologists have a variety of choices to make, both methodologically and statistically, when trying to determine if significant, reliable, and meaningful change has occurred. Despite the growing popularity of serial assessments and the expanding literature in this area, there is a need for more empirical studies to address several important but unanswered questions. We encourage those with relevant data to publish their findings to further inform the field.