|Home | About | Journals | Submit | Contact Us | Français|
Objective methods for determining clinically relevant neurocognitive change are useful for clinicians and researchers, but the utility of such methods requires validation studies in order to assess their accuracy among target populations. We examined the generalizability of regression equations and reliable change indexes (RCI) derived from a healthy sample to two HIV-infected samples, one similar in demographic makeup to the normative group and the other dissimilar. Measures administered at baseline and follow-up included the Trail Making Test, Controlled Oral Word Association Test (COWAT), Grooved Pegboard, and Digit Span. Frequencies of decline, improvement, or stability were determined for each measure. Among the demographically similar clinical cohort, elevated rates of decline among more immunologically impaired participants were indicated by simple regression method on measures of psychomotor speed and attention, while RCI addressing practice effects (RCI-PE) indicated improvement on most measures regardless of immunostatus. Conversely, among the demographically dissimilar cohort, simple regression indicated high rates of decline across all measures, while RCI-PE indicated elevated rates of decline on psychomotor and attention measures. Thus, the accuracy of the two methods examined for determining clinically significant change among HIV+ cohorts differs depending upon their similarity with the normative sample.
Detecting change in neurocognitive functioning is one of the primary roles of the neuropsychologist. This is often done without the benefit of baseline data; however, even when such information is available it is not always possible to determine the clinical significance of change in test scores. For example, changes in scores can be due to poor reliability of the measure, regression towards the mean, or intraindividual variations in behavior and motivation. However, even when these factors are considered, improved performance still may not represent clinically meaningful change. Perhaps the greatest obstacle in interpreting retest data is the effect that prior exposure to the test has on performance, commonly referred to as “practice effect.” This is most evident in settings in which changes in ability are expected and need to be quantified to some extent, such as neurorehabilitation, neurosurgery, or the assessment of neurodegenerative diseases. However, despite the relevance of detecting change to the practice of neuropsychology, relatively few studies exist that provide useful data for interpreting retest performance or that compare the various statistical methods for determining change.
There are a number of statistical approaches available for controlling for practice effects and therefore assisting in determining relevance change (for review, see Collie, Darby, Falleti, Silbert, & Maruff, 2002). These include the reliable change index (RCI) and its variants that address practice effects (RCI-PE; Chelune, Naugle, Luders, Sedlak, & Awad, 1993), simple regression, and multiple regression. The last two have the additional advantage of being able to correct for regression towards the mean. Additionally, multiple regression is capable of considering additional factors that might affect one’s capacity to benefit from prior exposure to a measure, such as age, education, and test–retest interval. However, while multiple regression accounts for the variances that these factors contribute, they are generally small and are seen only on a few measures (Levine, Miller, Becker, Selnes, & Cohen, 2004; Salinsky, Storzbach, Dodrill, & Binder, 2001; Temkin, Heaton, Grant, & Dikmen, 1999). Therefore, the relative simplicity of simple regression and RCI methods outweighs the small benefit of multiple regression in most instances.
Whatever technique one decides to use, the data must be derived from a control sample. Herein lies another area of debate: whether to use nonclinical or similar clinical samples in developing the RCI and regression formulas to be used with the sample of interest.
Perhaps the greatest contribution in neuropsychological test–retest literature comes from epilepsy researchers interested in the outcomes of temporal resection surgery. The trend among these researchers has been to use clinical samples from which to derive their data: that is, individuals with epilepsy who do not undergo surgical intervention. Earlier papers provided test–retest norms derived for a wide range of commonly used neuropsychological measures (Hermann et al., 1996; Sawrie, Chelune, Naugle, & Luders, 1996). More recently, others have reported such data for batteries such as the Wechsler Intelligence Scale for Children–Third Edition (WISC-III; Sherman et al., 2003), Wechsler Adult Intelligence Scale–Third Edition (WAIS-III; Basso, Carona, Lowery, & Axelrod, 2002), and Wechsler Memory Scale–Third Edition (WMS-III; Martin et al., 2002). Across these studies, the success of the data in detecting change as a result of surgery has been good, at least as an adjunctive method.
Others have used healthy cohorts to derive data that might be applied to clinical populations for the purpose of filtering out the effects of practice and detecting clinically relevant change (Dikmen, Heaton, Grant, & Temkin, 1999; Ivnik et al., 1999; Levine et al., 2004; Salinsky et al., 2001; Temkin et al., 1999). However, little information is available to answer whether or not data derived from healthy, nonclinical cohorts are applicable to clinical patients. Findings by Heaton et al. (2001) demonstrated that norms derived from a nonclinical group are not particularly useful for predicting retest performance among clinical populations. Those authors compared the efficacy of RCI-PE and simple and multiple regression in determining change in two groups of neurologically stable individuals (a nonclinical cross-validation group and a group diagnosed with schizophrenia) and two groups of neurologically unstable individuals (a sample recovering from traumatic brain injury and a group that had incurred brain insult between baseline and follow-up testing, called the “recovering TBI” and “new insult” group, respectively). Among the two stable groups, all three methods performed equally well with regard to specificity in classifying subjects as unchanged at retest. However, large standard errors obtained on some measures among the schizophrenic participants, with the result of an excessive number being classified as “changed” in both the positive and negative directions, led the authors to conclude that confidence intervals developed from the nonclinical sample may not be appropriate for such psychiatrically impaired individuals. Among the unstable groups the three methods were generally equivalent with regards to sensitivity, or the ability to detect change when it did occur. The authors concluded that it would be wise to use norms derived from individuals who are similar in demographics and baseline performance to the population to whom the data will be applied. More recently, this same group followed their suggestion when they examined the accuracy of modified RCI equations derived from a healthy sample in determining change on a neuropsychological test battery among a demographically similar HIV+ cohort (Woods et al., 2006). The equations demonstrated good specificity for the HIV+ participants after a one-year test–retest interval.
Recently, Levine et al. (2004) reported test–retest data and regression equations for eight neuropsychological measures that were derived from 1,047 healthy (HIV-negative) individuals. The sample consisted primarily of well-educated, Caucasian men. However, as the results of Heaton et al. (2001) suggest, the generalizability of data derived from that sample may not extend to clinical groups. In the current study, we examined the utility of the test–retest data derived from that sample in detecting change among two HIV-positive groups: one a demographically similar sample and the other demographically different. Individuals with more advanced disease (CD4+ T-cells < 200) were expected to show higher rates of clinically relevant decline on measures most sensitive to HIV-related neuropsychological deficits. We also expected that the data would be more accurate in predicting changed performance among individuals in the demographically similar group who were more immunocompromised, while the demographically dissimilar cohort would show unusually high rates of change.
Three groups comprised the total study sample. Two were drawn from the Multicenter AIDS Cohort Study (MACS), database. Briefly, the MACS is a multicenter epidemiological study of the natural history of HIV infection, conducted in four U.S. cities (Baltimore, Chicago, Pittsburgh, Los Angeles). Recruitment procedures have been described elsewhere (Kaslow et al., 1987, Miller et al., 1990). Participants were generally evaluated at semiannual intervals, although test–retest intervals varied in length (see Tables 1 and and3).3). The original healthy sample from which retest data was derived had a test–retest interval of between 4 and 24 months. Evaluations included physical examinations, HIV testing, structured clinical interviews, and neuropsychological testing. MACS participants were excluded from the analyses if they had a self-reported history of significant head trauma (loss of consciousness greater than 1 hour), self-reported history of learning disability, and/or substance abuse in the 6-month period prior to initial testing (cocaine, heroin, PCP, or methamphetamines).
The first group, from which the regression equations and RCI-PE data were derived, was a non-clinical sample drawn from a pool of 1,047 healthy, HIV-negative males from the MACS database. This sample was described in detail in Levine et al. (2004). Mean age and education for this group varied across neuropsychological measure and are listed in Table 1. The ethnic makeup was 87.7% Caucasian, 7.1% Black, 3.7% Hispanic, and the remaining 1.5% Pacific Islander, Native American, or Asian.
The second group was a demographically similar clinical (HIV-seropositive) cohort drawn from a pool of 955 MACS participants. Overall demographic characteristics of this group are listed in Table 2. There were no females in the MACS study and no one with current substance dependence as defined by the Diagnostic and Statistical Manual Of Mental Disorders (DSM-IV) criteria (American Psychiatric Association, 1994). A total of 57% met diagnostic criteria for acquired immunodeficiency syndrome (AIDS). Due to variation in the battery administered among study sites, the characteristics of the participants varied across neuropsychological measure (see Table 3). The overall ethnic makeup of this group was 81.6% Caucasian, 10.4% Black, 6.6% Hispanic, and the remaining 1.4% Pacific Islander, Native American, or Asian.
The third group consisted of 173 HIV-positive individuals who were participating in a 6-month, National Institute of Drug Abuse (NIDA)-funded study of antiretroviral medication adherence among drug users. Participants were recruited from the Los Angeles area through advertisements posted at university-affiliated infectious disease clinics as well as through community-based HIV/AIDS organizations. A total of 61 (65%) of the participants in his group met diagnostic criteria for AIDS. A total of 28 (16%) were female. A total of 55 (32%) of the participants were classified as currently drug dependent at the second visit according to DSM-IV criteria (American Psychiatric Association, 1994), as determined through the Psychiatric Research Interview for Substance and Mental Disorders (PRISM; Hasin et al., 1996). As with the MACS cohort, NIDA participants were excluded from the analyses if they had a history of learning disability or head trauma with loss of consciousness greater than 1 hour. Demographic characteristics of this group are listed in Table 4.
Both clinical cohorts were further divided into groups based on Centers for Disease Control (CDC) staging based on CD4+ T-cells count. Thus, individuals in each cohort were classified as follows: Group 1 (CD4+ T-cells < 200), Group 2 (CD4+ T-cells = 200–499), Group 3 (CD4+ T-cells = 500+).
The following measures were used: Trail Making Test–Parts A and B (Army Individual Test Battery, 1944); Stroop Color/Word Interference Test (Kaplan adaptation of the Comalli administration procedure; Stroop, 1935; Comalli, Wapner, & Werner, 1962); Grooved Pegboard (Klove, 1963; Lafayette Instrument Company); Digit Span (from the Wechsler Adult Intelligence Scale-Revised; Wechsler, 1981); Symbol Digit Modalities Test (Smith, 1991); and Controlled Oral Word Association Test(COWAT; Benton & Hamsher, 1978).
Raw scores were used for all analyses. Derivation of the original regression equations and data for calculating RCIs are detailed in a previously published paper (Levine et al., 2004). Briefly, regression equations were created by entering scores from the healthy normative sample into a simple regression analysis, with baseline scores as the predictor and retest scores as the criterion. Thus, β-coefficients and residual standard deviations were derived from these former analyses for each measure. These can be found in Levine et al., 2004. In the current analyses, baseline scores for the clinical groups were used as the predictor variable, with the retest score as the dependent variable. The residual standard deviations from the healthy group were used to create confidence intervals (CIs) around the predicted scores. The equation is as follows:
where: X2 – predicted X2 is the difference in between actual retest score and predicted retests score, and residual SD is the residual standard deviation of the normative sample.
Using the RCI-PE method, difference scores were derived from the normative sample in order to determine significance of change in performance. The RCI-PE (Chelune et al., 1993) is calculated by subtracting the mean change score (retest – baseline) of a healthy normative sample from the individual’s or group of interest’s mean change score. This is then divided by the standard deviation of the mean difference scores, or standard error of measurement of the difference (SDdiff), of the normative sample (data provided in Levine et al., 2004). The equation for calculating the RCI-PE is as follows:
where: X2 − X1 is the difference in between an individual’s or research sample’s baseline and retest performance; M2 − M1 is the difference between the normative sample’s mean baseline and retest performances; SDdiff is the standard deviation of difference between baseline and retest scores of the normative sample.
In the current study, a 90% CI was constructed around the RCI-PE and regression estimates. Therefore, among the nonclinical sample, 90% of the participants should have scores that fall between 1.645 and −1.645, with 5% being in the positive and 5% in the negative directions.
Using the confidence intervals described above, frequencies of declined, improved, and stable performances were determined for each measure among the three CDC-defined groups using both the regression equations and RCI-PE.
Baseline and retest scores are displayed in Table 5. As shown, there was ubiquitous improvement on all tests, reaching statistical significance in all instances. Frequencies of change on neuropsychological measures, as determined via the regression and RCI-PE, are presented in Table 6. Results of the RCI-PE equations indicate greater rates of improvement on Trail Making Test (Form A) and Grooved Pegboard (both hands). Further, this improvement appears to be similar across CDC groups, although slightly less for the most immunologically compromised (i.e., CD4+ T-cells < 200). Only on the COWAT did the RCI-PE equations indicate a decline rate of greater than 10%, and this was similar across CDC groups. Conversely, rates of decline and improvement consistently fell below 10% using the regression method, with the exception of Trail Making Test (Form B). On that measure, an expected higher rate of decline was seen among the more immunologically compromised. Note that there was a general incongruence in direction of change between the two methods, with regression showing somewhat higher rates of decline (5.7–11.3%) and RCI-PE indicating higher rates of improvement (3.9–16.1%).
Baseline and retest scores are shown in Table 5. As with the MACS cohort, all tests showed statistically significant change from baseline. However, in the NIDA sample, scores on both trials of the Grooved Pegboard and on Digit Span were poorer at retest. Rates of change for this sample are shown in Table 7. Elevated (>10%) rates of decline based on RCI-PE appeared on Trail Making Test (Form B) and on Grooved Pegboard (both hands). However, rates of decline did not appear to be associated with CDC group for any of these measures. Improvement rates were also elevated for Trail Making Test (Form B); however, once again there appeared to be no relation with CDC group. Rates of change based on regression were markedly higher than those based on RCI-PE, and exclusively in the direction of decline. This included a decline rate of 22.1% on Trail Making Test Form A, 35% on Form B, 40.2% on Grooved Pegboard dominant hand, 42.3% on nondominant hand, 19.1% on the COWAT, and 12.3% on Digit Span. As with the RCI-PE method, little association between decline rates and CDC group was apparent.
In this study, we assessed the generalizability of data derived from a healthy sample in detecting clinically relevant change among two groups: HIV-infected individuals who were demographically similar to the healthy sample and a group of HIV-infected individuals who were predominantly drug users and demographically dissimilar to the healthy sample. Development and validation of test–retest data is extremely valuable to neuropsychologists and other healthcare professionals as a means of assessing the meaningfulness of change over time. This study is among the few that addresses the generalizability of test–retest data.
When applied to a clinical cohort with similar demographic characteristics, the pattern of changes indicated by the regression equations was consistent with neurocognitive changes that one would expect in HIV-infected individuals—that is, increasing rates of decline on tests of psychomotor speed, working memory, and complex attention were associated with greater virologic compromise (Heaton et al., 1995). Conversely, the RCI-PE method indicated somewhat greater rates of improvement on these tasks, which was similar regardless of immunological status. Which method is more accurate is impossible to determine from our dataset, as no objective measure of change was established to use as a criterion. However, because the bulk of the data were collected prior to widespread use of highly active antiretroviral therapy (HAART), it would be expected that the sample was more prone to decline than improvement between baseline and retest. Thus, it appears that regression equations derived from a healthy normative sample do generalize well to a demographically similar clinical cohort, while RCI-PE is less accurate.
Conversely, among a clinical cohort that was quite different from the normative sample with regards to ethnicity, education, and other factors (e.g., substance abuse rates), it was the RCI-PE data that appear to have provided a more accurate assessment of change. Among this cohort, the regression equations predicted an unusually high rate of declined performance across all measures. This finding is somewhat understandable in the cases of the Grooved Pegboard and Digit Span, on which this cohort actually did perform worse as a group at retest. However, it is unclear why such high rates of decline were seen on the other measures. One possible reason for the apparent poor generalizability of regression equations to the NIDA sample was that there was in fact a high rate of decline among that cohort. This was a cohort that consisted largely of cocaine and methamphetamine abusers, drugs that have been shown to have an additive impact upon cognitive functioning in those with HIV (Levine et al., 2006; Rippeth et al., 2004). Therefore, the higher rates of decline indicated by the regression equations may be an accurate reflection of actual advancing cognitive deficit over the course of the 6-month study. However, examination of the group mean test scores and standard deviations (see Table 5) suggests that significant decline may have occurred in a subset of individuals, but that the group as a whole generally improved or maintained initial levels of performance across tests. As pointed out by Dikmen et al. (1999), individuals with low initial scores demonstrate the greatest change at retest due to a number of factors, including practice effects and regression to their true ability level. However, the opposite direction of change was observed in this cohort of individuals who performed worse as a group than did the other clinical cohort. In contrast to regression, when the RCI-PE data were applied to the NIDA cohort, rates of change were more modest. Only on the Grooved Pegboard (both hands) and Trail Making Test–Form B were elevated rates of decline seen, albeit considerably more modest than those of the regression method. Finally, there appears to be little relationship between rates of decline on these measures and immunostatus.
The reason for the vastly different findings based on regression versus RCI-PE may lie in the inherent characteristics of the two methods. In our NIDA sample, regression-based change formulas indicated high rates of clinically significant decline on all measures, including those in which there was an overall group improvement. Authors of previous studies comparing various methods for determining clinically significant change reported that simple models, such as the regression method used here, may be appropriate only when used with individuals who have typical baseline performance and who are homogeneous demographically (Temkin et al., 1999). Clearly, our NIDA sample did not meet these criteria. Those authors also suggested that wider confidence intervals be used for those with poor baseline performance in order to increase specificity and narrower confidence intervals for those with those with normal baseline performance in order to improve sensitivity. In many of our participants, lack of improvement at retest may have resulted in classification as “declined,” as their scores fell below the cutoff predicted by the residual term. Thus, wider confidence intervals may have been more appropriate for individuals who have atypical baseline scores. This inherent limitation of standard regression equations has been discussed in detail by Crawford and Howell (1998), who showed that regression equations derived from small-to-moderate sized samples (N < 100) tend to have confidence intervals that are too narrow to accurately classify individuals from a population of interest. For large samples, such as that used in the current study, this is not an issue. However, those authors showed that extreme scores in the population of interest can also result in erroneous change classification (i.e., declined or improved), as appears to be the case with our NIDA cohort. Thus, inflated Type I error rates can be expected when regression equations obtained from a healthy sample are applied to a sample with widely varying scores, as demonstrated in the NIDA sample of our study. This was not seen when regression was applied to our MACS clinical sample, as scores tended to be more centrally distributed. Crawford and Howell (1998), and more recently Crawford and Garthwaite (2006), have proposed a more accurate, albeit complicated, method for determining regression formulas and confidence intervals from a healthy population that could lead to greater accuracy in determining change. A simple rule of thumb, based on previous studies and the current findings, is that simple regression may be the more appropriate method when the sample of interest is demographically similar to the normative sample and is relatively large (N > 100) or has somewhat homogeneous baseline scores. When the sample of interest is demographically dissimilar to the normative sample, RCI-PE appears to be the better method for determining clinically relevant change.
We acknowledge that the absolute amount of improvement or decline, as reflected by change in scores, was small across measures. For example, among the MACS group, performance on the COWAT improved by only a single word. However, this was statistically significant, presumably due to little variation in direction and degree of change across individuals in this group. This small degree of change was comparable to that of the normative sample (see Levine et al., 2004) across all measures, suggesting that it does indeed reflect more than random variation or regression to the mean. Such findings lend support for the validity of these neuropsychological tests in detecting true change, at least on a group level. The true utility of the RCI and regression methods in determining significance of change will continue to be elucidated in studies that employ a criterion measure, such as collateral ratings, neuroimaging, and clinician diagnoses that do not consider neuropsychological test performance.
Finally, it is interesting that RCI-PE, which does not correct for variability in baseline performance to the extent that regression does, resulted in seemingly more expected change rates among the NIDA sample. Even within the MACS sample, rates of decline based on regression were higher than those based on RCI-PE for Grooved Pegboard and Trail Making Tests. Again, whether this indicates lack of improvement due to true pathology among a subset of individuals or poor specificity of the regression-based formula is unclear. Further research, in which a concurrent measure of neurocognitive change is available (e.g., clinical rating or neurologic diagnosis), is needed to probe this issue.
The MACS is funded by the National Institute of Allergy and Infectious Diseases, with additional supplemental funding from the National Cancer Institute and the National Heart, Lung and Blood Institute. UO1-AI-35042, 5-MO1-RR-00722 (GCRC), UO1-AI-35043, UO1-AI-37984, UO1-AI-35039, UO1-AI-35040, UO1-AI-37613, UO1-AI-35041. This study was also supported in part by a grant from NIDA (RO1 DA13799) and data provided by Charles Hinkin, PhD.
Publisher's Disclaimer: Full terms and conditions of use: http://www.informaworld.com/terms-and-conditions-of-access.pdf
This article may be used for research, teaching and private study purposes. Any substantial or systematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply or distribution in any form to anyone is expressly forbidden.
The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.
Data in this manuscript were collected by the Multicenter AIDS Cohort Study (MACS) with centers (Principal Investigators) at The Johns Hopkins University Bloomberg School of Public Health (Joseph B. Margolick, Lisa Jacobson), Howard Brown Health Center and Northwestern University Medical School (John Phair), University of California, Los Angeles (Roger Detels), and University of Pittsburgh (Charles Rinaldo) (Website located at http://www.statepi.jhsph.edu/macs/macs.html).