In this study, we assessed the generalizability of data derived from a healthy sample in detecting clinically relevant change among two groups: HIV-infected individuals who were demographically similar to the healthy sample and a group of HIV-infected individuals who were predominantly drug users and demographically dissimilar to the healthy sample. Development and validation of test–retest data is extremely valuable to neuropsychologists and other healthcare professionals as a means of assessing the meaningfulness of change over time. This study is among the few that addresses the generalizability of test–retest data.

When applied to a clinical cohort with similar demographic characteristics, the pattern of changes indicated by the regression equations was consistent with neurocognitive changes that one would expect in HIV-infected individuals—that is, increasing rates of decline on tests of psychomotor speed, working memory, and complex attention were associated with greater virologic compromise (

Heaton et al., 1995). Conversely, the RCI-PE method indicated somewhat greater rates of improvement on these tasks, which was similar regardless of immunological status. Which method is more accurate is impossible to determine from our dataset, as no objective measure of change was established to use as a criterion. However, because the bulk of the data were collected prior to widespread use of highly active antiretroviral therapy (HAART), it would be expected that the sample was more prone to decline than improvement between baseline and retest. Thus, it appears that regression equations derived from a healthy normative sample do generalize well to a demographically similar clinical cohort, while RCI-PE is less accurate.

Conversely, among a clinical cohort that was quite different from the normative sample with regards to ethnicity, education, and other factors (e.g., substance abuse rates), it was the RCI-PE data that appear to have provided a more accurate assessment of change. Among this cohort, the regression equations predicted an unusually high rate of declined performance across all measures. This finding is somewhat understandable in the cases of the Grooved Pegboard and Digit Span, on which this cohort actually did perform worse as a group at retest. However, it is unclear why such high rates of decline were seen on the other measures. One possible reason for the apparent poor generalizability of regression equations to the NIDA sample was that there was in fact a high rate of decline among that cohort. This was a cohort that consisted largely of cocaine and methamphetamine abusers, drugs that have been shown to have an additive impact upon cognitive functioning in those with HIV (

Levine et al., 2006;

Rippeth et al., 2004). Therefore, the higher rates of decline indicated by the regression equations may be an accurate reflection of actual advancing cognitive deficit over the course of the 6-month study. However, examination of the group mean test scores and standard deviations (see ) suggests that significant decline may have occurred in a subset of individuals, but that the group as a whole generally improved or maintained initial levels of performance across tests. As pointed out by

Dikmen et al. (1999), individuals with low initial scores demonstrate the greatest change at retest due to a number of factors, including practice effects and regression to their true ability level. However, the opposite direction of change was observed in this cohort of individuals who performed worse as a group than did the other clinical cohort. In contrast to regression, when the RCI-PE data were applied to the NIDA cohort, rates of change were more modest. Only on the Grooved Pegboard (both hands) and Trail Making Test–Form B were elevated rates of decline seen, albeit considerably more modest than those of the regression method. Finally, there appears to be little relationship between rates of decline on these measures and immunostatus.

The reason for the vastly different findings based on regression versus RCI-PE may lie in the inherent characteristics of the two methods. In our NIDA sample, regression-based change formulas indicated high rates of clinically significant decline on all measures, including those in which there was an overall group improvement. Authors of previous studies comparing various methods for determining clinically significant change reported that simple models, such as the regression method used here, may be appropriate only when used with individuals who have typical baseline performance and who are homogeneous demographically (

Temkin et al., 1999). Clearly, our NIDA sample did not meet these criteria. Those authors also suggested that wider confidence intervals be used for those with poor baseline performance in order to increase specificity and narrower confidence intervals for those with those with normal baseline performance in order to improve sensitivity. In many of our participants, lack of improvement at retest may have resulted in classification as “declined,” as their scores fell below the cutoff predicted by the residual term. Thus, wider confidence intervals may have been more appropriate for individuals who have atypical baseline scores. This inherent limitation of standard regression equations has been discussed in detail by

Crawford and Howell (1998), who showed that regression equations derived from small-to-moderate sized samples (

*N* < 100) tend to have confidence intervals that are too narrow to accurately classify individuals from a population of interest. For large samples, such as that used in the current study, this is not an issue. However, those authors showed that extreme scores in the population of interest can also result in erroneous change classification (i.e., declined or improved), as appears to be the case with our NIDA cohort. Thus, inflated Type I error rates can be expected when regression equations obtained from a healthy sample are applied to a sample with widely varying scores, as demonstrated in the NIDA sample of our study. This was not seen when regression was applied to our MACS clinical sample, as scores tended to be more centrally distributed.

Crawford and Howell (1998), and more recently

Crawford and Garthwaite (2006), have proposed a more accurate, albeit complicated, method for determining regression formulas and confidence intervals from a healthy population that could lead to greater accuracy in determining change. A simple rule of thumb, based on previous studies and the current findings, is that simple regression may be the more appropriate method when the sample of interest is demographically similar to the normative sample and is relatively large (

*N* > 100) or has somewhat homogeneous baseline scores. When the sample of interest is demographically dissimilar to the normative sample, RCI-PE appears to be the better method for determining clinically relevant change.

We acknowledge that the absolute amount of improvement or decline, as reflected by change in scores, was small across measures. For example, among the MACS group, performance on the COWAT improved by only a single word. However, this was statistically significant, presumably due to little variation in direction and degree of change across individuals in this group. This small degree of change was comparable to that of the normative sample (see

Levine et al., 2004) across all measures, suggesting that it does indeed reflect more than random variation or regression to the mean. Such findings lend support for the validity of these neuropsychological tests in detecting true change, at least on a group level. The true utility of the RCI and regression methods in determining significance of change will continue to be elucidated in studies that employ a criterion measure, such as collateral ratings, neuroimaging, and clinician diagnoses that do not consider neuropsychological test performance.

Finally, it is interesting that RCI-PE, which does not correct for variability in baseline performance to the extent that regression does, resulted in seemingly more expected change rates among the NIDA sample. Even within the MACS sample, rates of decline based on regression were higher than those based on RCI-PE for Grooved Pegboard and Trail Making Tests. Again, whether this indicates lack of improvement due to true pathology among a subset of individuals or poor specificity of the regression-based formula is unclear. Further research, in which a concurrent measure of neurocognitive change is available (e.g., clinical rating or neurologic diagnosis), is needed to probe this issue.