The present study investigated different methods of equating AVLT word list versions in longitudinal aging research. We adapted accepted test equating methods using a novel approach to the study of longitudinal cognitive aging. These methods are broadly applicable to within- and between-group comparisons of test performance data in both research and clinical settings. Equipercentile equating uses observed percentiles of a distribution, and is a more generalizeable non-parametric transformation than linear equating, which assumes normally distributed variables whose distributions are fully characterized by a mean and standard deviation. Graphical displays clearly show equipercentile equating accommodates tests that are more difficult than the reference test at different percentiles of performance, and models of within-person change show it also satisfactorily adjusts for practice, or retest, effects. Importantly, an implicit assumption of mean, linear, or equipercentile equating is that the populations producing two sets of scores, whether they are the same people followed over time or two different groups, have the same underlying ability. Because this may not be a valid assumption for older adults followed for years, the present study described equating procedures that used age standardization to preserve aging effects, propensity weighting to adjust for attrition, and restriction to preserve group differences due to diagnostic and intervention group membership.
ACTIVE is the largest study of cognitive training among older adults to date, and ADNI is a $60 million public-private partnership that is being used to stimulate innovative methods for evaluating progression of AD in clinical trials. The roller coaster trajectory in ACTIVE and waves in ADNI are attributable to nonequivalent AVLT forms used at different study visits. The ACTIVE study cycled through four versions of the AVLT until repeating the baseline list at the third annual follow-up. ADNI cycled between two AVLT lists, which explains the wave-like pattern. These method artifacts may be present in other settings. Indeed, important form differences are seen for the Hopkins Verbal Learning Test (Brandt & Benedict, 2001
) in the ACTIVE study (data not shown) and for the ADAS-Cog word list-learning task in ADNI (Crane et al., under review
). Similar plots and statistics presented in this study can be replicated using these measures. The reason these studies used alternate word lists was to reduce practice effects, but in doing so they introduced complications for making inferences about cognitive performance. All ACTIVE publications involving comparisons of within-person memory performance over time use equipercentile-equated scores (e.g., Gross et al., 2010a, 2010b, 2011
; Parisi et al., 2011
). Aside from work in ACTIVE, we are not aware of equipercentile equating being used in longitudinal settings with cognitive performance data. We believe the field can benefit by being aware of and adopting these equating methods. To date, most published studies that have used longitudinal neuropsychological data from ADNI have not examined the AVLT from visits in which different AVLT forms were administered (e.g., Hinrichs et al., 2011
, Murphy et al., 2010
, Petersen et al., 2010
). In other studies using ADNI data, word lists are treated as components in composite measures (e.g., Beckett et al., 2010
), but results of some studies are potentially susceptible to nonequivalent form differences (e.g., Carmichael et al., 2010
; Okonkwo et al., 2011
). Future work in ADNI should pay close attention to form differences on the AVLT and ADAS-Cog.
Equating methods are powerful tools, but their use comes with several caveats. First, measures should not be equated that have different meanings. For example, it is statistically possible to equate short-delay and long-delay recall trials, but the trials measure qualitatively different constructs. Relatedly, equating methods can equate test scores but do not address qualitative differences in behaviors, such as different strategies used on more difficult tests at different measurement occasions (Crawford et al., 1989
; Light, 1991
). A second limitation of equating is that populations that produce two sets of test scores must have the same underlying ability to be validly equated. This is an easy assumption to make when the same cognitively normal persons are being retested over time, but may not be achievable (or measurable) in all situations. The application of equating methods in the present study would have been fairly straightforward if we had assumed this. However, in studies with several years of longitudinal follow-up such as those in the present study, one can divide the equating task into two stages as we have done: identify a subset of observations as an equating sample in which underlying abilities can be assumed to be the same over time, then apply the equating algorithm derived in that sample to the full sample. A third limitation is that, in longitudinal settings, equating procedures assume the magnitude of retest effects is exchangeable across groups. This assumption may be unreasonable when comparing patients with different clinical syndromes or diseases, such as delirium or amnesia. Fourth, a limitation specific to equipercentile equating is that the outcome should be continuously distributed and have enough range to reliably distinguish different quantiles. Applying equipercentile equating to individual AVLT trial recall scores, for example, would be more challenging. This is not a concern in linear equating, which presumes a normally distributed outcome. Another limitation of this study is that we assumed that the underlying trajectory of change in AVLT performance is in fact linear over time. We used this assumption in growth models to assess the different equating methods. Previous work in ACTIVE has demonstrated memory follows a linear pace of change following the immediate post-training visit (Gross & Rebok, 2011
; Parisi et al., 2011
). The assumption of linear change in cognitive function among older adults is a commonly accepted fact in many other studies of older adults (e.g., Proust et al., 2006
). Nevertheless, because true change is a latent and unobserved phenomenon, whether the AVLT in ACTIVE and ADNI in fact shows linear decline over time is uncertain. A final potential limitation specific to the ACTIVE study is that modifications in test administration of the AVLT from standard clinical administration limit the generalizability of findings from these data to clinical settings. However, our purpose in the present study was to illustrate equating methods and not to make inferences about training effects on memory function in ACTIVE, which have been reported elsewhere (e.g., Gross & Rebok, 2011
; Parisi et al., 2011
; Willis et al., 2006
Mean, linear, and equipercentile equating, based in classical test theory, are not the only equating methods. Item response theory (IRT) methods can be used if populations producing two sets of scores differ in the underlying ability being measured, but require some items in common between the tests to anchor the two groups with respect to each other (Livingston, 2004
). Counterbalancing is a method of adjusting for form differences in the study design before analysis, but is useful only for making inferences about group differences and not within-person change (Cozby, 2009
Although equipercentile equating proved to be ideal for the applications of the present study, the same procedure may not apply in all cases. Mean equating is intuitive and produces the same grand mean on two tests, but it does not change an individual’s absolute difference from the mean. Thus, mean equating can lead to impossible or improbable scores among some individuals; for example, if two tests means are 60 and 50, and the maximum possible value is 100, then an individual scoring a 95 on the second test will have a mean-equated score of 105. Similar to mean equating, a limitation of linear equating is that extreme scores on a new test may yield equated scores outside the possible range of values in the original test. This is not a concern in equipercentile equating. The principal advantage of equipercentile equating over linear equating is that it does not assume the reference test is normally distributed, but there are cases in which that assumption is viable. The AVLT in ACTIVE and ADNI was approximately normally distributed, which explains the similarities in findings between linear and equipercentile equating.
Clinically, a patient’s test scores can only be interpreted using appropriate reference norms, but normative values are unhelpful if normative test scores come from a different population from which the patient came. Tests shown to be equivalent in certain groups defined by education, sex, or age may not be equivalent in other subpopulations (Ivnik et al., 1990
). For this reason, Schmidt (2004)
reports AVLT word lists that produce similar scores for older adults in addition to which lists produce similar scores to other lists. Equating techniques require data from cohorts of individuals to carry out, so it would not be possible to perform similar analyses for any particular person being evaluated clinically. Nevertheless, important differences in form difficulty should be kept in mind, and if different forms are used across time, this should be documented. Data from studies similar to the one presented here may be useful to assist the practitioner in understanding whether change has occurred, and if so, its likely direction and magnitude. Ignoring differences in difficulty across forms in clinical settings could lead to unnecessary confusion at least and incorrect conclusions or diagnoses at worst. Finally, it is important to acknowledge that equated data contribute to only a small part of the clinical picture. A clinician's judgment of change will depend on multiple test results and findings, the clinical history, non-quantitative observations about the patient's abilities (Lezak et al., 2004
), and on his or her expert judgment and prior experience (Mitrushina et al., 2005
In conclusion, equating challenges are pervasive but often unrecognized in research studies and clinical practice. When prior knowledge about form equivalence is unavailable or unclear when planning a study, we recommend that researchers use the same form and apply established methods to control for practice effects (e.g., Ferrer et al., 2004
; Rabbitt et al., 2004
; Salthouse et al., 2004
). Thorough data exploration is necessary both to recognize the need for equating and to understand the relative merits of different equating procedures. The replication of findings across two cohorts, utilizing special weighting adaptations, highlights the versatility and generalizability of the equating methods used in the present study.
The method of equipercentile equating may have broad applications in both clinical and research settings to enhance the ability to use nonequivalent test forms, to evaluate change over time, to quantify retest effects, and to align scores on different tests of the same construct (such as identifying cutpoints for dementia on cognitive screening tests). Equipercentile equating is a well-accepted tool for comparing psychiatric diagnostic instruments (Furukawa et al., 2009
; Leucht et al., 2005
; Montoya et al., 2011
; Noonan et al., 2011
; Schennach-Wolff et al., 2010
) and for identifying clinically relevant benchmarks and crosswalks on neuropsychological tests (Fong et al., 2009
). The procedure represents a robust and innovative approach to better understanding longitudinal changes over time. The present study demonstrated an innovative application of equating methods for longitudinal settings in which participants or patients are followed over long periods of time.