|Home | About | Journals | Submit | Contact Us | Français|
Analyses of individual differences in change may be unintentionally biased when versions of a neuropsychological test used at different follow-ups are not of equivalent difficulty. This study’s objective was to compare mean, linear, and equipercentile equating methods and demonstrate their utility in longitudinal research.
The Advanced Cognitive Training for Independent and Vital Elderly (ACTIVE, N=1,401) study is a longitudinal randomized trial of cognitive training. The Alzheimer’s Disease Neuroimaging Initiative (ADNI, n=819) is an observational cohort study. Nonequivalent alternate versions of the Auditory Verbal Learning Test (AVLT) were administered in both studies.
Using visual displays, raw and mean-equated AVLT scores in both studies showed obvious nonlinear trajectories in reference groups that should show minimal change, poor equivalence over time (ps≤0.001), and raw scores demonstrated poor fits in models of within-person change (RMSEAs>0.12). Linear and equipercentile equating produced more similar means in reference groups (ps≥0.09) and performed better in growth models (RMSEAs<0.05).
Equipercentile equating is the preferred equating method because it accommodates tests more difficult than a reference test at different percentiles of performance and performs well in models of within-person trajectory. The method has broad applications in both clinical and research settings to enhance the ability to use nonequivalent test forms.
The identification of cognitive decline requires repeated assessment of the same individual over time. Alternate forms of a test are often used to minimize practice effects under the assumption that the forms are equivalent. Thus, the ability to equate alternate forms of neuropsychological tests is highly important in both clinical and research settings. Importantly, inferences about differences between groups, or change within persons over time, may be erroneous or biased when versions of a test are not equivalent.
Methods for test equating are particularly important for neuropsychological assessments in population-based research and clinical practice. Alternate versions of a test are considered equivalent if they produce the same mean and variance in a sample (Borsboom, 2005). Tests that are parallel but nonequivalent could be similar in content and length, but yield different scores for the same individual because they differ in item difficulty or administration characteristics. A relevant example is provided by word list-learning tests, for which use of alternate forms is common. Word lists in particular have an extensive history in Western psychology (Ebbinghaus, 1895/1964; Underwood, 1963). Different versions are intended to be equivalent, but often are not. One reason for non-equivalence is test length. For example, tests are often shortened due to time limitations or to reduce participant burden, as was done for the CVLT (Mitrushina et al., 2005), and equipercentile equating methods were used to equate a 9 item version with the original 16 item version. Other reasons for non-equivalence of forms are more complicated. In word list-learning tests, the word’s frequency of use in the English language (Anderson & Bower, 1972; Battig & Montague, 1969; Fuller et al., 1997), number of syllables, serial position in the list, and imagery value (Paivio, 1968) all contribute to its difficulty. Fuller and colleagues (1997), for example, reported that two forms of the Auditory Verbal Learning Test (AVLT, described in Methods), Lists B and C (see Lezak et al., 2004), are not of equivalent difficulty. In a systematic review, Hawkins and colleagues (2004) showed recall differences of about three words (8% difference) across alternate forms that were studied.
Test equating is an analytical approach to adjust nonequivalent tests. The goal of test equating is to define a transformation of a variable that returns the same cumulative probability plot as the other variable being compared. There are many ways to equate tests. Given relevant characteristics of two distributions, equating methods can be applied in almost any setting in which multiple scores for each person are on different metrics. Two tests that measure the same outcome are equivalent if they place individuals in the same relative position in a group (Livingston, 2004). Widely used methods include mean, linear, and equipercentile equating. These methods differ by how relative position is defined. In mean equating, relative position is defined by the absolute difference from the sample mean of a test, and each individual’s score is changed by the same amount to equate the sample mean to that of a reference test (Kolen & Brennan, 1995). In linear equating, relative position is defined in terms of standard deviations from the group mean. Linear equating is accomplished by adjusting scores from the new form to be within the same number of standard deviations of the mean of the original form. A formula is provided in the Methods section. Equipercentile equating defines relative position by a score’s percentile rank in the group. It is accomplished by identifying scores on two measures with the same percentile rank and transforming the score on a new test to the corresponding score on the reference form with the same percentile rank.
The reason for administering alternate versions of a word list is to reduce retest or practice effects, but alternate forms do not completely correct for retest. Practice effects are attributable to general testing factors that arise from repeated exposure to the same task (e.g., learning to take tests) in addition to retention of particular test content (e.g., recall for specific words) (Crawford et al., 1989). Verbal learning tests are a mainstay in memory research and assessment, but are particularly susceptible to practice effects under repeated administration, such as in longitudinal research and in clinical settings when a patient must be re-evaluated over time. Alternate forms cannot account for general testing factors that contribute to practice effects, and may actually introduce greater problems in examining within-person trajectories in longitudinal studies if forms are not equivalent.
The objectives of the present study were to compare three equating methods – mean, linear, and equipercentile equating – and demonstrate their utility in longitudinal research. To demonstrate the generalizability of the approaches, examples are presented from two large studies of older adults, a cognitive intervention study and an observational study of predictors of conversion to Alzheimer’s disease. Equating methods were contrasted using visual displays, tests of mean equivalence over time in reference groups, and with models of person-level growth.
Participants were drawn from two large-scale, multi-site cohorts of older adults, the Advanced Cognitive Training for Independent and Vital Elderly (ACTIVE) study and the Alzheimer’s Disease Neuroimaging Initiative (ADNI). These studies were selected because both provide longitudinal data using word list-learning measures collected using similar methods. Although the data are similar and nonequivalent forms pose similar challenges, the objectives of these studies differ considerably, which highlights the generalizability of equating approaches.
ACTIVE is a longitudinal randomized trial of cognitive training in cognitively intact, community-dwelling adults age 65 and older (Ball et al., 2002; Jobe et al., 2001; Willis et al., 2006). Participants (n=2,802) were randomized to one of four intervention groups (memory, reasoning, speed of cognitive processing, and no-contact control) after a baseline assessment, and followed up immediately after training and after one, two, three, five, and ten years. For the sake of parsimony, the present study used data from two of these groups, the memory-trained (n=703) and no-contact control (n=698) groups, which were collected at baseline before the intervention, immediately following training, and at the one, two, three, and five year follow-up assessments.
ADNI began in 2003 as a five-year observational cohort study of Alzheimer's disease (AD), with the primary goal of assessing the extent to which serial magnetic resonance imaging, positron emission tomography, other biological markers, and cognitive tests can be used to predict progression to mild cognitive impairment (MCI) and AD. Further information is available at http://www.loni.ucla.edu/ADNI. The present study used baseline and 6, 12, 18, 24, and 36 month follow-up data for normal subjects (n=229) and MCI (n=397) and AD (n=193) patients. MCI patients were assessed at all waves. Normal healthy controls were not followed at 18 months, and AD patients were not followed at the 18 or 36 month waves. Data from a 48 month wave were not included in the present study because data collection was still underway. Data, which are continuously updated, were downloaded for the present study on March 15, 2011.
The AVLT (Rey, 1964; Schmidt, 2004) was administered in both the ACTIVE and ADNI studies. During administration of the AVLT, participants are read a list of 15 unrelated words and asked to recall as many words as they can remember. The same list is repeated over five trials, followed by an interference trial with a new 15-word list, a short-delay free recall trial, and a long-delay free recall trial thirty minutes later. The present study used the sum of recall from the five immediate AVLT recall trials. The administration of the AVLT in ACTIVE differed from that in ADNI in two ways. First, the long-delay free recall trial was dropped due to time constraints. Second, the ACTIVE protocol modified the test for group administration by having participants write down responses instead of speaking them.
In ACTIVE, the original AVLT List A and interference list B (Taylor, 1959) were used at the baseline and third annual visits, lists by Geffen and colleagues (1994) were used in the immediate post-training and fifth annual visits, lists by Crawford and colleagues (1989) were used at the first annual visit, and lists described by Jones-Gotman and colleagues were used for the second annual visit (Lezak et al., 2004, pp. 423). In ADNI, lists from Taylor (1959) were used at the baseline, 12, and 24 month visits and lists by Crawford and colleagues (1989) were used during 6, 18, and 36 month visits. These visits are delineated in Tables and Figures with letters.
To account for differences in test difficulty, we conducted mean, linear, and equipercentile equating in ACTIVE and ADNI separately following similar procedures (Kolen & Brennan, 1995). We adapted weighted versions of each equating procedure to preserve aging, cohort, and group differences using a two-stage approach. In the first stage, we selected an equating sample from which to collect necessary characteristics of test distributions and derive the equating algorithm. The goal of this stage was to define a sample of participants at each follow-up visit whose memory ability was equivalent, such that any differences between visits could be attributed to form differences and not aging or group differences. In the second stage, we applied the equating algorithm to the full study sample in a way that preserved attrition, aging, cohort, and group differences but eliminated form differences. Equated scores were then compared visually using plots of mean recall over time and cumulative probability plots and statistically using tests of equivalence of means in reference groups as well as estimates of within-person change using latent growth models.
An important assumption underlying any application of equating is that the populations producing responses on differently scaled tests at each time point must have the same underlying ability. To preserve differences in memory performance attributable to attrition, normal aging, and training status or diagnostic group, the equating sample was restricted in ACTIVE to control participants and in ADNI to MCI patients, respectively. Although the healthy control group in ADNI was the preferred reference group, the MCI group was used as the equating sample because by design ADNI assessed them at all waves and they provided a better coverage of AVLT scores observed across both healthy control and AD groups. Although the ADNI MCI group served as the equating sample, we subsequently restandardized the entire ADNI sample to make the healthy control group the reference sample by subtracting a model-implied mean difference in equated performance at each study visit between healthy controls and other participants. Equated performance at each time point was estimated from a weighted mixed effects model of test scores on indicators for time. Annotated code is available from the authors upon request.
To adjust for attrition over time, we used inverse propensity score weights in the equating sample to model the probability of dropout (Rosenbaum & Rubin, 1983). To preserve normal aging effects when the equating algorithm was applied to the full sample, we further restricted the equating sample to participant visits with ages that were common across all visits (ages 72 to 90 in ACTIVE and 63 to 90 in ADNI). Because ACTIVE and ADNI included long follow-up periods, we estimated analytic weights using a direct adjustment procedure for age to ensure the same age distribution at each study visit. This removes aging effects in the equating sample, but preserves aging in the full sample for individual differences analyses using equated test scores.
After procedures in this first stage, the only differences in memory performance between groups should be those attributable to form differences. These form differences are addressed to varying degrees with equating algorithms, key elements of which were carried over to be used on the full sample in the second stage.
Once equating samples were selected in ACTIVE and ADNI, equating algorithms were derived and applied to the full samples. These algorithms involve means at each study wave for mean equating (described earlier), means and standard deviations for linear equating, and test score percentiles for equipercentile equating. Linear equating was accomplished using the following formula:
Here, Test2adjusted is the linear-equated AVLT score for a follow-up Test2 visit for person i. The raw Test2 distribution had mean Test2 and standard deviation SDTest2. The baseline test had mean Test1 and standard deviation SDTest1. The means and deviations were identified in the equating sample but, as for all equating methods, was applied to all participants in the second stage of equating.
The goal of equipercentile equating is to define a non-parametric transformation of one variable, in the present study a follow-up AVLT recall score, that returns the same cumulative probability plot as the baseline test. Test scores at follow-up visits were scaled to baseline scores with the same percentile rank. For example, a score of 47 on the baseline AVLT in ACTIVE had a percentile rank of 43.7%. The first annual AVLT score with that percentile ranking is 44, demonstrating that for this score range, the first annual test was more difficult than the baseline test by 3 words, assuming individuals did not truly decline after only 10 weeks. This example is over-simplified because the adaptation used for the present study does accommodate normal age-related decline. Additionally, our equipercentile equating algorithm used a loglinear function to smooth out equated score distributions (Albano, 2011).
Line graphs showing mean AVLT scores over time, or time trend plots, and cumulative probability plots were constructed to compare equated scores. Equivalent tests should yield identical cumulative probability plots in the reference sample. To account for data missing at random conditional on indicators for time and group, estimated means from random effects models were used in time trend plots. Cumulative probability plots show the cumulative proportion of the sample (y-axis) who recalled up to a given number of words on the AVLT (x-axis).
The equivalence of test score means in reference groups (ACTIVE control, ADNI healthy controls) was tested for raw, mean, linear, and equipercentile equated scores using χ2 tests for nested confirmatory factor analysis models. In the first of two models for each equating set, trial recall means at each study visit were constrained to be equal. In the second model, means were freely estimated. Twice the difference in the log likelihood follows a chi-squared distribution. These tests are similar to repeated-measures ANOVA that tests for differences in means over time, but they are less stringent because they do not make the assumption that variances around the means are equal at each visit.
We used multiple group latent growth models to model person-level changes in recall over time (McArdle & Bell, 2000; Muthén, 1997; Muthén & Curran, 1997). Latent factors represent initial or baseline status and trajectories of change over time. These parameters are formed from observed scores at each study visit. We fixed factor loading paths from the intercept to observed recall sum scores for each assessment at 1, and factor loadings from the latent slope to values corresponding to a linear trajectory in time. In ACTIVE, a second intercept factor was also included to accommodate immediate training gains between baseline and post-training for trained participants (Bollen & Curran, 2006). We constrained its factor loadings to 0 at the baseline visit and to 1 at follow-up visits and its variance to 0.
Latent growth curve models and factor analyses were conducted using the Mplus (version 6.11) software package (Muthén & Muthén, 1998–2010). The models accommodate data missing at random conditional on observed covariates (Donders, van der Heijden, Stijnen, & Moons, 2006). Models using different equating methods were compared using standard model fit statistics including the root mean square error of approximation (RMSEA; Steiger, 1989) and comparative fit index (CFI; Hu & Bentler, 1999). These fit statistics were of key importance because they provide a measure of how much the model-estimated baseline levels and trajectories fit to observed trajectories using each method of equating. An RMSEA less than 0.05 and CFI greater than 0.95 are considered indicators of excellent model fit (Hu & Bentler, 1999). Graphical displays and equating algorithms were generated using Stata 12.0 (StataCorp, 2011) and R software packages (R Development Core Team, 2009).
Table 1 shows baseline characteristics and AVLT test scores at each follow-up time for ACTIVE and ADNI samples. ACTIVE participants were mostly white females aged 65–94 and cognitively intact at baseline. On average, ADNI participants were younger, more highly educated, and a higher proportion of them were males compared to the ACTIVE sample.
Mean recall over time for ACTIVE control and memory-trained participants under different equating methods are plotted in each panel of Figure 1. Figure 2 provides similar information using the ADNI MCI diagnostic group. In ACTIVE, plots of raw scores give the impression that both groups start at about the same level at the baseline visit, decline up to two years after training, and then recover inexplicably (Figure 1). In ADNI, the MCI group zigzags in performance at every other visit by approximately 0.3 standard deviations. AVLT test scores in Table 1 show that raw score trends in Figure 1 generalize to all intervention and diagnostic groups. The effect of nonequivalent forms is demonstrated more rigorously using cumulative probability plots in the Appendix (see Legends for a detailed interpretation). They reveal different difficulty levels across the score distribution for different waves: participants in both ACTIVE and ADNI performing at the 50th percentile on follow-up tests had systematically lower scores than participants at the 50th percentile of the baseline test. Scores at the lowest and highest performance percentiles were more comparable across visits.
Mean equating in ACTIVE produced more similar means, but overcompensated for differences in forms at the third and fifth annual visits (Figure 1). Mean equating in ADNI produced a plot of a declining trajectory, although some residual form differences remained at the 24 month visit (Figure 2). Cumulative probability plots for mean equating did not overlap as well as for other equating methods (Appendix).
Linear equating, like mean equating, revealed boosted performance at one year followed by decline in ACTIVE but also suggests improvement at the immediate post-training visit and a plateau in performance after the third annual visit (Figure 1). Linear equating in ADNI nearly eliminated any indication of decline in the mean level of performance over time, which should be expected in a sample of MCI patients.
Equipercentile equating produced the smoothest trajectories in ACTIVE, with an expected pre-post training gain and age-related cognitive decline (Figure 1). In ADNI, the wave pattern over time was successfully removed while an average decline of 1.8 AVLT words through three years was still apparent (Figure 2).
Although the graphical displays demonstrate the superiority of equating methods in relation to raw scores, they also indicate residual imprecision of these methods: in ACTIVE, the second visit was a post-test assessment only 10 weeks after the first. The equating methods should theoretically produce equal means between the baseline and immediate post-test assessments because there was almost no attrition or aging in the control group. Mean equating does not because of small residual form differences that persist after equating. Linear and equipercentile equating produce means at baseline and immediate post-training in ACTIVE very close to each other (Figure 1).
Cumulative probability plots are shown in the Appendix. Plots for equated scores demonstrate excellent overlap, especially for equipercentile equating in ACTIVE (Figure A1) and linear and equipercentile equating in ADNI (Figure A2). Thus, equipercentile equating visually demonstrates the best adjustment for learning effects and smoothes out mean trajectories.
Negligible change over time was assumed in the ADNI healthy control and ACTIVE control groups. In ADNI, raw (χ2=14.8, df=4, p=0.001) and mean equating (χ2=16.7, df=4, p<0.001) produced significantly different means over time, but linear (χ2=6.8, df=5, p=0.18) and equipercentile equating (χ2=7.6, df=4, p=0.09) produced statistically equivalent means over time in the healthy control group. In the ACTIVE control group, there were significant differences in raw AVLT recall sum score means (χ2=45.3, df=5, p<0.001), mean-equated means (χ2=58.2, df=5, p<0.001), linear-equated means (χ2=63.6, df=5, p<0.001), and equipercentile-equated means (χ2=31.9, df=5, p<0.001).
By the fifth year visit, n=749 of 1,401 (53%) participants were still in the study sample and attrition did not differ by intervention group. Follow-up in ADNI after three years was higher (n=591/819, 72%). Our propensity adjustment for sample attrition, which had a larger effect in the ACTIVE study, is likely responsible for the lack of equivalence of equipercentile-equated means, which as shown in Figure 1 demonstrates a smooth declining trajectory.
Results from ACTIVE and ADNI are shown in Tables 2 and and3,3, respectively, for raw, mean, linear, and equipercentile-equated scores. Means and variances of growth parameters in the Tables characterize level and variability in within-person trajectory of AVLT performance over time. An immediate contrast is in poor RMSEA and CFI model fits using raw AVLT scores, shown in Tables 2 and and3,3, that improve dramatically given any equating method but are best after equipercentile equating. The model with equipercentile-equated AVLT scores in ADNI fit perfectly with the data (RMSEA: 0.0; CFI=1.0).
Besides fit statistics, three key substantive inferences change depending on the type of equating used. First, using raw scores in ACTIVE, the immediate training “boost” appears to be in the negative direction for memory-trained participants (pre-post change: −5.0 words) but flips to a positive direction after equating (Table 2). Second, in both ACTIVE and ADNI, annual memory decline in all groups, as indicated by slope means, is overestimated or underestimated to varying degrees using raw scores, while equated scores show less annual decline in ACTIVE control participants (Table 2) and ADNI MCI and AD participants (Table 3). A third substantive change in inferences is the correlation between initial recall and aging trajectory is overestimated using raw scores relative to equated scores. Based on model fits and substantive knowledge of trajectories of cognitive aging, latent growth models suggest differences in test difficulty are handled best with either linear or equipercentile equating.
The present study investigated different methods of equating AVLT word list versions in longitudinal aging research. We adapted accepted test equating methods using a novel approach to the study of longitudinal cognitive aging. These methods are broadly applicable to within- and between-group comparisons of test performance data in both research and clinical settings. Equipercentile equating uses observed percentiles of a distribution, and is a more generalizeable non-parametric transformation than linear equating, which assumes normally distributed variables whose distributions are fully characterized by a mean and standard deviation. Graphical displays clearly show equipercentile equating accommodates tests that are more difficult than the reference test at different percentiles of performance, and models of within-person change show it also satisfactorily adjusts for practice, or retest, effects. Importantly, an implicit assumption of mean, linear, or equipercentile equating is that the populations producing two sets of scores, whether they are the same people followed over time or two different groups, have the same underlying ability. Because this may not be a valid assumption for older adults followed for years, the present study described equating procedures that used age standardization to preserve aging effects, propensity weighting to adjust for attrition, and restriction to preserve group differences due to diagnostic and intervention group membership.
ACTIVE is the largest study of cognitive training among older adults to date, and ADNI is a $60 million public-private partnership that is being used to stimulate innovative methods for evaluating progression of AD in clinical trials. The roller coaster trajectory in ACTIVE and waves in ADNI are attributable to nonequivalent AVLT forms used at different study visits. The ACTIVE study cycled through four versions of the AVLT until repeating the baseline list at the third annual follow-up. ADNI cycled between two AVLT lists, which explains the wave-like pattern. These method artifacts may be present in other settings. Indeed, important form differences are seen for the Hopkins Verbal Learning Test (Brandt & Benedict, 2001) in the ACTIVE study (data not shown) and for the ADAS-Cog word list-learning task in ADNI (Crane et al., under review). Similar plots and statistics presented in this study can be replicated using these measures. The reason these studies used alternate word lists was to reduce practice effects, but in doing so they introduced complications for making inferences about cognitive performance. All ACTIVE publications involving comparisons of within-person memory performance over time use equipercentile-equated scores (e.g., Gross et al., 2010a, 2010b, 2011; Parisi et al., 2011). Aside from work in ACTIVE, we are not aware of equipercentile equating being used in longitudinal settings with cognitive performance data. We believe the field can benefit by being aware of and adopting these equating methods. To date, most published studies that have used longitudinal neuropsychological data from ADNI have not examined the AVLT from visits in which different AVLT forms were administered (e.g., Hinrichs et al., 2011, Murphy et al., 2010, Petersen et al., 2010). In other studies using ADNI data, word lists are treated as components in composite measures (e.g., Beckett et al., 2010), but results of some studies are potentially susceptible to nonequivalent form differences (e.g., Carmichael et al., 2010; Okonkwo et al., 2011). Future work in ADNI should pay close attention to form differences on the AVLT and ADAS-Cog.
Equating methods are powerful tools, but their use comes with several caveats. First, measures should not be equated that have different meanings. For example, it is statistically possible to equate short-delay and long-delay recall trials, but the trials measure qualitatively different constructs. Relatedly, equating methods can equate test scores but do not address qualitative differences in behaviors, such as different strategies used on more difficult tests at different measurement occasions (Crawford et al., 1989; Light, 1991). A second limitation of equating is that populations that produce two sets of test scores must have the same underlying ability to be validly equated. This is an easy assumption to make when the same cognitively normal persons are being retested over time, but may not be achievable (or measurable) in all situations. The application of equating methods in the present study would have been fairly straightforward if we had assumed this. However, in studies with several years of longitudinal follow-up such as those in the present study, one can divide the equating task into two stages as we have done: identify a subset of observations as an equating sample in which underlying abilities can be assumed to be the same over time, then apply the equating algorithm derived in that sample to the full sample. A third limitation is that, in longitudinal settings, equating procedures assume the magnitude of retest effects is exchangeable across groups. This assumption may be unreasonable when comparing patients with different clinical syndromes or diseases, such as delirium or amnesia. Fourth, a limitation specific to equipercentile equating is that the outcome should be continuously distributed and have enough range to reliably distinguish different quantiles. Applying equipercentile equating to individual AVLT trial recall scores, for example, would be more challenging. This is not a concern in linear equating, which presumes a normally distributed outcome. Another limitation of this study is that we assumed that the underlying trajectory of change in AVLT performance is in fact linear over time. We used this assumption in growth models to assess the different equating methods. Previous work in ACTIVE has demonstrated memory follows a linear pace of change following the immediate post-training visit (Gross & Rebok, 2011; Parisi et al., 2011). The assumption of linear change in cognitive function among older adults is a commonly accepted fact in many other studies of older adults (e.g., Proust et al., 2006). Nevertheless, because true change is a latent and unobserved phenomenon, whether the AVLT in ACTIVE and ADNI in fact shows linear decline over time is uncertain. A final potential limitation specific to the ACTIVE study is that modifications in test administration of the AVLT from standard clinical administration limit the generalizability of findings from these data to clinical settings. However, our purpose in the present study was to illustrate equating methods and not to make inferences about training effects on memory function in ACTIVE, which have been reported elsewhere (e.g., Gross & Rebok, 2011; Parisi et al., 2011; Willis et al., 2006).
Mean, linear, and equipercentile equating, based in classical test theory, are not the only equating methods. Item response theory (IRT) methods can be used if populations producing two sets of scores differ in the underlying ability being measured, but require some items in common between the tests to anchor the two groups with respect to each other (Livingston, 2004). Counterbalancing is a method of adjusting for form differences in the study design before analysis, but is useful only for making inferences about group differences and not within-person change (Cozby, 2009).
Although equipercentile equating proved to be ideal for the applications of the present study, the same procedure may not apply in all cases. Mean equating is intuitive and produces the same grand mean on two tests, but it does not change an individual’s absolute difference from the mean. Thus, mean equating can lead to impossible or improbable scores among some individuals; for example, if two tests means are 60 and 50, and the maximum possible value is 100, then an individual scoring a 95 on the second test will have a mean-equated score of 105. Similar to mean equating, a limitation of linear equating is that extreme scores on a new test may yield equated scores outside the possible range of values in the original test. This is not a concern in equipercentile equating. The principal advantage of equipercentile equating over linear equating is that it does not assume the reference test is normally distributed, but there are cases in which that assumption is viable. The AVLT in ACTIVE and ADNI was approximately normally distributed, which explains the similarities in findings between linear and equipercentile equating.
Clinically, a patient’s test scores can only be interpreted using appropriate reference norms, but normative values are unhelpful if normative test scores come from a different population from which the patient came. Tests shown to be equivalent in certain groups defined by education, sex, or age may not be equivalent in other subpopulations (Ivnik et al., 1990). For this reason, Schmidt (2004) reports AVLT word lists that produce similar scores for older adults in addition to which lists produce similar scores to other lists. Equating techniques require data from cohorts of individuals to carry out, so it would not be possible to perform similar analyses for any particular person being evaluated clinically. Nevertheless, important differences in form difficulty should be kept in mind, and if different forms are used across time, this should be documented. Data from studies similar to the one presented here may be useful to assist the practitioner in understanding whether change has occurred, and if so, its likely direction and magnitude. Ignoring differences in difficulty across forms in clinical settings could lead to unnecessary confusion at least and incorrect conclusions or diagnoses at worst. Finally, it is important to acknowledge that equated data contribute to only a small part of the clinical picture. A clinician's judgment of change will depend on multiple test results and findings, the clinical history, non-quantitative observations about the patient's abilities (Lezak et al., 2004), and on his or her expert judgment and prior experience (Mitrushina et al., 2005).
In conclusion, equating challenges are pervasive but often unrecognized in research studies and clinical practice. When prior knowledge about form equivalence is unavailable or unclear when planning a study, we recommend that researchers use the same form and apply established methods to control for practice effects (e.g., Ferrer et al., 2004, 2005; Rabbitt et al., 2004; Salthouse et al., 2004, 2008, 2010). Thorough data exploration is necessary both to recognize the need for equating and to understand the relative merits of different equating procedures. The replication of findings across two cohorts, utilizing special weighting adaptations, highlights the versatility and generalizability of the equating methods used in the present study.
The method of equipercentile equating may have broad applications in both clinical and research settings to enhance the ability to use nonequivalent test forms, to evaluate change over time, to quantify retest effects, and to align scores on different tests of the same construct (such as identifying cutpoints for dementia on cognitive screening tests). Equipercentile equating is a well-accepted tool for comparing psychiatric diagnostic instruments (Furukawa et al., 2009; Leucht et al., 2005; Montoya et al., 2011; Noonan et al., 2011; Schennach-Wolff et al., 2010) and for identifying clinically relevant benchmarks and crosswalks on neuropsychological tests (Fong et al., 2009, 2011). The procedure represents a robust and innovative approach to better understanding longitudinal changes over time. The present study demonstrated an innovative application of equating methods for longitudinal settings in which participants or patients are followed over long periods of time.
Dr. Gross was supported by a National Institutes of Health Translational Research in Aging fellowship (T32AG023480-07). Dr. Inouye holds the Milton and Shirley F. Levy Family Chair in Alzheimer’s Disease. This work was supported in part by Grant No. P01AG031720 (SKI) from the National Institute of Aging. Dr. Rebok is an investigator with Compact Disc Incorporated for the development of an electronic version of the ACTIVE memory intervention. He has received no financial support from them for ACTIVE. Dr. Brandt receives royalty income from Psychological Assessment Resources, Inc., on sales of the Hopkins Verbal Learning Test-Revised. Drs. Rebok’s and Brandt’s relationships are managed by the Johns Hopkins University according to its established conflict of interest policies.
The ACTIVE intervention trials are supported by grants from the National Institute on Aging and the National Institute of Nursing Research to Hebrew Senior Life (U01NR04507), Indiana University School of Medicine (U01NR04508), Johns Hopkins University (U01AG14260), New England Research Institutes (U01AG14282), Pennsylvania State University (U01AG14263), the University of Alabama at Birmingham (U01 AG14289), and the University of Florida (U01AG14276).
Data collection and sharing for this project was funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: Abbott; Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Amorfix Life Sciences Ltd.; AstraZeneca; Bayer HealthCare; BioClinica, Inc.; Biogen Idec Inc.; Bristol-Myers Squibb Company; Eisai Inc.; Elan Pharmaceuticals Inc.; Eli Lilly and Company; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; GE Healthcare; Innogenetics, N.V.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Medpace, Inc.; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Servier; Synarc Inc.; and Takeda Pharmaceutical Company. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer's Disease Cooperative Study at the University of California, San Diego. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of California, Los Angeles. This research was also supported by NIH grants P30 AG010129, K01 AG030514, and the Dana Foundation.
*Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.ucla.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.ucla.edu/wpcontent/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf