|Home | About | Journals | Submit | Contact Us | Français|
The Alzheimer’s Disease Assessment Scale-Cognitive (ADAS-Cog) has been used widely as a cognitive end point in Alzheimer’s Disease (AD) clinical trials. Efforts to treat AD pathology at earlier stages have also used ADAS-Cog, but failure in these trials can be difficult to interpret because the scale has well-known ceiling effects that limit its use in mild cognitive impairment (MCI) and early AD. A wealth of data exists in ADAS-Cog from both historical trials and contemporary longitudinal natural history studies that can provide insights about parts of the scale that may be better suited for MCI and early AD trials.
Using Alzheimer’s Disease Neuroimaging Initiative study data, we identified the most informative cognitive measures from the ADAS-Cog and other available scales. We used cross-sectional analyses to characterize trajectories of ADAS-Cog and its individual subscales, as well as other cognitive, functional, or global measures across disease stages. Informative measures were identified based on standardized mean of 2-year change from baseline and were combined into novel composite endpoints. We assessed performance of the novel endpoints based on sample size requirements for a 2-year clinical trial. A bootstrap validation procedure was also undertaken to assess the reproducibility of the standardized mean changes of the selected measures and the corresponding composites.
All proposed novel endpoints have improved standardized mean changes and thus improved statistical power compared with the ADAS-Cog 11. Further improvements were achieved by using cognitive–functional composites. Combining the novel composites with an enrichment strategy based on cerebral spinal fluid beta-amyloid (Aβ1-42) in a 2-year trial yielded gains in power of 20% to 40% over ADAS-Cog 11, regardless of the novel measure considered.
An empirical, data-driven approach with e xisting instruments was used to derive novel composite scales based on ADAS-Cog 11 with improved performance characteristics for MCI and early AD clinical trials. Together with patient enrichment based on Aβ1-42 pathology, these modified endpoints may allow more efficient clinical trials in these populations and can be assessed without modifying current test administration procedures in ongoing trials.
Alzheimer’s disease (AD) clinical research has entered a new era of therapeutics that aim to modify underlying disease pathology rather than ameliorate symptoms [1–3]. Among the challenges faced by disease-modifying strategies targeting early disease stages is uncertainty over appropriate endpoints for early AD and mild cognitive impairment (MCI) trials. The standard research tool for cognitive assessment in clinical trials is the Alzheimer’s Disease Assessment Scale-Cognitive (ADAS-Cog) , which provides a single score based on arbitrary weightings of performance on test items relevant to AD. In its original configuration, the ADAS-Cog assessed learning and memory, language, and spatial cognition, but lacked coverage of executive function; an expanded version was developed in an attempt to address this concern .
Although the ADAS-Cog has been used successfully in trials of symptomatic treatments for mild-to-moderate AD , certain features of the ADAS-Cog limit its use in earlier stages. There is a nonlinear relationship between disease severity and rate of decline, with the fastest rate of decline seen in patients with moderate AD [7,8], and slower rates of decline in patients with mild AD or MCI . Up to half of ADAS-Cog subscales demonstrate ceiling effects in subjects with mild or moderate AD , which makes these items likely to be uninformative in earlier stages of the disease. Limited coverage of early cognitive deficits compounds the insensitivity of ADAS-Cog to mild progression. Last, variability in ADAS-Cog scores can exceed the annual rate of change in clinical trials [11,12], with variability in uninformative subscales potentially obscuring changes on items that can actually track mild deficits.
Another endpoint-related challenge for disease-modifying trials is the precedent set in symptomatic antidementia trials requiring statistical significance on separate primary endpoints of cognition and function [13,14]. This requirement translates into larger sample sizes needed to achieve the same study power compared with a single primary endpoint. Composite cognitive–functional endpoints optimized for earlier stages of disease could therefore provide an alternative to the high hurdle of co-primary endpoints and have been proposed as a methodological advance that may help early AD trials succeed .
We hypothesized that by eliminating less informative items from ADAS-Cog and substituting more responsive measures of cognition or function, we could improve sensitivity to change, reduce variability and develop a single scale optimized for MCI and early AD trials. We derived new cognitive and cognitive–functional composite scales based on ADAS-Cog and other existing instruments, and compared their power with the ADAS-Cog in these populations. We also compared the efficiency of the novel and composite endpoints with and without biomarker enrichment based on the presence of amyloid beta (Aβ) pathology in the MCI cohort. Last, we used a bootstrap procedure to assess the reproducibility of the standardized mean changes.
Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database . Analyses were performed on data downloaded from the ADNI web portal  using a data cutoff of November 15, 2010. Of 819 subjects in the data set, 798 subjects were included in the analyses; 21 subjects who reverted from AD to MCI were excluded. Five subjects not labeled as converters from MCI to AD in the original download were relabeled as converters for the analysis to resolve discrepancies between conversion status and current diagnosis. Conversion status was extracted from the ADNI fields for clinical category. The final analysis set at baseline included 229 normal elderly control subjects (NECs), 212 MCI-nonconverters (MCI-NCs), 165 MCI-converters (MCI-Cs), and 192 subjects with AD. Of the subjects with MCI at baseline (142 subjects), 64 MCI-NCs (30.2%) and 78 MCI-Cs (47.3%) had cerebrospinal fluid (CSF) Aβ1–42 ≤ 192 pg/mL, the cutoff we chose to follow for enrichment based on published cutoffs that discriminate between MCI converters and nonconverters as defined in ADNI .
One hundred ninety-eight NECs, 138 MCI-NCs, 139 MCI-Cs, and 131 subjects with AD had data for visits at months 12 and 24; 168 NECs, 99 MCI-NCs, and 103 MCI-Cs had data for month 36. The study protocol did not follow subjects with AD beyond 2 years. We created 15 cross-sectional cohorts by clinical category and visit month utilizing subjects with data for visits at baseline and months 12, 24, and 36: NEC, months 0, 12, 24, and 36; MCI-NC, months 0, 12, 24, and 36; MCI-C, months 0, 12, 24, and 36; and AD, months 0, 12, and 24.
In this article, we considered several cognitive, functional, and global measures that are available in ADNI as described in the ADNI General Procedures Manual . We evaluated ADAS-Cog 11, ADAS-Cog 13, and the individual ADAS-Cog subscales: Word Recall (Q1), Commands (Q2), Construction (Q3), Delayed Word Recall (Q4), Naming (Q5), Ideational Praxis (Q6), Orientation (Q7), Word Recognition (Q8), Recall Instructions (Q9), Spoken Language (Q10), Word Finding Difficulty (Q11), Comprehension (Q12), and Number Cancellation (Q14). In addition, the following scales were also considered: Mini-Mental State Examination (MMSE); Boston Naming Test (BNT); Clock Drawing and Clock Copying Tests; Wechsler Adult Intelligence Scale-R Digit Symbol Substitution Test (Digit Symbol); Digit Span Backward and Forward Tests (Digit Backward, Digit Forward); Logical Memory Immediate and Delayed Recall Tests (LM Immed, LM Delayed); Trail Making Tests A and B (Trails A and Trails B); Auditory–Verbal Learning Tests Delayed, Immediate, with Interference and Recognition (AVLT Delayed, AVLT Immed, AVLT Inter, AVLT Recog); Clinical Dementia Rating–Sum of Boxes (CDR-SB); Functional Assessment Questionnaire (FAQ); and Category Fluency Tests Animals and Vegetables (Category Animals, Category Vegetables).
We used the R statistical computing platform , version 2.13.0, and libraries available there.
For the 15 cross-sectional cohorts defined in Section 2.1, values of ADAS-Cog 11, ADAS-Cog 13, all the ADAS-Cog subscales, as well as all the other measures listed in Section 2.1.1 were extracted (Supplemental Table 1). Summary statistics and boxplots for each subscale and cohort were generated to identify ceiling effects, floor effects, and change across clinical categories (e.g., MCI to AD). Ceiling effects occur when the measure cannot distinguish among the best performers (i.e., scores that plateau at 0% errors or 100% correct). We define a ceiling effect for a measure if at least 10% of the subjects in at least one cohort achieve perfect scores. Conversely, floor effects occur when the measure cannot distinguish among the poorest performers (0% correct or 100% errors). For this article, a floor effect is defined if at least 10% of the subjects in at least one cohort get no test items correct. We describe change across clinical categories as the difference in the median percent scores of the NEC and the AD cohorts at baseline.
Standardized measures (z scores) were calculated for all cognitive and functional scales and subscales of the ADAS-Cog described in Section 2.1.1 across all subjects in all diagnostic categories by subtracting the baseline mean and dividing by the baseline standard deviation.
Standardized mean 2-year change from baseline was used to select informative subscales and measures for target disease-stage cohorts—AD, MCI, or MCI with Aβ1–42 pathology—using a threshold of CSF Aβ1–42 ≤ 192 pg/mL (MCI-Aβ). We focused on the MCI and the MCI with Aβ1–42 pathology cohorts for this article. Measures with standardized mean changes exceeding a threshold of 0.4 for MCI were selected to develop composite scores. We selected this cutoff based on comparison with ADAS-Cog, which has an effect size of 0.54 (see Section 3.2.1), reasoning that subscales and tests with standardized effects sizes well below this size would not yield sufficient gains for inclusion in novel composites. For the component selection, only subjects with longitudinal follow-up visits were considered. In addition, differential sensitivity of ADAS-Cog 11 subscales to changes in MCI compared with AD was assessed by evaluating change at year 1 and year 2 from baseline for both groups.
We assessed the performance of the composite measures and compared them with existing measures by calculating the power to detect a hypothesized 25% treatment effect [21,22] in a clinical trial. For this, we built statistical models of disease progression for each measure using linear mixed effects models with random slopes and intercepts to base power calculations for a 2-year, two-arm, parallel design clinical trial. Power calculations were done for subjects with (i) early AD, (ii) MCI, and (iii) MCI-Aβ at samples sizes ranging from 100 to 1000 per treatment arm. The disease progression model and the power calculations used any available data from subjects for visits from baseline up to and including month 24 for the 192 AD subjects, for the 377 subjects with MCI, and for the 142 subjects with MCI-Aβ pathology.
To assess whether the selected components of the composites could be selected reproducibly using this procedure in similar populations, we performed a bootstrap validation of the entire composite component selection process. Furthermore, we assessed whether these components, and the corresponding composites, demonstrated similar standardized mean changes and, consequently, similar power for clinical trials across the bootstrap samples. This was implemented as follows: A total of 1000 bootstrap samples of subjects were selected with replacement from the population of subjects with MCI in ADNI who had visits recorded for up to 24 months. For each bootstrap sample, for each measure, standardized mean 2-year changes from baseline were calculated. The empirical distribution of these standardized mean changes for each measure based on the 1000 bootstrap samples was determined.
Table 1 shows the baseline characteristics of the AD, MCI, and MCI-Aβ groups for MMSE, global CDR, CDR-SB, ADAS-Cog 11, age, gender, and APOE ε4 status as well as the novel composite scales, CC1, CC2, CFC1, CFC2 and CFC3. There is no significant difference in age across the three groups. The AD group has more women and has slightly less education than the MCI and MCI-Aβ groups. Both the AD and MCI-Aβ groups have more APOE ε4 carriers than the MCI group. As expected, the AD group showed more severity at baseline in all measures.
Fig. 1 shows boxplots of the trajectories across the 15 cohorts for ADAS-Cog 11 and ADAS-Cog 13, as well as each ADAS-Cog subscale. The y-axis represents errors as a percentage of the maximum possible score and the x-axis represents the 15 cohorts (from left to right: NEC 0, 12, 24, 36 through AD 0, 12, and 24). In each plot, for each disease stage (NEC to AD), the medians of the boxplots trace the course of the particular subscale over the duration observed. Juxtaposed to each other, the median trace suggests the disease trajectory for the given subscale across the disease spectrum observable in ADNI. The analysis identified several ADAS-Cog subscales that exhibit ceiling effects in virtually every cohort: Commands (Q2), Construction (Q3), Naming (Q5), Praxis (Q6), Recall Instructions (Q9), Language (Q10), Word Finding Difficulty (Q11), and Comprehension (Q12). Orientation (Q7) and Number Cancellation (Q14) also exhibit ceiling effects to a lesser degree. Delayed Word Recall (Q4) and Word Recognition (Q8) exhibit modest ceiling effects for the NEC cohorts, although this is not visible in the plots. ADAS-Cog 11, ADAS-Cog 13, and Word Recall (Q1) do not exhibit ceiling effects for any cohort. Delayed Word Recall (Q4), Word Recognition (Q8), and Number Cancellation (Q14) exhibit floor effects in the AD cohort. Delayed Word Recall (Q4) shows the largest amount of change across clinical categories, followed by Word Recall (Q1), Word Recognition (Q8), and Number Cancellation (Q14). A summary of the findings in the boxplots is given in Supplemental Table 1.
To understand more completely the different sensitivities of ADAS-Cog subscales to progression in MCI compared with AD, we evaluated the changes from baseline for each subscale at months 12 and 24 (Supplemental Fig. 1). A similar pattern and magnitude of change was seen between MCI and AD for several subscales: Word Recall (Q1), Delayed Word Recall (Q4), Orientation (Q7), Word Recognition (Q8), Recall Instructions (Q9), and Language (Q10). Subjects with MCI changed less than subjects with AD on Commands (Q2), Construction (Q3), Naming (Q5), Praxis (Q6), Word Finding Difficulty (Q11), and Comprehension (Q12), suggesting these items may make ADAS-Cog 11 less sensitive to changes in cognitive function in subjects with MCI.
We performed the same set of analyses on MMSE, BNT, Clock Drawing and Copying, Digit Symbol, Digit Span, LM Delayed and Immed, Trails A and B, the AVLT components, CDR-SB, FAQ, and the category tests. Fig. 2 shows the resulting boxplots. Ceiling effects in the MCI group are seen in MMSE, BNT, Clock Copying and Drawing Tests, AVLT Recog, and FAQ. Floor effects, especially for the AD group, are seen in LM Delayed, AVLT Delayed, and Trails B. Less severe floor effects are seen in LM Immed, AVLT Inter, AVLT Recog, and Trails A. However, with the exception of Trails A, tests that exhibit floor effects also exhibit, in general, the largest changes across disease stages.
We plotted the mean/standard deviation (SD) versus SD of 2-year changes from baseline for the MCI cohort, for all items tested (Fig. 3). ADAS-Cog 11 and ADAS-Cog 13 are shown for comparison. Subscales of the ADAS-Cog identified in Fig. 1 as having ceiling effects—Commands (Q2), Construction (Q3), Naming (Q5), Praxis (Q6), Recall Instructions (Q9), Language (Q10), Word Finding Difficulty (Q11), Comprehension (Q12)—fall in the lower right region in Fig. 3, indicating large variability and correspondingly small effect sizes. ADAS-Cog subscales that exceeded our target standardized mean changes threshold of 0.4 for the MCI cohort were Word Recall (Q1), Delayed Word Recall (Q4), and Orientation (Q7). They had small standardized 2-year changes and relatively low variability, except for Orientation. Additional cognitive measures that met the selection threshold were AVLT-Immed, which had the smallest standardized 2-year change of any item selected but the lowest variability as well, and MMSE, which had a relatively large 2-year change but high variability, which diminished its standardized 2-year change (compared with ADAS-Cog 11, Word Recall, and Orientation). The FAQ and CDR-SB both had greater 2-year changes than ADAS-Cog 11 or ADAS-Cog 13. Variability of 2-year changes was greater for CDR-SB than FAQ, ADAS-Cog 11, or ADAS-Cog 13.
By eliminating subscales with poor standardized mean changes, we developed the following six composites using standardized scores for selection. ADAS-3 is the sum of Word Recall (Q1), Delayed Word Recall (Q4), and Orientation (Q7) in ADAS-Cog. Cognitive composite 1 (CC1) combines ADAS-3, AVLT-Immed, and MMSE taking directionality of change into account. Cognitive composite CC2 is composed of ADAS-3 and the cognitive portion of CDR-SB. Cognitive–functional composite CFC1 combines CC1 with FAQ, CFC2 combines CC2 with FAQ, and CFC3 combines the cognitive portion of CDR-SB with FAQ.
Our novel cognitive–functional composites—CFC1, CFC2, and CFC3—have the largest standardized 2-year changes of any item (Fig. 3), in part driven by smaller variability than FAQ, CDR-SB, or ADAS-Cog 11, although the variability of CFC3 was slighter greater than ADAS-Cog 13. Our reduced ADAS-3 had the smallest variability and smallest standardized 2-year change of all the novel composites. CC1 had a greater 2-year change than ADAS-Cog 11 or ADAS-Cog 13, with variability similar to ADAS-Cog 13 and less than ADAS-Cog 11. The novel cognitive composite CC2, which combines ADAS-3 and the cognitive portion of CDR-SB, takes advantage of the low variability of the former and the high standardized change of the latter, resulting in a cognitive composite comparable with the cognitive–functional composites.
Fig. 4 shows plots of the statistical power (y-axis) as a function of sample size requirements (x-axis), for a hypothesized 25% treatment effect for each novel composite as well as for ADAS-Cog 11, ADAS-Cog 13, and CDR-SB, for the three cohorts (AD, MCI, and MCI-Aβ). Although our focus is on MCI, we compared the utility of the novel composites for AD also.
Table 2 summarizes the 2-year rate of change on ADAS-Cog 11, ADAS-Cog 13, CDR-SB, FAQ, and the novel composites for the MCI, MCI-Aβ, and AD populations. Also shown is the number needed to treat to observe a 25% treatment effect with 80% power.
In subjects with MCI, clear advantages for the novel composites are seen over ADAS-Cog 11 and ADAS-Cog 13. With a target 25% treatment effect, ADAS-Cog 11 requires 772 patients per arm for 80% power and ADAS-Cog 13 requires 582 per arm. This level of power is achieved by CDR-SB and the novel cognitive–functional composites with far fewer patients, less than 400 per arm. Indeed, it is possible to achieve 80% power with as few as 375 patients per arm using CDR-SB or 302 patients per arm using the best cognitive–functional composite CFC2, given a 25% target treatment effect.
Additional gains in power are seen with the MCI-Aβ population for all scales, including ADAS-Cog 11. For a target treatment effect of 25%, it would be possible to achieve 80% using CDR-SB with approximately 224 patients per arm, compared with 189 for CFC2, whereas this level of power would require 395 patients per arm for ADAS-Cog 13 and about 532 per arm for ADAS-Cog 11.
Supplemental Fig. 2 shows boxplots of the empirical distributions of standardized mean changes for each measure based on 1000 bootstrap samples from the subjects with MCI in ADNI with visits of at least 2 years. The empirical distributions for the measures selected from the original population exceeded the threshold of 0.4 significantly: Word Recall (99% exceeded 0.4), Delayed Word Recall (88% exceeded 0.4), Orientation (98% exceeding 0.4), MMSE (99% exceeding 0.4), AVLT Immed (80% exceeded 0.4), FAQ (100% exceeded 0.4), and CDR-SB (100% exceeded 0.4). Conversely, the empirical distributions for measures that had poor standardized mean changes in the original population largely fell well below the threshold. Thus, this demonstrates that the standardized mean changes of the measures that were selected, as well as their composites, are reproducible in populations similar to the original population.
We present several novel composite scores for early AD developed by (i) improving the ADAS-Cog for MCI populations by removing uninformative subscales, (ii) supplementing with other sensitive cognitive measures for MCI, and (iii) supplementing with a functional measure to produce a single composite end point that could be used in clinical trials in lieu of co-primary endpoints. We evaluated the resulting composite scores as outcome measures for 2-year clinical trials involving patients with MCI or MCI with Aβ pathology (based on a threshold of CSF Aβ1–42 ≤ 192 pg/mL), and for comparison with AD. Linear mixed-effects disease progression models were used to develop statistical power calculations for each novel and for several existing measures, and these show increased power for the novel composites compared with ADAS-Cog 11 or ADAS-Cog 13. The most significant gains in power were seen with our novel cognitive–functional endpoints. With a target 25% treatment effect, ADAS-Cog 11 would require 772 subjects with MCI per arm to achieve 80% power, whereas our novel cognitive–functional composites CFC1, CFC2, and CFC3 all require less than 350 subjects per arm.
Ceiling effects for subscales removed from the ADAS-Cog were seen in three clinical trials of donepezil in AD [8,23,24], in which perfect scores were achieved for subjects with AD in 55% to 82% of trials. Conversely, subscales retained in the novel composites have face validity for early AD. Recall is well-known as an early symptom of disease that quickly reaches “floor” levels, whereas Orientation has previously been noted in a clinical trial as the single subscale in ADAS-Cog to show significant decline in mild AD . The composition of ADAS-3, which forms the core of our novel composites (with the exception of CFC3, which uses the cognitive portion of CDR-SB), is also consistent with findings from the MCI trial of donepezil and vitamin E, in which 81% of the errors among the MCI group on ADAS-Cog items stemmed from word list recall or word recognition deficits , suggesting that these items should be included in any composite designed for MCI trials.
Cognitive composite CC1 achieved modestly improved effect size and power over ADAS-3 by incorporating MMSE and AVLT-Immed. Although AVLT-Immed is a sensitive measure of episodic memory deficits , and its inclusion is perhaps unsurprising, MMSE is a less obvious scale to increase the accuracy of a cognitive composite given its well-known variability, which is also reflected in Fig. 3. MMSE, however, is a fairly robust disease severity marker and rates of decline on functional  as well as cognitive [28,29]scales have been shown to vary with baseline MMSE scores. Because performance on ADAS-Cog, and by extension its components, is highly correlated with MMSE, the presence of MMSE in the cognitive composite may reflect the correlation of this scale with other assessments of AD. One concern regulators and clinical “trialists” may have about using measures like ADAS-3 as outcome measures is that their performance is driven by too few domains. Including MMSE mitigates against this type of concern without much loss in effect size.
Cognitive composite CC2 combines ADAS-3 and the cognitive portion of CDR-SB into the best-performing cognitive endpoint, which exceeded even some of the cognitive–functional composites in standardized mean change and power. Additional gains over the cognitive portion of CDR-SB alone may be a result of the increased specificity of ADAS-3 items for domains affected in MCI.
The regulatory requirement for co-primary endpoints in AD trials has, in general , mandated statistical significance on each co-primary endpoint at the studywise false-positive rate. This typically requires larger trials to achieve the same study power compared with a single primary endpoint. An alternative approach is to use a single composite end point, such as the CDR-SB, to evaluate a primary cognitive deficit and its clinical relevance simultaneously. The CDR-SB has been proposed for early AD trials  and it performed very well in our power calculations. Indeed, the cognitive component of CDR-SB outperformed all existing cognitive measures in the analysis of 2-year change, supporting its current use in MCI and prodromal AD trials. It is interesting to observe, however, that the effect size of the CDR-SB score in the MCI population is driven more by its cognitive component than its functional component, suggesting that the functional part of this composite could be optimized further for the mild functional deficits seen in this population. We noted that FAQ captured more functional deficits than the functional portion of CDR-SB in subjects with MCI, prompting us to use it as the functional core of all our cognitive–functional composites. CDR-SB and all three novel cognitive–functional composites (CFC1, CFC2 and CFC3) demonstrated adequate standardized mean changes for the MCI and MCI-Aβ cohorts, suggesting that any would be appropriate for MCI or prodromal AD clinical trials. However, the variability of the novel composites was less than that of CDR-SB. Another potential advantage of most of the novel composites relates to their independence from the clinical expertise of the rater. Clinician-rated scales such as CDR-SB, which perform well in ADNI, may be less reliable in clinical trial settings compared with scales that rely on quantitative assessments or on patient-reported outcomes.
A principal concern with cognitive–functional composites is that the cognitive component of a composite may drive statistical significance, without movement on the functional component. In all our novel cognitive–functional composites, the standardized mean change in functional component matched that in the cognitive components, in contrast to CDR-SB. Furthermore, FAQ shows moderately good correlations with the cognitive components of all the composites (0.51 with CC1 in CFC1, 0.62 with CC2 in CFC2, and 0.65 with the cognitive portion of CDR-SB in CFC3), suggesting reasonably well-balanced contributions from the cognitive and functional components. Previous work suggests a close relationship between changes on measures of function and cognition in AD [32–34]. The relationship is less well characterized for MCI, in which the expected deficits are different and milder. However, including measures with similar standardized mean changes and reasonable correlations should allay concerns over independent cognitive effects alone driving change in a cognitive–functional composite.
Here we focused on the statistical development of novel composites based on quantitative assessments to select components of preexisting measures for improved efficiency in clinical trials, followed by a statistical evaluation of their performance. Although we did not undertake a comprehensive psychometric assessment in this work, we assessed features relevant to a psychometric approach in the course of this work or as supplemental analyses.
Internal responsiveness, representing the ability of a measure to change over time, was measured using standardized response means, calculated as the ratio of the mean change from baseline to the SD of the change from baseline for a given group . Standardized response means of the novel composites (this corresponds to the standardized mean changes in Fig. 3) suggest the internal responsiveness for these measures was moderate to large, comparable with CDR-SB and superior to ADAS-Cog 11 or ADAS-Cog 13. Floor and ceiling effects for the individual components have been described previously. Factor analysis, with principal factors extraction with promax rotation, a type of non-or-thogonal rotation, on the baseline and 2-year change from baseline values of the components of each composite, was done to explore the structural validity of the novel composites. The results are presented in Supplemental Table 2. Unlike findings in Coley et al.  on a different cohort, the cognitive and functional components of our composites show mixed loadings between the two factors for both the MCI cohorts and the AD cohort. Interestingly, the components of CDR-SB (results not shown) for these cohorts in ADNI also show mixed loadings between the two factors, suggesting that this may reflect cohort-related differences between the ADNI cohorts and the cohort considered in Coley et al. . Convergent validity was assessed using Spearman’s correlations at baseline between the composites and the components, as well as reference measures MMSE, ADAS-Cog 11, CDR-SB, and FAQ in the MCI and AD cohorts (Supplemental Table 3). External responsiveness was assessed in part using Spearman’s correlations of 2-year changes from baseline between the composites and reference measures MMSE, ADAS-Cog 11, CDR-SB, and FAQ. In both cases, moderate to high correlations (0.46–0.82) were noted (Supplemental Table 3), although some exceptions were noted for ADAS-Cog 11 and MMSE with functional composite CFC3, and for CDR-SB and FAQ with cognitive composite CC1 in the MCI cohorts. Overall, these exploratory analyses suggest adequate, if preliminary, validity of our novel composites, and support their further development and psychometric characterization.
In these analyses, we used raw scores standardized against ADNI population baseline values, as described in Section 2.2.2, rather than normative scores used typically for cognitive measures in psychometric literature. Our interest is primarily in longitudinal change from baseline, and because subjects act as their own control, the need for normative scores is diminished. Furthermore, our approach to standardization allows for comparison across all measures—cognitive and functional—and potentially biomarker endpoints as well, whereas this is not possible given the limited number of normative datasets that exist for only cognitive endpoints.
The approach presented here can be applied to data from proof-of-concept or Phase II trials to facilitate the selection of optimal endpoints to take forward into later stages of testing. Furthermore, our proposed composites can be analyzed as primary endpoints without altering current test administration procedures or materials. That is to say, any of the novel composites can be calculated from scores of ADAS-Cog or other endpoints administered intact. Interpretability of the findings, as well as clinical meaningfulness, can also be addressed further by specifying key components of the composites and clinical measures as secondary endpoints.
We confirmed that a prespecified MCI sample enrichment strategy, using a threshold of CSF Aβ1–42 192 pg/mL, can be used in conjunction with our novel endpoints to enhance trial design. Donohue et al.  also show gains for ADAS-Cog 11 (about 40%) and CDR-SB (about 70%) with an MCI-Aβ cohort using simulations. Using an enrichment strategy based on the Aβ pathology yields a 20% to 40% increase in power, regardless of the outcome measure considered.
Several potential limitations of this work pertain to our methodology. Our method does not penalize redundant or correlated items. Two-year changes in ADAS-3 correlate moderately with those in MMSE (r = −0.52) and AVLT Immed (r = −0.47), and with the cognitive component of CDR-SB (r = 0.49). Although such correlations could produce increased variability, interrogating the same deficit through multiple instruments—some quantitative and some clinical—should provide better overall clinical accuracy in assessing that deficit, especially given that these instruments have large variability in general. Variance reduction techniques could be used to gain additional statistical efficiencies in the composites; however, this could come at the cost of lost interpretability. We undertook a bootstrap validation to assess the reproducibility of the standardized mean changes of the selected measures as well as their composites. This assessment demonstrated clearly the robustness of both, in populations similar to the ADNI MCI population. Understanding the limits of scales is also of interest. The scales perform moderately well into mild AD, but the performance in moderate AD, for which they are not adapted, is not known. Similarly, the performance for early MCI has not been assessed rigorously because subjects with early MCI were not available in ADNI at the time of writing.
We plan next to assess the impact of covariates on each of the proposed endpoints and undertake analyses based on Item Response Theory (IRT) models to understand the clinical significance of unit changes for the novel scales and robustness of the extreme values for each scale. Incorporation of other biomarkers into the composites may also increase power by reducing variability (data not shown), suggesting that development of a composite scale that captures cognitive, functional, and a biomarker with pathological relevance may be possible. Such a global composite might further advance clinical trial methodology and provide an important link to the development of surrogate endpoints.
In conclusion, use of novel cognitive composites based on a reduced ADAS-3 results in significant gains in power over ADAS-Cog 11 and ADAS-Cog 13 for MCI trials. Larger gains are achieved by using composites of cognition and function, the latter based on FAQ, which is more sensitive to mild functional deficits in MCI. Last, a dual strategy of enrichment combined with improved endpoints would thus result in cumulatively improved power for MCI trials.
The authors thank Harry Chen, of J&J Medical China, and Chihshan (Sandy) Lei, of Johnson & Johnson PRD, for generating analysis-ready files from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data. The authors also thank Victor Lobanov, Rudi Verbeeck, and Tim Schultz of the Informatics Center of Excellence. Johnson & Johnson PRD for curation and data management.
Data collection and sharing for this project was funded by ADNI (National Institutes of Health [NIH] grant U01 AG024904). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: Abbott; AstraZeneca AB; Bayer Schering Pharma AG; Bristol-Myers Squibb; Eisai Global Clinical Development; Elan Corporation; Eli Lilly and Company; F. Hoffman-La Roche; GE Healthcare; Genentech; GlaxoSmithKline; Innogenetics; Johnson and Johnson; Medpace, Inc.; Merck and Company, Inc.; Novartis AG; Pfizer, Inc.; Schering-Plough; Synarc, Inc.; as well as the nonprofit partners the Alzheimer’s Association and Alzheimer’s Drug Discovery Foundation, with participation from the U.S. Food and Drug Administration. Private-sector contributions to ADNI are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Disease Cooperative Study at the University of California, San Diego. ADNI data are disseminated by the Laboratory for NeuroImaging at the University of California, Los Angeles. This research was also supported by NIH grants P30 AG010129 and K01 AG030514, and the Dana Foundation.
The authors have no conflicts of interest to report.
N.R., M.N.S., M.F., E.Y., G.N., V.N., and A.D. are employees of Johnson & Johnson and hold stock options in Johnson & Johnson. M.G. is a consultant to Johnson & Johnson.