|Home | About | Journals | Submit | Contact Us | Français|
Assessing cognitive and functional changes at the early stage of Alzheimer's disease (AD) and detecting treatment effects in clinical trials for early AD are challenging.
Under the assumption that transformed versions of the Mini–Mental State Examination, the Clinical Dementia Rating Scale–Sum of Boxes, and the Alzheimer's Disease Assessment Scale–Cognitive Subscale tests'/components' scores are from a multivariate linear mixed-effects model, we calculated the sample sizes required to detect treatment effects on the annual rates of change in these three components in clinical trials for participants with mild cognitive impairment.
Our results suggest that a large number of participants would be required to detect a clinically meaningful treatment effect in a population with preclinical or prodromal Alzheimer's disease. We found that the transformed Mini–Mental State Examination is more sensitive for detecting treatment effects in early AD than the transformed Clinical Dementia Rating Scale–Sum of Boxes and Alzheimer's Disease Assessment Scale–Cognitive Subscale. The use of optimal weights to construct powerful test statistics or sensitive composite scores/endpoints can reduce the required sample sizes needed for clinical trials.
Consideration of the multivariate/joint distribution of components' scores rather than the distribution of a single composite score when designing clinical trials can lead to an increase in power and reduced sample sizes for detecting treatment effects in clinical trials for early AD.
Much effort has been devoted to developing disease-modifying treatments that intervene in the pathobiologic processes involved in the early stage of Alzheimer's disease (AD). Any therapy that is effective at treating this early manifestation of the dementia process may provide an opportunity for managing the disease while patient function is relatively preserved . Standard instruments used to quantify cognitive and functional decline in AD are relatively insensitive to the changes at early AD . This raises challenges for assessing the early changes in cognition and function across the spectrum of AD  and makes detecting treatment effects in clinical trials for early AD even harder .
Power analysis is standard when designing clinical trials for detecting treatment effects. Ard et al.  provide a comprehensive review for clinical trials in AD. Misalignment of the power analysis can lead to possible errors in decisions regarding sample size. Too large samples may waste time, resources, and money and may unnecessarily expose some participants to inferior treatment if a treatment could have been shown to be more effective with fewer participants. Significant underestimation of the sample size may be a waste of time as it would unlikely lead to conclusive findings and therefore be unfair to all participants taking part in the trial. In this article, we are interested in the power/sample size to detect the treatment effects on the component scores in clinical trials for early AD.
In the literature of early AD, many researchers have used composite scores as single endpoints for performing power analysis . A composite score is typically a linear combination of the scores of sensitive instruments. It provides a univariate summary of the component scores, avoids the multiple-hypothesis testing problem when each component score is considered separately, and reduces the impact of measurement error . Furthermore, it may be more sensitive to the cognitive and functional decline than its separate components .
The construction of a composite score involves the selection and weighting of the component scores. Typically, the selection of the component scores may be based on a broad literature review regarding sensitivity to decline of candidate components , with equal weighting tending to be applied, possibly naively, to the chosen components. However, more statistically driven approaches can be used to derive the weights to construct more sensitive composite scores , , , , , , .
We therefore classify the statistical strategies used for the construction of a composite score into two major classes. The first is focused principally on selecting the most informative composite components and using prespecified weights not derived from statistical considerations; for example, Raghavan et al.  identify the informative component instruments based on standardized mean of 2-year change from baseline for a mild cognitive impairment (MCI) cohort and summed them to create a new composite score. The other is focused on “optimizing” the weights assigned to component scores based on an appropriate optimality criterion and is therefore more data driven; for example, some previous proposals find composite weights, which are sensitive to the clinical decline, by fitting linear mixed-effect models (LMMs) to the longitudinal composite scores , , . Xiong et al.  propose composite weights that maximize the probability of observing a decline in one participant over a unit interval of time. Their weights can be considered as a special case of the composite weights proposed by Ard et al., who use the power to detect the time effect in a clinical trial as their criterion and obtain the component weights by maximizing this criterion . Ard et al.'s approach is applied to construct a composite atrophy index . Another approach within this class is to base the estimation of the composite weights on a criterion that looks at the mean to standard deviation ratio of change over time , . Wang et al.  propose another composite score construct by using a linear clinical decline equation to select and reweight the component scores simultaneously.
In general, using composite scores as single endpoints may lose information to detect the changes in components ; for example, a large change in one component can be masked by small changes on other component scores. Data-driven composite scores have been further criticized . Firstly, they may lose clinical interpretation. It is possible that a clinically meaningful component score has small weights in a data-driven composite score . In addition, they may not be consistent across different data sets. Donohue et al.  apply cross-validation to quantify the out-of-sample performance of optimal composite scores and conclude that the overall performance of the optimal composite scores is worse than those composite scores derived without optimization.
A limited amount of the literature in AD has considered power analysis with multiple endpoints, although multiple endpoints are commonplace in AD. Under the assumption that the component scores are jointly from a multivariate linear mixed-effects model (MLMM), we compare three approaches with regard to their power to detect the treatment effects on component scores. Two of them are with multiple endpoints, whereas the other is with a single-composite endpoint.
Mixed-effect models are from a class of useful statistical models for analyzing longitudinal data . They allow a subset of the regression parameters (random effects) to vary randomly between participants and thereby characterize the natural heterogeneity in the target population in these parameters. Fixed effects are used to refer to the regression parameters, which are fixed but unknown and need to be estimated.
Assuming that all possible covariates are balanced (as would be assumed in a clinical trial through randomization), we model the component scores using an MLMM with a random intercept, fixed time, and time by treatment interaction effects. (The addition of further covariates can be easily incorporated if deemed necessary.) Such a model is able to simultaneously characterize the correlations between the component scores at each time t and the correlations across time for each component score.
Let Yntj be the j-th component score of the n-th participant at visit time t, where n = 1,…,N, t = 1,…,Tn, and j = 1,…,J. Here, the number of visits Tn is a positive integer depending on the n-th participant, and the number of component scores J is prespecified. We use a linear function to link the component scores with the mixed effects
where is the j-th component treatment effect, bnj is the random intercept that is unique to the j-th component score of the n-th participant, and is the random error of the n-th participant on the j-th component score at time t. For each n, let bn = (bn1,…,bnJ)T independently follow a multivariate normal distribution with a mean vector 0 and a covariance matrix ∑b. Here, for any matrix or vector A, the matrix AT is the transpose of A. For each n and t, further let independently follow a multivariate normal distribution with the mean vector 0 and the covariance matrix . For each n and t, the error and the random effects bn are independent.
For each participant n and time t, the covariance matrix characterizes the correlation structure between the component scores Ynt1,…,YntJ. For each participant n, the component scores Ynt = (Ynt1,…,YntJ)T, t = 1,…,Tn, are independent of each other through time conditional on the random effect bn, but would be correlated marginally.
We can link the LMM for the composite scores to the MLMM for the components by letting , , , , , and , where w = (w1,…,wJ)T is the vector of weights for the composite score . The LMM for the composite score of the n-th participant at time t is therefore
where is the treatment effect on composite scores, and for each n, the random intercept, an, follows a normal distribution with mean 0 and variance , and for each n and t, the random error, δnt, follows a normal distribution with mean 0 and variance .
To detect the treatment effects on component scores, we consider three-hypothesis testing problems and their associated test statistics. Rejecting any of the null hypotheses suggests statistically significant component treatment effects.
The first hypothesis testing problem is to test the null hypothesis of no treatment effect in any of the components against the alternative that there is at least one non-zero treatment effect:
where is the J-dimensional vector of treatment effects. The Wald statistic can be used, where is the maximum likelihood estimator (MLE) of under the assumption of known covariance matrices for bn and , and is the covariance matrix of . It follows that under the null hypothesis of no treatment effect for any of the components that the Wald test statistic will be distributed as a χ2 distribution with J degrees of freedom, .
The second hypothesis testing problem considered is for the composite treatment effect, defined as a linear combination of the component treatment effects induced by the weights w = (w1,…,wJ)T. Here, we test the null hypothesis of no composite treatment effect versus the alternative of a composite treatment effect. That is,
The Wald statistic, here, is , which is distributed as under the null, .
The last hypothesis testing problem considers the case in which composite scores are used as single endpoints. It aims to test a single treatment effect on the composite scores
Given the variances and , let be the MLE of and be its variance. We can use the Wald statistic , which follows the distribution under , to test for this type of treatment effect.
The vector of weights w has different meanings under the last two hypotheses testing situations. The weights w are on the component treatment effects in the second, whereas the weights w reweight the component scores in the third. These testing approaches are equivalent only in the very special case of a linear link function, as is assumed in our setting.
Table 1 summarizes these three-hypothesis testing problem formulations. Under an alternative model, all the test statistics follow a noncentral χ2 distribution and thereby determine the power to reject the associated null hypothesis. However, using less powerful test statistics will lead to larger sample sizes, which may be judged unethical. In the Supplementary document, we prove that for any given weights w, the test statistic ΞJC(w) is no worse with regards to power than ΞC(w). The test statistic ΞJ does not uniformly outperform either ΞJC(w) or ΞC(w) over the range of w.
For illustration, we conduct a power analysis for a two-arm randomized AD clinical trial with equal allocation probabilities. The component scores consist of the Mini–Mental State Examination (MMSE), the Clinical Dementia Rating Scale-Sum of Boxes (CDR-SB), and the Alzheimer's Disease Assessment Scale–Cognition Subscale (ADAS-11) scores. We use data extracted from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (http://adni.loni.ucla.ca) to inform the specification of the various parameters required to perform the power analysis. This data set comprises 927 participants who are at MCI at baseline. The MMSE, the CDR-SB, and the ADAS-11 are recorded biannually for each participant over a total follow-up period of 10 years. To more closely satisfy the normality assumptions for the components in light of potential ceiling effects, we apply the Box-Cox transformation to the data and then rescaled them by their baseline standard deviation; see Supplementary Materials for details. The transformations applied are such that higher values of the transformed components indicate worse cognitive functioning.
We fit the MLMM to the three component scores; see the Supplementary Materials for details on how estimates of the rate of change parameters and the appropriate covariance structures necessary for us to perform the power analysis were obtained. The R function mlmmm.em() from the mlmmm package  was used to compute these estimates. The estimated annual rates of change on the transformed MMSE, the transformed CDR-SB, and the transformed ADAS-11 are 0.079 (95% confidence interval [CI]: 0.064, 0.095), 0.061 (95% CI: 0.045, 0.077), and 0.055 (95% CI: 0.040, 0.069), respectively. These annual rates of change correspond to small rates of change on the original untransformed scale and suggest that there is limited cognitive decline in those with MCI over the follow-up period. The estimated covariance matrices are
We consider various designs for our clinical trial based on choosing different follow-up periods (i.e., 2, 3, 4, 5, and 6 years) and assuming that it is of interest to detect minimally clinically meaningful treatment effects corresponding to 25% reductions in the annual rates of change in the MMSE, CDR-SB, and ADAS-11 (transformed). These 25% reductions here also correspond approximately to 25% improvements in the treated versus control arms, if the components were considered on their original scales of measurement.
We compare various weights for ΞJC(w) and ΞC(w) (optimal or otherwise) that can be used when performing a power analysis for the clinical trial designs mentioned in the early subsection. All the considered weight vectors are normalized by . The following weighting strategies are considered:
Table 3 presents the sample sizes required for each of the aforementioned weighting specifications and under the different trial duration scenarios when the statistical power is specified at 80% and the significance level is set at 5%. Also reported are the calculated sample sizes when each component is considered separately for powering the trial, and a Bonferroni correction is applied. Here, the maximum of the three calculated sample sizes based on the three components is chosen as the sample size to be specified for the trial.
From the table, we observe that the test statistic gives the smallest sample sizes (numbers highlighted in bold) for each of the clinical trial design scenarios considered. Moreover, we make the following points after examining Table 3.
A substantial number of participants may be required when a trial for early AD only lasts for 2 years, under our assumptions. We estimate that at least 17,000 participants would need to be recruited in a 2-year AD trial in an MCI population to have sufficient power (i.e., 80%) to detect a 25% reduction in the annual rate of change on each of the transformed component scores. Recruitment of such numbers may be infeasible for a 2-year duration clinical trial in early AD with four biannual follow-up visits and even if feasible failure rates could potentially be high for early AD populations. Note that the required sample sizes will decrease with increasing trial duration, assuming biannual visits.
The required sample sizes to detect the treatment effect on the transformed MMSE are much smaller than the ones to detect the treatment effect on the transformed CDR-SB or ADAS-11 (comparing w(1) rows to w(2) and w(3) rows in Table 3). Let us consider a clinical trial of 3 years duration as an example. The required sample sizes obtained by ΞJC(w(1)) is 55.0% of the ones obtained by ΞJC(w(2)) and 54.6% of the ones obtained by ΞJC(w(3)). This implies that the transformed MMSE is the more sensitive measure for detecting a treatment effect for early AD than transformed CDR-SB and the ADAS-11 measures , , .
The approaches that use the optimal weights could require at least 60% fewer participants than the ones using w(2) or w(3). In our analysis, the performances of ΞJC(w) and ΞC(w) with wZ are comparable to the ones using the optimal weights. This is a consequence of the estimated parameters obtained from the analysis of the ADNI data giving rise to optimal weights that are close to wZ (Table 2). Comparable performances across these three statistics will not in general be expected when using other component outcomes.
The sample sizes calculated under ΞJC(w) are always smaller than the ones calculated under ΞC(w) for fixed weights, although the reduction may not be significant; for example, there is a 3% reduction in sample sizes when ΞJC(w) is used with . Such gain in efficiency is obtained by specifying the correlation structure among the component scores in the MLMM.
We have described three approaches for performing power analysis to detect treatment effects in clinical trials for early AD. From our investigations, we found that jointly modeling the component scores and then constructing sensitive test statistics or composite scores based on optimal weights will improve the efficiency of clinical trials. Under our model assumptions, testing based on the optimal composite treatment effect will lead to the smallest required sample sizes and therefore should be recommended when powering clinical trials in AD if treatment effects on multiple components are of interest.
We end the article with the following discussion points.
We assume that the component scores are jointly from an MLMM. This may be too strong an assumption for analyzing some cognitive and function scores in AD, because the component scores usually are discrete with strong ceiling or floor effects. Consider the CDR-SB as an example. The CDR-SB is the sum of six component scores, including the Memory Score, the Orientation Score, the Judgement and Problem Solving Score, the Community Affairs Score, the Home and Hobbies Score, and the Personal Care Score. The component scores except the Personal Care Score have the discrete range 0, 0.5, 1, 2, and 3, whereas the Personal Care Score has the range 0, 1, 2, and 3. From the ADNI data, over 30% of individuals have 0 in each component score of the CDR-SB, which would indicate strong floor effects (zero-heavy data). Therefore, it may not be appropriate to use an MLMM with CDR-SB on its original scale or even after transformation as done in this article. The use of other models, which take account of zero-heavy data may be appropriate; see Farewell et al.  for a comprehensive review.
In our power analysis results, we took the covariance matrices of and bn to be known when fitting the MLMM. This allowed us to obtain explicit formulas for the MLEs and their covariance, which enabled us to compare the powers of the test statistics and calculate the optimal composite scores. In practice, these covariance matrices would need to be estimated. They may be obtained from previous investigations or through a pilot study. However, note that without considering the variability in the estimated covariance matrices, there would be a tendency to underestimate the required sample sizes. Monte Carlo studies can be applied to obtain more accurate sample sizes . However, these would require intensive computational work to compute the optimal weights.
In the MLMM for component scores, it is assumed that, for each n, the errors , t = 1,…,Tn, are independent across time. This implies that the time correlation of Ynt, t = 1,…,Tn, is induced only through the random intercepts bn. This can be generalized so as to introduce the auto correlations between , t = 1,…,Tn. Such generalization would raise computational challenges, and a bespoke program would be needed. (We were unable to find a statistical software package that would allow us to fit this more generalized model).
The considered Wald statistics are used to detect the component treatment effect, but they do not make distinction between beneficial effects and deleterious effects. However, because currently in early AD, there may be an expectation that any treatment brought forward for confirmatory testing in a phase III trial has undergone rigorous assessment at phase II to ensure that it does not confer harm, it may be of interest to investigate rejecting H0 under the alternative that all the component treatment effects are non-negative. In this situation, the Wald statistic ΞJ follows a mixture of distribution, P = 0,…,J, where distribution is the distribution with mass 1 at point 0. In general, it is challenging to calculate the weights that combine the distribution, P = 0,…,J, .
When the weights w in ΞJC(w) and ΞC(w) are non-negative elementwise, we may modify the alternatives against and to
respectively. We can use the Z-statistics, and , for the one-sided tests. They follow the standard normal distribution under their associated null hypothesis. However, the elements of the optimal weights and may not always be non-negative.
It is crucial to obtain plausible values of the parameters needed for the power analysis, including the annual change rates, the covariance matrix of random effects, and the covariance matrix of errors. These parameter values can be informed from a pilot study or existing studies . However, there always exists the concern whether the specified alternative truly represents the clinical trial target population effect of interest and how the variability of the alternatives will affect the calculated sample sizes, sensitivity analysis is recommended . McEvoy et al.  compute 95% CIs on the sample sizes through bootstrapping. We also present the 95% bootstrap CIs for the calculated sample sizes in our Supplementary document.
The effect sizes must be determined based on rationale and justification from theory and clinical experiences . When the effect sizes are set to be the percentages of the annual rate of change, they are approximately invariant to the transformation on the component scores if the term in the MLMM is around zero.
The derivation and use of optimal weights and here were for the clinical purpose of powering a trial. We did not propose a new composite score to be used as an endpoint but constructed the most powerful test statistics with the optimal weights and the most sensitive composite score with the weights to detect treatment effects. We further argued that no extra information or no further model assumption than what is typically needed is required to calculate them given the alternatives. Therefore, it is helpful to compute and use the optimal weights in power analysis. For other clinical purposes, the optimal weights w as defined and clinically meaningful weights may conflict. In such situations, we suggest modifying the criterion for determining the optimal weights to take account clinical meaningfulness.
This work has received support from the EU/EFPIA Innovative Medicines Initiative Joint Undertaking EPAD grant agreement no. 115736 and MRC programme grant (MC_ UP_1302/3).
Data collection and sharing for this project was funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (National Institutes of Health grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer's Association; Alzheimer's Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer's Disease Cooperative Study at the University of California, San Diego. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.
Supplementary data related to this article can be found at http://dx.doi.org/10.1016/j.trci.2017.04.007.