The purpose of this study was to estimate effect sizes (ES) for a set of IIV definitions and to determine which one provided the most robust definition of IIV derived using commonly utilized cognitive tests such as the MMSE and Clock Drawing Test. Results of the study suggested that IIV is an informative alternative performance summary (as compared to the total test scores), irrespective of whether it is derived from the items within a single test or from multiple tests. Consistent with our hypothesis, IIV computations based on MMSE items yielded greater ES estimates than IIV computations based on Clock Drawing Test items. The IIV estimate derived from the Clock Drawing Test predicted cognitive decline above and beyond mean scores on this test at a 12-month follow-up, but other IIV estimates did not.
To our knowledge, this is the first study to estimate effect sizes for a variety of IIV formulations including within test (item-level) estimation using commonly employed cognitive tests such as the MMSE. Our results are not directly comparable to earlier work in the sense that our definitions of IIV varied and we used effect sizes to rank the utilities of these definitions. For example, while “inconsistency” in performance is often estimated using reaction time data and then summarized variability across different tests in terms of the pattern, or dispersion, of test scores (Hilborn et al., 2009 
), in contrast, we summarized variability across different tests by computing one estimate of IIV, based on each definition, on combined test performances (see ). In general, our results are consistent with earlier work in that we have shown that IIV, derived from within-test item-level performance or from across test performance, will yield significant effect sizes. That being said, results of this study are not consistent with those of Christensen, et al. 
who found that a mean-independent estimate of IIV (level-independent variability or LIV) was a useful summary of performance on a reaction-time based measure This may be attributable, at least in part, to the very large number of trials that they used (compared to our study) as well as to the fact that they were using reaction times, not right/wrong answers to questions as we have done here.
Neuropsychological test scores may have a great deal of measurement error (one example is explored in 
), the constituent items are not exchangeable, and the typical use to which these test scores are put - comparing totals over time to estimate the number of “points lost,” - may have limited interpretive utility. In the context of cognitive science, by contrast, exchangeable and infinitely replicable reaction time trials are perfectly compatible with measuring intra-individual variability (IIV), which may provide useful information regarding the underlying neural integrity of cognitive systems and help predict incident cognitive decline as well as dementia (e.g. 
Cognitive scientists have suggested that increasing levels of IIV suggest decreasing levels of brain structure/architectural integrity. For IIV to be useful in clinical settings, it should differentiate normal from abnormal cognitive aging (e.g., MCI, AD). Results of this study suggest that IIV can be estimated from the items within tests, as well as across cognitive tests, and that the effect size obtained for IIV will depend on the test and on the definition of IIV that is used. Importantly, this study also showed that IIV can be estimated from tests that nearly every NIH-funded Alzheimer's (and clinical cognitive aging) study in the United States is already using, with only the item-level, rather than the total-score level, information.
Methodological limitations of this study must be noted. First, only two tests were used to compute estimates of across-test IIV; this decision was based on the data available and to increase likelihood of replication in future studies. Additionally, the majority of studies on IIV conducted to date have employed reaction-time based tasks. While reaction time tasks may have greater have sensitivity and reliability in measuring IIV compared to the MMSE and the Clock Drawing Test, more research is needed to evaluate this possibility empirically. More research is also needed to identify the number of tests needed to generate reliable estimates of across-test IIV (see, e.g., 
), and particularly, reliable estimates of clinically meaningful change
in variability. Neuropsychological task specificity (e.g., for different brain functions or
neural circuits, or both) may also need to be evaluated for the best IIV definition for reliable, longitudinal, study of cognitive aging (see also 
A second limitation is that clinical grouping was based on clinical diagnosis. A follow-up study is currently in progress to replicate findings of the current study in neuropathologically-confirmed diagnostic groups. Our results suggest that existing datasets that contain cognitive tests at the item level together with neuropathology and/or neuroimaging outcomes, can be used to explore the hypothesis that IIV can represent, for example, changes in frontal gray matter 
, white matter 
, or neurotransmission 
A third consideration is the many ways to conceptualize effect sizes 
; future work will determine the robustness – particularly with respect to longitudinal, clinically meaningful, changes in IIV – to the different effect size estimators. Our analyses have only showed that total scores and the coefficient of variation –computed at baseline- tend to provide similar predictive power for 12- and 24-month changes in CDR sum of boxes; we did not evaluate changes in any of the IIV formulations. It was also unclear why such a large effect size was observed for total score on MMSE, although these larger-than-expected effect sizes were also seen in the MMSE IIV formulations, suggesting the use of MMSE in intake for these subjects might have skewed the MMSE-related results.
A fourth consideration is that we chose to simulate data (generate “random samples”) using the mean and SD of the observed sample as “population parameters” for values following a normal distribution, rather than conduct a bootstrap which would have treated our observed means and variances as if they were the actual population; the bootstrap might have been more supportable if we had exchangeable and infinitely replicable scores. We felt that the more clinical-than-cognitive context of our study and its results supported the simulation approach over the bootstrap approach. It is possible that a bootstrap would have yielded different results, but the simulation is consistent with the way data like ours are used; we will seek to replicate these results in another sample (using simulation) in future work.
Finally, Schmiedek et al. 
reported that correcting for either individual or group means on a reaction time
-based estimate of IIV may lead to incorrect inferences. It is unclear whether the same is true for IIV estimated as CV (SD/mean) when the task is not based on reaction times. This is a new, and open, question.
Of interest in is the unexpectedly large effect sizes of the MMSE total, i.e., the estimated standardized difference (Cohen's d
) was 1.45 for the N vs MCI and 1.95 for the MCI vs AD comparisons. These are uncharacteristically large effect sizes, particularly for a general test of cognitive function, in these cohorts. While wholly beyond the scope of this discussion, this particular cognitive test is well known to be very noisy and give very weak effect sizes in general. The IIV-derived ES estimates were also relatively large in several cases. The very large effect sizes documented in the figures above could be due to the use of MMSE in identifying which participants were recruited to the study from which the data were obtained. It is not used to diagnose, but is sometimes used as a shorthand way of referring to –and sometimes recruiting –patients; accordingly, this influence may be driving the dramatic effect sizes that we found. By contrast, the items making up the Clock score were not used to enroll or diagnose ADNI participants. Its total score-based effect size were more modest, .54, for normal vs. MCI and .63 for MCI vs. AD.
A final note is the emphasis in this study on the method of summary, i.e., total (as proscribed) or some version of variability (as shown in and ). The ADNI study is only one of a large number of similar longitudinal studies presently being conducted. The level of missingness was very low for ADNI data at baseline. Because our results were targeting the simulation, and not so much the original data, we did not address the impact of missingness on our simulations. However, missingness could only have affected the results in and would likely have driven our observed-to-be-low estimates further towards zero. These estimates themselves were not the focus of our work but rather, we targeted the difference in using the total score vs. a different summary of the same item level information (i.e., IIV). We did not address missingness or employ random effects models or any kind of imputation in the current study. When IIV formulations and their utility are explored for their use longitudinally, however, missingness and random effects will be important considerations.
Despite these limitations, results of the current study underscore the potential utility of item-level and across-test estimates of IIV in large-scale studies of cognitive aging and dementia. Given that these data are readily available and being collected in longitudinal research protocols, estimates of IIV may provide an additional metric that reflects global neural integrity and may have predictive utility (e.g., 
). Definitions of IIV should also be studied for their performance and characteristics longitudinally; in the current study, our simulations and analyses were all based on baseline-data driven IIV estimates. Importantly, although our regression analyses suggested that the total score and within-test coefficient of variation (IIV) at baseline did not provide much explanatory power for change in clinical functioning –and that IIV generally did not provide explanatory power independent of that of the total score, effect sizes for IIV as a summary metric were comparable to ES estimates based on the total scores on these measures. Our future work will focus on studying the performance of IIV estimates longitudinally and developing a better understanding of how variability in response can represent neural integrity and neural pathology.