Tests of global cognitive functioning are widely used in clinical and research studies to screen for and monitor cognitive impairments. The Mini-Mental State Examination (MMSE), developed in 1975, was introduced in one of the most frequently cited papers in the medical literature (1
). The MMSE or similar tests have been recommended for clinically identifying patients with mild cognitive impairment (2
), though most subjects with mild cognitive impairment will have MMSE scores in the normal range (3
). Longer versions of the MMSE have been developed, including the Modified Mini-Mental State (3MS) (4
), the Cognitive Abilities Screening Instrument (CASI) (5
), and the Community Screening Instrument for Dementia (CSI `D') (6
). Some of the large epidemiological studies employing these tests are outlined in . These studies represent a major societal investment in measuring cognitive functioning in the elderly.
Large studies that have employed a cognitive functioning test*.
Unfortunately, the different tests are not interchangeable. The MMSE has 20 questions and is scored out of 30 points. The 3MS and the CASI have 40 questions and are scored out of 100 points; the scores are not equivalent. The CSI `D' has 45 questions and is scored out of 34 points. At present, cross-study analyses can be performed only on studies that used the same test. While a vast amount of data has been collected, until these cognitive tests are co-calibrated with each other, it is as if the tests were written in different languages, and there is no bilingual dictionary.
This situation is similar to the ancient Greek and Egyptian languages before the decipherment of the Rosetta Stone, a granite block inscribed with three scripts discovered in 1799. Scholars immediately recognized one script as Greek, used to record a proclamation. In 1824, Jean-François Champollion demonstrated that the other two scripts recorded the same proclamation in the ancient Egyptian language. Once the different languages and scripts were co-calibrated, it was possible to develop a bilingual dictionary, laying foundations for our knowledge of ancient Egyptian culture.
The National Institutes of Health has formed a trans-NIH Project on Cognitive and Emotional Health whose Critical Evaluation Study Committee identified 66 papers from 36 longitudinal studies of cognitive functioning with at least 500 subjects (8
). The Committee stressed “there is no agreement on the questionnaires used … making comparisons between studies and combining data from studies difficult.” They advocated development, implementation, and analysis of data collected by a new cognitive and emotional questionnaire for use in subsequent research (8
). In effect, this strategy calls for discarding the hundreds of thousands of person-years of data that have already been collected. The item response theory (IRT) co-calibration scheme presented here provides a viable alternative to that strategy. Results can be combined across studies and valid findings can be drawn from existing data once the scales have been co-calibrated.
Even within a study that uses a single test, there may be important challenges to using standard test scoring in longitudinal analyses. The simplest example of standard scoring is a sum score for a test in which each right/wrong item is worth one point, though the same arguments apply for tests with more complicated items and/or more complicated scoring rules. The distribution of item difficulties is not considered in standard scoring. If it happens that a test contains many easy items and few hard items, this fact is not reflected in scores. As we would hope, subjects with greater ability would be expected to have higher scores than subjects with lower ability levels at a single time point. However, with the passage of time, standard scoring of such a test may show strange results for the amount of change. Subjects with lower ability levels have more items whose difficulty level is around their ability level, so their scores are estimated with more precision. Small drops in cognitive ability for such an individual are more likely to be detected by changing from success to failure to several items. Subjects with higher ability levels have fewer items with difficulty levels close to their ability level, so their scores may appear to be more stable over time, at least initially, since changes in cognitive functioning will be less likely to be reflected with changing from success to failure on items. This problem of non-linear measurement properties of standard scoring of global cognitive tests has been discussed previously (9
An alternative scoring technique is to use IRT. IRT scoring does not assume a pre-specified weighting for the items. Instead, the data are used to determine parameters for the difficulty of each item and, in some models, discrimination (the strength of the relationship between probability of success on the item and the underlying trait or ability measured by the test). IRT can detect the situation discussed above, with several easy items and few hard items. IRT scoring takes the difficulty of test items into account, resulting in a metric with linear scaling properties (10
An additional potential problem may arise in the analysis of longitudinal data from a test with many easy items and a few hard items. Standard analytic strategies implicitly assume that measurement precision is constant. However, in our example test, measurement precision for people with low ability levels is greater than for people with higher ability levels. Standard approaches do not account for varying levels of measurement precision.
We reviewed the papers identified by the trans-NIH Project on Cognitive and Emotional Health Critical Evaluation Study Committee. None of these papers used IRT or otherwise accounted for non-linearity of cognitive tests. None of the studies summarized in account for varying measurement precision at all. Two studies mentioned by the Committee treat measurement error as a constant across the entire measurement spectrum (A47
; references preceded by A can be found in Appendix 3
). One additional paper mentions measurement error in the discussion but does not account for measurement error in the analytic plan (A49
). The remaining 63 papers do not mention measurement precision at all (A1
Our primary objective was to co-calibrate the 3MS, CASI, CSI `D', and MMSE using IRT. We illustrate our success by showing on the same co-calibrated scale the cut points used in several studies that employed different tests. Our secondary objective was to discuss measurement properties of the tests and illustrate the potential implications of varying levels of measurement precision on longitudinal studies of cognitive change over time.