|Home | About | Journals | Submit | Contact Us | Français|
To co-calibrate the Mini-Mental State Examination (MMSE), the Modified Mini-Mental State (3MS), the Cognitive Abilities Screening Instrument (CASI) and the Community Screening Instrument for Dementia (CSI `D') using item response theory (IRT) to compare screening cut-points used to identify cases of dementia from different studies, to compare measurement properties of the tests, and to explore the implications of these measurement properties on longitudinal studies of cognitive functioning over time.
We used cross-sectional data from three large (n>1000) community-based studies of cognitive functioning in the elderly. We used IRT to co-calibrate the scales and performed simulations of longitudinal studies.
Screening cut-points varied quite widely across studies. The four tests have curvilinear scaling and varied levels of measurement precision, with more measurement error at higher levels of cognitive functioning. In longitudinal simulations IRT scores always performed better than standard scoring, while a strategy to account for varying measurement precision had mixed results.
Co-calibration allows direct comparison of cognitive functioning in studies using any of these four tests. Standard scoring appears to be a poor choice for analysis of longitudinal cognitive testing data. More research is needed into the implications of varying levels of measurement precision.
Tests of global cognitive functioning are widely used in clinical and research studies to screen for and monitor cognitive impairments. The Mini-Mental State Examination (MMSE), developed in 1975, was introduced in one of the most frequently cited papers in the medical literature (1). The MMSE or similar tests have been recommended for clinically identifying patients with mild cognitive impairment (2), though most subjects with mild cognitive impairment will have MMSE scores in the normal range (3). Longer versions of the MMSE have been developed, including the Modified Mini-Mental State (3MS) (4), the Cognitive Abilities Screening Instrument (CASI) (5), and the Community Screening Instrument for Dementia (CSI `D') (6, 7). Some of the large epidemiological studies employing these tests are outlined in Table 1. These studies represent a major societal investment in measuring cognitive functioning in the elderly.
Unfortunately, the different tests are not interchangeable. The MMSE has 20 questions and is scored out of 30 points. The 3MS and the CASI have 40 questions and are scored out of 100 points; the scores are not equivalent. The CSI `D' has 45 questions and is scored out of 34 points. At present, cross-study analyses can be performed only on studies that used the same test. While a vast amount of data has been collected, until these cognitive tests are co-calibrated with each other, it is as if the tests were written in different languages, and there is no bilingual dictionary.
This situation is similar to the ancient Greek and Egyptian languages before the decipherment of the Rosetta Stone, a granite block inscribed with three scripts discovered in 1799. Scholars immediately recognized one script as Greek, used to record a proclamation. In 1824, Jean-François Champollion demonstrated that the other two scripts recorded the same proclamation in the ancient Egyptian language. Once the different languages and scripts were co-calibrated, it was possible to develop a bilingual dictionary, laying foundations for our knowledge of ancient Egyptian culture.
The National Institutes of Health has formed a trans-NIH Project on Cognitive and Emotional Health whose Critical Evaluation Study Committee identified 66 papers from 36 longitudinal studies of cognitive functioning with at least 500 subjects (8). The Committee stressed “there is no agreement on the questionnaires used … making comparisons between studies and combining data from studies difficult.” They advocated development, implementation, and analysis of data collected by a new cognitive and emotional questionnaire for use in subsequent research (8). In effect, this strategy calls for discarding the hundreds of thousands of person-years of data that have already been collected. The item response theory (IRT) co-calibration scheme presented here provides a viable alternative to that strategy. Results can be combined across studies and valid findings can be drawn from existing data once the scales have been co-calibrated.
Even within a study that uses a single test, there may be important challenges to using standard test scoring in longitudinal analyses. The simplest example of standard scoring is a sum score for a test in which each right/wrong item is worth one point, though the same arguments apply for tests with more complicated items and/or more complicated scoring rules. The distribution of item difficulties is not considered in standard scoring. If it happens that a test contains many easy items and few hard items, this fact is not reflected in scores. As we would hope, subjects with greater ability would be expected to have higher scores than subjects with lower ability levels at a single time point. However, with the passage of time, standard scoring of such a test may show strange results for the amount of change. Subjects with lower ability levels have more items whose difficulty level is around their ability level, so their scores are estimated with more precision. Small drops in cognitive ability for such an individual are more likely to be detected by changing from success to failure to several items. Subjects with higher ability levels have fewer items with difficulty levels close to their ability level, so their scores may appear to be more stable over time, at least initially, since changes in cognitive functioning will be less likely to be reflected with changing from success to failure on items. This problem of non-linear measurement properties of standard scoring of global cognitive tests has been discussed previously (9).
An alternative scoring technique is to use IRT. IRT scoring does not assume a pre-specified weighting for the items. Instead, the data are used to determine parameters for the difficulty of each item and, in some models, discrimination (the strength of the relationship between probability of success on the item and the underlying trait or ability measured by the test). IRT can detect the situation discussed above, with several easy items and few hard items. IRT scoring takes the difficulty of test items into account, resulting in a metric with linear scaling properties (10).
An additional potential problem may arise in the analysis of longitudinal data from a test with many easy items and a few hard items. Standard analytic strategies implicitly assume that measurement precision is constant. However, in our example test, measurement precision for people with low ability levels is greater than for people with higher ability levels. Standard approaches do not account for varying levels of measurement precision.
We reviewed the papers identified by the trans-NIH Project on Cognitive and Emotional Health Critical Evaluation Study Committee. None of these papers used IRT or otherwise accounted for non-linearity of cognitive tests. None of the studies summarized in Table 1 account for varying measurement precision at all. Two studies mentioned by the Committee treat measurement error as a constant across the entire measurement spectrum (A47, A48; references preceded by A can be found in Appendix 3). One additional paper mentions measurement error in the discussion but does not account for measurement error in the analytic plan (A49). The remaining 63 papers do not mention measurement precision at all (A1-A4, A11, A13, A14, A16-A19, A25-A28, A32-A35, A41, A49-A91).
Our primary objective was to co-calibrate the 3MS, CASI, CSI `D', and MMSE using IRT. We illustrate our success by showing on the same co-calibrated scale the cut points used in several studies that employed different tests. Our secondary objective was to discuss measurement properties of the tests and illustrate the potential implications of varying levels of measurement precision on longitudinal studies of cognitive change over time.
We present two separate analyses. First, we analyzed real global cognitive test data from 3 large studies to co-calibrate the tests using IRT. Once we co-calibrated the tests, we compared screening cut-points published in the literature on the common co-calibrated scale. We also determined measurement properties of the tests. In the second analysis, we used these results to inform simulated longitudinal studies of cognitive functioning over time. We compared four different scoring strategies in terms of their bias in estimating the true rate of cognitive decline over time.
We used cross-sectional data from 3 studies, the Cardiovascular Health Study (CHS) (n= 4,978), the Adult Changes in Thought Study (ACT) (n = 3,358), and the Indianapolis site from the Indianapolis - Ibadan Dementia Project (Indianapolis) (n = 1,254) (total n = 9,590). Detailed methods from each of these studies have been published (11-14). Local institutional review boards approved each study, and written informed consent was obtained in each study.
CHS enrolled 5,201 individuals aged 65 years or older from four communities between 1989 and 1990; 4,291 participated in the study in 1992-1993. An additional 687 African-American participants were enrolled in 1992-1993. We analyzed 3MS responses from 1992-1993 from these 4,978 individuals.
ACT enrolled 2,554 individuals aged 65 years or older in 1994-1996 from a large health maintenance organization. An additional 811 subjects were enrolled in 2000-2002; 804 had valid CASI scores. We analyzed the most recent CASI results from both cohorts (total n = 3,358).
Indianapolis enrolled 2,147 African-Americans aged 65 years or older in 1992-1993. Of those, 1,254 had CSI `D' data from the second incidence wave of data collection in 1997-1998 and are included here.
Co-calibration requires either the same people taking different tests or different tests sharing common items (16); we used common items. We identified anchor items with identical content across tests and ensured that their relationship with the underlying ability tested was the same across study sites. These items were then used to anchor the scales to a common metric.
We compared test items to identify those that presented subjects with identical stimuli. If necessary, we re-coded scoring categories. For example, interlocking pentagons was scored 0 or 1 in the CSI `D', 0-10 points in the CASI, and 4, 4, and 2 points in the 3MS for the left pentagon, right pentagon, and intersection. Based on scoring rules used in the studies, a score of 1 from the CSI `D' corresponded to a score of exactly 10 from the CASI, and to scores of exactly 4, 4, and 2 from the 3MS. We re-coded CASI scores as 10 = 1, any other score = 0, and re-coded 3MS scores as (4, 4, and 2) = 1, any other score combination = 0.
Many items were similar but not identical; such items were not considered as candidate anchor items. For example, the CSI `D' uses the words “boat”, “house”, and “fish” while the 3MS and CASI use the words “shirt”, “brown”, and “honesty” for short-term recall.
Each item from a test has relationships with all of the candidate anchor items from that test (within-test relationships), and the candidate anchor items have relationships with candidate anchor items in the other tests (between-test relationships). We used IRT to parameterize these relationships. We generated a PARSCALE data set (17) containing item responses from all 9,590 subjects; we used Samejima's graded response model for polytomous items (18, 19).
Anchor items must have the same relationship with cognitive functioning across study populations. Violation of this assumption is called differential item functioning (DIF), defined as statistical differences across groups (in this case, across studies) in item responses when controlling for the underlying ability measured by the test (20-22). Details of the approach we used to identify items ineligible to be anchor items due to DIF (23, 24) are presented in Appendix 2.
Samejima's graded response model provides a formula for the probability of each response category for each item for any level of cognitive functioning (18, 19, 25). We used this formula to determine the most likely response for every item for every cognitive functioning level. We calculated 3MS, CASI, CSI `D', and MMSE scores using published scoring algorithms (4, 5, 7). When MMSE item parameters were available from multiple tests (which could happen only if an MMSE candidate anchor item was found to have DIF related to study site), we used parameters from the 3MS. We rescaled IRT scores by multiplying them by 15 (so a standard deviation unit is 15 points) and adding 100 (so average cognitive functioning is 100 points) (26).
IRT provides two important summaries of test measurement properties. The test characteristic curve is a plot of the most likely score associated with each level of cognitive functioning (10, 27). It is useful for assessing whether the relationship between standard scores and the underlying level of cognitive functioning is linear (9). Linear relationships are important for many applications (10). The test information curve depicts the measurement precision of the test at each level of cognitive functioning, which may vary. If a cognitive test includes many difficult items it will have high information (good precision) for individuals with above-average levels of cognitive functioning; if it includes few difficult items it will have less information (poor precision) for individuals with above-average levels of cognitive functioning (10, 27). The standard error of measurement is proportional to the inverse square root of the information, and is on the same scale as the cognitive functioning level. We used standard formulas (27) to plot test characteristic curves and standard errors of measurement for the MMSE, 3MS, CASI, and CSI `D'.
We reviewed the publications listed in Table 1 to determine if a 2-stage sampling design was used and, if so, what screening cut-point was used in each study. We plotted these values on the transformed co-calibrated metric.
We examined 5 different scenarios for change, each with 250 data sets of 1000 subjects seen at baseline and every 2 years for 8 years (5 data points for each subject in each data set). We drew a random intercept (baseline value) and slope (rate of decline) for each subject in each data set. We varied the mean slope and intercept across five scenarios. In each scenario the standard deviation of the intercept term was 15 and the standard deviation of the slope was 7.5 points over 8 years. This process generated true cognitive abilities for each subject at each time point in each data set. We simulated item responses to a global cognitive test for each subject based on their true cognitive ability at each time point. We chose item parameters for the simulated test to mimic the test characteristic and standard error curves obtained from Analysis 1 (see Figures 2 and and3).3). Further details are provided in the footnote to Table 4.
We used four different strategies to score the observed item responses: standard total score, naïve IRT score, a single perturbed IRT score in which noise proportional to measurement error is added to each observation, and 10 perturbed IRT scores defined the same way. For IRT scores we used PARSCALE (17) and the graded response model (18, 19); we estimated item parameters from the baseline data anew for each run. These parameters were used to compute IRT scores at each follow-up time point. For the perturbed IRT scores we used the point estimate of the score from PARSCALE and added a perturbation term generated by multiplying a random normal (0,152) variable by the standard error of measurement term from PARSCALE.
We fit mixed effects models (28) for each scoring strategy to estimate the rate of change. We defined percent bias as the difference between the actual rates of change for the true scores and the estimated rates of change for each scoring strategy divided by the actual rate of change for the true scores. We split the data sets in half and numbered them from 1-125 to determine the running mean bias, defined for data set n as (total bias for data sets 1 to n)/n. We plotted running mean bias against run number to determine whether the bias estimates were converging to the same number in the two halves of the data.
Table 2 shows demographic characteristics of the study populations. The age and gender distributions of the three study populations were roughly similar, but there were large differences in ethnic and educational distributions.
The items included in the three tests and their final dispositions (anchor items, rejected anchor items due to DIF, or unique) are shown in Appendix Table 4. Two items served as anchors across all three tests, sixteen items anchored the comparison of the 3MS and the CASI, four items anchored the comparison of the 3MS and the CSI `D', and two items anchored the comparison of the CASI and the CSI `D'.
Table 3 summarizes the result of the co-calibration. Score comparisons can be made by reading across the rows of the table. For example, scores of 20 on the MMSE correspond to scores of 51 or 52 from the 3MS, scores of 73 or 74 from the CASI, and scores of 23 from the CSI `D'.
The screening cut-points used to identify subjects with poor cognitive functioning in selected epidemiological studies are shown on the IRT metric in Figure 1. The cut-points used vary dramatically, from a low around 3½ standard deviations below average (for the Chicago Healthy Aging Project and the Kinmen Study) to just about average (for the Women's Health Initiative Memory Study).
In Figure 2 we show test characteristic curves for the MMSE, the CASI, the 3MS, and the CSI `D'. The test characteristic curve for a scale with linear measurement properties would have a straight line (9). All of these cognitive screening tests produce non-linear test characteristic curves, with steeper slopes at below-normal levels of cognitive functioning and shallow slopes at normal and above-normal levels. All of these tests thus are characterized by having relatively few hard items and more easy items. A test with a curvilinear test characteristic curve is poorly suited for comparing trajectories of cognitive functioning across patients who start from different levels of cognitive ability. A patient who started in a steep portion of the curve would be expected to have a larger change in observed score for a given amount of change in cognitive functioning than a patient who started in a shallow portion of the curve. For longitudinal epidemiological and clinical studies (including those referred to in Table 1), tests with curvilinear test characteristic curves may threaten the validity of study results if standard scores are used (29).
In Figure 3 we show each test's standard error of measurement at each cognitive functioning level. For individuals with cognitive functioning levels below average (below 100), the curves for the different tests are nearly identical. For individuals with cognitive functioning levels above average (above 100) the standard error curves start to rise dramatically. Because the MMSE is a shorter test, its measurement precision is inferior to that of the other tests for individuals with average or above-average cognitive functioning. None of the tests has good measurement precision for these individuals. Between one standard deviation and two standard deviations above average (115-130 on the transformed cognitive functioning scale), each of the tests has a large standard error of measurement of at least one standard deviation (15 points). We performed simulation analyses to illustrate potential problems that may arise from this profile of varying levels of measurement precision.
Results from the simulation studies are shown in Table 4. Standard scoring performed poorly in three of the five scenarios (2, 4, and 5). Accounting for curvilinear scaling properties using naïve IRT scores provided more accurate estimates of the rate of change than standard scores in all five scenarios, though the difference was negligible in scenario 1. Further accounting for varying levels of measurement precision using the perturbation strategy proved to be the most accurate strategy for two scenarios (scenarios 2 and 4), but this strategy was not as accurate as naïve IRT scores for the other three scenarios and not as good as standard scores for scenarios 1 and 3. There was no difference in findings when the perturbation strategy was carried out once per data set or 10 times. Running means converged in the two halves of the data at the global means shown in Table 4 (see Appendix Figure 1).
We co-calibrated four commonly used tests of global cognitive functioning using IRT. Just as the deciphered Rosetta Stone permitted understanding of the Egyptian language by scholars familiar with Greek, co-calibration permits understanding across studies that use different cognitive tests. Scores from each test can be directly compared to scores on the other tests. This permits comparison of cut-points used in different studies, which we found to vary dramatically across studies. Co-calibration also permits us to compare measurement properties such as curvilinearity of the test characteristic curve and standard errors of measurement of the different tests on the same co-calibrated metric. We found that all four tests had similar curvilinear test characteristic curves, as well as poor measurement precision for high scores. We performed a simulation study to investigate the potential for bias due to curvilinear measurement properties and varying levels of measurement precision in longitudinal studies of cognitive change over time. Simulation results suggest that IRT scores are always an improvement over standard scores in recovering true rates of change over time. Further accounting for measurement error using a perturbation term strategy had varying effects on the amount of bias in estimating the rate of change over time.
The co-calibration performed here will permit direct comparison of scores across studies that employed different tests. Table 3 should thus prove useful to clinicians and researchers. One application of these co-calibrated scores is the ability to compare cut-points used across studies that used different instruments, which we found varied dramatically. Studies with higher cut-points will have detected a higher proportion of their subjects who have dementia (a higher sensitivity) at the expense of having performed many more evaluations (a lower specificity) than studies with lower cut-points. Because of the differences in cut-points, subjects with dementia identified in studies with widely divergent screening cut-points may not be comparable to each other. Subjects with mild forms of dementia are much more likely to escape detection in a study with a lower cut-point than in a study with a higher cut-point.
The curvilinear test characteristic curves of the cognitive tests (see Figure 2) create formidable impediments to analysis of cognitive changes that attempts to use traditional scoring of data from these tests. Regression and change score approaches to analyzing cognitive trajectories assume that each such trajectory is linear - that is, a change of a few points at the top end of the scale has the same implication for cognitive functioning as a change of the same few points at the bottom end of the scale (10). None of the cognitive tests considered here meets this standard. Changes of a few points at the top end of the scale imply vast differences in cognitive functioning, while changes of a few points at the bottom end of the scale imply tiny differences. Given the curvilinear nature of cognitive functioning as measured by these tests, analyses of changes in cognitive functioning over time should use IRT scoring rather than standard scoring (9, 10). These theoretical considerations are supported by our simulation results from Analysis 2, in which using naïve IRT scores was always more accurate than using standard scores. Thus instead of using the standard score equivalents from Table 3, the wisest choice would be to use IRT to obtain scores on the co-calibrated metric. Item parameters are available from the first author for these tests. It should be recalled that none of the studies identified by the Project on Cognitive and Emotional Health used IRT or any other technique appropriate for curvilinear metrics (A1-A4, A11, A13, A14, A16-A19, A25-A28, A32-A35, A40, A41, A47-A91).
Measurement precision also varied dramatically across the cognitive functioning spectrum for the cognitive screening tests that we examined, with no major differences between the tests (Figure 3). For all four tests, measurement precision is much poorer at the higher end of the cognitive functioning spectrum because of small numbers of difficult items. This lack of high-end sensitivity for all of these tests has been noted especially in conjunction with the interest in detecting early cognitive deficits such as mild cognitive impairment (MCI) (2). None of these tests should be solely relied upon for identification of early cognitive deficits, because none of them has much measurement precision at the higher end of the scale. Furthermore, as we show with our simulations, ignoring varying levels of measurement precision in an analysis of change over time may also lead to biased estimates of the rate of change. In two of the five scenarios, our perturbation strategy proved better than naïve IRT scoring in recovering the true rate of change over time. However, in the other three scenarios, naïve IRT scoring was better, and in two of the scenarios, standard scoring itself proved to be superior to the perturbation strategy. We think there are two implications of these findings. The first is that the standard error of measurement curves shown in Figure 3 provide qualitatively different information than the test characteristic curves shown in Figure 2. The second implication of these findings - especially the variability of the findings across scenarios - is that more research is needed before we can provide general guidelines on strategies to account for varying levels of measurement precision. Thus, while measurement error was ignored by almost all of the studies identified by the Project on Cognitive and Emotional Health (A1-A4, A11, A13, A14, A16-A19, A25-A28, A32-A35, A41, A49-A91), treated as a constant (A47, A48), or mentioned only in the discussion (A40), at present we have nothing better to offer than naïve IRT scores (a strategy employed by none of these studies). This is an active area of our ongoing research.
The perturbation strategy employed in Analysis 2 seems counterintuitive. Rather than relying on the best point estimate of each subject's score at each time point, we actually introduce noise to that point estimate in the form of the perturbation term, which in turn reduces bias in estimating rates of change. The perturbation strategy is directly analogous to the plausible values strategy employed since the 1983-1984 school year in the National Assessment of Educational Progress (NAEP) (30). This strategy is analogous to the multiple imputation framework for missing data (31, 32). In essence, every cognitive functioning estimate is treated as missing, and information is drawn from both the point estimate of the score and the certainty of that score. In our simulations and in most longitudinal studies of global cognitive functioning in the elderly, subjects on average move from a cognitive functioning level measured with less measurement precision to a cognitive functioning level measured with more precision. By incorporating knowledge of measurement precision into our analytic strategy, we try to avoid biased estimates of the rate of change. We found no difference between a single perturbed score and 10 perturbed scores, but this was averaged over 250 datasets. In any particular data set, multiple perturbations will likely reduce random fluctuations. Other potential strategies for handling varying levels of measurement precision include multi-level IRT approaches (33-39) in which a measurement model on one level is used to estimate ability, which in turn may be used in a mixed effects model at another level to estimate change over time. Further research is needed to determine optimal strategies for assessing change over time in studies that incorporate tests with widely varying measurement precision.
This study has limitations. The validity of our findings in Analysis 1 hinges on the validity of the anchor items. We reinforced our confidence in the anchor items by retaining only those that had no DIF related to study site. To our knowledge, this is the first co-calibration study to include the step of checking anchor items for DIF related to study site. Demographic characteristics for the subjects in the parent studies were very different from each other, and further studies should be performed to confirm the stability of the anchor items' item parameters across heterogeneous populations. Further targeted data collection would increase our confidence in the item parameters. The CSI `D' was only anchored to the other tests by 8 items. An additional 6 potential anchors for the CSI `D' were found to have DIF. Adding a few items from the CSI `D' to a study that routinely uses the CASI, for example, would dramatically increase the proportion of items that could be used to anchor the two tests. We also did not consider DIF related to anything other than study site in this analysis. We have previously found DIF related to age and education in the CASI (23) and the MMSE (40, 41), and would be surprised to not discover similar relationships in the 3MS or the CSI `D'. Analysis 2 is based on modest and realistic amounts of cognitive decline over time, but further simulation work is needed prior to recommending a strategy for handling varying levels of measurement precision.
In summary, we have co-calibrated widely-used screening tests of cognitive functioning. Strategies outlined here can be used to permit data to be combined across studies that used different cognitive tests. Co-calibration can serve as a Rosetta Stone to facilitate understanding across studies that otherwise would have no way to communicate. Standard analytical approaches to longitudinal cognitive testing data appear to be fundamentally flawed because they ignore curvilinear test characteristic curves common to these tests (Figure 2) and varying levels of measurement precision (Figure 3).
Drs. Crane and Gibbons and Ms. Narasimhalu were funded by NIH K08 AG 022232 from the National Institute on Aging. Dr. Crane was also supported by a New Investigator Research Grant from the Alzheimer's Association (Crane). Drs. Gibbons and van Belle were supported by grant P50 AG05136 from the National Institute on Aging. The CHS research reported in this article was supported by contracts N01-HC-85079 through N01-HC-85086, N01-HC-35129, N01 HC-15103, N01 HC-55222, and U01 HL080295 from the National Heart, Lung, and Blood Institute, National Institutes of Health, and grants AG15928 and AG 20098 from the National Institute on Aging, National Institutes of Health. A full list of participating CHS investigators and institutions can be found at http://www.chsnhlbi.org. The ACT research reported in this article was supported by contract U01-AG-06781 from the National Institute on Aging. The Indianapolis research reported in this article was supported by R01 AG 09956 from the National Institute on Aging. None of these sponsors had any role in the design of the present analyses, selection of these particular studies for co-calibration, data management, data analysis, or interpretation. None was involved in the preparation of the manuscript. The manuscript was reviewed by the Cardiovascular Health Study which includes review by the National Heart, Lung, and Blood Institute, which approved the manuscript.
We used McDonald's bi-factor model (15) to assess whether the items in these tests of global cognitive functioning were sufficiently unidimensional to use the tools of IRT. For both the CASI and the CSI `D', specific domains are mentioned in the original papers describing the scales (5, 7). For the 3MS, we used modifications of the CASI domain structure, since both tests were written by Evelyn Teng and test content is largely overlapping between the 3MS and the CASI. Since the MMSE is fully contained within the other tests we did not perform any analyses specifically on the MMSE itself, reasoning that if the 3MS (say) was sufficiently unidimensional to use IRT, any subset of it (including the MMSE) would also be sufficiently unidimensional to use IRT.
In the bi-factor model, each item is modeled to have loadings on two factors: a global cognitive functioning factor (all the items), and a sub-domain factor (such as long-term memory, orientation, attention, etc.). The sub-domains are specified to be mutually uncorrelated and also uncorrelated with the global cognitive functioning factor. The variance for all factors is set to 1.0 to obtain standardized loadings. According to McDonald, the key result is the loadings on the general factor. If all the items have salient loadings (which he defines as standardized loadings >0.30) on the general factor, McDonald states that the items are “sufficiently homogeneous” for applications requiring homogeneity (such as IRT).
We used MPLUS (42) for all factor analyses. We used the tetrachoric/polychoric correlation matrix and employed the weighted least squares with adjustments for mean and variance (WLSMV) estimator appropriate for categorical data.
McDonald does not focus on model fit in his discussion of the bi-factor model (though he emphasizes the importance of model fit elsewhere in his book on test theory). For the 3MS, the Confirmatory Fit Index was low at 0.659, but the Tucker-Lewis Index (TLI) was very good at 0.973, and the root mean squared error of approximation was very good at 0.036.
Appendix Table 1 shows loadings for the 3MS.
|Item||Loading, General Cognitive Function factor||Loading, sub-domain factor||Sub-domain name|
|Birth year||0.67||0.44||Long term memory|
|SHOES1||0.51||0.55||Short term memory|
|ARMLEG||0.52||0.40||Abstraction and judgment|
The standardized loading for each of the items on the general factor is greater than 0.30. There are only two loadings on the general factor in the 0.30s and none in the 0.40s.
Appendix Table 2 shows the standardized loadings for the CASI. Fit statistics for these analyses included a good CFI of 0.922, a good TLI of 0.969, and a good RMSEA of 0.027.
|Item||Loading, general cognition factor||Loading, sub-domain factor||Name of sub-domain|
|Birth place||0.65||0.07||Long term memory|
|Seconds in minute||0.84||0.11|
|Direction sun sets||0.51||0.10|
|Day of week||0.62||0.41|
|First word repeating||0.34||0.36||Attention|
|Second word repeating||0.43||0.80|
|First phrase repeat||0.43||0.29|
|Second phrase repeat||0.40||0.14|
|Digits backwards A||0.56||0.26||Concentration|
|Digits backwards B||0.40||0.45|
|Digits backwards C||0.41||0.45|
|First serial 3 subtract||0.48||0.50|
|Second serial 3 subtract||0.56||0.56|
|Third-fifth serial 3||0.51||0.55|
|First recall, first word||0.51||0.65||Short term memory|
|First recall, second word||0.45||0.72|
|First recall, third word||0.44||0.53|
|Second recall, first word||0.52||0.57|
|Second recall, second word||0.59||0.60|
|Second recall, third word||0.59||0.48|
|Follow written command||0.60||0.08||Language|
|Write a sentence||0.44||0.18|
|Identify body parts||0.57||0.26|
|Identify objects A||0.72||0.55|
|Identify objects B||0.77||0.48|
|Copy pentagons||0.51||N/a||Visual Construction|
|Animal fluency, 30 seconds||0.95||N/a||Fluency|
|Similarities||0.50||N/a||Abstraction and judgment|
The standardized loading for every item with a single exception (three step command had a loading of 0.26) on the general factor was greater than 0.30. Repeating the analysis with that item removed produced nearly identical results for the other items. Similarly, removing the birth day item (whose loading on the secondary factor was >1.0) produced virtually identical results for other items (largest change <0.03).
Appendix Table 3 shows the standardized loadings for the CSI `D' from the Indianapolis site of the Indianapolis-Ibadan Dementia Study. The CFI was good at 0.933, the TLI was good at 0.923, and the RMSEA was adequate at 0.065 (<0.08 is adequate; <0.05 is considered good fit).
|Item||Loading, general cognition factor||Loading, sub-domain factor||Name of sub-domain|
|REMNAME 1||0.51||0.56||Short term memory|
|PENCIL||0.71||0.66||Higher cortical functioning|
|NAMECITY||0.78||0.22||Orientation to place|
|MONTH||0.72||0.62||Orientation to time|
Almost all items had loadings on the general factor >0.30. Two of the story elements - minor and well - had much lower loadings. When we totaled up the story elements and treated them as a single item, model fit was minimally changed and none of the loadings changed very much. The pentagons item had a low loading. Excluding it caused minimal change in loadings for the other items and did not have a significant impact on model fit. Similarly, removing items that had loadings >1.0 had minimal impact on fit statistics or loadings for any of the other items.
We thus concluded that the scales were sufficiently unidimensional to use IRT.
We use IRT scores estimated using PARSCALE to evaluate items for DIF. We examine three ordinal logistic regression models for each item for each study comparison (e.g., CHS vs. ACT) (labeled here as “study”) selected for analysis:
In these models, “cut” is the cutpoint for each level in the proportional odds ordinal logistic regression model, as described by McCullagh and Nelder (43), and “θ” is the IRT estimate of cognitive functioning. All DIF analyses are performed using Stata (44).
Two types of DIF are identified in the literature. In items with non-uniform DIF, demographic interference between ability level and item responses differs at varying levels of cognitive functioning. In items with uniform DIF, this interference is the same across all levels of cognitive functioning. These concepts are analogous to the concepts of effect modification and confounding from epidemiology (23).
To detect non-uniform DIF, we compare the log likelihoods of models 1 and 2 using a X2 test. We used an α level of 0.05. To detect uniform DIF, the relative difference between the parameters associated with θ (β1 from models 2 and 3) is determined using the formula |(β1(model 2)-β1(model 3))/β1(model 3)|. If the relative difference is large (10% in this study), group membership interferes with the expected relationship between ability and item responses.
If we found that a candidate anchor item had DIF with respect to a comparison between two studies, that item was rejected as an anchor, and treated separately in the two studies. We repeated these steps until DIF-free anchor items had been selected.
Spurious false-positive and false-negative results may occur if the ability score used for DIF detection includes many items with DIF (22). We use an iterative approach to deal with this issue. Once the final set of anchor items was obtained, we confirmed that they had no further DIF related to study site. Stata .ado files for DIF analyses are available for free download; type “ssc install difwithpar” at the Stata prompt.
We used an α = 0.05 criterion for non-uniform DIF and a proportional change in β coefficient of 0.10 criterion for uniform DIF for this study.
|1. Anchor items|
|1a. For all 3 tests||Year Season||Year Season||Year Season|
|1b. For 3MS with CASI||Birthday||Birthday|
|Date Registration (3) and 2nd recall (3) of “shirt”, “brown”, and “honesty”||Date Registration (3) and 2nd recall (3) of “shirt”, “brown”, and“honesty”|
|First recall of“honesty”||First recall of“honesty”|
|Naming animals (30 seconds, no leg restriction) (10)||Naming animals (30 seconds, no leg restriction) (10)|
|Spatial orientation (5)||Spatial orientation (5)|
|1c. For 3MS with CSI `D'||Draw pentagons||Draw pentagons|
|Identify shoulder||Identify shoulder|
|Identify elbow||Identify elbow|
|Identify knuckle||Identify knuckle|
|1d. For CASI with CSI `D'||Month Day||Month Day|
|2. Rejected candidate anchor items due to differential item functioning|
|2a. Rejected for all 3||3 step command (3)||3 step command (3)||3 step command (3)|
|2b. Rejected for 3MS with CASI||Repetition of phrase (“he would like to go home”)||Repetition of phrase (“he would like to go home”)|
|Reading & writing (2)||Reading & writing (2)|
|Birth year Similarities (3)||Birth year Similarities (3)|
|1st recall of“shirt” and“brown” (2)||1st recall of“shirt” and“brown” (2)|
|Identify forehead||Identify forehead|
|Identify chin||Identify chin|
|2c. Rejected for 3MS with CSI `D'||Repetition of phrase (if and but)||Repetition of phrase (if and but)|
|Month||(no DIF for CASI/CSI `D')|
|Day||(no DIF for CASI/CSI `D')|
|2d. Rejected for CASI with CSI `D'||Identify shoulder||(no DIF for 3MS/CSI `D')|
|Identify elbow||(no DIF for 3MS/CSI `D')|
|(No DIF for 3MS/CSI `D')||Draw pentagons||(No DIF for 3MS/CSI `D')|
|(No DIF for 3MS/CASI)||(No DIF for 3MS/CASI)||Spatial orientation (state*, city)|
|3. Items unique to each test|
|Mental Reversal||Age||Math functions (8)*|
|Repetition of phrase (“Yellow circle”)||Identify pencil, watch, chair, shoes|
|Number of minutes/hour||Part of the day*|
|Identify wrist||Rain the previous day|
|Object identification & recall (10)||Describe bridge, hammer, church|
|Serial threes (5)||Examiner's name (2)|
|Judgment (3)||1st recall story elements (6)|
|Count backward (3)||2nd recall story elements (6)*|
|Direction of sunset||Draw circles|
|General knowledge (when was World War II*, name civil rights leader, name president*, name mayor, name governor*)|
|Naming animals (4 legged, 60 seconds)|
|Registration (3) and recall (3) of“boat”, “house”, and “fish”|
|Spatial orientation (street, landmark, address)|