The MATRICS Project
The data for the present study are drawn from the MATRICS (measurement and treatment research to improve cognition in schizophrenia) Psychometric and Standardization Study (PASS) led by Co-Chairs of the Neurocognition Committee (Drs. Nuechterlein and Green). The multi-site MATRICS project was funded by the National Institute of Mental Health (NIMH) and involved extensive interactions among academic, NIMH, FDA, and pharmaceutical industry representatives (Nuechterlein et al., 2008
; Green and Nuechterlein, 2004
; Green et al., 2008
). The overarching goal of MATRICS – PASS was to develop and evaluate a battery of cognitive tests that could serve to stimulate the development and evaluation of new drugs to improve cognition in schizophrenia. To accomplish this objective, MATRICS PASS researchers sought to identify a set of measures with good psychometric properties and would be repeatable in clinical trial contexts (i.e., a set of "consensus cognitive performance measures").
One of the chief goals underlying the MATRICS – PASS project was the FDA recommendation that for any new pharmaceutical agents seeking approval as cognitive enhancers, improvements on objective cognitive batteries would be necessary, but not sufficient. According to FDA officials, additional evidence also is needed to corroborate that objectively measured change is, in fact, clinically meaningful and relevant to “real world” social, educational, and occupational functioning (Green et al., 2008
). As a result, additional “co-primary” measures (e.g., SCoRS, CGI-CogS) would need to be included along with objective test results in clinical trials of cognition enhancing medications that were seeking FDA approval.
The sample of participants for the MATRICS – PASS project was selected to be representative of stable patients with schizophrenia or schizoaffective disorder and who would be typical of patients who participate in clinical trial research. The participants can be described as a sample of chronic outpatients with reasonable levels of adult daily living skills. Although the sample is limited in terms of individuals with severe levels of cognitive impairment (see results below), many individuals are included who are quite limited in terms of educational and occupational functioning.
Specifically, participants were selected according to the following criteria: diagnosis of schizophrenia or schizoaffective disorder, depressed type, based on diagnostic interview; no medication changes in previous month and none expected for the following month; stable outpatient or rehabilitation center status; age 18–65 years; no substance dependence in past 6 months; no substance abuse in past month; no clinically significant neurological disease or head injury as determined by medical history; ability to understand spoken English sufficiently to complete testing procedures; and ability to comprehend the consent form appropriately. We also excluded individuals who did not meet the substance use or dependence exclusion criteria but who had a clearly excessive amount of lifetime alcohol or drug consumption over a 10-year period or had been using alcohol or drugs heavily in the 3 days prior to testing (Nuechterlein et al., 2008
In MATRICS – PASS, data were collected on two occasions four weeks apart between January 2004 and August 2005. At baseline, the number of participants was 176. At one-month follow-up, nine individuals were lost. Combining baseline and follow-up data, the total number of possible data points for analyses was 343.1
However, not all individuals were able to complete both the CGI-CogS and SCoRS at follow-up. Analyses in this paper thus are based on sample sizes that vary somewhat across analyses. Specifically, the number of data points for within scale descriptive statistics and factor analyses are 319 for CGI-Cogs and 323 for SCoRS. The subsequent item response theory and simulated computerized adaptive testing analyses are based on 315 data points that had no missing data on either instrument. The raw aggregated scores derived from the CGI-CogS and SCoRS had 4-week test-retest reliability estimates of 0.80 and 0.82, respectively.
At baseline, the MATRICS participants (n =176) had a mean age of 44.0 (11.2) years, a mean education of 12.4 (2.4), and were 76% male. The marital status was as follows; single 61.4%, married or co-habitating 9.7%, divorced 23.9%, widowed 2.3% and, separated 2.8%. The ethnicity and racial distribution was as follows; Caucasian (59%), African American (29%), Hispanic/Latino (6%), Asian Pacific (<1%), and Native American/Alaskan (4%). Diagnosis according to DSM-IV was Schizophrenia (86%) or Schizoaffective, depressed type (14%), with Brief Psychiatric Rating Scale (BPRS; Ventura, Green, Shaner, & Liberman, 1993
) positive symptoms minimum = 3; maximum = 19; mean = 7.7 (3.77) and negative symptoms minimum=3; maximum=14; mean = 5.98 (2.56). These BPRS scores are interpreted as indicating that participants are in the mild range of symptomatology and on average below clinically significant levels. This is consistent with guidelines for the optimal design of clincial trails to enhance cognitive function in schizophrenia, and specifically with the recommendaiton that these should include individuals who are clinically stable and have no more than moderate delusions, hallucinations, or formal thought disorder symptoms (Buchanan et al., 2005
The collection of data on both the SCoRS and CGI-CogS instruments is complicated and thus we briefly describe how the item responses analyzed herein were obtained. When a participant arrives for an assessment, a previously trained clinician conducts an in-person semi-structured interview centered on the CGI-CogS and SCoRS items (see description below). The clinician then rates each item based on this interview. In addition, all participants are asked to provide the name of an "informant" (e.g., family member, friend, social worker) that knows them well. The research team then contacts the informant and the clinician conducts the same semi-structured interview either in person or over the phone, and ultimately rates each CGI-CogS and SCoRS item based on this second source of information. At no time do participants or informants directly rate themselves on each item. Finally, based on these two sources of information combined with his or her judgment, the clinician makes a final rating on each item. It is these rater item ratings that comprise the data for this study. The two measures of cognitive deficits were counterbalanced between, but not within individuals; that is, each individual received the same order at baseline and retest, but order differed across individuals.
Clinical Global Impression of Cognition in Schizophrenia (CGI-CogS)
The 21-item CGI-CogS (Ventura et al., 2008
) has a general background section and includes four major categories for evaluation: 1) activities of daily living, 2) severity of impairment for cognitive domain, 3) global severity of cognitive impairment, and 4) global assessment of functioning. The CGI-CogS item were written to parallel the cognitive impairments assessed by established batteries of objective cognitive tests (see Nuechterlein et al., 2004
). Specifically, the item content reflects cognitive deficits selected from the seven domains of functioning included in the MATRICS Consensus Cognitive Battery. These domains are: 1) working memory (2 items), 2) attention/vigilance (3 items), 3) verbal learning and memory (2 items), 4) spatial learning (3 items), 5) reasoning and problem solving (4 items), 6) speed of processing (3 items), and 7) social cognition (4 items). In the interview, the items are ordered by these content domains.
Each item on the CGI-CogS is phrased as series of questions (i.e., prompts) that attempt to provide concrete examples for each concept assessed. For example, the concept underlying Item #1 is "difficulty maintaining newly learned verbal information in mind for brief periods (long enough to use)." The actual question (prompt) is: "Do you forget names of people you just met? Do you have trouble recalling telephone numbers you hear? Do you have trouble remembering what your Dr. just said during visits? Do you find you need to write down information to remember?" As the participant or informant responds and describes their functioning in these areas in the last month, the rater both makes a numeric rating, and writes down specific concrete examples. The CGI-CogS uses a 7-point Likert scale for its ratings: 1) not at all impaired, 2) minimal, cognitive deficits but functioning is generally effective, 3) mild, cognitive deficits with some consistent effect on functioning, 4) moderate, cognitive deficits with clear effects on functioning, 5) serious, cognitive deficits which interfere with day-today functioning, 6) severe, cognitive deficits that jeopardize independent living, and 7) cognitive deficits are so severe as to present a danger to self and others. Items can also be scored N/A for not applicable or insufficient information. The cognitive deficit assessed by each is displayed in and the content domain for each item is shown in .
Abbreviated Item Content and Descriptive Statistics For the CGI-CogS Scale
Unidimensional, Multidimensional, and Bifactor Analyses of CGI-CogS
Schizophrenia Cognition Rating Scale (SCoRS)
The SCoRS (Keefe et al., 2006
) is a 20-item interview-based measure of cognitive deficits and the degree to which they affect day-to-day functioning. The items were developed by a team of content experts to assess a variety of cognitive domains (i.e., memory, attention, problem solving, working memory, learning, speed of processing, and social cognition) that were chosen because of their severity of impairment in many patients with schizophrenia and the empirically established relationship of these cognitive deficits to impairments in aspects of functional outcome. Although these deficits were not originally developed to exactly match the seven cognitive domains included in the MATRICS Consensus Cognitive Battery, the item content on SCoRS is very similar to CGI-CogS (see Keefe et al. 2006
, Ventura et al. 2008
, as well as Ventura et al., under review
). The original purpose of the SCoRS was to “serve as a co-primary measure in clinical trials of cognitive enhancing drugs for schizophrenia” (Keefe, et al., 2006
, p. 426).
The format for each SCoRS item differs from CGI-CogS in several ways. First, SCoRS items begin with "do you have difficulty…." and then target a more specific cognitive/behavioral phenomenon. For example, SCoRS Item #1 is "Remembering names of people you know or meet? Example: roommate, nurse, doctor, family & friends" and Item #2 is "remembering how to get places? Example: friend's house, store, restroom, own room." Second, in contrast to CGI-CogS’s 7-point scale, SCoRS items are rated on a 4-point scale: none, mild, moderate, and severe impairment. However, each item has a unique anchor for each of the 1 to 4 ratings. For example, on Item #1, anchors are: 1) no impairment, 2) mild = sometimes forgets names, 3) moderate = frequently forgets names, and 4) severe = almost always forgets names. For Item #2, anchors are: 1) no impairment, 2) mild = sometimes forgets how to get places, 3) moderate = only able to get to very familiar places, and 4) severe = unable to get any place without assistance because of difficulty with memory. In other words, the anchor points for each SCoRS item describe the frequency ("sometimes", "almost always") to which a deficit impairs a specific aspect of day-to-day functioning. The underlying concept ostensibly assessed by the SCoRS items is displayed in and the content domain is listed in .
Item Content and Descriptive Statistics For the SCoRS Scale
Unidimensional, Multidimensional, and Bifactor Analyses of SCoRS
Traditional Item Statistics
and display traditional item psychometrics for both measures separately. Indices include the item mean, standard deviation, corrected item-total correlations, and the response proportions in each category. These indices point to major differences between the two instruments. Most notably, the item-test correlations are strikingly higher on CGI-CogS relative to SCoRS. Moreover, coefficient alpha for CGI-CogS item responses is .95 with an average inter-item correlation of .47 (range from .15 to .65). The coefficient alpha for SCoRS item responses is .89 with an average inter-item correlation of .29 (range from .09 to .53). The difference in the average inter-item correlation is striking and could be taken to suggest that the two instruments are measuring cognitive deficits at different levels of generalization, with the CGI-CogS having more homogeneous item content relative to SCoRS. However, the raw summed scores on the two scales are correlated r = .84, a value which, when corrected for attenuation due to unreliability in observed scores is r = .92. This strongly suggests that the scale scores reflect a common latent variable.
Another obvious difference between the instruments is evident in inspection of the response frequency distributions in and . As noted previously, the two instruments use different response formats. The CGI-CogS has a one to seven scale with anchors: none, minimal, mild, moderate, serious, severe, dangerous. However, there were so few ratings in categories six (severe) and seven (dangerous), responses were collapsed in . SCoRS has a one to four response scale with anchors: none, mild, moderate, and severe. Given the need to collapse the extreme categories with CGI-CogS, and the low rating frequencies in category four on SCoRS, it is apparent that our sample did not contain many participants with extreme (rated severe and above) levels of cognitive deficits. This is understandable given that the sample was intentionally selected to involve schizophrenia out-patients who were on stable medications and not in an acute psychotic episode.
Despite the collapsing of the top two categories, it appears that the CGI-CogS distinction between minimal and mild impairment (versus SCoRS none to mild) reduces the floor effect, whereas at the other end of the scale, the CGI-CogS distinction between moderate and serious (versus SCoRS moderate to severe) better spreads ratings at the high end. Interestingly, judging by the item means and relative frequencies of category ratings, on average participants are rated in the minimal (rating of 2) to mild (rating of 3) impairment range on CGI-CogS items but are characterized as mildly impaired (rating of 2) on the SCoRS items. Such findings may indicate that it is easier for raters to identify impairments using the CGI-CogS instrument relative to the SCoRS. Another way of stating this is that the criterion for a high item rating on SCoRS items may be set too high. Consider for example the SCoRS “forgets names” item cited above where the extreme ratings were: 3) moderate = frequently forgets names, and 4) severe = almost always forgets names. Perhaps “frequently” and “almost always” are too extreme.
Understanding the psychometric differences between the two instruments requires acknowledging their different approaches to measurement. First, as discussed, due to differences in the number of item categories, CGI-CogS items have more response variance relative to SCoRS even after collapsing categories. Although this may partially explain the increased item-intercorrelations for CGI-CogS, the number of response categories cannot completely account for the observed differences. For example, when we collapsed responses on each instrument down to three categories for all items, CGI-CogS still retained a notable advantage in terms of item-test correlations, inter-item correlations, and coefficient alpha. To better understand the differences between the item responses produced from each instrument, we believe that it is important to consider the structure and format of the item content and scoring for the two instruments.
Although both instruments attempt to cover similar aspects of cognitive functioning, CGI-CogS contains item content that explicitly attempts to capture a theory of cognitive deficits that specifies that there are seven “domains” (Nuechterlein et al., 2004
) of cognition. Accordingly, CGI-CogS contains two, three, or four items that were written explicitly to capture these domains. In addition, items are administered in order by domain. Because of this explicit domain structure, items within a domain tend to be relatively highly correlated, and thus CGI-CogS has a relatively higher average inter-item correlation. Indeed, as will be demonstrated in the subsequent factor analyses, the CGI-CogS item ratings within a domain do tend to cluster together in a more coherent way relative to SCoRS item ratings.
Another explanation for the difference in internal-consistency is perhaps due to the different way in which the items are written and the content anchoring for rating the responses. As noted previously, each CGI-CogS item is phrased as a series of examples (e.g., CGI-CogS Item #3: “Do you have trouble concentrating? Do you take breaks frequently? Do you have trouble paying attention while reading, listening to the radio, or watching television, long enough to read/listen/see a whole article/chapter/program?”). If the participant or informant acknowledges difficulties, they are then asked to discuss concrete examples which are written down, and further probed by the interviewer. Finally, the content anchors for the response categories request an evaluation of the degree of cognitive deficit and the extent to which that deficit interferes with day-to-day functioning (e.g., 4) moderate = “cognitive deficits with clear effects on functioning”). In contrast, SCoRS items tend to be more specific in content, and the content anchors for the response categories are defined by frequency or time judgments. For example, SCoRS Item 3 asks “do you have difficulty… following a TV show: examples, favorite shows or news.” Then based on what the participant or informant describes, the rater selects: mild = “can follow a short movie or news show (1–2 hours)”, moderate = “can only follow a light 30 minute show (i.e., sitcom)”, or severe = “unable to follow a TV show for any period of time more than five minutes.”
In sum, CGI-CogS items probe about a wide range of contexts where problems (e.g., concentration) may arise. In turn, CGI-CogS items may lead to a more informative interaction among the interviewer and interviewee and provide the clinical interviewer with a more precise view of overall cognitive impairment and its impact on daily functioning. In contrast, SCoRS items tend to ask about a specific context (e.g., watching a movie or TV show) and then the interviewer must judge whether there is an impairment, and whether that impairment occurs rarely, frequently, or nearly all the time. One conclusion we draw from these findings is that although the high correlation between raw scores indicates that CGI-CogS and SCoRS items are measuring essentially the same underlying construct, the CGI-Cogs produces more internally consistent ratings. The consequences of this will become very clear after we review the factor analytic and item response theory results in the following sections.
In this section, we explore the dimensionality of responses to the CGI-CogS and SCoRS instruments. One goal of this research is to apply an IRT model so that we can make more informed decisions regarding shortening the instruments and to better understand how the items function as a measure of cognitive deficits. Because the IRT model we use makes a “unidimensionality” assumption, we need to explore whether this assumption is reasonably satisfied in these data sets. Beyond these psychometric objectives, exploring the dimensionality of item responses derived from these instruments is also of substantive interest. For example, as noted previously, both CGI-CogS and SCoRS were written to recognize diverse aspects of cognitive functioning. It is interesting to ask, to what degree do these domains emerge as multidimensionality, and in turn, does this multidimensionality interfere with our ability to fit an IRT model and scale individuals on a common dimension?
The first step in the analyses was to estimate a polychoric correlation matrix for each scale separately. Polychorics were estimated using the POLYMAT command available in the PSYCH package (Revelle, 2009
) of the R
statistical software (R Development Core Team, 2008
). This step was necessary to avoid the well known problems of factor analyzing item-level data using Pearson correlations. For each scale separately, we then conducted three separate sets of exploratory factor analyses: 1) a unidimensional model, 2) a multidimensional correlated factors model, and 3) a bifactor model (Schmid & Leiman, 1957
). In the unidimensional model we inspected the eigenvalues, percent of variance explained, mean and standard deviation of residuals, and a goodness-of-fit index (GFI; McDonald, 2000
). The GFI index reflects the proportion of common variance explained by the model. Although there are no gold-standard rules-of-thumb for deciding when a response matrix is “unidimensional enough” for IRT modeling (see Embretson & Reise, 2000
), generally researchers look for a large ratio of the first to second eigenvalues (e.g., > 3 to 1), a goodness-of-fit at least 0.90 (better if above 0.95), and residuals with a mean of 0.0 and standard deviation less than 0.05.
We then extracted for each scale from 1 to 7 factors with direct oblimin (delta = 3) rotations. These solutions were evaluated using the same indices as described above, but in addition, we examined the substantive coherence of a factor, and the correlation among the extracted factors. As argued elsewhere (Reise, Morizot, & Hays, 2007
), multidimensional solutions are useful in identifying item content clusters in correlation matrices, but they are not very useful in terms of deciding whether the data are unidimensional enough for IRT analyses. As is well recognized (McDonald, 1981
), the item responses that result from a measure of a substantively complex construct are rarely if ever strictly unidimensional (i.e., each item has only one common
cause, the rest of the variation is item specific and error). Typically, many psychological measures contain repeated content which, in turn, will emerge as factors in a factor analyses, and if sample size is large enough, they will be statistically significant (see Gibbons & Hedeker, 1992
, for similar arguments). Yet, the presence of multidimensionality does not necessarily vitiate either the scoring of an instrument as a measure of a single construct or the fitting of a subsequent unidimensional IRT model. The important criterion is whether there is a dominant general factor running through the items. The way to explore this issue, as argued by Reise, Morizot, and Hays (2007)
and others (e.g., Immekus & Imbrie, 2008
), is through fitting a bifactor model, and comparing the results to the unidimensional model.
Because exploratory (as opposed to confirmatory) bifactor analysis may be unfamiliar, we will briefly review the method. Note first that commonly used factor analytic rotation methods are fundamentally incapable of identifying (rotating to) a bifactor structure where each item loads on a common factor, and one or more item content based factors called "group" factors. The reason is that factor rotations (e.g., varimax, promax, and oblimin) that are included in commonly-used software are designed to identify "simple structures" where each item loads only on a single factor (see Browne, 2001
). However, this presents a problem for researchers who want to explore alternative factor structures such as a bifactor.
There are three solutions to this dilemma. First, a researcher may use targeted factor rotations as are available in the Comprehensive Exploratory Factor Analysis program (Browne, Cudeck, Tateneni, & Mels, 2004
). These rotations are technically complicated and will not be discussed further here. Second, a researcher may avoid the rotation issue by imposing a "confirmatory" bifactor structure and then use structural equations modeling software to fit a model. This topic will also not be discussed because our goal is to explore data structure, not to confirm a model or compare the statistical fit of two alternative models.
A third solution, and the one used here is to conduct an exploratory bifactor analysis using the Schmid-Leiman (1957)
orthogonalization technique. To implement the Schmid-Leiman, we used the SCHMID command available in the PSYCH package (Revelle, 2009
) of the R
statistical software. In an exploratory bifactor model, each item is allowed (but not forced) to load on a general factor and one or more “group” factors. The group factors potentially account for residual common variance that is left unexplained by the general factor. Typically, group factors emerge when measures contain clusters of items with highly similar content. In a bifactor model, the general and group factors are constrained to be orthogonal.
Essentially, an exploratory bifactor model (as opposed to confirmatory) is an "orthogonalization" of a second-order factor model. That is, first a set of correlated factors are extracted and rotated (e.g., using principal axis extraction with oblimin rotation). These correlated factors are called "primary" or “first-order” factors. Then, the correlations among those primary factors are factored, and the resulting factor loadings reflect the correlations between the primary and "higher-order" dimension. The Schmid-Leiman technique transforms this higher-order solution into a bifactor structure where each item can load on a common dimension and one or more group factors. Generally speaking, if the original oblique factor extraction contains three correlated factors, then the Schmid-Leiman will contain one general and three group factors. Factor identification and interpretability is judged by looking for an independent cluster basis (i.e., each factor has three or more items that load only on that factor) as explained in McDonald (1999
In sum, in an exploratory analysis (but not confirmatory), an oblique rotation, a higher-order factor model, and a Schmid-Leiman orthogonalization are all equivalent ways of examining data and are highly interrelated (see Chen, West, & Sousa, K. H., 2006
, and the citations therein for further discussion). Indeed, after specifying the number of primary dimensions (or group factors) the SCHMID command in the PSYCH package produces all three solutions. Finally, we note that the Schmid-Leiman can be easily implemented in SPSS and SAS as well (e.g., Wolff & Preising, 2005
To evaluate the bifactor model, and ultimately the viability of a unidimensional IRT model, we first evaluate the relative percent of variance explained by the general factor with the percent explained by the group factors. To the degree that the observed variance is explained by a general factor relative to the group factors, a unidimensional IRT model is viable. A second method of evaluation is to compare the factor loadings in the general factor of the bifactor model with the factor loadings in the unidimensional model. If the factor loadings are similar in the unidimensional model (i.e., a model with loadings potentially biased by multidimensionality), and the general factor in the bifactor model (i.e., a model that takes into account multidimensionality), this is further evidence that multidimensionality can safely be ignored and treated as nuisance.
The first seven eigenvalues from the CGI-CogS item responses are: 11.56, 1.44, 1.11, 0.93, 0.68, 0.62, and, 0.59, respectively. This dramatic 8 to 1 ratio of the 1st to 2nd eigenvalue suggests a large general factor. Under the Uni column of are the loadings for the unidimensional model, and it is clear that each item loads strongly on this first dimension with loadings ranging from .57 to .81. The GFI index, which reflects the degree to which common variance is accounted for is .98, strongly supporting the adequacy of a single factor. The mean residual is zero and the standard deviation of the residuals is .07. This latter value is a little above our preferred benchmark, but still it is in an acceptable range.
Despite this strong evidence for a unidimensional solution, for the sake of thoroughness, we also explored multidimensional solutions. Although solutions of five, six, and seven correlated dimensions were evaluated, these results are difficult to interpret due to Heywood cases, factors defined by single items, or no item having its highest loading on a factor. Thus, we argue that the only viable multidimensional solution is the four factor, and it is this model that is reported in the middle set of columns in . In this model, GFI is .99 the mean residual is .03 and the standard deviation of the residuals is .03. The factors are clearly content domain factors where Factor 1 is merged attention and reasoning domain items, Factor 2 is the four social cognition domain items, Factor 3 is spatial and verbal learning domain items, and Factor 4 consists of items from the speed of processing domain. The factors are strongly inter-related with correlations ranging from .52 (3 and 4) to .69 (1 and 2).
The four factor model is an “improvement” on the unidimensional model in terms of GFI and standard deviation of residuals. However, it is not surprising that a model with more factors explains the item response data better. To explore the extent to which this is a meaningful improvement, we turn to the bifactor results, which are reported in the third set of columns in . Given the exploratory factor analysis results, we report on only the findings of the one general factor and four group factor solutions. In the exploratory bifactor model (GFI = .99), all items continue to load highly on the general factor but the loadings are around .03 to .05 smaller in the general factor than in the unidimensional model. This suggests that the multidimensionality “biases” the loadings in the unidimensional solution, but only slightly. The group factors replicate the four factor model above, with the exception that now the loadings are much smaller due to the fact that the general factor has been controlled for in the bifactor solution. Finally, we note that in the bifactor solution, the general factor accounts for 47% of the variance (73% of the common variance), whereas the group factors explain, 4%, 4%, 4%, and 5%, respectively, or 17% combined.
The first seven eigenvalues from the SCoRS were: 7.86, 1.45, 1.36, 1.15, 0.99, 0.85, and, 0.74, respectively, a pattern suggesting the presence of only a single meaningful dimension. The ratio of the 1st to 2nd eigenvalue is 5.42, which is less than CGI-CogS, but still well above the standard 3 to 1 ratio used to judge unidimensionality. In the first set of factor loadings in are the loadings for the unidimensional model, and again, each item loads strongly on this single dimension with loadings ranging from .49 to .79. The GFI index is .97, strongly supporting the adequacy of a single factor. The mean residual is 0.0 and the standard deviation of the residuals is .08. As before with CGI-CogS, this latter value is above our preferred benchmark, but still it is by no means a large value.
We also explored multidimensional solutions with SCoRS ranging from 1 to 7 correlated dimensions extracted. However, the interpretation of the dimensions was more challenging in SCoRS relative to the CGI-CogS. In addition, we again had concerns with factor solutions above four dimensions, namely, single item factors, factors having no item with the highest loading, or factors that were impossible to meaningfully interpret. For these reasons, we report the results of the four correlated factor solution in the middle columns of . In this model, GFI is .99, the mean residual is 0.0 and the standard deviation of the residuals is .04. We interpret the factors as follows. Factor 1 appears composed of attention and focus items. Factor 2 is a memory factor containing four items with the word "remember" in it and “learns new things.” Factor 3 has item content reflecting understanding of meaning and is similar to CGI-CogS social cognition. Factor 4 is marked by two items, “familiar tasks”, and “remembers chores,” which is perhaps a demands of daily living factor. The SCoRS factors are not as highly inter-related with correlations ranging from 0.34 (4 and 2) to 0.54 (1 and 2).
In the last set of factor loadings in , we report on the findings of the one general factor and four group factor solutions. In this exploratory bifactor model (GFI = .99), all items continue to load highly on the general factor but the loadings are around .03 to .06 smaller in the general factor than in the unidimensional model. As expected, the factor loadings on the group factors are now smaller due to the fact that the general factor has been controlled for in the bifactor solution. The general factor accounts for 30% of the item response variance (60% of common variance), whereas the group factors explain, 6%, 5%, 4%, and 5%, respectively, or 20% combined. Clearly, the general factor is not as strong in SCoRS relative to CGI-CogS, and the factors do not cohere as one would expect based on the a priori content groupings.
Although the above analyses were important for us to learn how each instrument functions as a stand alone measure, it is these combined analyses that are most decisive in terms of evaluating the tenability of whether to fit an IRT model to both scales simultaneously. The first seven eigenvalues of the combined matrix are: 18.22, 2.17, 1.75, 1.52, 1.51, 1.40, and 1.28, respectively, again suggesting a strong general factor. To save space, and because the multidimensional solutions were so challenging to interpret, we will not provide a table of factor loadings (available from authors) or provide a detailed description of the multidimensional solutions in this section. Rather, we note first that in the unidimesnional solution, all items continued to load substantially. Second, although solutions containing from four to seven group factors were considered in the bifactor analysis, we decided to extract seven group factors. The GFI index for this model as 0.99 with a mean residual of 0.00 and standard deviation of 0.03.
As with the previous results, in bifactor solutions, the loadings on the general factor were only slightly lower in the bifactor relative to their values in the unidimensional solution. This suggests that the multidimensionality does not interfere with our ability to use the items to scale individuals on a common dimension. Third, we note that unsurprisingly given the above results, with the exception of SCoRS item #14 (“learns new things”) and #20 (“follows group conversations”), the CGI-CogS items dominate the general factor in terms of loadings. Finally, it is clear that when the two instruments are combined, the general factor continues to dominate the group factors. Specifically, the general factor accounts for 38% of the variance while the group factors combined account for 23%. However, the dominance of the general factor is reduced in this joint analysis compared to when the CGI-CogS was analyzed alone. Nevertheless, we argue that these findings support our application of IRT to the combined measures, a topic we now address.
Item Response Theory Analysis
Item response theory (IRT; Embretson & Reise, 2000
) is the dominant measurement theory in aptitude, achievement, and licensure testing and is quickly becoming the default psychometric method in patient reported health outcomes measurement as well (see Reise & Waller, 2009
). IRT measurement models are a device to study the relation between individual differences on a latent variable (e.g., cognitive deficits) assumed to underlie item responses and the probability of responding in a particular response category. IRT measurement models require that certain assumptions be met in order for the results to be interpretable. First and foremost, unidimensional IRT models (i.e., models with only a single latent variable) require that the data be reasonably unidimensional; that is, one common trait should account for the item responses. Another way of stating this is that IRT models assume that the estimated item parameters reflect the relation between item responses and the common trait they are assessing and are not overly distorted by multidimensionality in the data. The above factor analyses establish that although both instruments contain some multidimensionality due to content clusters, the overwhelming majority of common variance is explained by a strong general factor. For this reason, we feel comfortable moving forward with fitting an IRT model.
The Graded Response Model
The graded response model (GRM; Samejima, 1969
) is appropriate when items contain ordered categorical responses (Embretson & Reise, 2000
; Ostini & Nering, 2006
). In the GRM, each scale item (i
) is described by one item slope parameter (αi
) and j = 1… mi
“threshold” parameters (βij
), where m
is the number of response categories minus one. These parameter estimates allow researchers to compute category response curves (CRCs) that display the relation between a latent variable (cognitive deficits) and the probability of responding in each category. For example, displays the CRCs for SCoRS item #13 “Stays Focused
”. This item will be discussed in more detail shortly.
Category Response Curves: SCoRS #13 Stay Focused.
As for interpreting the parameters, the βij parameters in the GRM represent the trait level necessary to respond above a between category boundary j with .50 probability. In other words, for a four-category item scored as 1 through 4, there are three boundaries between the categories (e.g., 1 versus 2, 3, 4; 1, 2 versus 3, 4; and 1, 2, 3, versus 4). Thus, the first threshold parameter represents the point on the latent trait continuum where the probability of responding above category 1 is 50%; the second threshold parameter represents the point on the latent trait continuum where the probability of responding above category 2 is 50%; and the third threshold parameter represents the point on the latent trait continuum where the probability of responding above category 3 is 50%. The slope parameters (αi) in the GRM are analogous to factor loadings, and are related to the discriminating power of an item. Items with relatively high slope parameters are better able to differentiate among people at different ranges of the latent trait.
In this research, we estimated the GRM parameters for the combined pool of 41 items (21 CGI-CogS and 20 SCoRS items) using MULTILOG (Thissen, 1991
). Mostly program defaults were used with two important exceptions. First, the number of estimation cycles allowed was set to 500. This proved to be more than enough iterations to achieve a converged solution. Second, the quadrature points used to define the distribution of the latent trait in the population were specified to range from −4.5 to 4.5 in increments of 0.1. This was implemented in order to better estimate extreme threshold parameters (e.g., greater than 4.0 or less than −4.0) that may result from low response rates in the extreme categories.
Note that in IRT modeling, for identification purposes, it is customary to define the latent variable (cognitive deficits) such that the mean is zero and the standard deviation is 1.0. Thus, the item parameter estimates in this study should be interpreted in regard to this metric. Specifically, the mean of the latent trait distribution in this study refers to the mean level of cognitive deficits in a stabilized out-patient population characterized by mild to moderate levels of cognitive impairment.
Finally, due to the fact that for many items, especially SCoRS items, there are few responses in the extreme impaired category, we were concerned with our ability to estimate some threshold parameters. To address this concern, we ran MULTILOG in an iterative fashion. First, we ran the program specifying that all the CGI-CogS items had five categories, and all the SCoRS items had four categories. We then inspected the threshold parameters and their standard errors to check whether the values and standard errors were reasonable. If any estimated threshold parameter was above 4.0 or below −4.0, we collapsed categories and re-estimated the model. In the first run, we found that the fourth threshold parameter for CGI-CogS #17 was above 4.0 and had a relatively large standard error. In addition, this also occurred for the third threshold parameter for nine of the twenty SCoRS items. We thus collapsed responses in category five into category four for CGI-CogS #17, and collapsed responses in category four into category three for nine SCoRS items. The need to collapse categories reflects both the lack of severely impaired participants, as well as the extremity of the anchor for the extreme category (e.g., forgets names “all the time”).
After collapsing categories, the second MULTILOG run produced more reasonable and interpretable estimates (i.e., threshold estimates between −4 and 4) for all parameters. These parameter estimates along with their standard errors are reported in . In the first column of are the estimated item slope parameters. The size of the item slope parameters ranges from around 1.0 to 2.5. It is clear that the CGI-CogS items tend to be more discriminating (i.e., better able to differentiate among people) relative to the SCoRS items. This result is completely consistent with the higher item-test correlations and factor loadings in the unidimensional solution for CGI-CogS relative to SCoRS reported earlier. Within CGI-Cogs, all items have reasonably high slopes and there does not appear to be any consistent link between content domain and the size of the estimated slope parameter.
Item Response Theory Item Parameter Estimates and Standard Errors For the Graded Response Model
As noted, relative to CGI-CogS, the SCoRS items tend to have the relatively lower slope estimates. This remains true even for SCoRS items that did not contain collapsed categories. One reason this may occur is that the response format for rating CGI-CogS items does a better job of differentiating among people than the response format for rating SCoRS items. In particular, observe that the SCoRS items with the lowest slopes are also the ones with low frequencies in the top category. What happens with these types of items is that responses get bunched up in the lower response categories, which is another way of saying the response categories are not as relatively discriminating. For example, consider the category response curves for SCoRS item #13 “Stays Focused” shown in . This item has slope of 0.96 with thresholds of −1.99, 1.39, and 3.48, the latter value indicating that the model predicts that an individual must be over three standard deviations above the mean to have a 50% chance of being rated in category four. For a very wide range of the latent trait, a rating in category two is most likely. It is only very extreme individuals (above roughly 1.5 standard deviations) who are more likely to be rated a three, and only the most extremely impaired (above roughly 3.5 standard deviations from the mean) are likely to receive the four rating.
The estimated threshold parameters and their standard errors are shown in the next set of columns. Recall that these values reflect the point on the latent trait continuum where the probability of responding above a between category boundary is 0.5. For example, using the 1, 2, 3, 4, 5 response scale of CGI-CogS Item #1, the thresholds reflect the trait level necessary to be rated above category 1 (−1.93), above category 2 (−0.57), above category 3 (0.60), and above category 4 (1.85). These location parameters are critically important for a number of reasons. As explained below, they determine where on the latent trait continuum an item provides information among individuals (i.e., where on the latent trait the item discriminates best among individuals). They are also important for judging the item quality and determining how the clinical raters are using the categories to differentiate among people.
Item and Test Information and Standard Error of Measurement
An important difference between traditional measurement theory and IRT is that in the former, individuals are assumed to have equal standard errors regardless of their position on the construct. In IRT, individuals’ can have different standard errors depending on how discriminating or informative a set of items are in different range of the latent trait (cognitive deficits). Specifically, the estimated item parameters from the graded response model () can be transformed into item information curves (Ostini & Nering, 2006
). Items with larger slope parameters provide relatively more information, and where that information is located on the latent trait continuum depends on the item threshold parameters; item information is maximized around the threshold parameters.
To illustrate, provides the contrast in information between CGI-CogS #14 (slope = 2.41; “judgment in situations”) and SCoRS #9 (slope = 1.34; “keep track of money”). Because information is a function of the square of the item slope, the CGI-CogS item has nearly three times the information in the middle of the trait range. In short, it would take three items like SCoRS #9 to discriminate as well as one CGI-CogS #14. For a second example, in we provide a contrast of the information curves for CGI-CogS #4 (“focus on information”) and CGI-CogS #11 (“initiate attempts to solve problems”). These two items have similar slopes, but somewhat different threshold parameters. Although the differences in item information are not dramatic, the former item provides relatively more information in the high range of cognitive deficit whereas the latter provides relatively more in the low range.
Item Information Functions CGI-CogS #14 Versus SCoRS #9.
Item Information Functions CGI-CogS #11 Versus CGI-CogS #4.
Information is a critically important concept in IRT for three reasons. First, information is additive across the items administered. Thus, if a researcher wanted to know how informative a scale consisting of the first five CGI-CogS items would be, all he or she would have to do is add together the item information functions for those items. For example, in is shown the scale information for all 41 items, the 21 CGI-CogS items and the 20 SCoRS items. Taken as a whole or by scale, this pool of items is most informative around the mean trait level and then tapers off sharply at both extremes. Second, information is inversely related to an individual’s standard error of measurement; the more information a measure provides in a certain trait range, the more precise an individual’s measurement in that trait range will be. This feature of IRT allows researchers to go beyond asking how reliable or precise a score is, and to study how precise a measurement is for individuals in high, medium, and low trait ranges. A third important feature of information, as demonstrated below, is that it plays a crucial role in real-world or simulated computerized adaptive testing (CAT; Wainer et al., 2000
). Specifically, in an adaptive test, the item information function is used to select items that are most informative (i.e., discriminating) for a particular individual dependent on their currently estimated trait level as they progress through a computerized test. In the following section, we explore the use of simulated adaptive testing in order to further learn about the relative quality of the items and to explore the consequences of shortening the combined CGI-CogS and SCoRS measure.
Scale Information Functions for Entire Pool, CGI-CogS, SCoRS, and 10 CAT Items.
Simulated Computerized Adaptive Testing: How Many Items are Really Needed?
In traditional psychometrics, to compare individuals on the same scale, all people must be administered the same items or parallel tests. In contrast, under an IRT framework, an individual’s position on the latent trait scale can be estimated based on any subset of items with known item parameters (Embretson & Reise, 2000
). For this reason, it is often argued that one of the major advantages of IRT measurement models is that they form the basis for computerized adaptive testing. Indeed, computerized adaptive test administration has been used extensively for large-scale aptitude and licensure testing (see, Wainer et al., 2000
) and is increasing popular in personality and health outcomes assessment (Reise & Waller, 2009
In this study, we use the logic of IRT-based CAT to explore the question, if hypothetically a CAT were to be administered, which items would be selected the most, and in turn, how many items are needed to scale individuals on a cognitive deficit continuum at different levels of precision? We explored these issues by performing a computerized adaptive simulation using the estimated item parameters from the previous section, and the actual item response patterns from the 315 data points with complete responses to all items. This simulation is easily accomplished using an existing software package called FIRESTAR 0.09 (Choi, 2009
). This program reads in the estimated item parameters and then computes item information functions as described in the previous section. We specified that all individuals should start at a latent trait value of 0.0. For each individual response vector, FIRESTAR then finds the item that is most informative at latent trait equals zero (note: this is always the same item for all individuals). FIRESTAR then finds the response and re-estimates an individual’s trait level and standard error using the expected a posteriori (Mislevy & Bock, 1982
) method. This cycle continues until a stopping rule is encountered. In this study, we explored four different stopping criteria. Specifically, we instructed FIRESTAR to end adaptive testing when a person’s standard error was either at or below: .25, .3, .4, or .5. On a z-score metric, these values are roughly analogous to error variances of 6%, 9%, 16%, and 25%, respectively.
Real-data simulations are typically judged by several criteria. First, how many items were needed to reach the stopping criteria? In these simulations, the average number of items administered was 14.36, 8.68, 4.39, and 2.63, for the four standard error criteria, respectively. A second criterion addresses the question: how does the number of items needed change as a function of position on the latent trait? In , the x-axis is the latent trait score estimate based on all 41 items and the y-axis is the number of items administered during simulated CAT for the 0.30 standard error condition (note: other conditions displayed similar patterns). This figure demonstrates that for people in the middle of the cognitive deficits trait range (−2 to 2), very few items are needed due to the fact that there are many very informative items in that trait range. However, for individuals at the very low end of cognitive deficits (< −2.0), even the entire pool may be unsatisfactory – the present instrument simply does not allow for high measurement precision at the low range. It is arguable that this is irrelevant for this construct because low levels of cognitive deficits would not be a target for intervention.
Plot of CAT Based Trait Level Estimates Versus Number of Items Administered to Reach a Standard Error of Measurement Less Than 0.30Criterion.
A third way to evaluate a real-data simulation is to examine the correlation between latent trait scores estimated based on CAT versus those based on administration of all 41 items. In the present case, the more stringent the stopping criterion, the higher the correlation. Specifically, the observed correlations are r = .98, .96, .92, and .89, respectively for the four conditions. Of course, these values are inflated due to shared items in the CAT and in the full-scale trait level estimate. Given this, it is still remarkable that the full-scale score can be reproduced almost perfectly with a little over one-third the item pool (i.e., around 14 items).
Finally, it is important to inspect the frequency distribution of which items were administered. For the SEM < 0.3 condition, for example, most examinees receive the same 8 to 12 items. Specifically, CGI-CogS items 1, 2, 4, 6, 11, 14, and to a lesser extent, items 12, 3, 19, and 15 were administered to most individuals regardless of their trait standing. In we include the scale information function based on these 10 items. Interestingly, these 10 CGI-CogS items provide more information than the 21 SCoRS items. Given that a SEM of less than 0.3 implies an error variance of 0.09 (reliability 0.91), it is clear that in terms of scaling individual differences, the CGI-CogS and SCoRS combined item pool can be seriously reduced while still maintaining reasonably precise latent trait score estimates. Such results suggest that future researchers consider either a fixed-length short-form or a computerized adaptive test. However, given our findings that even in a CAT, the same set of 10 items are consistently administered to most people, it would be difficult to argue for a CAT over a fixed-length short form in this context.