Measurement Invariance Tests
is a graphical display of the likelihood space showing the relative differences in fit for all nested model comparisons examined. Model-data misfit is expressed as negative twice the log likelihood (−2lnL) and plotted on the Y-axis. The number of free (estimated) parameters is given on the X-axis. The upper graphic displays changes in misfit for models with age, sex, and age by sex interaction effects on the factor variance versus models allowing covariate effects on the individual MD criteria factor loadings. The lower graphic shows these same model comparison fits for the factor mean and criteria thresholds.
Although this graphical form of presenting model fitting results departs from the more familiar tabular format, we see several advantages to this type of presentation. First, relative differences in model-data misfit for all nested model comparisons can be visually comprehended as a gestalt. The amount of change in the −2lnL is expressed by the steepness of the line segments connecting the −2lnL values for models with more parameters. Steeper lines indicate greater improvement in fit per the additional parameters estimated.
Model misfits (−2lnL) used as comparison benchmarks are shown as circled numbers. Thin dotted lines originating from these reference models are contours of equivalent Akaike’s Information Criterion (AIC) (
Akaike, 1981;
Akaike, 1987). These AIC contours provide an alternative index for evaluating models by balancing overall fit and model complexity (i.e., number of parameters estimated). Lines with double lettered labels represent likelihood-ratio chi-square difference tests. If certain regularity conditions are met, differences in −2lnL for appropriately nested models are distributed asymptotically as chi-squared. Solid lines indicate significant likelihood-ratio difference tests (
p < .05) whereas dashed-dotted lines denote non-significant tests.
The baseline model (circled ‘1’ labeled ‘1-No Moderation’) has 20 estimated parameters (10 factor loadings and 10 thresholds). This model produced a −2lnL of 63538.4 and, although being the most parsimonious model, it produced the worst fit to the data. In the upper portion of (Loadings), the green labeled lines ‘AV’, ‘BV’, and ‘CV’ denote reductions in model misfit for models sequentially estimating age, sex, and an age by sex interaction on the factor variance. The single effect of age (AV) on the factor variance significantly improved the fit (Δχ2
(1) = 56.7, p = .000). Adding a sex (BV; Δχ2
(1)= 16.4, p = .000) and an age by sex interaction factor variance effect (CV; Δχ2
(1)= 6.1, p = .01) were also significant.
The ‘DL’, ‘EL’, and ‘FL’ lines give the improvements in fit for models allowing covariates to directly moderate the factor loadings. The multivariate test of age moderating all MD criteria factor loadings produced a significant improvement over a single age effect on the factor variance (DL; Δχ2
(9) = 22.5, p = 0.007). This was the only multivariate test for factor loadings to fall below the AIC contour line. Including all three covariate effects on the individual loadings (FL) did produce a significant likelihood ratio difference test (solid line) but did not fall below the corresponding AIC contour.
The lower portion of (Thresholds) shows model comparison results for criteria thresholds and factor mean covariate models. A different pattern of results is seen. A modest but significant effect of age on the factor mean (AM; Δχ2
(1) = 9.7, p = .002) was found. Adding a sex effect produced a substantial improvement in fit (BM; Δχ2
(1) = 57.1, p = .000); The age by sex interaction was also significant (CM; Δχ2
(1) = 20.2, p = 0.000). The multivariate covariate threshold tests were more pronounced and pervasive. All threshold moderation models (DT, ET, and FT) fall well below their corresponding AIC contours. Overall, the best fitting model by the AIC criterion was model 8 which allowed moderation of all criteria thresholds by all three covariates (FT; Δχ2
(27) = 119.4, p = .000).
Test results to determine which of the individual covariate effects were responsible for the multivariate findings are presented in . To be conservative, a model including all three covariate effects on the factor variance and factor mean (FV) and (FM) was used as the reference model. Factor variance and mean covariate effects were estimate in two ways: (1) age and sex effects were estimated separately; and (2) age, sex and an age by sex interaction effects were estimated jointly. AIC was used to determine improvement in model fit.
| TABLE 2Results of Testing for Effects of Age, Sex and Age x Sex Interaction on Each of Ten DSM-III-R MD Symptom Criteria |
The age alone effect on the factor variance was significantly different from zero (0.10, [.08; .13]) indicating an increase in MD factor variance with age. Including age, sex, and age by sex interaction in the model, all had significant effects on the factor variance. Four MD criteria displayed some form of factor loading covariate moderation that departed from the factor variance covariate expectations (‘depressed mood’, ‘fatigue’, ‘feelings of worthlessness’, and the ‘duration criteria’).
For factor mean effects, sex by itself had a significant positive effect (0.15, [.11; .16]). Females, on average, have higher MD factor scores compared to males. However, this factor mean sex difference was noticeably reduced and rendered nonsignificant (0.01, [−.07; .09]) when the age by sex interaction was included. Six criteria (‘depressed mood’, ‘loss of interest’, ‘weight problems’, ‘psychomotor problems’, ‘fatigue’, and ‘feelings of worthlessness’) were found to have significant forms of differential threshold covariate moderation. These effects are given in the last three columns of for factor loadings (upper) and thresholds (lower).
To further examine the patterns of differential age, sex, and interaction covariate effects for the MD criteria, bootstrapping was carried out. Using a twin model that included all significant factor mean and variance covariate effects plus all significant factor loading and threshold noninvariant effects, the model was refit five-hundred times to random samples drawn with replacement from the original data. displays these bootstrapping results. Noninvariant effects are expressed using 4 points with 95% confidence intervals (CI) for each MD criterion. Significant differential effects of factor loadings (A) and thresholds (B) are identified by the numbered points, (1) no covariate effects (males no age effects), (2) male age effect, (3) sex effect (female-male sex difference), and (4) age by sex interaction effect (sex effect plus the age by sex interaction effect).
The geometric shapes formed by the lines connecting the points provide a visual description of the nature of the differential covariate effects. Points without numbers and CIs that are identical indicate MD criteria with no significant differential effects (e.g., ‘sleep problems’). An effect of age but not sex would appear as a parallelogram with horizontal red and blue broken lines. The ‘fatigue’ factor loading (A) and ‘psychomotor’ threshold (B) follow this pattern. Sex but no age effects appear as a parallelogram with offset horizontal green and purple lines (e.g., ‘depressed mood’ threshold (B)). Interaction effects can yield triangular or trapezoid shapes, such as is evident for the ‘feeling worthless’ factor loading (A) and ‘weight problem’ threshold (B). That factor loadings are, in general, estimated with less precision than are thresholds is evident from their wider confidence intervals.
All MD criteria factor loadings () had estimated values at or above 0.7. The ‘depressed mood’ and ‘fatigue’ criteria displayed unexpected increases in discrimination with age. For older individuals, these MD criteria discriminate individual differences on the MD factor more sharply than would be expected given the single MD factor variance effect of age. In contrast, the ‘duration’ item had an unexpected decrease in discrimination with age. Finally, the ‘feeling worthless’ criterion had a more complex moderation pattern that included an age by sex interaction effect (i.e., triangular shape). Compared to same aged males, females showed a significant decline in discriminating power for this criterion with increasing age.
shows unexpected covariate effect patterns for the MD criteria thresholds. Criteria thresholds were all located above the zero factor scale point or ‘average’ risk level. Therefore, all MD criteria predominately provided information about MD factor score differences towards the high end of the risk scale in this population-based sample. Several criteria had differential effect patterns suggesting the presence of an age by sex interactions (i.e., triangular shapes). The ‘loss of interest’, ‘weight problems’, and ‘fatigue’ criteria all followed such a pattern. A differential sex effect was found for ‘depressed mood’ with females tending to endorse this criterion more often (lower threshold) than expected compared to men when accounting for the factor sex mean effect. The ‘psychomotor’ threshold displayed a significant age effect — this criterion tended to be endorsed more by older twins than expected based on the single factor mean age effect. An age effect in the opposite direction was found for ‘feeling worthless’ with older twins reporting experiencing this MD symptom less often than expected.
The two columns to the far right show how covariate effects on the factor variance () and mean (B) were impacted by the presence of measurement non-invariant effects at the criterion level. The pattern of covariate effects on the variance did not change but were somewhat reduced in magnitude. However, the age-by-sex interaction effect on the factor mean was noticeably altered suggesting a confounding with the differential threshold effects that were present. The likelihood ratio test comparing a model including all significant differential covariate effects against the baseline model produced a sizeable reduction in misfit (Δχ2
(17) = 208.9, p < .001).
MD factor scores were estimated under four different parameterizations: (1) the baseline model, (2) allowing factor mean and variance covariate effects, (3) allowing for only significant factor loading and threshold moderation effects, and (4) both 2 and 3. To evaluate the disagreement between factor scores and algorithmic derived affected/unaffected diagnostic classifications, the percentage of twins meeting DSM-III-R requirements for major depression (11.3%) was used as a cut-off for the corresponding highest rank ordered MD factor scores. Condition 1 produced disagreement for N = 103 twins, ~ 1.2% of the total sample. Under conditions 2, 3, and 4, discrepancies were N = 132 (~1.5%), N = 112 (~1.3%) and N = 132 (~1.5%) respectively. Adjusting for age, sex, and interaction effects on the factor and individual criteria increased the rate of disagreement. Disagreement for positive/negative diagnoses and factor scores below/above the 11.3% cutoff were found to be fairly symmetric.
is a histogram of estimated MD factor scores adjusted for significant non-invariance covariate effects. Factor scores are partitioned into the four possible categories obtained by crossing male/female with affected/unaffected status. Approximately one percent of the total sample (N = 112) had estimated factor scores that disagreed with the binary diagnoses. Discrepancies fell in the region marked by dashed vertical lines. The solid black line is the break point separating the two types of discrepancies. Below this line individuals had factor scores falling outside the top 11.3% of the distribution but were assigned an affected status by the diagnostic algorithm. Above this line discrepancies were reversed. From a measurement perspective, it of interest to note that the location separating the different types of discrepancies coincided with the maximum level of information provided by the 10 MD criteria. Factor scores in this region are the most precisely calibrated based on the criteria information. Also, all discrepancies had one feature in common – their status on the duration criterion. Cases with factor scores above the cut-off but classified as unaffected all failed to meet the 14-day minimum duration requirement. In contrast, cases classified as affected but their corresponding factor score fell below the cut-off, the duration criterion was met.
We note that when a set of items is essentially unidimensional (i.e., a single factor model fits adequately) as is the case here for the 9 DSM-III-R MD criteria and duration requirement, the estimated factor scores generally correlate highly with the more straightforward sum of the binary criteria that are used to determine diagnostic case status. However, when investigating the effects of covariates on MD criteria, one might find different item level effects for the common portion of the criteria (factor loadings) compared to those specific to each criterion. This would not be possible with the sum scores. Also, if the item set is multidimensional, sum scores can produce covariate effects that may be distorted and misleading because they ignore the structure present in the item associations. In other work we have established that maximum likelihood factor scores are more accurate estimates of true factor scores than are sum scores when responses are available from only a subset of the items, that is, there some items have missing data (Estabrook and Neale, submitted).