We used a comprehensive simulation study to empirically test our set of theoretically generated research hypotheses pertaining to the performance of CFA for ordinal data with polychoric correlations using both full WLS (as per Browne, 1984
, and B. Muthén, 1983
) and robust WLS (as per Muthén et al., 1997
) estimation. The results of our study provided support for each of our proposed hypotheses.
First, as predicted, we replicated Quiroga’s (1992)
findings that polychoric correlations among ordinal variables accurately estimated the bivariate relations among normally distributed latent response variables and that modest violation of normality for latent response variables of a degree that might be expected in applied research leads to only slightly biased estimates of polychoric correlations. Both our results and Quiroga’s suggest that this finding occurs regardless of whether the same set of threshold values is applied to both latent response variables leading to the observed ordinal variables. A thorough examination of the effects of differing thresholds and level of skewness across the two variables on polychoric correlation estimates is beyond the scope of our study. However, we note that Quiroga found that polychoric correlation estimates remained accurate when the ordinal variables had differing thresholds or skewness of opposite sign. Specifically, although increasing levels of nonnormality tended to produce increasingly positively biased correlation estimates, these biases remained quite small and were typically less than .03 in the correlation metric. Furthermore, we found that variability in polychoric correlations was not affected by non-normality and that sample size had no effect on the mean polychoric correlation estimates across cells.
These results likely occurred because nonnormality in the latent response variables, in combination with the threshold values used to categorize the data, did not produce contingency tables with low expected cell frequencies (see Brown & Bendetti, 1977
; Olsson, 1979
). Because our results indicated that the polychoric correlations accurately recovered the correlations among the unobserved y*
variables even under modest violation of the normality assumption, we proceeded to examine the adequacy of fitting CFA models to these correlation structures.
As we described earlier, it has been analytically demonstrated that when CFA models are fitted using observed polychoric correlation matrices, full WLS estimation produces asymptotically correct chi-square tests of model fit and parameter standard errors (e.g., B. Muthén, 1983
;B. O. Muthén & Satorra, 1995
). However, in practice, the use of full WLS is often problematic, especially when models with a large number of indicators are estimated with sample sizes typically encountered in social science research. In our study, we found that when the 20-indicator model (Model 4) was estimated using N
= 100, the estimated asymptotic covariance matrix W
was consistently nonpositive definite and could not be inverted for any of the replications. Thus, not a single solution across all replications was obtained in this situation using full WLS estimation. This finding is consistent with Browne’s (1984)
observation that this method of estimation “will tend to become infeasible” as the number of variables approaches p
= 20 (p. 73) and Jöreskog & Sörbom’s (1996)
recommendation that a minimum sample size be attained for estimation of W
. When Model 4 was estimated with samples of size N
= 200, W
was usually invertible, but the full WLS fitting function frequently failed to converge to a proper CFA solution.
However, the overall rates of nonconvergence and improper solutions obtained with full WLS in the present study are less than those found in simulation studies applying full WLS to continuously distributed data (e.g., Curran et al., 1996
). Nonetheless, other studies applying full WLS estimation to the analysis of polychoric correlations have found rates of nonconvergence and improper solutions similar to those reported here: Neither Dolan (1994)
nor Potthast (1993)
obtained nonpositive definite weight matrices or improper solutions for any replications of their respective studies, which analyzed samples of size 200 and greater, whereas Babakus et al. (1987)
obtained high rates of non-convergence and improper solutions only with samples of 100.
One advantage of the robust WLS method relative to full WLS is that sample solutions for the CFA model can still be obtained with robust WLS estimation even when the weight matrix is nonpositive definite. Thus, robust WLS estimation successfully obtained solutions to Model 4 with N = 100, whereas full WLS did not. Furthermore, our findings show that the likelihood of obtaining an improper solution or encountering convergence difficulty is near zero with robust WLS estimation, even when the model is large and the sample is small.
Next, we hypothesized that full WLS estimation of CFA models for ordinal data using polychoric correlations would lead to inflated test statistics and underestimated standard errors. Our findings supported this prediction in that, on average, the chi-square test statistics were positively biased and parameter standard errors were negatively biased in nearly every cell of the study. The results also supported our more specific predictions that these biases would increase as a function of decreasing sample size and increasing model complexity. Specifically, the effect of sample size was larger for more complex models. With the five-indicator model, chi-square statistics were substantially biased at N
= 100, whereas sample sizes of 200 and even 500 led to substantially biased chi-square statistics for the two 10-indicator models. Sample sizes as large as N
= 1,000 still led to heavily inflated chi-square values for the 20-indicator model. However, with the two 10-indicator models, chi-square values showed substantial inflation with sample sizes below 1,000. For the 20-indicator model, full WLS estimation produced vastly inflated chi-square values across all sample sizes, virtually guaranteeing rejection of a properly specified CFA model with N
= 200 or N
= 500. Thus, our findings are similar to those of Dolan (1994)
, who concluded that a sample size of 200 is not sufficient to estimate an eight-indicator model with full WLS using polychoric correlations, and to those of Potthast (1993)
, who reported significant problems when nine-parameter and larger models are estimated with a sample size as large as 1,000.
The model test statistics produced with robust WLS also tended to be positively biased relative to their expected values; however, these biases were substantially less than those observed with the test statistics produced by full WLS. For the 10- and 20-indicator models, the mean RB of the robust WLS test statistics was typically less than 10% for all cells with sample size 200 or greater. Whereas full WLS estimation led to high Type I error rates for the 10- and 20-indicator models, robust WLS model rejection rates were much closer to the nominal .05 level. Thus, our prediction that robust WLS estimation might produce test statistics accurately distributed as chi-square, with degrees of freedom estimated from the data as per B. Muthén et al. (1997)
, was supported across all four model specifications estimated with samples of size 200 or greater.
Because of its reliance on the complete asymptotic variance–covariance matrix, it is not surprising that full WLS estimation produced increasingly biased test statistics as a function of decreasing sample size combined with increasing model size. In contrast, robust WLS estimation is primarily a function of only the asymptotic variances, and not the covariances, among the sample correlation estimates. Therefore, solutions obtained with robust WLS estimation are not affected by inaccuracies in the full weight matrix nearly to the same extent that full WLS solutions are affected.
Because polychoric correlations provide consistent estimates of the relationships among latent response variables, we predicted that CFA parameter estimates would be unbiased. Our findings suggest that, for normally distributed latent response variables, parameter estimates obtained with full WLS estimation tended to be somewhat positively biased with overestimation increasing as a function of increasing model size and decreasing sample size. However, these biases were relatively small across all cells of the simulation: Even when the 20-indicator model was estimated with N
= 200, estimates of the population factor loading .70 were consistently less than .80, and estimates of the population factor correlation .30 were typically less than .40. Dolan (1994)
found that parameters tended to be slightly overestimated with N
= 400 and less, whereas Potthast (1993)
concluded that parameter estimate bias was trivial across all cells of her simulation study, which only had two conditions of sample size, N
= 500 and N
= 1,000. Our results essentially replicated these findings. With robust WLS estimation, parameter estimates were mostly unbiased with normally distributed latent response variables.
As predicted, for both full WLS estimation and robust WLS estimation, we found that increasing levels of nonnormality in latent response variables was associated with greater positive bias in parameter estimates, echoing the tendency of polychoric correlations to be positively biased when observed ordinal data derives from nonnormal latent response variables. However, in that nonnormality in latent response distributions produces only slightly biased polychoric correlations, this nonnormality introduces only slight bias in CFA parameter estimates. As we noted earlier, because the polychoric correlations resulting from extremely nonnormal continuous latent response distributions were substantially distorted, we did not fit CFA models to these correlation structures. Theory would strongly predict that the CFA parameter estimates and standard errors would be biased as a function of the distorted correlation structure.
Finally, we found that nonnormality in latent response variables contributed to only a slight increase in the positive bias of chi-square values obtained with full WLS estimation but not with robust WLS estimation. Similarly, Babakus et al. (1987)
, Potthast (1993)
, and Hutchinson and Olmos (1998)
found that chi-square test statistics become more biased with increasing nonnormality in the observed ordinal variables, although these researchers found a greater effect of nonnormality than we did here.
However, it is crucial to keep in mind that in the current study, we created nonnormal ordinal observed data by categorizing nonnormal continuous latent response variables. This manipulation is fundamentally different from that implemented in these other studies in which nonnormal ordinal observed data were created by varying the thresholds used to categorize normal continuous latent response variables. As illustrated in , the nonnormal y* variables from which we generated our sample data had more extreme levels of skewness and kurtosis than the observed, ordinal variables from which polychoric correlations and CFAs were estimated. Indeed, the levels of skewness and kurtosis in our ordinal observed data were quite close to those of the normal distribution, despite that our continuous latent response variables were much more nonnormal.
As such, observed ordinal distributions obtained in practice may often have more extreme levels of skewness and kurtosis than those used here. Because prior studies (e.g., Babakus et al., 1987
; Hutchinson & Olmos, 1998
; Potthast, 1993
; Rigdon & Ferguson, 1991
) have effectively demonstrated the effects of increased skewness and kurtosis in observed ordinal variables on the estimation of CFAs, we deemed a more thorough manipulation of y
variables to be beyond the scope of the current study. Rather, as stated above, our intent was to evaluate the effect of violation of a crucial theoretical assumption for estimation of CFAs using polychoric correlations, namely the latent normality assumption for y*,
and the manipulations we chose for our simulations were explicitly targeted to do so. Thus, combining our findings with those from previous studies, we were able to reach three general conclusions about how distribution of y*
affects the estimation of CFA models from observed ordinal data.
First, estimation of CFA models is robust to moderate violation of the latent normality assumption for y* variables, an assumption implicit in the statistical theory underpinning the polychoric correlation, at least under conditions to those studied here. Because polychoric correlations provide robust estimates of the true correlation even when different sets of thresholds are applied to y* variables, it follows that estimation of CFA models is not substantially affected according to whether or not threshold sets are constant across indicators.
Second, to the extent that the observed ordinal variables have nonzero skewness and kurtosis (e.g., as a result of threshold sets that lead to a dramatically different distribution shape for y
relative to a normal or moderately nonnormal y*
), full WLS estimation is known to produce biased chi-square test statistics and parameter standard error estimates. This latter finding likely occurs because of an increased tendency for low expected cell frequencies in observed contingency tables, especially in the context of small-to-moderate sample sizes (e.g., fewer than 1,000; Potthast, 1993
Third, when the population y*
variables are of extreme nonnormality (e.g., skewness = 5, kurtosis = 50), the likely result is that the observed ordinal variables themselves will also have exaggerated levels of skewness and kurtosis, thus again leading to low expected frequencies in observed contingency tables. With regard to consideration of the joint effects of underlying nonnormality and varying thresholds across indicators, to the extent that these factors jointly produce observed contingency tables with low (or zero) expected cell frequencies, they are likely to lead to inaccurate polychoric correlations (as shown by Brown & Bendetti, 1977
), which in turn adversely affect estimation of CFA models.
Implications for Applied Research
There are several specific implications of our findings with respect to using these analytic methods in practice. First, our findings suggest that the estimation of CFA models using polychoric correlations is robust to the moderate levels of nonnormality in the latent response variables that we considered here. Consistent with Quiroga (1992)
, our results showed that polychoric correlations become only slightly inflated when the latent response variables are moderately nonnormal. In turn, this bias in the correlation estimates contributes to a modest overestimation of CFA model parameters and has little effect on chi-square test statistics or parameter standard error estimates. Our findings support Pearson and Heron’s (1913)
argument that the latent normality assumption is merely a mathematical convenience that has little practical importance when the latent response variables are moderately (but not extremely) nonnormal.
Although our study demonstrates robustness to the latent normality assumption, it is important to stress that our study does not offer a thorough assessment of the effects of skewness and kurtosis of the observed ordinal variables. In practice, researchers may likely observe ordinal distributions that are more skewed and kurtotic than those examined here. We refer applied researchers to prior studies (e.g., Babakus et al., 1987
; Hutchinson & Olmos, 1998
; Potthast, 1993
; Rigdon & Ferguson, 1991
) to further understand the effects of high skewness and kurtosis among observed ordinal variables on estimation of CFAs using polychoric correlations.
Consistent with previous studies for both continuous and ordinal data, our results demonstrate that for CFA models of realistic size (e.g., with 10 or more indicators), the desirable asymptotic properties of full WLS estimation are not observed with the types of sample sizes typically encountered in applied psychological research, even with N = 1,000. In the situation where a researcher wishes to fit a large model with N = 1,000 or fewer, our findings imply that when proper CFA solutions are obtained with full WLS (which can be quite rare at small-to-moderate sample sizes), these usually have inflated chi-square test statistics and parameter estimates and negatively biased standard errors. In contrast, robust WLS estimation nearly always produces a proper solution with test statistics, parameter estimates, and standard errors that are much less vulnerable to the effects of increasing model size and decreasing sample size. Furthermore, even for the situations in which full WLS estimation performs well (i.e., with small models and large sample size), robust WLS still produces less biased parameter estimates, standard errors, and test statistics. These results support the recommendation that applied researchers closely consider robust WLS estimation for CFAs with ordinal scales (and for SEMs more generally), particularly when testing medium-to-large models with a moderate-to-small sample size.
Given this recommendation, a word of caution is warranted. Despite its apparent finite-sample superiority to full WLS estimation, robust WLS estimation still leads to slightly biased test statistics and standard errors when large models are estimated with small samples. This inflation of the test statistic increases Type I error rates for the chi-square goodness-of-fit test, thereby causing researchers to reject correctly specified models more often than expected. It may be that researchers can supplement the chi-square goodness-of-fit test with other fit indices often computed in applications of SEM to retain a model that would otherwise be rejected on the basis of the chi-square goodness-of-fit test alone, although we did not include such measures as formal outcomes of this study. Examples of such fit indices include the comparative fit index (Bentler, 1990
) and the root-mean-square error of approximation (Steiger, 1990
). However, because both of these indices are based in part on the sample value of the fit function (i.e., Equation 8
), these are likely to be biased to some degree as well. There has been little research explicitly evaluating the performance of these statistics in the context of CFA with polychoric correlations, although Hutchinson and Olmos (1998)
reported promising initial findings for the comparative fit index.
Study Limitations and Directions for Future Research
We believe that the design of our study allows for empirically informed conclusions about the practical usefulness of CFA using polychoric correlations across several model specifications of increasing complexity, a broad range of sample sizes, several distributions for the latent response variables, and two types of WLS model-fitting procedures. However, this study shares the basic limitation of all Monte Carlo simulations; namely, it is possible that the findings cannot be generalized beyond the specific conditions studied here. Nonetheless, we believe that our experimental design and subsequent findings enable stronger predictions to be made about the finite-sample performance of these methods across a broad range of realistic conditions than have been previously available.
We defined our nonnormal underlying distributions by selecting levels of skewness and kurtosis to represent what we propose to be minor-to-moderate departures from normality that might be commonly encountered in behavioral research. It is important to reiterate that we were not interested in exploring how severely nonnormal the underlying distributions would need to be made for these estimators to fail. Instead, we feel that our results provide strong empirical evidence that robust WLS performs exceptionally well across a variety of commonly encountered conditions in applied research, and full WLS works almost as well for larger sample sizes and for simpler models. In sum, we are not concluding that robust or full WLS is unaffected by any degree of nonnormality of the latent response, but we do conclude that these methods are well-behaved for a variety of nonnormal distributions that might be expected in practice. Future research is needed to determine more accurately at what point full or robust WLS estimation begins to break down under violations of latent normality. Furthermore, we again remind readers that our results pertain most directly to violation of the normality assumption for latent response variables, a crucial theoretical assumption for the estimation of polychoric correlations. This is in distinct contrast to prior research that has addressed the effects of varying skewness and kurtosis among observed ordinal data in the context of normal latent response variables. We believe our findings augment this prior work in important ways.
Given the scarcity of studies that have explicitly assessed the application of CFA using polychoric correlations to realistic CFA models for ordinal variables, we felt that it was important to focus exclusively on the estimation of properly specified models. However, in practice, a given model specification is rarely exactly correct (see, e.g., Cudeck & Henly, 2003
; MacCallum, 1995
). Therefore, it is important that future research assess the ability of CFA with polychoric correlations to test misspecified models. In particular, our study provides promising results in support of the robust WLS method of B. Muthén et al. (1997)
when applied to correctly specified models. However, the performance of this method for the estimation of misspecified models is still unknown. Because this method calculates test statistics and standard errors in a manner similar to the Satorra and Bentler (1986
method for continuous observed data, one might predict from the results of Curran et al. (1996)
that robust WLS would produce significantly underestimated chi-square statistics under model misspecification, thus reducing power to reject a misspecified model. Again, future research is needed to determine whether robust WLS maintains appropriate statistical power to detect a misspecification under violations of distributional assumptions.
Finally, we considered homogeneous values for the factor loadings both within and across models. This is of course a simplification, and it would be interesting to examine the more realistic situation of heterogeneous factor loading values in future research. Although we did not empirically examine the effects of varying the values of factor loadings here, we can draw on statistical theory to make some predictions about this issue. Namely, higher factor loading values are reflective of greater factor determinacy, a condition that has many benefits in model estimation and testing (see MacCallum, Widaman, Zhang, & Hong, 1999
, for an excellent review). Given our focus of interest, we would predict that larger factor loadings would likely serve to convey information more saliently about violation of the underlying normality distributions within the CFA. Similarly, weaker loadings would likely dampen these effects. Taken together, we would expect that a heterogeneous set of factor loading values would simultaneously exacerbate and weaken the effect of the violation of assumptions, the specific effect of which would depend on many other experimental design features. Thus, although future research is needed to better understand the influences of heterogeneous loading structures, we do not believe that our use of homogenous loadings is a limitation here.
In sum, we make the following conclusions based on our experimental design and associated empirical findings. First, with the exception of a small number of modest differences in accuracy of polychoric correlation estimation and model convergence under full WLS, there were few to no differences found in any empirical results as a function of two-category versus five-category ordinal distributions. Second, moderate nonnormality of latent response distributions did not significantly effect the accuracy of estimation of polychoric correlations, although severe nonnormality did. Third, full WLS rarely resulted in accurate test statistics, parameter estimates, and standard errors under either normal or nonnormal latent response distributions, and this accuracy only occurred at large sample sizes and for less complex models. Fourth, robust WLS resulted in accurate test statistics, parameter estimates and standard errors under both normal and nonnormal latent response distributions across all sample sizes and model complexities studied here (although there was modest bias found at the smallest sample size). Fifth, as we studied four variations of a CFA model here, we would anticipate that these findings would generalize to SEMs of comparable complexity. Finally, all of our conclusions are based on proper model specifications, and future research must address the role of statistical power and the ability of full and robust WLS to detect misspecifications when such misspecifications truly exist.