In the present study we investigated the dimensionality and scaling properties of the EDS in a population of patients with DM2. The objectives of this study were threefold. Firstly, we examined the dimensionality of the EDS. An important practical and scientific issue is whether the items, each covering different conceptually narrow aspects (
e.g., anxiety, dysphoria, and anhedonia), together constitute a proper unidimensional scale for measuring the general broad attribute of depression. Measurements of the general attribute, covering the full breadth of the construct, may have higher predictive validity than subscale scores [
33]. Confirmatory factor analyses and Mokken scale analysis showed that the ten EDS items constitute a unidimensional scale for the general depression factor of interest. Respondents can be reliably ordered on this dimension using the sum score. These results justify the use of sum scores on the EDS as measurements of the underlying depression attribute. This finding corroborates the original intentions of the developers, who designed the EDS to be unidimensional [
26].
The question of whether a set of items covering different aspects should be treated as one unidimensional scale for the general attribute of interest, or should be divided into smaller subscales, is an important issue in psychological assessment. Our study demonstrated that both the nonparametric IRT framework (as an exploratory approach) and bifactor models (as a confirmatory approach) provide powerful tools for rigorously examining to what extent the items in a scale together measure a broad attribute in the presence of specific aspects. Typically, a general attribute dimension is present if MSA clusters all items into a single scale for medium values of lower bound
c, and yields separate clusters for high values of
c, and if all items have high loadings on the general factor in the bifactor model. In some instances, however, the existence of a general underlying attribute is easily derived from the factor solutions themselves. For example, De Bruin
et al. [
30] tested the two-factor model of Pop
et al. [
32] in a confirmatory factors analysis, and found a correlation of .86 between the two factors. From this high correlation, they concluded that the factors basically provide information about the same underlying construct. Such compelling evidence is the exception rather than the rule.
The dimensionality analyses revealed small local dependencies between items 1 and 2. We hypothesize that local dependence can be explained by the opposite wording compared to the other items in the EDS. Local dependencies related to item wording are typical for scales comprising negatively- and positively-worded items [
72]. The literature provides competing explanations as to why these additional dependencies often emerge in balanced scales. One explanation states that positively-worded items are dimensionally distinct from negatively- worded ones. For example, being
unhappy is different from
not being happy. Other explanations include careless responding [
73] and carry-over effects due to similarity in wording [
72]. Whether the two items should be regarded as covering a dimensionally distinct attribute or as being caused by idiosyncratic response tendencies or wording effects is difficult to tell from a single data analysis. Future research may explore a modified version of the EDS in which all the items are worded in the same direction, to see whether local dependence vanishes.
Another scale refinement that may be pursued in future research is to remove the locally-dependent items 1 and 2 and use an eight-item version of the EDS. However, given the results of our study, we believe that removing items 1 and 2 is not to be recommended. MSA, CFA, and parametric IRT analyses consistently showed that the two items are reliable indicators of the general attribute. In addition, bifactor analyses showed that the bias in estimated scale reliability was only 0.01. Thus, from a pure practical point of view, ignoring local dependence does not impair the valid use of EDS scores. Removing the items, however, would result in a loss of information and would compromise the reliability and increase the risks of incorrect diagnosis. The two-step estimation approach adopted in our study facilitates forecasting the consequences of removing items 1 and 2 from the EDS. For example, the two items accounted for 12% to 14% of the information around the cutoff. Removing these items reduces the test information around the cutoff by a factor 1.2. In addition, removing items 1 and 2 reduces Cronbach's alpha from 0.86 to 0.82 (results based on a simulated data set of 10,000 item-response vectors; details available from the second author). This may seem small, but it should be noted that decreasing reliability caused by test length has several adverse effects, including a reduction in the power to find group differences, additional bias in the estimated regression effects of the EDS, and higher risks of classification errors (
e.g., [
74]). Furthermore, removal of the items necessitates determining new cutoffs for diagnosing mild and severe levels of depression, and may unduly narrow the construct since one aspect (anhedonia) may no longer be well represented.
The second objective of this study was to test whether the EDS is biased with respect to gender. Significant DIF was found for items 3 (blaming), 4 (anxious) and 9 (crying), but only for item 9 did DIF lead to appreciable differences in expected responses for males and females. However, at the scale level, the presence of DIF caused no substantial differences in the expected scores between males and females. The minor impact of DIF was also evident from the small differences between latent cutoffs that were obtained separately for males and females. For example, for the screening for mild depression we found latent cutoffs of θ = 0.54 and θ = 0.60 for females and males, respectively. Such a difference is negligible given that θ is standard normally distributed. Altogether these findings indicate that the observed DIF had no practical impact, and justify using the same screening rules for males and females.
The third objective was to have a more detailed picture of the screening properties of the EDS items. We found that the EDS is only informative at the higher ranges of the
θ scale. This is a common result for many clinical scales [
43], which basically assess symptom severity with respect to a clinical condition (
e.g., depression). This means that the items in the scale only assess one polar of the 'no depression-depression' continuum and constitute a quasi- attribute [
43]. Secondly, we found that, for the distinction between no depression and mild depression and for that between mild depression and severe depression, item 7 (
difficulty sleeping) appeared to be the most reliable indicator, followed by item 8 (
sad/miserable). For the other items, differences in the relative contribution to the information between the two cutoffs were also small, which means that for differentiating respondents around the higher cutoff (
X+ = 12), the relative importance of the items is the same as for differentiating around the lower cutoff (
X+ = 9). In addition, the differences in screening properties between males and females were small, which again demonstrates that the impact of DIF is small and of no practical concern. Thirdly, we looked at the score profiles that further characterize the diagnostic groups at the item level. We found that the difference between mild and severe depression is most prominently reflected by differences in sleeping difficulties and anhedonia.
In this study, we used IRT-based methods to examine different aspects of the EDS. To the best of our knowledge, there are two other studies that have used IRT to validate the EDS [
48,
49]. Both those studies adopted a polytomous Rasch model (
e.g., [
42,
55]), which assumes, for example, that all the items in a scale have the same discrimination power. This assumption is unrealistic for the EDS, as shown by the varying item-factor loadings in the factor analysis and the varying scalability coefficients in Mokken scale analysis. Therefore, the Rasch model seems to be too restrictive to adequately capture the relevant test and item characteristics of the EDS. Using an IRT model that is too restrictive yields undesirable results. Most importantly, it may lead to the removal of sound items. For example, Pallant
et al. ([
48],
p. 28) suggested discarding item 8 from the EDS because it showed poor fit under the postulated Rasch model. However, this misfit is most likely explained by the fact that the item has higher discrimination than the other items. Under the Rasch model, such deviating item discrimination is identified as item misfit. Discarding item 8 seems to be an unfortunate choice since, as was shown in this study, it is highly informative for diagnosing mild and severe depression levels and has excellent measurement properties. Removing item 8 would unnecessarily compromise the reliability and (predictive) validity of the EDS.
Although the above findings support the dimensionality and reliability of the EDS, two limitations should be noted. Firstly, we limited our study to analysis of the dimensionality and measurement properties of the EDS. However, for an instrument to be a valid screening tool in patients with an elevated risk of adverse health outcomes, additional studies on the sensitivity and specificity must also be carried out. The sensitivity and specificity of the EDS have been extensively studied in pregnant [
75], non-postnatal [
27], and menopausal-aged [
76,
77] women. Unfortunately, we had no information on clinical diagnoses derived from psychiatric diagnostic interviews at our disposal. Data from a psychiatric diagnostic interview such as the Composite International Diagnostic Interview would allow us to calculate the sensitivity and specificity of the Dutch version of the EDS for primary-care patients with type 2 diabetes.
The second limitation concerns the specific sample of diabetes patients used in our study. Published validation studies on other scales measuring depressive symptoms, such as the HADS [
78] and the SCL-90-R [
79], have sometimes yielded inconsistent results with respect to the dimensionality of the scales across different (clinical) populations. These inconsistencies can partly be explained on statistical grounds since researchers use different research strategies and model selection criteria [
80]. However, it has also been hypothesized that the dimensionality of symptom scales may depend on the general level of negative affectivity - a concept closely related to depression - itself [
80]. This means that for a well- defined population, the dimensionality within the subpopulation with high negative affectivity may be different from within the subpopulation with low negative affectivity. According to this hypothesis, negative affectivity serves as a so-called
structure generating factor. However, not only the general negative affectivity level but also specific characteristics of the disease status of the respondents may operate as a structure-generating factor. This means that caution must be exercised in generalizing the results from one clinical population to another.