This study found that the QIDS-C
16 and the QIDS-SR
16 are very similar to one another. These results are very similar to those by
Rush et al. (in press). The finding of greatest clinical significance is that the two versions are highly comparable. Individual domains relate equally well to overall depression with the two scales. The largest difference involved the relative infrequency with which clinicians used the most extreme category for one item, restlessness/agitation. If anything, the self-report version was slightly superior in discriminating those of average to above average depression in this sample. These results indicate the utility of both versions. The self-report performed very well, even in a somewhat poorly educated, socially disadvantaged population.
CTT findings and the present IRT results (
Trivedi et al. 2004a) produce very comparable findings. For example, the CTT item-total correlations were reflected in the
a (slope) parameter found with IRT. Similarly, findings based on the CTT item means were also noted with the
bi (location) parameters of IRT. However, the present IRT findings provided for very straightforward testing of both kinds of differences between test versions, whereas CTT only did so for the item means. In addition, IRT afforded an explicit basis to equate scores on two different tests (
Rush et al. 2003b). For a more extensive discussion of some of the advantages and disadvantages of IRT over CRT, see
Nunnally and Bernstein (1994, pp. 394–396 and 433–435).
Some previous psychiatric applications of IRT have employed the
Rasch (1960) model (
Bech et al. 1981;
Cialdella et al. 1992). However, the Rasch model makes one assumption that we feel unfortunately severely limits its utility in the present context — namely, that all items have the same slope (
a parameter). Empirically, this was clearly not the case in the present report nor is it likely in any clinical setting. This requirement precludes the determination of the differential contributions of the various symptoms to the overall definition of depression, which is an important aspect of this investigation. Unlike applications in industrial/organizational psychology, where weakly discriminating items can be eliminated, such symptoms need to be considered in psychiatric diagnosis. For example, although Suicidal Ideation is less discriminating than Low Energy or Sad Mood, it would be a grave omission not to ask questions about less commonly but clinically important symptoms. It is suggested that the Rasch model is most useful in settings where the same general type of question can be asked, such as presenting randomly selected pairs three digit numbers for addition to evaluate arithmetic ability in young children. In our view, this model seems less appropriate for medical settings where a wide range of symptoms with varying sensitivity need be considered.
Consider the limitations of the Rasch model in the context of measuring the severity of a psychiatric or general medical syndrome such as major depression, schizophrenia, nephrotic syndrome, congestive heart failure, etc. Virtually all medical syndromes are based on a listing of commonly occurring clusters of signs and symptoms. No one patient is required to have each and every sign and symptom relevant to the diagnosis. The syndrome of major depression, for example, requires either sad mood or reduced interest and only four of the remaining seven criterion symptom items to qualify for the diagnosis. In some patients, some signs/symptoms will be more common, while others will be less common. Over time, some new signs/symptoms may develop. Others may abate. Thus, an analytic model like the Samejima model that allows for a grading of the severity of each diagnostic criterion sign/symptom, and that allows investigators to gauge the likelihood of each specific sign/symptom being endorsed in a heterogeneous syndrome, provides greater flexibility in assessing test performance.
As noted in the introduction, IRT is becoming widely used to study depressive symptomatology, and much of this work has examined DIF, e.g.,
Azocar et al. (2001),
Evans et al. (2004), Iwata and Buca (2002), and
Iwata et al. (2002). Unlike these reports, this study deals with a questionnaire that was developed specifically within psychiatry and which was designed to evaluate symptoms of depression that follow from its DSM definition (
American Psychiatric Association 2000). In that sense, it is related to the work of
Gibbons et al. (1993). The papers by Azocar and Iwata et al. studied tests like the Beck Depression Inventory that have been more wisely used in nonclinical populations than the QIDS. In contrast, it is perhaps more important to consider the implications of DIF, when it is present, for tests when they are applied to clinical populations.
One possibility is to treat DIF in the present context as it is usually treated in employment (industrial/organizational) settings. In that case, it is usually interpreted as being highly undesirable, and much effort goes into eliminating or at least rewriting such items. Indeed, the term “item bias” has largely been replaced by DIF. In the present case, this would involve the appetite/weight and restlessness/agitation domains. It is possible that suitable instructions to the clinical interviewers could reduce these differences. However, there is no assurance that other domains may not possess DIF when applied to different ethnic groups, genders, etc. Moreover, unless these changes can preserve the essential characteristics of the domain, one runs the risk of failing to cover the DSM criteria, which was the goal of constructing the scale in the first place. Furthermore, as more and more relevant groups are compared, the probability of finding DIF in a given item or domain in at least one group increases, e.g., even though an item may not possess DIF by gender or in a racial (e.g., black/white) comparison, it may for an ethnic comparison (e.g., white Hispanic vs. white non-Hispanic).
An alternative is to consider these instances of DIF as legitimate group differences that should be taken into account in diagnosis. Indeed, this may be quite necessary should DIF emerge among subtypes of depression. As noted earlier, the presence of DIF implies that a single dimension, depression in this case, is not sufficient to account for all differences among patients, i.e., patients at the same level of depression might differ in some other respect. This is a likely possibility that simply should be kept open along with the present findings that different methods of inferring QIDS responses differ slightly, but perhaps legitimately. In other words, it is reasonable to assume that scores on scales like the QIDS may be influenced by dimensions of relevance other than depression, per se.
Limitations
It is probable that a major factor underlying the equivalence of the clinical and self-report versions in this study is the fact that there were no major incentives to either exaggerate or minimize the symptoms of depression. It is not unreasonable to assume that the presence of such factors would lead to differences between the two methods, though it should not be forgotten that even the clinical version is based heavily upon patient report. In addition, there are a variety of other samples (e.g., bipolar depression) for whom possible equivalence has not been examined.
Conclusions
The two versions of the QIDS16 are highly similar, even in this less educated, more socially disadvantaged sample. In particular, this means that self-report is an adequate method of assessing depression and has the advantage of taking less clinician time.