|Home | About | Journals | Submit | Contact Us | Français|
Nine DSM-IV-TR criterion symptom domains are evaluated to diagnose major depressive disorder (MDD). The Quick Inventory of Depressive Symptomatology (QIDS) provides an efficient assessment of these domains and is available as a clinician rating (QIDS-C16), a self-report (QIDS-SR16), and in an automated, interactive voice response (IVR) (QIDS-IVR16) telephone system. This report compares the performance of these three versions of the QIDS and the 17-item Hamilton Rating Scale for Depression (HRSD17).
Data were acquired at baseline and exit from the first treatment step (citalopram) in the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) trial. Outpatients with nonpsychotic MDD who completed all four ratings within ±2 days were identified from the first 1500 STAR*D subjects. Both item response theory and classical test theory analyses were conducted.
The three methods for obtaining QIDS data produced consistent findings regarding relationships between the nine symptom domains and overall depression, demonstrating interchangeability among the three methods. The HRSD17, while generally satisfactory, rarely utilized the full range of item scores, and evidence suggested multidimensional measurement properties.
In nonpsychotic MDD outpatients without overt cognitive impairment, clinician assessment of depression severity using either the QIDS-C16 or HRSD17 may be successfully replaced by either the self-report or IVR version of the QIDS.
Accurate, time-efficient measurement of depressive symptom severity is of great importance in conducting cost-efficient, clinical trials. Development of a self-report measure that accurately reflects overall symptom severity would be useful to both clinicians and researchers who wish to monitor treatment outcomes. In addition, if automated methods for obtaining such ratings over the telephone using interactive voice response (IVR) technology were available, researchers and clinicians would be able to obtain such measures at virtually any time or place.
The growing importance of symptom remission in managing depression has been recognized for several years (American Psychiatric Association 2000b; Bauer et al 2002a, 2002b; Canadian Psychiatric Association Network for Mood and Anxiety Treatments 2001; Crismon et al, 1999; Depression Guideline Panel, 1993; Reesal et al, 2001; Rush and Ryan, 2002). The field has yet to agree on and validate the best definition of remission. The ascertainment of remission or partial remission, however, based on DSM-IV-TR (American Psychiatric Association, 2000a, page 412), logically recommends that all nine diagnostic criterion symptoms that define the syndrome be assessed. Some would also recommend, however, that an assessment of anxiety, other common symptoms (e.g., irritability, pain), and even day-to-day function could also be important to fully define remission.
The most commonly used clinician ratings of depressive symptom severity, e.g., the Hamilton Depression Rating Scale (HRSD) (Hamilton. 1960, 1967) and the Montgomery-Äsberg Depression Rating Scale (MADRS) (Montgomery and Äsberg, 1979), do not specifically identify and weigh equally each of the diagnostic criterion symptoms specified by DSM-IV-TR. It could be argued that more common criterion symptoms (e.g., sad mood) should contribute to a greater degree to total severity than less common symptoms (e.g., suicidal thinking). The DSM-IV-TR, at least, does not differentially weight the symptoms to define a major depressive episode or to establish the presence of partial remission or remission. Self-report versions of the HRSD (Carroll et al, 1981; Smouse et al, 1981; Reynolds and Kobak, 1995) and of the MADRS (Svanborg and Äsberg 2001) are available. However, the limitations inherent in the original clinician ratings likely apply to these self-reports (e.g., confounded items, missing criterion diagnostic items, etc.) (Rush et al, 1996).
The Inventory of Depressive Symptomatology (IDS) was developed initially as a 28-item clinician rating scale and a matched 28-item self-report that included all nine criterion symptom domains, as well as commonly associated noncriterion symptoms (e.g., anxiety, irritability) (Rush et al, 1986). These 28-item versions were later enlarged to 30 items (Rush et al, 1996) to capture all DSM-IV atypical symptom features (American Psychiatric Association 2000a). The IDS scales were designed to provide a reliable method to measure symptom severity and symptom change, as well as to provide a rapid appraisal of clinically relevant symptom features (e.g., atypical, anxious, melancholic symptoms). The IDS scales have been subjected to numerous psychometric evaluations (Gullion and Rush, 1998; Corruble et al, 1999; Rush et al 2000, 2003, 2004b; Trivedi et al 2004b) and have been administered to patients with major depressive, bipolar, and dysthymic disorders. The 30-item IDS (IDS30) is sensitive to change with various types of treatments (Rush et al 2000, 2003; Trivedi et al 2004a). A recent report (Rush et al 2004b) has shown the performance of the self-report version of the IDS (IDS-SR30) was comparable to the HRSD. Conversion tables allow IDS30 total scores to be converted to equivalent total scores on the 17-item HRSD (HRSD17), 21-item HRSD (HRSD21), and 24-item HRDS (HRSD24) (Rush et al 2003).
To reduce the time needed to appraise depressive symptom severity, the 16-item Quick Inventory of Depressive Symptomatology (QIDS16) was developed (Rush et al 2003; Trivedi et al 2004b) in both a clinician-rated (QIDS-C16) and self-report (QIDS-SR16) version. The QIDS can also be administered by computer over the telephone using an IVR system (QIDS-IVR16). All three versions of the QIDS16 scales are based on 16 IDS items and obtain ratings (range 0–3) concerning all nine criterion symptom domains (Rush et al 2003; Trivedi et al 2004b). The questions are identical for the QIDS-C16 and the QIDS-SR16. The QIDS-IVR16 uses slightly different questions to obtain symptom ratings for the nine domains. For all versions of the QIDS, four items are used to assess the sleep domain (initial, middle, and late insomnia, as well as hypersomnia). Two items are used to gauge psychomotor activity (agitation and retardation). Four items assess the appetite/weight domain (i.e., appetite increase and decrease, weight increase and decrease). For each of these three domains, the highest rating on any one relevant item is used to score the domain (range 0 –3). Only one item is used to score the remaining six criterion domains (each rated 0 –3) (sad mood, concentration, energy, interest, guilt, suicidal ideations/intent). The QIDS16 total score ranges from 0 to 27.
The QIDS were designed to measure overall severity of the depressive syndrome (major depressive disorder [MDD]) by assessing each of the nine symptom domains that define the syndrome. The IDS assesses the same nine domains and other commonly associated symptoms (e.g., anxiety, irritability). Neither is intended as a diagnostic tool, though total score thresholds that indicate the presence of MDD have been reported (Rush et al, 1996).
Item response theory (IRT) analyses of QIDS-SR16 data indicate that a QIDS-SR16 total score of 5 corresponds to an HRSD17 total score of 7, a commonly used definition of remission in clinical trials. Other QIDS16 thresholds recommended to estimate depression severity are mild (6 –10), moderate (11–15), severe (16 –20), and very severe (≥21) depression. Corresponding HRSD17 scores would be 8 to 13, 14 to 17, 18 to 24, and ≥25, respectively (Rush et al 2003).
The IDS and QIDS scales are in the public domain and are available in multiple languages (www.ids-qids.org). To date, evidence suggests that both the QIDS-C16 and the QIDS-SR16 have acceptable psychometric properties (Rush et al 2003; Trivedi et al 2004b).
The present study was conducted using data made available by the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study (Fava et al 2003; Rush et al 2004a). This report characterizes and compares the QIDS-C16, QIDS-SR16, QIDSIVR16, and the HRSD17 using classical test theory (CTT) and IRT analyses in a large sample of outpatients with MDD.
The STAR*D was designed to define prospectively which of several treatments are most effective for outpatients with nonpsychotic MDD who have an unsatisfactory clinical outcome to initial and, if necessary, subsequent treatments. The STAR*D protocol was reviewed and approved by the 14 Institutional Review Boards (IRBs) governing the 14 regional centers and the IRBs at the National Coordinating Center (UT Southwestern, Dallas) and the Data Coordinating Center (University of Pittsburgh) (see acknowledgments and Rush et al 2004a).
Outpatients with nonpsychotic MDD were recruited from 18 primary and 23 specialty care settings across the United States. Eligible STAR*D participants were female and male outpatients (18–75 years of age) with nonpsychotic MDD for whom outpatient treatment with an antidepressant was deemed to be safe and appropriate by the treating clinician. The broad inclusion and minimal exclusion criteria were used to obtain a highly representative sample of persons with MDD treated in everyday practice. Participants with schizophrenia, schizoaffective disorder, bipolar disorder, or anorexia nervosa were excluded, as were those with primary diagnoses of obsessive-compulsive disorder (OCD) or bulimia nervosa. Participants with a history of nonresponse or intolerance (in the current major depressive episode) to protocol treatments and those with medical conditions contraindicating protocol treatments (e.g., seizures) were excluded. Participants taking concomitant nonpsychotropic, anxiolytic, or sedative hypnotic medications could enroll based on clinician judgment, and those with current substance abuse or dependence were eligible if inpatient detoxification was not required.
Following written informed consent, participants were evaluated by the Clinical Research Coordinators (CRCs), who worked closely with participants and clinicians, administered some of the clinician-rated instruments, ensured that all self-rated instruments were completed, and functioned as study coordinators. Ratings germane to this report are as follows. Telephone interviews with trained and certified research outcome assessors, who were masked to treatment and located apart from any treatment site, collected the HRSD17 and the 30-item IDS clinician rating scale (IDS-C30) (from which the QIDS-C16 was extracted) following a structured interview (available at www.star-d.org). A telephone-based IVR system collected other research outcomes, including the QIDS-IVR16. The patient completed the QIDS-SR16 at the clinic visit.
This preliminary report is based on data available from the first 1500 consecutive STAR*D participants, obtained at entry into or exit from Level 1 (citalopram) treatment. The decision to include data from both the baseline and exit evaluations, rather than just from the baseline evaluations, was arbitrary. The same conclusions result from analyses of either dataset. For this report, data from the HRSD17 and QIDS-C16 extracted from the IDS-C30 obtained by the ROAs, the QIDS-SR16 obtained by the CRC during the clinic visit, and the QIDS-IVR16 obtained by the computer-automated telephone calls were used. All three QIDS16 measures and the HRSD17 must have been obtained within a time period of 2 days or less to be included in these analyses. Of these 1500 participants available, 1120 met the criterion of being administered the QIDS-C16, QIDS-SR16, QIDS-IVR16, and the HRSD17 within 2 days of their baseline visit; 582 had the four tests administered within 2 days of their final (exit) visit; and 479 met both criteria. However, not all patients answered all items, so some analyses involved a smaller number of observations.
Classical test theory measures of scale consistency, including Cronbach’s alpha (Cronbach, 1951), item score - total scale correlations (not corrected for measurement error), item/symptom domain mean values, and mean total scale scores were computed for each of the four measures of depression. Item response theory analyses of the discriminative and informational relationships between individual scale items and total scale properties were also computed.
An assumption of IRT analysis is that scale items measuring symptom severity assess only depression (i.e., they are unidimensional). Therefore, a principal components factor analysis was conducted on each measure. Parallel analysis (Horn, 1965; Humphreys and Ilgen,, 1969; Humphreys and Montanelli,, 1975; Montanelli and Humphreys,, 1976) was used to infer the number of “real” dimensions (independent components/factors) present in the data. Like a scree plot, parallel analysis is an alternative to the traditional, Kaiser-Guttman rule (eigenvalues greater than 1) to define dimensionality but, unlike the scree criterion, incorporates an empirical rule to define the cutoff.
Parallel analysis involves 1) generating one or more correlation matrices whose individual elements are sampled from a population of null correlations, using the same number of observations and variables as the actual data; 2) extracting the principal components for each random matrix (which are orthogonal, by definition); and 3) averaging the magnitude of the eigenvalues over replications. The number of components for which the obtained eigenvalue exceeds the simulated eigenvalue defines the dimensionality of the variables.
Samejima (1997) graded IRT model item/domain parameters were estimated for each version of the QIDS16 scale and for the HRSD17. To determine whether the parameter estimates varied across the three versions of the QIDS16, the fit of an IRT model in which parameter estimates of all measures were allowed to vary freely was compared with the fit of an IRT model in which parameter estimates were constrained to be the same for each measure.
Effect sizes for change from first to last session of Level 1 were computed for each measure and each item/domain of each measure. Effect size refers to the mean decrease in each item/ domain for those patients (n 1 479) seen on both occasions divided by the standard deviation of the decrease. We chose .50 as an “acceptable” effect size.
The four scales (three versions of the QIDS16 and the HRSD17) were compared on their ability to identify treatment response and remission at exit from Level 1 treatment. The strength of agreement between measures was assessed by the kappa statistic. Treatment response was defined as a 50% improvement from baseline. Remission was defined as HRSD17 score ≥7 or a QIDS16 score ≥5, based on the clinician ratings or the standard self-report or IVR version of the scale (Rush et al, 2003).
Figures Figures11 and and22 respectively contain the domain means and the domain item/total correlations (rit) for the three methods of administering the QIDS16 (clinical, self-report, and IVR). One difference not visible in Fig. 2 is that Restlessness/Agitation was never reported at level “3” in the clinical version, whereas it was, albeit rarely in the other two methods.
The associated means (standard deviations) were 8.6 (6.3), 7.7 (5.7), and 8.8 (6.4), and the associated values of coefficient alpha were .87, .87, and .86. The mean differences differed significantly, F(2,1162) = 48.13, MSe = 3.99. This is a small effect (η2 = .01), explainable in terms of the lower self-report scores relative to the two others and basically limited to three domains: Appetite, Concentration,/Decision Making, and Energy level, as can be seen in Fig. 1. In general, domains leading to frequent symptom reports in the companion paper (REF??), most specifically Sleep, were also reported frequently here and domains previously reported as correlating highly with the total QIDS16 score also did so here, most specifically Sad Mood and Concentration.
Table 1 contains the item means, item/total correlations (rit), scale mean, scale standard deviation, and coefficient alpha for the HRSD17. Its scale mean (standard deviation) was 11.4 (8.6), and its coefficient alpha reliability was .89. The four measures being considered therefore differ marginally with respect to their internal consistency. Note that although most of the HRSD17 domains correlate acceptably with total score and the reliability is highly similar to the three versions of the QIDS16, Loss of Insight shows virtually no tendency to be reported and the rit for this domain is negative.
Table 2 contains the intercorrelations among the four measures. Given that the reliabilities were all nearly .9, these correlations essentially become perfect when disattenuated.
Figs. Figs.33--66 contain the Samejima IRT a, b0, b1, and b2 parameter estimates for the three versions of the QIDS16. It may be recalled from the accompanying paper (REF??) that these represent the slope or relation between the domain and depression in general, the threshold separating category “0” responses from higher category responses, the threshold separating categories “0” or “1” responses from category “2” or “3” responses, and the threshold separating categories “0”, “1”, and “2” responses from category “3”: responses. Like their CTT counterparts, the values are similar across methods of administration with one apparently large exception,: Restlessness/Agitation at b2. In fact, clinicians never used the most extreme response category. Note that the ordinate uses normal-curve scaling—any value outside the range ±3.0 is effectively at asymptote. By this criterion, there were 2, 0, and 2 (7%, 0%, and 7%) extreme b parameter estimates for the clinical, self-report, and IVR results, each of which has a maximum of 27.
Table 3 contains the corresponding Samejima estimates for the HAMD17 (also see REF?? for a similar analysis). Here, a total of 14/51 (27%) parameter estimates were outside the scalable range. Basically, the most extreme response category was never chosen for 12 of 17 items.
Inferential tests consist of comparing a model in which all parameters are allowed to vary freely with models containing constraints. The first such constrained model involved equating both the QIDS16 a and b parameters over the three methods of administration. In this case, the fit increased by a significant G2(72) of 779.7 so there are clearly some differences among the three methods of administration. Constraining the b parameters but letting the a parameters vary freely also led to a significant G2(54) of 553.4, so there are clearly large threshold differences. Conversely, constraining the a parameters but letting the b parameters vary freely led to a significant G2(54) of 35.3. Thus, there are some slope differences, but these are of lesser magnitude.
Looking at these differences at the item level indicated that thee slope differences were confined to domains 5 (Self-view) and 7 (General interest), G2(2) o 8.7 and 11.2, ps < .05 and .01. In both cases, IVR was slightly less discriminating than the other two methods. In contrast, only domain 6 (Thoughts of death or suicide) failed to differ across methods at the .05 level or better, and, among the remaining domains, only domain 7 failed to be significant beyond the .01 level.
Principal component analyses were performed separately upon the three versions of the QIDS16 and the HAMD17. The resulting scree plots (eigenvalue magnitude as a function of its serial position) using the first nine components are presented in Fig. 7. The criterion used to infer how many “real” components were present in the data was parallel analysis (REF??), which has been offered as one alternative to the limitations of the traditional Kaiser-Guttman eigenvalue-greater-than-1 rule. Parallel analysis involves generating a series of correlation matrices sampled from a population of null correlations using the same number of observations and variables as the real data, extracting the principal components for each random matrix, and averaging over replications. The last point where the obtained eigenvalue exceeds the simulated eigenvalue defines the dimensionality. The first four simulated principal components using the 9 variables of the QIDS16 were 1.19, 1.13, 1.08, and 1.03. For each of QIDS16 version, the first obtained eigenvalue for exceeded the first simulated eigenvalue, but all later obtained eigenvalues were smaller than the simulated eigenvalues, which supports the unidimensionality of all three versions. In contrast, the first four simulated principal components using the 17 variables of the HAMD17 were 1.30, 1.24, 1.20, and 1.16. Here, the first two obtained eigenvalues exceeded the simulated eigenvalues, so there is more evidence for multidimensionality of the HAMD17 than the QIDS16.
The elements of the first principal component were then analyzed. These are the optimally weighted linear combinations using z-score transformations of the responses. However, they were virtually identical to the item/total correlations which are the equally weighted linear combinations of the raw response data.
The change in item response from baseline to exit was examined next. This represents the mean decrease among the 479 patients seen on both occasions. Tables Tables44 and and55 contain the data from the three versions of the QIDS16 and the HAMD17, respectively. As can be seen the dominant change, by a large margin, is one of mood.
Another way to examine changes is to consider how well pairs of scales agree as to whether a patients has improved, defined as a 50% reduction from baseline. Table 6 contains these data Perhaps not surprisingly, the greatest agreement is that between the two clinically administered scales (the QIDS16-C and HAMD17), which was over 92%. The agreement between the remaining pairs of scales was in the 85-87% range.
A second definition of change is whether or not a patient can be considered as remitted, defined as a HAMD17 score < 7, a QIDS16-C or QIDS16-SR score < 5, and a QIDS16-IVR score < 6. Table 7 contains the relevant results. Again, the two clinically administered scales agreed most, over 91% of the time, whereas the remaining pairs of scales was in the 85-88% range.
From the Department of Psychiatry (AJR, MHT, TJC, KS-W, MMB), The University of Texas Southwestern Medical Center at Dallas, Dallas, Texas; Department of Psychology (IHB, AW), The University of Texas at Arlington, Arlington, Texas; Epidemiology Data Center (SW), Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania; Healthcare Technology Systems (JCM), Madison, Wisconsin; and Clinical Psychopharmacology Unit (AAN, MF), Massachusetts General Hospital, Boston, Massachusetts.