Data from our 2 studies totaling 6,000 patients provide strong evidence for the validity of the PHQ-9 as a brief measure of depression severity. Criterion validity was demonstrated in the sample of 580 primary care patients who underwent an independent reinterview by a mental health professional. Construct validity was established by the strong association between PHQ-9 scores and functional status, disability days, and symptom-related difficulty. External validity was achieved by replicating the findings from the 3,000 primary care patients in a second sample of 3,000 obstetrics-gynecology patients. Indeed, the similar results seen in rather different patient populations suggests our PHQ-9 findings may be generalizable to outpatients seen in a variety of clinic settings.
Our analysis of the full range of PHQ-9 scores complements rather than supercedes the validated PHQ-9 algorithm for establishing categorical diagnoses. However, as the PHQ-9 is increasingly used as a continuous measure of depression severity, it will be helpful to know the probability of a major or subthreshold depressive disorder at various cut points. PHQ-9 scores of 5, 10, 15, and 20 represent valid and easy-to-remember thresholds demarcating the lower limits of mild, moderate, moderately severe, and severe depression. In particular, scores less than 10 seldom occur in individuals with major depression while scores of 15 or greater usually signify the presence of major depression. In the “gray zone” of 10 to 14, increasing PHQ-9 scores are associated, as expected, with increasing specificity and declining sensitivity. However, the operating characteristics of the PHQ-9 displayed at various cut points in compare favorably to 9 other case-finding instruments for depression in primary care which have an overall sensitivity of 84%, a specificity of 72%, and a positive likelihood ratio of 2.86.
1 Likewise, the positive predictive value of the PHQ-9 (ranging from 31% to 51% depending upon the cut point) is similar to other instruments; of note, predictive value is related not only to a measure's sensitivity and specificity but also the prevalence of depressive disorders.
The one depression measure that was used concurrently with the PHQ-9 in our subjects was the 5-item mental health scale of the SF-20, also known as the Mental Health Inventory (MHI-5). PHQ-9 scores were strongly correlated with MHI-5 scores in our subjects ( and ). Berwick et al. used ROC analysis to determine how well the MHI-5 and several other measures discriminated between patients with and without major depression.
14 In their study, the area under the curve (AUC) was 0.89 for the MHI-5, 0.90 for the longer MHI-18, 0.89 for the 30-item General Health Questionnaire, and 0.80 for the 28-item Somatic Symptom Inventory. In our study, the AUC for major depression was 0.95 for the PHQ-9 and 0.93 for the MHI-5. It is unlikely that other depression-specific measures would be significantly better than the PHQ-9 since an AUC of 1.0 represents a perfect test.
A particularly important characteristic of a severity measure is its sensitivity to change over time. In other words, how precisely do declining or rising scores on the measure reflect improving or worsening depression in response to effective therapy or natural history? Although an exhaustive review of depression measures is beyond the scope of this paper but can be found elsewhere,
4,12 a brief discussion of selected measures is warranted. The Hamilton Rating Scale for Depression has been the criterion standard outcome measure in clinical trials, but it can require 15 to 30 minutes of clinician time to administer and is therefore not feasible in many practice settings. The HAM-D is also rather complicated to score and requires substantial training in order to get reasonable inter-rater agreement. The Montgomery-Asberg Depression Rating Scale is about half as long as the HAM-D and probably just as sensitive to change.
15,16 Like the HAM-D, however, the Montgomery-Asberg scale must be administered by a clinician with special training and still is moderately time intensive. Several self-administered scales—the 21-item Beck Depression Inventory and the 20-item Zung Self-Rating Depression Scale—also have been used as outcome measures but may be somewhat less sensitive to change than the HAM-D.
17 The SCL-20 has been used as an outcome measure in primary care clinical trials,
18–20 although published evidence on its sensitivity to change as well as other psychometric characteristics is limited. Epidemiological and clinical studies have established the 20-item CES-D as a valid measure for identifying depression, but there is less information regarding its sensitivity to change.
In summary, there appear to be many comparable measures for identifying depression,
1,2,4,12 including a number of self-administered scales. In contrast, it is less clear what the optimal measure for monitoring response to treatment may be, especially outside the setting of a clinical trial. Sensitivity to change is clearly a necessary feature, but other pragmatic considerations include the number of items, time required for completion, mode of administration (self-rating vs interviewer-administered scale), complexity of scoring, inter-rater agreement, and special training requirements. The specific items included in the scale are another factor. One advantage of the PHQ-9 is its exclusive focus on the 9 diagnostic criteria for DSM-IV depressive disorders. On the other hand, some may argue that instruments including symptoms not in the DSM-IV criteria (e.g., loneliness, hopelessness, and anxiety) may have additional value to the clinician. At the same time, it is possible that such scales are less specific for major depression and other mood disorders and may discriminate less accurately depression from anxiety or even general psychological distress.
The major limitation of our study is its cross-sectional design. While our large sample establishes the construct and criterion validity of the PHQ-9, longitudinal studies are needed to establish its sensitivity to change. This will require the completion of several large ongoing clinical trials using the PHQ-9 in parallel with the HAM-D or other established outcome measures. It will also be useful to define the threshold that represents an adequate clinical response. A preliminary approach would be to consider a PHQ-9 score less than 10 and a 50% decline from the pretreatment score as clinically significant improvement. While any proposed threshold requires prospective verification, this approach would be consistent with that established for the HAM-D. Other study limitations are that validation was based on telephone rather than face-to-face interviews and the time for patients to complete the PHQ-9 was not determined.
Detecting depression and initiating treatment are necessary but often insufficient steps to improve outcomes in primary care.
21 Monitoring clinical response to therapy is also critical. Multiple studies have shown that monitoring is often inadequate, resulting in clinician failure to detect medication noncompliance, increase the antidepressant dosage, change or augment pharmacotherapy, or add psychotherapy as needed.
21,22 Having a simple self-administered measure to complete either in the clinic or by telephone administration (e.g., nurse administration
23 or interactive voice recording
24) would save clinicians the time needed to inquire about the presence and severity of each of the 9 DSM-IV symptoms to assess outcomes.
Brief measures are more likely to be used in the busy setting of clinical practice. For example, many practitioners have found it more feasible to use the 4-item CAGE questionnaire than a number of longer alcohol screening measures. Of note, as few as 1 or 2 questions have demonstrated a high sensitivity in screening for major depression.
2,25 Brevity is just as likely to be a valued attribute when it comes to assessing depression severity as it is when establishing depressive diagnoses. Brevity coupled with its construct and criterion validity makes the PHQ-9 an attractive, dual-purpose instrument for making diagnoses and assessing severity of depressive disorders. If the PHQ-9 proves sensitive to change in clinical trials, it could also be a useful measure for monitoring outcomes of depression therapy.