|Home | About | Journals | Submit | Contact Us | Français|
While considerable attention has focused on improving the detection of depression, assessment of severity is also important in guiding treatment decisions. Therefore, we examined the validity of a brief, new measure of depression severity.
The Patient Health Questionnaire (PHQ) is a self-administered version of the PRIME-MD diagnostic instrument for common mental disorders. The PHQ-9 is the depression module, which scores each of the 9 DSM-IV criteria as “0” (not at all) to “3” (nearly every day). The PHQ-9 was completed by 6,000 patients in 8 primary care clinics and 7 obstetrics-gynecology clinics. Construct validity was assessed using the 20-item Short-Form General Health Survey, self-reported sick days and clinic visits, and symptom-related difficulty. Criterion validity was assessed against an independent structured mental health professional (MHP) interview in a sample of 580 patients.
As PHQ-9 depression severity increased, there was a substantial decrease in functional status on all 6 SF-20 subscales. Also, symptom-related difficulty, sick days, and health care utilization increased. Using the MHP reinterview as the criterion standard, a PHQ-9 score ≥10 had a sensitivity of 88% and a specificity of 88% for major depression. PHQ-9 scores of 5, 10, 15, and 20 represented mild, moderate, moderately severe, and severe depression, respectively. Results were similar in the primary care and obstetrics-gynecology samples.
In addition to making criteria-based diagnoses of depressive disorders, the PHQ-9 is also a reliable and valid measure of depression severity. These characteristics plus its brevity make the PHQ-9 a useful clinical and research tool.
Depression is one of the most prevalent and treatable mental disorders and is regularly seen by a wide spectrum of health care providers, including mental health specialists, medical and surgical subspecialists, and primary care clinicians. There are a number of case-finding instruments for detecting depression in primary care, ranging from 2 to 28 items in length.1,2 Typically, these can be scored as continuous measures of depression severity and also have established cut points above which the probability of major depression is substantially increased. Scores on these various measures tend to be highly correlated,3 and it is not evident that any one measure is superior to the others.1,2,4
The Patient Health Questionnaire (PHQ) is a new instrument for making criteria-based diagnoses of depressive and other mental disorders commonly encountered in primary care. The diagnostic validity of the PHQ has recently been established in 2 studies involving 3,000 patients in 8 primary care clinics and 3,000 patients in 7 obstetrics-gynecology clinics.5,6 At 9 items, the PHQ depression scale (which we call the PHQ-9) is half the length of many other depression measures, has comparable sensitivity and specificity, and consists of the actual 9 criteria upon which the diagnosis of DSM-IV depressive disorders is based. The latter feature distinguishes the PHQ-9 from other “2-step” depression measures for which, when scores are high, additional questions must be asked to establish DSM-IV depressive diagnoses. The PHQ-9 has the potential of being a dual-purpose instrument that, with the same 9 items, can establish depressive disorder diagnoses as well as grade depressive symptom severity. In this paper, we analyze data regarding the PHQ-9 to address 3 major questions:
The Patient Health Questionnaire (PHQ) is a 3-page questionnaire that can be entirely self-administered by the patient.5 The clinician scans the completed questionnaire, verifies positive responses, and applies diagnostic algorithms that are abbreviated at the bottom of each page. The PHQ assesses 8 diagnoses, divided into threshold disorders (disorders that correspond to specific DSM-IV diagnoses: major depressive disorder, panic disorder, other anxiety disorder, and bulimia nervosa), and subthreshold disorders (disorders whose criteria encompass fewer symptoms than are required for any specific DSM-IV diagnoses: other depressive disorder, probable alcohol abuse/dependence, somatoform, and binge eating disorder).
The PHQ-9 (Appendix) is the 9-item depression module from the full PHQ. Major depression is diagnosed if 5 or more of the 9 depressive symptom criteria have been present at least “more than half the days” in the past 2 weeks, and 1 of the symptoms is depressed mood or anhedonia. Other depression is diagnosed if 2, 3, or 4 depressive symptoms have been present at least “more than half the days” in the past 2 weeks, and 1 of the symptoms is depressed mood or anhedonia. One of the 9 symptom criteria (“thoughts that you would be better off dead or of hurting yourself in some way”) counts if present at all, regardless of duration. As with the original PRIME-MD, before making a final diagnosis, the clinician is expected to rule out physical causes of depression, normal bereavement, and history of a manic episode.
As a severity measure, the PHQ-9 score can range from 0 to 27, since each of the 9 items can be scored from 0 (not at all) to 3 (nearly every day). An item was also added to the end of the diagnostic portion of the PHQ-9 asking patients who checked off any problems on the questionnaire: “How difficult have these problems made it for you to do your work, take care of things at home, or get along with other people?”
From May 1997 to November 1998, 3,890 patients, 18 years or older, were invited to participate in the PHQ Primary Care Study.5 There were 190 who declined to participate, 266 who started but did not complete the questionnaire (often because there was inadequate time before seeing their physician), and 434 whose questionnaires were not entered into the data set because the equivalent of approximately 1 page (20 items) was not completed. This resulted in the 3,000 primary care patients reported here (1,422 from 5 general internal medicine clinics and 1,578 from 3 family practice clinics). From May 1997 to March 1999, 3,636 patients, 18 years or older, were approached to participate in the PHQ Obstetrics-Gynecology (Ob-Gyn) Study.6 There were 245 patients who declined to participate, 127 who started but did not complete the questionnaire, and 264 whose questionnaires were not entered into the data set because the equivalent of approximately 1 page was not completed. This resulted in the 3,000 subjects from 7 obstetrics-gynecology (ob-gyn) sites. All sites used one of 2 subject selection methods to minimize sampling bias: either consecutive patients for a given clinic session or every n th patient until the intended quota for that session was achieved. Patient characteristics are summarized in Table 1. Besides being entirely women, the ob-gyn sample had a younger average age, more Hispanic subjects, lower average education, and less medical comorbidity.
A total of 62 physicians participated in the PHQ Primary Care Study (21 general internal medicine and 41 family practice [19 of who were family practice residents]). Their mean age was 37 years (standard deviation [SD], 6.5), and 63% were male. A total of 40 physicians and 21 nurse practitioners participated in the PHQ Ob-Gyn. Their mean age was 39 years (SD, 8.9), and 48% were male.
Before seeing the physician, all patients completed the PHQ. Additionally, they completed the Medical Outcomes Study Short-Form General Health Survey (SF-20).7 The SF-20 measures functional status in 6 domains (all scores from 0 to 100; 100=best health). Also, patients estimated the number of physician visits and disability days during the past 3 months.
To determine the agreement of PHQ diagnoses with those of MHPs, midway through the PHQ Primary Care Study, a MHP (a PhD clinical psychologist or 1 of 3 senior psychiatric social workers) attempted to interview by telephone all subsequently entered subjects who had a telephone, agreed to be interviewed, and could be contacted within 48 hours. All except 1 site participated in these validation interviews. The MHP was blinded to the results of the PHQ. The rationale and further details of the MHP telephone interview, which used the overview from the SCID8 and diagnostic questions from the PRIME-MD, are described in the original PRIME-MD report.9 To examine test-retest reliability, the MHP graded the 9 PRIME-MD questions assessing DSM-IV symptoms using the same 4 response options as the PHQ-9 (i.e., not at all, several days, more than half the days, nearly every day).
The 580 subjects who had a MHP interview within 48 hours of completing the PHQ were, within each site, similar to patients not reinterviewed in terms of demographic profile, functional status, and frequency of psychiatric diagnoses. Agreement between the PHQ diagnoses and the MHP diagnoses was examined. One modification from the original PRIME-MD algorithm was necessary. The number of criteria required for diagnosing major depressive disorder could remain the same as in DSM-IV, i.e., 5 of 9 during the past 2 weeks. However, because the PHQ response set was expanded from the simple “yes/no” in the original PRIME-MD to 4 frequency levels, lowering the PHQ threshold from “nearly every day” to “more than half the days” raised the sensitivity from 37% to 73% while maintaining high specificity (94%).
For most analyses, the PHQ-9 score was divided into the following categories of increasing severity: 0–4, 5–9, 10–14, 15–19, and 20 or greater. These categories were chosen for several reasons. The first was pragmatic, in that the cut points of 5, 10, 15, and 20 are simple for clinicians to remember and apply. The second reason was empiric, in that using different cut points did not noticeably change the associations between increasing PHQ-9 severity and measures of construct validity.
For analyses assessing the operating characteristics of various PHQ-9 intervals or cut points, diagnostic status (major depressive disorder, other depressive disorder, or no depressive disorder) was that assigned by the independent MHP structured psychiatric interview. The latter is considered the criterion standard and provides the most conservative estimate of the operating characteristics of the PHQ-9 score. Besides calculating sensitivity and specificity of the PHQ-9 over various intervals, we also determined likelihood ratios10 and conducted ROC curve analysis11 as quantitative methods for combining sensitivity and specificity into a single metric.
Construct validity of the PHQ-9 as a measure of depression severity was assessed by examining functional status (the 6 SF-20 scales), disability days, symptom-related difficulty, and health care utilization (clinic visits) over the 5 PHQ-9 intervals. Analysis of covariance was used, with PHQ-9 category as the independent variable and adjusting for age, gender, race, education, study site, and number of physical disorders. Bonferroni's correction was used to adjust for multiple comparisons.
The internal reliability of the PHQ-9 was excellent, with a Cronbach's α of 0.89 in the PHQ Primary Care Study and 0.86 in the PHQ Ob-Gyn Study. Test-retest reliability of the PHQ-9 was also excellent. Correlation between the PHQ-9 completed by the patient in the clinic and that administered telephonically by the MHP within 48 hours was 0.84, and the mean scores were nearly identical (5.08 vs 5.03).
In 85% of cases clinicians required less than 3 minutes to review responses on the full 3-page PHQ,5 which consists of 5 modules and 28 to 58 items (depending upon the number of skip-outs). Although time to review the PHQ depression items was not measured separately, it is unlikely this took more than a minute, since the PHQ-9 includes less than one third of the items contained in the full PHQ.
Table 2 shows the distribution of PHQ-9 scores according to depression diagnostic status in the 580 patients interviewed by a mental health professional who was blinded to the PHQ-9 results. The mean PHQ-9 score was 17.1 (SD, 6.1) in the 41 patients diagnosed by the MHP as having major depression; 10.4 (SD, 5.4) in the 65 patients diagnosed as other depressive disorder; and 3.3 (SD, 3.8) in the 474 patients with no depressive disorder. The vast majority of patients (93%) with no depressive disorder had a PHQ-9 score less than 10, while most patients (88%) with major depression had scores of 10 or greater. Scores less than 5 almost always signified the absence of a depressive disorder; scores of 5 to 9 predominantly represented patients with either no depression or subthreshold (i.e., other) depression; scores of 10 to 14 represented a spectrum of patients; and scores of 15 or greater usually indicated major depression.
Because PHQ-9 scores in the 10 to 15 range appear to represent an important “gray zone,” we conducted a more detailed examination of the operating characteristics of various cut points in this range. Table 3 displays the sensitivity, specificity, and likelihood ratios for different PHQ-9 thresholds in diagnosing major depression in the 580 patients who had a MHP interview. For example, a patient with major depression is 6 times more likely than a patient without major depression to have a PHQ-9 score of 9 or greater and 13.6 times more likely to have a score of 15 or greater. In this sample with a 7% prevalence of major depression (41 out of 580 patients), the positive predictive value for major depression ranged from 31% for a PHQ-9 cut point of 9 to 51% for a cut point of 15.
Examination of likelihood ratios further confirmed the substantial association between increasing PHQ-9 scores and the likelihood of major depression. The positive likelihood ratios of PHQ-9 scores of 0–4, 5–9, 10–14, 15–19, and 20–27 for major depression were 0.04, 0.5, 2.6, 8.4, and 36.8, respectively. Interpretation of these likelihood ratios means that, for example, a PHQ-9 score in the 0–4 ranges is only 0.04 (i.e., 1/25) times as likely in a patient with major depression compared to a patient without major depression, while a score of 10 to 14 is 2.6 times as likely and a score of 15 to 19 is 8.4 times as likely. The positive likelihood ratio of these same 5 PHQ-9 intervals for any depression (i.e., major or other depressive disorder) was 0.12, 1.3, 4.9, 15.7, and 38.0, respectively.
ROC analysis showed that the area under the curve for the PHQ-9 in diagnosing major depression was 0.95, suggesting a test that discriminates well between persons with and without major depression. The area under the curve for the 5-item mental health scale of the SF-20 was 0.93.
As shown in Table 4, there was a strong association between increasing PHQ-9 depression severity scores and worsening function on all 6 SF-20 scales. Several findings should be noted. First, results were essentially the same for both the primary care and obstetrics-gynecology samples. Second, the monotonic decrease in SF-20 scores with increasing PHQ-9 scores were greatest for the scales that previous studies have shown should be most strongly related to depression, i.e., mental health, followed by social, overall, and role functioning, with a lesser relationship to pain and physical functioning.12 Third, most pairwise comparisons within each SF-20 scale between successive PHQ-9 levels were highly significant.
Figure 1 illustrates graphically the relationship between increasing PHQ-9 scores and worsening functional status. Decrements in SF-20 scores are shown in terms of effect size, which is the difference in mean SF-20 scores, expressed as the number of standard deviations, between each PHQ-9 interval subgroup and the reference group. The reference group is the group with the lowest PHQ-9 scores (i.e., 0–4), and the standard deviation used is that of the entire sample. Effect sizes of 0.5 and 0.8 are typically considered moderate and large between-group differences, respectively.13 Figure 1 shows effect sizes for the primary care sample; results for the obstetrics-gynecology sample (not displayed) were similar.
When the PHQ-9 was examined as a continuous variable, its strength of association with the SF-20 scales was concordant with the pattern seen in Figure 1. The PHQ-9 correlated most strongly with mental health (0.73), followed by general health perceptions (0.55), social functioning (0.52), role functioning (0.43), physical functioning (0.37), and bodily pain (0.33).
Table 5 shows the association between PHQ-9 severity levels and 3 other measures of construct validity: self-reported disability days, clinic visits, and the general amount of difficulty patients attribute to their symptoms. Greater levels of depression severity were associated with a monotonic increase in disability days, health-care utilization, and symptom-related difficulty in activities and relationships. When the PHQ-9 was examined as a continuous variable, its correlation was 0.39 with disability days, 0.24 with physician visits, and 0.55 with symptom-related difficulty.
Because our sample was relatively young and disproportionately female, we examined the influence of age and gender in several ways. First, simple correlations between PHQ-9 score and measures of construct validity were similar when examined separately for women and men, while correlations were somewhat lower but still highly significant in patients 65 years and older compared to younger individuals. Second, analysis of covariance results showed age had an independent and weak effect on only one outcome (SF-20 physical functioning), while gender had no independent effect.
The single item assessing difficulty that the patients attributed to their depressive symptoms correlated strongly with impairment as measured by the SF-20 subscales, particularly those domains known to be most affected by mental disorders. Correlations of the single symptom-related difficulty item with the SF-20 scales in the primary care sample were 0.53 for mental health, 0.42 for general health perceptions, 0.40 for social functioning, 0.38 for role functioning, 0.27 for bodily pain, and 0.27 for physical functioning. Although slightly lower in the obstetrics-gynecology sample, correlations showed a similar rank order.
Data from our 2 studies totaling 6,000 patients provide strong evidence for the validity of the PHQ-9 as a brief measure of depression severity. Criterion validity was demonstrated in the sample of 580 primary care patients who underwent an independent reinterview by a mental health professional. Construct validity was established by the strong association between PHQ-9 scores and functional status, disability days, and symptom-related difficulty. External validity was achieved by replicating the findings from the 3,000 primary care patients in a second sample of 3,000 obstetrics-gynecology patients. Indeed, the similar results seen in rather different patient populations suggests our PHQ-9 findings may be generalizable to outpatients seen in a variety of clinic settings.
Our analysis of the full range of PHQ-9 scores complements rather than supercedes the validated PHQ-9 algorithm for establishing categorical diagnoses. However, as the PHQ-9 is increasingly used as a continuous measure of depression severity, it will be helpful to know the probability of a major or subthreshold depressive disorder at various cut points. PHQ-9 scores of 5, 10, 15, and 20 represent valid and easy-to-remember thresholds demarcating the lower limits of mild, moderate, moderately severe, and severe depression. In particular, scores less than 10 seldom occur in individuals with major depression while scores of 15 or greater usually signify the presence of major depression. In the “gray zone” of 10 to 14, increasing PHQ-9 scores are associated, as expected, with increasing specificity and declining sensitivity. However, the operating characteristics of the PHQ-9 displayed at various cut points in Table 2 compare favorably to 9 other case-finding instruments for depression in primary care which have an overall sensitivity of 84%, a specificity of 72%, and a positive likelihood ratio of 2.86.1 Likewise, the positive predictive value of the PHQ-9 (ranging from 31% to 51% depending upon the cut point) is similar to other instruments; of note, predictive value is related not only to a measure's sensitivity and specificity but also the prevalence of depressive disorders.
The one depression measure that was used concurrently with the PHQ-9 in our subjects was the 5-item mental health scale of the SF-20, also known as the Mental Health Inventory (MHI-5). PHQ-9 scores were strongly correlated with MHI-5 scores in our subjects (Table 4 and Figure 1). Berwick et al. used ROC analysis to determine how well the MHI-5 and several other measures discriminated between patients with and without major depression.14 In their study, the area under the curve (AUC) was 0.89 for the MHI-5, 0.90 for the longer MHI-18, 0.89 for the 30-item General Health Questionnaire, and 0.80 for the 28-item Somatic Symptom Inventory. In our study, the AUC for major depression was 0.95 for the PHQ-9 and 0.93 for the MHI-5. It is unlikely that other depression-specific measures would be significantly better than the PHQ-9 since an AUC of 1.0 represents a perfect test.
A particularly important characteristic of a severity measure is its sensitivity to change over time. In other words, how precisely do declining or rising scores on the measure reflect improving or worsening depression in response to effective therapy or natural history? Although an exhaustive review of depression measures is beyond the scope of this paper but can be found elsewhere,4,12 a brief discussion of selected measures is warranted. The Hamilton Rating Scale for Depression has been the criterion standard outcome measure in clinical trials, but it can require 15 to 30 minutes of clinician time to administer and is therefore not feasible in many practice settings. The HAM-D is also rather complicated to score and requires substantial training in order to get reasonable inter-rater agreement. The Montgomery-Asberg Depression Rating Scale is about half as long as the HAM-D and probably just as sensitive to change.15,16 Like the HAM-D, however, the Montgomery-Asberg scale must be administered by a clinician with special training and still is moderately time intensive. Several self-administered scales—the 21-item Beck Depression Inventory and the 20-item Zung Self-Rating Depression Scale—also have been used as outcome measures but may be somewhat less sensitive to change than the HAM-D.17 The SCL-20 has been used as an outcome measure in primary care clinical trials,18–20 although published evidence on its sensitivity to change as well as other psychometric characteristics is limited. Epidemiological and clinical studies have established the 20-item CES-D as a valid measure for identifying depression, but there is less information regarding its sensitivity to change.
In summary, there appear to be many comparable measures for identifying depression,1,2,4,12 including a number of self-administered scales. In contrast, it is less clear what the optimal measure for monitoring response to treatment may be, especially outside the setting of a clinical trial. Sensitivity to change is clearly a necessary feature, but other pragmatic considerations include the number of items, time required for completion, mode of administration (self-rating vs interviewer-administered scale), complexity of scoring, inter-rater agreement, and special training requirements. The specific items included in the scale are another factor. One advantage of the PHQ-9 is its exclusive focus on the 9 diagnostic criteria for DSM-IV depressive disorders. On the other hand, some may argue that instruments including symptoms not in the DSM-IV criteria (e.g., loneliness, hopelessness, and anxiety) may have additional value to the clinician. At the same time, it is possible that such scales are less specific for major depression and other mood disorders and may discriminate less accurately depression from anxiety or even general psychological distress.
The major limitation of our study is its cross-sectional design. While our large sample establishes the construct and criterion validity of the PHQ-9, longitudinal studies are needed to establish its sensitivity to change. This will require the completion of several large ongoing clinical trials using the PHQ-9 in parallel with the HAM-D or other established outcome measures. It will also be useful to define the threshold that represents an adequate clinical response. A preliminary approach would be to consider a PHQ-9 score less than 10 and a 50% decline from the pretreatment score as clinically significant improvement. While any proposed threshold requires prospective verification, this approach would be consistent with that established for the HAM-D. Other study limitations are that validation was based on telephone rather than face-to-face interviews and the time for patients to complete the PHQ-9 was not determined.
Detecting depression and initiating treatment are necessary but often insufficient steps to improve outcomes in primary care.21 Monitoring clinical response to therapy is also critical. Multiple studies have shown that monitoring is often inadequate, resulting in clinician failure to detect medication noncompliance, increase the antidepressant dosage, change or augment pharmacotherapy, or add psychotherapy as needed.21,22 Having a simple self-administered measure to complete either in the clinic or by telephone administration (e.g., nurse administration23 or interactive voice recording24) would save clinicians the time needed to inquire about the presence and severity of each of the 9 DSM-IV symptoms to assess outcomes.
Brief measures are more likely to be used in the busy setting of clinical practice. For example, many practitioners have found it more feasible to use the 4-item CAGE questionnaire than a number of longer alcohol screening measures. Of note, as few as 1 or 2 questions have demonstrated a high sensitivity in screening for major depression.2,25 Brevity is just as likely to be a valued attribute when it comes to assessing depression severity as it is when establishing depressive diagnoses. Brevity coupled with its construct and criterion validity makes the PHQ-9 an attractive, dual-purpose instrument for making diagnoses and assessing severity of depressive disorders. If the PHQ-9 proves sensitive to change in clinical trials, it could also be a useful measure for monitoring outcomes of depression therapy.
The development of the PHQ-9 was underwritten by an educational grant from Pfizer US Pharmaceuticals, New York, NY. PRIME-MD is a trademark of Pfizer Copyright held by Pfizer.
|Name ______________________ Date _________|
|Over the last 2 weeks, how often have you been bothered by any of the following problems?||Not at all||Several days||More than half the days||Nearly every day|
|1. Little interest or pleasure in doing things||0||1||2||3|
|2. Feeling down, depressed, or hopeless||0||1||2||3|
|3. Trouble falling or staying asleep, or sleeping too much||0||1||2||3|
|4. Feeling tired or having little energy||0||1||2||3|
|5. Poor appetite or overeating||0||1||2||3|
|6. Feeling bad about yourself—or that you are a failure or have let yourself or your family down||0||1||2||3|
|7. Trouble concentrating on things, such as reading the newspaper or watching television||0||1||2||3|
|8. Moving or speaking so slowly that other people could have noticed? Or the opposite—being so fidgety or restless that you have been moving around a lot more than usual||0||1||2||3|
|9. Thoughts that you would be better off dead or of hurting yourself in some way||0||1||2||3|
|(For office coding: Total Score ____ = ____ + ____ + ____)|
If you checked off any problems, how difficult have these problems made it for you to do your work, take care of things at home, or get along with other people?
|Not difficult at all||Somewhat difficult||Very difficult||Extremely difficult|
From the Primary Care Evaluation of Mental Disorders Patient Health Questionnaire (PRIME-MD PHQ). The PHQ was developed by Drs. Robert L. Spitzer, Janet BW Williams, Kurt Kroenke, and colleagues. For research information, contact Dr. Spitzer at rls8/at/columbia.edu. PRIME-MD is a trademark of Pfizer Inc. Copyright 1999 Pfizer Inc. All rights reserved. Reproduced with permission