|Home | About | Journals | Submit | Contact Us | Français|
Objectives To investigate the proportion of original studies included in systematic reviews and meta-analyses on the diagnostic accuracy of screening tools for depression that appropriately exclude patients who already have a diagnosis of or are receiving treatment for depression and to determine whether these systematic reviews and meta-analyses evaluate possible bias from the inclusion of such patients.
Design Systematic review.
Data sources Medline, PsycINFO, CINAHL, Embase, ISI, SCOPUS, and Cochrane databases were searched from 1 January 2005 to 29 October 2009.
Eligibility criteria for selecting studies Systematic reviews and meta-analyses in any language that reported on the diagnostic accuracy of screening tools for depression.
Results Only eight of 197 (4%) unique publications from 17 systematic reviews and meta-analyses specifically excluded patients who already had a diagnosis of or were receiving treatment for depression. No systematic reviews or meta-analyses commented on possible bias from the inclusion of such patients, even though 10 reviews used quality assessment tools with items to rate risk of bias from composition of the sample of patients.
Conclusions Studies of the accuracy of screening tools for depression rarely exclude patients who already have a diagnosis of or are receiving treatment for depression, a potential bias that is not evaluated in systematic reviews and meta-analyses. This could result in inflated estimates of accuracy on which clinical practice and preventive care guidelines are often based, a problem that takes on greater importance as the rate of diagnosed and treated depression in the population increases.
Depression is a common and disabling condition,1 and improving care has been prioritised. Routine screening for depression is one solution that has been proposed. Depression screening involves the use of screening tools to identify patients who might have depression but who are not seeking treatment for symptoms and whose depression is not otherwise recognised by their physicians so that they can be further assessed and, if appropriate, treated.2 3 Screening for depression has been recommended in several medical settings, including cardiovascular care,4 perinatal care,5 6 7 oncological care,8 and primary care,9 although no clinical trial has found better depression outcomes for screened versus unscreened patients when the same treatment and care resources are potentially available to both groups.10 11 Screening for depression can identify patients with depression who might otherwise go undetected, but it can also lead to misdiagnosis, the identification of patients as being depressed who are not, and overdiagnosis, which occurs when some patients with mild conditions are identified as depressed and exposed to the risk of labelling and treatment, even when the condition might not cause measurable morbidity or mortality. Recently, a report from the National Institute for Health and Clinical Excellence (NICE)11 noted a lack of evidence for benefit from depression screening and, rather than routine screening, recommended case identification strategies to identify depression among high risk groups of patients or patients otherwise identified by physicians as possibly having depression.
A great deal of research has been conducted to determine the diagnostic accuracy of depression screening tests in different clinical settings. Based on data from such studies, expert panels have considered the risks and benefits of screening and issued recommendations to screen for depression in various settings.9 11 Diagnostic or screening tests, however, are useful only to the extent that they distinguish between disordered and non-disordered states that are not otherwise obvious to clinicians12 and if they are accurate across the spectrum of patients who will be assessed in clinical practice.12 13 14 15 16 17 18
The term “spectrum effect” has been used to describe variations in test performance that sometimes occur across subgroups of patients that differ in demographic or clinical features. Spectrum effects raise questions about the generalisability of study results to specific populations of patients that might differ in important ways from study samples.19 The term “spectrum bias” is related and also describes situations in which the accuracy of a test is heterogeneous across subgroups of patients. Spectrum bias is said to be present when a study samples preferentially from certain portions of the patient spectrum but provides a global estimate of accuracy that could misrepresent what would be experienced in actual practice.12 13 14 15 16 17 18 19 Estimates of diagnostic accuracy that are based on case-control designs and whose samples include only obvious cases and healthy controls, for instance, have been shown to substantially overestimate diagnostic accuracy.13 14 18
Self reported depression questionnaires are used for various purposes (such as screening for unidentified cases, tracking severity of symptoms, detecting relapse). For the purpose of screening, which involves the identification of cases not previously recognised, if individuals who already have a diagnosis of depression are not specifically excluded from studies assessing the diagnostic accuracy of depression screening tools, examined cohorts will have a greater prevalence and severity of depression than if only individuals without clinically recognised depression were screened. Not excluding patients who already have a diagnosis would, in turn, lead to determinations of screening accuracy and new case yield that are inflated compared with what would be achieved if the instrument were used to screen patients in clinical practice.12 13 14 15 16 17 18
Systematic reviews and meta-analyses are highly cited and are prioritised in grading evidence for practice guidelines.20 21 If studies of the diagnostic accuracy of depression screening tools that include patients who already have a diagnosis or are receiving treatment are included in systematic reviews and meta-analyses without adjustment for potential bias, these reviews could provide misleading accuracy estimates, thereby misleading calculations of risk-benefit by expert panels and, thus, clinicians.
We evaluated the proportion of studies included in systematic reviews and meta-analyses of the diagnostic accuracy of depression screening tools that excluded patients who already had a diagnosis of or were receiving treatment for depression. We also assessed whether authors of systematic reviews and meta-analyses noted the possibility of spectrum bias from the inclusion of such patients in the original research studies they reviewed. We hypothesised that few studies of depression screening tools would exclude such patients and that systematic reviews and meta-analyses would not consider spectrum bias from their inclusion.
We searched Medline, PsycINFO, CINAHL, Embase, ISI, SCOPUS, and Cochrane databases from 1 January 2005 to 29 October 2009 for systematic reviews and meta-analyses of the diagnostic accuracy of depression screening tools. We restricted the search to this period to obtain recent systematic reviews and meta-analyses that reflect relatively current practice. The search terms used were ((systematic review OR meta-analysis) AND (screening OR sensitivity OR specificity) AND depression). Eligible articles included systematic reviews and meta-analyses in any language published in final form or on the internet before final publication that reviewed the accuracy of screening tools for depression compared with a diagnosis of depression. Depression screening tools included any self report measure used to attempt to identify patients with depression. We included systematic reviews and meta-analyses that reviewed diagnostic accuracy and other psychometric characteristics of depression questionnaires (such as validity and reliability) but extracted data only on diagnostic accuracy. We excluded systematic reviews and meta-analyses that compared scores only on self report screening tools with classifications of depression based on cut offs from other self report screening tools but not a diagnosis of depression. Two investigators reviewed systematic reviews and meta-analyses for eligibility independently. If either reviewer deemed a systematic review or meta-analysis potentially eligible based on a review of the title and abstract, we carried out a full text review of the systematic review or meta-analysis. Any disagreement between reviewers after full text review was resolved by consensus after consultation with an independent third reviewer. Chance corrected agreement between reviewers was assessed with Cohen’s κ.
Two investigators independently extracted and entered on a standardised spreadsheet data items from the systematic reviews and meta-analyses, as well as from the original studies included in the reviews, with discrepancies resolved by consensus. For each systematic review or meta-analysis, they recorded whether or not original studies mentioned possible bias because of the inclusion of patients who already had a diagnosis of or were receiving treatment for depression. Investigators also determined whether or not each systematic review or meta-analysis included an assessment of the quality of included diagnostic accuracy studies. If so, they recorded the tool that was used to do this and whether or not the tool included an evaluation of the risk of spectrum bias. Investigators also recorded the impact factor of the journal in which each systematic review or meta-analysis was published, using the impact factor for the year of publication.22 In addition, they reviewed the introduction and discussion sections and recorded the described purpose for which accuracy of the screening tool was being assessed (such as screening or identification of new cases, monitoring progress of treatment, detection of relapse).
Original diagnostic accuracy studies included in the systematic reviews and meta-analyses were classified as having excluded patients who already had a diagnosis of or were receiving treatment for depression if the authors of the study specifically indicated this in the exclusion criteria. If studies did not specifically indicate that such patients were excluded they were classified as having included them.
For each systematic review or meta-analysis, and overall, we determined the number of unique publications on the diagnostic accuracy of depression screening tools, as well as the number of unique cohorts of patients. We assessed the number of publications and the number of cohorts because, in some cases, there were multiple publications from the same cohort. This occurred, for instance, when different publications reported results from different screening tools or criterion standards with the same group of patients, when one or more publications reported on a subset of the sample from another publication, or when the same patients were assessed at different time points (such as during pregnancy and after delivery). Identification of different publications from the same cohort was done by cross referencing authors and coauthors, characteristics of patients, and countries in which the research was conducted. Verification was done by comparing information in the publications. Cohort status was coded conservatively in that publications that seemed to be from the same cohort were coded as such, even if this could not be confirmed with 100% certainty.
We did not publish or register a review protocol for this study. All methods were determined a priori with the exception of reviewing the introduction and discussion sections to record the described purpose for which the accuracy of depression screening tool was being assessed. This additional step was added to the study methods after data extraction and tabulation of results to clarify whether the intention of the included systematic reviews and meta-analyses was to assess diagnostic accuracy for identification of new cases versus other possible uses of depression symptom questionnaires.
The electronic database search yielded 1216 unique titles and abstracts for review. Of these, 1160 were excluded after review of titles and abstracts because they did not report results from a systematic review or meta-analysis or because they reported data from a systematic review or meta-analysis that was not related to the diagnostic accuracy of a depression screening tool. Of the 56 articles that underwent full text review, we excluded 39, leaving 17 eligible systematic reviews and meta-analyses (figure(figure). Chance). Chance corrected agreement on inclusion and exclusion decisions between reviewers, as assessed with the Cohen’s κ, was 0.95.
Table 1 shows the characteristics of selected systematic reviews and meta-analysesmeta-analyses.. Of the 17 systematic reviews and meta-analyses included, 10 were systematic reviews,23 24 25 26 27 28 29 30 31 32 and seven were meta-analyses.33 34 35 36 37 38 39 The systematic reviews and meta-analyses included between two and 63 original studies and were published in a wide range of journals in terms of impact factor. Two meta-analyses assessed the nine item depression scale of the patient health questionnaire (PHQ-9)33 39; one systematic review23 and two meta-analyses37 38 evaluated the geriatric depression scale; seven systematic reviews24 26 27 29 30 31 32 and one meta-analysis36 assessed depression screening tools, generally, in defined medical populations; two systematic reviews assessed specific screening tools, other than the patient health questionnaire or geriatric depression scale, in defined patient populations25 28; and two meta-analyses assessed brief screening tools (for example, fewer than five items) in primary care34 and palliative care.35 All 17 systematic reviews and meta-analyses described the purpose of the review as related to determining diagnostic accuracy for new case detection by screening, and none discussed how their results might apply to other uses of depression screening tools (such as monitoring progress of treatment, detection of relapse).
The 17 systematic reviews and meta-analyses included a total of 197 unique publications on the diagnostic accuracy of screening tools for depression in 170 unique cohorts of patients. The diagnostic accuracy studies examined more than 25 different screening tools in a wide range of patients (see appendix 1 on bmj.com). Only eight of 197 unique publications (4%) and eight of 170 cohorts (5%) specifically excluded patients who already had a diagnosis of or were receiving treatment for depression (see appendix 1). As shown in table 1, 1123 26 27 30 31 32 33 35 37 38 39 of the 17 systematic reviews or meta-analyses did not examine a single cohort of patients that specifically excluded those who already had a diagnosis of or were receiving treatment for depression.
Table 22 shows that only four40 41 42 43 of the eight studies that excluded such patients reported the number of patients who were excluded because of pre-existing mental health treatment. The proportion of patients excluded for this reason was 22% in a Veteran’s Affairs primary care setting in the United States (published in 2004)43; 10% in a 2003 study of patients in general practice from New Zealand42; 2% in a 2004 study of postpartum women from Turkey40; and 0.2% in a 1996 study of postpartum women from Sweden.41
As shown in table 1, 1323 24 25 27 30 31 32 33 34 35 36 38 39 of the 17 systematic reviews and meta-analyses conducted some form of quality assessment of included studies, including two meta-analyses36 39 that used the quality assessment for diagnostic accuracy studies (QUADAS) tool44; one systematic review27 that used the diagnostic test studies evaluation tool45; one meta-analysis34 that used the Newcastle-Ottawa scale46; two systematic reviews30 32 that used methods developed by the US Preventive Services Task Force (USPSTF)47 48; one systematic review31 that based quality review on guidelines from the American Academy of Neurology49; one systematic review25 that evaluated quality items based on a system from the York Centre for Reviews and Dissemination50; one systematic review24 that used a study specific tool based on criteria identified by the Cochrane Methods Working Group on Systematic Review of Screening and Diagnostic Tests51; one meta-analysis35 that based quality ratings on a published article by Pai et al52; and one systematic review23 and two meta-analyses33 38 that used ad hoc procedures, such as extracting data on one to two items related to study quality.
Of these, 10 systematic reviews or meta-analyses24 25 27 30 31 32 34 35 36 39 used quality assessment methods that included an assessment of spectrum bias. The authors of one of these systematic reviews24 noted study limitations from the lack of non-white patients, and the authors of another32 reported that younger children were poorly represented in studies of children and adolescents. The authors of one meta-analysis reported that half of studies reviewed did not include representative samples but did not provide a rationale for this conclusion.36 The authors of another noted the possibility of a “disease progression bias” in one study of patients after stroke and indicated that none of the other 11 studies reviewed had limitations related to composition of patients.39 In one systematic review, one of four included studies was downgraded because of the description of the sample, but an explanation was not provided.27 The authors of the five other systematic reviews or meta-analyses that used quality assessment methods that included an assessment of spectrum bias did not comment specifically on quality ratings related to possible spectrum bias.21 26 27 30 31
Overall, none of the 17 systematic reviews or meta-analyses commented on possible spectrum bias from the inclusion in studies of patients who already had a diagnosis of or were receiving treatment for depression.
We found that less than 5% of studies on the diagnostic accuracy of depression screening tools appropriately excluded patients who already had a diagnosis of or were receiving treatment for depression. The importance of this finding relates to the potential effect on assessments of the accuracy of depression screening instruments and the number of new cases they will uncover and, therefore, on their utility in clinical practice. The diagnostic accuracy of a screening test is often considered a fixed characteristic of a test, but it can vary substantially in populations with different clinical features.16 Studies that have examined accuracy of diagnostic tests consistently show that increased prevalence or severity of disease in the cohort of patients being examined inflates the reported sensitivity of the test being assessed.14 If the accuracy of screening tools for depression was studied in a group of patients, some of whom had already received a diagnosis for the condition, the assessments would be biased by the inclusion of individuals with a greater prevalence and severity of depression than if the instruments were used in clinical practice to screen patients without clinically recognised depression. This would, in turn, lead to inflated, and potentially misleading, estimates of accuracy on which clinical practice and preventive care guidelines are generally based.
The potential magnitude of this problem grows as the prevalence of already diagnosed and treated depression in the population increases.53 54 Estimates of the prevalence of depression in primary care range from 5% to 13%, including 6% to 9% among adults aged 55 or older.55 Rates are somewhat higher in patients with chronic physical illness.1 Among adults aged 35 and older in the US, rates of antidepressant use increased from 8% to 14% from 1996 to 2005, with a third to a half of prescriptions specifically for psychiatric problems.53 Rates of prescriptions for antidepressants might be even higher among patients with chronic physical disease. Based on provincial data from Ontario, Canada, for instance, the rate of antidepressant prescriptions within six months of an acute myocardial infarction doubled from 8% in 1993 to 16% in 2002 among patients aged 65 and older.56 In a more recent cohort of more than 1200 outpatients with stable cardiovascular disease, just under 20% were treated with an antidepressant at the time of enrolment in the study.57 58 In addition to patients who receive treatment with antidepressants, a relatively small percentage of people receive psychotherapy for depression without drug treatment,59 and some people are recognised by their physicians as depressed but choose not to undergo treatment.
A recent meta-analysis found that general practitioners correctly identify about 50% of patients with depression without the assistance of a screening tool.60 Dichotomising a doctor’s identification or non-identification of depressive disorders, however, could underestimate the degree to which they recognise depression. A study of over 700 patients in primary care from the US and the Netherlands, for instance, found that complete disagreement between physicians’ assessments and a diagnostic interview for depression was much less common than is often thought.61 In that study, only 27% of false negative cases based on physician assessments were true false negatives. In most cases of false negatives, physicians recognised symptoms of depression but underestimated severity compared with the diagnostic interview (40%) or gave another psychiatric diagnosis (33%). Thus, in many settings, a substantial proportion of depressed patients are recognised as depressed without screening, either because they seek treatment for their depression or because a healthcare professional otherwise recognises their symptoms. Based on reported rates of prescriptions for antidepressants and estimates of physicians’ ability to recognise depression, it could be that as many as half or more of patients who are detected as cases in studies assessing the diagnostic accuracy of screening tools would not even be screened in clinical practice.
Data are not available that would allow a precise calculation of the degree by which studies that fail to exclude patients who already have a diagnosis of or are receiving treatment for depression might overestimate diagnostic accuracy and the number of new patients who would be identified through depression screening. Two reviews, however, have reported that studies of other types of diagnostic tests that have used case-control designs13 or case-control designs that compared severely affected patients and healthy controls18 substantially overestimate diagnostic accuracy (relative diagnostic odds ratios 3.013 and 4.9,18 respectively).
Even a relatively small increase in reported diagnostic accuracy resulting from the inclusion of patients who already have a diagnosis or are receiving treatment would result in a substantial overestimate of the positive predictive value and new case yield from depression screening compared with what would be expected in clinical practice. A systematic review of the diagnostic accuracy of depression screening tools in primary care found a median sensitivity of 85% and median specificity of 74%.62 Based on this, in a primary care setting with a prevalence rate of 10%,55 32% of all patients would screen positive for depression, of whom 27% would be true positive cases, equivalent to 9% of all patients screened. If existing studies overestimated the sensitivity by even 10% because of the inclusion of patients with a diagnosis or being treated (relative diagnostic odds ratio 1.9), and it is conservatively assumed that physicians recognise 50% of depressed patients without screening, the rate of screening with positive results would decrease only slightly, from 32% to 27%. Only 14% of these, however, would be true positives, and, overall, less than 4% of patients screened would be newly identified cases of depression (see appendix 2 on bmj.com).
We know of only one study, which was not included in any of the systematic reviews or meta-analyses that we reviewed, that assessed the yield of screening for depression with and without excluding patients with psychiatric disorders already treated with psychotropic drugs.63 In that study of 113 women with breast cancer, the true positive rate of screening for depression fell from 21% to 7% after exclusion of patients who were already receiving treatment for depression before screening.
Our results should be considered in the context of studies that have assessed whether screening for depression benefits patients. There are at least 11 trials in primary care,10 as well as trials in perinatal care,64 65 and cancer care,66 that have tested whether screening and referral for depression treatment improves depression outcomes, and all have had negative results. Reflecting this, the US Preventive Services Task Force recommends screening for depression only when it is supported by integrated staff assisted depression management programmes.9 To our knowledge, only one published research study has documented an attempt to screen and provide collaborative care, as recommended by the task force, in a clinical setting.67 In that study, from the Netherlands, 1687 high risk patients were invited to enrol in a screening trial, 780 participated, and 71 cases of major depression were detected. Of the 71 patients identified, 36 were already receiving treatment for depression and 18 additional patients refused treatment or did not attend their scheduled appointment. Thus, only 17 people of 1687 potentially screened started treatment for depression.
One possible limitation of the current study is that we searched for systematic reviews and meta-analyses, rather than for original studies, and there are probably many original studies on the diagnostic accuracy of depression screening tools that were not included. Our purpose, however, was to assess whether original studies appropriately excluded patients who already had a diagnosis or were receiving treatment and to determine whether systematic reviews and meta-analyses reflected potential bias from the failure to do this, which required a review of reviews. It is unlikely that including additional studies that were not listed in recent systematic reviews or meta-analyses would have substantively altered the results.
Another potential limitation is that the proportion of patients who already had a diagnosis of or were receiving treatment for depression who were inappropriately included in the diagnostic accuracy studies reviewed is unknown. Only four of the studies that excluded such patients reported the proportion excluded for this reason, and this varied widely depending on the setting and the time period of the study. It was less than 2% in studies that collected data from 10 years ago in Turkey40 and more than 15 years ago in Sweden,41 but about 10% in a 2003 study of patients in general practice from New Zealand42 and just over 20% in a 2004 study of primary care patients treated in a US Veteran’s Affairs setting.43 In addition, the small number and substantial heterogeneity of studies that excluded patients who already had a diagnosis or were receiving treatment did not allow for an assessment of the effect of inclusion and exclusion decisions on diagnostic accuracy estimates. On the other hand, numerous studies have found that the inclusion of established cases among examined cohorts consistently inflates assessments of the accuracy of a diagnostic test,14 and it is likely that this would also be the case in studies of depression screening tools.
The importance of our findings relates to the use of depression questionnaires for screening, a procedure conducted to identify previously unrecognised cases.2 3 In clinical practice, depression questionnaires are sometimes used for purposes other than screening, including monitoring the severity of symptoms in patients who already have a diagnosis of depression and assessing patients for recurrence of symptoms while they are being treated. The introduction and discussion sections of the 17 systematic reviews and meta-analyses we reviewed indicate that all were intended to assess the diagnostic accuracy and utility of depression questionnaires for the purpose of screening—that is, for identification of new cases. None discussed how findings might apply to other possible uses for the questionnaires (such as monitoring progress of treatment or detection of relapse). In addition, the recommendations that have been issued by expert panels regarding depression screening in various settings discuss the use of screening instruments as a means of identifying new cases.
Screening for depression is somewhat different from many other types of screening in that a history or interview might not necessarily be part of the evaluation before a screening tool is administered. To illustrate, the US Preventive Services Task Force recommends screening for cervical cancer in women who have been sexually active and have a cervix.68 On the other hand, such screening is not recommended for women older than 65 or for women who have recently had a normal result on a smear test. This approach to screening is predicated on some “filtering” to determine the appropriate individuals or groups to be screened. On the other hand, the task force’s recommendations regarding depression screening9 focus on issues in healthcare systems, such as the availability of staff assisted depression care, rather than on any upstream evaluation of patients before screening. In clinical settings, screening tools for depression might be routinely administered to all patients in the waiting room of a hospital, physician’s office, or clinic, as has been recommended by expert panels.4 Regardless of whether these screening tools are used with or without upstream “filtering” in clinical practice, accurate determinations of test characteristics that reflect the ability to detect previously unrecognised cases can be obtained only if this upstream “filtering” is done in studies to exclude patients who already have a diagnosis of depression. Our findings show that this is rarely done, and, as a result, existing evidence on the accuracy and case yield of depression screening tools could substantially overestimate their utility in clinical practice. Well designed studies that exclude patients who already have a diagnosis of or are receiving treatment for depression are needed to generate realistic determinations of the accuracy of depression screening tools in clinical settings to inform decisions about risks and benefits with screening.
Appendix 1: Characteristics of diagnostic accuracy studies included in systematic reviews and meta-analyses
Appendix 2: Estimate of screening results including and excluding patients who already had diagnosis of depression
We thank Allison Leavens, Lisa R Jewett, and Brooke Levis, all of the Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Quebec, Canada, for verification of referencing and study counts and proofreading the manuscript. They were not compensated for their contributions.
Contributors: BDT was responsible for the study concept and design, wrote the review protocol, supervised and carried out the data extraction, and drafted the manuscript with the input of the other authors. EA reviewed articles for inclusion, carried out the data extraction, contributed to the analysis, interpretation, and presentation of data, and conducted a critical revision of the manuscript. GE-B and AM participated in the design of the study, reviewed articles for inclusion, carried out the data extraction, and contributed a critical revision of the manuscript. RCZ contributed to the study design and contributed a critical revision of the manuscript. RJS contributed to the study design and analysis and interpretation of the data and contributed a critical revision of the manuscript. All authors had full access to all of the data (including statistical reports and tables) and take responsibility for the integrity of the data and the accuracy of the data analysis. BDT is guarantor.
Funding: This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. BDT is supported by a New Investigator Award from the Canadian Institutes of Health Research and an Établissement de Jeunes Chercheurs award from the Fonds de la Recherche en Santé Québec. RCZ is supported by the National Center for Complementary and Alternative Medicine (grant No R24AT004641) and the Miller Family Scholar Program of the Johns Hopkins Center for Innovative Medicine. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Center for Complementary and Alternative Medicine or the National Institutes of Health.
Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare: no support from any organisation for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work.
Ethical approval: Not required.
Data sharing: No additional data available.
Cite this as: BMJ 2011;343:d4825