|Home | About | Journals | Submit | Contact Us | Français|
Serious sequelae of youth depression, plus recent concerns over medication safety, prompt growing interest in the effects of youth psychotherapy. In previous meta-analyses, effect sizes (ESs) have averaged .99, well above conventional standards for a large effect and well above mean ES for other conditions. The authors applied rigorous analytic methods to the largest study sample to date and found a mean ES of .34, not superior but significantly inferior to mean ES for other conditions. Cognitive treatments (e.g., cognitive–behavioral therapy) fared no better than noncognitive approaches. Effects showed both generality (anxiety was reduced) and specificity (externalizing problems were not), plus short- but not long-term holding power. Youth depression treatments appear to produce effects that are significant but modest in their strength, breadth, and durability.
Depression in children and adolescents (herein referred to collectively as youths) is a significant, persistent, and recurrent public health problem that undermines social and school functioning, generates severe family stress, and prompts significant use of mental health services (Angold et al., 1998; Clarke, DeBar, & Lewinsohn, 2003). Youth depression is also linked to increased risk of other psychiatric disorders (Angold & Costello, 1993) as well as drug use and suicide (Gould et al., 1998; Rohde, Lewinsohn, & Seeley, 1991), which is the third most common cause of death among adolescents (Arias, MacDorman, Strobino, & Guyer, 2003). Relapse rates have been reported at 12% within 1 year and 33% within 4 years (Lewinsohn, Clarke, Seeley, & Rohde, 1994), and by the age of 18 years, some 20% of youths will have met criteria for a diagnosis of major depressive disorder at least once (Birmaher et al., 1996). Prospective longitudinal research has shown substantial continuity of youth depression into adulthood, with impaired functioning in work, social, and family life, and markedly elevated risk of adult suicide attempts and completed suicide (see, e.g., Costello et al., 2002; Weissman et al., 1999). The extent, impact, and long-term sequelae of youth depression underscore the need for effective treatment. A primary purpose of the current article was to assess the effects of the most extensively tested genre of youth depression treatment: psychotherapy. In this article, we seek to answer the six questions listed below.
The need to examine psychotherapy effects is underscored not only by evidence on the extent, impact, and sequelae of youth depression but also by recent debate over medication risks. Selective serotonin reuptake inhibitors (SSRIs) have become a widely used treatment for depressed youths (Safer, 1997; Treatment for Adolescents with Depression Study [TADS] Team, 2004; Weisz & Jensen, 1999), but concerns over possible risks, including suicidal ideation and suicide attempts (Vitiello & Swedo, 2004; Whittington et al., 2004), have led regulatory agencies to hold hearings, issue safety warnings, and (in the United Kingdom) classify SSRIs as “contraindicated” for pediatric use (see Committee on Safety of Medicines, 2004; U.S. Food and Drug Administration [FDA], 2004). Most recently, the FDA (2004) issued a “black box” warning on all antidepressants, not just SSRIs, to underscore the possible risk and thus encourage clinicians and parents to consider alternatives to medication. Concerns about pharmacotherapy have thus refocused attention on the most prominent medication alternative, psychotherapy, and on the question of how effective psychotherapy is with youth depression.
Efforts to answer this question can be found in multiple tests of youth psychotherapy programs conducted over the past 2 decades, and each individual trial sheds some light on treatment impact. However, quantitative experts (e.g., Cohen, 1990; Rosenthal, 1990; Schmidt, 1992) caution that reports of p values from individual studies may often be misleading and that a better way to evaluate progress is to rely on effect size (ES) values and on meta-analyses that synthesize those values across multiple studies. Like any statistical tool, meta-analysis is most informative when applied to the most representative data possible (e.g., the most complete collection of studies), and successive meta-analyses are needed to ensure accuracy as the pool of relevant studies builds over time.
These points are particularly relevant to the domain of youth depression. Three meta-analyses have been published on this topic, with findings that convey an unusually positive picture of treatment success. In the earliest of the three, Reinecke, Ryan, and DuBois (1998a) focused on a sample of six depression treatment studies with adolescents, all involving cognitive–behavioral therapy (CBT); they reported a mean ES across studies of 1.02. In response to a commentary by Harrington, Campbell, Shoebridge, and Whittaker (1998), Reinecke, Ryan, and DuBois (1998b) used an alternate computational method, generating a mean ES of 0.97, still markedly above means in the youth treatment literature generally. In a second meta-analysis, focused on CBT with depressed adolescents, Lewinsohn and Clarke (1999) reported an even larger mean ES of 1.27, on the basis of what appear to be 12 treatment–control comparison studies. In a third meta-analysis, Michael and Crowley (2002) addressed treatment of child and adolescent depression, placing no limits on the kinds of psychotherapy used. Their study collection included 14 controlled trials (plus a larger number of prepost design studies that were analyzed separately), encompassing multiple forms of youth depression treatment, and they reported a mean ES of 0.72 for these controlled trials. Averaging across these three meta-analyses, the mean ES is 0.99, well above Cohen’s widely used benchmark of 0.80 for a large effect and markedly higher than the mean value of 0.54, reported in the most recent broad-based meta-analysis of youth therapy outcome encompassing diverse treated problems and disorders (Weisz, Weiss, Han, Granger, & Morton, 1995).
The very high mean ES across these three youth depression meta-analyses might be read as an indication that the search for highly effective nonbiological treatment has already succeeded, and that resources for treatment development and testing should be focused on youth problems and disorders other than depression, which appear to show much smaller effects. Before reaching such a conclusion, however, it may be wise to examine the prior meta-analyses closely and to consider additional evidence that is now available.
First, it should be noted that although the previous meta-analyses were valuable in relation to their specific goals, all three may have provided a somewhat less than complete picture, and new studies have appeared in the years since 1999, the date of the most recent study that was included in these previous meta-analyses. We have now identified more than twice as many randomized trials as the most complete meta-analysis to date (Michael & Crowley, 2002). Second, it should be noted that two of the prior meta-analyses (Lewinsohn & Clarke, 1999; Reinecke et al., 1998a,1998b) included only articles published in peer-reviewed journals. Including only such peer-reviewed work may increase the risk of ES overestimation because the peer review process favors positive findings (see, e.g., McLeod & Weisz, 2004). Third, for those studies that the meta-analysts did identify, ES computation appears to have relied solely on data in the published reports, on estimates of effects for the numerous studies not reporting sufficient data, and on exclusion of studies (e.g., 50% of the studies identified, in one of the meta-analyses), rather than on contacting authors and persuading them to provide the information needed for precise ES values. Fourth, close examination suggests that the meta-analyses did not consistently require random assignment of study participants to treatment and control conditions or consistently compute ES on the basis of all depression outcome measures obtained in the studies.
Another important issue is that previous meta-analyses relied on fixed effects analyses. This approach is widely used in meta-analyses, but technically it is appropriate only if homogeneity analyses support the assumption that all ES values, across studies, are estimating the same population mean. Our examination (see below) suggests that the assumption is not valid in the case of youth depression psychotherapy studies and thus that random effects analyses are appropriate. Moreover, only random effects analyses permit generalization from the particular studies under review to other treatment studies that used different participants, methods, doses, and outcome measures.
All these issues suggest the need for a current examination of the outcome literature on youth depression treatment. In providing such an examination in the current article, we (a) obtained and calculated ES for a particularly complete pool of youth depression psychotherapy trials, including both peer-reviewed and non-peer-reviewed studies, (b) required random assignment of participants to study conditions to ensure fair tests of treatment effects, (c) contacted authors repeatedly to obtain the exact information needed to compute ES precisely for all studies in our pool, thus eliminating the need to make rough estimates of study parameters or drop studies altogether, (d) used all of the published depression outcome measures included in each study, omitting none, and (e) tested homogeneity of ES and used random effects analyses as indicated.
In addition to providing the most complete assessment of youth depression treatment effects to date, we sought to gauge the holding power of those effects. We did so by quantifying the effects found in clinical trial follow-up assessments and comparing these with the effects found at immediate posttreatment. Such a comparison provides an indication of whether the changes occurring during treatment are internalized in a way that lasts, and thus whether booster sessions and continuation treatment (see, e.g., Clarke, Rohde, Lewinsohn, Hopps, & Seeley, 1999; Weissman, 1994) may be needed to sustain treatment benefit over time. In the two previous youth depression meta-analyses that reported ES at follow-up, effects were found to diminish over time; however, these reports were based on only six studies in Reinecke et al.’s (1998b) analyses and eight studies in Michael and Crowley’s (2002) analyses. Our larger and more representative pool of studies appears to offer a more reliable picture of holding power, with 19 studies providing usable follow-up tests.
We sought to learn whether depression treatment effects were limited to depression symptoms or generalized to other conditions. Previous analyses of the effects of youth psychotherapy (in Weisz et al.’s, 1995, study) examined specificity grossly by comparing treatment effects on problems targeted by the intervention with effects on all other problems; however, in depression treatment, a more refined question arises: whether benefits of depression treatment spread to symptoms of more closely related and more distally related syndromes—that is, anxiety and conduct problems, respectively. We expected that depression effects might not carry over to symptoms so different as those in the externalizing domain; however, high correlations between youth depression and anxiety (see, e.g., Achenbach, 1990) and theories that posit common risk factors for anxiety and depression (e.g., the tripartitate model of emotion; Clark, Watson, & Mineka, 1994) suggest that treatments that reduce depression may have benevolent effects on anxiety as well. This possibility is directly relevant to recent debates about whether depression and anxiety require separate treatments or can be treated by combined intervention for emotional disorders (i.e., depression plus anxiety; see Barlow, Allen, & Choate, 2004), but no previous meta-analysis has addressed the issue. Thus, we assessed the specificity of depression treatment effects by assessing outcomes separately for (a) depression measures, (b) measures of anxiety, and (c) measures of externalizing problems (e.g., disruptive conduct).
A potential weakness of most psychotherapy outcome research, including youth depression research, is that it rests on comparisons of active treatments with inert conditions such as waitlist or no treatment (see Jensen, 2003; Weisz, 2004). Such inert conditions control only for the passage of time and the natural time course of problems and disorders, not for placebo or expectancy effects and not for such nonspecific effects as improvement due to attention or to the benefits of a therapeutic relationship. Extensive research (e.g., Baskin, Tierney, Minami, & Wampold, 2003; Kazdin, Bass, Ayers, & Rodgers, 1990) has demonstrated that comparisons of two active conditions, such as treatment versus placebo, generate lower ES values than comparisons of active versus inert conditions. Accordingly, a key question for treatment of any problem, including depression, is whether active treatments show significant effects when compared with active control groups. If treatment effects of youth depression are found to be significant only in comparison with inert control groups, then one implication could be that the apparent benefit of treatments designed specifically for youth depression is somewhat illusory, resembling placebo effects or resting on more generic benefits of nonspecific factors such as the therapeutic relationship (see, e.g., Horvath & Luborsky, 1993). To address this issue, we examined effects for studies in which active treatments were compared with active control groups and for studies in which active treatments were compared with passive control groups.
The current literature in youth depression treatment heavily emphasizes approaches that stress altering unrealistic negative cognitions (e.g., CBT and cognitive restructuring treatments). However, research on adult depression treatment (e.g., Hollon, 2000; Jacobson et al., 1996; Jacobson, Martell, & Dimidjian, 2001) has raised questions about whether a cognitive emphasis is needed to generate improvement and even whether cognitive intervention adds significantly to such noncognitive approaches as behavioral activation. Promising results of some recent youth depression trials that used treatments without a cognitive emphasis suggest that this question is also relevant to youth depression treatment. To address the question, we tested the magnitude of effects for treatments with a cognitive emphasis and for treatments that do not emphasize cognitive change.
We focused an additional set of analyses on the question of whether treatment effects were evident in clinically representative conditions. Critiques of psychotherapy research have raised the concern that many of the studies are controlled efficacy trials with design features that do not resemble actual clinical practice, and that some treatments that succeed under efficacy trial conditions may not work so well under clinical practice conditions (see, e.g., Weisz, 2004; Westen, Novotny, & Thompson-Brenner, 2004). Of special interest in these critiques have been the research participants (i.e., whether participants were recruited vs. clinically referred; see, e.g., Hammen, Rudolph, Weisz, Burge, & Rao, 1999), the treating therapists (i.e., whether therapists were research employees vs. practicing clinicians; see, e.g., Michael, Huelsman, & Crowley, 2005), and the treatment setting (i.e., whether treatment took place in a research setting vs. a clinical service setting; see, e.g., Southam-Gerow, Weisz, & Kendall, 2003). We included all three elements in our analyses and examined the magnitude of effects in (a) studies that used clinically referred youths and those that used recruited youths, (b) studies that used practicing clinicians as therapists and those that used research staff therapists, and (c) studies of treatment conducted in clinical service settings and those with treatment conducted in nonclinical research settings. A finding that treatment effects were insignificant for referred youths, with practicing clinician therapists, or in service settings, would support concerns about the relevance of research results to clinical practice.
Finally, in harmony with recent recommendations in the youth treatment literature (e.g., Kazdin, 2000; Weisz, 2004), we carried out a series of secondary analyses that examined whether depression treatment showed measurable benefit across dimensions that have been identified as important potential predictors of outcome in the treatment literature. These dimensions included youth characteristics (age, gender, and whether youths were selected for study samples based on depressive disorder diagnoses or based on depression symptom measures; see, e.g., Weisz, Valeri, McCarty, & Moore, 1999); treatment intervention characteristics (group vs. individual modality, and treatment duration; see, e.g., McRoberts, Burlingame, & Hoag, 1998); and study characteristics (study attrition rates, whether studies were peer-reviewed, and whether outcomes were assessed via youth self-report vs. parent report; see, e.g., Hammen & Rudolph, 1996). These analyses, although not addressing our six primary aims, were useful in delineating the boundary conditions within which psychotherapy is beneficial.
For the purposes of this review, psychotherapy for depression was defined as an intervention designed to alleviate depressive disorders or elevated levels of depressive symptomatology through structured or unstructured interaction or a training program, which was provided by one or more individuals trained to deliver the intervention or through a program designed to be self-administered.
Studies were obtained through (a) computer searches on PsycINFO (1887–2004), Dissertation Abstracts International (1861–2004), and MEDLINE (1994–2004); (b) examination of reference lists in relevant review articles and reference trails from outcome studies; (c) hand searching all issues from 1965–2004 of those journals in which at least five psychotherapy studies had been identified through our computer and reference searches; and (d) personal communications with authors of relevant studies, asking whether they knew of any additional relevant studies. Keywords used in the computer searches included the following three diagnosis/problem terms (depression, dysthymia, major depression), with the search limited to child and adolescent populations (mean age less than 18 years) and publications that were classified as treatment outcome, clinical trial, single-blind design, or double-blind design.
To be included in the meta-analysis, a study had to meet the following criteria: (a) participants selected because of elevated levels of depressive symptoms, formal diagnosis of major depressive disorder or dysthymic disorder, or research diagnostic criteria diagnoses of minor or intermittent depression; (b) random assignment of participants to at least one active treatment group and at least one untreated, waitlist, minimally treated, or active placebo control group (one study was excluded because it compared two forms of group therapy that were described as active treatments, with no control group); (c) samples of mean age younger than 19 years; and (d) intervention intended by the investigators to target depressive symptoms or disorder. To reduce the possibility of a publication bias favoring positive results, which threatens the validity of meta-analyses (Cook et al., 1993; McLeod & Weisz, 2004; Sohn, 1996), we included non-peer-reviewed studies (e.g., book chapters) and doctoral dissertations. When separate articles were published from the same data set (e.g., articles presenting posttreatment and follow-up findings separately, or dissertations that were later published in journals), these were combined for analysis as a single study. Single-subject designs were excluded because they generate ES values that are not comparable to those for group-comparison designs, and because they pose some risk of idiosyncratic findings based on unusual characteristics of a small number of participants.
Our final sample consisted of 35 studies. Summary information about the individual studies, including their ES values, appears in Table 1. Study references appear, with asterisks, in the reference list. One of the 35 studies (Clarke et al., 1995) was labeled by its authors as “targeted prevention”; we included it because it met all our criteria for studies of psychotherapy, and its sample included only adolescents who showed elevated scores on a standardized depression measure.
Studies were coded for multiple subject, design, and method features. Two judges independently coded all the studies, with interjudge reliability assessed for continuous codes via intraclass correlation coefficients (ICCs) and for categorical codes via kappas. Kappas are conventionally categorized as slight if .01–.20, fair if .21–.40, moderate if .41–.60, substantial if .61–.80, and almost perfect if .81–1.00 (Landis & Koch, 1977). Discrepancies in coding were resolved through the coders’ joint review of study details (see also the ES Calculation Procedures and Reliability section below).
We coded outcome measures in each study for content—that is, whether measures assessed depression, anxiety, or externalizing/conduct problems (κ = .90).
We also coded outcome measures as to informant—that is, whether they were derived from youth report, parent report, or teacher report (κ = .76). Because teacher reports were rarely used, we were only able to structure tests focused on youth and parent reports.
The type of control group used varied across studies. Some used passive controls that were given no additional attention or intervention (e.g., no-treatment or waitlist-control groups); others used active control groups that provided attention and/or nonspecific treatment elements similar in dose to what was provided for the intervention group (κ = .90).
Turning to participant characteristics, we coded studies for whether a diagnostic or symptom measure was used to identify depressed youths. Diagnosed samples were those who were administered a diagnostic interview and met full diagnostic criteria (Diagnostic and Statistical Manual of Mental Disorders or research diagnostic criteria) for major depressive disorder, dysthymic disorder, minor depressive disorder, or intermittent depressive disorder. Symptom measure samples showed elevated depressive symptoms on a symptom measure (κ = 1.00).
We also coded samples according to age group. As in prior meta-analyses (e.g., Michael & Crowley, 2002; Weisz et al., 1995), we classified participants under 13 as children and those 13 or older as adolescents (κ =.94). Studies that included both children and adolescents (n = 13) were not included in age group analyses.
Limited variability in treatment methods used to date (i.e., most involved some form of behavioral intervention) left us with only two meaningful treatment method categories: cognitive emphasis (i.e., approaches that emphasize changes in beliefs or ways of thinking about events and conditions) versus no cognitive emphasis—treatments that do not emphasize changing beliefs or ways of thinking about events and conditions (κ = 1.00). A treatment was coded as having a cognitive emphasis if the treatment description gave any indication that treatment focused on changing cognitions, thoughts, beliefs, ways of thinking, or internal self-talk. Treatments that were coded as not having a cognitive emphasis included attachment-based family treatment, behavioral problem solving, group support, interpersonal psychotherapy, relaxation training, role-playing, self-modeling, social skills training, structured learning therapy, and systematic behavior family therapy.
Treatment modality was coded as group versus individual treatment (κ =.77).
We used a continuous measure of treatment duration, calculating the number of therapy hours provided (e.g., total time spent in parent, family, and youth sessions; ICC = .97).
For each study, we coded the percentage of sample attrition that occurred between the point of randomization and the posttreatment assessment (ICC = .90).
We distinguished between studies that had been published in a peer-reviewed journal and those that had not (including dissertations and book chapters; κ = 1.00).
We coded whether the majority of participants were already referred for mental health services independently of the study (clinically referred) or were recruited into the study (e.g., via advertisement; κ = 1.00).
We differentiated studies in which a majority of therapists were used primarily as clinicians versus studies in which a majority of therapists were not primarily clinicians (e.g., researchers, graduate students, professors; κ =.69).
We coded whether treatment was provided in a clinical service setting (e.g., outpatient, inpatient, or day treatment program) or nonclinical setting (e.g., university lab or lab clinic, primary or secondary school; κ =.86).
ES values were calculated separately for each outcome measure, informant (e.g., parent, youth), and time of assessment (e.g., posttreatment, follow-up), then averaged up to the level of the target comparison (as we describe later). Following Smith, Glass, and Miller’s (1980) study, ES was the posttherapy difference between the control and treatment group means, divided by control group standard deviation. This resembles Cohen’s (1988) d, except that the divisor is the control group standard deviation rather than the pooled standard deviation across both groups; pooling using the posttreatment standard deviation may be inappropriate for researchers conducting psychotherapy studies because treatment may increase variability (see Weisz et al., 1995, p. 455, for supporting evidence; see Michael & Crowley, 2002, for a useful alternative to our approach). All ESs were calculated such that positive values implied an advantage for treatment over control group. In addition, we adjusted all ES values using Hedges’s small sample correction (Hedges & Olkin, 1985), which yields an unbiased estimator of ES. As sample size increases, d and ES approximate each other. However, with smaller sample sizes, variance estimates are larger, and the distribution of d values sampled tends to be skewed. To correct for this small sample bias, we used the following formula:
where N is equal to the number of participants in the control group plus the number of participants in the treated group.
When means, standard deviations, or other information needed for our calculations or analyses were not reported in an article or manuscript, we contacted authors, sometimes repeatedly, until we obtained the information from them. Using Bollen’s (1989) procedure of dropping ESs lying beyond the first gap of at least one standard deviation between adjacent ES values in a positive or negative direction, we conducted analyses that excluded ES outliers for individual measures. When different scales from the same measure were reported, scales were averaged for overall ES, but depression-specific scales were used in computing depression ES.
All ESs were independently calculated by two raters. Agreement, assessed via the ICC, was .98. The few interrater discrepancies were resolved by jointly reviewing the study methodology to ensure that correct information about means, standard deviations, and direction of the outcome measure (e.g., higher score indicating better outcome) had been used; when these steps were taken, no discrepancies in ES remained.
We pooled ES values up to the most conservative level appropriate to each test. For example, in calculating ES for depression measures, we collapsed across treatment groups and averaged across all depression measures to produce a single ES mean for each study; however, in computing ES means for different forms of psychotherapy, we did not collapse across treatment groups.
We analyzed our data using a weighted least squares (WLS) approach; each ES was weighted by the inverse of its variance (Hedges & Olkin, 1985), thereby adjusting for heterogeneity of variance across individual observations. To facilitate comparison with previous meta-analyses that used unweighted least squares (ULS) procedures (e.g., Michael & Crowley, 2002), we report ULS, as well as WLS values, for overall ES means and for t tests comparing these mean ES values with zero.
Homogeneity analyses (Hedges & Olkin, 1985) were conducted to test the assumption that all the ES values were estimating the same population mean and to inform a decision about random versus fixed effects analyses (see below). The tests were significant for depression measures ES, Q(30) = 64.77, p < .01, and other ES, Q(29) = 43.72, p = .04, suggesting that the psychotherapy studies may not estimate common ES parameters.
The decision as to whether to use fixed or random effects models was undertaken by considering both the types of inference that we wished to make and the homogeneity of ES parameters (as discussed in Hedges & Vevea’s, 1998, study). Fixed effects analysis only takes account of uncertainty due to the particular samples included in a specific meta-analytic set of studies and thus supports only conditional inference about that set of studies; such analysis is appropriate when homogeneity is not rejected. Random effects analysis is used to make inferences about a population of studies that is larger and more diverse (e.g., in samples, designs, treatment doses, and outcome measures) than a single observed study set used for a single meta-analysis; such analysis is appropriate when homogeneity is rejected. Because homogeneity was rejected in our analyses, and we wished to support inferences about psychotherapy outcomes for depression in the general population of children and adolescents, we used random effects analyses.
We used paired t tests to compare outcomes on different measures obtained for the same set of studies (e.g., outcomes based on parent vs. youth reports). For comparison of ES values with zero, we used SPSS macros that generate z tests based on the absolute value of the mean ES divided by the standard error of the mean ES (Wilson, 2003). To compare mutually exclusive categories of studies, a researcher should use Q-statistic analog to analysis of variance (Lipsey & Wilson, 2001). If the between-category variance is significant, then the mean ESs across groups differ by more than sampling error. The Q statistic is distributed as a chi-square with k–j degrees of freedom, where k is the number of ESs (Hedges & Olkin, 1985), and j is the number of groups. We conducted all analyses using maximum likelihood, random effects models weighted by the inverse of the variance.
To control for the possibility that confounding characteristics might produce or obscure ES differences between theoretically important subgroups of studies, we also included multivariate meta-regression analyses. On the basis of prior meta-analyses and reviews (e.g., Kazdin et al., 1990; Weisz, 2004), we identified four variables with apparent potential to influence ES—mean age, percentage of boys, recruitment status (recruited vs. clinically referred youths), and type of control group (active vs. passive). For each instance in which we tested an ES difference between two subgroups of studies, we also assessed whether the two subgroups differed significantly on any of these four variables, and any of the variables found to differentiate the two groups were included as covariates in the metaregression analysis, as a complement to our Q-statistic comparison. For analyses of within-study variables (e.g., youth self-report measures vs. parent-report measures), confounding variable differences between different subgroups of studies were not at issue, so covariates were not used.
We limited our analyses to those with acceptable power rather than report numerous underpowered tests and risk unreliable findings due to chance. We followed Cohen’s (1988) methods for estimating power on the basis of type of analysis, including those sets of analyses with mean power greater than 0.50 to detect a large effect. Following Cohen’s (1988) guidelines, we used 0.80 as our cutoff for a large effect for t tests and 0.40 as our cutoff for a large effect for F tests. These cutoffs permitted three types of planned analyses:
Given the lack of literature on the topic, our estimates of power for the z tests (tests of ES greater than zero) and Q statistics (tests of mutually exclusive study characteristics) were based on Cohen’s (1988) procedures for t tests and F tests (i.e., analogs for z and Q, respectively).
Characteristics of the full set of studies are shown in Table 2. As the table shows, more studies focused on adolescents alone than children or mixed-age samples. Almost half of the studies included only youths with clinically diagnosed depression, and almost half of the studies used an active control condition. Fewer studies involved clinic-referred youths, provided treatment in a clinical setting, or used practicing therapists. On average, treatments involved 13 hr of psychotherapy, studies reported two depression-specific outcome measures and three nondepression measures (e.g., anxiety), and outcome information was typically available from two different informants (e.g., child and parent). Usable follow-up data were provided in slightly more than half of the studies. Three additional studies included follow-up assessments, but their waitlist-control group received treatment immediately after posttreatment, ruling out meaningful comparison between treatment and control groups.
In reporting our findings, we emphasize ES values for depression measures throughout this report, with other measures (e.g., anxiety) included in only two ways: (a) combined with depression measures in the next-to-last column of Table 1 to generate overall ES values for each of the studies and (b) in specificity analyses (see below) comparing ES for depression measures with ES for nondepression measures.
First, we focus on mean ES values for the full collection of studies using different quantitative approaches.
Across the 35 studies, the mean WLS ES for depression measures was 0.34 (SD = 0.40; range was from −0.66 to 2.02), significantly different from zero (z = 4.57, p < .01). When we entered each of the treatment versus control group ES values separately, as some previous meta-analyses have done, the mean was 0.38 (SD = 0.42; range was from −0.66 to 2.02, reflecting 44 treatment–control group comparisons), also significantly different from zero ( p < .01).
We also calculated the ULS mean for the 35 depression studies to permit comparison with previous meta-analyses that did not use weighting. Our mean ULS ES was 0.40, significantly different from zero, t(34) = 4.26, p < .01. The ULS mean when treatment groups were used as the level of analysis was 0.46, also significantly different from zero, t(43) = 5.45, p < .01.
To compare the mean ES for depression with the mean ES found for treatment of problems and disorders other than depression, we pooled data from two previous broad-based meta-analyses encompassing treatments for diverse child and adolescent conditions (Weisz et al., 1995; Weisz, Weiss, Alicke, & Klotz, 1987). We identified all studies in those two meta-analyses that tested treatments for conditions other than depression (e.g., aggression, disruptive behavior, attention-deficit/hyperactivity disorder, fears). Because the two previous meta-analyses had excluded non-peer-reviewed studies, we excluded the non-peer-reviewed studies in the current meta-analysis sample from this comparison. The same ES calculation methods were used for both the depression and nondepression study sets. For the peer-reviewed studies of problems other than depression (n = 180), the mean WLS ES (for the specific problems targeted in those studies) was 0.69, significantly higher than the WLS ES of 0.37 for the peer-reviewed depression studies of the current meta-analysis, Q(1, 213) = 8.72, p = .01. Regression analyses controlling for factors that prior literature suggests might influence ES—that is, mean age, percentage of boys, recruitment status (recruited vs. referred youths), and type of control group (active vs. passive)—still resulted in significantly higher ES values for the nondepression outcome studies (B = 0.33, SE = 0.13, b = .19, p = .01).
To assess maintenance of treatment gains over time, we compared ES values obtained immediately after treatment with those obtained in follow-up assessments. In the 19 studies that used follow-up comparisons between treated and untreated groups, the mean lag between end of treatment and follow-up assessment was 37.5 weeks (range = 4–130 weeks). When follow-up assessments were conducted at multiple time points (e.g., 3 and 6 months), ES values were averaged across the time points. We found no overall difference between posttreatment and follow-up ES for depression outcomes (Ms: 0.30 vs. 0.28, respectively, p = .84), suggesting at first blush that there was no fall off in effects. However, a closer look revealed that the correlation between follow-up time lag and follow-up ES was negative and significant (r = −.50, p = .03). Follow-up assessments conducted near the end of treatment (e.g., 2–3 months) showed relatively large effects, but follow-ups with lags of 1 year or more showed essentially no treatment effect.
Next we considered the specificity question, comparing ES for measures of depression with ES for two kinds of nondepression outcome measures: anxiety symptoms and externalizing behavior. In these analyses, we used only studies that had included both depression and anxiety outcomes (n = 10) and depression and externalizing behavior outcomes (n = 11), respectively. A matched sample t test comparing depression ES (0.57) and anxiety ES (0.39) was marginally significant, t(9) = 2.05, p = .07. By contrast, the t test comparing depression ES (0.31) and externalizing ES (0.05) was significant, t(10) = 2.91, p = .02. The WLS mean ES for anxiety measures was significantly different from zero (z = 2.73, p = .01), but the mean ES for externalizing problems was not (z = −0.30, p = .77).
The most common form of control group across the pool of studies was passive—that is, waitlist or no treatment. For the 20 studies that used such passive control groups, mean psychotherapy ES was 0.41, significantly different from zero (z = 4.36, p < .01). The remaining 15 studies used active control groups; these studies showed a more modest ES mean of 0.24, also significantly different from zero (z = 2.15, p = .03). ESs for studies that used active versus passive control groups were not significantly different, Q(1, 33) = 1.46, p = .23. Studies that used active control groups had included clinically referred youths more often than studies that used passive control groups, χ2(1, N = 35) = 9.65, p = .002; when we controlled for this clinic referral variable, the ES differences between active and passive control groups remained insignificant (B = 0.25, SE = 0.18, p = .16).
We computed mean ES separately for the 31 treatments that involved a cognitive change emphasis (e.g., CBT) and the 13 treatments that did not emphasize cognitive change (e.g., relaxation training). The ES mean for cognitive treatments was 0.35, significantly different from zero (z = 4.54, p < .01); the mean for noncognitive treatments was 0.47, also different from zero (z = 3.57, p < .01). The ES difference between cognitive and noncognitive treatments was not significant, Q(1, 42) = 0.63, p = 42. Studies that used cognitive treatments did not differ from studies that used noncognitive treatments in terms of mean age, gender, recruitment status, and control group, so analyses with covariates were not conducted.
Next, we turned to the issue of clinical representativeness in study design, focusing on the youths, therapists, and settings used in the studies.
Mean ES for studies that used primarily recruited participants was 0.34 (n = 29), significantly different from zero (z = 4.24, p < .01). Mean ES for studies that used clinically referred participants (n = 6) was 0.32, also different from zero (z = 1.99, p < .05). The ES difference between the two groups of studies was not significant, Q(1, 33) = 0.02, p = .89. Studies with referred youths used active control groups more often, χ2(1, N = 35) = 9.65, p < .01, but the difference between referred and recruited youths remained insignificant when we controlled for type of control group (B = 0.16, SE = 0.22, p = .47).
The mean ES for studies in which the majority of therapists were research therapists rather than practicing clinicians was 0.52 (n = 18), significantly different from zero (z = 3.92, p < .01). For studies in which the majority of therapists were practicing clinicians (n = 9), the mean ES was 0.27, marginally different from zero (z = 1.95, p = .05 [p < .06]). The difference between the two groups of studies was not significant, Q(1, 25) = 1.92, p = .17. Studies that used practicing clinicians had higher rates of clinically referred youths, χ2(1, N = 27) = 6.01, p = .01, but the ES difference between practicing clinicians and research therapists remained statistically insignificant when we controlled for recruitment status of youths (B = −0.26, SE = 0.20, p = .19).
Twenty-two studies were conducted in nonclinical settings, generating a mean ES of 0.41, significantly different from zero (z = 4.56, p < .01). Eleven studies were conducted in clinical service settings, yielding an ES of 0.24, also significantly different from zero (z = 2.04, p = .04). The ES difference between the two sets of studies was not significant, Q(1,31) = 1.26, p = .26. Although studies conducted in clinical service settings used referred youths more often, χ2(1, N = 33) = 8.25, p < .01, differences between studies from clinical and nonclinical settings remained insignificant when we controlled for recruitment status (B = −0.20, SE = 0.17, p = .24).
Next we carried out a series of secondary analyses examining ES as a function of characteristics of the study participants, the treatments provided, and the design and procedural characteristics of the studies.
The WLS mean for studies of children (seven studies with samples aged less than 13 years) was 0.41, significantly different from zero (z = 2.19, p = .03); the mean for adolescents (n = 21) was 0.33, also different from zero (z = 3.64, p < .01). The correlation between mean sample age and ES was 0.02 ( p = .91). The Q test revealed no significant difference between child and adolescent ES, Q(1, 264) = 0.16, p = .69. Studies of children had higher rates of male participants than did studies of adolescents, t(25) = 2.28, p = .03. However, the youth age effect remained nonsignificant when we controlled for gender composition of the sample (B = −0.17, SE = 0.23, p = .47).
The bivariate correlation between ES and percentage of boys in the sample was −.27, p = .12.
Fifteen studies used diagnostic procedures (e.g., assessing major depressive disorder and dysthymic disorder) to identify their samples. Mean ES for these studies was 0.35, significantly different from zero (z = 3.39, p < .01). Twenty studies used depression symptom measures to identify their samples. The mean ES for these studies was 0.32, also different from zero (z = 3.25, p < .01). The two study sets did not differ significantly in ES, Q(1, 33) = 0.03, p = .86. Studies that used diagnosed samples had older participants, t(31) = 2.58, p = .02, and used more clinically referred samples, χ2(1, N = 35) = 4.84, p = .03. Differences in ES between studies that used diagnosis and symptom measures remained nonsignificant when we controlled for these two covariates (B = −0.08, SE = 0.18, p = .67).
We computed mean ES separately for the 31 group-based treatments and the 13 individually administered treatments. The WLS ES mean for group treatments was 0.38, significantly different from zero (z = 4.63, p < .01); the mean for individual treatments was 0.37, also different from zero (z = 3.22, p < .01). The ES difference between group- and individually administered treatments was not significant, Q(1, 42) = 0.01, p < .93. Studies that used group treatment also used passive control groups, χ2(1, N = 44) = 7.05, p < .01, and recruited youths, χ2(1, N = 44) = 14.57, p < .01, more frequently than studies that used individual treatment had. When we tested the effect of treatment modality controlling for type of control group and recruitment status, the effect remained nonsignificant (B = −0.13, SE = 0.19, p = .49).
Treatment duration ranged from 4 to 32 hr, with a mean of 13.5 and a median of 12. The correlation between treatment duration and overall ES was −0.06 ( p = .78). A direct test based on a median split comparing studies involving less than 12 treatment hr versus those involving 12 or more treatment hr revealed no significant difference, Q(1, 28) = 1.22, p = .27. Studies that used shorter treatment durations did not differ from studies that used longer treatment durations in terms of mean age, gender, recruitment status, or control group, so analyses with covariates were not conducted.
Attrition rates varied from 0% to 71%, with a mean of 12.8% and a median of 9.5%. The bivariate correlation between attrition rate and overall ES was nonsignificant (r = −.03, p = .88).
Of the 35 studies, 27 were peer-reviewed, and these had a mean ES of 0.34, significantly different from zero (z = 4.38, p < .01). The 8 non-peer-reviewed studies had a mean ES of 0.32, only marginally different from zero (due in part to small sample size; z = 1.66, p = .10). A direct test of the ES difference between peer-reviewed and non-peer-reviewed studies was not significant, Q(1, 33) = 0.01, p = .92. Peer-reviewed studies included fewer boys in their samples than did non-peer-reviewed studies, t(32) = 2.98, p < .01, but when percentage of boys was included as a covariate in regression analyses, the effect of publication status remained non-significant (B = 0.16, SE = 0.24, p = .50).
We compared posttreatment ES values for youth versus parent report using only studies that included both youth- and parent-report depression measures (n = 6). A paired-sample t test showed a significant youth–parent difference for depression measures, t(5) = 2.56, p = .05, Ms = 0.72 for youth report, 0.24 for parent report. The mean ES for youth report was significantly different from zero (z = 4.70, p < .01), but the mean ES for parent report was not (z = 0.49, p = .62).
Given current concerns about suicide risk in youth depression treatment, we focused on the six studies in our collection that included a measure of suicidality (e.g., the suicide symptom total from a diagnostic interview or a questionnaire designed to tap suicidal thinking and behavior). The average ES for these suicidality measures across the 6 studies was 0.18, marginally different from zero (z = 1.80, p = .07).
Finally, we turned to the question of why mean ES in our analyses (0.34) was so much lower than in the three prior meta-analyses of youth depression treatment trials, discussed in the introduction (0.99). It is not possible to discern all the reasons for this discrepancy, given the numerous differences that inevitably exist between any two meta-analyses (e.g., inclusion criteria, samples of studies used, procedures used to calculate ES). However, we sought to shed some light by examining the different collections of studies and considering what ES values might emerge with changes in procedure.
Our ES mean of 0.34 was markedly lower than the mean reported by Reinecke et al.: 1.02 (range = 0.40–1.85, 95% confidence interval [CI] = 0.81–1.23) in Reinecke et al.’s (1998a) study and 0.97 in Reinecke et al.’s (1998b) study. When we applied our data-analytic methods to the sample of six studies included in Reinecke et al.’s meta-analysis, we found a mean ES of 0.87 (WLS)/0.88 (ULS). When we shifted to the methods of ES calculation and data analysis described in the Reinecke et al.’s two articles (e.g., changing from our ES formula based on control group standard deviation to Reinecke’s formula based on pooled treatment group plus control group standard deviation), we found a mean ES of 0.98 (ULS)/0.96 (WLS), almost identical to the figures reported by Reinecke et al. Thus, the difference between our mean ES of 0.40 and the much higher mean ES reported by Reinecke et al. appears to relate primarily to the difference between their small, select sample of six studies and our more broadly representative sample of 35 studies (e.g., including peer-reviewed and non-peer-reviewed trials, which used any form of psychotherapy).
Our ES mean of 0.34 differed most from that reported by Lewinsohn and Clarke, which was 1.27 (range and CI not reported). Using our methodology to calculate ES for the 12 studies cited by Lewinsohn and Clarke that included treatment versus control comparisons, we found a mean ES of 0.49 (WLS)/0.55 (ULS), closer to the ES in the current study. As in our comparison with Reinecke et al. (1998a, 1998b), we sought to compute an ES for Lewinsohn and Clarke’s study set on the basis of their exact methods of ES calculation and data analysis; however, information on their ES calculation and analytic procedures was not available from the article or from the authors. Nonetheless, our findings with their pool of studies do suggest that much of the difference between their mean ES of 1.27 and our much more modest mean of 0.34 was due to differences between their ES calculation methods and ours, with a smaller additional portion due to differences between the two samples of studies used.
Michael and Crowley reported a mean ES of 0.72 (range = 0.03–1.84, 95% CI = 0.48–0.94), on the basis of averaging 21 treatment effects reported in 14 articles and dissertations, using treatment group as the level of analysis. Their mean ES is larger than the unweighted ES values our meta-analysis generated (0.46 treatment level and 0.40 study level) and may differ both because of the different pool of studies used and differences in our methods of computing ES. Our methods differed in at least three ways from those of Michael and Crowley: (a) Our list of depression outcome measures differed slightly from theirs in that we included the standard Child Behavior Checklist anxious–depressed scale, whereas they used a depression-specific Child Behavior Checklist scale (see Clarke, Lewinsohn, Hops, & Seeley, 1992); (b) we computed ES using the posttest standard deviation of the control group only, whereas they pooled standard deviations of experimental and control groups at pretreatment and the control group at posttreatment; and (c) we used Hedges’s correction for small sample size in computing each study’s ES, whereas they did not. Calculating ES with their pool of studies and procedures but instead using each study as the level of analysis would have yielded an overall ES of 0.61 (Kurt Michael, personal communication, October 21, 2004). When we used the same set of studies as Michael and Crowley (2002) but applied our method of ES calculation (including using each study as an observation, rather than each treatment group), we found a mean ES of 0.48 (ULS)/0.38 (WLS), only modestly larger than the ES means we had calculated for our pool of 35 studies. Our calculated ES for Michael and Crowley’s pool of studies with ESs averaged across treatments is 0.57 (WLS)/0.56 (ULS). Thus, it appears that some of the difference between Michael and Crowley’s relatively larger mean ES and our smaller mean may reflect differences in the study collections used, but a larger portion of the difference is attributable to different data-analytic methods.
We assembled the most comprehensive collection to date of youth depression treatment trials (both peer-reviewed and non-peer-reviewed), required random assignment for inclusion, contacted study authors repeatedly until we had obtained all data needed for the most precise ES calculation, and applied stringent meta-analytic methods (e.g., depression measures only, one mean ES per study, correction for small samples, weighted ES, and random effects analyses) to the data we obtained. The meta-analytic findings that emerged differed from those of previous reports in some very significant ways.
Perhaps the most striking difference between our findings and those of previous meta-analyses concerned the overall magnitude of treatment benefit. The mean effect of psychotherapy in our analyses was 0.34, falling between Cohen’s (1988) benchmarks for a small (i.e., 0.20) and medium (i.e., 0.50) effect. Psychotherapy effects in previous youth depression meta-analyses (Lewinsohn & Clarke, 1999; Michael & Crowley, 2002; Reinecke et al., 1998a, 1998b) had averaged 0.99, comparing favorably with Cohen’s benchmark of 0.80 for a large effect. The surprisingly modest treatment effect evident in our analyses suggests a new perspective on the success of youth depression psychotherapy. Our findings—including our direct comparison with previous meta-analytic findings with problems and disorders other than depression—indicate that youth depression treatment does not surpass but instead may lag significantly behind treatments for other youth conditions.
Such an inference would need to be considered with caution. The years in which the depression and nondepression treatment studies were published overlapped substantially but not perfectly, which could complicate the comparison if year of publication were associated with ES; however, such an association is not evident in the youth treatment literature (see Weisz et al., 1995). The depression versus nondepression study comparison did control for multiple factors that previous literature suggests might explain differences between different collections of studies (mean age, gender distribution, recruited vs. referred youths, active vs. passive control group, methods of ES calculation, and peer-reviewed studies vs. non-peer-reviewed); these factors did not account for the difference between depression and nondepression studies. However, different collections of studies will inevitably differ in diverse ways, so it is possible that factors not identified and controlled for in our analysis might account for the ES difference between depression and nondepression studies.
In light of our finding that psychotherapies for youth depression have relatively modest mean effects, several potentially useful next steps might be considered. These could include (a) strengthening the substance or ramping up the dose of current treatments, (b) combining currently separate depression treatments into more potent multicomponent packages, and (c) developing and testing entirely new methods that produce more substantial benefit. That said, it is important to note that ES values showed a broad range across studies in our collection; indeed, five different treatment programs generated effects exceeding 1.0. Thus, some treatments in the current armamentarium may already have strong potential.
In this connection, it must be noted that the strongest potential may not attach to the most popular treatments. In the current zeitgeist, treatments that focus on altering unrealistic, negative cognitions have particularly prominent status. Indeed, 33 of the 44 treatments in our study set emphasized cognitive change (i.e., through CBT or other cognitive approaches). This broad approach is also popular in adult depression treatment; however, some of the most provocative adult research (see, e.g., Hollon, 2000; Jacobson et al., 1996, 2001) has highlighted the potential of noncognitive behavioral-activation strategies, providing evidence that the impact of such strategies is not improved on by treatment with a cognitive focus. Our analyses of youth treatment evidence indicated, similarly, that noncognitive treatments demonstrated effects that were easily as robust as the cognitive treatments, suggesting that beneficial treatment for youth depression may not require altering cognitions.
Although our overall mean ES for psychotherapy was much lower than in previous reports, it was significantly different from zero, suggesting reliable treatment effects across the group of studies as a whole. However, these effects proved durable only in the relative short-term. ES at follow-up periods of 1 year or longer showed no lasting treatment effect. This supports the potential value of booster sessions and continuation treatment (Clarke et al., 1999; Weissman, 1994) in extending treatment benefit over time. However, two caveats should be noted. First, more than one third of the studies reviewed did not include follow-up assessments with treatment versus control comparisons; we do not know how lasting effects were in those studies. Second, only five studies included follow-up at 1 year or beyond; this limits our ability to generalize, and it highlights the need for studies with longer term follow-up.
We also assessed the generality versus specificity of treatment effects, investigating the extent to which effects on depression-related outcome measures were replicated with nondepression outcome measures. Previous meta-analytic findings on generality–specificity across an array of youth treatments and treated problems (Weisz et al., 1995) had shown significant treatment effects on both targeted and nontargeted outcomes but with effects stronger for targeted than nontargeted outcomes, suggesting specificity of benefit. However, the previous work had not focused on depression in particular or on the question of whether carryover effects might depend on conceptual similarity between depression and the outcomes being measured. When we addressed this question here, we found evidence of both generality and specificity in treatment effects. Depression treatment was associated with significant improvement in the conceptually similar domain of anxiety. In fact, following depression treatment, the reduction in anxiety symptoms was only marginally lower than the reduction in depressive symptoms. By contrast, we found that effects for the conceptually dissimilar domain of externalizing problems were significantly inferior to effects on depression measures, and that the mean effect on externalizing outcomes was not significantly different from zero. Our finding that depression treatment has beneficial effects on anxiety is consistent with growing evidence that youth depression and anxiety are closely associated empirically (see, e.g., Achenbach & Rescorla, 2001) and that they share a common core of negative affectivity (see, e.g., Cole, Peeke, Martin, Truglio, & Seroczynski, 1998; King, Ollendick, & Gullone, 1991). A useful question for future research is whether the effects of youth depression treatment on anxiety result from increased skill in addressing the negative affectivity that is apparently shared by the two syndromes. Whatever the answer to this question, the findings do offer some support for the possibility that youth depression and anxiety might be treated by a common intervention encompassing emotional disorders (see Barlow et al., 2004).
Taken together, these findings on the magnitude and specificity of effects may help inform the debate over alternatives to antidepressant medication, as discussed in the introduction (see Glass, 2004; Safer, 1997; TADS Team, 2004; Vitiello & Swedo, 2004; Weisz & Jensen, 1999; Whittington et al., 2004). Our results suggest that for those who seek an alternative to antidepressants, psychotherapy offers a reasonable option, generating a small to medium ES that generalizes to comorbid anxiety symptoms and shows substantial holding power for some months after treatment ends. Because recent concerns over SSRIs relate to elevated risk of suicidality, it may warrant attention that our study set included six investigations that assessed suicidality as an outcome, and that these studies averaged a small reduction in suicidality (mean ES = 0.18, marginally greater than zero).
As another perspective on these issues, one might construe psychotherapy as a potentially useful complement to, rather than a replacement for, antidepressants—that is, a form of intervention that may boost outcomes when combined with medication. This perspective is consistent with the findings of the most complete and sophisticated direct comparison, to date, of medication to psychotherapy in youth depression treatment—that is, the TADS (see TADS Team, 2004). In this study, adolescents treated with fluoxetine alone showed outcomes superior to those in a placebo condition, but adolescents treated with a combination of fluoxetine and a 12-week course of CBT showed the most positive treatment response, supporting the idea that psychotherapy may complement the effects of antidepressant medication. An important additional finding was that CBT alone did not significantly outperform the placebo condition, supporting concerns that psychotherapy alone (at least in its CBT form) may not be a very potent treatment force. If this finding were taken as definitive evidence on the potential of CBT, then the results could be quite discouraging to those who seek a psychotherapeutic alternative to medication. However, a close look at Table 1 indicates that the CBT ES generated in TADS is not characteristic of most CBT or psychotherapy effects on youth depression; 20 of the 23 other CBT programs in the table showed larger ES than the TADS version of CBT, and the mean ES value across the non-TADS CBT programs in the table was 0.48, markedly higher than the −0.07 ES associated with the TADS CBT intervention. What is not clear from the available data is whether this picture results from a low-potency version of CBT in TADS, from the unusual and challenging comparison of CBT with a medication placebo condition in TADS (see Baskin et al., 2003), from a combination of the two, or from other factors not identified.
A concern raised by some (e.g., Jensen, 2003; Weisz, 2004) about the evidence on youth treatment research in general is that so many of the trials have compared active treatment with passive control conditions, including no treatment and waitlist. This was evident in the current depression study set as well, with 20 of the 35 studies having used passive control conditions. Those studies showed a relatively strong treatment effect (mean ES = 0.41), significantly superior to zero at the .01 level. In contrast, the 14 studies comparing treatment with an active control group generated a mean ES of only 0.24, markedly lower but still superior to zero. Thus, our findings showed rather modest benefits of depression treatment when compared with the most rigorous comparison conditions (see Baskin et al., 2003). As Jensen (2003) stressed, we need studies that use “control groups comparable in intensity of exposure to the supposed active treatment. Such studies are critical, if we are to conclude that something about a given therapy is specifically effective, over and above simple compassion, friendliness, attention, and belief” (p. 37). The fact that passive control groups generate higher ES may also help explain why previous youth depression treatment meta-analyses have yielded higher mean ES than the current one, because the study collections in those prior meta-analyses involved somewhat heavier reliance than the current meta-analysis on no-treatment and waitlist comparison groups. Specifically, 40% of the studies in our meta-analysis used active control groups, in contrast to 26% averaging across the three previous youth depression meta-analyses—that is, 14% in Michael and Crowley’s (2002) meta-analysis, 33% in Reinecke et al.’s (1998a, 1998b) meta-analysis, and 33% in Lewinsohn and Clarke’s (1999) meta-analysis.
In the debate over empirically supported treatments, concern has been raised that the empirical support comes largely from efficacy studies in which experimental control is achieved at the cost of clinical representativeness (e.g., Weisz, 2004; Westen et al., 2004). One result, the argument goes, is that for many treatment programs we do not know whether the procedures actually work with clinically referred youths, treated by clinical practitioners, in clinical practice settings. Our findings may alleviate some of these concerns to some degree. Although we did find that ES values were somewhat larger for research therapists than for clinical practitioner therapists, and somewhat larger for treatments delivered in research settings than in service settings, neither difference was significant. Moreover, ES was reliably superior to zero for referred youths, practitioner therapists, and clinical service settings, suggesting that significant treatment benefit can be obtained across all three clinical representativeness dimensions.
Despite the modest overall treatment effects evident across the depression trials, we found that treatment benefit proved rather robust across some notable variations in person and treatment characteristics. For example, significant treatment effects were identified for (a) both child-majority and adolescent-majority samples considered separately, (b) samples identified as having depressive disorders and samples identified through depression symptom measures, (c) both group and individual treatments, and (d) treatments with and without a cognitive emphasis. In addition, treatment duration was not correlated with outcome, suggesting that some briefer treatments may have the potential to be as effective as lengthier ones. Thus, the benefits of psychotherapy, though modest on average, were evident across rather diverse characteristics of treated youths and across variations in the format, content, and duration of therapy.
We also found effects to be rather consistent across published and non-peer-reviewed studies, suggesting that publication bias may not be a major problem in the youth depression treatment literature thus far. In the youth psychotherapy literature generally, unpublished/non-peer-reviewed studies show significantly lower ES than published studies (see McLeod & Weisz, 2004). However, most youth depression psychotherapy research is relatively recent compared with treatment research with other youth conditions (see Weisz, Hawley, & Jensen Doss, 2004); more recent research may profit from an increased focus, in journal reviews, on the quality of the research procedures rather than on the statistical significance of intervention effects.
Our findings shed light on another area of discussion among youth depression treatment experts: intervention outcome as perceived by different informants. We found that depression-related outcomes looked significantly better when the outcome information was provided by the youngsters themselves (e.g., via symptom self-report measures) than when their parents provided the information. Moreover, youth-report outcomes were significantly better than zero, whereas parent-report depression outcomes were not. Such a finding may reflect the fact that youths themselves have better access to information on their own internal state than do outside observers (see Hammen & Rudolph, 1996). Collateral reporters rather consistently report lower levels of depression than children themselves (Angold & Costello, 1993; Capaldi & Stoolmiller, 1999; Hammen & Rudolph, 1996). It is possible that parents’ difficulty in evaluating their children’s internal states may make them relatively insensitive to the changes that would need to be noted to detect improvement at the end of treatment. It may also be relevant that parents of depressed children are more likely than other parents to be depressed themselves (Kovacs & Devlin, 1998); some have argued that relatively depressed mothers may perceive their children’s behavior in a more negative light than it actually warrants (Breslau, Davis, & Prabucki, 1988), but because maternal depression is in fact associated with more actual child disorder, it is not clear whether bias is involved (Boyle & Pickles, 1997; Richters, 1992). Whatever the reason for our findings, they do raise a concern that the evidence for beneficial effects of psychotherapy for youth depression rests almost entirely on reports by the youths themselves without confirmation from other more objective informants.
Our findings, together with the scrutiny of studies required for this meta-analysis, suggest several observations about the state of the evidence and ways to improve it. As noted previously, we need more studies designed to test whether specific depression treatments can outperform active conditions that control for attention and other nonspecific factors. In addition, the fact that more than one third of the studies in our collection generated no usable follow-up comparisons of treatment and control conditions is a reminder that we need more studies that include follow-up assessments, and in which the control condition remains unaltered throughout the follow-up period, so that treatment–control comparisons can be meaningful at the time of follow-up. The episodic nature of depression may also argue for extending the lag between posttreatment and follow-up. An important counter to this point is that maintaining control conditions for extended periods raises ethical concerns, if doing so exposes depressed youths to long periods without treatment. This point, in our view, underscores the potential value of the treatment-as-usual control condition (see, e.g., Weisz, 2004). Ethical concerns should not attach to procedures that provide youths with the intervention they would have received in the absence of the study. This suggestion and the previous one are quite compatible; no treatment and waitlist, currently the most commonly used control conditions, are not only the weakest experimentally but also the most difficult to sustain throughout a waitlist period, given ethical and humane concerns.
Another design limitation evident in the studies that we reviewed was the relative absence of intent-to-treat analyses (only 11 studies explicitly reported such analyses), a state of affairs that constrains interpretation of positive effects. Without such analyses, one cannot rule out the possibility that youths who dropped out of treatment (and were thus dropped from analyses) did so because they were not benefiting from treatment, and that the resulting ES values are overestimates.
Two final areas of concern involve the search for moderators and mediators of treatment impact (see discussion in the following books: Kazdin, 2000; Weisz, 2004). On the moderator front, many of the studies can be faulted for a failure to characterize the samples fully enough to address the role of participant characteristics. As an example, only 13 of the 35 studies provided detailed information on the race/ethnicity of their samples, and only 6 of the 35 studies included any test of any potential moderator. Even more striking is the relative inattention to the question of what change processes underlie improvement. In only one of the studies was a candidate mediator or change process identified and its mediating role tested.
Taken together, our findings and our observations on the evidence suggest an agenda for future research on youth depression treatment. Clearly, a useful foundation has been laid, with evidence from 35 studies pointing to treatment effects that are significant, albeit markedly more modest than those reported in previous meta-analyses. Effects appear to be durable for the initial months following treatment but not when followed for 1 year or more; however, more evidence is needed regarding long-term holding power. Critical examination of the evidence suggests a need for increased use of active control conditions, meaningful follow-up assessment, intent-to-treat analysis, moderator assessment, and tests of proposed mechanisms of change. Much has been accomplished in 25 years of youth depression treatment research, but important work remains for the years ahead.
Preparation of this article was facilitated by support from the John D. and Catherine T. MacArthur Foundation (Research Network on Youth Mental Health [provided to John R. Weisz]); support from the National Alliance for Research on Schizophrenia and Depression (provided to Carolyn A. McCarty); and National Institute of Mental Health Grants K01 MH69892 (awarded to Carolyn A. McCarty), R01-MH57347, R21-MH63302, and R01-MH068806 (awarded to John R. Weisz).
We thank Bahr Weiss for generously sharing his statistical expertise and his data, David Lipsey and Will Shadish for their helpful statistical consultation, and the authors of clinical trials (including Joan Asarnow, Richard Harrington, Solveiga Miezitis, Laura Mufson, Michael Reed, and Kevin Stark) for supplying supplemental information about their samples and their outcome data to facilitate our coding and effect size computation.
John R. Weisz, Judge Baker Children’s Center, Harvard University.
Carolyn A. McCarty, University of Washington.
Sylvia M. Valeri, Brown University.
References marked with an asterisk indicate studies included in the meta-analysis.