Biases of the enriched design for maintenance efficacy
Before examining analyses of maintenance studies in MDD, we should understand how such studies are designed so as to appreciate why they are mostly biased in favor of antidepressants.
Most maintenance studies of antidepressants begin, before the study begins, with an acute major depressive episode, frequently treated with the antidepressant being studied. Then, patients who respond to the antidepressant are entered into the maintenance study. Those who do not respond or tolerate the antidepressant are not included; this already biases the study in favor of the antidepressant. Then patients are followed for one to two years. The majority of patients relapse in the first six months of follow-up, however. This does not prove maintenance efficacy, because the maintenance phase of treatment in MDD does not begin until one year after the acute episode ends, which is when we know that the natural remission of an acute depressive episode happens [16
]. Thus, one year or longer is the relevant time frame to assess prevention of new episodes; there is a clear consensus on this issue in the depression literature. Even if one did not want to focus on one year, it would at least be reasonable to say that at least 6 months or longer after the acute episode is needed to assess maintenance efficacy.
Most maintenance RCTs fail, as we will see, to pass this simple test.
This problem has been much discussed in the bipolar disorder literature [18
], and we have related it previously to the maintenance studies of neuroleptics in bipolar disorder [19
]. The early literature with lithium included both prophylaxis and relapse prevention methodologies. In the prophylaxis design, “all comers” are included in the study: in other words, any patient who is euthymic, no matter how that person got well, is eligible to be randomized to drug versus placebo or control, including those with recent manic or depressive episodes. In the relapse prevention design, typically only those patients who acutely respond to the drug being studied are then eligible to enter the randomized maintenance phase. Those who responded to the drug are then randomized to stay on the drug or be withdrawn from it (usually abruptly, sometimes with a taper) and switched to placebo. The “prophylactic” and “relapse prevention” designs are obviously not addressing the same questions about drug efficacy. In the lithium studies in which the relapse prevention design was used (i.e., only initial lithium responders to acute treatment were included), there was evidence of lithium withdrawal following acute treatment in the placebo group [20
]. The problem is that by design those who reach the “maintenance” phase and are treated with placebo are in fact persons who responded acutely to the study drug (lithium) and then get abruptly discontinued. Thus, if the placebo relapse rate is very high and almost exclusively limited to the first 1–2 months after study initiation, then one is observing a withdrawal effect involving a relapse back in to the same acute episode that had just been treated rather than a new episode, i.e. a recurrence. In other words, the relapse prevention design methodology confounds prevention of relapse back into the index episode with prevention of a new episode.
Besides the above problem of withdrawal relapse, a key aspect of the relapse prevention design is that it is definitely biased in comparison with active controls, and it is very likely biased against placebo as well. This is simply seen by realizing that though such studies are randomized, they are only randomized after preselecting all subjects to be randomized as responsive to only one of the two arms of the study. Thus, randomization is, in effect, instituted after the study has already been biased in favor of one of the two treatments.
To put it simply, if some people like chocolate ice cream and others like vanilla ice cream, and we preselect only those who like chocolate ice cream to be randomized to again receive chocolate ice cream versus vanilla ice cream, we will find that most chocolate ice cream-lovers will continue to prefer chocolate ice cream. This does not prove that chocolate ice cream is superior to vanilla ice cream.
The same principle will apply to studies in which patients are preselected to respond to the study drug, and later randomized to stay on study drug or receive placebo. Again, the study would be biased in favor of study drug, and would not prove inherent superiority of study drug over placebo.
A truly randomized study would have to either preselect subjects to be responsive to both treatments being studied, or, as in the traditional prophylaxis study, make no preselection at all.
These inherent biases of the enriched maintenance design will be key to analyzing meta-analyses of the maintenance antidepressant efficacy literature. None of those reviews, save one, understand the relevance of the enriched design, and thus they draw incorrect conclusions, both for and against antidepressants.
The standard review of maintenance efficacy of antidepressants often involves reference to the Cochrane collaboration meta-analysis of published studies. In that report, 10 studies with SRIs (n=2080) and 15 with TCAs (n=881), mostly with one year follow-up, showed maintenance benefit versus placebo. The longest follow-up with modern antidepressants was two years with venlafaxine [4
]. An obvious problem with simply stating the results this way is that this meta-analysis does not address the issue of publication bias. If the acute antidepressant studies are any indicator, it is likely that some negative maintenance studies with antidepressants in MDD exist, but are unpublished, and they would reduce this reported effect size.
A more important issue is the problem of enriched maintenance designs, which bias studies in favor of drug enrichment (or placebo, if analyses are enriched in the opposite direction, as discussed below). The only analysis of antidepressants RCTs in MDD which has addressed the problem of enrichment is a recent paper by Briscoe and El-Mallakh [22
]. They address the problem of enrichment by limiting data analysis to 6 months or longer after the acute depressive episode. By so doing, they exclude those who relapsed soon after the maintenance study started, right after the end of the acute episode. Those who received antidepressant and were switched to placebo would relapse rapidly in the first few months of the maintenance treatment; this discontinuation effect is an artifact of the enriched design, and would not, in this view, demonstrate true recurrence of a new episode, but rather immediate relapse into the same episode that had been present in prior weeks. Only 5 RCTs provided data on relapse rates before and after 6 months. Limiting analyses to those studies, the researchers found that, as expected given the biases of the enriched design, the majority of relapses (about 2/3) occurred in the first six months of follow-up; these are not new episodes of depression, but withdrawal relapse into the same acute episode that had just occurred a few weeks or months earlier, before the maintenance study began. In the one-third of relapses occurring after 6 months, and thus testing the proposition of whether new episodes were truly being prevented, 4 of 5 studies found no benefit with antidepressants over placebo.
The venlafaxine PREVENT maintenance study
Many authors cite a recent long, large study of venlafaxine (VNL) as evidence for antidepressant maintenance efficacy in MDD [5
This study purports to show major benefits with VNL for maintenance treatment of MDD but it really reflects what we might call “super-enrichment”: the study repeatedly picks out those who respond to venlafaxine and rerandomizes them to VNL or placebo, thus repeatedly selecting a smaller and smaller group of highly-VNL responsive patients. By two years, this small group is indeed very responsive to VNL, but hardly generalizable to a patient who might newly be prescribed VNL.
The specific data are as follows: In that study, 1096 MDD patients initially received, for acute depression, venlafaxine (VNL) or fluoxetine. 715 responders were enrolled in 6 month blind continuation on the same treatment. 258 (35.9%, 258/715) of those acute responders remained well at 6 months and entered maintenance phase A for one year treatment (randomized to VNL vs placebo)[23
]. 131 responders (83 VNL, 48 placebo) in maintenance phase A entered phase B for a second year of maintenance (VNLresponders were re-randomized to VNLvs placebo; placebo responders stayed on placebo, fluoxetine responders stayed on fluoxetine).
In the first year of maintenance treatment in 258 responders, 23% of VNL-treated patients relapsed vs 42% with placebo. Thus 77% of the VNL group (n=83) stayed well for one year after already preselecting those who had stayed well for 6 months (n=258), selected after initially responding to treatment for an acute episode (n=715), as described in the previous paragraph. This is only 11.6% (83/715) of initial sustained responders.
Only 12.5% of placebo responders at one year relapsed at two years, but, in rerandomized VNL responders (another super-enrichment on top of all the prior enriched selection phases), 44.8% of the placebo group relapsed at 2 years vs 8.0% with VNL. Or, as the pharmaceutical industry marketing emphasized, 92% of venlafaxine patients remained well at 2 years follow-up. This 92% is seems like a huge number. But, because of super-enrichment, it represents the repeated selection of a tiny group of highly VNL-responsive patients: It is 92% of the 11.6% above (those who responded at one year), which is 10.7% of original sustained responders. Once dropouts are included, patients still treated at two years, after the initial sample of over 1000 patients, was 15 subjects with placebo and 31 subjects with VNL – 4.2% of the original sample.
Antidepressant discontinuation meta-analysis
The most recent review of the maintenance MDD literature represents a unique analysis [8
]. The authors essentially conducted an enriched study of placebo response; in other words, they selected the data they would analyze based on a sample enriched for placebo responders, and biased against drug response. Then they concluded that drugs were ineffective and even harmful.
All they really proved – once again – is that the enriched maintenance design is biased against whatever one wants to bias it against.
This is the converse of the standard enriched design maintenance study, as described above, which is enriched for drug response and biased against placebo response. The same limitations apply in both cases: enrichment does not prove the inefficacy or harm of the treatment that is not being enriched, nor does it prove the efficacy or benefit of the treatment that is being enriched.
In this review, they collected 7 studies of maintenance treatment with antidepressants versus placebo in which initial acute treatment was provided with the two arms; in these 7 studies, the maintenance phase involved continuation of those patients who had responded to placebo acutely. In those acute placebo responders, relapse in the maintenance phase was (not surprisingly) uncommon (24.7%). In contrast, 39 trials involved acute treatment with antidepressant versus placebo, and the reviewers selected those patients who responded to antidepressants acutely, and then were randomized to receive placebo in maintenance treatment. In this group, which reflects antidepressants discontinuation after acute response, there was 42.1% relapse.
The authors interpret these results as showing harm with antidepressants, which they speculatively relate to animal data on monoaminergic effects of these agents. They conclude that the biological effects of antidepressants will actually increase the risk of relapse in long-term treatment, compared to no treatment (placebo).
This interpretation ignores the problems of the enriched design, and, as a result, this kind of analysis highlights the importance of always comparing treatment results to the natural history of an illness.
This analysis enriches the results for placebo response. The patients treated acutely who respond to placebo, stay on placebo; the patients treated acutely who respond to antidepressant, come off antidepressant. One should ask the question why these placebo acute responders responded to placebo? Did they actually respond to placebo, in some way that the inert pill, with its concomitant psychosocial warmth factors, had a direct effect producing response? Or is placebo a stand-in for natural recovery, spontaneous remission, as part of the natural history of recurrent, episodic depression?
The latter is a possibility, at least for part, if not all, of the placebo “response.” Over a century of natural history
research, especially before the treatment era in past decades, has established the fact that there is an episodic course to recurrent unipolar depression, in which there are periods of acute symptomatology, and periods of natural remission [12
]. During periods of natural remission, patients will stay well, often for years, without any treatment. The recovery of some patients with placebo, in those 7 studies, may well reflect natural cycling out of acute episodes in unipolar depression. Once patients have cycled out of acute episodes, they are in natural remission, which, in the case of recurrent unipolar depression, based on a century of research, usually involves over a year of remission before the next depressive episode[16
]. In the 7 placebo maintenance response studies, no study exceeded 12 months of follow-up, and, in reading the appendix attached to the meta-analysis, it appears that the mean duration of follow-up was less than two months in six of the seven studies (range 1.4–1.9 months).
In other words, the lack of relapse really means that a patient improved spontaneously from acute depression in a two month study (the usual duration of acute depression studies), and then that patient remained well for another two months. This is not robust evidence of long-term stability on placebo. It represents the fact that when spontaneous remission occurs from acute depression, it lasts at least two months (and indeed usually up to a year), without any treatment.
In contrast, in the antidepressant discontinuation studies analyzed, all patients responded in acute treatment (usually two months in duration), and then 42% relapsed in maintenance treatment after the antidepressant was discontinued. One might question whether serotonin withdrawal syndrome, which can mimic depressive episodes, occurred in some cases. But separate from that issue, a century of natural history research has led to a clear consensus that the mean duration of a typical depressive episode in unipolar depression is 6–12 months [12
]. If a patient is treated to recovery at two months, and then the treatment is stopped, it is proven that such a patient will relapse into the mood episode rapidly, because the 6–12 month period of the biological persistence of a mood episode has not yet elapsed. This has proven with antidepressants in depression, and with neuroleptics in mania, repeatedly [19
In sum, this creative analysis of the maintenance MDD literature suffers from a complete lack of awareness of the impact of the enriched design; the analysis is enriched for placebo response, and thus biased against antidepressant effect. The most conceptually parsimonious, and empirically well-supported, interpretation, based on extensive clinical literature in human beings (as opposed to speculative biological extrapolations from animal studies), would be to view these results as mainly reflective of the natural history of depression - not specific harm from antidepressants nor special benefit from placebo.
Maintenance data in STAR*D
Though STAR*D is mainly reported in terms of its acute data, one analysis so far also provides maintenance data [1
], and it is perhaps underappreciated that the STAR*D maintenance data may be the best evidence we have to date on long-term efficacy with antidepressants in unipolar depression. Further, STAR*D was designed to be, and is, generalizable to the real world of complex, comorbid, recurrently depressed patients, as opposed to the cleaner populations studied in most RCTs (designed for FDA registration by the pharmaceutical industry).
As noted previously, STAR*D is a double-blind randomized study; all the following maintenance data after the first phase of treatment (i.e., with the dozen or so antidepressant treatments given besides citalopram) involve randomized, not observational, data.
The basic results are as follows: Of subjects who acutely responded or remitted to antidepressants in STAR*D, only about one-half stayed well at one year (sustained remission). In other words, preselecting those patients who have acute benefit with antidepressants, as noted above, one-half will maintain benefit. Since one-half get acute benefit, and one-half of that group have sustained maintenance benefit, only one-quarter of the overall sample has long-term maintenance remission with antidepressants in unipolar depression [20
]. Based on STAR*D, there appears to be much less long-term benefit with antidepressants in unipolar depression than has often been assumed.
Objections to our critique of enriched maintenance designs
The above critique of enriched maintenance designs is neither widely known nor generally accepted. It is novel, rarely stated, and, when stated, strongly opposed by most researchers involved with maintenance studies in psychopharmacology.
There has not been much published discussion of this topic, but one objection that could be raised is that the enriched design is not biased because those who respond acutely to a drug treatment are both “true drug responders” and “placebo drug responders” – meaning, some of them would have responded to placebo had they been given placebo. Thus the design is not solely biased towards the study drug. This objection would only make sense if all patients were equally likely to respond to drugs or placebo; if 50% of patients “really” responded to drug (true drug response) and 50% would have responded to placebo had it been given (placebo drug response), then a maintenance randomization of those acutely responsive subjects to drug versus placebo would be valid. Ironically, this would only be the case if the critiques of Kirsch and colleagues [2
] is correct, i.e., if antidepressants are not more effective than placebo for acute depression.
If antidepressants are more effective than placebo for acute depression in most patients, as we believe we showed earlier in this article, then the percentage of true drug responders should be higher than the percentage of those who would have responded to placebo anyway (placebo drug responders). In a hypothetical group of acutely depressed patients treated with antidepressant X, and later randomized to a maintenance study of X versus placebo, the reality is that there would not have been a 50–50 split between true drug responders and placebo drug responders before maintenance randomization; the split would be 60–40, or 70–30, or even higher in favor of drug X.
In other words, since antidepressants are better than placebo acutely, enrichment for acute efficacy before maintenance RCTs is indeed biased in favor of antidepressants as opposed to later treatment with placebo. Enrichment entails bias.
Interestingly, many psychiatric researchers appear to fully understand this critique as applied to the maintenance meta-analysis by Andrews and colleagues [8
]; they appreciate that such an analysis entails “apples and oranges”, picking out placebo responders and comparing how they later did when continued on placebo, versus drug responders and how they later did when switched to placebo. Placebo responders are just different than drug responders, it is said. We agree. All placebo responders, by definition, respond to placebo, while only some probably would respond to drug. Thus such analyses are biased in favor of placebo response.
But while this enriched method – a species of selection bias that is unique to maintenance clinical trial design -[10
] is rejected by many in our field in relation to the claim that placebos are as good or better than antidepressants, the exact same method is used to assert that antidepressants are more effective than placebo. The reason for such selectivity about accepting or rejecting the same research methodology is not entirely pellucid.