|Home | About | Journals | Submit | Contact Us | Français|
Multiple outcomes are commonly analyzed in randomized trials. Interpretation of the results of trials with many outcomes is not always straightforward. We characterize the prevalence and factors associated with multiple outcomes in reports of clinical trials of depression, methods used to account for these outcomes, and concordance between published analyses and original protocol specifications.
A PubMed search for randomized controlled depression trials that included multiple outcomes published between January 2007 and October 2008 in 6 medical journals. Original study protocols were reviewed where available. Parallel data collection by 2 abstractors was used to determine trial registration information, the number of outcomes, and analytical method.
Of the 55 included trials, nearly half of the papers reported more than 1 primary outcome, while almost all (90.9%, n=50) reported more than 2 combined primary or secondary outcomes. Relatively few of the studies (5.8%, n=3) adjusted for multiple outcomes. While most studies had published protocols in clinical trial registries (76.4%, n=42), many did not specify outcomes in the protocol (n=11) and a number had discrepancies with the published report.
Multiple outcomes are prevalent in randomized controlled depression trials and appropriate statistical analyses to account for these methods are rarely used. Not all studies filed protocols, and there were discrepancies between these protocols and published reports. These issues complicate interpretability of trial results, and in some cases may lead to spurious conclusions. Promulgation of guidelines to improve analysis and reporting of multiple outcomes is warranted.
Multiple outcomes are often incorporated in randomized clinical trials (RCTs) due to interest in characterizing how a treatment influences a range of responses. Long-term mental health conditions, such as depression, are particularly reflective of this practice. Reporting more than one outcome in depression trials may be appropriate because a single measure may not sufficiently characterize the effect of a treatment on a broad set of domains. A lack of clear consensus on the most important clinical outcome, combined with the need to examine clinical effectiveness on related outcomes spanning disparate domains, encourage the use of multiple outcomes.
To adhere to statistical design issues, researchers often specify a small set of measures to serve as the primary outcomes, with another (often larger) set listed as secondary. While it is common practice to collect, analyze and report multiple measures, the efficient and appropriate analysis of multiple outcomes is not fully established. A number of approaches to accounting for multiple outcomes have been proposed [1–3], assessed  and reviewed . The most common method for analyzing multiple outcomes is separate testing of each individual outcome, sometimes with but most often without adjustment for multiple testing. Another approach involves combining the multiple outcomes into a single (composite) outcome and performing a single test . A third approach undertakes global testing using simultaneous (joint) tests .
The choice of an appropriate method for dealing with multiple outcomes is important because clinical interpretations can be difficult in the presence of multiple conflicting results. Simultaneous or joint models that provide an overall test, with separate reports of individual outcome, can provide useful additional information. Moreover, joint models can be more powerful if some outcomes are missing.
Study design features should be specified prior to patient enrollment and characterized in the study protocol. However, prior work by Al-Marzouki et al  and Chan et al  found discrepancies in outcomes between published study protocols and clinical trial reports. Reasons for these discrepancies may be related to unanticipated changes while conducting the trial, such as modifications to trial inclusion and exclusion criteria to increase enrollment. Al-Marzouki and colleagues found that there were major differences in 11 out of 37 trials. Overall there was a median of one outcome in published protocols and a median of two outcomes in the published report. Chan et al concluded that reported outcomes are often incomplete and inconsistent with the registered protocols, which potentially yield bias and unreliable results. Specifically, statistically significant outcomes were more likely to be fully reported than non-significant outcomes, indicating that results may be overestimating the benefits of an intervention. Turner and colleagues  found a similar trend among antidepressant clinical trials. Another study by Viereck and Boudes  found that pharmaceutical industry practices do not generally make clinical trial protocols and results accessible to the general public. The FDA Amendment Act  expanded the trial registry established under the FDA Modernization Act to encompass a larger set of protocols involving treatments and devices . Along with the International Committee of Medical Journal Editors , these new promulgations are likely to improve reporting.
We assessed the prevalence of multiple outcomes in clinical trials of depression. Major depression and related disorders were chosen because they have a profound personal, social and economic cost , and are the focus of a number of prevention and intervention trials. Examples of depression measures include a clinical diagnosis, measures of depressive symptoms (such as the Center for Epidemiologic Studies–Depression Scale , the Beck Depression Inventory [17, 18], the Hamilton Depression Rating Scale , and the Montgomery-Asberg Depression Rating Scale . The use of multiple outcomes in depression trials is particularly common because disease complexity is multifaceted . No one measure encompasses all aspects of the disorder, and clinicians may be interested in the impact of a new treatment on different domains. As a result, it is naive to force a single outcome. Instead, use of a broad range of clinically relevant measures, along with a procedure to globally assess them is more realistic and useful.
We also sought to determine whether there were important associations between number of outcomes and characteristics of the depression trial and to assess the concordance between reported outcomes and those specified in published study protocols.
We reviewed the use of multiple outcomes in randomized clinical trials with depression as a primary or secondary outcome that were recently published in six top-tier psychology or general medical journals (American Journal of Psychiatry (AJP), Archives of General Psychiatry (AGP), British Medical Journal (BMJ), Lancet, Journal of the American Medical Association (JAMA), and the New England Journal of Medicine (NEJM)). PubMed was used to obtain articles that matched “clinical trials”, included the keywords "depression" or "depressive disorder", and were published between January 2007 and October 2008 in these six journals. These journals were selected because of their high impact and relevance to clinical researchers in psychiatry.
We abstracted data from the appropriate clinical trials registries: ClinicalTrials.gov, Australian New Zealand Clinical Trials Registry (ANZCTR), and International Standard Randomised Controlled Trial Number (ISRCTN). We calculated the number of primary and secondary outcomes described in the protocol. In situations where it was difficult to determine the number of outcomes a consensus method was undertaken by two of the authors (KT and NH).
We extracted the number of primary outcomes, secondary outcomes, and the methods (if any) used to account for multiple outcomes. An outcome was coded as primary if it was designated as such by the researchers in the abstract, methods, results or tables. Each article was reviewed independently by two of the authors (KT and NH) and a consensus process was used to address any inconsistencies in coding. Secondary outcomes include measures that were reported as randomized trial group comparisons. Outcomes listed as "additional", "tertiary" or "exploratory" were coded as secondary outcomes [e.g. 21]. Side effects and adverse events were not included in the count. If no distinction was made between primary and secondary outcomes, all were assumed to be primary outcomes.
The total number of outcomes was calculated as the sum of the number of primary and secondary outcomes. In addition, a categorical variable for the number of primary outcomes was created (1, 2–3, or 4+). Other abstracted variables included the sample size (and coded as sample size<100, 100–399, or 400+), the journal name, and the clinical trial registry code. If no registration code was reported in the paper, the registries were searched in order to abstract the appropriate protocol information.
Fisher's exact tests were used to test associations in cross-classification tables, the Kruskal-Wallis test was used to compare count outcomes by group, and Spearman correlation to assess associations between counts of outcomes in the initial protocol and the final report. A p-value of 0.05 was used to assess statistical significance. Because our goals were exploratory, we did not undertake any adjustment for the five tests that we undertook . All p-values are two-tailed. Analyses were undertaken using Stata version 10.1.
Of the 105 studies initially retrieved, a total of 50 were excluded because they were not RCTs (n=31) (e.g. cohort study nested within a trial), were cost effectiveness studies (n=2), or did not have a depression measure as a primary or secondary outcome (n=17). After these exclusions, there were a total of 55 articles coded and analyzed (Figure 1 and Table 1). Just over half (52.7%, n=29) of the papers reported exactly one primary outcome. The distribution was heavily skewed to the right with 25.5% (n=14) of the studies reporting at least 5 primary outcomes.
Of the primary outcomes that had a depression component, the most commonly used were the Hamilton Depression Rating Scale (22.9%, n=11), the Montgomery-Asberg Depression Rating Scale (8.3%, n=4), Clinical Global Impression Scale (8.3%, n=4), and clinical diagnosis using the DSM-IV (8.3%, n=4). Table 2 displays the distribution of primary depression outcomes used in more than one study.
Secondary outcomes were also common with a median of three outcomes. This was also heavily skewed with a maximum number of secondary outcomes of 31 (25th percentile 0, 75th percentile 6 outcomes). Almost all (94.5%, n=52) of the articles reported more than one primary and secondary outcome. The number of secondary outcomes was significantly larger (p=0.003) for papers reporting only 1 primary outcome (median=5) as compared to those with more than 1 primary outcome (median=0). The median of the total number of outcomes was 7 outcomes (25th percentile 4, 75th percentile 10 outcomes). Figure 2 displays boxplots of the distribution of number of primary, secondary and total (sum of primary + secondary) outcomes.
A total of eight articles used a Bonferroni-type adjustment; however five papers reporting this approach applied it to comparisons within multiple treatments and not for multiple outcomes [23–27]. Of the 52 articles with multiple outcomes, only 5.8% (n=3) used a Bonferroni adjustment. While Strong et al [28)] specified only one primary outcome, the seven secondary outcomes analyzed were adjusted by utilizing a modified cutoff for statistical significance at 0.01. Welton et al  adjusted for the 41 outcomes by using a Bonferroni corrected alpha level of 0.0001. Lesperance et al  clearly specified a single primary outcome and single secondary outcome in the abstract, and included 6 exploratory outcomes later in their analysis. They used a Bonferroni-like multiplicity adjustment by partitioning the experiment-wise alpha level into 0.033 for the primary outcome analysis and 0.017 the secondary outcome analysis. All additional analyses, which they designated as exploratory, used the standard alpha level of 0.05. No articles used a Hochberg  or similar procedure. None of the papers reported use of joint testing methods or global tests.
Only two articles used a composite measure as their primary outcome. Raskin et al  combined information from four cognitive tests (Verbal Learning and Recall, Symbol Digit Substitution, Two-Digit Cancellation, and Letter-Number Sequencing). The composite cognitive score was a weighted sum (in proportion to the time spent administering the test) that ranged from 0 to 51. Goldberg et al  used a measure of either recovery (8 weeks with partial or full remission) or recovering (4 weeks with low level of symptoms).
The median sample size of the reported RCT’s was 200 with a minimum size of 28 and maximum of 7380. Studies with sample sizes less than 100 and greater than or equal to 400 participants had significantly more outcomes reported than did studies with sample sizes between 100 and 400 (p=0.003, see Table 3). There were more papers with total number of participants between 100 and 400 with only 1 outcome than expected.
Of the 55 eligible papers, we linked 42 (76%) to published protocols and of those, 31 (74%) reported information on outcomes. The 13 papers without protocols in the registries that we searched were published in AJP or AGP, while the 11 papers without outcomes specified in the protocol were published in AJP, AGP, JAMA, or the NEJM. There were no statistically significant differences between the distribution of the number of primary outcomes and clinical trial registry status (p=0.92). There were statistically significant differences between the sample size groups and the proportion with complete registry information on outcomes (p=0.005, 42% available for sample size<100, 60% available for a sample size between 100 and 400, and 62% available for sample size>=400). Of the 31 with reported outcomes in the protocol, 74% increased the total number of outcomes reported in the final manuscript (by an average of 4.9 outcomes). The Spearman correlation between the number of primary outcomes reported in the published manuscript and that reported in the protocol was modest (n=31, correlation=0.14, test of no-association yielded p=0.45).
The CONSORT [34, 35] statement provides guidance and structure to investigators when reporting the results of clinical trials. These guidelines are intended to clarify the key outcomes of these investigations, and ensure that their description is detailed and consistent within the abstract, methods, results and tables. Furthermore, while the CONSORT statement recommends only a single primary outcome, it does not directly specify statistical methods for appropriately handling multiple outcomes. A recent study examining statistical problems found by reviewers in high-impact psychiatry journals demonstrated the need to improve reporting of multiple statistical tests .
The CONSORT 2010  statement strengthens the discussion of multiple outcomes, and notes that while a trial may have more than one primary outcome, "having several primary outcomes, however, incurs the problems of interpretation associated with multiplicity of analyses … and is not recommended. " (p. 7).
Nearly half of depression clinical trials published between January 2007 and October 2008 in leading medical and psychiatry journals reported more than one primary outcome, while nearly all reported more than one primary or secondary outcome. The median number of total outcomes (not including our category of tertiary outcomes for side effects or similar) was seven. While depression is a multifaceted disorder that manifests itself in many ways over multiple domains, there is a need to specify what outcomes are being considered and how they will be accounted for in a clear fashion. No single primary outcome is appropriate for all depression studies.
We also found that determining the number of primary and secondary outcomes for many of the articles included in this study was not straightforward, with relatively few clearly and consistently specifying primary and secondary outcomes [e.g. 28, 30, 37].
Separate analyses, with no correction for multiplicity, were the most common method to analyze multiple outcomes. A familiar drawback of this approach is the risk of inflating the Type-I error rate (likelihood of obtaining significant results due to chance). While it is critically important that multiple domains of a disorder are discussed, interpretation of a large number of p-values by a clinical reader can be challenging. While we focused on randomized trials, similar issues arise in observational studies. Failure to account for the multiplicity of comparisons could lead to invalid inferences and spurious conclusions. At the very least, researchers reporting a profusion of results without adjustment should address the internal consistency of their findings .
The appropriate use of corrections for multiplicity is not always straightforward [38, 39]. Rothman  notes that scientists need to explore multiple leads in the search for better interventions and treatments, and that inappropriate use of multiplicity adjustment may obscure possibly important findings. Nonetheless, inflation of Type I error is a serious concern, and in the setting of randomized trials, this must be accounted for in the trial protocol. Several papers employed a Bonferroni-type correction to address the issue of multiplicity. A particularly creative approach was undertaken by Lesperance et al , where the primary outcome (HAM-D) was tested at alpha=0.033 while the secondary outcome (BDI) was tested at 0.017.
A common critique of the Bonferroni method is that it will tend to be conservative when the outcomes are correlated. However, the simulations of Yoon et al  indicated that for settings similar to that of the CATIE trial, with 5 outcomes, the Bonferroni adjustment performed adequately when correlations were moderately. For psychiatric studies, it is rare to have highly correlated endpoints.
Further use of more sophisticated approaches to account for multiplicity may be warranted. Joint testing is particularly attractive in this setting [4, 7]. By capitalizing on the correlation of multiple outcomes, these methods are generally more powerful than separate analyses  or Bonferroni adjustment . While more complicated than separate testing of multiple outcomes with multiplicity adjustment, these approaches are straightforward to fit in general purpose statistical software [4, 7]. Changes in the scale of research and the use of large data banks to test hypotheses will complicate future evidence-based medicine, and will likely exacerbate these issues .
Another troubling problem, unrelated to the multiplicity issue, concerns missing data. When outcome data are only partially observed, separate analyses of the outcomes will lead to the inclusion of different subjects for the analysis of each outcome. The reader is then faced with interpreting treatment effects based on different samples of subjects, as well as assessing assumptions regarding missingness. Joint models are particularly attractive in this setting, since they incorporate partially observed data and pool information across outcomes.
The concordance between the published protocols in registries and the number of published outcomes was also discouragingly low, albeit similar to findings reported for cardiology, rheumatology and gastroenterology RCTs . Although the 2007 FDA Modernization Act now requires investigators and sponsors to submit information for any applicable clinical trial to NIH/NLM, complete adherence to this act will require some time before becoming appearing in published trial results. While we anticipate that more investigators will publish their protocols in a more timely and complete fashion as part of new journal requirements, selective reporting remains a potential problem [8, 9]. Investigators must not "torture their data until they speak"  by examining additional outcomes, undertaking unplanned subgroup analyses or similar mischief. The addition of a CONSORT checklist item to note changes in trial outcomes after the trial commences should also help with this issue.
To help improve practice in this area, we suggest that all clinical trial reports:
Widespread adoption of these recommendations, all of which flow from the CONSORT guidelines and are consistent with the FDA modernization act, could be easily incorporated into common practice. If implemented, they could help improve the timely dissemination and appropriate interpretation of results from clinical trials.
Partial support was provided by the National Institute of Mental Health grant R01-MH54693 and the Smith College Tomlinson Fund. Thanks to Ian White and to the anonymous reviewers for many useful comments on a previous draft.
Table 1. Full table of articles
Note: The full table of articles, with citations, outcome counts and methodology used is available as an online Appendix (see separate attachment in submission)