|Home | About | Journals | Submit | Contact Us | Français|
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
The interpretation of the results of active-control trials regarding the efficacy and safety of a new drug is important for drug registration and following clinical use. It has been suggested that non-inferiority and equivalence studies are not reported with the same quantitative rigor as superiority studies.
Standard methodological criteria for non-inferiority and equivalence trials including design, analysis and interpretation issues were applied to 18 recently conducted large non-inferiority (15) and equivalence (3) randomized trials in the field of AIDS antiretroviral therapy. We used the continuity-corrected non-inferiority chi-square to test 95% confidence interval treatment difference against the predefined non-inferiority margin.
The pre-specified non-inferiority margin ranged from 10% to 15%. Only 4 studies provided justification for their choice. 39% of the studies (7/18) reported only intent-to-treat (ITT) analysis for the primary endpoint. When on-treatment (OT) and ITT statistical analyses were provided, ITT was favoured over OT for results interpretation for all but one study, inappropriately in this statistical context. All but two of the studies concluded there was "similar" efficacy of the experimental group. However, 9/18 had inconclusive results for non-inferiority.
Conclusions about non-inferiority should be drawn on the basis of the confidence interval analysis of an appropriate primary endpoint, using the predefined criteria for non-inferiority, in both OT and ITT, in compliance with the non-inferiority and equivalence CONSORT statement. We suggest that the use of the non-inferiority chi-square test may provide additional useful information.
Equivalence and non-inferiority randomized controlled trials are the standard research methodology to demonstrate that a new treatment is equivalent or non-inferior to standard therapy (active-control) in term of efficacy. While an equivalence trial would use the 2-sided 95% confidence interval of the difference between the 2 trial arms, the non-inferiority trial would usually use the 90% confidence interval of the difference, if a 1-sided 5% rather than 2.5% significance test was considered a priori acceptable . Because it is impossible to prove exact equality, the goal in a non-inferiority trial, in situations where the effect compared to placebo is large, is to rule out differences of clinical importance in the primary outcome between the two treatments.
Issues, difficulties and controversies surrounding non-inferiority trials have long been well recognized and extensively reported in many medical settings, including human immunodeficiency virus infection (HIV) [2,3]. Highly active antiretroviral therapy (HAART) delays progression of the acquired immunodeficiency syndrome (AIDS) and increases survival among HIV infected patients. With efficacy rates of 70%  and 75%  respectively, the space for better antiretroviral agents efficacy has become very tight. However, long term toxicities, pill burden and genotypic resistance call for treatment simplification and alternative new agents. As a consequence, the number of non-inferiority trials has been growing in the recent years in the AIDS therapy literature. Some authors chose to use interchangeably the terms "equivalence" and "non-inferiority", regardless of the hypothesis of the study. Given that the question of interest is not symmetric, we think that they are better described as "non-inferiority" trials.
Because efficacy in viral suppression remains the major outcome, new drugs should first prove non-inferiority with respect to prolonged control of HIV replication, as the primary endpoint. Second, the new drugs should provide other advantages. Inevitably, there may have been some tension between marketing purposes and scientific issues in the published reports of those trials. In this paper, our objective was to verify the validity of recently published non-inferiority AIDS trials regarding the primary endpoint.
Our aim was to consider a cohort of equivalence or non-inferiority trials published in the area of HIV/AIDS, after HAART became available. We performed a MEDLINE search using the terms equivalence OR non-inferiority AND random* AND HIV (1) and abacavir AND random* (2). 64 (1) and 136 (2) articles were identified. 5 (1) and 5 (2) were selected because they fulfilled the following requirements: randomized controlled clinical trial with 48-week minimum follow-up, initially designed as a non-inferiority or equivalence trial with a prespecified non-inferiority margin, virological primary endpoint and publication in New England Journal Medicine, JAMA, Lancet, AIDS, Clinical Infectious Diseases, Journal of Infectious Diseases and Journal of Acquired Immune Deficiency Syndrome between 2001 and 2006. Eight additional articles were identified by examining cross-references or by authors' knowledge of their existence.
We applied traditional methodological requirements for non-inferiority and equivalence trials adapted from Kirshner, Jones et al. , McAlister and Sackett  and Piaggio et al. to eighteen [10-27] active-control trials. We also applied proposed standards in the report of non-inferiority and equivalence trials adapted from Le Henanff .
Intent-to-treat (ITT) or on-treatment (OT) analysis 95% confidence interval of the treatment difference were computed using the normal approximation, based on available data included in the flow chart, results section and figures. Two selected studies (ALIZE and SEAL) predefined a 90% confidence interval of the treatment difference, but their conclusions were not affected by the use of the 95% confidence interval (which was used in this paper for homogeneity). Two other selected studies (BMS-045 and CONTEXT) defined the primary endpoint as the log10 reduction in HIV viral load, using a time-averaged difference method. For homogeneity with other studies, we considered the more pertinent criteria (closer to the clinical practice) of the percentage of patients with undetectable viral load (< 50 copies/ml or < 400 copies/ml) at week 48 (reported as secondary endpoint).
In case of missing data, the corresponding author of the paper was contacted. When only percentages were available with several possibilities for the numerator due to rounding, we choose on a worst case basis. If original data were censored, we used the cumulative incidence of the primary endpoint in each arm.
Significance testing in establishing non-inferiority between the two arms of a study was computed by the use of the continuity-corrected chi-square of Dunnett and Gent  for non-inferiority in intent-to-treat or on-treatment analysis, also on a worst case basis. Briefly, π1 and π2 represent the true proportions of patients with treatment success according to the primary outcome in a random sample of the 2 populations of patients receiving the control treatment and the new drug, respectively. In case of non-inferiority, the expected estimates of π1 and π2 are given by:
where x and y are the observed number of success, n1 and n2 are the 2 sample sizes in the control and the experimental study groups, respectively and Δ the pre-specified margin for non-inferiority.
The continuity-corrected chi-square of Dunnett and Gent  (reproduced with written permission) for non-inferiority is given by:
where m = x + y and = 1 n1
If Δ is the maximal acceptable difference in success rates between the 2 treatment arms and δ is the observed difference between the experimental and control arms, the equivalence hypothesis can be formulated as pair of one-sided hypothesis:
H01 : δ ≥ Δ versus Ha1 : δ < Δ with a type I error of α1 (1)
H02 : δ ≥ - Δ versus Ha2 : δ > - Δ with a type I error of α2 (2)
The type I error probability α for H0 rejection corresponds to H01 H02. Therefore, the P-value for equivalence is the lower chi-square value associated with max (α1, α2). In a non-inferiority hypothesis, only (1) is necessary. More details have been published elsewhere.
To avoid confusion between the P-values of superiority tests and the P-values of non-inferiority tests (both are reported in this paper), the latter have been renamed "D-values". When the normal approximation is a valid hypothesis, there is a general consistency between the two-sided 95% confidence interval approach (non-inferiority at α/2 < 2.5%) and the non-inferiority chi-square (D-value < 5%), as shown in Figure Figure1.1. D-values and P-values < 0.05 were considered statistically significant.
All of the antiretroviral trials outlined in Table Table11 were conducted with active-controls which have previously shown efficacy. 16 studies used a composite endpoint including virologic failure, clinical progression to AIDS or death in compliance with the other new AIDS clinical trials, whereas 2 studies used log10 reduction in HIV viral load. However, they reported virologic failure as secondary endpoints.
All studies identified a pre-specified non-inferiority margin (criterion for selection). As shown in Figure Figure2,2, however, only 4/18 studies reported justification for their choice. In the CNAAB3005 study, the choice of the non-inferiority margin was based on discussion with clinical investigators and with the Food and Drug Administration. The margin of 12% was considered as the largest difference clinically acceptable. In the 903 study, the authors considered that the margin of 10% was a more stringent and conservative non-inferiority criterion. The authors of the CNA30024 commented that it was the appropriate measure for distinguishing the clinical effectiveness of 2 study treatment. Finally, the CNA30024 authors' choice relied on HIV clinicians' judgement as well as on discussion with independent reviewers. Other studies did not comment on their choice, which ranged from 10% to 15% (median: 12%). CONTEXT and BMS-045 considered a non-inferiority margin of -0.5 log10 reduction in HIV viral load, without justification. Other issues regarding design are reported in Table Table11.
All but two trials reported results using the confidence interval approach. In the BEST study, the authors predefined their non-inferiority margin for sample size calculation, but the confidence interval was neither defined nor reported. In the NEFA study, although the confidence interval approach was clearly defined in the statistical analysis section of the article, none was provided in the results section. NEFA, BEST, 2NN, FTC-303, ESS40013 and SHAART studies reported non-significant superiority tests for efficacy to reinforce non-inferiority. The ALIZE and 934 studies switched from the non-inferiority to the superiority hypothesis to declare that the experimental treatment had superior efficacy in the ITT analysis set (for secondary and primary endpoints, respectively), as appropriate.
CNAAB3005, NEFA, SOLO, BEST, EPV20001, ALIZE, BMS-2004, SEAL and SHAART studies (Figure (Figure2)2) published both ITT and OT analysis (9/18), but only the ALIZE, SOLO, EPV20001 BMS-2004 and SEAL studies found concordant results regarding non-inferiority in the two analysis. The BEST investigators provided separate conclusions for ITT and OT, as appropriate. The ALIZE-trial group conducted ITT, OT and a worst scenario analysis. In CNAAB3005 NEFA and SHAART, the conclusion was based on ITT analysis only. 2NN, FTC-303, EPV20001, ESS40013, CNA30021 and 934 studies described sufficient details to permit alternative analyses, such as OT. We have failed to compute OT analysis from the 903 and CNA30024 studies. Because of the nature of their primary outcome, CONTEXT and BMS-045 studies were not able to provide ITT and OT analysis. Both analysis were provided as secondary endpoints.
CNAAB3005 (12% versus 14.3), NEFA (13.5% versus 15.8), 2NN (10% versus 14.0%; 10% versus 14.6%), 903 (10% versus 10.3%) and SHAART (15% versus 17.4%) concluded non-inferiority inappropriately on the basis of their pre-specified margin. In accordance, their non-inferiority D-values were above 5%, as shown in Table Table2.2. BMS-2004 concluded that the two drugs were as efficacious (suggesting equivalence), while the ITT lower bound of the 95% confidence interval (-11.7%) exceeded 10% in favour of the experimental drug. The main BMS-2004 hypothesis (non-inferiority of the experimental drug at 10%) was demonstrated with a D-value = 0.043 (OT analysis). In our analysis of the ESS40013 study (OT), thenon-inferiority margin exceeded the pre-specified non-inferiority margin. Finally, CONTEXT and BMS-045 studies provided a conclusion in accordance with their non-inferiority margin (data not shown).
BEST, SOLO, FTC-303, EPV20001, ALIZE, CNA30024, SEAL, CNA30021 and 934 conclusions' were appropriate, on the basis of available data.
Trials that assess non-inferiority require rigorous methods for their design, analysis and interpretation. Although the design and the sample size were appropriate for AIDS non-inferiority and equivalence trials, there is room for substantial improvement regarding statistical analysis and interpretation of the results.
Patients with HIV infection would be harmed by deferral of therapy. Consequently, the use of placebo would be unethical . Even if placebo-controlled of HAART therapy are not available, a conclusion about efficacy can be reached because the great majority of patients (about 70%) will not be controlled without treatment [4,5]. Because significant inferiority to active-control would be a major problem for patients, the non-inferiority margin for a new drug should be smaller than the difference between active-control and placebo. Because this effect size is so large, only the clinically chosen margin is really an issue, but is also highly subjective. As a result, this margin varied from the conventional 10% up to 15%. Even the same study group chose different margins in studies 903 (10%) and 934 (13%). A small decrease in margin provides greater assurance of satisfactory effect, but the cost of the study will increase because more patients are required. In the 903-study, the authors could not demonstrate non-inferiority at 10% but they point out in their discussion that this margin was more stringent than the 12% chosen in CNAAB4005. However, if the authors had chosen the less powerful 12% as the maximal limit for non-inferiority, the 95% confidence interval would have been wider, possibly beyond the 12% limit. Consequently, data-driven discussion about the non-inferiority margin after completion of the study is pointless.
Blinding has been described as less efficient in non-inferiority than superiority trials, in particular if the primary endpoint is subjective. For example, a blinded investigator could bias the results toward a preconceived belief in equivalence by assigning similar ratings to the treatment responses of all patients, giving a "bias toward the null". Even when the primary outcome is objective (viral failure, clinical progression or death), however, we believe that blinding is important to protect against bias. Unblinded investigators may provide other effective therapies to patients in the arm that they believe superior or equivalent, such as more regular appointment or adherence support. In addition, patient or physicians may overinterpret subjective endpoints such as side-effects in open-label studies. Finally the absence of blinding can distort the comparability of the groups regarding study withdrawal or patients' adherence, since patients participating in a non-inferiority trial may prefer to receive the simpler therapy. Among the studies observed, significantly more patients discontinued the ALIZE study medication in the control arm for personal reasons, as compared with the simpler, once-a-day experimental group (11% versus 2%, P < 0.0004). This may influence outcome, particularly in an ITT analysis, where withdrawals are considered as failure. Another example comes from the results of the 934 study, where adherence to treatment differed significantly between groups. The conclusion about superior efficacy of the experimental arm in the 934-study may be in part the consequence of greater exposure to the experimental drug. On the other hand, blinding can stand in the way of an optimal drug dispensation in non-inferiority and equivalence trials, in particular if the aim is to simplify antiretroviral therapy. For example, if the purpose is to offer simpler dosage or fewer pills as compared to standard therapy, blinding may require similar regimens in both arms so that any advantages of simplification would be eliminated.
Exclusion of patients after they have been randomized sacrificed the validity of "on-treatment" analysis because it may cause major bias regarding group comparability. For this reason, intention-to-treat analyses has been recognized as the most appropriate and conservative strategy to analyse data of double-blinded trials. However, in case of non-inferiority and equivalence trials, it is well known that this method lacks of robustness since not conservative. For this reason, the study interpretation should also be complemented by "on-treatment analysis"[1,8,9]. If there are discrepancies in the results regarding equivalence or non-inferiority, this should be reported and acknowledged. The CNAAB3005 illustrated how apparent equivalence can be the consequence of a dilutional effect of comparing 2 treatments in the ITT (527 patients) when only 54% of the patients where on-treatment. The same could apply to the ESS40013 study. The use of an "overall" log-rank testing superiority within the 3 arms in the NEFA study may also have blurred the lower efficacy of one study arm, as demonstrated by the "head-to-head" comparison between abacavir and efavirenz.
Like in superiority trials, the choice of the primary outcome is also critical in non-inferiority trials. The BMS-045 illustrated how statistical non-inferiority for viral log difference can be compatible with up to 20.4% of additional virologic failure in the experimental arm, a percentage much larger than non-inferiority margins usually selected for this outcome in this setting.
Finally, the majority of the studies concluded that the effect of at least one experimental arm, based on their prespecified margin, was similar to the control. However, only half of these studies actually demonstrated non-inferiority. Prespecifying the non-inferiority or equivalence margin is necessary but not sufficient to guaranty methodologic quality and appropriate conclusion. We confirmed that AIDS trialists had low adherence to non-inferiority and equivalence methodological standards, as it is the case in other fields. An antiretroviral drug may not prove non-inferiority in term of efficacy but nonetheless be a good alternative because the observed difference is small and the new drug demonstrates better tolerance. This interpretation should, however, be left to the reader. To allow a risk-benefit assessment to be made, the report has a particular obligation to be as clear as possible, using standard statistical vocabulary for non-inferiority and equivalence trials, in compliance with the CONSORT statement.
Conclusions about non-inferiority should be drawn on the basis of an appropriate confidence interval using a predefined criterion for non-inferiority, shown in both OT and ITT in compliance with the non-inferiority and equivalence extension of the CONSORT statement. We describe how failure to do so will lead to erroneous conclusions. A claim of non-inferiority with a non-inferiority chi-square D-value above 5% is as incorrect as a claim of superiority with traditional null hypothesis testing P-value above 5%. Although the 95% confidence approach is sufficient to reject the null hypothesis, the non-inferiority chi-square provides additional information about the actual degree of significance. Of note, the revised CONSORT statement for superiority trials, item 12a recommends the report of the actual P-values for statistical significance rather than the imprecise threshold "P < 0.05". The additional use of the continuity-corrected non-inferiority chi-square may contribute to avoid misleading interpretation by non-statisticians, for whom significance testing may have a higher impact than confidence intervals. The clinical relevance of the primary outcome on which non-inferiority rely should also be assessed. Reviewers and Editors need to reinforce their standards for acceptance of non-inferiority and equivalence randomized controlled trial. Finally, the importance of critical appraisal has implications for both curricular planning in schools and colleges of medicine, as well as for continuing education programs.
JJP received research or travel grants from Boehringer Ingelheim, GlaxoSmithKline, Abbott Pharmaceutical, Roche Pharma, GileadSciences. RV received research or travel grants from Bristol-Myers-Squibb, Merck, Boehringer Ingelheim, GlaxoSmithKline, Abbott Pharmaceutical, Roche Pharma, Pfizer and GileadSciences. VM declares that she has no competing interests.
No persons apart from the authors contributed to this paper. JJP had the original idea for the paper. JJP and VM performed the literature search, conducted quality assessment and data extraction and performed statistical analysis. The paper was drafted by JJP and critically appraised for intellectual content by RV and VM, who were also involved in interpretation of the data. All authors read and approved the final manuscript. The guarantor of this paper is JJP.
The pre-publication history for this paper can be accessed here:
We would like to thank David Sackett for his useful comments on an earlier version of this manuscript, Jean-Michel Molina and Joel Gallant for helpful discussions on their work. There is no funding for this study.