The methodological challenges in conducting a noninferiority study are significant; especially since poor trial design and execution can erroneously suggest a similarity. Challenges include (a) choice of an active control treatment, (b) choice of a noninferiority margin, (c) sample size estimation, and (d) statistical analysis.
As previously mentioned, these designs cannot be used without a well-established standard treatment to use as the active control (ICH, 2001
). There must be convincing prior evidence of the effectiveness of the active control compared with placebo; its effectiveness must be consistently demonstrated (Blackwelder, 2004).
It must be clear that the active control is effective in the specific application, ideally with the specific population, used in the current study. The conditions of the trial (e.g., setting, dose, duration) should not unfairly favor one treatment over another (Hwang & Morikawa, 1999
). Also, it must be truly unknown if one treatment is inferior to the other (Djulbegovic & Clarke, 2001
). This can be a difficult standard to meet in studies with underserved populations in which the actual standard care is little or insufficient care (e.g., telemental health trials with rural patients). For example, in our ongoing trial the true “standard of care” in many of our rural locations is minimal to no care. However for the purpose of the trial, we needed to choose an active control with a well established evidence base. Therefore we selected the face-to-face delivery of CPT services as our active control.
The choice of noninferiority margin is another critical decision. Unfortunately, there is no gold standard criterion for determining an appropriate margin (W.L. Greene, Concato, & Feinstein, 2000
). In fact, the only consistent recommendations from regulators are that the margin is determined in advance and that it should not be greater than the smallest effect size the active drug would be reliably expected to have compared with a placebo (Hwang & Morikawa, 1999
; ICH, 2001
). If the margin is too large, rejecting the null hypothesis is meaningless; but if the margin is too small, power to detect noninferiority is dramatically reduced (Wiens, 2002
). Some researchers prefer to derive the margin based on statistical properties. That approach often leads to noninferiority margins that are relative to effect size. Typically it is a fraction, usually one half or less, of the historical effect size of the standard intervention (Temple & Ellenberg, 2000
). It can also be a percentage of the effect of the standard treatment in the current trial. For example, the novel intervention must be at least 80% as effective as the standard intervention. Perhaps the most common approach in treatment outcome studies is to set a margin based on what is considered “clinically unimportant.”
Conceptually, if the experimental treatment is almost as good as the standard treatment and there is only a trivial or unimportant difference between the two, then that margin has to be smaller than an amount that would make a difference clinically. This is quantified by taking the smallest value that would be clinically meaningful and using this as the margin of noninferiority. For example, in our ongoing CPT study with a population of combat veterans with PTSD, our noninferiority margin or the minimum clinical meaningful difference between the two conditions of interest on the primary clinical outcome measure is a decrease of 10 points on the CAPS. Thus, a difference smaller than this magnitude would not be enough to be considered clinically meaningful. This prespecified difference is both clinically and statistically established (Schnurr, et al., 2003
). Therefore, our margin is set such that the mean score reduction of the experimental treatment could not be more than 10 points lower than that of the standard treatment. The most rigorous approach is to utilize both clinical and statistical judgment (Wiens, 2002
). Whatever criterion is chosen, the margin and its derivation should be explicitly reported in the manuscript (Piaggio, Elbourne, Altman, Pocock, & Evans, 2006
It is generally not enough for a noninferiority study to merely show that the novel treatment is not inferior to a standard treatment. Rather, it is preferable that the study be able to demonstrate that both of the treatments are actually effective. Even when the standard treatment has a strong history of effectiveness, there are many trial-specific factors that could make the standard treatment ineffective such as choice of study population, treatment setting and choice of primary outcome. The most methodologically rigorous approach is to include a placebo or wait list condition to confirm that the actual control (i.e., standard treatment) and novel treatment are both superior to placebo (Hwang & Morikawa, 1999
). When a placebo is not feasible, as is frequently the case in psychotherapy research, every other aspect of the study design should be as similar as possible to the previous trials establishing the effectiveness of the active control (ICH, 2001
; Temple & Ellenburg, 2000
As with superiority trials, the required sample size depends upon the specific analyses and variables to be used in the trial. Power calculations must take into account the noninferiority margin and Type I and II errors (Julious, 2004
). Regulatory authorities recommend the use of a 95% confidence interval, which corresponds to a Type I error value of 0.025 (Lewis, Jones, & Rohmel, 1995
). In order to achieve power no less than 80%, Type II error can be no less than 0.20. We have frequently heard researchers voice a misconception that equivalence and noninferiority studies necessitate extremely large sample sizes and as a result are virtually impossible to conduct. However, investigators should not be discouraged from pursuing an equivalence or noninferiority design based on this erroneous assumption. Although these studies may require large sample sizes, the size of any trial (superiority or other) is dependent upon the primary objective, the choice of error rates, and the differences one is expecting to detect. For our ongoing study, we considered that in using a noninferiority design the consequence of a Type II error is the same as the consequence of a Type I error in traditional studies and adjusted values accordingly, in order to have a resultant power of 0.90.
The statistical analysis of noninferiority can be conducted by either using confidence intervals or by applying variations to the analytic strategy of null hypothesis testing. Under conventional hypothesis testing in a comparative study of two interventions, the goal is to reject the null in favor of a difference between the two interventions. If this approach is extended to a noninferiority study, then can noninferiority be claimed when the test fails to reject the null hypothesis of no difference? Some argue that it is acceptable assuming that a strict level of Type II error (failing to reject the null when it is false) is provided (Jennison & Turnbull, 2000
; Ng, 1995
). However, others argue that it is logically impossible to conclude noninferiority on the basis of failing to reject the null hypothesis (Blackwelder, 1982
; Dunnett & Gent, 1977
; Jones, Jarvis, Lewis, & Ebbutt, 1996
). Rather, failure to reject the null means that there is not sufficient evidence to accept the alternative hypothesis. Alternately, if the null hypothesis states that the true difference is greater than or equal to the pre-specified noninferiority margin, then failing to reject it can be interpreted as insufficient evidence to conclude that the difference of the two procedures is less than the pre-specified noninferiority margin.
The statistical analysis plan for our ongoing trial is to employ a multilevel (also called hierarchical or random effects) modeling procedure for a non-inferiority analysis on our primary outcome measure (change in PTSD symptoms on the CAPS). The magnitude of the difference in the means (effect sizes), as estimated by the confidence intervals, will provide useful clinical information and will allow a clinical judgment relative to the clinical non-inferiority of the two modes of delivery. In a multilevel design such as this one, power is influenced not only by the number of participants, the effect size, and the α-level, but also by the number of clusters (i.e., the number of cohorts) and the effect size variability (measured in our study as the effect size variance across cohorts). High values of the effect size variability would result in reduced power, but because care will be taken to ensure that all treatment sessions will undergo the same procedures, there is no reason to expect high effect size variability.
To avoid misinterpretation of null hypothesis testing, some investigators, including those who contributed to the CONSORT guidelines, favor the confidence interval approach to show noninferiority of two treatments (Durrleman & Simon, 1990
; Jones et al., 1996
; Makuch & Simon 1978
; Piaggio et al., 2006
). The width of the interval signifies the extent of noninferiority, which is a favorable characteristic of this approach. If the confidence interval for the difference between two interventions lies to the right of the noninferiority margin () then noninferiority can be concluded. If the interval crosses the boundary (contains the value of the margin) then noninferiority cannot be claimed. Some investigators prefer the confidence interval approach to examine the precision of noninferiority. Others prefer hypothesis testing. The authors recommend that both confidence interval and p
-values be provided to allow the audience to interpret the extent of the findings.
ICH E10 (2001)
and CONSORT (Piaggio et al., 2006
) guidelines recommend that noninferiority trials conduct an intent-to-treat (ITT) analysis including all participants who were randomized into groups, regardless of their actual participation in treatment. It should be noted that although the importance of ITT analysis in a traditional superiority trial is well established, the role of the ITT population in a non-inferiority trial is not equivalent to that of a superiority trial (Brittain & Lin, 2005
). An ITT analysis in a superiority trial tends to reduce the treatment effect, minimizing the difference between groups—in essence favoring the null hypothesis of no difference. This is a conservative way to view the results. However, in the non-inferiority setting, because the null and alternative hypotheses are reversed, a dilution of the treatment effect actually favors the alternative hypothesis, making it more likely that true inferiority is masked. An alternative approach is to use a per-protocol population, defined as only participants who comply with the protocol. The per-protocol analysis can potentially have bias as well. Because non-completers are not included, it can distort the reported effectiveness of a treatment. If, for example, a significant percentage of participants dropped out of the experimental treatment because they felt it was ineffective, the per-protocol analysis would not adequately capture that. Another approach is to use a modified ITT which excludes participants who never actually received the treatment, but includes non-compliant participants who started the treatment but did not fully complete it. The most rigorous approach is to use both a per-protocol analysis and an ITT analysis with the aim of demonstrating noninferiority with both populations (Jones et al., 1996
). Regardless of the approach used, the reported findings should specify and fully describe the type analysis population.