Investigators understand intuitively, even before applying statistical rigor, how to conduct a trial to establish superiority of a novel treatment. When a new therapy is compared with a placebo control, or, if one exists, an active control, the investigator defines an outcome (such as level of pain or overall survival) and declares the new treatment superior if, at the end of the trial, the estimated value of the outcome in the treated group is 'better' than the estimate in the control group. Statistically speaking, 'better' means that the data allow rejection of the null hypothesis that the two distributions are equal, in favor of the hypothesis that the new treatment is better than the control.
Sometimes, the goal is not to show that the new treatment is better, but that the new treatment is 'equivalent' to the control. Because only with an infinite sample size would it be possible to show exact equivalence, investigators instead select a margin. Again, call it Δ. At the end of the trial, a CI is computed around the difference between two test statistics (equivalence trials typically use 90% CIs) and if the CI lies strictly within [-Δ, +Δ] the two treatments are called 'equivalent.' Such trials are used to show that a generic drug is biologically the same as the drug it is trying to mimic. They are also used to show lot consistency in vaccine trials, in which the outcome is a measure of immune response.
Non-inferiority is different from equivalence. In an equivalence trial, the desired conclusion is that two products are the same or 'not unacceptably different' from each other. In a non-inferiority trial, by contrast, the aim is to show that a new product is not unacceptably worse than an older one. Why might it be reasonable to pursue a product that is possibly less efficacious than an existing therapy? A new treatment that is not much worse than, or 'non-inferior to', the standard treatment may be attractive if, when compared with the standard treatment, it is expected to cause fewer side effects, or lead to improved quality of life, or if its dosing regimen is easier to tolerate.
Assume it is possible to define what 'significantly worse' means (think of this as a window of indistinguishability, or a margin that we will call -Δ; below we discuss how to choose such a margin), and that there is an existing treatment available against which to compare the new treatment. The new treatment could be said to be not unacceptably worse than [3
] (that is, non-inferior to) the existing treatment if, when the CI around the difference in the effect size between the new and existing treatments is calculated, the lower bound of that interval does not extend beyond the window of indistinguishability defined above. One focuses on the lower bound for this non-inferiority comparison; what happens at the upper end of the CI is not the primary concern. In an equivalence trial, by contrast, investigators care about both ends of the CI, and would declare the new treatment equivalent to the existing treatment only if the entire CI falls within this margin on either side of zero.
Non-inferiority trials are clearly appropriate for some diseases and some treatments. When developing a new treatment to prevent tuberculosis, investigators might be willing to sacrifice some small amount of benefit (as reflected in the margin) for a simpler dosing schedule, fewer side effects, or other advantages, but they would be delighted if the new treatment were better than current therapies (hence no restriction on the upper bound of the interval) and they could also declare superiority. This would only happen if the lower bound of the interval were above zero, not simply above -Δ.
Thus far, the problem sounds straightforward. One needs to select a non-inferiority margin, run the trial comparing the experimental treatment to an active control, calculate the CI around the difference between the treatments, and examine the lower bound of the CI. If the lower bound is above the margin -Δ, the new treatment is deemed non-inferior, and the trial is a 'success'. Further, if the new treatment is statistically significantly better than the comparator (that is, the lower bound of that same CI is also above zero), then superiority of the new treatment can also be declared. Importantly, testing first for non-inferiority and then for superiority does not require a statistical 'penalty' for multiple testing, because testing first for non-inferiority before testing for superiority (while examining a single CI) uses a testing procedure that appropriately controls the overall Type I, or α, error rate of the two tests. Statisticians refer to this type of testing as 'closed testing', and such a process ensures that the overall experiment-wise error rate is maintained at the correct level when testing more than one hypothesis. The order of the testing is important; to declare superiority, a new treatment necessarily also has to be declared non-inferior. The converse (testing first for superiority and then for non-inferiority) is not always a closed procedure. Testing in that order could lead to apparently anomalous results, even when examining a single CI. A large trial with a narrow CI around the difference between the active control and the new treatment might show that the lower limit of the interval lies within the margin, meaning that the new treatment is non-inferior to the active control, but the upper limit of the interval is below zero, so the new treatment is also inferior to the active control. Bear in mind that the opposite of 'non-inferior' is not 'inferior'; it is the looking-glass opposite, 'not non-inferior'. As an example, suppose the margin -Δ is -3, and the observed 95% CI at the end of the trial is [-2.7, 1.5]. The lower limit of the CI is above -3, so the new drug is non-inferior to the old, but the upper limit of -1.5 is less than zero, so the new drug is also inferior to the old. In this case, the single CI can be used to say that the new treatment is simultaneously 'non-inferior' and 'inferior'. Although this example may seem counterintuitive, when interpreting the results of a non-inferiority trial, it must be remembered that the purpose of the trial is to estimate the lower bound of the CI, not to establish a point estimate of the treatment effect. This test, sitting on the other side of the looking glass, requires an interpretation different from the usual.
In some trials, it is statistically appropriate to perform a superiority comparison first and, if that does not show statistical benefit, to perform a non-inferiority comparison. That would be appropriate only when the non-inferiority margin had been preselected. The reason such a switch is permissible stems from the fact that we can view the test as an interpretation of a CI. The calculated CI does not know whether its purpose is to judge superiority or non-inferiority. If it sits wholly above zero, then it has shown superiority. If it sits wholly above -Δ, then it has shown non-inferiority.
A non-inferiority trial can have five possible types of outcomes as depicted in Figure . The two vertical lines indicate zero and -Δ. Each horizontal line represents a CI, with the estimated treatment effect denoted by the dot in the center. The CI at the top of the figure sits wholly above zero; a trial with this outcome would conclude that the new treatment is superior and hence, also non-inferior, to the control. The next interval, which spans zero but lies wholly above -Δ, represents a trial that has shown non-inferiority, but not superiority. The third interval, which straddles both zero and -Δ, represents a trial that has shown neither non-inferiority nor superiority. The fourth CI illustrates the case discussed above; tucked between the two vertical lines, it shows both non-inferiority (because it lies wholly above the line for -Δ) and inferiority (because it also lies wholly below zero). The final CI on the bottom of the figure shows inferiority and does not show non-inferiority.
Possible outcomes of a non-inferiority trial.