Statistical testing, as described above, judges differences between treatments and is capable only of supplying evidence of a difference [

1,

28], not a certainty of difference. A nonsignificant p value means we cannot conclude the two samples are drawn from different populations, but it is important to understand this does not imply that they are from the same population [

1,

29]. Even should two groups in a trial produce the numerically same outcome for a studied end point and a nonsignficant p value, one has to consider that might be attributable to variability masking a true difference, ie, that one, or even both groups, have produced outliers that happen to be identical but that would not be reproduced in a new study. However, there are some arguments why it is important to ask that a new treatment be no worse (within a margin), rather than better (Table ). For instance, a new treatment may be designed to be safer (ie, fewer complications or less severe complications), less expensive, or otherwise more desirable, while providing similar efficacy on a primary outcome to that of an accepted contemporary treatment. When the new treatment may provide better efficacy, but it was designed based on improving the secondary measures, noninferiority testing is desirable as it also allows the potential to establish that it is better. Noninferiority assessment is one-sided testing, ie, it does not allow the possibility that the new treatment is worse (plus a noninferiority margin) than the active control, but better is also noninferior, and noninferiority testing does not preclude establishing superiority. Confidence intervals supply us with a means to test this hypothesis.

| **Table 1**Reasons for choosing noninferiority over superiority designs |

A noninferiority trial starts with defining “no worse (within a margin)” or noninferior before the beginning of the trial (Fig. ). This requires the definition of an outcome and a threshold in that outcome, below which one would consider a new treatment to be inferior to an older one. One method to define such a margin is to use a value lower than the mean of a current treatment plus a noninferiority margin [

10,

20,

24,

45]. The size of this margin should be determined during the design stage of a study before the actual experiment. Choosing the size is difficult, as there are no explicit rules. Usually findings from earlier studies and estimates of clinically relevant differences are combined. For example, the currently used anticoagulant A serves as an active control for a study of the new anticoagulant B. A reduces the incidence of deep venous thrombosis after total hip replacement by 80% and a noninferiority margin of 5% of this effect is assumed “no worse”. Therefore B would be required to reduce the incidence of deep venous thrombosis by 80% × 0.95, ie, 76%. To test for this, the 95% CI for the difference of the incidence of deep venous thrombosis with new treatment minus the incidence for the active control is plotted against this threshold and interpreted (Fig. ). Alternatively, a one-sided test can be used because in noninferiority testing we are interested only in whether B is no worse than 76%, ie, the probability to lie on the left side of the probability distribution curve. “Being no worse than 5% worse than the effect of the active control” also could be interpreted as trying to maintain at least 95% of the effect of the active control. However, there is a risk of so-called “biocreep” [

16,

18] (Fig. ), which refers to the problem of having a new treatment B that is no worse than 5% of the current standard of care treatment A, and then comparing an even newer treatment C again by 5% with treatment B, and then treatment D with treatment C, and so on. Although the noninferiority margin of 5% has never been violated in individual comparisons, treatment Z is far from the effect of the original treatment A. The definition and use of such a margin might seem arbitrary to some, but it actually is more rigorous than a superiority design because it involves a predefined minimum difference and statistical testing for this specific difference, whereas superiority trials assess only the significance but not the size of a difference.

Another method is to establish a margin that is significantly different from a putative placebo control by using data from pilot studies, earlier studies, or meta-analyses. A benefit of this approach is the inclusion of a placebo or negative control because noninferiority is a closed comparison of two treatments [

27]. The problem in closed comparison is that it sometimes can be difficult to differentiate whether a new treatment approximated the effect of the well performing, active control (ie, both treatments worked equally well) or whether the active control approximated the effect of an ineffective, new treatment owing to a problem in the study (ie, both treatments failed clinically to produce a positive outcome). Such problems could be lacking compliance, implant failure, a poorly chosen model, patient crossover or losses to followup, or any other treatment failure. In both scenarios, the new treatment will “be no worse (within a margin)” than the active control, but it is obvious that the latter (no difference between treatments because both failed) should not be interpreted as evidence for the effectiveness of a new treatment. Both methods can be combined to define a margin that is significantly better than placebo and clinically meaningful.

Next the investigator must establish the population to test. For superiority trials, testing the ITT population is recommended [

28,

37], which means patients are analyzed by their initial allocation, regardless of attrition, missed followups, lacking compliance, crossovers, etc, to produce a conservative estimate that reflects the real life situation [

6]. Currently it is not unequivocally clear whether ITT leads to a more conservative estimate in noninferiority trials, and it has been suggested that other factors related to study design, patient flow, and statistical analysis influence the conservatism of ITT analyses in noninferiority studies [

34]. However, recommendations steer toward the use of ITT, mostly for the sake of consistency with superiority testing [

6].

The investigator also must determine sample size for noninferiority. This is straightforward. In superiority testing, the sample size depends on the expected difference between treatments, and analogously, in noninferiority the sample size depends on the expected noninferiority margin [

44,

47,

53]. Depending on how large or small this margin is chosen, the required sample size in a noninferiority trial can be substantial. Usually, noninferiority and equivalence studies require (much) larger sample sizes than superiority studies because the typical sizes of the anticipated margins in noninferiority and equivalence studies are (much) smaller than what would be considered a clinically meaningful difference per se between groups in superiority studies.

The question emerges whether noninferiority and superiority designs may be combined in one trial without affecting validity or power. As mentioned above, superiority and noninferiority have distinct testing principles, and adding superiority testing to noninferiority testing is not consistent with multiple comparison testing and can be done without loss of power and the need to adjust p values [

28,

37,

38,

61]. It also is valid to test for noninferiority for one (group of) end point(s) and for superiority in other end points in one study [

28,

36]. Nevertheless, as with all methodologic details of a study, the analysis plan, use of these approaches individually or in combination, directions of the hypothesis test(s), and required sample size have to be determined a priori.