|Home | About | Journals | Submit | Contact Us | Français|
A well-planned randomized controlled trial (RCT) is the most optimal study design to determine if a novel surgical intervention is any different than a prevailing one. Traditionally, when we want to show that a new surgical intervention is superior to a standard one, we analyze data from an RCT to see if the null hypothesis of “no difference” can be rejected (i.e., the 2 surgical interventions have the same effect). A noninferiority RCT design seeks to determine whether a new intervention is not worse than a prevailing (standard) one within an acceptable margin of risk or benefit, referred to as the “noninferiority margin.” In the last decade, we have observed an increase in the publication of noninferiority RCTs. This article explores this type of study design and discusses the tools that can be used to appraise such a study.
A well-planned randomized controlled trial (RCT) is the most optimal study design to determine if a novel surgical intervention is any different than a prevailing one. Traditionally, when we want to show that a new surgical intervention is superior to a standard one, we analyze data from an RCT to see if the null hypothesis of “no difference” can be rejected (i.e., the 2 surgical interventions have the same effect).1,2 Let’s consider a hypothetical RCT that compares laparoscopic to open appendectomy and the outcome measured is a pain score based on a Likert scale from 0 to 10. Suppose it was found that the mean pain score following laparoscopic appendectomy was 7 points and that following open appendectomy was 8 points, and that this 1-point difference was statistically significant. Such a result would be uncommon because it would require a large sample size, but let’s accept this for now. Although statistically the result is significant, we do not consider this 1-point difference to have clinical relevance. This type of thinking addresses the concept of minimum clinically important difference (MCID), which describes a threshold that might persuade us to change our surgical practice. The meaningful MCID is usually based on the available best evidence derived from previous systematic reviews, pilot/feasibility studies or clinical judgment based on discussion with experts in the field.
In another hypothetical RCT, the length of stay (LOS) after laparoscopic appendectomy was observed to be 24 hours versus 30 hours after open appendectomy, with a p < 0.05. It would be meaningless to conclude that the observed difference of 6 hours is the truth without reporting a confidence interval (CI), as a p value alone does not provide information on the degree of uncertainty (variation) applied in measuring the difference in hospital stay.3 Briefly, a CI provides information regarding the degree of uncertainty associated with the observed difference of 6 hours in hospital stay. It is within the CI that the true difference will likely lie. Let’s say that in our hypothetical example the 95% CI for hospitalization time difference of 6 hours was 1–11 hours in favour of the laparoscopic approach. This means we are 95% confident that the true difference lies somewhere between 1 and 11 hours, which is quite wide, raising uncertainty when a definitive conclusion is made.
In the last decade, we have observed an increase in the publication of noninferiority RCTs. This article explores this type of study design and discusses the tools that can be used to appraise such a study.
At the last cardiac surgery weekly academic rounds there was a heated exchange between 2 surgeons, who were arguing the merit of ex-vivo heart perfusion compared with cold storage as a means of preserving donor hearts before transplantation. To resolve this dilemma, the division head has assigned you, the newest member of the division, with the task of finding the best evidence to answer this clinical question and report your findings to the group at next week’s rounds.
To identify the best evidence and inform your colleagues you begin by conducting a literature search according to the “Users’ guide to the surgical literature: how to perform a high-quality literature search.”4 You follow the PICOT format, which serves as the starting point for identification of important key words used in the search process:5
You then conduct a literature search in PubMed Clinical Queries using the search terms “heart transplantation” AND “ex-vivo perfusion” AND “cold storage,” using the “Therapy” and “Broad” filters. You identify 10 articles: 7 ex-vivo human/animal studies,6–12 1 nonrandomized clinical study,13 1 review14 and 1 RCT.15 The RCT addresses your research question and has the benefit of being level-I evidence.16 However, when reading the article, you are perplexed that it is labelled as a “randomized noninferiority trial.”
A noninferiority RCT design seeks to determine whether a new intervention is not worse than a prevailing (standard) one within an acceptable margin of risk or benefit, referred to as the noninferiority margin.17–20 It is usually assumed that the standard intervention has been shown to have better (superior) clinical effect than a placebo or an earlier intervention. The new intervention is considered to be noninferior to the standard one when it is shown to have reduced costs, have fewer adverse effects (harm), be less invasive, and be of greater convenience. In trials that investigate noninferiority, the null hypothesis is not symmetric. The new intervention will be proven noninferior if it is similar to the standard intervention, but not beyond the margin of noninferiority for a specified outcome measure. If the new intervention is found to be superior, it is an additional benefit. Tests of noninferiority should be linked to the predefined noninferiority margin and predefined α. An α of 0.025 for a 1-sided noninferiority hypothesis is equivalent to the 1-sided 97.5% CI, as an α of 0.05 for a 2-sided hypothesis is equivalent to a 2-sided 95% CI.17
Suppose the hospital administrators would like to expedite surgical patients’ hospital discharge with the adoption of this new surgical approach if it is proven to be noninferior to the standard procedure within a 3-hour noninferiority margin.
Figure 1 presents some possible scenarios for noninferiority trials observing mean differences in hospital stay following laparoscopic and open approaches. Scenario A shows that laparoscopic surgery is superior to open surgery, as the CI lies to the left of the no difference line (zero). In scenario B, the CI includes the threshold of noninferiority. It means that noninferiority is not shown, as the true difference in hospital stay could be worse than the 3-hour predefined noninferiority margin for laparoscopic surgery. Scenarios C and D show that laparoscopic surgery is noninferior to open surgery because the upper confidence limit lies to the left of the 3-hour noninferiority margin in hospital stay. Scenario D shows that laparoscopic surgery is definitely noninferior to open surgery, as the CI lies to the left of the noninferiority margin and also excludes zero line of no difference. In scenario E, the laparoscopic surgery is definitely inferior to open surgery with respect to hospital stay, as the lower bound of the CI lies to the right of the noninferiority margin. Such a scenario is less likely, as it requires a very large sample size.17
The choice of noninferiority margin requires sound clinical judgment.18 The noninferiority margin should be the smallest clinically meaningful difference between the 2 surgical interventions. In general, margins for mortality or serious adverse events should be more stringent than those for symptom control or quality of life.18 Many experts have stipulated that the noninferiority margin for efficacy outcomes should be no more than 50%, and preferably no more than 20% of the treatment effect for the standard treatment, as established in placebo-controlled superiority RCTs.18,19 Unfortunately no validated rules exist for calculating the noninferiority margin, and many trials use margins that statisticians consider to be too liberal.21 It is important that, whenever possible, this margin be validated by published expert consensus22 and not left to the sole discretion of the investigators or sponsors of the study.18
The article you identified is a prospective, open-label, multicentre, randomized noninferiority trial, by Ardehali and colleagues15 that took place at 10 heart transplant centres in the United States and Europe. Eligible heart transplant patients were randomly assigned to receive either donor hearts preserved with the organ care system (OCS; ex-vivo heart perfusion) or standard cold storage (SCS). The key methodological characteristics of the study are summarized in Figure 2 and Table 1.15
Are the results valid?
What are the results?
How can I apply the results to my patient or clinical practice?
As with the more commonly seen superiority RCT, the noninferiority RCT is expected to minimize the risk of bias by ensuring concealment of randomization, balance between known and unknown prognostic factors, blinding of patients, surgeons and outcome assessors to treatment allocation, and complete follow-up of all patients. In reviewing the noninferiority trial by Ardehali and colleagues,15 you see that an independent biostatistician prepared sealed and masked randomization envelopes, which were assigned to the research trial sites. The investigators, however, did not report if the envelopes were opened sequentially and one at a time. Patients, investigators and medical personnel were not blinded to group allocation. They chose an open-label design because the method of donor heart preservation made blinding of medical staff infeasible. In reviewing Table 1 of their article, you see no glaring differences in main demographic characteristics of patients assigned to the 2 competing approaches; these characteristics included age, sex, height, body mass index (BMI), diagnosis of cardiomyopathy of the recipient patients, and the cause of death of the donor patients.
There was, however, some imbalance in the preservation time before the heart transplantation. The preservation time was longer in the OCS group than in the SCS group (324 ± 79 min v.195 ± 65 min, p < 0.001); however, the mean total ischemia time was significantly shorter in the OCS group than in the SCS group (113 ± 27 min v. 195 ± 65 min, p < 0.001).
As the heart transplantation is a definitive procedure, there is probably little room to provide differential care to affect the prognostic balance after the event. Figure 2 of the article by Ardehali and colleagues15 shows the flow of the patients in the 2 groups. It details the results of the randomization protocol, wherein 130 patients were randomly assigned to either group: 67 to OCS and 63 to SCS. There appears to have been deviation of the protocol in 2 patients in the OCS and 5 in the SCS groups. It is important to note that 2 patients in the OCS group and 1 patient in the SCS group crossed over (i.e., these patients were transplanted using the other respective system).
Ardehali and colleagues15 did not report details regarding postoperative care, so you do not know if there was differential care between the 2 groups. Therefore, you cannot conclude with any certainty whether the 2 groups were balanced in this regard.
In the present noninferiority trial, the investigators declared the noninferiority margin (Δ) to be 0.10 (10%). Unfortunately, they did not provide any evidence to support this difference, which leads you to wonder if a smaller noninferiority margin (e.g., 5%) could have, or indeed should have, been accepted.
The purpose of randomization is to ensure that prognostic factors are balanced between the surgical interventions. Patients who do not adhere to the allocated treatment, as in the study protocol, may have a different prognosis than those who do.24 Omission of patients who do not adhere to the novel intervention is likely to bias results toward overestimation of treatment effects in a superiority trial. An intention-to-treat analysis, wherein patients are analyzed according to the group they were assigned, provides an unbiased estimate of the treatment effectiveness, irrespective of their adherence to the study protocol.
Ardehali and colleagues15 conducted both the intentionto-treat and the as-per-protocol analyses and found similar results in both analyses. Therefore, you remain assured that the authors’ analysis of the results may be appropriate. However, the authors could have assured readers of the findings by statistically addressing the missing data (e.g., multiple imputations or best and worst case scenario).
In a noninferiority RCT it is important that the investigators report the noninferiority margin and the rationale for choosing it. In the statistical analysis section of their methods, Ardehali and colleagues15 mentioned a 10% noninferiority margin, but provided no rationale for choosing it.
The investigators reported in the Methods section of their article that they calculated the 1-sided 95% upper confidence bound based on the normal approximation for the difference between the 2 population proportions. An upper confidence bound less than the 10% noninferiority margin would have rejected the null hypothesis. For the purpose of sample-size calculation, they assumed πOCS = 0.95 and πSOC = 0.94. On the basis of these assumptions, use of a normal approximation test and a 1-sided α level of 0.05, inclusion of 54 patients per treatment group would have provided 80% power.
There should be a justification for choosing a superiority versus a noninferiority study design. To some degree the authors justified the choice of the noninferiority design in that the OCS system provides certain benefits, such as the potential of “distant procurement for donor hearts, thus expanding the donor pool” in contrast to the standard cold storage. The justification of the study design is made on the research question asked and the hypothesis — specifically on the clinical advantages of the novel intervention. The measured outcomes play an important role in the sample size calculation through the choice of the MCID. Survival should demand a smaller MCID than, for example, a quality of life (QOL) outcome. The 10% choice as the noninferiority margin seems very liberal, which most surgeons or patients would not accept in a case of life or death. A noninferiority margin of 1%–2% would likely be a better choice. This raises the concern that the study may have been designed originally as a superiority study.
The investigators found that the 30-day patient and heart transplant survival rate (primary outcome) was 94% in the OCS group and 97% in the SCS group (p = 0.45). The intention-to-treat analysis (94% v. 97%, p = 0.36) and the as-per-protocol analysis (93% v. 97%, p = 0.39) supported the overall estimate.
Multiple clinically important outcomes were included, such as graft failure and left and right ventricular dysfunction, with a time horizon of 30 days. Some surgeons may consider this short-term time frame of limited value; a longer followup would have been more appropriate. Patient-important outcomes, such as quality of life, were not considered. A validated patient-reported outcome scale would have provided more information on the merits of the comparative interventions. You note this as a limitation of this noninferiority RCT.
The secondary outcomes — serious adverse events, incidence of severe rejection and median length of stay in the intensive care unit (ICU) — were similar for the 2 approaches. Based on these results, the investigators concluded that the OCS approach was not inferior to the SCS approach. You believe that their conclusion is reasonable based on the results of the study.
The precision of the results is normally presented as a confidence interval (CI). In this noninferiority study the authors provided the CI for both the primary outcomes (30-d patient and graft survival) and the secondary outcomes (cardiac-related serious adverse events, incidence of severe rejection and ICU length of stay). They provided this for the intention-to-treat, as-treated and as-per-protocol analyses (Table 1).
The authors reported the 30-day patient and graft survival rates to be 94% in the OCS group and 97% in the SCS group. The patient and graft survival rates were the same, as no repeat heart transplant surgeries were performed. The authors reported that the upper bound of the 95% CI for the percentage differences in the primary effectiveness outcome between the 2 populations was 8.8%, which is less than 10%, so the null hypothesis was rejected in favour of the alternative hypothesis. Based on this finding you concur that noninferiority was shown.
Based on the demographic evidence provided by Ardehali and colleagues15 in Table 1 of their study (not shown here), in which they reported the mean age, weight, height, BMI and types of cardiomyopathy their patients had as well as the donor characteristics, you conclude that the patients treated in your division would be similar and that, therefore, the study’s conclusions are applicable.
The investigators included patient and graft 30-day survival as a primary outcome. A longer survival time horizon (e.g., 1- or 2-yr survival) would have been preferable. The investigators also included 30-day right and left ventricular function and length of stay in the ICU as secondary outcomes. The outcomes research movement in the last 20 years, however, expects clinical investigators to measure patients’ quality of life after medical interventions. The quality of life assessment requires a longer follow-up, and this trial was designed based on immediate and short-term outcome assessment. The authors might have suggested this for future investigation.
Although the authors concluded that the novel intervention was noninferior to the standard approach, you should not be rushed to adopt it. The investigators did not report the resource utilization associated with either approach. Many new innovations are costly. The ideal study, therefore, would be one in which resource utilization and costs are captured. Health-related quality of life can also be measured using a utility scale from which quality-adjusted life years (QALYs) can be calculated. The integration of costs and QALYs in a cost–utility analysis can help determine whether the new innovation is cost-effective or not.25
There are consequences to future patients and society if incorrect inferences from a poorly designed and conducted noninferiority RCTs are accepted. It is important to determine if this noninferiority study is really a failed superiority RCT. You can do this by determining whether the authors’ noninferiority threshold was appropriate or not. To do so, you review the literature for similar studies to determine the upper boundary for the CI of the primary outcome (30-d mortality and heart transplant survival) and examine the extent to which it exceeds the chosen threshold. If the upper boundary is substantially greater/lower than the threshold chosen by the investigators (10%) you may choose not to adopt this new technology. This is, unfortunately, the case with the RCT by Ardehali and colleagues.15 Their noninferiority margin of 0.10 (10%) was chosen without supportive documentation.
Although, in general, you are happy with the designation of this study as an RCT, you are not persuaded that it met all the criteria for its designation as a noninferiority RCT. Specifically, you are concerned that their noninferiority margin of 0.10 (10%) was chosen without supportive evidence. As a result, you recommend to your colleagues that the study has definite weaknesses. You then offer to review and critique a superiority RCT comparing these approaches to present at next week’s rounds.
Competing interests: None declared.
Contributors: All authors designed the study. D. Waltho acquired the data, which F. Farrokhyar, D. Waltho and C. Goldsmith analyzed. A. Thoma, F. Farrokhyar and D. Waltho wrote the article, which all authors reviewed and approved for publication.