|Home | About | Journals | Submit | Contact Us | Français|
Randomized clinical trials (RCTs) are the most appropriate research design for studying the effectiveness of a specific intervention. Its results are considered as the highest ‘level of evidence'. Published reports on RCTs have already succeeded in a peer review process, but still there can be undetected major deficiencies of the study that may question the reported outcome. It is still up to the readers to assess the quality of publications and to question if the published results apply to their patients. The major points of such a critical appraisal process are reviewed and discussed with a focus on breast cancer studies.
Randomisierte klinische Studien (RCTs) sind am besten geeignet, um die Wirksamkeit von Interventionen zu untersuchen. Ihre Ergebnisse werden als höchste Evidenzstufe betrachtet. Publikationen von RCTs haben bereits erfolgreich einen Peer-Review durchlaufen, trotzdem kann man nicht ausschlieβen, dass noch bedeutende unentdeckte Mängel in der Studie vorhanden sind. Nach wie vor obliegt es den Lesern, die Qualität der Publikation zu beurteilen und zu fragen, ob die publizierten Ergebnisse auf ihre Patienten anwendbar sind. Die wichtigsten Punkte einer solchen kritischen Abschätzung werden besprochen und mit einem Schwerpunkt auf Brustkrebsstudien diskutiert.
Nowadays, clinical practitioners are usually overloaded with information from the literature. They have to appraise whether a reported piece of research is worthwhile to be used in their own clinical decision-making. A lot of the research papers have flaws (even after peer review), but many of these deficiencies will be negligible and will have almost no effect on the conclusions to be drawn from a study . Thus, the important question is not whether there are defects, but whether those defects matter. It is up to the readers to use their critical appraisal skills to detect the flaws of a study and to decide how they affect the usefulness of the paper. The critical reading of a scientific article and the evaluation whether its results can be applied to patients is a fundamental skill that all clinicians should have .
The evidence of studies depends on their design. The so-called ‘levels of evidence’ from the Oxford Centre for Evidence-Based Medicine  include, from high to low evidence, (1) randomized clinical trials (RCTs), systematic reviews on RCTs [4, 5] and ‘all-or-none’ case series; (2) cohort studies; (3) case control studies; (4) case series; and (5) expert opinions. As RCTs are the most appropriate research design for studying the effectiveness of a specific intervention or treatment , the critical appraisal at hand focuses on RCTs [1, 7, 8, 9, 10, 11, 12, 13, 14] only.
The first part of a critical appraisal process deals with the internal validity or accuracy of the results. It assesses whether the reported effects represent the correct direction and magnitude. That is, do the results represent an unbiased estimate of the treatment effect or have they been influenced in some systematic fashion to lead to false conclusions ? The main five questions  to assess validity are:
An essential part of most prospective RCTs is the random allocation of patients to the treatment groups so that accidental or even intentional biases are avoided. Proper randomization ensures that the next treatment allocation is predictable neither for clinicians and study personnel, nor for patients. Pseudo-randomization by birth date, e.g. group A for even and group B for uneven days, is predictable and thus not recommended. A coin toss meets this criterion but has the disadvantage that randomizations are not traceable later. Often, sealed envelopes containing random treatment sequences are used, and it can be traced later whether the envelopes have been opened in the assigned order. More complex randomization methods additionally ensure that important prognostic factors are balanced between treatment groups, e.g. age distribution, sex or tumor staging. This stratified randomization can be performed either by sealed envelopes or by computer programs (e.g. with the ‘Randomizer’ ). Sealed envelopes can only be used for cases with few strata.
Was follow-up complete? Every patient who entered the trial should be included in the final analysis and thus contribute to the conclusions. If a substantial number of patients are lost to follow-up, then the validity of the trial may be questioned as missing patients may behave differently than the patients remaining in the study (informative missingness). For example, patients may either not return to follow-up recalls because of adverse outcomes of the treatment or because they are doing so well that they do not want to waste their time by returning to the follow-up visits. The larger the proportion of missing values, the higher is the risk that informative drop-outs may change the results of the study.
Were all patients analyzed in the groups to which they were randomized? The so-called ‘intention-to-treat’ (ITT) principle requires that randomized patients remain in their randomized group for analysis even if the scheduled treatment was only applied in parts or, even worse, if no treatment at all or a completely other treatment (than randomized) was applied. This might be surprising, but if patients do not properly take their randomized medication or receive a different one, then there will usually be prognostically relevant reasons. For example, if patients are too frail for the scheduled treatment, then the exclusion of such non-compliant patients restricts the analysis to those who may be destined to have a better outcome and destroys the unbiased comparison provided by randomization. ITT is recommended for studies designed to show differences. However, in non-inferiority and equivalence trials, the application of the ITT principle could become problematic, as treatment differences under ITT are often underestimated and showing non-inferiority or equivalence may become easier. Additional analyses may include per-protocol analyses, i.e. only patients are included who are treated with the randomized treatment and as described in the protocol. An as-treated analysis contributes patients to the treatment actually received and not necessarily randomized. Both the per-protocol and the as-treated analysis are vulnerable to bias due to excluded patients and patients changing groups, respectively. A comparison of the results of all three analysis strategies may shed light on this issue.
Patients, clinicians or other study personnel may intentionally or unintentionally change their attitude and/or behavior in a systematic way by knowing the applied treatment. For example, special attention may be unintentionally given to patients under new therapy by clinicians, special expectations (or fear) of the patients may be connected to the experimental therapy, or the treatment knowledge affects a subjective assessment of the outcome. Such behavior introduces systematic bias and distorts the results. This is closely related to the well-known ‘placebo effect', i.e. patients show a stronger ‘treatment effect’ under placebo treatment compared to untreated patients. Thus, treatments should be blinded to all involved persons (including the statistical analyst), whenever possible. For a drug study, smell, taste, shape and color of the drugs and frequencies of applications should be identical. However, specific side effects could still unmask the drugs. Unblinding of single patients should be possible for the responsible physician at any time in case of side effects or adverse events. However, any unblinding has to be documented and should be described, e.g. in the CONSORT-flowchart .
If the sample size is sufficiently large and the randomization method is proper and robust, then the make-up of the treatment and control groups should be quite similar, selection bias should be prevented, and the only difference should be intervention versus control treatment . Still, random treatment assignment does not guarantee that the groups are equivalent at baseline. Any differences in baseline characteristics are, however, the result of chance rather than bias . Important demographic and clinical characteristics of the study groups at baseline (start of the study) are frequently summarized in a table. This gives additional information on the comparability of the groups. Despite many warnings of their inappropriate-ness [16,17,18,19], significance tests of baseline differences between groups are still common in an RCT. Even for a statistically non-significant comparison group, differences can be clinically relevant, especially if the sample size is small. Thus, the description and critical discussion of the baseline characteristics for each group is usually preferable to statistical tests.
In some studies, imbalances in participant characteristics (prognostic variables) are adjusted for by using some form of multiple regression analysis [16, 20]. In RCTs, the decision to adjust should not be determined by whether baseline differences are statistically significant . Ideally, adjusted analyses should already be specified in the study protocol.
It is important that the groups are treated completely equally except for the different treatment under investigation. This can be guaranteed if contacts of health workers and study personnel with the patients are blind with respect to the treatment. Treatment applications and follow-up schedules should be identical.
After having considered these five points, one has to decide whether the results of the study can be trusted or whether it is likely that biases may have invalidated the findings of the study. Thus, the next question is: Is it worth continuing? The final assessment of validity is never a clear ‘yes’ or ‘no’ decision and has to remain subjective, at least to some extent .
The reliability of the results can be assessed by posing the following two questions :
The outcome measured in a study can differ with respect to the measuring scale. Often, a dichotomous or binary outcome is chosen, a so-called ‘yes’ or ‘no’ outcome, e.g. remission of the tumor after preoperative chemotherapy or no remission. The risk difference between two groups can be expressed as absolute risk difference, as relative risk, or as odds ratio. In the example of tumor remission, one is not interested in the risk but rather in the chances of an increased remission rate. The underlying statistical concept is the same for assessing risks or chances. For example, if the proportion of a complete remission is 30% (0.3) under the new treatment and 20% (0.2) under standard treatment, then group difference expressed as absolute risk (chance) difference yields 0.1 (= 0.3 — 0.2): The new treatment increases the chance for a complete remission by 10 percentage points (table (table1).1). Equality of both groups would result in a value of zero. On the other hand, the group differences expressed as relative risk (chance) give 1.5 (= 0.3/0.2): The new treatment increases the healing probability by 50%. For the relative risk, a value of 1 would mean no treatment difference. The choice of the appropriate measure (here absolute risk difference or relative risk) usually depends on the interpretation of the effect and should be fixed during the planning phase of a trial.
If the outcome measure is metric (e.g. bone density after 1 year) and normally distributed, then the difference between groups is described by the difference of the means.
In breast cancer studies, the primary outcome is often ‘time to event', e.g. time to death or time to disease recurrence. Differences between groups in the time to event, so-called survival curves, are often described by the survival probability at a specific time, e.g. 5-year survival. However, treatment differences are quantified in most cases by a hazard ratio, e.g. the risk for an event is 1.5 times higher under standard treatment than under experimental treatment. A hazard ratio of 1 expresses no group difference. The estimation of a single hazard ratio assumes that this hazard ratio is equal over the whole time period under observation. For example, the hazard ratio is the same at, let's say, 1 year but also at 3 and 5 years. This is an assumption that has to be checked before quantifying the treatment difference with a single number.
The true population difference between treatment groups can never be known; all we have is the estimate provided by a rigorous controlled trial, and the best estimate of the true treatment effect is that observed in the trial . However, the estimate of the treatment difference is a point estimate and it is unlikely that it exactly agrees with the true treatment difference. But it is likely that the area around the estimate covers the true population effect with high probability. This area can be estimated by a confidence interval (CI). In medical research, the 95% CI is frequently given in publications, although any other percentile (e.g. 90 or 99%) could be used as well. There is a close connection between p-values and CIs. For example, if a 95% CI (100-a%) for the absolute risk does not include zero (= no difference), then the corresponding p-value is smaller than the significance level of 5% (a%). If the CI includes zero, then the p-value is larger than 5%. For the 95% CI of the relative risk, one has to look if it includes 1, which then indicates no difference. In some cases, the estimates of the absolute difference and of the relative risk can differ with respect to their main message of a treatment difference. However the type of estimate has to be determined in advance and stated in the study protocol. It is not allowed to choose the estimator depending on the data and the outcome. The width of the CI depends on the sample size of the trial–the higher the number of observations in the groups, the more information is available for the estimation, the more precise will be the estimate and the smaller its CI. For example, let us consider the example of the estimated remission rates of 30 and 20% for the new and standard treatment, respectively. If the number of patients is 100 in each group, then the 95% CI for the absolute risk (chance) difference of 0.1 is (-0.019; 0.219), and for the relative risk (chance) of 1.5, the 95% CI is (0.916; 2.456) (table (table1).1). For 500 patients per group, the absolute risk difference and relative risk estimates would be more precise and the 95% CIs would shorten to (0.047; 0.153) and (1.203; 1.870), respectively. The larger the sample size, the narrower is the CI; but when is the sample size big enough? Usually, a proper sample size calculation has to be made before starting a trial and the underlying assumptions should be described in the paper. The significance level in a medical research setting is usually 5% and should be two-sided (in almost all cases). The power is the probability to detect a group difference of a specific size if it really exists and is usually chosen to be 80-90%. The group difference used for sample size calculation should be the smallest clinically relevant difference worth to be detected to change the clinical routine so that the new treatment becomes standard. If no details on the sample size calculation are given, one can look at the width of the CI; it is directly related to the sample size.
The CI also helps to interpret negative studies in which the authors have found that the experimental treatment is no better than the control therapy. If the upper boundary of the CI is a value that is still clinically important, e.g. 2.456 for the relative risk (chance) 95% CI for 100 patients per group, then the study has failed to exclude an important treatment effect and the number of patients included is too small. Such studies fail to prove treatment differences, but also fail to prove that there is no difference.
It is also worth noting that, if a published study with extremely small sample size shows a significant difference, then there will be a non-negligible chance that the significantly better treatment is in reality worse . This is the most severe error in statistical testing and is called type III error.
When size and precision of the treatment effect of the published study is assessed, one can raise the final question whether the published results can also be applied to a specific patient.
This last step in a critical appraisal process is called external validity or applicability and can be assessed by answering the following three questions :
It is a natural question whether a specific patient should be treated with the significantly better experimental treatment from a recent publication. If the patient at hand meets all inclusion and exclusion criteria of the study, then the published results should be applicable. Otherwise, the patient would not have been eligible for the study and judgment is required  whether the study results may still be applicable. Assume that the in- and exclusion criteria are nearly met, e.g. the patient may be 2 years too old to be included in the study, then it may still be reasonable to generalize the study results to this patient. Another approach is to ask whether there is some compelling reason why the results should not be applied to the patient.
Reports about subgroup analyses: Special care has to be given to findings about subgroups. Often the treatment effects are also shown for subgroups, which may help to find patient groups with different behavior and treat them in the best possible way. Sometimes it appears that subgroups of patients benefit from treatment and others do not. In statistical terms, this is called an interaction between the therapy and the variable defining the subgroups. In such a situation an explanation has to be found as to whether there is some powerful biologic reason behind the different effects in the subgroups.
Because of the high risk for spurious findings, subgroup analyses are often discouraged. Post hoc subgroup comparisons (analyses done after looking at the data) are especially likely not to be confirmed by further studies, so that such analyses do not have great credibility [16, 20, 23]. On the one hand, the probability increases that effects in subgroups are found by chance alone (multiplicity problem) so that false-positive results may be easily found. On the other hand, the number of observations decreases in the subgroups so that false-negative results may result due to decreased power (less information available). Despite that, there is still a strong temptation to make subgroup analyses in order not to waste potential information and to come to preliminary conclusions/information. Oxman and Guyatt  give some recommendations for the decision whether an effect in a subgroup may be real or not:
All important outcomes should be addressed by the study, e.g. a report on favorable effects of treatment on one outcome may be judged differently by a harmful effect on other outcomes. For breast cancer patients, a prolongation of disease-free survival may be differently judged if the treatment has severe side effects or if the quality of life is heavily affected by the new treatment.
Furthermore, improvement of the outcome should be beneficial to the patients, e.g. a considerable remission of the tumor from preoperative chemotherapy should also have an effect on disease-free or overall survival or at least on an increase in the quality of life by a smaller proportion of breast ablations.
If the study results can be generalized to patients and its outcomes are important, the next question is whether the probable treatment benefits are worth the effort . Let us consider again the example of preoperative chemotherapy. The proportions of complete remission are 30% (0.3) under the new treatment and 20% (0.2) under standard treatment, resulting in a relative risk (chance) of 1.5, which sounds quite impressive. The ‘number needed to treat’ (NNT) is the number of patients to be treated with the new therapy in order to prevent in expectation one negative event or, in our example, to achieve one additional complete remission compared to standard therapy. With complete remission proportions of 0.3 and 0.2, the NNT is 10 (= 1/(0.3 — 0.2)). So, 10 patients have to be treated with the new therapy to have in expectation one complete remission more than if these 10 patients were treated by standard therapy. If the proportions for a complete remission had been 0.03 and 0.02, respectively, the resulting relative risk would also have been 1.5, but the NNT would now be 100 (= 1/(0.03 — 0.02)). This has to be kept in mind when balancing the benefits and the harms (side effects, reduced quality of life, etc.) of the new treatment.
When conducting a randomized clinical trial, special attention has to be given to the design of the study to avoid any bias from the beginning . Statistics and its proper application is an essential part of well-conducted trials, from study planning to study analyses and interpretation . The best statistical analysis strategy cannot outweigh badly planned and conducted studies.
Nowadays, the CONSORT statement  gives general guidelines on how to prepare manuscripts on RCTs for publication so that the most relevant information is included and the reader is able to understand what has been going on. Medical journals have improved their peer review process in the last years and nowadays a medical-statistical review is nearly always mandatory for high-quality journals. Still, it cannot be excluded that flaws may be published undetected and it is still up to the readers to assess publications for themselves.
From the homepage of the Centre for Evidence-Based Medicine  checklists for the critical appraisal of publications can be downloaded. They also offer a software tool (CATmaker) that helps creating Critically Appraised Topics (CAT) about therapy, diagnosis, prognosis, etiology/harm and systematic reviews of therapy.
The author wishes to thank Harald Heinzl for his valuable comments and suggestions.