Our simulation results demonstrate that small sample size is associated with a high risk of imbalance in PFs in individual simple RCTs. The probabilities of an absolute imbalance ≥5% in a binary PF of prevalence 0.5 is 0.42, 0.62 and 0.67 with 125, 50 and 25 patients per arm. The probability of absolute imbalance decreases as sample size increases or prevalence of PF approaches 0 or 1.
Failing to adjust for a largely balanced strong PF (RR
5) in a logistic regression model leads to bias toward no treatment effect when the actual size of treatment effect is moderate (RR
0.75); this bias varies little with sample size greater than 50 patients per arm. Adjusting for such a PF reduces precision of the effect estimate but increases statistical power. The gain in power is comparatively larger when sample size is between 500 and 1000 per arm and prevalence is within 0.2–0.6, relative to other cases. When the PF is less powerful and a treatment difference exists, improvement in accuracy and efficiency associated with the adjustment for a largely balanced PF is less noticeable. When the treatment effect is zero, such covariate adjustment leads to minimal loss of precision. Overall the simulation results based on a single binary baseline PF suggest it is critical to adjust for important PFs in trials evaluating a binary outcome. If ignored, substantial bias due to confounding or non-collapsibility can emerge; bias would be more marked when PF has high predictive value and sample size is small to moderate.
It is challenging to establish a single rule for sample size requirement focused on the probability and impact of prognostic imbalance. Multiple factors influence the requirement.
Firstly, sample size should be sufficiently large that the probability of imbalance is restricted to a reasonably low value. The adequate sample size varies with the choice of imbalance measure, the size of imbalance that is deemed important, and the prevalence of the PF. For example, suggests that if an absolute measure of imbalance ≥0.05 is deemed important, 1000 patients per arm is a reasonable size.
Secondly, sample size should be sufficient to produce a reliable estimate of treatment effect. Although it has less impact on the magnitude of bias around the mean effect estimate in the unconditional setting, sample size does affect precision. While adjusting for PF removes systematic bias, estimates from an individual trial may still deviate from the true effect in either direction due to random sampling variation. and suggest that probabilities of having an absolute deviation >
0.05 (in either direction) from the true ORR are 0.87–0.93 and 0.52–0.62 for trials recruiting 125 and 2000 patients per arm, respectively. If trialists are willing to tolerate a slightly bigger deviation from the true ORR, for instance, no more than 0.1, the above probabilities decrease to 0.75–0.81 (125/arm) and 0.20–0.32 (2000/arm) for both models, and 2000 patients per arm then seems to be a reasonable sample size ( and ). As PF becomes less prevalent, larger trial sizes are required for purposes of precision. When randomization partially or completely fails, no statistical adjustment or increase in sample size can fully correct the resulting bias.
The current investigation on the likelihood of prognostic imbalance and its implications for sample size requirements is consistent with previous findings. A minimum of 100 patients per arm has been suggested to control the chance of imbalance of 20% or more in a single PF 
, and 1350 per arm may be needed to minimize the chance of a 5% imbalance 
. Although Cui et al calculated the probabilities of a 20% imbalance in at least one out of k
independent PFs (k
2, 3, and 4) 
, situations involving multiple correlated PFs are worth further investigation.
Gail first demonstrated that omitting balanced baseline covariates in logistic regression asymptotically (i.e. for very large sample sizes) results in downward bias on the subject-specific treatment-outcome association 
. This is also referred to as the non-collapsibility problem 
, because the odds ratio as the measure of association between the treatment and the binary outcome within each category of the baseline covariates (i.e. conditional or subject-specific association) is different from the association across all categories of the covariates (i.e. the marginal or average association).
In their simulation study 
, Negassa and Hanley showed that omitting an important balanced continuous or binary covariate in logistic regression model lowers both the coverage probability (that is, the proportion of the time that the CI contains the true value of interest in a set of hypothetical repetition of data collection and analysis procedure 
) and study power in binary trials with moderate sample sizes (n
500 and 1000). These findings are complemented by a simulation study that explored the effect of imbalance in two continuous baseline covariates on power in a logistic regression framework when both variables were adjusted for in analyzing small trials (n
50, 100 and 300) 
. Others quantified the increase in statistical power resulting from covariate adjustment as a decrease in the sample size required in comparison to the unadjusted model 
It was not clear in the literature, however, how the interplay of chance imbalance, the risk of outcome and the prevalence of a binary PF affects treatment effect estimation in trials with a binary outcome. Our simulation study provided information on what constitute an adequate sample size to control against potential impact of prognostic imbalance. Our results based on trials subject to chance imbalance across six sample sizes in the unconditional setting are consistent with the previous findings.
When one is confident that all important PFs are distributed similarly between treatment groups in a binary trial, it is sensible to decide if the goal of a trial evaluating a binary outcome is to assess the marginal effect of treatment over patients with heterogeneous baseline prognosis, or to obtain a more individualized treatment effect estimate that is specific to a prognosis. These objectives can be achieved by using the unadjusted and adjusted logistic regression analyses. With a binary outcome, the two models produce mathematically different results in the presence of a non-zero treatment effect. Mismatch of the study objective, the statistical method, and interpretation of results can result in misleading messages. Due to the uncertainty around the existence or magnitude of the treatment effect and possibly different criteria to assess prognostic imbalance, we recommend reporting both the adjusted and unadjusted results in the manuscript.
The CPMP guideline recommends that including important PFs in the primary analysis can be justified only if their associations with the primary outcome are expected to be strong, based on previous evidence, and are specified a priori 
. What constitutes adequate justification may be a matter of judgment. Our results demonstrate the value of adjustment, and suggest the merits of avoiding excessively stringent criteria when deciding whether prior evidence of prognostic power is adequate.
Our study has several limitations. First, we included only one binary baseline PF to illustrate the probability and impact of prognostic imbalance in RCTs evaluating a binary outcome. For continuous PFs, Ciolino and colleagues proposed a rank-sum ratio to measure the level of imbalance in addition to the commonly used mean values 
. When multiple PFs are present at baseline, balancing distribution of the individual PFs and the overall prognosis needs to be assessed. Although the single binary PF considered in the current study can be conceptualized as a measure of the overall prognosis of a patient based on multiple PFs, for instance, in a propensity score framework 
, further investigation on the distribution and impact of multiple correlated PFs on effect estimation in RCTs is warranted.
Second, although our investigation was focused on prognostic balancing in individual RCTs, systematic reviews and meta-analyses face the same methodological challenges. The cumulative number of patients from individual RCTs and the between-study variation need to be considered to assess the impact of imbalance on obtaining an aggregated estimate of treatment effects. Future work is needed in these directions.
Our study provides useful new insights. The results can not only help to design clinical trials, but can also inform quality assessment of a body of evidence from RCTs. Our simulation findings provide insights on prognostic imbalance which pertains to both risk of bias and imprecision 
. The current study was not designed to propose a single threshold value of sample size that can be readily employed to rate the quality of evidence with respect to precision. Rather it lends itself to guide selection of such threshold values over various combinations of trial parameters, a subjective process likely influenced by the tolerance of risk.
In summary, prognostic imbalance does not on average jeopardize internal validity of findings from RCTs, but if neglected, may lead to chance confounding and biased estimate of treatment effect in a single RCT. To produce an accurate estimate of the treatment-outcome relationship conditional on patients' baseline prognosis, balanced or unbalanced PFs with high predictive value should be adjusted for in the analysis. Covariate adjustment slightly reduces precision, but improves study efficiency, when PFs are largely balanced. Once chance imbalance in baseline prognosis is observed, covariate adjustment should be performed to remove chance confounding.