In the current paper we have described methods to assess the adequacy of the specification of the propensity-score model in the context of 1:1 matching on the propensity score. These methods are based on the propensity score being a balancing score: conditional on the propensity score, treated and untreated subjects will have similar distributions of observed baseline covariates. As suggested by others, one can assess the specification of the propensity-score model by examining whether matching on the propensity score balances baseline covariates between treated and untreated subjects [

12]. The methods described in the current paper range from those based on comparing means and prevalences of single covariates between treated and untreated subjects to graphical comparisons of the distribution of continuous variables between treated and untreated subjects. The objective of these diagnostics is to determine whether the distribution of observed baseline covariates is similar between treated and untreated subjects matched on the estimated propensity score. Thus, balance diagnostics serve as a test of whether the propensity-score model has been adequately specified.

In RCTs, randomization will balance, in expectation, both measured and unmeasured covariates between the different treatment arms. Senn writes that, in a RCT, ‘over all the randomizations the groups are balanced; and that for a particular randomization they are unbalanced’ [

35]. Thus, while randomization will, on average, balance covariates between treated and untreated subjects, it need not do so in any particular randomization. Our discussion of the sampling distribution of the standardized difference in Section 3.3 illustrates that in small RCTs, one would expect modest imbalance in some baseline covariates. In the context of RCTs, several authors have advocated that regression analysis be used to obtain estimates of treatment effect after adjusting for potential imbalance in prognostically important baseline covariates [

35–

38].

Regression analysis can be used within the propensity-score matched sample to adjust for residual imbalance in observed baseline covariates. However, there are limitations to this approach. An unadjusted estimate is a marginal (or population-average) measure of treatment effect, while the adjusted estimate is a conditional (or adjusted) estimate of treatment effect. When the outcome is continuous and a linear treatment effect is used, then the marginal and conditional treatment effects coincide [

39]. However, when the outcome is binary or time-to-event in nature, then marginal and conditional treatment effects do not coincide for non-null treatment effects [

39]. Reviews of the use of propensity-score methods in the medical literature have found that they are most frequently used to estimate the effect of treatment on binary or time-to-event outcomes, and are rarely used for continuous outcomes [

6]. Similarly, in published RCTs in the medical literature, binary and time-to-event outcomes are more common than are continuous outcomes [

40]. In the context of binary outcomes, the use of an adjusted analysis would most frequently be accomplished using a logistic regression, with the odds ratio being used as the measure of treatment effect. The odds ratio has been criticized by several clinical commentators who have suggested that the absolute risk reduction (and the associated number needed to treat) as well as the relative risk are more relevant for clinical decision making than is the odds ratio [

41–

46]. Both the absolute risk reduction and the relative risk can be computed directly in the propensity-score matched sample. Thus, while the use of regression adjustment can remove the effects of residual imbalance in baseline covariates, its usefulness is tempered by the introduction of the odds ratio as the measure of treatment effect. However, when the outcome is continuous, and the treatment effect is linear, then the use of regression adjustment within the propensity-score matched sample may be particularly useful.

In propensity-score matching there are two competing issues at play. First, even if the true propensity score were known, matching on the propensity score would only in expectation produce balance between treated and untreated subjects in the matched sample. However, a particular matched sample may be subject to imbalance, just as a particular RCT may be subject to imbalance. Our examination of the sampling distribution of the standardized difference in Section 3.3 illustrates that even if the propensity-score model is correctly specified, one can still observe modest standardized differences for baseline covariates. Second, in practice the true propensity score is not known. Thus, residual systematic differences between treated and untreated subjects may be reduced by improving the specification of the propensity-score model. The quandary for the applied researcher is to determine to what degree observed differences between treated and untreated subjects in the matched sample represent a property of that specific sample (which would be observed even if the true propensity score were known) and to what degree observed differences are indicative of the fact that the propensity-score model has been mis-specified. We suggest that modifications of the propensity-score model be attempted with the objective of having the estimated standardized differences for observed baseline covariates lying within the 2.5th and 97.5th percentiles of the empirical sampling distribution of standardized difference in the appropriate sized sample. In small samples, despite heroic attempts to modify the propensity-score model, one may not be able to have all estimated standardized differences below some arbitrary threshold such as 0.25 (25 per cent) or 0.10 (10 per cent). However, the goal in modifying the propensity-score model should be to have the estimated standardized differences lie below the threshold that is consistent with the propensity-score model having been correctly specified. In small samples, it may be necessary to use regression adjustment to remove the effect of residual imbalance of measured confounders.

In addressing the question of observed differences in baseline covariates between groups in the matched sample, both Ho

*et al.* and Imai

*et al.* suggest that imbalance should be minimized without limit [

12,

29]. A difficulty with this approach is highlighted in the previous paragraph: even if the true propensity score were known, it is likely that a certain degree of residual imbalance would be observed. In Section 3 we described methods to determine the empirical sampling distribution of the standardized difference under the assumption that the propensity-score model had been correctly specified. For modest sample sizes, one could expect standardized differences that exceed 0.20 (20 per cent) even when the propensity-score model was correctly specified. The only way to eliminate all imbalance is by matching on all covariates. However, when the number of covariates is large, this is likely to either not be feasible or to result in a dramatically reduced sample size. It is to address this limitation of matching that propensity-score methods were developed.

In our empirical study of different methods for assessing balance, we observed that when we relied on visual comparison of means and proportions, and quantified differences between groups in the matched sample using standardized differences, we came to the conclusion that treated and untreated subjects in the matched sample were similar, with only negligible differences (standardized difference ≤0.030). However, when we compared ratios of the variances of continuous variables between treated and untreated subjects, it was evident that the variance of the distribution of some covariates differed between treated and untreated subjects in the matched sample. Similarly, when we used graphical displays to compare distributions, we observed that the distribution of age differed to a minor degree between the two groups. In particular, differences were most evident in the upper half of the distribution. These differences were masked when only means were compared. These observations suggest that one should consider a range of diagnostics when comparing balance in observed baseline covariates between groups in the matched sample.

We now summarize our recommendations for the use of balance diagnostics in propensity-score matched samples. First, descriptive statistics should always be reported comparing the means of continuous covariates and the frequency distribution of categorical variables between treated and untreated subjects in the matched sample. These are most easily communicated in a table comparing baseline characteristics between treated and untreated subjects in the matched sample. Second, standardized differences should also be reported comparing the means and prevalences of continuous and dichotomous variables between treated and untreated subjects in the matched sample. Third, variances of continuous variables should be compared between treatment groups in the matched sample. Alternatively, one can use standardized differences to compare the means of squared terms of these variables (this is equivalent to comparing second order moments of that variable). For both approaches, one can determine the range of variance ratios or standardized differences that are consistent with the propensity-score model having been adequately specified. The use of five-number summaries may serve as a rough guide for assessing imbalance in baseline covariates; however, its use is limited by difficulty in assessing the amount of variation that one would expect if the propensity-score model were correctly specified. Fourth, the means of two-way interactions between baseline covariates can be compared between treated and untreated subjects. Fifth, the use of quantile–quantile plots of selected, prognostically important covariates can provide further evidence of whether the propensity-score model has been correctly specified. This approach can be seen as a complement to comparisons of variances or second order moments. Sixth, methods based on comparing the distribution of the estimated propensity score between treatment groups are of limited use and provide little information as to whether the propensity score has been correctly specified. The diagnostics described in the current paper have been proposed for assessing whether the propensity-score model has been adequately specified in the context of 1:1 matching on the propensity score. Many of these methods may be modified to the context of many-to-one matching on the propensity score by the inclusion of sample weights, as described elsewhere [

47].

In summary, we have described diagnostics for assessing whether the propensity-score model has been adequately specified when using propensity-score matching. Implementing multiple complementary methods for assessing balance allows researchers to better determine whether the propensity-score model has been adequately specified, and thus determine the degree to which matching on the estimated propensity score has reduced or eliminated systematic differences between treated and untreated subjects.