Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Pharmacoepidemiol Drug Saf. Author manuscript; available in PMC 2013 September 3.
Published in final edited form as:
PMCID: PMC3760213

Reweighted Mahalanobis Distance Matching for Cluster Randomized Trials with Missing Data

Robert A. Greevy, Jr, PhD,1,2 Carlos G. Grijalva, MD, MPH,4 Christianne L. Roumie, MD, MPH,1,3 Cole Beck, BA,1,2 Adriana M. Hung, MD,1,3 Harvey J. Murff, MD, MPH,1,3 Xulei Liu, MS,1,2 and Marie R. Griffin, MD, MPH1,3,4



This paper introduces an improved tool for designing matched-pairs randomized trials. The tool allows the incorporation of clinical and other knowledge regarding the relative importance of variables used in matching and allows for multiple types of missing data. The method is illustrated in the context of a cluster-randomized trial. A web application and R package are introduced to implement the method and incorporate recent advances in the area.


Reweighted Mahalanobis Distance (RMD) matching incorporates user-specified weights and imputed values for missing data. Weight may be assigned to missingness indicators to match on missingness patterns. Three examples are presented, using real data from a cohort of 90 Veterans Health Administration sites that had at least 100 incident metformin users in 2007. Matching is utilized to balance seven factors aggregated at the site level. Covariate balance is assessed for 10,000 randomizations under each strategy: simple randomization, matched randomization using the Mahalanobis distance, and matched randomization using the RMD.


The RMD matching achieved better balance than simple randomization or MD randomization. In the first example, simple and MD randomization resulted in a 10% chance of seeing an absolute mean difference of greater than 26% in the percent of nonwhite patients per site; the RMD dramatically reduced that to 6%. The RMD achieved significant improvement over simple randomization even with as much as 20% of the data missing.


RMD matching provides an easy-to-use tool that incorporates user knowledge and missing data.


In trials designed to reflect routine care, the cluster-randomized trial is appealing. By randomly assigning interventions to physicians or hospitals instead of directly to patients, routine care settings may be studied under experimental interventions, while problems such as treatment contamination can be prevented. However, the number of clusters being randomized is often relatively small. A study including thousands of patients may have randomized only a dozen hospitals. In this situation, assigning treatments with a simple randomization, e.g., drawing the names of half of the hospitals out of a hat, is unacceptably risky. When the number of units being randomized is small, there is substantial risk that severe imbalance in important covariates will occur by chance.1

Restricted randomization methods are commonly used to reduce this risk. Cluster-randomized trials frequently have the advantage of covariate information being available on all units prior to randomization; those units are randomized all at once or in a few batches. Stratified randomization is a commonly used restricted randomization method that creates strata based on a few important covariates and then randomly assigns half of the units in each stratum to one treatment and half to the other. While providing some benefit, this approach is limited to including only a few of the important covariates and the categorization of continuous covariates into a few bins. In spite of its limitations, the general concept is sound. The ideal stratification would contain exactly two similar units within each stratum. Matching prior to randomization achieves this without requiring categorization of continuous covariates, without severely limiting the number of covariates being balanced, and without requiring units that match perfectly to achieve balance between the study arms.

The benefits of any restricted randomization method depend on its ability to balance important covariates, the strength of the association between the covariates and the outcome, and the study’s sample size. In a non-clustered study of 132 patients randomly assigned treatment, Greevy et al. demonstrated that optimal nonbipartite matching on the Mahalanobis distance derived from 14 covariates resulted in an average increase in power equivalent to a 7% increase in sample size.1 Moreover, this approach eliminated the rare but severe imbalances that may occur with simple randomization. Despite the method’s superior performance, it presently remains less widely used than simple or stratified randomization. In cluster-randomized trials, wider adoption has been hindered by misunderstandings about matching and an absence of user-friendly tools to implement the method.

In their 2009 paper, Imai et al. dispel the major misconceptions surrounding matched-pair cluster-randomization (MPCR).2 For example, they examine the assumptions leading Martin et al. to recommend against MPCR in small samples.3 When the assumption of equal cluster sizes is relaxed, as is appropriate for most practical scenarios, the MPCR that matches on cluster size and pre-treatment covariates will improve the study’s efficiency and power over unmatched cluster-randomization, even with as few as six clusters. In a discussion of Imai et al.’s paper, Zhang and Small show the utility of optimal nonbipartite matching for achieving pre-treatment covariate balance in MPCR and for optimally selecting a set of units for study when the number of units available is greater than the number needed.4 For observational studies utilizing matching, Rosenbaum presents a method of augmenting the distance matrix to optimally choose the number of units to study for a specified level of quality of match.5 When the quality of matches is of greater concern than the exact number of units included, this approach can be very useful in the MPCR setting. Both approaches are incorporated into the methods presented here.

To fully realize the benefits of MPCR with several pre-treatment covariates, including continuous measures and cluster size, a multivariate distance measure is needed. To balance the cluster-specific covariate distributions, appropriate summary measures are chosen. Categorical variables may be summarized with proportions, e.g., the percentage of patients taking statins. Likewise, when the shape of the distributions is not highly variable, a single summary measure may suffice for continuous covariate distributions, e.g., mean low-density lipoprotein (LDL). Otherwise, multiple measures may be used, e.g., the mean and standard deviation of LDL or the 10th, 50th, and 90th percentiles. Once a continuous multivariate distance measure is developed, the optimal set of matches is the set that minimizes the average distance between pairs. Lu et al. recently released an R package and web application that takes a user-created matrix of distances between units and solves for the optimal matches.6 However, the creation of the distance matrices may create an obstacle for some researchers, and improving the utility of distance measures is an open area of research.

This paper addresses the need for a customizable distance measure that incorporates clinical and other knowledge regarding the importance of the covariates while also allowing the inclusion of covariates with missing values. The method we propose may incorporate two ways to exclude units when more units are available than can be included in the study. We introduce two user-friendly tools to implement the methodology in the form of a web application and an R package. To aid the development of the distance measure, the web application includes tools for assessing the quality of the matches prior to randomization and comparing them to benchmark values to assist the user in choosing covariate weights. Once the choice of weights has been finalized, the application allows the user to perform the official randomization with a user-specified random seed to allow reproducibility and, if needed, the randomization of additional study units to be added after the first set of treatment assignments has been made. The web application and instructions on downloading the R package nbpMatching are available at Examples using real data from VHA sites are presented to illustrate the method.


MD and RMD

The Mahalanobis Distance (MD) is a multivariate distance measure akin to the familiar Euclidean Distance; however, it has two additional benefits. First, it is scale invariant, e.g., including a site’s pre-treatment mean LDL in mg/dL will yield the same results as LDL in mmol/L. Second, it incorporates the correlations between the covariates. The effect may be thought of as down-weighting a difference in one covariate that is expected based on the differences observed in the other covariates. The MD may be written as

equation M1

where xi is the ith row of the (n×p) covariate matrix X, with n subjects in the rows and p covariates in the columns, and S is the (p×p) covariance matrix of X.

A limitation of the MD is that is the influence the covariates have on the distance is driven purely by their covariance structure, not their clinical importance. The Reweighted Mahalanobis Distance (RMD) incorporates user-specified weights, imputed values for missing covariate data, and indicators of covariate missingness. We refer to the distance as “reweighted” to distinguish it from similarly named measures used in different settings.710 The RMD may be written as

equation M2

where X is the (n×p+q) covariate matrix consisting of X with the addition of indicator variables for the q covariates with missingness and missing values replaced with imputed values, xi is the ith row of X, S is the (p+q×p+q) covariance matrix of X, and W is a (p+q×p+q) diagonal matrix of user-specified weights. Various methods may be used to impute the missing values, provided the imputation is estimating an expected value without random noise added. The application presented here currently uses the R package transcan, which transforms the covariates to have their maximum correlation with the best linear combination of the other covariates and returns expected values on the original scale that may be interpreted as an expected median or mode for continuous or categorical variables, respectively.11

The usefulness of the imputed values will depend partially on the validity of the missing-at-random (MAR) assumption on which they are based.12 If the MAR assumption is in question, researchers may wish to match on missingness patterns more than on the imputed missing values. This may be achieved through adjusting the weights for the missingness indicators. The web application currently uses the same weight for all missingness indicators and a small default value of 0.1, giving preference to the MAR assumption. A weight of 0 can be used to completely eliminate the impact of the indicators.

Optimality and Limitation on the Number of Clusters

Using the RMD, an (n×n) matrix of distances between units is created. Unlike a retrospective cohort study, where units from one group are matched to units in another group, any of the n units may be matched to any of the other units. The optimal set of matches is the set that yields the smallest average RMD distance between the matched pairs. Thanks to advanced methods for solving this so-called optimal nonbipartite matching problem, the computational complexity no longer appreciably limits its use.13 The web application presented here has successfully handled up to 5,000 units, well beyond the typical number of clusters in a cluster-randomized trial.

Optimally Selecting a Subset of Clusters

In cluster-randomized studies, the cost of each hospital or experimental unit included in the study often limits the number of units that may be used, since more units may be available for participation than can be included. Typically the units that are excluded are chosen via ad hoc procedures. Some reasons to exclude a unit may be obvious, such as logistical difficulties unique to that unit. When there is no clear choice, the matching method can optimally select which units to drop by removing those that would create the greatest imbalance between the groups.4 The user specifies a number of units to exclude, say k units. The application adds k units to the cohort that have the special property that they match every other unit perfectly. These units are usually referred to as “sinks” and are labeled “phantoms” by the applications presented here. Units that are matched to the phantoms are excluded from the study. In an alternate approach, the user specifies a threshold for acceptable distances, [delta with tilde], that matches must meet to be included in the study. The distance matrix used for matching is augmented to select the optimal set of units satisfying the threshold.5 This approach is equivalent to adding units to the study that are distance [delta with tilde] from all other units. Any unit that is matched to one of these “near-matchers,” or “chameleons,” as they are called in our applications, is excluded from the study.

Evaluating Performance

If examining the average difference between groups over all possible randomizations, almost any randomized method appears to balance the variables well, because the expected mean difference for all of these methods is zero. However, a particular randomization may be quite poor, showing a large imbalance in the mean difference for a variable. Thus, balance is assessed through the 90th percentile of the absolute mean differences (AMD_90). The AMD_90 is empirically estimated via 10,000 randomizations for each strategy, and all standard errors are ≤0.1 unless specified otherwise. The simple randomization used here balances only the number of sites in each arm, drawing from a set of size equation M3 possible randomizations for n sites. The MPCR balances the covariates of interest by restricting to a smaller set of size 2n/2 possible randomizations.

Conflicts of Interest and Funding

This project was funded in part by the Agency for Healthcare Research and Quality, US Department of Health and Human Services, Contract No. HHSA2902010000161, as part of the Developing Evidence to Inform Decisions about Effectiveness (DEcIDE 2) program. The authors of this research are responsible for its content. Statements here should not be construed as endorsement by the Agency for Healthcare Research and Quality or the US Department of Health and Human Services. The work of RG was supported in part by a grant from the National Institutes of Health, P60AR056116. The work of AMH was supported in full by the Career Development Program from the Department of Veterans Affairs CDA (2-031-09S) from CSR&D. There were no conflicts of interest with this research.

Example Study

As our motivating example, we consider designing a trial to study the effects of early intensification with insulin on low-density lipoprotein (LDL) at 12 months post intensification. In 2006, the American Diabetes Association recommended that metformin be used as the first-line agent unless contraindicated.14 No standard protocol currently exists for patients failing their oral anti-diabetic monotherapy shortly after starting it, i.e., those who show glycosylated hemoglobin (A1c) levels of 7–9% within 3–12 months after initiation. Those failing metformin monotherapy could remain on their current treatment, intensify with insulin, or intensify to a metformin-sulfonylurea dual therapy. We consider randomly assigning Veterans Health Administration (VHA) sites to early intensification with insulin or early intensification with dual therapy. The method proposed will be used to balance (between the two treatment arms) the distributions of covariates known to affect LDL in the VHA patient population.15


To illustrate the method, a hypothetical cohort of potential study sites is drawn from the National VHA databases, which include pharmacy, inpatient, outpatient, and laboratory records. For the VHA fiscal year 2007, 90 VHA sites serving 100-500 incident metformin users were identified. Covariate information for each patient was derived from the 365 days preceding their starting treatment; see Roumie et al. for covariate definitions.15 Matching is utilized to balance seven factors aggregated at the site level: percentage of nonwhite patients, percent female, percent on statins, mean systolic blood pressure (mmHg), body mass index (BMI), A1c (%), and the number of incident metformin users (N). Three examples are presented. The first compares the performance of the MD and RMD to simple matching of 12 preselected sites. The second example illustrates the performance when selecting 12 sites out of the 90 potential sites via the use of phantoms or chameleons. The third example illustrates the performance when missingness is induced completely at random (MCAR).

Example 1

The correlations and standard deviations for the 12 sites are shown in Table 1. The comparatively large standard deviation and low correlations for race suggest that it will be more difficult to balance than the other variables. Table 2 shows the AMD_90 for four methods: simple randomization; Mahalanobis distance matching (MD); RMD matching with weight=1 for race and 0 for all other variables (RMD_race); and RMD with weights=10 for race and BMI, 5 for statin use, and 1 otherwise (RMD_race+). The AMD_90 for simple randomization was 26.8%. In other words, simple randomization yielded a 10% chance of a study having a mean difference of at least 26.8% in the percent of nonwhite patients between study arms. MD showed a small improvement with 26.1%. RMD_race dramatically reduced the AMD_90 to 4.8%, and RMD_race+ reduced it to 6.4%. The benefit of MD was primarily seen in BMI, reducing the AMD_90 to 0.7 from simple randomization’s 1.3 and 1.4 for RMD_race. RMD_race+ had comparable balance on BMI at 0.8. Compared to the MD, RMD_race+ achieved dramatically better balance on race and cluster size at a small cost of slightly less balance on the other variables.

Table 1
Variable Correlations and Standard Deviations (12 preselected sites)
Table 2
90th Percentile of the Absolute Mean Difference (AMD_90) for Four Randomization Methods (12 preselected sites)

Example 2

Where example 1 preselected 12 from the 90 sites, MD and RMD can select an optimal subset via the use of phantoms or chameleons. The balance for sites selected via four different methods is presented in Table 3. For comparison, 500 sets of 12 were selected via simple random samples. On average, simple randomization performed similarly to how it did in example 1. Utilizing 78 phantoms to optimally select 12 sites, MD yielded slightly better balance on the preselected sites than RMD_race+. In this setting RMD_race+ did not balance the number of patients per site, thus the variable N patients was also given a weight of 5, yielding RMD_race++. As in example 1, RMD_race++ outperformed MD in terms of balancing the difficult variables of race and number of patients per site, while performing almost as well as on the other variables. When selecting an optimal subset using chameleons, set with a threshold equal to the 0.2 percentile of the distance matrix, the method selected 16 sites with performance similar to the set selected via phantoms.

Table 3
90th Percentile of the Absolute Mean Difference (AMD_90) for 12 Sites Selected from 90 via Four Methods

Example 3

The performance of simple randomization will improve as the number of sites increases. When randomizing all 90 sites, simple randomization has an AMD_90 of 7.2 for race and 60.5 for patients per site. With as many as 20% of the covariate values missing, RMD_race++ outperformed simple randomization on all variables, especially race and patients per site.


The RMD provides a user-friendly method for researchers to incorporate into the matching process their clinical knowledge and the relative difficulty of balancing important covariates. Greevy et al. have shown that matching prior to randomization outperforms unmatched randomization in non-clustered RCTs1, and Imai et al. have shown its benefits in clustered RCTs2. Zhang and Small have shown that optimal nonbipartite matching using a MD may outperform other matching methods4, and the current paper shows that the RMD may yield results superior to the MD when the perceived quality of the matching depends on the relative clinical importance of the variables. Moreover, the RMD may account for missing data in a sophisticated, yet highly automated, process. Table 4 shows benefits over simple randomization with up to 20% of the data missing.

Table 4
90th Percentile of the Absolute Mean Difference (AMD_90) for RMD_race+ with Three Levels of Induced Missingness

Analysis for MPCR designs is a growing area of research. Recently Imai et al. introduced a harmonic mean estimator for which the study inferences can be justified by the study design alone.2 Zhang, Traskin, and Small have developed a robust test statistic for MPCR trials that outperforms linear mixed models for heavy-tailed distributions and performs nearly as well in the special case where the mixed model assumptions are true.16 The statistic may optionally include covariate adjustment while still relying on the study design to justify inferences via the approach developed by Rosenbaum.17 Many studies will include the covariates used in the matching as variables in the analysis model. To avoid potential bias, we discourage including the indicators for missingness that are created by RMD.18 In addition, the form of the covariates used in the model may also vary from the form used in the matching, e.g., the model may benefit from the transformation of a covariate to account for a nonlinear association with the outcome.

In situations where the potential for imbalance was low, e.g., low variability in the covariates, the benefit of up-weighting variables purely for their clinical importance was small. The choice of weights was best made with a combination of clinical knowledge and examination of the pre-randomization covariate data. Users may influence the impact of individual variables on the MD by dropping variables entirely or applying nonlinear transformations to them, e.g., rank or log. When matching observational data with highly non-normal distributions, Rosenbaum recommends using a rank-based Mahalanobis distance (MD_rank).19 The MD_rank is equivalent to an RMD that uses a rank transformation of each variable (i.e., replacing a variable with the rank ordering of the variable and using average ranks for ties) and uses weights equal to the standard deviation of the rank transformed variables divided by what the standard deviation of the ranks would be if there were no ties. This serves to down-weight variables with ties. Ranking is particularly useful for covariates that have outliers, and it is often desirable to down-weight a variable with numerous ties, e.g., a binary variable such as “teaching hospital Y/N.” The clinical importance of a particular variable may discourage researchers from down-weighting, or even to up-weight, that variable. Because the aggregated measures used in the examples have sufficient precision to prevent any ties, the MD_rank is equivalent to the MD on rank-transformed data. The effects of the rank transformation are presented in online Supplementary Tables 1 and 2.

For missing data, a user may wish to use an imputation method that is more sophisticated than the highly automated procedure used here. The web application and R package allow more advanced users to use a covariate matrix that incorporates their customized changes or use their own customized distance matrix. The methods applied here provide a straightforward, easily implemented method for creating optimally matched clusters for srandomization in a MPCR study.

Supplementary Material


1. Greevy R, Lu B, Silber JH, Rosenbaum P. Optimal multivariate matching before randomization. Biostatistics. 2004 Apr;5(2):263–275. [PubMed]
2. Imai K, King G, Nall C. The essential role of pair matching in cluster-randomized experiments, with application to the Mexican Universal Health Insurance Evaluation. Statistical Science. 2009 Feb;Vol. 24(No. 1):29–53.
3. Martin DC, Diehr P, Perrin EB, Koepsell TD. The effect of matching on the power of randomized community intervention studies. Statistics in Medicine. 1993;12:329–338. [PubMed]
4. Zhang K, Small D. Comment: The essential role of pair matching in cluster-randomized experiments, with application to the Mexican Universal Health Insurance Evaluation. Statistical Science. 2009 Feb;Vol. 24(No. 1):59–64.
5. Rosenbaum PR. Optimal matching of an optimally chosen subset in observational studies. Journal of Computational and Graphical Statistics. 2011 in press.
6. Lu B, Greevy R, Xu X, Beck C. Optimal nonbipartite matching and its statistical applications. The American Statistician. 2011;Vol. 65(no. 1):21–30. [PMC free article] [PubMed]
7. Wölfel M, Ekenel H. Feature weighted Mahalanobis distance: Improved robustness for Gaussian classifiers; 13th European Signal Processing Conference (EUSIPCO); Antalya, Turkey. 2005. Sep,
8. Younis K, Karim M, Hardie R, Loomis J, Rogers S, DeSimio M. Cluster merging based on weighted mahalanobis distance with application in digital mammograph. Proc of IEEE Aerospace and Electronics Conference. 1998
9. Peng J, Heistenkamp DR, Dai HK. Adaptive kernel metric nearest neighbor classification. Proc. of IEEE International Conference on Pattern Recognition. 2002
10. Rerkrai K, Fillbrandt H. Tracking persons under partial scene occlusion using linear regression. 8th International Student Conference on Electrical Engineering. 2004
11. Harrell FE, et al. R package version 3.9-1. 2012. Jan, Hmisc.
12. Little RJA, Rubin DB. Statistical analysis with missing data. 2nd ed. New York: Wiley; 2002.
13. Derigs U. Solving nonbipartite matching problems via shortest path techniques. Annals of Operations Research. 1988;13:225–261.
14. Summary of revisions for the 2006 Clinical Practice Recommendations. Diabetes Care. 2006 Jan;29(Suppl 1):S3. [PubMed]
15. Roumie CL, Huizinga MM, Liu X, Greevy RA, Grijalva CG, Murff HJ, Hung AM, Griffin MR. The effect of incident antidiabetic regimens on lipid profiles in veterans with type 2 diabetes: a retrospective cohort. Pharmacoepidemiology and Drug Safety. 2011;20:36–44. [PubMed]
16. Zhang K, Traskin M, Small D. A powerful and robust test statistic for randomization inference in group-randomized trials with matched pairs of groups. Biometrics. 2011 in press. [PubMed]
17. Rosenbaum PR. Covariance adjustment in randomized experiments and observational studies. Statistical Science. 2002;17(No. 3):286–327.
18. Greenland S, Finkle WD. A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am. J. Epidemiol. 1995;142(12):1255–1264. [PubMed]
19. Rosenbaum PR. Design of Observational Studies. Ch 8. Springer Series in Statistics. 2009