2.1 A study with two comparison groups
The use of multiple comparison groups in observational studies has been discussed for some time, although not widely implemented (Rosenbaum 1987
). We used administrative data from the FEHB plans and from the group of private health plans for the years 1999–2001, and focus on those beneficiaries who had been continuously enrolled over this period. illustrates the longitudinal structure of the data and how the investigator can designate a second comparison group from it. Information for the private health plans arise from the Medstat MarketScan database that includes health insurance claims for large self-funded employers. These employers, which offered relatively generous coverage, were not required to implement any state parity laws because the Employee Retirement Income Security Act exempts self-insured plans from state benefit regulations. Subjects included in the Medstat data form the first comparison
group. The US Office of Personnel Management required FEHB plans to offer MH/SA benefits in 2001, so we define the time period 1999–2000 as pre-parity and the year 2001 as post-parity. The parity
group consists of FEHB enrollees followed from baseline in 2000 through follow-up in 2001. A second comparison
group consists of FEHB enrollees followed from baseline in 1999 through follow-up in 2000 but prior to the implementation of parity. We refer to this group as the second comparison
group, and like the Medstat comparison group, this comparison group did not receive parity benefits. This second comparison group is composed of enrollees from the same population as the parity group, so there should be no differences in confounders, including unobserved, between them.
Longitudinal structure of the data allows for a natural second comparison group of FEHB enrollees observed pre-parity.
While adjustments such as matching can control for observed covariates, the first comparison group of Medstat enrollees may be quite different from the parity enrollees on unobserved baseline covariates. For example, federal employees and industry employees may have different spending habits or stigma associated with mental health services that are not captured in the observed data. Industry employees may have less flexibility with regard to work schedules that would otherwise permit them to be treated for and to deal with severe illness. In fact we can infer this information from the observed covariate distributions; for the study groups from the FEHB and Medstat plans, there are a higher proportion of Federal employees than Medstat employees seeking treatment for MH/SA disorders (). Matching takes care of this observed difference, but there very well may be unobserved factors that influence the seeking of treatment. As such, an observed post-parity difference in utilization of MH/SA services between the parity and first comparison groups may be due to an unobserved bias, due to the intervention, or a combination of both. This type of ambiguity is a central concern in any observational study.
Covariate balance before and after matching for the two comparison groups. The dashed vertical lines display bounds such that observations within these bounds are considered small.
Using the second comparison group could partly address this ambiguity. The second comparison group is composed of enrollees from the same population as the parity group, and thus this group should not differ from the parity group on their observed or unobserved characteristics. The major difference between the parity and second comparison groups is that the second comparison group is followed before the implementation of parity in 2001. A contrast between these two groups could show an effect of parity, without the same concerns about unobserved selection biases that the contrast with the Medstat comparison group would have. A disadvantage with the second comparison group, however, is that its enrollees are not followed at the same time as the parity group, so any temporal changes such as changes in treatment for severe illness or accessibility of care may be confounded in its comparison with the parity group. This disadvantage, though possibly minor, is partially addressed by the first comparison group, whose enrollees from Medstat are followed at the same time as the parity group. This intuition is developed further in §2.2.
Two contrasts have been described: one between the parity and first comparison groups, and the other between the parity and second comparison groups. The first contrast could suffer from selection bias but not a temporal trend; the second contrast does not suffer selection bias but may suffer a temporal trend. Taken together, these two contrasts build a more convincing body of evidence than an analysis with a single contrast could provide. With very little cost, the researcher can easily address these concerns of hidden biases by careful selection of a second comparison group.
First, in order to construct the parity and second comparison groups, we randomly split the sample of FEHB enrollees into two equally sized groups, denoted parity and second comparison. We then select individuals from the parity, first comparison, and second comparison groups on the basis of having at least one claim in the baseline year for the severe diagnoses listed in . For the parity and first comparison groups the baseline year is 2000, and for the second comparison group it is 1999. Matched sets of triplets consisting of one parity enrollee, one first comparison enrollee, and one second comparison enrollee were created. We examined distributions of observed baseline covariates to ensure balance among the triplets ().
Table 2 Distribution of baseline covariates in the matched sample. n = 356 matched triplets formed. A binary covariate for a specific diagnosis indicates whether a MH/SA claim was filed in the baseline year for that diagnosis. The important diagnostic categories (more ...)
There are a few key points to be made here. The sampling procedure above takes advantage of the relatively large ratio between the sizes of total FEHB group and of the Medstat comparison group (approximately 5:1). Without this favorable ratio, splitting the FEHB group might incur costs with respect to finding good matched samples; for example, if the overall ratio were instead 2:1, then splitting the FEHB enrollees into the two groups would likely not permit good matches because the available pool would be severely limited. A secondary issue is that randomly splitting the FEHB group introduces variation in the analysis – the estimates and conclusions from one iteration of the sampling procedure may differ from those from another. We do not address this issue here; for a theoretical discussion of split sample designs in observational studies see Heller, Rosenbaum, and Small (2009)
. Our primary goal is to address the first-order concerns about bias that do not diminish with increasing sample size.
2.2 Appealing to logic about trends in bias
In a standard observational study, for example, comparing the FEHB parity against the Medstat comparison enrollees, a significant estimated effect could very well be due to unobserved bias. In fact a comparison of these groups suggests that the odds of using any MH/SA services in the follow-up year was significantly greater in the parity versus the first Medstat comparison group. However, we might reasonably believe that different unobserved mechanisms are producing this observed effect. As a check against this concern, we compare the parity enrollees against enrollees from the second comparison group, who were also from FEHB Program plans but followed before parity’s implementation. Using this contrast could be also viewed as incorporating a pretest-posttest design in the analysis procedure; see Laird (1983)
. Likewise, the second comparison suggests that the odds of using any MH/SA services in the follow-up year was greater in the parity group versus the second comparison group, both comprised of FEHB enrollees. As a final check, the two comparison groups who did not receive parity benefits could be contrasted to check the similarity of their utilization outcomes. For example, if we should find an insignificant difference between the two control groups, then this suggests that we can be less concerned about the unobserved bias due to selection.
The trend in outcomes could also be a combination of unobserved selection bias and a temporal shift in utilization of MH/SA services. However, the temporal shift would have needed to occur within a relatively short period and cover up the effect of the unobserved selection. While not implausible, the effect due to this combination is much less likely than either source of bias alone. This rationale, along with other arguments for and against observed patterns of outcomes, can be used by the investigator to form a body of evidence suggesting that that the intervention effect is real and not due to ambiguities such as unobserved biases.
Informally, we find a significant effect between the parity and first comparison groups, a significant effect between the parity and the second comparison groups, and finally no significant difference between the two comparison groups. Thus we can be less concerned about the specific unobserved biases discussed in §2.1 than in an analysis with the first comparison group alone; while we may have been previously concerned about access to MH/SA care, we are less concerned about it after analyzing the second comparison group. However, the basic analysis described is not suitable for several reasons. First, finding a significant difference between the parity and comparison groups does not mean that the difference is meaningful for policy. Second, failure to find a significant difference between the two comparison groups does not mean that they are the same; an equivalence test is needed. Finally, without any organizing structure the probability of rejecting a true null hypothesis is well above the nominal level due to multiple testing.
An intuitive way to resolve this last issue is to test the hypotheses in a logical order of priority (Rosenbaum 2008
). We order the above contrasts into three steps. In Step 1, we test for a meaningful parity effect between the parity and first comparison groups. If one is found, then we proceed to Step 2, in which a contrast is made between the parity and second comparison groups. If a significant effect is found between them, then in Step 3, we characterize the similarity between the two comparison groups by an equivalence interval. The basic procedure just described comes without cost to the researcher, in the sense that the power and type I error rate of the study are essentially unchanged by the introduction of the second comparison group. We can say something important about unobserved confounding, based on logical choices, without altering the conclusions of the standard observational study. The procedure is developed formally in §3.3.