Study Background and Collaboration Framework
The Safety Assessment of Biologic Therapy (SABER) study is a broad-ranging inquiry into the safety of biologics for the treatment of auto-immune diseases. The study is based at the University of Alabama at Birmingham with working groups at other research centers around the United States. The Kaiser Permanente Division of Research in Oakland, CA serves as the study’s single data coordination center (DCC). The DCC creates standardized data definitions, facilitates transmission of data among parties, compiles study datasets, and provides standardized programming code and other analytic support.
For each specific safety question within the SABER study, the investigators sought to pool clinical and administrative data from a number of participating research organizations (“centers”). The centers maintained data from Kaiser Permanente, Medicare, Medicaid, state tumor registries, vital statistics providers, and state pharmaceutical assistance programs for low-income elderly. At the beginning of the project, programmers at each center standardized their data files based on HMO Research Network protocols and data dictionaries.
4 Each center was subject to strict data use agreements and/or federal rules and regulations concerning patient privacy.
We anticipated that pooling basic, non-identifying data from the multiple centers would yield the number of patients and outcome events needed to precisely estimate treatment effects,
5 but validity of the estimates remained a concern. For the outcomes under consideration, we saw strong potential for confounding by indication.
3 For example, patients with more severe autoimmune disease received more potent immunosuppressant therapy, but the severity of their disease put them at risk for adverse events such as infection. Consequently, simple age/sex adjustment was not sufficient; the study required full multivariate adjustment for all outcomes.
Pooling Alternatives Considered
We considered 4 main pooling methods. and the following text detail the privacy, statistical, and operational trade-offs involved in selecting among the techniques.
| TABLE 1Summary Comparison of Pooling Methods Considered |
The first method we considered was the most straightforward: sharing individual patients’ full covariate information (“covariate sharing method”). This method requires each center to create an analytic dataset and then to transmit that dataset—with full covariate information—to the DCC. Despite its advantages with respect to study speed and flexibility, we quickly ruled out this approach due to privacy concerns.
The project team next considered aggregating like patients into cells and then adjusting for confounders based on counts of patients in each cell (“aggregated data method”). For example, there might be 140 exposed men with a history of heart disease versus 220 exposed women with that history; for these patients, the centers could transmit just the 2 summary counts. While this method would likely work with few covariates, we determined that it would not scale to the 50 or more confounders required for this complex study. In particular, the method requires creating a cell for each observed combination of exposure, outcome, and covariates; with rare outcomes and the numerous covariates, we anticipated that cells with only one patient would be common. With frequent occurrence of single-person cells, the aggregated data method would provide little more privacy than covariate sharing.
To that end, Centers for Medicare and Medicaid Services regulations require cell counts of at least 11 patients. For cells with fewer than 11 patients, one option would be to drop the cells entirely. However, in the case of the rare safety outcomes we were considering, dropping the cells would have led to the loss of most of the study’s statistical power. Alternatively, a series of smaller cells could be combined until they reached a size of 11, but this approach would have required mixing dissimilar patients into a single stratum, and thus a substantial loss of confounder information.
The third alternative we explored was fixed or random effects meta-analysis of results (“meta-analysis”). This is the same statistical technique commonly applied to the compilation of data from multiple trials,
6 here applied among our study centers. For each study exposure and outcome, each center would generate a point estimate and variance. The centers would then transmit the point estimates and variances to the DCC, which would in turn combine the results with standard meta-analytic techniques.
Statistically speaking, meta-analysis should yield very similar point estimates and confidence intervals (CIs) as compared to covariate sharing. For our study’s purposes, however, we noted several limitations: (1) the burden of the analysis would move from the DCC to the center such that each center would be required to have full analytic capabilities, including SAS software and statistically-trained staff; (2) all aspects of the outcome models would need to be specified a priori at the centers, which would limit any later-stage analytic flexibility; and (3) subgroup analyses, sensitivity analyses, and data exploration would become operationally difficult. The related method of creating matched cohorts within the centers and transmitting either unadjusted point estimates or 2 × 2 tables to the DCC would be impeded by the same issues.
Pooling Methodology Used
Because of the limitations of the 3 approaches described above, we chose a pooling method developed by 2 of the coauthors (J.A.R., S.S.).
7 This method allows for pooling based on a propensity score (PS, “PS-based pooling”). A PS is the predicted probability of being exposed given the patient’s set of measured covariates (). A PS captures all patient covariate information into a single, opaque number. A man from the aggregated data example described above—the 140 exposed men and 220 exposed women, each with a history of heart disease—might have a PS of 0.37 (37% probability of exposure, based on his covariates), while a woman could have a PS of 0.52 (52% probability of exposure). The differences might be due to gender (women could be more likely to be exposed than men) as well as the individuals’ unique medical histories. Adjusting the outcome model by the PS generally functions as well as adjusting by individual covariates.
8,9 Consequently, by pooling with PSs, we could adjust for confounders in patients’ medical histories while keeping those histories concealed.
In our study, we estimated a set of PSs within each center ().
10 Only non-identifying information (age in decades, gender) and the PSs were transmitted to the DCC, which in turn used this information to produce fully adjusted outcome models. While we chose to use only deciles of PS in our outcome models, the PS-based pooling method does not imply any specific analytic technique: the PSs can be used as continuous values, in quintiles or deciles, or as variables on which to match, and outcome models can be adjusted for the PS alone, or for both the PS and the non-identifying information.
Analytic Challenges Arising from PS-Based Pooling
Despite the advantages of the PS-based pooling method, several key analytic challenges arose. In this section, we note those challenges and describe our responses to them.
Multiple Exposure Categories
The SABER study examined a series of treatment groups—4 tumor necrosis factor (TNF) antagonists, other biologic disease-modifying antirheumatic drugs (DMARDs; eg, abatacept, rituximab, efalizumab), and several nonbiologic DMARDs (nb-DMARDs; eg, methotrexate)—and then compared those groups with respect to safety outcomes. However, PS matching usually predicts the probability of receiving a single study treatment versus a single referent. The PS is almost always estimated using logistic regression, which is built on the 2-category binomial distribution.
11 A polytomous logistic regression can predict the probability of one option among a set of multiple choices
12: it might predict that, based on his patient characteristics, the man from the example above has a 37% chance of being treated with methotrexate, a 22% chance of a DMARD, a 23% chance of a nb-DMARD, and an 18% chance of a TNF antagonist. However, there is minimal literature on using polytomous regression for estimation of PSs.
13–15 We therefore compared each study treatment individually to a single prespecified referent group. For example, we compared each of the TNF antagonists to methotrexate and, separately, compared nb-DMARDS to methotrexate. This pairwise approach allowed us to be fully confident of the underlying statistical and epidemiologic theory without losing much or any statistical power.
12 One limitation of this method is its increased complexity and a large number of propensity scores. To minimize logistical complexity and computing time, we only estimated those pairwise PSs that were of scientific interest, as specified by study investigators.
Time-Varying Exposures
Patients treated with biologics can follow a treatment pattern in which they are “stepped up” from an initial therapy to more effective medications or combinations of medications. Moreover, although there is guidance on which therapies one might select as first line agents,
16 there is only limited evidence to guide the decision to add or switch medications. Patients in our study switched frequently among the study exposures, and certain investigators therefore wished to conduct time-varying exposure assessment. The usual procedure with PSs is to estimate the probability of treatment at baseline (PS
Baseline) and to use that baseline score for confounding adjustment. Because a change in drug treatment during follow-up can be informed by the patient’s condition and prognosis,
3 the most conservative study choice would be either to censor the patient in an as-treated analysis or to knowingly misclassify the exposure in an intention-to-treat-style analysis. Alternatively, if second-line agents were of higher interest, investigators could modify the study design to start follow-up at the time of initial use of the second-line exposure,
17 rather than at the time of initiation of the first-line therapy.
In the instances where the investigators wished to assess both the original and the second-line exposure within the same analysis, several options were possible: (1) re-estimate the PS at the time of the change of exposure (COE) and use resulting PSCOE alone for confounding adjustment; (2) use the PSBaseline and PSCOE side-by-side for confounding adjustment; (3) use PSBaseline plus additional individual co-variates measured at the time of exposure change; or (4) use only PSBaseline and do no further adjustment at the time of exposure change. Statistically, none of these options is ideal for reasons entirely separate from the pooling method: to varying degrees, options 1, 2, and 3 each run the risk of adjusting for an intermediate on the causal pathway and thereby obscuring any safety issues with the study medication, while option 4 runs the risk of underadjustment. Given the complexity of the analytic trade-offs, the SABER investigators determined the most appropriate approach individually for each of the study’s exposures and outcomes.
Time-Varying Confounding
Certain working groups desired to adjust for confounders that varied over the course of treatment, such as concomitant oral glucocorticoid use, even if exposure did not change. While the epidemiologic issues are distinct from those of time-varying exposures, the available options for updating the confounders using the PS-based pooling method are similar to those described in the section above. In our study, the investigators chose to measure key commonly occurring time-varying confounders (eg, mean daily prednisone dose) at prespecified times during follow-up. The DCC transmitted those key confounders alongside the baseline PS.
Multiple Subgroup Analyses
Study investigators wished to assess the drugs within certain subgroups of the patient population. For this to be possible, subgroup indicators needed to be based on nonprivate information and then transmitted with the PSs. We questioned whether the PSs required re-estimation in the subgroup, or whether the PS estimated in the entire population would remain valid within a subgroup analysis. Both theory
8 and preliminary research by 2 of the coauthors (J.A.R., S.S.) indicate that in large subgroups, and under certain assumptions, no re-estimation of the PSs is required. Due to the ambiguity of the “correct” way to handle this issue, the study investigators chose to handle subgroup analyses on a case-by-case basis.
Heterogeneous Center Effects
In this multicenter study, detection and handling of heterogeneity among the centers’ populations was a key issue. We divided heterogeneity into 2 categories: (1) heterogeneity due to the information content available at each center and (2) heterogeneity due to distinct patient populations at each center.
Heterogeneity due to information content is most obvious when some centers are able to provide measured covariates—such as laboratory results—that other centers cannot provide. In such a case, each center can provide both a universally defined PS (PS
Univ)—one that includes just the covariates available at all centers—as well as one that includes both the universal covariates plus all local information available at the specific center (PS
Local).
7Note that even if each center ostensibly provides the same covariates, it is possible that certain centers can measure the covariates better than others. Consider the example of a “history of high blood pressure” variable: if one center relies on ICD-9 codes while another has access to blood pressure measurements from an electronic medical record, the center with the electronic medical record data may able to supply superior information.
In the hypothetical example illustrated in , each center-specific relative risk (RR) estimate is adjusted for each of the 2 PSs described above: a PSUniv that is estimated with the “lowest common denominator” of covariates or covariate measurement available at the centers, and a PSLocal that includes each center’s best available information. A visual inspection of demonstrates the 2 differing types of heterogeneity. Overall, most centers show a RR of approximately 1.0, but Centers 3 and 6 are anomalous. Center 3 exhibits an RR that is elevated regardless of the PS used. It is likely that Center 3’s population is distinct from those of the other centers and is truly heterogeneous. A comparison of absolute event rates within Center 3 versus the other centers could shed additional light on how Center 3’s population might differ. Conversely, Center 6 shows a distinct RR, but only after adjustment with PSLocal. The use of PSLocal—estimated using a superset of the variables contributing to PSUniv—moved the point estimate away from that observed with adjustment just for PSUniv. It is likely that the additional information contained in PSLocal included important confounders that were not measured in the other centers. This observation is a strong indicator that additional confounder control would be necessary.