Cancer registries collect information on type of cancer, histological characteristics, stage at diagnosis, patient demographics, initial course of treatment including surgery, radiation therapy, and chemotherapy, and patient survival (Hewitt and Simone, 1999
). Such information can be valuable for studying variations in quality of cancer care, for example, across racial and ethnic groups. Concerns have been raised, however, about the completeness of treatment information in cancer registries. Bickel and Chassin (2000) and Malin et al. (2002)
demonstrated underreporting of adjuvant chemotherapy and radiation therapy for breast cancer in hospital and state registries, respectively. Cress et al. (2003)
reported similar treatment underreporting for colorectal cancer in a state registry and showed that it was associated with both patient and hospital characteristics. Thus, studies based solely on registry data would lead to invalid results.
The classical errors-in-variables approach (Carroll et al., 2006
) often used in epidemiology would be to analyze the relationship of registry data on treatment to clinical outcomes, and adjust for reporting error. This approach might involve modeling the relationship between the correct values of therapy variables in the validation sample and misreported/misclassfied ones in the registry. The error-adjustment procedures are often complicated and are analysis-specific. On the other hand, the therapy variables may be used by many researchers in analyses for various scientific purposes. Implementation of the error-adjustment procedures might be challenging for analysts who do not possess the relevant specialized statistical expertise.
A more appealing strategy might be multiple imputation (Rubin, 1987
). In a typical missing data problem, this method first “fills in” (imputes) missing variables several times to create multiple completed datasets. Analysis of each set can then be conducted using standard complete-data procedures. Finally, the results obtained from separate completed datasets are combined into a single inference using simple rules. In the presence of underreporting, this strategy is applied by imputing the uncollected correct treatment variables in the registry outside the validation sample. The imputer also may incorporate additional information which may not generally be available to other analysts, such as from other administrative databases (Rubin, 1987
). The imputation model characterizes the misclassification process and makes the adjustment. The corrected registry data can then be analyzed without any additional modeling of underreporting.
Yucel and Zaslavsky (2005)
(henceforth “YZ”) proposed statistical models for imputing receipt of adjuvant chemotherapy using data from the California Cancer Registry and from medical records obtained from a physician follow-back survey, a validation sample for the registry data. Cancer treatment patterns may vary across hospitals. Similarly, the cancer registry data are aggregated from hospital registries, whose completeness of reporting may vary due to differences in registrar resources, provider network structures, and other organizational factors. Hence YZ’s model included individual and hospital level predictors, as well as hospital random effects for provision and reporting of chemotherapy. They used multiply-imputed data sets to estimate models for mortality within two years of treatment. Using the same models, Zheng et al. (2006)
profiled hospitals based on imputed rates of chemotherapy for colorectal cancer.
The method proposed by YZ focused on a single treatment variable. But patients may receive multiple therapies in the course of treatment. For example, Malin et al. (2002)
developed individual quality scores to measure the receipt of each treatment (surgery, lymph node dissection, radiation therapy, and tamoxifen/chemotherapy) by eligible breast cancer patients, and added these scores to summarize overall quality. Furthermore, reporting completeness for different treatments may also be associated. Ignoring such associations when correcting the registry data may bias results of analyses concerning multiple therapies. In this paper, we extend YZ’s method to impute the underreported status of multiple treatment variables. This approach borrows strength from the validation sample to correct the misclassification in the registry system, accommodating the associations among the multiple therapies.
In Section 2, we present the statistical models. In Section 3 we analyze data from our motivating example. Finally in Section 4 we suggest directions for future research.