Algorithm-based exposure assessments based on patterns in questionnaire responses and professional judgment can readily apply transparent exposure decision rules to thousands of jobs quickly. However, we need to better understand how algorithms compare to a one-by-one job review by an exposure assessor. We compared algorithm-based estimates of diesel exhaust exposure to those of three independent raters within the New England Bladder Cancer Study, a population-based case–control study, and identified conditions under which disparities occurred in the assessments of the algorithm and the raters.
Occupational diesel exhaust exposure was assessed previously using an algorithm and a single rater for all 14 983 jobs reported by 2631 study participants during personal interviews conducted from 2001 to 2004. Two additional raters independently assessed a random subset of 324 jobs that were selected based on strata defined by the cross-tabulations of the algorithm and the first rater’s probability assessments for each job, oversampling their disagreements. The algorithm and each rater assessed the probability, intensity and frequency of occupational diesel exhaust exposure, as well as a confidence rating for each metric. Agreement among the raters, their aggregate rating (average of the three raters’ ratings) and the algorithm were evaluated using proportion of agreement, kappa and weighted kappa (κw). Agreement analyses on the subset used inverse probability weighting to extrapolate the subset to estimate agreement for all jobs. Classification and Regression Tree (CART) models were used to identify patterns in questionnaire responses that predicted disparities in exposure status (i.e., unexposed versus exposed) between the first rater and the algorithm-based estimates.
For the probability, intensity and frequency exposure metrics, moderate to moderately high agreement was observed among raters (κw = 0.50–0.76) and between the algorithm and the individual raters (κw = 0.58–0.81). For these metrics, the algorithm estimates had consistently higher agreement with the aggregate rating (κw = 0.82) than with the individual raters. For all metrics, the agreement between the algorithm and the aggregate ratings was highest for the unexposed category (90–93%) and was poor to moderate for the exposed categories (9–64%). Lower agreement was observed for jobs with a start year <1965 versus ≥1965. For the confidence metrics, the agreement was poor to moderate among raters (κw = 0.17–0.45) and between the algorithm and the individual raters (κw = 0.24–0.61). CART models identified patterns in the questionnaire responses that predicted a fair-to-moderate (33–89%) proportion of the disagreements between the raters’ and the algorithm estimates.
The agreement between any two raters was similar to the agreement between an algorithm-based approach and individual raters, providing additional support for using the more efficient and transparent algorithm-based approach. CART models identified some patterns in disagreements between the first rater and the algorithm. Given the absence of a gold standard for estimating exposure, these patterns can be reviewed by a team of exposure assessors to determine whether the algorithm should be revised for future studies.