|Home | About | Journals | Submit | Contact Us | Français|
Background: In epidemiologic studies that rely on professional judgment to assess occupational exposures, the raters’ accurate assessment is vital to detect associations. We examined the influence of the type of questionnaire, type of industry, and type of rater on the raters’ ability to reliably and validly assess within-industry differences in exposure. Our aim was to identify areas where improvements in exposure assessment may be possible.
Methods: Subjects from three foundries (n = 72) and three textile plants (n = 74) in Shanghai, China, completed an occupational history (OH) and an industry-specific questionnaire (IQ). Six total dust measurements were collected per subject and were used to calculate a subject-specific measurement mean, which was used as the gold standard. Six raters independently ranked the intensity of each subject’s current job on an ordinal scale (1–4) based on the OH alone and on the OH and IQ together. Aggregate ratings were calculated for the group, for industrial hygienists, and for occupational physicians. We calculated intra-class correlation coefficients (ICCs) to evaluate the reliability of the raters. We calculated the correlation between the subject-specific measurement means and the ratings to evaluate the raters’ validity. Analyses were stratified by industry, type of questionnaire, and type of rater. We also examined the agreement between the ratings by exposure category, where the subject-specific measurement means were categorized into two and four categories.
Results: The reliability and validity measures were higher for the aggregate ratings than for the ratings from the individual raters. The group’s performance was maximized with three raters. Both the reliability and validity measures were higher for the foundry industry than for the textile industry. The ICCs were consistently lower in the OH/IQ round than in the OH round in both industries. In contrast, the correlations with the measurement means were higher in the OH/IQ round than in the OH round for the foundry industry (group rating, OH/IQ: Spearman rho = 0.77; OH: rho = 0.64). No pattern by questionnaire type was observed for the textile industry (group rating, Spearman rho = 0.50, both assessment rounds). For both industries, the agreement by exposure category was higher when the task was reduced to discriminating between two versus four exposure categories.
Conclusions: Assessments based on professional judgment may reduce misclassification by using two or three raters, by using questionnaires that systematically collect task information, and by defining intensity categories that are distinguishable by the raters. However, few studies have the resources to use multiple raters and these additional efforts may not be adequate for obtaining valid subjective ratings. Thus, improving exposure assessment approaches for studies that rely on professional judgment remain an important research need.
Exposure assessment of occupational risk factors in population-based epidemiologic studies often relies on exposure assessors, such as industrial hygienists, to review study subjects’ responses to questionnaires designed to collect workplace information. In a typical study, each subject is asked to complete an occupational history (OH), with open-ended questions that cover job title, industry, work tasks, types of equipment and chemicals used, and employment dates. These open-ended questions capture some of the exposure variability among subjects with the same or similar job titles. To further refine the exposure assessments, some studies also incorporate job- or industry-specific questionnaires (IQs) to collect more detailed information about the work setting and tasks performed (Siemiatycki et al., 1981; Stewart et al., 1998). The responses to both types of questionnaires are then reviewed by one or more exposure assessors. Although recall bias is possible, the questionnaires are designed to elicit task-related information, which is likely more easily recalled than information on specific exposures (Teschke et al., 2002).
The ability of the exposure assessors to provide valid (accurate) exposure estimates is of vital importance to these studies because exposure misclassification can mask exposure-disease associations. The evaluation of the exposure assessors’ performance ideally requires gold standard measurements of exposure for multiple time points within the study’s scope, a wide breadth of exposure scenarios, multiple agents, and multiple raters. No study has yet met all these criteria. The evaluation of the validity of the exposure assessors’ ratings has primarily occurred in cohort studies of specific industries (Kromhout et al., 1987; Hertzman et al., 1988; Hawkins and Evans, 1989; Post et al., 1991; de Cock et al., 1996; Friesen et al., 2003), where exposure measurements are more readily available. The breadth of jobs and industries reported in population-based studies makes evaluating the validity of raters in population-based studies much more challenging. The few validity studies that used a population-based design were based on only a small number of measurements and were limited to examining the overall validity and the validity by agent (Benke et al., 1997; Tielemans et al., 1999; Fritschi et al., 2003). As a result, researchers have relied on studies that examined factors related to the reliability (reproducibility) of raters, rather than their validity, to identify areas to improve exposure assessment for population-based studies (Teschke et al., 2002; Mannetje et al., 2003; Correa et al., 2006; Steinsvag et al., 2007).
The present study was undertaken to identify factors that influence the validity and reliability of exposure ratings based on professional judgment. Because of the challenges in obtaining a gold standard for the wide variety of jobs and industries reported in population-based studies, this study straddled both the cohort and case–control study designs. Six exposure assessors were asked to discriminate between levels of exposure intensity within the same industry (foundry and textile) based on the subjects’ responses to questionnaires designed for case–control studies. Their ratings were then compared to personal dust measurements collected in the two industries. We examined the influence of the type of questionnaire, type of industry, type of rater, and number of raters. By identifying factors related to higher and lower validity and reliability of the subjective ratings, we have suggested areas where exposure assessment approaches may be improved.
We evaluated subjective ratings of occupational exposure intensity levels using subject-specific air measurements for dust among foundry workers and cotton dust among textile workers in Shanghai, China. Subjects were recruited from three foundries and three textile plants (foundry, n = 72; textile, n = 74); the subjects were not participants in any on-going epidemiologic studies. Each subject completed an OH questionnaire, which included open-ended questions on the job title, industry, employment dates, the job tasks, and the activities, tools, equipment, and chemicals used in each job ever held. Each participant also completed a more detailed IQ for his or her current job that asked about specific tasks and work locations to systematically capture within-job heterogeneity. Participation was voluntary and written consent was obtained from all subjects according to protocols approved by the Institutional Review Boards of the National Cancer Institute and the Shanghai CDC.
The gold standard of exposure for foundry dust and cotton dust in this study was personal, full-shift total dust measurements collected on each subject on six different occasions over 1 year (three in the summer, three in the winter). The measurements were collected and analyzed by Shanghai CDC industrial hygienists based on National Institute for Occupational Safety and Health (NIOSH) Method 0500 for ‘Particulates not otherwise regulated, total’ (NIOSH, 1994). The industrial hygienists changed the filter cassette at least once (midday) for most subjects and collected at least two blanks per sampling day per plant. The foundry industry measurements were collected in May through December 2001 and the textile industry measurements were collected in December 2001 and May through July 2002.
Subjective ratings of exposure intensity for each subject’s current job were made by three industrial hygienists and three occupational physicians over two rounds during an exposure assessment workshop in October 2002. The six raters were affiliated with the Shanghai CDC and had knowledge of the processes used in these industries. For the first assessment round, only the OH responses were available to the raters (hereafter, OH round). For the second round, held the next day, both the OH and IQ responses were available (OH/IQ round). The raters estimated the intensity of dust on an ordinal scale of 1 (very low) to 4 (high). The categories were not anchored to specific exposure levels; however, the raters were advised to consider the categories as approximately <10%, 10–50%, 50–100%, and >100% of the exposure limit. The time-weighted-average (TWA) exposure limits were 1 mg m−3 for both cotton dust and silica (<50% free SiO2) based on total dust samplers. Each rater rated the intensity without consultation with the other assessors and without access to the exposure measurements or, in the OH/IQ round, to previous intensity ratings.
To assign values to samples below the limit of detection (LOD), we first imputed the sample mass 20 times for all samples that had a blank-corrected mass below the gravimetric LOD of 0.02 mg, resulting in 20 imputed data sets per industry. The imputed values were calculated using a macro developed in SAS (SAS Institute Inc., Cary, NC, USA) that used ‘proc lifereg’ to compute maximum likelihood estimators for the left truncated data, assuming lognormally distributed data. We then used all samples collected per subject on each sampling day to calculate a TWA dust concentration for the subject on that day. The six TWA measurements per subject were then averaged to obtain the subject-specific arithmetic mean.
The six independent exposure assessor ratings were used to calculate three aggregate ratings for each subject: the arithmetic mean of the six raters’ ratings (group rating), the three industrial hygienists’ ratings, and the three occupational physicians’ ratings, respectively.
All analyses were conducted using Stata 11.1 (StataCorp, College Station, TX, USA). All analyses were stratified by industry and assessment round, some were also stratified by type of rater. We were interested in the performance of the ‘typical’ rater rather than the performance of a specific rater. Thus, we report the central tendency (e.g. mean, median) and variability [e.g. standard deviation, interquartile range (IQR), data range] of the measures of reliability or validity across the six raters rather than reporting the measure for each rater individually.
We used the individual intra-class correlation coefficient (ICCind) to represent the reliability of the raters, which measures the consistency among the raters. The ICCind penalizes ratings that are close together less than ratings that are far apart and the magnitude of ICCind is identical to weighted kappa for ordinal ratings when quadratic weights are used. ICCind was calculated from the between-subject , between-rater , and residual variance components obtained from two-way analysis of variance models , assuming a random rater effect according to Shrout and Fleiss (1979) Case 2. The group ICC was also calculated , where k = number of raters). ICCgrp represented the stability of the mean rating from the k raters. Since the magnitude of ICCind and ICCgrp are influenced by the between-subject variability of the study population and the prevalence of the intensity ratings (Maclure and Willett, 1987), comparisons between groups should be made cautiously. Thus, we report only trends and do not make statistical inferences.
We assessed the relationship between the OH and OH/IQ individual and aggregate ratings using three metrics: proportion of ratings that differed between the two rounds, Spearman correlation statistic, and proportion of agreement between rounds by intensity category (i.e. number of subjects assigned category i in both rounds divided by number of subjects assigned category i in either round).
As a measure of the validity of the exposure assessors’ ratings, we calculated the Spearman correlation statistic between the subject-specific measurement means (as the ‘gold standard’) and the individual and aggregate ratings. 95% confidence intervals were calculated based on Fisher’s transformation.
We also examined whether the exposure assessors’ ability to rank exposure intensities differed by exposure category. Because the ratings were not anchored to specific dust concentration values, we calibrated the ratings to the dust concentration scale. For each industry, we used a regression model to evaluate the relation between the log-transformed subject-specific measurement means and the group ratings (continuous scale, 1–4). The models’ predicted dust concentrations at the midpoints of the scale (1.5, 2.5, and 3.5) were used as cut points to divide the subjects-specific means into four categories. We then calculated the proportion of subjects that were correctly assigned by the individual and group ratings to the measurement-based categories. We also examined whether the raters could discriminate between high and low exposures (Categories 1 and 2 versus 3 and 4).
To examine the influence of the number of raters on the validity of the group rating, we first calculated the mean intensity ratings for each possible combination of ratings of 1, 2, 3, 4, 5, and 6 raters. We then calculated the correlation between the mean rating of each combination and the subject-specific measurement means. We report the central tendency, the IQR, and the range of the correlation statistics for all the possible combinations for each number of raters.
In the foundry industry, 13 of the 72 subjects had at least one measurement below the LOD (range: 8–27% censored measurements per subject); the 860 collected (predominantly) partial-shift samples (17 censored) resulted in 432 TWA measurements. In the textile industry, 31 of the 74 subjects had at least one measurement below the LOD (range: 14–60% censored measurements per subject); the 486 collected samples (57 censored) resulted in 443 TWA measurements. Table 1 shows the descriptive statistics of the TWA dust measurements and the subject-specific measurement means. The subject-specific measurement means were spread more across the exposure range for the foundry industry than for the textile industry (Fig. 1). The overall means of the subjective ratings were similar for the two rounds (Table 2).
ICCind and ICCgrp were generally higher for the foundry industry than for the textile industry; although the differences between industries were small for ICCgrp (Table 2). For the foundry industry, the magnitude of the ICC’s were similar between the two types of raters. Larger differences were observed for the textile industry, with greater reliability observed among occupational physicians than industrial hygienists. ICCind and ICCgrp were consistently lower in the OH/IQ round compared to the OH round in both industries. The magnitude of the ICCgrp for both rounds generally met the criteria for ‘Almost Perfect’ reliability (Landis and Koch, 1977).
All changes in the group rating (rounded to the nearest integer) and 97% of the changes by an individual rater were within one category. On average, 35% of the individual ratings for the foundry industry (range of six raters: 17–58%) and 32% of the ratings for the textile industry (range: 9–55%) were changed between the two rounds. We observed fewer changes in the group rating between rounds (foundry: 32%; textile: 12%).
The individual rater’s ratings from the two rounds were correlated (foundry, rS: mean 0.65, range 0.56–0.91; textile, rS: mean 0.65, range 0.56–0.91 of the six raters). The correlation was higher for the group ratings (foundry: rS = 0.76; textile: rS = 0.90). In the evaluation by intensity category, we observed no patterns across the categories for either the individual or group ratings, although Category 4 tended to have the highest agreement and Category 1 the lowest agreement (not shown).
The correlations between the subject-specific measurement means and the aggregate ratings were higher than for the individual raters in both industries and rounds (Table 3). The magnitude was similar for the group, the industrial hygienists, and the occupational physicians’ ratings in both industries and assessment rounds. For the foundry industry, the correlation was higher in the OH/IQ round than in the OH round. The correlations were nearly the same in the two rounds for the textile industry.
In Fig. 2, we show the best fit line from the regression models that were used to categorize the subject-specific measurement means into four categories based on the group ratings from the OH/IQ round, where the category cut points were defined by the predicted mean for rating values of 1.5, 2.5, and 3.5. For the foundry industry, the cut points of 0.8, 1.8, and 4.1 mg m−3 were higher than the rough guidelines provided to the raters (0.1, 0.5, and 1 mg m−3). For the textile industry, the cut point of 1 mg m−3 for the highest category was identical to the expected value, but the lower cut points of 0.4 and 0.7 were higher than expected and resulted in narrow exposure ranges for the medium and medium-high categories.
When we examined the agreement by exposure category, the agreement was generally highest for measurement-based Category 4 for the foundry industry; but this pattern was not observed for the textile industry (Supplementary data are available at Annals of Occupational Hygiene online). For the foundry industry, the overall agreement between the group ratings and the measurement-based categories improved from 46% based on four categories to ~70% based on two categories. For the textile industry, the overall agreement improved from 37% based on four categories in both rounds to ~60% based on two categories. The high exposure category had better agreement for the group ratings (foundry: 80% in OH round and 77% in OH/IQ round; textile, both rounds: 100%) than the low exposure category (foundry: 61 and 67%; textile: 58 and 52%, for OH and OH/IQ rounds, respectively).
The use of two or three raters increased the correlation between the ratings and the measurement means compared with using only one rater (Fig. 3); minimal improvement was observed when four or more raters were used. The same patterns were observed in the OH round and with the reliability measure, ICCgrp (not shown).
The exposure assessors in this study provided subjective ratings based on subjects’ responses to IQs designed for population-based studies for two industries in China. Our assessment was limited to evaluating the exposure levels in the subjects’ current jobs within an industry and could not evaluate the raters’ larger task in population-based studies to evaluate exposures across many years, jobs, and industries. Our findings identified trends but did not make statistical comparisons because of the small number of workers. Nevertheless, our study identified potential weaknesses of subjective ratings of exposure and ways to improve future exposure assessments.
The findings in this study straddled both the median validity (0.5) and reliability (0.6) measures reported in a 2002 review and in subsequent studies (Teschke et al., 2002; Mannetje et al., 2003; Correa et al., 2006; Steinsvag et al., 2007). The dusts evaluated here were directly mentioned in the IQs and were, for the most part, visible. Both of these factors were previously associated with good inter-rater reliability for metrics of exposure probability, whereas invisible and less explicit exposures generally have had poor inter-rater reliability (Mannetje et al., 2003). We found lower validity and reliability of the ratings for the textile industry than for the foundry industry. We suspect that the poorer performance was related to the textile industry’s smaller range of intensities and the strong skew in exposure distribution (GSD 3.4 versus 1.8 for textile versus foundry, respectively). For instance, we found that Categories 1, 2, and 3 each spanned an exposure range of only 0.2–0.4 mg m−3. Thus, in that industry, we asked the raters to discriminate between intensity levels when there was little between-subject difference for most subjects. The limited guidance we provided to the exposure assessors may have also contributed to the poor performance because the intensity category cut points were not anchored to specific exposure levels. As a result, raters may have felt obliged to use the entire range on the ordinal scale. The raters may have also defined each category differently from the other raters. The performance of the raters might have improved with greater a priori consideration of the exposure categories and the exposure levels commonly found in that industry. This issue is more important in industry-based studies, whereas population-based case–control studies cover a large number of industries and a wide range of exposure levels.
Interestingly, it was in the poorer performing textile industry that the exposure category cut points more closely matched the guidelines to the raters, whereas the observed cut points in the foundry industry were roughly 4-fold higher than expected. This suggests that when the exposure limits are low, it may not be possible for the raters to discriminate between exposures at these lower levels. The validity of the raters’ assessments improved when we reduced the exposure assessment task from four categories to two, suggesting that exposure assessors can successfully distinguish between high and low exposure levels within the same industry. Identifying these broad differences is important because exposure assessment approaches that maximize between-group differences have been shown to minimize the attenuation of exposure-response associations, although loss of precision may occur (van Tongeren et al., 1997; Tielemans et al., 1998; van Tongeren et al., 1999).
With the additional information from the IQs, the raters changed, on average, 20–35% of the exposure assessments. The OHs did not provide a systematic way of collecting detailed information, such that two reported jobs with different exposure scenarios may appear similar because of the limited information reported. The IQs, in contrast, provided a systematic way to differentiate the variability in the exposure scenarios among similarly reported jobs. We cannot rule out, however, the influence of recall on the similarities between rounds. For convenience, the raters rated the OH/IQ information the day after rating the OH information. Thus, we may be partly measuring the raters’ reproducibility rather than solely the influence of the questionnaire type. The influence of reproducibility was likely small, however, as the raters did not have access to the first assessment during the second assessment and the number of jobs was high.
The validity of the ratings improved for the foundry industry, but not the textile industry, when the ratings were based on the additional information from the IQs compared with ratings based on the OHs alone. The narrow exposure range for most subjects in the textile industry may have reduced the influence of the additional information in that industry. The additional information did not improve the raters’ reliability for either industry, which is consistent with previous studies (de Cock et al., 1996; Stewart et al., 2000; Steinsvag et al., 2007). However, the improved validity seen here provides some support for the additional time and cost associated with collecting more detailed information, especially when there is substantive variability in exposure that is not captured by job title alone.
We consistently observed wide variability among the raters’ ratings for the same subject. The ability of a rater to assess exposures is likely to be heavily dependent on his or her actual experience in an industry. Thus, each rater’s intensity ratings may correctly reflect his or her own experience but may not reflect the actual long-term intensity of the exposure scenario being evaluated. As a result, the best metric may not be a rating from an individual rater but rather the average rating from a group of raters. This hypothesis is supported by the improved validity seen here and in previous studies when multiple raters are used (de Cock et al., 1996; Semple et al., 2001; Steinsvag et al., 2007). This is a common phenomena observed in multiple domains, where the collective wisdom of independent evaluators does remarkably well at obtaining an unbiased estimate (Surowiecki, 2005).
Our results suggest that at least two but not more than three raters were necessary to obtain a stable, valid group rating, which is consistent with previous studies (de Cock et al., 1996; Semple et al., 2001; Steinsvag et al., 2007). Although obtaining a stable rating coincided with a more valid rating in this study, this pattern may not always be the case. More importantly, the improvement in stability and validity may not be sufficient. For instance, using multiple raters for the textile industry improved the correlation from 0.35 for the ‘typical’ rater (mean performance of the individual raters) to 0.44 for the group; however, the use of the group rating would still introduce substantial misclassification into the study.
Using large panels of raters is time-consuming and expensive, so identifying ways to reduce that burden remain a priority. The raters used in our study were professionals with expertise in occupational disease and/or exposure monitoring but did not have specific training in exposure assessment for epidemiologic studies and were provided limited guidance on the exposure scales. The approach is common in epidemiologic studies where experts in the work environment are considered experts in all aspects of exposure assessment. In this study, each rater had over 10 years of experience within their respective fields, had substantial field experience in these and other work settings, and used the results of exposure monitoring in their work. Thus, their similar abilities were not surprising.
Some exposure misclassification may have been introduced by our gold standard. Our approach assumed that six measurements collected over two seasons for each subject provided an unbiased measure of the subjects’ average exposure in that job; however, some misclassification may remain, especially for subjects with highly varied tasks. It is unclear how ‘alloyed’ our gold standard was (Wacholder et al., 1993). Additional exposure misclassification may have occurred from the use of nonspecific particulate measurements rather than measurements that reflect the agent of concern (e.g. silica, cotton dust) (Friesen et al., 2007). We used the subject-specific measurement means as the gold standard because most epidemiologic studies of chronic diseases aim to assign an unbiased mean exposure level for each subject for each year of exposure, irrespective of day-to-day variability (Checkoway et al., 2004). Likewise, the questionnaires asked about the ‘usual’ or ‘typical’ work patterns rather than day-specific work patterns. The use of the mean is supported by a previous validation study that showed that the median proportion of variance explained by the ratings improved substantially (from 0.25 to 0.45) when the day-to-day variation was excluded (Kromhout et al., 1987).
There were two important design-related limitations. First, our scope was limited to evaluating the raters’ performance to only one industry per substance. As a result, the exposure assessment task was more straightforward compared to population-based studies where the same exposure would need to be evaluated across multiple industries and across time. Second, our scope was limited to evaluating the raters’ performance for only the current job; the subjects’ past jobs were not assessed here. As a result of both restrictions in study design, the impact on a population-based study is not clear. Our study may represent a best-case assessment and our findings may overestimate the performance of exposure assessors in population-based studies. For example, Teschke et al. (1996) found lower reliability for both individual and group ratings for jobs held by Subjects 4 and 5 decades in the past compared to ratings for more recent jobs. One the other hand, population-based studies are likely to have a wider range of exposure levels, thus making it easier to discriminate among exposures.
We also did not evaluate whether exposure assessment training could improve the reliability and validity of the raters. Greater guidance and training before the assessments, including interpretation of the questionnaires, anchoring the exposure category cut points with exposure scenarios expected in the study, and consensus on how to evaluate complex exposure scenarios, might have improved the reliability and validity of the raters. For instance, Logan et al. (2009) showed that the accuracy of exposure assessments, made to identify current exposure scenarios that may result in exposure levels above the exposure limits, were improved by 50% after a data interpretation training workshop.
In summary, exposure assessments based on professional judgment may reduce misclassification by using at least two or three raters, by using questionnaires that systematically collect task information, and by defining exposure categories that are relevant to the exposure scenario and that are distinguishable by the raters. The amount of improvement in the ratings may differ by exposure agent and should be examined for other agents. Few studies have the resources to use panels of raters, so identifying ways to improve an individual rater’s assessment, such as examining the impact of exposure assessment training, remain an important research need.
Intramural Research Program at the National Cancer Institute.
We thank the six raters, the Shanghai Center for Disease Control, and the study subjects in the Shanghai foundry and textile factories who participated in this study.