|Home | About | Journals | Submit | Contact Us | Français|
A previous study demonstrated decreased diagnostic accuracy for finding fractures and decreased ability to focus on skeletal radiographs after a long working day. Skeletal radiographic examinations commonly have images that are displayed statically. This study investigated whether diagnostic accuracy for detecting pulmonary nodules in computed tomography (CT) of the chest displayed dynamically would be similarly affected by fatigue.
Twenty-two radiologists and 22 residents were given two tests searching CT chest sequences for a solitary pulmonary nodule before and after a day of clinical reading. To measure search time, ten lung CT sequences, each containing 20 consecutive sections and a single nodule, were inspected using free search and navigation. To measure diagnostic accuracy, one hundred CT sequences, each with 20 sections and half with nodules, were displayed at preset scrolling speed and duration. Accuracy was measured using ROC analysis. Visual strain was measured via dark vergence, an indicator of the ability to keep the eyes focused on the display.
Diagnostic accuracy was reduced after a day of clinical reading (p = 0.0246), but search time was not affected (p > 0.05). After a day of reading, dark vergence was significantly larger and more variable (p = 0.0098), reflecting higher levels of visual strain and subjective ratings of fatigue were also higher.
After their usual workday, radiologists experience increased fatigue and decreased diagnostic accuracy for detecting pulmonary nodules on CT. Effects of fatigue may be mitigated by active interaction with the display.
Today’s advanced imaging modalities are acquiring more and more images that must be interpreted in less and less time [1-8]. Concerns have been raised that radiologist workloads are becoming so demanding that fatigue and reduced time for interpretation are negatively impacting diagnostic accuracy [9-14]. In court, a plaintiff’s attorney has argued that a radiologist missed a breast lesion because he was overworked .
Although radiologist fatigue has been a concern for years, only recently have dedicated studies been conducted. Some early studies did not examine fatigue or viewing times directly. For example, Oestmann et al.  demonstrated that detection accuracy for lung nodules decreased as viewing time decreased, but fatigue was not examined. Bechtold  found that error rates in the interpretation of abdominal CT more than doubled when radiologists read out more than 20 studies in a day. This retrospective review and classification of errors in clinical cases was not a controlled examination of fatigue.
More recently, studies have examined reader accuracy at different times during the day, with mixed results. Taylor-Phillips et al.  examined data from the UK Breast Screening Programme for nearly 200,000 cases in an attempt to relate accuracy to time of day and reading time. They found that recall rates varied with time of day but not in the same way for the individual readers. Some readers had lower recall rates in the afternoon, while others did not. Recall rates tended to decline with increased reading time (i.e., recall rates were lower around lunch and the end of the day), but again it varied considerably among readers. The sample was too noisy to document anything significant beyond a possible trend. This study did not directly examine fatigue or conduct a controlled study in which readers read a dedicated set of cases before and after a day of clinical reading.
Al-s’adi et al.  also found that breast lesion detection varies with time of day, but that no particular time of day had a significant effect. Radiologists at a national meeting were recruited to read a set of mammograms during one of 4 reading times (7:00 – 10:00; 10:00 – 13:00; 13:00 – 16:00; 16:00 – 20:00). There were no statistically significant differences in sensitivity, specificity or area under the receiver operating characteristic (ROC) curve (AUC) as a function of time of day. Limitation of this study include readers only participating in a single session and that they could choose the time of their participation, possibly choosing a time of higher performance or motivation.
The impact of fatigue was directly studied by Krupinski et al.  using skeletal radiographs with fractures as the detection task. Forty radiologists and residents interpreted a set of 60 patient examinations before and after a day in the reading room. Resting state of accommodation (a.k.a. dark focus) was measured as an indicator of visual workload on oculomotor equilibrium. Subjective measures of physical and visual strain and/or fatigue were collected. The results indicated that diagnostic accuracy was reduced significantly from before to after the day of clinical reading (p < 0.05) and the radiologists and residents became more myopic. Subjective ratings indicating increased lack of energy, physical discomfort, sleepiness, physical exertion, lack of motivation, and eyestrain. In general, the residents exhibited greater effects of fatigue on all measures compared to the attending radiologists. The conclusion was that after a day of clinical reading, radiologists have reduced ability to focus and a reduced ability to detect fractures, and increased symptoms of fatigue and visual strain.
The results of this study probably generalize well to most radiographic modalities. However, there are usually few radiographs per patient and the images are static. Tomographic modalities such as CT, magnetic resonance imaging (MRI) and digital breast tomosynthesis are viewed in fundamentally different way than radiographs.
The sequences of tomographic sections are typically viewed in ciné-animation mode with successive sections presented one after another under the radiologist’s control. The difference between static and dynamic displays places different demands on the visual system. A very basic, yet critical, distinction in the human visual system is between channels processing static stimuli and channels processing moving or changing stimuli [18-20]. Briefly, the transient visual channel, with high temporal resolution but poor spatial resolution, serves as an “early warning system” for the sustained visual channel which has poor temporal resolution and high spatial resolution. Things that move or change attract attention and eye movements. That is why people wave when they want to attract attention and why warning signals flash off and on. It is why things that move seem blurry and things that do not move seem to be sharp. These characteristic reflect the sensitivities of the two parts of the visual system handling perception of these stimuli. As the radiologist cycles dynamically through a sequence of CT sections, the sudden onset and offset of a pulmonary nodule captures the viewer’s attention and directs it to the location of change . With dynamic images, the motion channel of visual processing which directly affects attention comes into play, and the task of guiding the eyes around the changing image in search of lesion becomes more complex. Thus, the impact of fatigue may differ for dynamic and static image interpretation.
The goal of the present study was to measure diagnostic accuracy for pulmonary nodule detection in dynamic CT chest sequences before and after a day (or night) of diagnostic image interpretation. We also investigated measure of visual strain, the resting static of convergence, often referred to as dark vergence.
This study was approved by the Institutional Review Boards at both the University of Arizona and the University of Iowa.
All images were stripped of patient identifiers to comply with Health Insurance Portability and Accountability Act standards. We used 110 chest CT examinations selected from existing databases [21-24], approximately half (60) with a solitary pulmonary nodule and half (50) nodule free. Approximately half of the nodules were moderately subtle and the other half subtle as determined in the previous studies. In order to standardize the viewing conditions for all observers, we restricted each case to 20-slice sequences. For the nodule cases, the slices (3 mm) were selected such that the nodule did not appear in the two end slices. This insured that the entire nodule would be visible without getting cut off at the boundaries. Standard lung window/level setting was used and observers were not allowed to adjust settings during testing. Additional examinations were used in a demonstration prior to testing to familiarize observers with the task, reporting procedure, and presentation software.
Observers were attending radiologists and radiology residents at the University of Arizona (AZ) and the University of Iowa (IA), with 11 attending radiologists and 11 radiology residents at each institution. Table 1 provides the gender, average age, percent wearing corrective lenses, and type of lenses worn for the observers at both institutions. Table 2 indicates how long on average they had been reading cases prior to the test sessions.
The participants were also asked to indicate whether they had a preferred order in which they viewed CT chest image areas (bone, mediastinum, lung) and in what manner they preferred to view them (e.g., cine first, right then left etc.). The preferences are shown in Figures 1 and and22.
Cases were displayed using customized WorkstationJ software developed at the (copyright 2011, the University of Iowa) . Data were collected at two points in time for each observer: once prior to any diagnostic reading activity (Early) and once after a day of diagnostic reading (Late). It should be noted that we use the terms Early and Late rather than morning and afternoon since the Early session for some readers was in the afternoon before starting a night shift and the Late session was in the morning after coming off call.
Observers at one site (AZ) completed the Swedish Occupational Fatigue Inventory (SOFI) which was developed and validated to measure perceived fatigue in work environments [26-27]. The instrument consists of 20 expressions distributed on five latent factors: Lack of Energy, Physical Exertion, Physical Discomfort, Lack of Motivation, and Sleepiness. Subjects report their ratings for each of the 20 questions using a 0 – 10 point scale where 10 indicates that they are 10 times as fatigued/stressed/unmotivated etc. than if they were reporting a 1 (i.e., interval scale data). An average score for each of the five latent variables is derived from the individual questions within the set of 20 that contribute to the latent factors [26-27]. Physical Exertion and Physical Discomfort are considered physical dimensions of fatigue, while Lack of Motivation and Sleepiness are considered primarily mental factors. Lack of Energy is a general factor reflecting both physical and mental aspects of fatigue. Lower scores indicate lower levels of perceived fatigue than higher scores. SOFI does not measure visual fatigue so it was complemented with the oculomotor strain sub-scale from the Simulator Sickness Questionnaire (SSQ) [28-29]. Subjects report their ratings on a set of seven dimensions (i.e., general discomfort, fatigue, headache, eyestrain, difficulty focusing, difficulty concentrating, blurred vision) using a 1 – 4 point scale ranging from none to severe (ordinal scale data).
Visual strain was assessed by measuring “dark vergence,” the resting state of convergence of the eyes [30-33], measured in the absence of stimuli (including light). There is evidence that prolonged near work impacts dark vergence (as it does accommodation), resulting in inducement of temporary myopia. In this study we measured dark vergence using the Vergamatic™II USB (manufactured by Steven Spadafore, Franklin and Marshall College, Lancaster, PA). The device measures dark vergence and generates two metrics called V or angle (deg) and meter-angle. Angle (V) is approximately equal to the angle between the lines from the optical center of the eyes to the point of fixation and the parallel rays that would define the gaze direction if the eyes were fixated at infinity. Meter angle is the linear equivalent of V. Measures were made before and after each reading session.
After an introduction and review of the practice cases, the observers viewed the CT sequences on a NEC MultiSync LCD 2490WUXi color display (maximum luminance 400 cd/m2; contrast ratio 800:1; resolution 1920 × 1200; screen size 24.1”) that was calibrated to the DICOM (Digital Imaging and Communications in Medicine) Grayscale Standard Display Function (GSDF) .
The test session was divided into two parts. In Part I (free scrolling), the readers were presented with 10 of the cases, each containing a nodule. In this part they used the mouse to scroll back and forth through the CT sequences at their own pace. Their task was to determine if a nodule was present, locate it with a cursor, and provide a rating of their decision confidence both in adjectival form (definite, probable, possible, suspicious) and subjective probability (10-100 in 10-point intervals) to be used in a ROC analysis. Total time spent viewing each sequence was recorded.
In Part II (fixed scrolling) 100 CT sequences were shown to the readers but at a fixed rate and number of passes through each sequence. Each sequence went through 4 passes (sections 1 to 20, 20 to 1, 1 to 20, and 20 to 1) at a rate of 0.18 sec/slice for a total of 14.18 sec total viewing time. After each sequence was displayed, the software guided them through a series of responses to indicate whether a nodule was present or absent and, if present, to indicate its location (right or left lobe, and anterior, central or posterior portion of the lung). Finally it asked them to indicate their confidence in the decision as a subjective probability (10-100% in 10% intervals), before prompting them to go to the next sequence. Each session took approximately one hour to complete.
Diagnostic accuracy was derived from the confidence data and was measured using area under the Receiver Operating Characteristic (ROC) curve (AUC). AUC was estimated for each observer in each experimental condition, and the average areas were compared using an Analysis of Variance (ANOVA). Between-subject variables were level of training (Attending, Resident), institution (Arizona, Iowa), and a within-subject (or repeated measures) variable was the reading session time-of-day (Early, Late). Two ROC methods were used. The first was PROPROC [35-37] which does not take into account lesion location, and the second was LROC [38, 39] which does take location into account. Post-hoc F-tests were used to examine individual variable differences and interactions.
The viewing times were measured in seconds (continuous ratio data) so were analyzed using an ANOVA with Early vs Late and location (AZ vs IA) as independent variables. The visual strain (dark vergence) measures (continuous ratio data) were also analyzed with an ANOVA with Early and Late pre and post-session recordings as the independent variables. Post-hoc F-tests were used to examine individual variable differences and interactions.
The SOFI survey uses interval scales for reporting so an ANOVA was used to analyze these data. The SSQ survey uses a 1-4 ordinal scale for reporting and thus a Wilcoxon Signed Rank test was used to analyze these data.
In Part I (free scrolling), number of nodules detected (of 10) and time to indicate the nodule (which ended the trial) were analyzed in two separate ANOVAs. There was no significant difference between Early and Late in the number of nodules detected (F = 1.42, p = 0.24). The attending radiologists detected 81% of the nodules on average in the Early session and 80% during the Late session. The residents detected 79% on average during the Early session and 75% during the Late session. There was also no significant difference in viewing time per image. The median viewing time for the 10 trials was computed for each reader in each treatment. Average of the median viewing time in the Early session was 26.83 sec and 26.85 sec during the Late session (F = 0.00, p = 0.99).
In Part II (fixed search) area under the ROC curve (AUC) was used to measure accuracy for detecting nodules. For the ANOVA and PROPROC AUC measures, the only significant effect was the training level by time-of-day interaction (F = 5.45, p = 0.0246). This effect is illustrated in Figure 3.
Follow-up F-tests indicate that for the attending radiologists the Early to Late change in PROPROC AUC (0.873 to 0.882) was not significant (F = 0.86, p = 0.37) and for the residents the Early to Late change in PROPROC AUC (0.906 to 0.863) was marginally significant (F = 3.93, p = 0.063).
For the ANOVA for LROC AUC measures, the only significant effect was the training level by time-of-day interaction (F = 6.40, p = 0.0154). This effect is illustrated in Figure 4.
Follow-up F-tests indicate that for the attending radiologists the Early to Late change in LROC AUC (0.706 to 0.755) was marginally significant (F = 4.13, p = 0.057) and for the residents the Early to Late change in PROPROC AUC (0.789 to 0.742) was not significant (F = 2.13, p = 0.162).
Both of the dark vergence measures (V and MA) showed increased variability for the Late versus Early reading sessions (box plots Figure 5). The MA metric revealed a statistically significant increase for Late compared to Early sessions (F = 6.793, p = 0.0098). The V metric also showed an increase for the Late session, but it did not reach statistical significance (F = 1.507, p = 0.2210).
The scores for each of the five SOFI factors (AZ readers) were analyzed with an ANOVA with session (Early vs. Late) and experience (Attending vs. Resident) as independent variables. Average rating values for each factor are shown in Table 3. It can be seen for all measures ratings were higher (more severe) for the Late compared to the Early sessions. For all of the measures the residents gave higher ratings than the attending radiologists.
For Lack of Energy (F = 9.13, p = 0.0044) and Lack of Motivation (F = 8.23, p = 0.0066) the differences were statistically significant. For Physical Exertion, Physical Discomfort and Sleepiness, the Early to Late differences were not statistically significant. For the SSQ survey the residents again had higher ratings overall than the attending and the ratings for Early were significantly lower for both groups (i.e., less severe) than for Late reporting (Z = -3.509, p = 0.0004).
Our study revealed some decreases in diagnostic accuracy as a function of the work of interpreting clinical images. Part II used automated scrolling to collect 100 ROC trials in under 50 minutes. We had judged that collecting our data in under an hour per session was necessary in order to limit adding to the fatigue levels. Both proper ROC and location-specific ROC methods demonstrated a statistically significant training-by-workload interaction with attending radiologists tending to increase in accuracy with work and residents tending to decrease accuracy with work. Attending radiologist either improved after working (LROC) or stayed the same (ROC). Residents either decreased in accuracy (ROC) or stayed the same (LROC). These significant interactions mirror our findings with fracture detection used to measure the effects of fatigue . Long reading days do impact observer performance for interpretation of dynamic CT sequences much as they do with static image interpretation.
An interesting question is why the proper ROC analysis and the LROC analysis gave differing version of the statistical interaction between training and fatigue: LROC showed increasing attending performance, while proper ROC showed decreasing resident performance. Of course, it should be noted that the direction of non-significant effects was consistent with the significant effects. The difference between ROC scoring and LROC scoring is that the former may give credit for a false positive response identifying a non-existent nodule combined with a false negative response failing to identify a real nodule. This (LROC) should provide a more accurate scoring of responses.
Part I of our experiment used free scrolling to focus on fatigue effects on visual search time, closely resembling actual clinical reading. Neither response time nor hit rate detecting nodules depended on interpretive work. Part I with only 10 target nodules and no trials without nodules was not designed to measure diagnostic accuracy. The instructions were designed to encourage observers to search until they were confident that they had located a pulmonary nodule. The purpose was to determine where visual search became less efficient. It did not. Perhaps active interaction with the workstation provides a measure of physical activity sufficient to ward off the effects of fatigue.
Dark vergence was a fairly effective measure of visual strain or fatigue at least using the MA metric. After a long day/night of clinical reading, there was much more variability in both the V and MA metrics, and for MA the values increased significantly. The results are supportive of those observed with the accommodation measure used in the bone fracture study [17, 40], readers were essentially more myopic after each reading session compared to before as well as more myopic overall Late compared to Early.
As noted earlier, induced myopia is a common finding in observers engaged in prolonged near-vision work which is exactly what radiologists are engaged in as they sit in front of computer displays for hours on end interpreting image cases. We have now established in two separate studies with two separate measures of the set point of accommodation and convergence that radiologists experience induced myopia after a long day/night of reading. However, we cannot yet establish a causal relationship between visual changes and reduced diagnostic accuracy. Evidence from other studies is mixed. For example, Safdar et al.  tested visual acuity of 23 radiologists between 7:50 – 10:30, 12:00 – 15:30 and after 15:30 on several workdays. They found no significant differences in acuity as a function of time of day. We would expect decreased acuity based on decrements in ability to keep the eyes focused on the display screen. Safdar et al. did not however report on exactly how much clinical reading the observers had been engaged in prior to each measurement.
Unno et al.  compared visual acuity, convergence, and pupil diameter of younger and older subjects before and after reading 2D vs 3D (stereoscopic) radiographs. They observed some possible trends in each of these measures with the 3D reading impacting the measures more, but no statistically significant differences were observed. The limitations of this study were that they did not use radiologists as observers and the study was focused on fatigue associated with a relatively brief reading of stereo pairs.
The SOFI and SSQ ratings are very similar to those observed in the fracture study . Both the attending radiologists and the residents subjectively felt more fatigued after a day/night of clinical reading. In both studies, the residents had higher ratings on all of the measures compared to the attending radiologists. It is interesting to note that even though the attending radiologists felt fatigued and experienced induced myopia as evidenced by the dark vergence measurements, they did not have an associated decrease in diagnostic accuracy. In the fracture study they did exhibit a decrease in performance, but as in this study the residents were clearly more impacted by fatigue than the more experienced attending radiologists. In the present study it was the residents’ drop in diagnostic accuracy that contributed more to the statistical significance than the attending radiologists’.
Further study is needed to determine why this difference between residents and attending radiologists exists, but two possibilities come to mind immediately. The first is that the residents are still in a learning phase during their routine workdays and although they clearly do not read as many images as an attending radiologist the learning process itself is quite fatiguing and stressful, thus impacting them more at the end of the day. The second possibility is that the attending radiologists are quite fatigued as well at the end of their shifts, but through experience have learned to compensate for their fatigue better perhaps by being more careful during reading and pacing themselves better than the residents.
There are limitations associated with this study. Although we did include a free search condition (Part I), the main study (Part II) was restricted to 20 contiguous sections that were scrolled through automatically by the computer for a set amount of time. This is quite unlike clinical reading, but was necessary for this study as we wanted all readers to complete the study within about 1 hour and to have read the same number of cases. Although this could have made the task less fatiguing than in true clinical reading, we still observed a statistically significant drop in diagnostic accuracy after a long day of clinical reading. If we had actually replicated clinical reading with free search of 100 cases in Phase I (and eliminated phase II) that included all of the slices, it seems likely that we would have observed an even greater decrement in performance. A future study is warranted to follow up on this possibility.
A second limitation is that the readers knew that the study was about fatigue. However, intuitively one would think that knowing the study was about fatigue would have led to readers trying to compensate for or overcome their fatigue in the late session just to “prove” their performance was not affected by fatigue. The results however indicate otherwise. Even if they were trying to combat their fatigue and maintain accuracy, at least for the residents and some of the attending this did not happen – accuracy was degraded after a long day of clinical reading. They did not rise to the occasion.
It is interesting that the attending were overall less impacted by fatigue than the residents in that their diagnostic accuracy in the main test (Phase II) was not impacted greatly Late in the day. There are two possible contributing factors. The first was noted above – the automatic scrolling and set viewing time may somehow lessen the impact of fatigue, perhaps by reducing the need to interact with the computer, decide how fast to scroll, when to stop etc. Less cognitive and physical energy was needed compared to traditional “active” reading so more attentional and cognitive resources could be devoted to the detection task. This is one avenue for potential future investigation. The second possibility is that the attending are simply much more experienced than the residents and over many years of clinical reading have developed ways to compensate for fatigue.
After a day/night of clinical reading, radiologists have increased symptoms of fatigue, and increased oculomotor strain as evidence by more variability in dark vergence. Residents have reduced detection accuracy for lesion targets in dynamic CT sequences, although paradoxically attending radiologists do not. These results parallel those for accuracy in detecting fractures in static bone images. Radiologists need to be aware of the effects of fatigue on diagnostic accuracy and take steps to mitigate these effects.
We would like to thank the 44 attending radiologists and radiology residents in the Departments of Radiology at the University of Arizona and the University of Iowa for their time and participation in this study.
This work was supported in part by grant R01 EB004987 from the National Institute of Biomedical Imaging and Bioengineering
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.