Screening luggage for threats requires high and sustained levels of vigilant attention due to the continuous requirement for detecting weak and infrequent signals among high levels of background clutter. It was shown that threat detection performance on a simulated luggage screening task deteriorated during night work and sleep loss, and that SLST and PVT performance covaried over a 34-h period of total sleep deprivation, a prerequisite for the ability of the 3-min PVT to predict SLST performance. Coherence of average SLST performance across subjects and average ≥355 ms lapse frequency on the 3-min PVT across subjects was high and 83% in the variability of SLST performance were explained by 3-min PVT outcomes. Within subject coherence was lower, probably due to some individuals insensitive to sleep loss and with low variability in both SLST and PVT performance. However, SLST and PVT performance were still negatively correlated within the majority of subjects (28 out of 36).
The purpose of fitness-for-duty testing is to detect relevantly impaired individuals unfit for the job. The relevance of impairment needs to be defined within the context of each predicted task, i.e. a certain level of impairment may be tolerated in one task, but it may be considered detrimental in another, especially in safety sensitive jobs (e.g. truck driving or luggage screening). In any case, some sort of feedback has to be given by the fitness-for-duty test, and this feedback has to lead to consequences. The type of feedback can range from a continuous measure of impairment (e.g. the percentage level of performance relative to a standard ranging from 0%–100%) to a dichotomous outcome (fit/unfit). A continuous measure of impairment in itself is of little use. One or more decision thresholds are needed to assign ranges of fitness-for-duty test outcomes to levels of impairment that are connected with specific consequences (in its easiest form whether or not the subject is allowed to perform the task). In our view, one threshold dividing fitness-for-duty test outcomes in “fit” and “unfit” may not be enough, because it is questionable whether subjects performing just above or below the single decision threshold are really fit or unfit to perform the task.
Here, we presented a method to find optimal decision thresholds to assign ≥355 ms lapses on the 3-min PVT to three SLST performance categories (high, medium, and low), although a higher number of categories could be used, too. The medium performance category separates the high performance category (subjects are fit for the task) from the low performance category (subjects are unfit for the task and must not perform it). The consequences for subjects falling in the medium performance category may vary depending on the relevance of the task. If subjects are allowed to perform the task, informing them about their decreased level of alertness may improve their effort and inspire them to apply countermeasures aiming at short term (e.g. break, caffeine) or long term (e.g. increasing individual sleep times) improvements of alertness. The latter was shown in a study of truck drivers (24
). If subjects falling in the medium performance category are not allowed to perform the task, employers need to be aware that the increase in performance on the task comes with the cost of excluding greater numbers of employees from doing their job.
The mixed models used for analyzing differences between high, medium, and low performance groups showed that 40.5% of the variance in SLST performance A′ were attributable to inter-individual differences, demonstrating that SLST performance varied considerably between subjects. However, the PVT is only able to predict relative
decreases in SLST performance caused by fatigue or other influences both affecting PVT and SLST performance (e.g. alcohol and drugs, although this was not explicitly tested in this study) within
a given subject. This is an important distinction, as the PVT is no surrogate or measure of absolute SLST performance between subjects (i.e. high performance on the PVT in a given individual does not guarantee high performance on the SLST). Psychomotor vigilant attention is an important but not the only factor determining SLST performance, e.g., screeners differ in their ability to fixate and recognize threat objects among high levels of background clutter (25
). Thus, the determination of decision thresholds based on the rank order of SLST performance within
subjects seemed a natural and reasonable approach.
After pooling the data of all subjects, decision thresholds of ≤11 ≥355 ms lapses and >20 ≥355 ms lapses on the 3-min PVT led to the highest percentage of correct assignments to the three performance groups. These two thresholds were optimal for the whole group, which does not imply that they were also optimal for each individual. In fact, prediction of job performance could be improved if thresholds were determined for each subject individually. However, this would require each subject running through an experimental protocol involving sleep deprivation and both FD-PVT and SLST testing, which does not seem a practical approach.
The categorization of all bouts of all subjects into the three SLST performance categories based on ≥355 ms lapses on the 3-min PVT was shown in . Homeostatic and circadian influences of sleep deprivation were well replicated by the size of the three performance groups. The categorization was sensitive to sleep loss, i.e. more subjects showed up in the medium and low SLST performance groups after 16 hours of wakefulness, going well along with recent findings of a chronic sleep restriction experiment reported by Van Dongen et al. (17
), who observed performance decrements only after wakefulness was chronically extended over 15.8 hours per day. At the same time, the categorization was specific in the sense that only a minority of subjects was assigned to the low SLST performance group during the first 15 hours of sleep deprivation (79.5% of the bouts were classified as high, 14.2% as medium, and only 6.3% as low SLST performance bouts).
There were five subjects that were classified at least once as low performers at or before 23:00. Overall, the average number of lapses was exceptionally high in these five subjects, ranking at 31, 32, 34, 35, and 36 relative to all 36 subjects. A prior sleep debt is one possible reason for the low PVT performance of these five subjects during the day, where one night with 8 hours time in bed in an unfamiliar environment may not have been enough to recover from this sleep debt. Sleep time prior to the study was determined with diaries and actigraphy. It averaged 8.06 h across days and subjects. Subject #3 had the fifth shortest average TST with 7.04 h and slept only 5 h in the night before the study began. Subject #8 slept above average (his average TST was 8.32 h), but only 4.75 h on the night preceding the study. Subject #35 had the second shortest individual average TST with 6.82 h, but slept 8 h in the night before the study began. Low TST prior to the study could not explain the low PVT performance of subjects #16 and #30, but subject #16 reported muscular aches 5 of 7 days, backache 4 of 7 days, joint pain 3 of 7 days, feeling too cold 5 of 7 days, and tiredness 2 of 7 days prior to the study, and subject #30 drank many caffeinated teas and colas before the study, which were not allowed during the study. As the PVT is not specific for fatigue, other reasons could have contributed to the overall low PVT performance levels of these five subjects, like alcohol, drugs, illness, and motivational factors, although alcohol, drugs, and illness are unlikely to have played a role because of screening tests and the controlled environment of the laboratory. This was an intent to treat analysis, and therefore we did not exclude any of the five subjects with low levels of PVT performance during the day.
Comparisons of high, medium, and low SLST performance groups as categorized by ≥355 ms lapse frequency on the 3-min PVT showed that threat detection performance differed significantly between the three groups confirming the ability of FD-PVT to predict SLST performance. HR was 7.0% and 4.0% higher in the high and the medium performance groups compared to the low performance group. Therefore, by allowing subjects with >20 lapses on the 3-min PVT to continue screening between 4 and 7 out of 100 threats would potentially be missed because of fatigue. It has to be borne in mind that these estimates of HR are conservative as they are confounded by time in study (there was a prominent effect of time in study on HR, and high, medium, and low performance groups differed significantly according to time in study). By controlling for hours awake (additional to pilot study membership) differences in HR between high and medium performance groups to the low performance group increased from 7.0% to 8.0% and from 4.0% to 4.3%., respectively
At the same time, differences between high, medium, and low performance groups in FAR became insignificant by controlling for hours awake. Controlling for hours awake decreased the differences between performance groups in A′ marginally, but they remained significant between all groups (all p<0.01). Response bias increased from high to medium to low performance groups (although non-significantly), meaning that the willingness to classify both safe bags and threat bags as threats increased with the number of lapses on the 3-min PVT. By controlling for hours awake this trend was amplified, and low and high performance groups now differed significantly from each other (p=.030). Time used to complete a 200-bag set significantly decreased from high to medium to low performance groups. The difference between groups decreased by controlling for hours awake, but the low performance group still differed significantly from the medium (−0.7 min, p=.007) and the high (−0.9 min, p<.001) performance groups.
In summary, low ≥355 ms lapse frequency on the FD-PVT was associated with high threat detection performance, with high HRs, with conservative decision criteria (subjects less willing to classify both safe bags and threat bags as threats), and with long per bag response latencies. FD-PVT ≥355 ms lapse frequencies were not related to FAR, especially after controlling for hours awake.
Our subject group consisted of healthy, young to middle aged, non-professional volunteers. It is therefore unclear how the results generalize to a group of professional luggage screeners who receive a special training that is refreshed on a regular basis. Also, in contrast to our subjects professional luggage screeners do not receive detection performance feedback at the end of their shift, which may restrict generalizability of our findings. Furthermore, it is unclear how the effects of sleep loss in the laboratory can be generalized to the operational environment. In other professions that have been studied (e.g., pilots, physicians in training, professional truck drivers), this was the case, although the magnitude of the effects can be smaller in a well trained professional cohort than in unselected laboratory subjects (26
). Another reason our results may not readily generalize to the airport security environment is because the 25% threat prevalence we used was higher than found in security operations, at least for guns and knives. If all prohibited items (e.g., bottles, pocket knives, nail files/clippers, lighters, etc.) are taken into account, 25% threat prevalence is, in fact, not unusual at airport checkpoints, and there are times when the combined rate of different classes of prohibited items can exceed 25%. To determine the extent to which our findings have ecological validity, studies would have to be conducted on trained professional screeners and in operational environments.
A simulated luggage screening task was shown to be sensitive to night work and sleep loss and to co-vary with the number of ≥355 ms lapses on the 3-min PVT. A method was proposed to identify optimal 3-min PVT lapse decision thresholds for dividing SLST bouts into high, medium, and low performance bouts. Group classification was both specific and sensitive, i.e. only a minority of subjects was classified as low performers during the first 15 hours of sleeplessness, while most of the subjects transitioned from high to medium and low performance groups after 17 or more hours of wakefulness. It was shown that assignment to the different performance groups replicated homeostatic and circadian patterns during total sleep deprivation and that threat detection performance A′ and HR decreased significantly from high to medium to low performance groups. In conclusion, the 3-min PVT was shown to be sensitive to sleep loss and to predict performance in a simulated luggage screening task, and should therefore be further validated as a fitness-for-duty test for luggage screeners. Future studies need to proof its feasibility and usefulness in professional screeners and operational environments.