The current study sought to investigate the interrater reliability of qualitative clock drawing test ratings made by five dementia clinicians at Boston University Medical Center. The clinicians were reliable clock drawing test raters using both dichotomous (impaired versus intact) and ordinal (0–10 impairment scale) ratings. The interrater reliability for the dichotomous system achieved a kappa of 0.85 and the ordinal rating resulted in an intraclass correlation coefficient of 0.92. These statistics represent excellent interrater reliability values and are comparable to those obtained in our recent work comparing several widely used quantitative clock drawing test scoring systems.27
The current findings demonstrate that in the absence of objective scoring methods, the clock drawing test can be rated reliably across a cognitive severity spectrum by clinicians who specialize in dementia.
Despite these excellent reliability values, there were several individual instances in which clinicians’ ratings were widely disparate. As seen in , ratings of nine clocks (6%) differed by six or more points on the ordinal scale after rerating eliminated errors. There are multiple factors that may explain why the clinicians applied disparate ratings, including spatial configuration, participant self-corrected errors, and shape of the clock face as exemplified in . The discrepancy among the raters highlights the difficulty that clinicians face when scoring clocks subjectively.
The present study also examined the accuracy of clinician-rated clock drawing test in differentiating among cognitively normal, mild cognitive impairment, and Alzheimer’s disease diagnostic categories. Despite the substantial37
overall agreement between raters, the results demonstrate that the accuracy with which qualitative ratings can differentiate diagnostic group membership was less robust. Although Alzheimer’s disease patients and comparison subjects could be differentiated with a relatively high degree of accuracy, the ratings were considerably less useful when making the distinction between a diagnosis of mild cognitive impairment and comparison (less sensitive) or Alzheimer’s disease and mild cognitive impairment (less specific). Therefore, while the clock drawing test may be a good screening instrument for Alzheimer’s disease, it may not be a sensitive instrument for screening mild cognitive impairment, especially if clinicians use a dichotomous rating. When screening for mild cognitive impairment, the presence of an abnormal clock drawing test in isolation (based on subjective clinician rating) may result in a large number of false positive or false negative errors. For the mild cognitive impairment diagnosis, the sensitivity and specificity was somewhat improved by using a subjective ordinal rating scale with three or more cutoff points as compared to the dichotomous scale. We therefore suggest using a 3-point subjective ordinal clock drawing test rating scale such as “normal,” “suspicious,” and “impaired” to improve the mild cognitive impairment diagnosis rather than the existing dichotomous system.
The clinicians who served as raters for the current study are specialists in diagnosing dementia, and work in a tertiary care clinical setting and research center. Therefore, these clinicians may represent a more reliable and diagnostically accurate group than nonspecialists in the community. Their expertise in dementia assessment may limit the extent to which the findings can be generalized to other settings and clinicians. Another limitation is that some clinicians were also members of the consensus team that formulated the original diagnostic impressions for our participant cohort. This overlap raises the possibility that the clinicians may not have been completely blinded to diagnostic group membership for the clocks being reexamined, assuming that the clinicians remembered the clocks that were presented in prior consensus conference meetings. However, this overlap would have only impacted the diagnostic utility statistics and not the interrater reliability, which was the primary focus of the current study. We excluded individuals with dementia other than Alzheimer’s disease, visual impairment and non-English speakers, which may have increased the diagnostic utility statistics while limiting the generalizability of our study.
Although the clock drawing test has many advantages as a screening instrument in the assessment of patients with suspected dementia, it is often used qualitatively, or subjectively, in clinical settings. As such, the reliability of these qualitative ratings between clinicians is brought into question. This is the first study to investigate the concordance among clock drawing test ratings by dementia specialists. The current study results indicate that dementia specialists can reliably rate clock drawing test performance using two different qualitative rating approaches. In contrast, the findings do not support the use of the clock drawing test as a standalone screening instrument, as the classification accuracy statistics presented suggest that in mild cognitive impairment, the clinician ratings may be susceptible to both false positive and false negative errors. However, the clinicians’ ratings had excellent sensitivity and specificity for distinguishing healthy comparison from probable and possible mild Alzheimer’s disease. Future studies should compare the reliability and diagnostic accuracy of qualitative methods to empirically validated quantitative scoring systems.