|Home | About | Journals | Submit | Contact Us | Français|
The authors aim to study subjective ratings of clock drawing test by clinicians and determine interrater reliability and diagnostic accuracy. The clock drawing test has been advocated over the Mini-Mental State Examination as an office screening test for dementia, but use of the clock drawing test by neurologists and dementia specialist clinicians has not been validated. The authors conducted a study of clock drawing test scoring by dementia specialists. The authors randomly assigned 25 clocks from each of six predetermined groups based on consensus diagnosis (cognitive comparison subjects, subjects with a memory complaint but with normal neuropsychological testing, subjects with probable and possible mild cognitive impairment, and subjects with possible and probable Alzheimer’s disease) to dementia specialists for blinded scoring using a binary yes/no impairment system and a 0–10 scale as subjectively determined by each individual clinician rater. The authors collapsed the six groups into three (comparison subjects, mild cognitive impairment patients, and Alzheimer’s disease patients) and analyzed interrater reliability, sensitivity, and specificity for consensus diagnosis of mild cognitive impairment, and Alzheimer’s disease. The authors found excellent interrater reliability, sensitivity, and specificity for predicting consensus diagnosis. The 0–10 clock drawing test rating scale was more predictive of consensus diagnosis than the binary impairment scale. Based on the five clinicians’ average dichotomous rating, the clinicians differentiated comparison and Alzheimer’s disease participants with a sensitivity of 0.75 and a specificity of 0.81. For three of the four comparisons, a cutoff score of two or greater resulted in the maximization of sensitivity and specificity for differentiating diagnostic groups. A cutoff score of four or greater maximized sensitivity (0.54) and specificity (0.74) for differentiating Alzheimer’s disease from mild cognitive impairment. Based on rating systems, clock drawing test scoring by dementia clinicians had excellent interrater reliability and sensitivity for differentiating the mild Alzheimer’s disease subjects from comparison subjects. When utilizing a binary rating scale for the clock drawing test in the absence of clinical information, dementia specialist clinicians at the Boston Medical Center were moderately sensitive and highly specific in separating mild cognitive impairment from healthy comparison subjects. These dementia clinicians were also highly sensitive and less specific in differentiating mild cognitive impairment from Alzheimer’s disease.
Alzheimer disease is a growing public health problem1,2 and its prevalence is increasing rapidly with the aging of the baby boomer generation.3 Early diagnosis and treatment can reduce the burden this increase poses to the health care system and society.4 However, Alzheimer’s disease is often under recognized in community clinical practice settings5–7 because the diagnosis can be difficult8 and may require specialized training. Without fast and reliable screening instruments, it may be difficult for primary care physicians to identify patients who should be referred for a more comprehensive dementia workup.
The clock drawing test is widely used as a screening test for dementia. Neurologists have used clock-drawing and time telling tests extensively.9 Several factors contribute to the test’s popularity, including administration and scoring ease and evaluation of multiple cognitive domains,10,11 such as executive functioning.11–13 Compared to the Mini-Mental State Examination (MMSE), the clock drawing test is thought to have less educational bias14 and is better able to detect cognitive decline due to Alzheimer’s disease and other dementias.15 The clock drawing test has also been advocated over the MMSE as an office screening test for dementia in community clinics4 and in acute hospital settings.16 Furthermore, the clock drawing test is suitable for non-English speaking populations.14
There are two general clock drawing test scoring approaches, including qualitative and quantitative, and varied scoring systems that emphasize different facets of the clock drawing process. Early quantitative scoring systems were validated to distinguish between subjects with moderate or severe Alzheimer’s disease and cognitive healthy comparison subjects and later adapted for use in mild cognitive impairment and mild Alzheimer’s disease.17–23 Previous studies of objective clock drawing test rating systems identified Alzheimer’s disease with overall diagnostic accuracy ranging from 59% to 85%.24 However, such diagnostic accuracy has not been found in mild cognitive impairment cohorts with sensitivities ranging from 17% to 92%.24 In a retrospective study comparing several clock drawing test scoring systems,24 the scoring system by Mendez19 has been found to be the most accurate in distinguishing demented from nondemented individuals, followed closely by the Consortium to Establish a Registry for Alzheimer’s Disease (CERAD) system.25
Though diagnostically useful, quantitative clock drawing test rating schemes18,19,20,22,23 are rarely used in clinical settings, as they take more time and require trained clinical personnel to score. Moreover, when using the clock drawing test to identify dementia, qualitative ratings of naive judges may be equal to or more accurate than many quantitative scoring systems.24 Though it is widely assumed that dementia specialists are more reliable and valid than naive raters, there are no known studies that have evaluated the psychometric properties of clock drawing test ratings made by trained clinicians who use the clock drawing test as part of their regular clinical practice. Our present study was performed to determine the interrater reliability, sensitivity, and specificity of qualitative clock drawing test ratings made by clinicians specializing in the assessment of patients with dementia. Two qualitative rating approaches were utilized: a dichotomous rating of impaired versus nonimpaired and a 0–10 ordinal rating scale. A multidisciplinary consensus conference was the gold standard for dementia diagnosis in the current study.
Archival data were extracted from the Boston University Alzheimer’s Disease Core Center registry, which is an institutional review board approved, National Institute on Aging-funded, Alzheimer’s disease registry,26–28 that longitudinally follows older adults with and without memory problems. Participants performed the clock drawing test as part of an annual neurological and neuropsychological exam. All participants were at least 55 years old, English-speaking community dwellers, with no history of major psychiatric or neurological illness or head injury involving loss of consciousness, and with adequate auditory and visual acuity to complete the examination. After data query, there were 506 eligible participants in the Boston University Alzheimer’s Disease Core Center patient/comparison registry who had been diagnosed by a multidisciplinary consensus team (including at least two board-certified neurologists and two neuropsychologists) based on a clinical interview with the participant and an informant, medical history review, and neurological and neuropsychological examination results. Of the 506 participants, 168 were diagnosed as cognitively normal comparison subjects, 39 as cognitively normal comparison subjects with cognitive complaints reported by self or study partner (worried comparison subjects), 88 as “probable” mild cognitive impairment patients,29,30 106 as “possible” mild cognitive impairment patients (no complaint of cognitive decline, but with objective impairment on one or more primary neuropsychological variables), 55 as probable Alzheimer’s disease patients, and 50 as possible Alzheimer’s disease patients.31 Participants diagnosed as cognitive comparison subjects (with or without complaints) were included if they had a Clinical Dementia Rating of 0,32 an MMSE score ≥26,33 and if they were not impaired on any primary neuropsychological test variable (i.e., no scores fell more than 1.5 standard deviations below normative means). Exclusion criteria included dominant hand hemiparesis or other central or peripheral motor impairments or visual acuity impairment that would preclude clock drawing test completion. The current study utilized the participants’ most recent registry visit data.
Trained psychometricians administered the clock drawing test in a standard way to include command (i.e., “I want you to draw the face of a clock, putting in all the numbers where they should go, and set the hands at 10 after 11”) and copy conditions.27 Participants were allowed to make corrections and make attempts to draw the clock a maximum of two times. Only the command condition data were used for the current study.
For the purpose of this study, 25 command clocks were randomly selected from each of the six diagnostic strata described above, resulting in the inclusion of 150 clocks from 150 different subjects with 50 clocks equally divided among each of the three primary diagnostic groups (i.e., comparison, mild cognitive impairment, Alzheimer’s disease). The clocks were rated independently by four board-certified neurologists and a neurology nurse practitioner, all of whom specialize in dementia. Raters were blinded to participant diagnostic and demographic information. Ratings were made on a binary (normal/abnormal) and an ordinal scale (0–10 rating, where 0 signified no impairment and 10 signified complete impairment). The ordinal ratings were examined for potential outliers, and when widely discordant interexaminer ratings (score differences >5; arbitrarily selected) were found, these clocks were rerated by the original raters. The rerated clock drawing test scores were used in the analyses.
All analyses were performed in SPSS 16 (Chicago) or SAS 9.1 (Cary, NC) software. A one-way analysis of variance (ANOVA) assessed for between-group differences in age, education, MMSE, and Geriatric Depression Scale scores. Significant findings were followed-up with Tukey-Kramer post-hoc tests to determine specific group differences. Sex and racial differences among the groups were analyzed using the chi-square test of independence.
Estimates of rater agreement were calculated for both the binary and ordinal rating scales. For the binary ratings, kappa statistics were computed.34–36 Because there were five raters, the multiple agreement function in SAS was used for kappa calculations for the binary ratings. For the ordinal ratings, Kendall’s intraclass correlation coefficient (ICC) of concordance was computed using the intraclass correlation functions in SAS, and Spearman rank correlations were generated between individual raters. Agreement for ordinal ratings were evaluated by calculating the absolute score differences between raters.
To examine the diagnostic utility (based on the three primary diagnostic groups) of the clinicians’ ratings, sensitivity, specificity, and positive likelihood ratios were calculated for both the dichotomous and ordinal ratings. For the dichotomous ratings, two methods were used to summarize the five raters’ data. First, we calculated individual sensitivity and specificity statistics for each rater and then created an average for all five raters. We also created a single summary rating of “impaired” versus “intact” based on whether the majority of the raters rated the clock as impaired or intact. For the ordinal ratings, an average was calculated to summarize the five clinicians’ ratings, then sensitivities and specificities were calculated for this average rating against the true diagnostic classification. The ordinal ratings were used to examine the cutoff score that optimized sensitivity and specificity.
Details of demographics and clinical characteristics are shown in Table 1. There were significant between-group differences for age, female gender, education, and MMSE. There were no significant between-group differences in the Geriatric Depression Scale score.
The clinicians’ interrater reliability was “almost perfect”34–37 for the ordinal (ICC=0.92) system and “substantial” 37 for the dichotomous system (k=0.85). The absolute difference scores between ordinal ratings are presented in Table 2. The five clinicians’ ratings did not differ by more than three ordinal scale units for 69% of the clocks. An absolute score difference of 5 or less was observed in 91% of the clocks, and only 5 clocks (3%) had an absolute difference of more than seven units on the ordinal scale. Examples of participant clocks that were rated similarly and dissimilarly are presented in Figure 1. Spearman rank correlations between each of the individual raters ranged from 0.64 to 0.82 for the ordinal scale (Table 3).
Based on the five clinicians’ average dichotomous rating, the clinicians differentiated comparison and Alzheimer’s disease participants with a sensitivity of 0.75 and a specificity of 0.81. In comparison, the dichotomous rating based solely on the majority of raters had a sensitivity of 0.84 and specificity of 0.84. In differentiating comparison subjects from mild cognitive impairment participants, the average sensitivity of the five raters’ dichotomous classifications was 0.47 and the average specificity was 0.81. Using a majority for the calculation of sensitivity and specificity resulted in values of 0.50 and 0.84, respectively (Table 4).
The sensitivities, specificities, and positive likelihood ratios for several cutoff scores on the ordinal scale are presented in Table 5. For three of the four comparisons (i.e., Alzheimer’s disease versus comparison, mild cognitive impairment versus comparison, and Alzheimer’s disease + mild cognitive impairment versus comparison), a cutoff score of two or greater resulted in the maximization of sensitivity and specificity for differentiating diagnostic groups. For differentiating Alzheimer’s disease from mild cognitive impairment, a cutoff score of four or greater was the rating that maximized sensitivity and specificity. As can be expected, higher ordinal ratings (i.e., more impaired clocks) were associated with a greater likelihood of being diagnosed with either Alzheimer’s disease or mild cognitive impairment but at the expense of sensitivity.
The current study sought to investigate the interrater reliability of qualitative clock drawing test ratings made by five dementia clinicians at Boston University Medical Center. The clinicians were reliable clock drawing test raters using both dichotomous (impaired versus intact) and ordinal (0–10 impairment scale) ratings. The interrater reliability for the dichotomous system achieved a kappa of 0.85 and the ordinal rating resulted in an intraclass correlation coefficient of 0.92. These statistics represent excellent interrater reliability values and are comparable to those obtained in our recent work comparing several widely used quantitative clock drawing test scoring systems.27 The current findings demonstrate that in the absence of objective scoring methods, the clock drawing test can be rated reliably across a cognitive severity spectrum by clinicians who specialize in dementia.
Despite these excellent reliability values, there were several individual instances in which clinicians’ ratings were widely disparate. As seen in Table 2, ratings of nine clocks (6%) differed by six or more points on the ordinal scale after rerating eliminated errors. There are multiple factors that may explain why the clinicians applied disparate ratings, including spatial configuration, participant self-corrected errors, and shape of the clock face as exemplified in Figure 1. The discrepancy among the raters highlights the difficulty that clinicians face when scoring clocks subjectively.
The present study also examined the accuracy of clinician-rated clock drawing test in differentiating among cognitively normal, mild cognitive impairment, and Alzheimer’s disease diagnostic categories. Despite the substantial37 overall agreement between raters, the results demonstrate that the accuracy with which qualitative ratings can differentiate diagnostic group membership was less robust. Although Alzheimer’s disease patients and comparison subjects could be differentiated with a relatively high degree of accuracy, the ratings were considerably less useful when making the distinction between a diagnosis of mild cognitive impairment and comparison (less sensitive) or Alzheimer’s disease and mild cognitive impairment (less specific). Therefore, while the clock drawing test may be a good screening instrument for Alzheimer’s disease, it may not be a sensitive instrument for screening mild cognitive impairment, especially if clinicians use a dichotomous rating. When screening for mild cognitive impairment, the presence of an abnormal clock drawing test in isolation (based on subjective clinician rating) may result in a large number of false positive or false negative errors. For the mild cognitive impairment diagnosis, the sensitivity and specificity was somewhat improved by using a subjective ordinal rating scale with three or more cutoff points as compared to the dichotomous scale. We therefore suggest using a 3-point subjective ordinal clock drawing test rating scale such as “normal,” “suspicious,” and “impaired” to improve the mild cognitive impairment diagnosis rather than the existing dichotomous system.
The clinicians who served as raters for the current study are specialists in diagnosing dementia, and work in a tertiary care clinical setting and research center. Therefore, these clinicians may represent a more reliable and diagnostically accurate group than nonspecialists in the community. Their expertise in dementia assessment may limit the extent to which the findings can be generalized to other settings and clinicians. Another limitation is that some clinicians were also members of the consensus team that formulated the original diagnostic impressions for our participant cohort. This overlap raises the possibility that the clinicians may not have been completely blinded to diagnostic group membership for the clocks being reexamined, assuming that the clinicians remembered the clocks that were presented in prior consensus conference meetings. However, this overlap would have only impacted the diagnostic utility statistics and not the interrater reliability, which was the primary focus of the current study. We excluded individuals with dementia other than Alzheimer’s disease, visual impairment and non-English speakers, which may have increased the diagnostic utility statistics while limiting the generalizability of our study.
Although the clock drawing test has many advantages as a screening instrument in the assessment of patients with suspected dementia, it is often used qualitatively, or subjectively, in clinical settings. As such, the reliability of these qualitative ratings between clinicians is brought into question. This is the first study to investigate the concordance among clock drawing test ratings by dementia specialists. The current study results indicate that dementia specialists can reliably rate clock drawing test performance using two different qualitative rating approaches. In contrast, the findings do not support the use of the clock drawing test as a standalone screening instrument, as the classification accuracy statistics presented suggest that in mild cognitive impairment, the clinician ratings may be susceptible to both false positive and false negative errors. However, the clinicians’ ratings had excellent sensitivity and specificity for distinguishing healthy comparison from probable and possible mild Alzheimer’s disease. Future studies should compare the reliability and diagnostic accuracy of qualitative methods to empirically validated quantitative scoring systems.
This research was supported by National Institute of Health grants P30-AG13846 (Boston University Alzheimer’s Disease Core Center), M01-RR00533 (Boston University General Clinical Research Center), K24-AG027841 (RCG), K23-AG030962 (Paul B. Beeson Career Development Award in Aging; ALJ), The authors thank Sabrina Poon, Melissa Barrup, Laura Byerly, Sita Yermashetti, Pallavi Joshi, Mario Orozco, Amanda Gentile and Kristen Huber for their assistance with data and all the psychometricians, nurse specialists and clinicians at the BU Alzheimer’s Disease Center and GCRC for administering the clock drawing task and providing clinical assessments. In particular, the authors thank the participants of the Boston University Alzheimer’s Disease Center cohort.
A portion of this work was presented in poster format at the American Academy of Neurology Annual Meeting, April 14–16 2008, in Chicago. None of the authors have any conflicts of interest to disclose. There are no commercial associations that might pose or create a conflict of interest with the information presented in the submitted manuscript.