Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Arch Clin Neuropsychol. Author manuscript; available in PMC 2009 September 25.
Published in final edited form as:
PMCID: PMC2752157

Clock Drawing Performance in Cognitively Normal Elderly


The Clock Drawing Test (CDT) is a common neuropsychological measure sensitive to cognitive changes and functional skills (e.g., driving test performance) among older adults. However, normative data have not been adequately developed. We report the distribution of CDT scores using three common scoring systems (Mendez, Ala, and Underwood, 1992; Freund, Gravenstein, Ferris, Burke, & Shaheen, 2005; and Cahn, Salmon, Monsch, Butters, Wiederholt, & Corey-Bloom, 1996), among 207 cognitively normal elderly. The systems were well correlated, took little time to use, and had high inter-rater reliability. We found statistically significant differences in CDT scores based on age and WRAT-3 Reading score, a marker of education quality. We present means, standard deviations, and t- and z-scores based on these subgroups. We found that “normal” CDT performance includes a wider distribution of scores than previously reported. Our results may serve as useful comparisons for clinicians wishing to know whether their patients perform in the general range of cognitively normal elderly.

Keywords: Clock Drawing Test, clock drawing, normal aging, normative data, scoring systems, neuropsychological tests, elderly, Alzheimer’s disease, mild cognitive impairment, dementia, cognitive decline, cognitive screening

The Clock Drawing Test (CDT) has been extolled as an inexpensive, fast, “non-threatening” (Shulman, Shedletsky, & Silver, 1986), and easily administered measure of cognitive function, especially in the elderly (e.g., Brodaty & Moore, 1997; Cahn, Salmon, Monsch, Butters, Wiederholt, & Corey-Bloom, 1996; Freedman, Leach, Kaplan, Winocur, Shulman, & Delis et al., 1994). A multifaceted and multidimensional measure, the CDT is thought to test visuoconstructive and visuospatial skills, symbolic and graphomotor representation, auditory language skills, hemiattention, semantic memory, conceptual abilities, and executive function including organization, planning, and parallel processing (Freedman et al., 1994; Libon, Malamut, Swenson, Prouty Sands, & Cloud, 1996; Mendez, Ala, & Underwood, 1992; Rouleau, Salmon, Butters, Kennedy, & McGuire, 1992; Shulman, 2000; Spreen & Strauss, 1998). Deficits in these areas reflect possible frontal and temporoparietal disturbances that are often exhibited in AD (Freedman et al., 1994; Samton, Ferrando, Sanelli, Karimi, Raiteri, & Barnhill, 2005; Spreen & Strauss, 1998), and that may not easily be detected by commonly-used cognitive screening tests such as the Mini-Mental State Exam (MMSE) (Brodaty & Moore, 1997; Folstein, Folstein, & McHugh, 2001). Studies have reported associations between the CDT and other cognitive measures, including those that measure semantic memory (Libon et al., 1996), the Rey-Osterrieth Complex Figure (Osterrieth, 1944; Rey, 1941) Symbol Digits Modalities Test (Mendez et al., 1992; Smith, 1973), and the MMSE (Brodaty & Moore, 1997; Folstein et al., 2001; Mendez et al., 1992; Shulman, 2000). An advantage of the CDT over many other cognitive measures is its lack of reliance on verbal abilities (Spreen & Strauss, 1998; Sunderland, Hill, Mellow, Lawlor, Gundersheimer, Newhouse, et al., 1989), making it a useful screening tool for dementia in non-English speaking populations (e.g., Cacho, García-García, Arcaya, Gay, Guerrero-Peral, Gómez-Sánchez, et al, 1996; Lam, Chiu, Ng, Chan, Chan, Li, et al., 1998) and in patients with aphasia or other loss of verbal expression. Furthermore, the measure’s good test-retest reliability (Mendez et al., 1992; Spreen & Strauss, 1998; Watson, Arfken, & Birge, 1993) and high intra- and inter-rater reliabilities across clinicians and non-clinicians have led to widespread use of the CDT in neuropsychological screening batteries (Kozora & Munro, 1994; Mendez et al., 1992; Rouleau et al., 1992; Spreen & Strauss, 1998; Sunderland et al., 1989; Tuokko, Hadjistavropoulous, Rae, & O’Rourke, 2000).

Perhaps the most common method of interpreting CDT performance is “clinical judgment.” That is, many users of the CDT, including neurologists, psychiatrists, geriatricians, and neuropsychologists, use it to provide a “quick ‘cognitive scan’ and to demonstrate a patient’s difficulties to family members” (Fischer & Loring, 2004, p. 553), presumably making a clinical judgment without using a standardized scoring approach. There are obvious limitations to this approach, including poor inter-clinician agreement and the fact that most clinicians have not administered the CDT to enough healthy controls to create their own internal norms. Moreover, without some standardized scoring method, the utility of the CDT in research would be significantly limited. Therefore, over the past two decades, more than a dozen scoring systems for the CDT have been developed (Freedman et al., 1994; Freund, Gravenstein, Ferris, Burke, & Shaheen, 2005; Libon et al., 1996; Manos, 1999; Mendez et al., 1992; Roth, Tym, Mountjoy, Huppert, Hendrie, Verma, et al., 1986; Rouleau et al., 1992; Royall, Cordes, & Polk, 1998; Samton et al., 2005; Shulman et al., 1986; Sunderland et al., 1989; Tuokko, Hadjistavropoulous, Miller, & Beattie, 1992; Watson et al, 1993; Wolf-Klein, Silverstone, Levy, & Brod., 1989).

Some authors have emphasized the need for brief, quantitative scoring approaches that clinicians can easily use and interpret (c.f. Juby, Tench, & Baker, 2002; Samton et al., 2005; Shulman, 2000; van der Burg, Bouwen, Stessens, Ylieff, Fontaine, de Lepeleire, et al., 2004; Wolf-Klein et al., 1989). However, the conversion of a multifaceted, multidimensional test into a single number score tends to sacrifice both the sensitivity and specificity of the instrument, both of which must be high for the test to function as an adequate screening tool (Fischer & Loring, 2004; Greenhalgh, 1997). For example, a scale that measures only “critical” aspects of the CDT may not detect subtle differences between healthy subjects and those with mild forms of cognitive impairment (Schramm, Berger, Muller, Kratzch, Peters, & Frolich, 2002; Seigerschmidt, Mosch, Siemen, Forstl, & Bickel, 2002; Tuokko et al., 1992). A single-score quantitative approach may also compromise the test’s ability to effectively distinguish between different error types, which might be critical in discriminating between Alzheimer’s disease and other types of cognitive dysfunction (e.g., Huntington’s disease (Rouleau et al., 1992), frontotemporal dementia (Blair, Kertesz, McMonagle, Davidson, & Bodi, 2006), or ischemic vascular dementia (Libon et al., 1996)). Original sensitivity and specificity values are often determined by comparing very healthy subjects to those with severe impairment, a practice that is not reflective of clinical realities and may overestimate the CDT’s utility as a screening instrument (Lee, Swanwick, Coen, & Lawlor, 1996). To increase the likelihood of detecting errors in clock drawing, other researchers have promoted the use of scoring approaches with multiple scales that rate quantitative and qualitative features of the production (Libon et al., 1996). However, these scoring systems are potentially more time-consuming for the clinician.

A major limitation to the existing quantitative and qualitative CDT scoring approaches is that there is little, if any, normative data available. In particular, there is a dearth of data describing cognitively normal older adults’ CDT performance across commonly employed scoring systems. Moreover, normative data from a well-characterized population are lacking, possibly accounting for differences in sex, education, and race across meaningful age intervals.

Regardless of the continued lack of a standardized, simple, reliable, valid, and commonly-accepted scoring system for clock drawings (Mendez, et al., 1992; Storey, Rowland, Basic, & Conforti, 2001), many researchers still encourage its cautious use and interpretation as a supplement to other neuropsychological measures of cognitive decline (e.g., Shulman, 2000; Sunderland et al., 1989). It has gained widespread clinical use (Freedman et al., 1994), and despite ample recommendations to the contrary, has still been recommended as a primary (and single) screening measure of a critical decline in functioning. For example, the CDT is recommended as a major component in the assessment of driving safety among older adults by the National Highway Traffic Safety Administration (NHTSA) in conjunction with the American Medical Association (Wang, Kosinski, Schwartzberg, & Shanklin, 2003).

The purpose of this study was to collect data on commonly used qualitative and quantitative CDT scoring methods in a well-characterized population of cognitively healthy older adults. Our goals were to: (1) examine pertinent demographic variables in relation to CDT performance; (2) provide normative data for these scoring systems, and (3) provide a better understanding of the psychometric properties of the various scoring systems evaluated, including their inter-rater reliability, their interrelation to one another, and their clinical utility. Clinicians may use these data to determine whether their patients’ CDT performances fall within the normal range of scores for cognitively healthy elderly individuals with similar demographic characteristics. They may be further guided in their selection of a particular CDT scoring system that most adequately suits their clinical or research needs.



The study population consisted of volunteers enrolled in the patient control registry for the Boston University Alzheimer’s Disease Core Center (BU ADCC). The data collection for this study was approved by the local institutional review board and all participants provided written informed consent. As previously described (Jefferson, Wong, Bolen, Ozonoff, Green, & Stern, 2006; Jefferson, Wong, Gracer, Green, & Stern, 2007), all participants are at least 55 years old and are evaluated annually with physical and neurological examinations, informant interviews, and neuropsychological tests. All diagnoses are made at multi-disciplinary consensus meetings. Although the CDT is regularly administered as part of the neurological examination, CDT scores are not included in the consensus diagnosis of objective cognitive impairment; that is, a diagnosis of mild cognitive impairment (MCI), dementia, or other cognitive disorder is not made solely on the basis of CDT performance. In addition, all members of the consensus diagnostic conference are blinded to CDT scores.

As part of the registry’s neuropsychological evaluation, all participants are administered the Wide-Range Achievement Test-3 (WRAT-3) Reading subtest (Wilkinson, 1993) as a measure of estimated pre-morbid intelligence. Additional measures include the MMSE (Folstein et al., 2001) as a screen of cognitive impairment, and the Geriatric Depression Scale (GDS) (Yesavage, Brink, Rose, Lum, Huang, Adey, et al., 1983) as a self-report measure of depression. In addition, all participants complete a Clinical Dementia Rating (CDR) interview (Hughes, Berg, Danziger, Coben, & Martin, 1982; Morris, 1993).

For the current study, only participants who received a consensus diagnosis of non-case were included. This diagnosis includes participants who have completely normal performance (i.e., ≤ 1.5 standard deviations of normative data) on neuropsychological tests (without consideration of the CDT) (N = 207). This group of non-case participants was further divided into “Controls” (CTL), those without any self- or informant-report of cognitive, behavioral, or functional complaint (n = 168), and “Controls – Complaint” (CTL-C), those with self and/or informant complaints (n = 39). All participants for the current study had a global CDR = 0, an MMSE score ≥ 26, and a GDS score < 20. The distributions of demographic variables and performance on the MMSE, WRAT-3 Reading test, and GDS in the normative sample are shown in Table 1.

Table 1
Descriptive Statistics, All Subjects (N = 207).


The CDT is part of the ADCC registry annual visit. Trained examiners administer the test in the following standardized fashion: a sheet of 8 ½” × 11” white paper is folded in half such that a pre-drawn clock (set to 10 after 11) on the left side of the paper is concealed. For the Command condition, the examiner states, “I want you to draw the face of a clock, put in all the numbers where they should go, and set the hands at ten after eleven.” Once the subject has completed the Command condition, the examiner turns the paper over, exposing the pre-drawn clock on the other side for the Copy condition, and states, “I want you to copy this clock exactly as you see it. Try to make your drawing look exactly as this drawing.” The subject is allowed to make erasures, and may attempt to draw each clock two times.

Selection of Clock Drawing Scoring Systems

We selected scoring criteria from three different scoring systems: those developed by Mendez et al. (1992); Freund et al. (2005); and Cahn et al. (1996), which includes their adaptation of the quantitative score approach reported by Rouleau et al. (1992). We will refer to these three systems as Mendez, Freund, and Cahn, respectively. These scales were selected because they provide a combination of qualitative and quantitative scales for evaluation of the CDT, as well as for other specific reasons enumerated below.

The Mendez system was included because it has one of the largest Receiver Operating Characteristic (ROC) (Lusted, 1971) curve areas compared to other CDT scoring systems, as well as excellent inter- and intra-rater reliability (Storey et al., 2001). It is also one of the most detailed quantitative scoring systems available. The Mendez scale awards a maximum of 20 points based on the presence of various clock features (mostly related to the correct quantity and positioning of numbers and hands) and the absence of intrusive marks. The authors report that normal elderly subjects (n = 26) do not miss more than 2 points, versus subjects with Alzheimer’s disease (n = 46) who miss at least 3 points. Furthermore, this scoring system takes “less than a minute” to complete (Mendez et al., 1992).

The Freund system was included because of its recommended use by the NHTSA, in conjunction with the American Medical Association, as a primary component in the assessment of driving safety among older adults (Wang et al., 2003). It should be noted that a newer version of the Freund scale was used for the present study than the one recommended in the NHTSA manual. The Freund scale assigns up to 7 points based on 3 categories: time (3 points), numbers (2 points), and spacing (2 points), with a score of 7 being perfect. Among older community members referred for driving evaluations, loss of more than two points on the scale was a predictor of unsafe driving based on a STISIM Drive Simulator test (Freund et al., 2005).

The Cahn system is comprised of multiple parts: a quantitative scale, a qualitative scale, and a global score. The 10-point Quantitative scale was first presented by Rouleau et al. (1992) and is a revision of the scoring originally presented by Sunderland et al. (1989). A high score on the Quantitative scale indicates good performance. Cahn et al. (1996) added a Qualitative scale, which assigns one point to each of eight possible errors. Thus, a high Qualitative score is indicative of poor performance. Finally, the Cahn Global score is calculated by subtracting the qualitative score (number of errors) from the quantitative score, for a possible range of −8 (worst) to 10 (best). A global score ≤ 6 (a combination of a Quantitative score ≤ 7 and a Qualitative score ≥ 1) is highly sensitive and specific in differentiating normal subjects and those with dementia of the Alzheimer type, and was better than either of the individual scores alone (Cahn et al., 1996).

Many additional scoring systems were not evaluated in this study. Some systems have less favorable sensitivity and specificity in cognitive screening than those selected for this investigation (c.f. Brodaty & Moore, 1997; Storey et al., 2001; Yamamoto, Mogi, Umegaki, Suzuki, Ando, Shimokata, et al., 2004). Also, some authors base their scoring criteria on the use of a pre-drawn clock circle (Manos, 1999; Shulman et al., 1986; Watson, 1993; Wolf-Klein, 1989) and, in some cases, the administration does not require subjects to draw the hands on the clock (e.g., Watson et al., 1993; Wolf-Klein et al., 1989).

Clocks were also evaluated as to whether the “center” (i.e., a central point or intersection of the clock hands) was present and whether this center was displaced more than a specified number of millimeters horizontally or vertically from the measured center of the clock face. These types of errors have been found to be uncommon in cognitively normal elderly (Freedman et al., 1994). These errors were reported as a 3-point quantitative scale modified from Freedman et al. (1994).

Clock Scoring

In order to assure that raters scored clocks from healthy subjects with minimal bias, retrospective scoring was completed for a total of 579 clocks from the entire sample of ADCC participants. This consisted of 216 non-cases, 206 subjects diagnosed with MCI, 115 with possible or probable Alzheimer’s disease, and 42 with “ambiguous” or other diagnoses. CDT productions from the subjects’ most recent ADCC registry visits were utilized.

Scoring was completed by a group of five research assistants blinded to the participants’ diagnoses. One rater scored each clock using all of the scoring systems. All scoring system training was conducted by the ADCC’s senior neuropsychologist (RAS) for each rater. The raters found it necessary to further specify certain qualitative aspects of the systems to enhance inter-rater agreement. The compiled scoring systems used for this study, including supplemental modifications, are included in Appendix 1.

Inter-rater reliability was assessed by comparing CDT scores from two raters who each scored the same 50 clocks from subjects sampled randomly from all diagnostic groups (the controls used for the normative analysis presented here, as well as subjects with other diagnoses, such as MCI or probable Alzheimer’s disease) to assure adequate variability in scores. For every clock scored, each rater used all scoring systems arranged in random order.

In order to establish whether repeated use of the scales increased the speed of scoring (i.e., if there was a learning curve), we compared the mean time required to score both the Command and Copy conditions of the first and last 50 clocks scored by a single rater, using each of the scoring systems arranged in random order. As with the 50 clocks used in the inter-rater reliability analysis, these 100 clocks were sampled from a range of all ADCC subjects.

Statistical Analysis

We used a significance level of 0.05 for all analyses unless otherwise indicated. All cognitively normal subjects (N = 207) who met inclusion criteria were incorporated in the majority of analyses. We excluded two participants from analyses involving race: one who self-identified as Asian and one who self-identified as “other,” resulting in a sample size of 205 (n = 31 “Black/African American,” hereafter referred to as African American, and n = 174 “white”).

Because the Command condition is more commonly administered (that is, some clinicians give only the Command condition and not the Copy condition), we created norms based on the Command condition and used the resulting demographic sub-groupings to present Copy condition norms as well. Here we present descriptive statistics and other general information for both the Command and Copy conditions.

Correlations among CDT scores

We performed Pearson and Spearman correlation analyses to examine the association between the different CDT scoring systems. To avoid Type I errors due to multiple correlations, we set a conservative alpha level of 0.01.

Differences in the Normative Sample and CDT Scores

In order to identify whether differences in cognitive complaint status, race, or sex might be related to differences in CDT scores, we compared performance on the CDT and other measures across these subgroups using independent samples t-tests and Pearson chi-square tests. Mann-Whitney U tests were used to evaluate differences on the Center scoring system. We evaluated differences in CDT scores according to education by dividing the sample into five education levels: less than a high school degree, high school graduate, some college, college graduate, and post-college education. Differences in CDT scores by education level were determined using one-way ANOVAs and Tukey’s HSD post-hoc test.

Impact of Demographic Variables on CDT Performance

To determine meaningful normative subgroups, we evaluated the degree to which demographic variables accounted for variance in CDT scores. We performed a stepwise multiple linear regression for each CDT Command condition total score. Variables included in the models were: age, years of education, WRAT-3 Reading raw score, MMSE, sex, and race.

For each model, plots of the residuals against the predicted (fitted) values suggested that the assumption of constant variability for residuals underlying regression modeling was not consistently met. Additionally, the Cahn models violated the regression assumption of equality of error variance. We therefore report the results of the regression analyses with the caveat that they should not be assumed to be a complete picture of the relations between CDT scores and other variables. They were, however, useful for guiding our further evaluation of these interactions as we determined normative subgroups.

Development of Norms

The next step in establishing normative data was to determine how and where to subdivide the different variables contributing to differences in CDT performance. We used the following iterative process, beginning with the variable that accounted for the greatest proportion of variance in CDT scores in the linear regression models. The sample was divided into potentially clinically meaningful subgroups. One-way ANOVAs and Tukey’s post-hoc comparisons were performed to determine which subgroups differed significantly from each other. This was repeated using numerous different subgroup combinations until the most parsimonious division was determined – that is, the fewest possible subgroups that showed the most statistically significant differences in CDT scores, and with a statistically sufficient number of subjects per subgroup. Once this was determined, we performed independent-samples t-tests and chi-square tests to evaluate whether other variables (e.g., education, race, WRAT-3 Reading scores) differed significantly between the subgroups. To determine whether other variables attenuated the subgroup differences in CDT scores, we entered potential covariates into type III sum of squares models of between-subjects effects.

Once this process had been completed for the first variable, we repeated the process using other variables that had shown an association with CDT scores. After the subgroupings for two variables were determined, we performed 2×2 ANOVAS to see whether all subgroups remained significant and whether there were interactions between them. We continued this process until all meaningful groupings had been determined, and used the resulting subgroups to present normative data.

Center qualitative features

Analyses of the Center score were conducted separately as a categorical, qualitative measure because the three-point Center scale was limited in range. We performed nonparametric correlations (Spearman’s rho) to evaluate the association between the qualitative features of the clock center and the variables that had been significant in the linear regression models, using a conservative alpha level of 0.01 due to multiple correlations. Once the normative subgroups had been determined, we evaluated differences in quantitative or qualitative center scores across the subgroups using Mann-Whitney U tests.


Inter-rater reliability

The ICC coefficients for each CDT system are presented in Table 2. For every scoring system except the Cahn Qualitative scale, ICC coefficients were in the “almost perfect” range for inter-rater reliability (Landis & Koch, 1977) for both Command and Copy conditions. The Cahn Qualitative scale was the only scale that did not show almost perfect correspondence between raters: its Command condition ICC would be considered “substantial” and its Copy condition only “moderate” (Landis & Koch, 1977).

Table 2
Summary of Intraclass Correlation Analysis, CDT Command and Copy conditions.

Time to score

Mean times to score using each system are displayed in Table 3. The “Cahn Scale” includes the time to complete the Quantitative and Qualitative scales and to calculate the Global score. As seen in Table 3, the Freund scale scoring took the least amount of time, while the Mendez and Cahn scales took about twice as long as Freund.

Table 3
Mean time to score per system and comparison of change over time for clocks scored by one rater.

Time to implement decreased significantly with practice for all four systems (see Table 2). The Mendez and Cahn scales showed the largest decrease in scoring time (approximately 30 seconds, or 1/3 of the total original scoring time, were reduced with practice). The Freund scale remained the least time-consuming after practice.

Clock Scores – Normative Subjects

Descriptive statistics for all CDT total scores across the entire normative sample are shown in Table 4.

Table 4
CDT Total Scores for the Entire Normative Sample, Command and Copy conditions (N = 207).

Correlations among CDT scores

The results of the Pearson correlations were not significantly different than the results obtained using the Spearman correlations; therefore, we report the results of the more robust parametric (Pearson) correlations in Table 5. All CDT Command condition scores were significantly correlated with each other. The strongest correlations between systems were for the Freund and Mendez scales, which accounted for 57.6% of each other’s variance.

Table 5
Correlations among CDT Command condition total scores.

Differences in the Normative Sample and CDT Scores

Cognitive complaints

CTL subjects were younger, on average, than CTL-Cs (CTL: 70.6 years, s.d. 8.0, range 55–89; CTL-C: 74.3 years, s.d. 9.0, range 56–98; t(205) = −2.53, p = 0.012; equal variances assumed, Levine’s Test for Equality of Variance, F = 0.183, p = 0.669). The CTLs also had a lower mean GDS score (CTL: 2.7, s.d. 4.7 vs. CTL-C: 5.8, s.d. 3.1; t(46) = −3.85, p < 0.001; equal variances not assumed, Levine’s Test for Equality of Variance, F = 20.60, p < 0.001). The two subgroups did not differ significantly on mean MMSE, WRAT-3 Reading raw or adjusted scores, or on any of the CDT Command or Copy condition scores. Therefore, we considered all 207 “non-cases” as one group for the purpose of establishing norms, regardless of the existence of a self or informant complaint.


In our sample, African Americans were younger than whites, had fewer years of education, and had lower WRAT-3 Reading raw and adjusted scores (Table 6). MMSE and GDS scores did not differ between groups. There were significant differences in CDT Command condition total scores between whites and African Americans for the Cahn Qualitative and Cahn Global scores (Table 6). The two groups also differed on the Copy condition for the Mendez and Cahn Qualitative scores (Table 6). We therefore examined possible racial differences in our subsequent, more detailed analysis of CDT scores.

Table 6
Racial Group Differences.


The only significant between-group difference by sex was mean education level (17.5 years, s.d. 2.5 for males; 16.1 years, s.d. 2.6 years for females, t(205) = 3.62, p < 0.001). There were no differences between males and females on CDT Command condition total scores. There was a significant difference in mean Cahn Quantitative Copy condition scores in an independent samples t-test between men (mean 8.5, s.d. 1.1) and women (mean 8.8, s.d. 0.6), t(101) = −2.61, p = 0.041).


CDT Command condition scores differed significantly by education level only for the Cahn Global score (F(3, 203) = 2.98, p = 0.033). Post-hoc tests showed that the significant differences in Cahn Global scores were between the least-educated group (high school or less) and both the college graduates and those with post-college education.

Impact of Demographic Variables on CDT Performance: Multiple Linear Regression

No multiple linear regression model was able to account for more than about 10 percent of the variance of any CDT scoring system. For the Freund and Mendez systems, age and WRAT-3 Reading raw scores were the only significant predictors of Command condition total scores. For the Cahn system, age was a significant predictor of scores on all three of the subscales. The MMSE predicted some of the variance in the Quantitative scale, and race was a significant predictor of the Qualitative scale; however, age and WRAT-3 Reading were again the only significant predictors of the Cahn Global score.

WRAT-3 Reading raw score accounted for the apparent influences of MMSE, education, and race in the majority of the multiple regression analyses. A stepwise linear regression using WRAT-3 Reading raw score as the dependent variable showed that race, years of education, and age were the most significant predictors of WRAT-3 Reading, together accounting for 26.4% of its variance. Race alone accounted for 17.7% of the variance in WRAT-3 Reading scores. Therefore, in our subsequent determination of normative data based on our sample, we considered the influences of age and WRAT-3 Reading raw scores, but also separately analyzed the effects of age, education, and race since WRAT-3 Reading data are often unavailable clinically.

Development of Norms


We began by determining meaningful subgroups for age because it had been the only universal predictor of CDT scores in the linear regression models. The best age group division was a split at age 75 (< 75 vs. ≥ 75 years) for all three scoring systems (Table 7).

Table 7
Comparisons between age groups split at 75 years.

When we compared the two age groups split at 75 years, we found that MMSE was significantly different between the older and younger participants. No significant differences were found between age groups for cognitive complaint (CTL vs. CTL-C), WRAT-3 Reading raw score, race (white vs. African American), or sex (Table 7).

Because of the between-group age differences in MMSE and the association between WRAT-3 Reading and CDT performance, we adjusted for WRAT-3 Reading raw scores and MMSE to determine whether these variables had any impact on the age group differences in CDT scores. Age remained statistically significant when WRAT-3 or MMSE were covaried. WRAT-3 Reading raw scores were significant covariate factors in the type III sum of squares models of between-subjects effects for all CDT scores, with the exception of the Freund scale.


We dichotomized participants into those with less than a college degree (“low education”) and those with at least a college degree (“high education”). Significant differences emerged between the low and high education groups on the Cahn Global (t(205) = −2.48, p = 0.014) and Quantitative (t(205) = −2.22, p = 0.027) Command conditions (Table 8). Because this categorization (low vs. high education) showed larger and more robust differences between CDT scores than the original five education levels, we used the dichotomization for the remaining education analyses.

Table 8
Differences between subjects with low and high levels of education.

Differences emerged between the two education groups for sex, race (white vs. African American), MMSE and WRAT-3 Reading raw and adjusted scores (Table 8). The education differences between the Cahn Quantitative and Global scales remained significant when race and sex were included as covariates. When WRAT-3 Reading raw score was covaried, however, neither Cahn Quantitative nor Cahn Global scores remained significantly different between the two education groups.

In a 2×2 ANOVA including both age (split at 75 years) and education (high vs. low) as independent variables and comparing CDT scores between groups, education was only a significant factor in the models for the Cahn Quantitative (F(1, 203) = 4.27, p = 0.040) and Cahn Global (F(1, 203) = 4.49, p = 0.035) Command condition scores. The interaction between the age group split at 75 years and the high/low education levels was not significant for any of the CDT Command condition scores. Again, when WRAT-3 Reading raw score was included as a covariate, education was no longer significant in the between-subjects analysis for any of the CDT scores, while age at visit remained significant for all of them. This result was consistent when MMSE was used as a covariate instead of WRAT-3 raw score.

Without data on a subject’s WRAT-3 scores, educational differences may appear to influence performance on components of the Cahn scoring system. We therefore provide means and standard deviations for the Cahn scores based on education and separately based on WRAT-3 Reading raw scores.


To determine if dichotomizing the sample at age 75 years would lead to interactions between age and sex on the CDT scores, we performed a 2×2 ANOVA using both age split at 75 years and sex as independent variables. No sex differences were detected for any of the clock scores, nor was the interaction between sex and age split significant for any CDT score.


When we analyzed the overall effect of race on CDT Command condition scores in an ANCOVA, and adjusted for WRAT-3 Reading raw score, race did not have a significant effect on any of the CDT scores. When years of education or high/low education level were used as a covariate instead of WRAT-3 Reading raw score, the Cahn Qualitative score remained significantly different between whites and African Americans (covarying years of education: F(1, 202) = 5.36, p = 0.022; covarying high/low education level: F(1, 202) = 5.57, p = 0.019).

Because we planned to report normative scores based on education level for the Cahn scales, we evaluated the effect of race in a 2×2×2 factorial ANOVA including age (split at 75 years), education (high vs. low), and race (white vs. African American) as independent variables. Age and education were significant factors for all three Cahn scales. Race was a significant factor in both the Qualitative and Global scales, but not for the Quantitative scale. The interaction between education and race was significant for all three scales.

WRAT-3 Reading scores

The most significant differences in CDT scores emerged when WRAT-3 Reading raw scores were split at 49 (scores of 39–48 compared to scores of 49–57). However, because of the resulting large difference in sample size between these groups (i.e., only 22 subjects had WRAT-3 Reading raw scores < 49), this division was deemed inappropriate. Splitting WRAT-3 Reading raw scores at 52 instead (comparing scores of 39–51 to scores of 52–57) also resulted in significant differences for the Mendez, Cahn Quantitative, and Cahn Global scores, but not for the Freund or for Cahn Qualitative scores. The sample size was more evenly distributed between these two groups (WRAT-3 Reading raw scores < 52: n = 67; ≥ 52: n = 138). Furthermore, in a 2×2 ANOVA of CDT scores by the age split at 75 years and WRAT-3 Reading raw scores split at 52, both age and WRAT-3 Reading raw scores were significant factors in every model except WRAT-3 Reading raw score in the Cahn Qualitative model, which had not been significant before. There were no statistically significant interaction effects between age and WRAT-3 Reading raw score in any of the models at the WRAT-3 Reading split at 52.

Presentation of Norms

Based on the above iterative processes, we present the following means and standard deviations for the CDT scores: unadjusted scores split by age (55–74 and 75–98) for every CDT scoring system (Table 9); scores split by age and WRAT-3 Reading raw score (<52 and ≥ 52) for every system (Table 10); and Cahn (Quantitative, Qualitative, and Global) scores split by age, education, and race (Table 11). Appendix 2 presents the standardized (t- and z-) scores for the raw scores on every CDT scoring system using these demographic groupings.

Table 9
Norms by Age Split at 75 years, unadjusted.
Table 10
Norms: Age Split at 75 x WRAT-3 raw score.
Table 11
Cahn scores split by age, education, and race.

Center Score Analysis

Among the 207 normative subjects, every clock in both the Command and Copy conditions featured a center. In the Command condition, 19 (9.2%) of the clocks had centers that were displaced horizontally more than 5 mm. Eleven (5.3%) were displaced vertically; 9 of which (81.8%) were shifted more than 7 mm above the center, and 2 (18.2%) more than 5 mm below the center. Only 4 clocks (1.9%) had both horizontal and vertical displacement. In the Copy condition, 12 centers (5.8%) were horizontally displaced, and 10 (4.8%) were vertically displaced; of these, 2 clocks (1.0%) were both horizontally and vertically displaced. The vertical displacement was primarily below center in the Copy condition (6 out of 10, or 60% of those vertically displaced); only 4 centers were shifted up more than 7 mm.

There were no significant correlations between age, WRAT-3 Reading raw score, or years of education and qualitative center scores. Mann-Whitney U tests indicated no significant differences in quantitative or qualitative center scores between the normative subgroups, with one exception. In the Copy condition, the proportion of subjects who vertically displaced the center of the clock was significantly greater among those with a WRAT-3 Reading score <52 compared to those with a WRAT-3 Reading ge; 52 (Table 12).

Table 12
Differences in vertical center displacement between WRAT-3 raw score groups, Copy condition.


This paper presents normative data from a sample of 207 cognitively normal elderly aged 55–98 using three previously published CDT scoring systems. The three systems generally correlate well with each other, have high inter-rater reliability, and require only about a minute each to complete. However, in this sample of cognitively healthy elderly volunteers who underwent extensive evaluative and diagnostic procedures, previously published recommendations for what constitutes “normal performance” for all three systems were too narrow in range and overly strict, even in spite of the relatively high level of education in this sample.

Features of the scoring systems

Inter-rater reliability

The Mendez system showed the highest inter-rater reliability on both the Command and Copy conditions, though all of the systems were near-perfect with the exception of the Cahn Qualitative scale. The Mendez was the most reliable scale, likely because of its very explicit scoring criteria, while the Cahn Qualitative scale was the least reliable, likely because of its subjective nature. Regardless, our findings suggest that all three of these systems can be used consistently across different raters with adequate (and brief) training.

Time to score

Each system took less than one minute to implement; however, the Freund system remained the quickest even after practice (mean of 42 seconds per clock). The brevity of scoring suggests that any of these systems could be implemented in clinical practice without excessive burden, and that the clinician would benefit from using a scoring system rather than relying solely on clinical judgment.

Correlations between CDT scores

The three systems are moderately to highly correlated, with the strongest correlation observed between the Mendez and Freund systems. This latter finding is likely due to the scales’ overlap in evaluating similar aspects of clock drawing.

CDT scores

Overall CDT scores

Median CDT scores among our sample were within the normal ranges estimated by the systems’ authors. However, there was variation in performance using each CDT scoring system, which would lead many of our normal subjects to be misclassified as cognitively impaired using previously published cutoffs. Our findings suggest that the range of “normal” performance is much greater than previously reported using smaller sample sizes and comparing clock drawing abilities between diagnostic groups (Cahn et al., 1996; Freund et al., 2005; Mendez et al., 1992).

Impact of demographic and testing variables

We found no difference in MMSE, WRAT-3 Reading, or any CDT scores between the CTL and CTL-C subgroups. However, CTL-Cs were, on average, older and had higher GDS scores than the CTLs. This finding is consistent with previous research documenting an association between increased memory complaints and depression even in the absence of objective cognitive decline (Derouesné, Thibault, Lagha-Pierucci, Baudouin-Madec, Ancri, & Lacomblez, 1999; Grut, Jorm, Fratiglioni, Forsell, Viitanen, & Winblad, 1993; Spitznagel & Tremont, 2005).

CDT performance was significantly worse among older subjects for all three scoring systems. This finding was best captured using a cutoff age of 75 years (i.e., <75 vs. ≥75), a result consistent with other studies reporting the natural deterioration of clock drawing ability among healthy adults with increasing age (Freedman et al., 1994; Kozora & Munro, 1994; Paganini-Hill, Clark, Henderson, & Birge, 2001; Spreen & Strauss, 1998).

Education level was not significantly related to CDT scores using the Freund or Mendez systems, but did relate to scores using the Cahn system. Likewise, racial differences appeared correlated with clock drawing performance based on the Cahn system, with whites performing better and committing fewer qualitative errors than African Americans. However, the WRAT-3 Reading test, a marker of education quality (Manly, Jacobs, Touradji, Small, & Stern, 2002), attenuated any educational and racial differences in CDT scores. Thus, WRAT-3 Reading explained differences in CDT scores more effectively than did racial and educational differences combined. These results are consistent with Manly et al.’s (Manly, Byrd, Touradji, & Stern, 2004) finding that quality of education, as operationalized by WRAT-3 Reading scores, is an important predictor of cognitive performances among African Americans.

Previous studies have found few, if any, differences in CDT scores across racial and ethnic categories (c.f. La Rue, Romero, Ortiz, Liang, & Lindeman, 1999; Marcopulos & McLain, 2003). Instead, educational differences appear to have a greater effect on clock drawing (Marcopulos & McLain, 2003). As noted, however, norms based on years of education may not adequately reflect differences in the quality of education received across racial groups, and racial differences in cognitive performances may be observed even when education level is considered (Howieson, Loring, & Hannay, 2004; La Rue et al., 1999; Manly et al., 2004;2002). Manly and her colleagues (Manly, 2006; Manly & Echemendia, 2007) further argue that presenting race-specific norms can be problematic because it may discourage consideration of other underlying factors for which race is merely a proxy (e.g., socioeconomic status differences). Thus, although we present norms separated by race and education, we recommend the use of the tables based on WRAT-3 Reading raw scores if possible, as this measure better accounts for CDT performance differences than do racial or educational differences. Because the WRAT-3 is not routinely administered in many clinical settings and is unlikely to be used as part of a quick CDT screening, norms based on education and race may be the most appropriate way to evaluate CDT performance if the Cahn system is used.

Ultimately, no regression model was able to account for more than 10.1% of the total variance in CDT scores, suggesting that, overall, CDT performance is not dramatically affected by these demographic variables. While there are significant differences in scores between the normative subgroups presented here, the absolute differences between the groups are not substantial.

Qualitative Features of Clock Drawing

A comparison of the qualitative features of the Command condition clocks that received very different scores across the three systems revealed some particular differences between the scoring systems (Figure 1). The Freund scale quantifies errors in the spacing of numbers relative to each other and to the edge of the clock face and is most affected by the presence of tics or other intrusive marks on the clock. The Freund scale does not, however, detect whether the time is correctly indicated (i.e., if the long and short hands are switched, but the hands still point to both the “2” and “11,” the clock may still receive a perfect score).

Figure 1
Examples of qualitative differences detected by each scoring system

The Cahn system assesses spatial errors as part of the quantitative score and in two particular questions on the qualitative scale. One question regards specific planning deficits (i.e., unnecessarily large gaps before the 12, 3, 6 or 9 on the clock), a feature not measured separately from other spacing errors on the other two scoring systems. Cahn is the only system of the three that explicitly requires the hands to be correctly set to 11:10. It is also the only system that evaluates the overall shape of the clock face, as the Mendez system only requires that there be a “closure figure” present, and the Freund system assumes that there is a clock circle but does not evaluate features of the circle itself. Finally, although all three systems measure whether the numbers are placed within the clock circle, this is a somewhat more significant error in the Cahn system because a subject loses points from both the quantitative and qualitative subscales if the numbers are outside of the circle.

Like Freund, the Mendez system does not require that the hands be correctly set as long as there are two hands of unequal length that point to the “2” and the “11.” Unlike Freund and Cahn, the Mendez scale only detects major spacing errors (“most symbols are distributed as a circle without major gaps”) instead of the finer aspects of number placement (Freund requires that the numbers be “spaced equally or nearly equally from each other” as well as from the edge of the circle, and Cahn begins subtracting points if there is anything beyond “minimal error in the spatial arrangement”). Because of this feature, many “normal” clocks receive lower scores using Cahn or Freund than they would using Mendez due to minor spacing errors.

Only one subject obtained a perfect score across all three scoring systems (Figure 2). Using the scoring cutoffs established by the systems’ authors, 36 of our cognitively normal subjects’ clocks would be considered abnormal using the Freund scale, 58 using Mendez, and 66 using Cahn. The NHTSA, which makes recommendations about driving based on an older eight-point version of Freund’s scale, recommends that a driving intervention is necessary if a subject demonstrates “any incorrect element” on the Freund system. Assuming the NHTSA would make the same statement regarding the seven-point Freund scale, only 7 of our 207 cognitively healthy study participants would not be referred for a driving intervention.

Figure 2
The only “perfect clock” in the normative sample (receiving a perfect score on all three scoring systems).

There were some consistent features among the clocks drawn by this normative sample. Because the Mendez scale measures 20 specific aspects of clock drawings, trends in these features can be used to generalize about normative performance. No subject included numbers that went beyond the number 12, and numbers were always written as Arabic or Roman numerals. All but three subjects drew enclosed clock faces (example, Figure 3). Nine subjects (4.3 percent) did not draw a clock hand pointing to the “2;” however, every subject correctly drew a clock hand pointing to the “11.” Even though the Mendez scale has less stringent spacing criteria than the other two systems, 143 subjects (nearly 70 percent) still left “major gaps” between the numbers on their clocks, and 157 (75.8 percent) did not place the numbers “about equally adjacent” to the edge of the clock face. Twenty percent of the sample failed to draw all of the numbers entirely inside the clock face. The length difference between the hands was not visible in 7.2 percent of the clocks. As measured by the Center scale, every subject drew a clock that featured a real center (or one implied by the intersection of the two hands), though this center was not necessarily located in the actual center of the clock face. Furthermore, in seven subjects’ drawings, the hands did not radiate from the direction of the closure figure center. Five of the errors on the Mendez scale were committed by one subject each, and all five unique errors were committed by a total of three subjects (Figure 3).

Figure 3
Normative clock drawings featuring unique errors (using the Mendez scoring system)

Center Scoring Criteria

Most healthy elderly subjects drew a center that was within Freedman et al.’s (1994) normative range, with 5.8 to 9.2 percent of subjects displacing the center more than 5 mm to the left or right, and 4.8 to 5.3 percent drawing the center more than 5 mm below or 7 mm above the actual center of the clock face. However, these scoring criteria only provided an absolute measurement of displacement, rather than a measure proportional to the size of the clock face itself. That is, a 5 to 7 mm displacement allows a much smaller margin of error for a subject who draws a clock that is 13 cm in diameter than for a subject who draws a 5 cm-diameter clock. Freedman et al. used these displacement indices to evaluate a condition where an 11.7-cm diameter circle was pre-drawn for the test subjects. Though it is important to note that some horizontal and vertical center displacement was present among healthy elderly subjects in the present study, a more accurate measurement of the center displacement could have been made if it were based on distances proportional to the size of the clock face. In terms of the CDT’s clinical utility, however, the calculation of a proportional measurement is likely to add significantly to the time required to complete scoring and may be impractical. Of note here is that no healthy elderly subject drew a clock without a central intersection of the hands or a central mark.

CDT Scoring Systems: Selection for clinical use

The choice of a scoring system ultimately depends on the specific needs and goals of the clinician or researcher. For example, a person interested in obtaining detailed information about many specific qualitative aspects of clock drawing might use the Mendez system. For a quick CDT screen that is highly sensitive to spatial errors without much qualitative detail, the Freund system might be the preferred choice. The most thorough information about some of the most essential aspects of clock drawing, such as correct hand positioning and the clock’s gestalt, is probably best obtained using the three-part Cahn scale. Ultimately, all three systems have high inter-rater reliability and correlate well with each other; thus, the final decision about which system to choose will likely depend on the consumer’s preferences about detail and time.

Limitations and Further Research


A limitation of this study is the fact that our sample was highly educated compared to the general older adult U.S. population, and therefore we were unable to evaluate the effect of a wide range of educational experiences on CDT performance. Several studies examining the effect of education on clock scores suggest the presence of a “ceiling effect;” that is, healthy elderly obtain very high scores regardless of their education level, so little to no education effects are observed (Lam et al., 1998; Ratcliff, Dodge, Birzescu, & Ganguli, 2003; Shulman, 2000). Because of the proposed ceiling effect, additional normative data including healthy subjects with fewer years of education may not produce vastly different results than those presented here, where education appeared to have a minor, if any, influence on CDT performance. However, this issue appears unresolved and merits further investigation.

Longitudinal vs. Cross-sectional data

Because this was a cross-sectional study, it is possible that there were cohort differences (e.g., education, culture) between the older and younger subjects which led to differences in CDT performance (Howieson et al., 2004). Therefore, our results should not be interpreted as determining the effect of aging on CDT performance. On the other hand, in a longitudinal community study of initially non-demented adults aged 65 and older, Ratcliff et al. (2003) found that 46.6 percent of participants’ CDT ability declined over a 10-year period. Longitudinal analysis of CDT productions by a cognitively normal aging sample would be required to evaluate whether the observed decline in clock drawing ability with age is applicable to the individual level.

Addenda to published scoring systems

The high inter-rater reliability seen in the present study is likely due in part to the “Supplemental Scoring Criteria” which were used in conjunction with the published scoring systems. Thus, the intraclass correlations reported here are likely to be higher than if only the original scoring criteria had been employed. We therefore encourage the use of the more specific scoring criteria to improve consistency across different raters in clinical and research settings.


Clearly, most healthy elderly subjects cannot draw clocks that would be considered “perfect” using any of the scoring systems evaluated in this study. Although there may be some generalizations about what can be considered a normal clock based on these data, it is important to note that if a subject’s clock shows an error or feature not seen among this cohort, it does not necessarily mean that the person is impaired. Likewise, if a clock drawing falls into the range of “normal” performance, this does not necessarily implicate that the subject is cognitively normal.

The goals of this study were to provide normative data for cognitively healthy elderly subjects using several published CDT scoring systems, accounting for meaningful demographic and test variables, and to evaluate the clinical utility of three common CDT scoring systems. This is the first presentation of such information based solely on cognitively normal elderly volunteers. Knowledge about what to expect from an unimpaired elderly individual is critical because the failure to account for variations in normal performance has serious consequences for healthy elderly who present for cognitive evaluation. Therefore, these results have implications for screening and diagnostic practices as well as clinical research outcomes. For example, current standards for CDT performance using the Freund scale might lead to 97 percent of our normative sample being referred for an on-road driving evaluation (Wang et al., 2003). Such recommendations must consider normal performance, especially if they lead to consequences that unnecessarily limit independence in everyday activities.

Although CDT performance is generally high among cognitively healthy elderly volunteers, some variation in drawing abilities, affected by variables such as age and quality of education, will be observed and should not be assumed to be abnormal. Continued research comparing well-characterized samples of cognitively healthy participants and those suffering from various degrees of cognitive decline and other disorders may further elucidate whether particular features of clock drawing are characteristic of cognitive impairment.


This work was supported by NIH Grants P30-AG13846 (Boston University Alzheimer’s Disease Core Center), M01-RR00533 (Boston University General Clinical Research Center), R03-AG026610 (ALJ), R03-AG027480 (ALJ), K12-HD043444 (ALJ), and K23-AG030962 (ALJ), K24 AG27841 (RCG).

Appendix 1. Compiled CDT scoring sheets used for data collection

Scoring Criteria for Freund (2005) Quantitative Scale

A. Instructions: score out of 7 points, even if no numbers are present.

1. Time (maximum: 3 points)One hand points to “2” (or symbol representative of 2)0101
Exactly 2 hands0101
Absence of intrusive marks, e.g., writing or hands indicating incorrect time, hand points to number 10; tic marks, time written in text (11:10; ten after eleven)0101
Sum of Time elements/3/3
2. Numbers (maximum: 2 points)Numbers are inside the clock circle (may touch perimeter, but may not extend outside of circle)0101
All numbers 1–12 are present, no duplicates or omissions0101
Sum of Number elements/2/2
3. Spacing (maximum: 2 points)Numbers spaced equally or nearly equally from each other0101
Numbers spaced equally or nearly equally from the edge of the circle0101
Sum of Spacing elements/2/2

B. Clock Drawing Command and Copy Score Table

1. Time01230123
2. Numbers012012
3. Spacing012012

Time to Complete Scoring: _____min _____sec

Scoring Criteria for Clock Drawing Interpretation Scale (Mendez, 1992)

A. Clock Drawing Command and Copy Score Table (score “1” per element present)

1. There is an attempt to indicate a time in any way0101
2. All marks or items can be classified as either part of a closure figure, a hand, or a symbol for the clock numbers0101
3. There is a totally closed figure (closure figure); no gap is greater than ¼″0101
Score Only if Symbols for Clock Numbers Are Present
4. A “2” is present and is pointed out in some way for the time0101
5. Most symbols are distributed as a circle without major gaps:
  • - Score “0” if more than one number is not placed in a circular fashion
  • - - Score “0” if a “major gap” is twice the size of the smallest gap separating two consecutive numbers
6. Three or more clock quadrants have one or more appropriate numbers: 12-3, 3–6, 6–9, 9–12 per respective clockwise quadrant0101
7. Most symbols are ordered in a clockwise or rightward direction0101
8. All symbols are totally within a closure figure (may not touch the perimeter of the circle)0101
9. An “11” is present and is pointed out in some way for a time0101
10. All numbers 1–12 are indicated0101
11. There are no repeated or duplicated number symbols0101
12. There are no substitutions for Arabic or Roman numerals0101
13. The numbers do not go beyond the number 120101
14. All symbols lie about equally adjacent to a closure figure edge0101
15. Seven or more of the same symbol type are ordered sequentially0101
Score Only if One or More Hands Are Present
16. All hands radiate from the direction of a closure figure center, or from a point that is approximately in the middle of the clock face0101
17. One hand is visibly longer than another hand (do not use ruler to determine)010
18. There are exactly two distinct and separable hands010
19. All hands are totally within a closure figure010
20. There is an attempt to indicate a time with one or more hands010

Time to Complete Scoring: _____min ______sec

Scoring Criteria for Cahn’s (1996) Quantitative Scale

1. Integrity of the clock face (maximum: 2 points)Present without gross distortion22
Incomplete or some distortion11
Absent or totally inappropriate00
Note: clock face is considered distorted if “smashed” horizontally or vertically, or if face is extremely slanted.
2. Presence and sequencing of numbers (maximum: 4 points)All present in the right order and at most minimal error in the spatial arrangement (only one number is displaced, or only one very large or very small gap between numbers)44
All present but errors in spatial arrangement33
Numbers missing or added but no gross distortions of the remaining numbers; OR numbers placed in counterclockwise direction; OR all numbers present but gross distortion in spatial layout (i.e., hemineglect, numbers outside of the clock)22
Missing or added numbers and gross spatial distortions11
Absence or poor representation of numbers00
3. Presence and placement of the hands (maximum: 4 points)Hands are in the correct position and the size difference is respected44
Slight errors in the placement of the hands or no representation of the size difference between the hands33
Major errors in the placement of the hands (significantly out of course including 10 to 11)22
Only one hand or poor representation of two hands11
No hands or perseveration on hands00

B. Clock Drawing Command and Copy Score Table

1. Integrity of the clock face012012
2. Presence and sequencing of numbers0123401234
3. Presence and placement of the hands0123401234

Scoring Criteria for Cahn’s (1996) Qualitative Error Scale

Instructions: If no numbers present: score out of 8 points; give no errors for error types regarding numbers

Error TypeCriteriaError present
1. Stimulus bound responseThe tendency of the drawing to be dominated or guided by a single stimulus. There may be three types of stimulus bound errors:
  1. The hands may be set for 10 to 11 instead of 10 after 11
  2. Time is written beside the 11 or between 10 and 11 on clock
  3. The hands are absent or are pointed toward “10” and/or “11.” This type of error is also considered to be a conceptual error.
2. Conceptual deficitThis error type reflects a loss or deficit in accessing knowledge of the attributes, features, and meaning of a clock. Included in this category are misrepresentation the clock itself (a clockface without numbers or inappropriate use of numbers) and misrepresentation of the time on the clock (the hands are either absent or inadequately represented or the time is written on the clock)0101
3. PerseverationThe continuation or the recurrence of activity without an appropriate stimulus. In clock drawing, this error is seen in the presence of more than two hands and abnormal prolongation of numbers (writing beyond “12”)0101
4. Neglect of left hemispaceAll attributes of the clock are written on the right half of the clockface. Possible neglect of the right hemispace was also evaluated but this type of error was never observed.0101
5. Planning deficitThis error type is represented by gaps before the 12, 3, 6, or 9 depending on the strategy used in drawing0101
6. Nonspecific spatial errorA deficit in the spatial layout of numbers, without any specific pattern in spatial disorganization0101
7. Numbers written on outside of clockNumbers written either around the perimeter of the circle or on the circle itself (including touching the perimeter)0101
8. Numbers written counterclockwiseArrangement of numbers with the “12” at the top of the clockface and then continuing around in a counterclockwise fashion0101
Total: (score “1” per error type present)/8/8

Scoring Criteria for Cahn’s (1996) Global Clock Drawing Test (CDT) Scale

A. Instructions: This score takes into account not only the presence and correctness of the features of the clock, but also the number of error types made and strategies used in the construction of the clock. Score is determined by subtracting the Cahn Qualitative Error Score from the Cahn Quantitative Score. Maximum score is 10; minimum score is −8.

B. Global Clock Drawing Score Table





Time to Complete Scoring (Cahn total time): _____min _____sec

Center scoring criteria (Freedman 1994)

CriteriaElement present
Clock has a center (drawn or inferred/extrapolated at the point where 2 hands meet)0101
No horizontal displacement: center is displaced from the vertical axis within 5.0 mm (3/16 in) to the right or left of the axis0101
No vertical displacement: center is displaced from the horizontal axis within 5.0 mm (3/16 in) below or 7.0 mm (5/16 in) above the axis0101
Qualitative Vertical Displacement:
Shifted upShifted up
Shifted downShifted down
No displacementNo displacement
No centerNo center

Time to Complete Scoring: _______min _______sec

Appendix 2. t- and z-scores

Table thumbnail
Table thumbnail
Table thumbnail
Table thumbnail
Table thumbnail
Table thumbnail
Table thumbnail
Table thumbnail
Table thumbnail
Table thumbnail
Table thumbnail


Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.


  • Blair M, Kertesz A, McMonagle P, Davidson W, Bodi N. Quantitative and qualitative analyses of clock drawing in frontotemporal dementia and Alzheimer’s disease. Journal of the International Neuropsychological Society. 2006;12:159–165. [PubMed]
  • Brodaty H, Moore CM. The Clock Drawing Test for dementia of the Alzheimer’s type: A comparison of three scoring methods in a memory disorders clinic. International Journal of Geriatric Psychiatry. 1997;12:619–627. [PubMed]
  • Cacho J, García-García R, Arcaya J, Gay J, Guerrero-Peral AL, Gómez-Sánchez JC, et al. El test del reloj en ancianos sanos. Revista de Neurología. 1996;24:1525–1528. [PubMed]
  • Cahn DA, Salmon DP, Monsch AU, Butters N, Wiederholt WC, Corey-Bloom J. Screening for dementia of the Alzheimer type in the community: The utility of the Clock Drawing Test. Archives of Clinical Neuropsychology. 1996;11(6):529–539. [PubMed]
  • Census Bureau, U. S. Current Population Survey, 2006 Annual Social and Economic Supplement. 2006. Retrieved 2 May 2007, from
  • Derouesné C, Thibault S, Lagha-Pierucci S, Baudouin-Madec V, Ancri D, Lacomblez L. Decreased awareness of cognitive deficits in patients with mild dementia of the Alzheimer type. International Journal of Geriatric Psychiatry. 1999;14:1019–1030. [PubMed]
  • Fischer J, Loring D. Construction. In: Lezak M, Howieson D, Loring D, editors. Neuropsychological Assessment. 4. New York: Oxford University Press; 2004. pp. 531–568.
  • Folstein MF, Folstein SE, McHugh PR. Mini-Mental State Examination (MMSE) Lutz, FL: Psychological Assessment Resources, Inc; 2001.
  • Freedman M, Leach L, Kaplan E, Winocur G, Shulman K, Delis DC. Clock Drawing: A neuropsychological analysis. New York: Oxford University Press; 1994.
  • Freund B, Gravenstein S, Ferris R, Burke BL, Shaheen E. Drawing clocks and driving cars: Use of brief tests of cognition to screen driving competency in older adults. Journal of General Internal Medicine. 2005;20:240–244. [PMC free article] [PubMed]
  • Greenhalgh T. How to read a paper: Papers that report diagnostic or screening tests. British Medical Journal. 1997;315:540–543. [PMC free article] [PubMed]
  • Grut M, Jorm A, Fratiglioni L, Forsell Y, Viitanen M, Winblad B. Memory complaints of elderly people in a population survey: variation according to dementia stage and depression. Journal of the American Geriatrics Society. 1993;41:1295–1300. [PubMed]
  • Howieson D, Loring D, Hannay J. Neurobehavioral Variables and Diagnostic Issues. In: Lezak M, Howieson D, Loring D, editors. Neuropsychological Assessment. 4. New York: Oxford University Press; 2004. pp. 286–334.
  • Hughes C, Berg L, Danziger W, Coben L, Martin R. A new clinical scale for the staging of dementia. British Journal of Psychiatry. 1982;140:566–572. [PubMed]
  • Jefferson AL, Wong S, Bolen E, Ozonoff A, Green RC, Stern RA. Cognitive predictors of HVOT performance differ between individuals with mild cognitive impairment and normal controls. Archives of Clinical Neuropsychology. 2006;21:405–412. [PMC free article] [PubMed]
  • Jefferson AL, Wong S, Gracer TS, Green RC, Stern RA. Geriatric performances on the Boston Naming Test-30 item even version. Applied Neuropsychology. 2007;14:1–9.
  • Juby A, Tench S, Baker V. The value of clock drawing in identifying cognitive executive dysfunction in people with a normal Mini-Mental State Examination score. Canadian Medical Association Journal. 2002;167(8):859–864. [PMC free article] [PubMed]
  • Kozora E, Munro CC. Qualitative features of clock drawing in normal aging and Alzheimer’s Disease. Assessment. 1994;1(2):179–187. [PubMed]
  • La Rue A, Romero L, Ortiz I, Liang H, Lindeman R. Neuropsychological performance of Hispanic and non-Hispanic older adults: An epidemiolgic survey. Clinical Neuropsychologist. 1999;13(4):474–486. [PubMed]
  • Lam LCW, Chiu HFK, Ng KO, Chan C, Chan WF, Li SW, et al. Clock-face drawing, reading and setting tests in the screening of dementia in Chinese Elderly Adults. Journal of Gerontology: Psychological Sciences. 1998;53B(6):353–357. [PubMed]
  • Landis J, Koch G. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–174. [PubMed]
  • Lee HL, Swanwick GRJ, Coen RF, Lawlor BA. Use of the Clock Drawing Task in the diagnosis of mild and very mild Alzheimer’s Disease. International Psychogeriatrics. 1996;8(3):469–476. [PubMed]
  • Libon DJ, Malamut BL, Swenson R, Prouty Sands L, Cloud BS. Further analyses of clock drawings among demented and nondemented older subjects. Archives of Clinical Neuropsychology. 1996;11(3):193–205. [PubMed]
  • Lusted L. Decision-making studies in patient management. New England Journal of Medicine. 1971;284(8):416–424. [PubMed]
  • Manly JJ. Deconstructing Race and Ethnicity: Implications for Measurement of Health Outcomes. Medical Care. 2006;44:S10–S16. [PubMed]
  • Manly JJ, Byrd DA, Touradji P, Stern Y. Acculturation, reading level, and neuropsychological test performance among African American elders. Applied Neuropsychology. 2004;11(1):37–46. [PubMed]
  • Manly JJ, Echemendia R. Race-specific norms: Using the model of hypertension to understand issues of race, culture, and education in neuropsychology. Archives of Clinical Neuropsychology. 2007;22:319–325. [PubMed]
  • Manly JJ, Jacobs D, Touradji P, Small S, Stern Y. Reading level attenuates differences in neuropsychological test performance between African American and White elders. Journal of the International Neuropsychological Society. 2002;8:341–348. [PubMed]
  • Manos P. Ten-point clock test sensitivity for Alzheimer’s disease in patients with MMSE scores greater than 23. International Journal of Geriatric Psychiatry. 1999;14:454–458. [PubMed]
  • Marcopulos BA, McLain CA. Are our Norms “Normal”? A 4-Year Follow-Up Study of a Biracial Sample of Rural Elders with Low Education. Clinical Neuropsychologist. 2003;17(1):19–33. [PubMed]
  • Mendez MF, Ala T, Underwood KL. Development of scoring criteria for the Clock Drawing Task in Alzheimer’s Disease. Journal of the American Geriatrics Society. 1992;40:1095–1099. [PubMed]
  • Morris J. Clinical Dementia Rating (CDR): Current version and scoring rules. Neurology. 1993;43:2412–4. [PubMed]
  • Osterrieth P. Le test de copie d’une figure complexe. Archives de Psychologie. 1944;30:206–356.
  • Paganini-Hill A, Clark L, Henderson V, Birge S. Clock Drawing: Analysis in a retirement community. Journal of the American Geriatrics Society. 2001;49:941–947. [PubMed]
  • Ratcliff G, Dodge H, Birzescu M, Ganguli M. Tracking cognitive functioning over time: Ten-year longitudinal data from a community-based study. Applied Neuropsychology. 2003;10(2):76–88. [PubMed]
  • Rey A. L’examen psychologique dans les cas d’encephalopathie traumatique. Archives de Psychologie. 1941;28:286–340.
  • Roth M, Tym E, Mountjoy CQ, Huppert FA, Hendrie H, Verma S, et al. CAMDEX--A standard instrument for the diagnosis of mental disorder in the elderly with special reference to the early detection of dementia. British Journal of Psychiatry. 1986;149:698–709. [PubMed]
  • Rouleau I, Salmon DP, Butters N, Kennedy C, McGuire K. Quantitative and qualitative analyses of clock drawings in Alzheimer’s and Huntington’s Disease. Brain and Cognition. 1992;18:70–87. [PubMed]
  • Royall DR, Cordes JA, Polk M. CLOX: an executive clock drawing task. Journal of Neurology, Neurosurgery and Psychiatry. 1998;64:588–594. [PMC free article] [PubMed]
  • Samton JB, Ferrando SJ, Sanelli P, Karimi S, Raiteri V, Barnhill JW. The Clock Drawing Test: diagnostic, functional, and neuroimaging correlates in older medically ill adults. Journal of Neuropsychiatry and Clinical Neurosciences. 2005;17:533–540. [PubMed]
  • Schramm U, Berger G, Muller R, Kratzch T, Peters J, Frolich L. Psychometric properties of Clock Drawing Test and MMSE or Short Performance Test (SKT) in dementia screening in a memory clinic population (Abstract) International Journal of Geriatric Psychiatry. 2002;17(3):254–260. [PubMed]
  • Seigerschmidt E, Mosch E, Siemen M, Forstl H, Bickel H. The clock drawing test and questionable dementia: reliability and validity. International Journal of Geriatric Psychiatry. 2002;17:1048–1054. [PubMed]
  • Shulman K. Clock-drawing: is it the ideal cognitive screening test? International Journal of Geriatric Psychiatry. 2000;15:548–561. [PubMed]
  • Shulman K, Shedletsky R, Silver I. The challenge of time: clock-drawing and cognitive function in the elderly. International Journal of Geriatric Psychiatry. 1986;1:135–140.
  • Smith A. Symbol Digit Modalities Test. Los Angeles: Western Psychological Services; 1973.
  • Spitznagel M, Tremont G. Cognitive reserve and anosognosia in questionable and mild dementia. Archives of Clinical Neuropsychology. 2005;20:505–515. [PubMed]
  • Spreen O, Strauss E. A Compendium of Neuropsychological Tests: Administration, Norms, and Commentary. 2. New York: Oxford University Press; 1998.
  • Storey JE, Rowland JTJ, Basic D, Conforti DA. A comparison of five clock scoring methods using ROC (receiver operating characteristic) curve analysis. International Journal of Geriatric Psychiatry. 2001;16:394–399. [PubMed]
  • Sunderland T, Hill J, Mellow A, Lawlor BA, Gundersheimer J, Newhouse PA, et al. Clock drawing in Alzheimer’s Disease: a novel measure of dementia severity. Journal of the American Geriatrics Society. 1989;37:725–729. [PubMed]
  • Tuokko H, Hadjistavropoulous T, Miller JA, Beattie BL. The clock test: A sensitive measure to differentiate normal elderly from those with Alzheimer Disease. Journal of the American Geriatrics Society. 1992;40:579–584. [PubMed]
  • Tuokko H, Hadjistavropoulous T, Rae S, O’Rourke N. A comparison of alternative approaches to the scoring of Clock Drawing. Archives of Clinical Neuropsychology. 2000;15(2):137–148. [PubMed]
  • van der Burg M, Bouwen A, Stessens J, Ylieff M, Fontaine O, de Lepeleire J, et al. Scoring clock tests for dementia screening: a comparison of two scoring methods. International Journal of Geriatric Psychiatry. 2004;19:685–689. [PubMed]
  • Wang CC, Kosinski CJ, Schwartzberg JG, Shanklin AV. Physician’s Guide to Assessing and Counseling Older Drivers. Washington, D.C.: National Highway Traffic Safety Administration; 2003.
  • Watson Y, Arfken C, Birge S. Clock completion: an objective screening test for dementia. Journal of the American Geriatrics Society. 1993;41:1235–1240. [PubMed]
  • Wilkinson G. Wide-Range Achievement Test (WRAT-3) 3. Richmond Hill, Ontario, Canada: Psycan Corporation; 1993.
  • Wolf-Klein GP, Silverstone FA, Levy AP, Brod MS. Screening for Alzheimer’s Disease by Clock Drawing. Journal of the American Geriatrics Society. 1989;37:730–734. [PubMed]
  • Yamamoto S, Mogi N, Umegaki H, Suzuki Y, Ando F, Shimokata H, et al. The Clock Drawing Test as a valid screening method for mild cognitive impairment. Dementia and Geriatric Cognitive Disorders. 2004;18:172–179. [PubMed]
  • Yesavage J, Brink T, Rose T, Lum O, Huang V, Adey M, et al. Development and validation of a geriatric depression screening scale: A preliminary report. Journal of Psychiatric Research. 1983;17:37–49. [PubMed]