The Clinical Dementia Rating (CDR; [

1]) is a global measure of dementia representing five stages, ranging from “no impairment” (CDR=0) to “severe dementia” (CDR = 3) [

2]. The CDR is frequently used as an entry criterion and/or a primary outcome measure in Alzheimer’s disease clinical trials [

3], and it is often used as the basis for a consensus decision, i.e., CDR results are reviewed by a committee of practitioners and transition from one severity stage to another is determined based on consensus. In this context, it is critical that raters are consistent not only with other raters given the same patients and information, but also within themselves, consistently identifying the same dementia severity level across patients. More particularly, it is essential for the validity of a study that the highest level of rater accuracy be achieved

*prior* to the start of enrollment. This accuracy is a typical endpoint of training for personnel across multiple sites before a multi-site clinical study begins.

Reports of the validity and reliability of the CDR have been based on 3 – 80

raters who applied the CDR to between 3 and 15

cases each [

4–

6]. Whether the consistency of only a few raters (e.g., a clinical practice) or a large group of personnel (e.g., a multi-center clinical study) is the outcome of interest, estimates of agreement based on small test samples (<10) will have large associated standard errors (see e.g., [

7]). Additionally, when estimating confidence intervals around the kappa estimate, large-sample methods assume no fewer than 20 [

8–

9] and preferably at least 25–50 rated cases [

10]. Thus, it is important to test each rater on a larger sample set than has been reported to date. Simulations have shown that the empirically-derived lower 95% confidence interval bound for estimated kappa was found to be 0.6 when 20 ratings of dichotomous items yielded an estimated kappa of .80 [

9]. This is not a function of the agreement statistic that is employed, but rather depends on how the confidence interval for the statistic is calculated. Cohen’s kappa statistic has many detractors (e.g., see [

11] pp. 35–37; [

12], p. 31) but is widely used. It is interpreted to represent the level to which independent people will agree after taking into account the fact that they would agree by chance [

13–

14]. However, in any set of ratings, kappa can only be estimated; as such, the variability of this agreement characterization must be acknowledged.

Landis and Koch [

13] provided guidance for classifying/interpreting kappa estimates (

) as reflective of “poor” (

<0.0), “slight” (0.0 ≤

≤0.2), “fair” (0.21 ≤

≤ 0.4), “moderate” (0.41 ≤

≤ 0.6), “substantial” (0.61 ≤

≤0.8), and “almost perfect” (0.81 ≤

≤1.0) agreement among raters. The motivation for the present work is that, due to variability in estimation, it is the lowest probable value for this estimate of agreement (i.e., the lower bound of the confidence or credibility interval) that should be classified along this continuum, and not the estimate itself. For example, a one-sided 95% confidence interval constructed around any calculated value of

should have a lower bound no smaller than 0.61, particularly when training multiple users of an instrument, such as the CDR, in a clinical study. With this criterion, the true level of agreement in the sample is likely to be captured in the interval .61–1.0. When too few ratings are performed, the confidence interval can be mis-estimated, resulting in a much larger interval than .61 – 1.0.

In the context of a multi-center clinical trial, the ‘test’ for rater certification might provide estimates of agreement of individuals with the ‘correct’ rating, and the precision of this estimate must be high. The ideal situation is both a high level of agreement and a high value representing the lower bound of the confidence interval around the estimate. Thus, training programs should strive to achieve a calculated kappa value in the “excellent range” (i.e., at least .80 [

13]); we also suggest that the lower bound of the confidence interval should be no lower than “substantial” (i.e., at least .61 [

13]). This study describes the estimation of optimal numbers of test case ratings, and their distribution across the possible CDR levels, for certifying this combination (point estimate and lower bound) criterion for concordance and consistency among multiple CDR raters. Simulations were carried out to establish the relative sizes of confidence intervals around kappa statistics (agreement with the gold standard rating) based on testing sample size, distribution of testing sample CDR stages, and previously reported levels of agreement.