Search tips
Search criteria 


Logo of acnLink to Publisher's site
Arch Clin Neuropsychol. Aug 2011; 26(5): 434–444.
Published online May 18, 2011. doi:  10.1093/arclin/acr042
PMCID: PMC3142950
Difficulty and Discrimination Parameters of Boston Naming Test Items in a Consecutive Clinical Series
Otto Pedraza,* Bonnie C. Sachs, Tanis J. Ferman, Beth K. Rush, and John A. Lucas
Department of Psychiatry and Psychology, Mayo Clinic, Jacksonville, FL, USA
*Corresponding author at: Department of Psychiatry and Psychology, Mayo Clinic, Jacksonville, FL 32224, USA. Tel.: +Phone: 1-904-953-7286; fax: +1-904-953-0461. E-mail address:otto.pedraza/at/ (O. Pedraza).
Accepted April 21, 2011.
The Boston Naming Test is one of the most widely used neuropsychological instruments; yet, there has been limited use of modern psychometric methods to investigate its properties at the item level. The current study used Item response theory to examine each item's difficulty and discrimination properties, as well as the test's measurement precision across the range of naming ability. Participants included 300 consecutive referrals to the outpatient neuropsychology service at Mayo Clinic in Florida. Results showed that successive items do not necessarily reflect a monotonic increase in psychometric difficulty, some items are inadequate to distinguish individuals at various levels of naming ability, multiple items provide redundant psychometric information, and measurement precision is greatest for persons within a low-average range of ability. These findings may be used to develop short forms, improve reliability in future test versions by replacing psychometrically poor items, and analyze profiles of intra-individual variability.
Keywords: Boston Naming Test, Item response theory, Item difficulty, Item discriminability
The Boston Naming Test (BNT) (Kaplan, Goodglass, & Weintraub, 1983) is the most frequently used instrument for the assessment of visual naming ability (Rabin, Barr, & Burton, 2005). Its validity and reliability are well established and reviewed elsewhere (Strauss, Sherman, & Spreen, 2006). Briefly, internal consistency for the 60-item version ranges from r = .78 to .96 across studies. Test–retest stability in cognitively normal adults varies as a function of time interval and sample composition, but generally ranges from r = .59 to .92. Moreover, the BNT correlates highly (r = .76 to .86) with other naming tests, such as the Visual Naming Test from the Multilingual Aphasia Examination.
Although the psychometric properties of the BNT have been established at the global “test” level, few studies have used modern psychometric methods to evaluate the BNT at the “item” level (some studies considered item characteristics at a descriptive level, e.g., Tombaugh & Hubley, 1997). This information can be helpful to develop new short forms, improve test reliability by replacing psychometrically poor items, analyze error patterns or profiles of intra-individual variability, or take into account regional or cultural influences on individual item responses. For instance, Graves, Bezeau, Fogarty, and Blair (2004) used a one-parameter (Rasch) model to analyze the difficulty of BNT items and develop a new short form. Items were excluded from the short form if they were too easy, failed to fit the Rasch model, or had poor loadings on the first component of a principal components analysis.
Item response theory (IRT) is a state-of-the-art measurement approach that uses examinees' item responses to simultaneously estimate each person's underlying (latent) ability and the characteristics of the test items used to measure that ability (Embretson & Reise, 2000; Hambleton & Swaminathan, 1985; Hambleton, Swaminathan, & Rogers, 1991). In this framework, a person's ability level is considered a function of the pattern of unique item responses as well as the parametric properties of the test items. It thus becomes possible to estimate an item's “discrimination” (α), or the degree to which the item distinguishes persons with higher ability from those with lower ability, and “difficulty” (β), the point in the ability scale at which a person has a 50% chance of responding correctly to the item. Models that estimate both item discrimination and difficulty parameters are well suited for the investigation of cognitive tests and abilities (Teresi, 2006).
In IRT, item characteristic curves (ICCs) trace the probability of a correct item response as a function of the underlying ability construct and can be thought of as the regression of an item score on the person's latent ability. Item difficulty is depicted by the location along the x-coordinate where the probability of a correct response for a binary item is 50%, and item discrimination is represented by the slope of the trace line at that location. For instance, Fig. 1 depicts a theoretical test item with a difficulty parameter equal to zero. In this case, a person with an average ability has a 50% chance of responding correctly to the item. In contrast, Fig. 2 depicts a theoretical item with a difficulty parameter equal to −1.0. Because a lesser degree of latent ability is required to obtain a 50% chance of responding correctly, the item in Fig. 2 is considered less difficult than that in Fig. 1. Note also the differences in the discrimination parameters between the two items. The steeper slope (i.e., higher discrimination) for the item in Fig. 2 indicates that it is better at distinguishing persons within a very narrow range of ability. When item discrimination is zero, every person has an equal probability of providing a correct response. In this case, the ICC is flat and the item should be flagged for deletion or replacement from the pool of test items.
Fig. 1.
Fig. 1.
Theoretical item with discrimination (α) = 2.0 and difficulty (β) = 0.
Fig. 2.
Fig. 2.
Theoretical item with discrimination (α) = 3.0 and difficulty (β) = −1.0.
An advantage of IRT over classical test theory methods is that reliability is not constrained to a single coefficient, but instead can be measured continuously over the entire ability spectrum. Reliability in IRT is equivalent to the concept of “information” and is inversely related to the standard error of measurement (Embretson & Reise, 2000). Item, and hence test, information is maximized by higher discrimination parameters and an adequate match between item difficulty and a person's ability level. A further attractive property of IRT models is that item information can be summed to yield a global test information function, which represents the degree of precision for the test at each level of the latent ability.
Recently, Pedraza and colleagues (2009) used an IRT approach to evaluate the differential response pattern for BNT items 30–60 in cognitively normal Caucasian and African American adults. Results showed that successive BNT items do not necessarily reflect an increase in psychometric difficulty, many items do not discriminate persons with low versus high naming ability, and a subset of items demonstrate comparable difficulty or discrimination properties, suggesting that these items may be psychometrically redundant. In addition, the BNT showed the greatest measurement precision for individuals with mild naming difficulty.
The current study represents an extension of Pedraza and colleagues (2009) to investigate the item-level properties of the BNT in a prospective series of adult patients with a broad range of naming ability.
Study participants included 300 consecutive referrals to the outpatient clinical neuropsychology service at Mayo Clinic in Florida. Patients were referred predominantly by the Departments of Neurology and Neurosurgery (65%) and Internal Medicine and its subspecialties (19%). Approximately half of the patients were referred for dementia evaluations, with the remainder including epilepsy, normal pressure hydrocephalus, depression, poststroke status, and other medical and neurologic conditions. All patients were evaluated between July 2009 and January 2010. Only those patients whose primary language was English were considered for inclusion. All data were obtained in full compliance with a study protocol approved by the Mayo Clinic Institutional Review Board.
The BNT was administered in ascending order beginning with item 1 and proceeding until item 60. Items were scored as correct or incorrect following standardized instructions (Kaplan et al., 1983). For the purposes of the current investigation, the BNT total raw score represents the sum of all correct items regardless of basal or discontinuation rules. A separate score using basal and discontinuation rules was recorded for the purposes of the clinical examination and will not be considered in this study.
Statistical Analyses
A fundamental assumption in IRT is that the set of test items should measure a single dimension or construct. The dimensionality of the BNT was evaluated using multiple approaches. First, internal consistency was examined using Cronbach's alpha coefficient. Although internal consistency (i.e., alpha > 0.70) does not preclude the presence of multiple constructs, it represents a necessary but insufficient component of unidimensionality and is considered in that context (Gardner, 1995; Schmitt, 1996). Second, an exploratory factor analysis was performed using unweighted least squares extraction, followed by confirmatory factor analysis (CFA) in LISREL (Jöreskog & Sörbom, 1997, 2006) on the tetrachoric covariance matrix using an asymptotic distribution-free (ADF) estimator. A limitation of ADF estimators, however, is that substantially large sample sizes are necessary to generate admissible solutions (Boomsma & Hoogland, 2001). Non-admissible solutions can result from parameter estimates failing to converge after multiple iterations or negative variance estimates due to sampling fluctuations. Given our sample size, as well as our prior experience resulting in non-admissible solutions (Pedraza et al., 2009), robust maximum-likelihood estimation was also considered. The asymptotic covariance matrix was generated using PRELIS 2.0. Model fit was evaluated with the comparative fit index (CFI, values >0.90 indicate better fit) and root-mean-square error of approximation (RMSEA, values <0.10 indicate better fit), as well as the Satorra–Bentler scaled chi-square statistic for the robust model (Satorra & Bentler, 1988). Lastly, unidimensionality was evaluated further using DIMTEST 2.0, a non-parametric conditional covariance-based test (Nandakumar & Stout, 1993; Stout, 1987; Stout, Froelich, & Gao, 2001).
Item difficulty and discriminability parameters, standard errors, and summary statistics were obtained using marginal maximum-likelihood estimation in MULTILOG (Thissen, 2003). The characteristic curves for each item were plotted for visual inspection, and the overall test information was calculated to measure reliability across the range of naming ability.
Demographic characteristics and mean BNT data are presented in Table 1. Participants ranged in age from 22 to 92 years, and the majority were Caucasian (>95%). BNT scores were significantly correlated with age (r = −.21, p < .001) and years of education (r = .28, p < .001), but not with sex (r = −.10, p = .10). As expected, internal consistency was high (alpha = 0.91). Exploratory factor analysis revealed a 5.3:1 ratio between the first and second eigenvalues. A single-factor CFA using ADF estimators returned non-admissible solutions, but the use of robust maximum-likelihood estimation yielded a well-fitting single-factor model (CFI = 0.97; RMSEA = 0.0229; Satorra–Bentler scaled χ2 = 1717.61, p < .001). A two-factor model did not result in improved fit. Moreover, the result from DIMTEST (T-statistic = 0.99, p = .16) was consistent with the prior dimensionality assessments. Altogether, these findings suggest that the BNT was sufficiently unidimensional to proceed with IRT modeling.
Table 1.
Table 1.
Demographic characteristics and BNT data for 300 patients
BNT total scores ranged from 22 to 60. As shown in Fig. 3, participants provided 100% correct responses to four items (BNT item numbers denoted in parenthesis): “bed” (1), “tree” (2), “toothbrush” (10), and “hanger” (15). “Protractor” (59) had the fewest correct responses (19%). The graph in Fig. 3 also illustrates multiple dips or points at which there is a prominent decline in the percent of correct responses for consecutive items. For example, 92% of participants responded correctly to “wreath” (28) and 88% responded correctly to “harmonica” (30), but only 63% responded correctly to “beaver” (29). Similarly, 81% responded correctly to “asparagus” (49), yet 41% responded correctly to the following item, “compass” (50).
Fig. 3.
Fig. 3.
Mean percent correct item responses on the BNT.
Table 2 presents the IRT item discrimination and difficulty parameters. As expected, there was no variance associated with the four items with 100% correct responses. The standard error for items with highly skewed response patterns (e.g., “scissors”, “broom”) could not be defined under maximum-likelihood estimation. Protractor (59) had a negative, near-zero discrimination parameter, suggesting that it was a poor item yielding minimal-to-no psychometric information for the IRT model.
Table 2.
Table 2.
Item discrimination and difficulty parameters for the BNT
Among the remaining items, “comb” (7) showed the highest magnitude of discrimination, followed by “racquet” (21), “saw” (9), “canoe” (26), and “wheelchair” (16). The least discriminating items were “flower” (8), scissors (6), “latch” (51), “yoke” (56), and “trellis” (57). These findings are more clearly visualized in Fig. 4, where the items with the highest degree of discrimination show the steepest slopes, and those with the lowest discrimination have relatively flat slopes.
Fig. 4.
Fig. 4.
Fig. 4.
Fig. 4.
Fig. 4.
Matrix of ICCs for the BNT (Note: ICCs not available for items 1, 2, 10, and 15).
In terms of difficulty, “abacus” (60) exhibited the highest parameter and was followed by compass (50), yoke (56), palette (58), and sphinx (55). As noted earlier, although 81% of participants responded incorrectly to protractor (59), its IRT parametric difficulty could not be properly estimated because the likelihood of responding correctly was nearly equal for any individual along the ability spectrum. Besides the four items in which all participants responded correctly, the next five easiest items were flower (8), scissors (6), broom (12), camel (17), and house (4). Several items had difficulty parameters that suggested a notable discrepancy from their ordered placement on the test. For instance, “acorn” (32) was the 19th easiest item and harp (38) the 22nd easiest item. In contrast, “octopus” (13) was the 36th easiest item and seahorse (24) the 48th easiest item. These results highlight the lack of monotonic increase in psychometric difficulty among successive items.
Figure 5 displays the global test information curve. The BNT provided the most information (reliability) for individuals in the low-average range of naming ability, or approximately −1.0 standardized units. Measurement error increased considerably when assessing individuals with at least a high-average degree of naming ability.
Fig. 5.
Fig. 5.
Test information and standard error curves for the BNT.
The present study explores the item-level psychometric properties of the BNT in a clinical outpatient sample and suggests the following: First, each successive BNT item does not necessarily confer a stepwise increase in psychometric difficulty. Easier items generally do group together in the first half of the test and harder items in the second half, but there is marked variability in difficulty levels within smaller clusters of successive items. Second, some BNT items do not discriminate well between individuals at close levels of naming ability, and a few items are simply inadequate to make such distinctions. For instance, scissors, flower, and protractor do not discriminate well within any range of naming ability, and this lack of discrimination is independent of their difficulty level. Third, a subset of items exhibits a comparable degree of difficulty or discrimination, which suggests that these items provide redundant psychometric information. For instance, there is a high degree of redundancy between the following item pairs: octopus and asparagus, racquet and canoe, wreath and harp, and igloo and volcano. It can be expected that excluding an item from each of these pairs would result in negligible psychometric loss. This may be helpful for future derivation of shorter naming tasks using BNT items without a loss of discrimination characteristics. And fourth, the BNT yields the highest degree of measurement precision near the low average to mildly impaired range of naming ability (i.e., between −1.0 and −1.5 standardized units). Measurement precision remains acceptable in the moderate range of impairment, but declines markedly in the high-to-above average ability range, likely due to the test's measurement ceiling. In practical terms, this suggests that the BNT is most precise for adults who present to an outpatient clinical practice with an early or mild naming deficit.
These findings in a neurologic and medical outpatient sample are consistent with those reported by Pedraza and colleagues (2009) among cognitively normal older adults. This detailed item-level psychometric information may be useful to supplement the test's total score and more clearly delineate the nature of a patient's naming deficit. A briefer short form could be created empirically for a clinical trial by selecting highly discriminating items located at equidistant intervals along the entire range of difficulty and selecting only those without differential item functioning. For instance, a brief 10-item form could include the following items, ordered from easiest to hardest: saw, comb, mushroom, racquet, harmonica, pyramid, seahorse, beaver, sphinx, and abacus. These items demonstrate relatively equidistant difficulty parameters, relatively high discriminability, and no differential item functioning with respect to Caucasian versus African American adults. These data also could be used to construct alternate test forms in which items have equivalent difficulty and discrimination, and which may be helpful in rehabilitation or other settings requiring repeated evaluations. Moreover, examining a person's pattern of BNT item responses as part of a forensic exam could yield a symptom validation index if the person tends to make a disproportionate number of errors to psychometrically easy items but responds correctly to difficult items.
A few limitations are worth noting. First, although a key advantage of IRT over classical test theory is that item parameters are invariant across populations, this property holds only when the range of the ability sampled is maximized. Participants in this study obtained BNT total scores ranging from 22 to 60, and only 10 participants had scores between 22 and 29. Thus, the findings may not generalize to patients with acute, language-dominant hemisphere stroke or advanced semantic dementia, who may be expected to make substantially greater number of errors on the BNT. Also, the clinical sample in this study has limited ethnic minority representation, and our past findings from cognitively normal adults suggest that slight differences in item parameters exist between ethnic groups (Pedraza et al., 2009). Although these results demonstrate a lack of incremental or monotonic difficulty among ordered items, the extent to which this factor may contribute to variation in total test scores under standard basal and discontinuation criteria is unknown. It seems reasonable to assume, however, that such an effect may be negligible because normative values (e.g., MOANS/MOAANS age-scaled scores; Heaton T-scores) generally comprise a range of raw scores rather than a single score. Lastly, it bears noting that these results do not negate the utility of error pattern analyses as originally intended by the BNT authors.
In summary, these results offer additional information regarding the psychometric properties of the BNT that may be useful in clinical practice, research, and future test development or refinement.
This work was supported by the National Institutes of Health (NS054722 to O.P.).
Conflict of Interest
None declared.
We would like to thank Dan Mungas, Ph.D., for helpful comments on an earlier portion of the manuscript. We also extend our gratitude to our wonderful team of psychometrists: Diana Achem, Cameron Griffin, Ashley Marshall, Jill McBride, Wendy Mercer, and Sonya Prescott.
  • Boomsma A., Hoogland J. J. The robustness of LISREL modeling revisited. In: Cudeck R., du Toit S., Sörbom D., editors. Structural equation models: Present and future. Lincolnwood, IL: Scientific Software International; 2001. pp. 139–168.
  • Embretson S. E., Reise S. P. Item response theory for psychologists. Mahway, NJ: Lawrence Erlbaum Associates; 2000.
  • Gardner P. L. Measuring attitudes to science: Unidimensionality and internal consistency revisited. Research in Science Education. 1995;25(3):283–289. doi:10.1007/BF02357402.
  • Graves R. E., Bezeau S. C., Fogarty J., Blair R. Boston naming test short forms: A comparison of previous forms with new item response theory based forms. Journal of Clinical and Experimental Neuropsychology. 2004;26(7):891–902. doi:10.1080/13803390490510716. [PubMed]
  • Hambleton R. K., Swaminathan H. Item response theory. Principles and applications. Boston: Kluwer-Nijhoff Publishing; 1985.
  • Hambleton R. K., Swaminathan H., Rogers H. J. Fundamentals of item response theory. Newbury Park, CA: Sage Publications; 1991.
  • Jöreskog K. G., Sörbom D. LISREL 8: User's reference guide. 2nd ed. Chicago, IL: Scientific Software International; 1997.
  • Jöreskog K. G., Sörbom D. LISREL 8.80. Chicago, IL: Scientific Software International; 2006.
  • Kaplan E., Goodglass H., Weintraub S. The Boston Naming Test. Philadelphia: Lea & Febiger; 1983.
  • Nandakumar R., Stout W. Refinements of Stout's procedure for assessing latent trait unidimensionality. Journal of Educational Statistics. 1993;18:41–68. doi:10.2307/1165182.
  • Pedraza O., Graff-Radford N. R., Smith G. E., Ivnik R. J., Willis F. B., Petersen R. C., et al. Differential item functioning of the Boston Naming Test in cognitively normal African American and Caucasian older adults. Journal of the International Neuropsychological Society. 2009;15(5):758–768. doi:10.1017/S1355617709990361. [PMC free article] [PubMed]
  • Rabin L. A., Barr W. B., Burton L. A. Assessment practices of clinical neuropsychologists in the United States and Canada: A survey of INS, NAN, and APA Division 40 members. Archives of Clinical Neuropsychology. 2005;20(1):33–65. doi:10.1016/j.acn.2004.02.005. [PubMed]
  • Satorra A., Bentler P. M. Scaling corrections for chi-square statistics in covariance structure analysis. American Statistical Association 1988 proceedings of the business and economics section; Alexandria, VA: American Statistical Association; 1988. pp. 308–313.
  • Schmitt N. Uses and abuses of coefficient alpha. Psychological Assessment. 1996;8(4):350–353. doi:10.1037/1040-3590.8.4.350.
  • Stout W. A nonparametric approach for assessing latent trait unidimensionality. Psychometrika. 1987;52(4):589–617. doi:10.1007/BF02294821.
  • Stout W., Froelich A., Gao F. Using resampling methods to produce an improved DIMTEST procedure. In: Boomsma A., van Duijn M. A. J., Snijders T. A. B., editors. Essays on item response theory. New York: Springer-Verlag; 2001. pp. 357–376.
  • Strauss E., Sherman E. M. S., Spreen O. A compendium of neuropsychological tests: Administration, norms, and commentary. 3rd ed. New York: Oxford University Press; 2006.
  • Teresi J. A. Different approaches to differential item functioning in health applications: Advantages, disadvantages and some neglected topics. Medical Care. 2006;44(11 Suppl. 3):S152–S170. doi:10.1097/01.mlr.0000245142.74628.ab. [PubMed]
  • Thissen D. MULTILOG 7.0: Multiple, categorical item analysis and test scoring using item response theory. 2003 Chicago: Scientific Software International doi:10.1111/j.1745-3984.1990.tb00754.x.
  • Tombaugh T.N., Hubley A.M. The 60-item Boston Naming Test: Norms for cognitively intact adults aged 25 to 88 years. Journal of Clinical and Experimental Neuropsychology. 1997;19(6):922–932. doi:10.1080/01688639708403773. [PubMed]
Articles from Archives of Clinical Neuropsychology are provided here courtesy of
Oxford University Press