The present study sought to investigate differential item functioning (DIF) on the Boston Naming Test between African American and Caucasian adults age 52 and older. Using IRT-based methodology, 12 of the studied items demonstrated DIF, suggesting that the conditional probability of responding correctly to these items differed significantly between the two groups after matching for the latent naming ability. The items “dominoes” and “escalator” showed uniform and nonuniform DIF, which reflects nonequivalence in the difficulty and discriminability parameters between the two groups. The item “rhinoceros” showed DIF in the discriminability parameter only, whereas items “muzzle,” “unicorn,” “noose,” “latch,” “tripod,” “scroll,” “tongs,” “palette,” and “protractor” showed DIF only in the difficulty parameter.
There has been considerable debate within the DIF literature about the extent to which item parameter estimates, and hence DIF detection, are dependent on the methods used to calculate those parameters. An emerging opinion promotes the use of multiple computational procedures for DIF detection. To minimize the likelihood that the current findings were specific to the use of IRT, we reanalyzed the data using hierarchical logistic regression models with item response as the binary dependent outcome in each model. Results showed 14 items to demonstrate DIF, with 6 of these items considered to have at least “moderate” DIF based upon suggested criteria. These 6 items (“dominoes,” “escalator,” “muzzle,” “latch,” “tripod,” “palette”) were similarly identified as demonstrating DIF through the IRT analyses and represent the strongest evidence for race/ethnicity-based DIF in the Boston Naming Test.
The presence of DIF on the Boston Naming Test is problematic from two broad perspectives. First, it raises some concerns about the construct validity of the test when a construct-irrelevant aspect, namely race or ethnic group membership, is associated with nonequivalence in the conditional probability of obtaining a correct item response. In other words, after matching individuals on naming ability level, at minimum 6 of the 31 items studied using IRT and logistic regression methods (or 12 of 31 studied items using IRT methods only) demonstrate a significantly different probability that a person will respond correctly, solely as a result of their racial/ethnic group membership. Second, it bolsters the notion that the use of ethnicity-based norms to evaluate the clinical impact of the summed total score may mask psychometric problems that are present at the item level.
The current results present additional psychometric information about the Boston Naming Test that, to our knowledge, has not been previously reported. Specifically, examination of the IRT difficulty and discriminability parameters shows lack of a monotonic relationship among ordered items. Despite administration rules conforming to an incremental administration, each successive item does not necessarily represent a psychometric increase in difficulty when compared to its previous item. In fact, the observed pattern most closely resembles an oscillating profile of increasing and decreasing item difficulty as administration progresses from items 30 through 60. This pattern is evident in both the African American and Caucasian participant samples. Additionally, one would expect uniformity in the degree to which each item differentiates those with better or worse naming ability. However, the discrimination parameters show considerable variability from item to item throughout both groups of adults. Graves et al. (2004)
previously applied a 1-parameter Rasch model to a mixed sample of 206 adults (n = 62 considered cognitively normal) in order to develop a short version of the BNT; but unfortunately, item difficulty parameters were not reported. Additional IRT-based investigations certainly appear warranted to better understand the finer psychometric properties of this instrument, with a particular emphasis on item parameters across the full range of naming performance in normal and cognitively impaired populations.
From a practical standpoint, these results could be utilized in a future refinement of the BNT to mitigate item bias or differential functioning. Specifically, the items shown here to be free of DIF could be retained in a future revision of the test, with reordering of those items based upon estimated rather than hypothesized difficulty parameters. Alternatively, a scoring algorithm can be devised and implemented to weight responses from DIF-free items more heavily than responses from DIF-loaded items. Given the widespread use of the BNT within and beyond neuropsychology, and the existing large normative datasets across the developmental span, a new scoring algorithm may seem most practical.
Potential limitations to the current investigation include restricting the range of the BNT to items 30 through 60, restricting the participant sample to cognitively normal older adults, the characterization of the two groups, and possible multidimensionality of the data from Caucasian adults. First, item range restriction was necessary for psychometric reasons. The test was administered using standardized rules that instruct examiners to begin with item 30 and proceed until a basal level of 8 consecutive correct responses is reached. Fewer than one-third of our cognitively normal participant sample failed to reach this basal level. As a result, the majority of items 1-29, and particularly items 1-25, were administered to less than 200 subjects per group. IRT analyses of these items would have resulted in unstable and perhaps misleading parameter estimates.
Our sample was restricted to cognitively normal adults for two reasons. First, the principal goal of the study was to investigate the psychometric properties of the BNT at the item level, with a particular focus on differential functioning between two racial/ethnic groups. The relationship between item parameters and clinical dysfunction is not essential to understanding DIF, as the analytic method presupposes matching on ability level. Furthermore, restricting the sample to cognitively normal adults minimized the likelihood that an unequal distribution between the two groups of clinical factors unrelated to naming (e.g., reduced visual acuity or perception, slowed processing speed) could have contributed to differential performance across items. The relationship between item parameters and naming dysfunction remains an important topic of investigation and will be pursued in future studies.
The two participant groups consisted of self-identified African American and Caucasian adults that have taken part in the Mayo normative studies. Numerous publications over the past 19 years have described the methodology used to select the participant sample. Test selection, administration, and scoring are comparable between the two Mayo sites in Rochester, Minnesota, and Jacksonville, Florida. Because all of the African American adults were recruited in Jacksonville, it could be reasonably argued that the current results represent DIF based upon geographic (i.e., north vs. south) rather than ethnicity-based factors. This certainly presents a caveat to the findings discussed above as well as a testable hypothesis. Unfortunately, we do not have a sufficiently large sample of African American and Caucasian adults within both sites to examine DIF based on geographic distribution.
Lastly, it is possible that the BNT data obtained from Caucasian adults were not sufficiently unidimensional to meet IRT assumptions, although it seems unclear why ‘naming’ should be a more strongly unitary construct in one group versus the other. To our knowledge the invariance properties of the BNT have yet to be uniformly established across these two groups, and this may represent another topic worthy of further study.
In sum, the current investigation highlights the benefits of modern psychometric methods in the investigation of between-group discrepancies. Our findings suggest that the unexamined use of race/ethnicity-based norms, although necessary in clinical decision-making, potentially masks underlying psychometric problems that may contribute to between-group discrepancies at the item level. The degree to which these findings extend beyond the BNT to other established and commonly used neuropsychological instruments remains largely unexplored.