|Home | About | Journals | Submit | Contact Us | Français|
The National Institutes of Health Stroke Scale (NIHSS) is a widely used measure of neurological function in clinical trials and patient assessment; inter-rater scoring variability could impact communications and trial power. The manner in which the rater certification test is scored yields multiple correct answers that have changed over time. We examined the range of possible total NIHSS scores from answers given in certification tests by over 7,000 individual raters who were certified.
We analyzed the results of all raters who completed one of two standard multiple-patient videotaped certification examinations between 1998 and 2004. The range for the correct score, calculated using NIHSS ‘correct answers’, was determined for each patient. The distribution of scores derived from those who passed the certification test then was examined.
A total of 6,268 raters scored 5 patients on Test 1; 1,240 scored 6 patients on Test 2. Using a National Stroke Association (NSA) answer key, we found that correct total scores ranged from 2 correct scores to as many as 12 different correct total scores. Among raters who achieved a passing score and were therefore qualified to administer the NIHSS, score distributions were even wider, with 1 certification patient receiving 18 different correct total scores.
Allowing multiple acceptable answers for questions on the NIHSS certification test introduces scoring variability. It seems reasonable to assume that the wider the range of acceptable answers in the certification test, the greater the variability in the performance of the test in trials and clinical practice by certified examiners. Greater consistency may be achieved by deriving a set of ‘best’ answers through expert consensus on all questions where this is possible, then teaching raters how to derive these answers using a required interactive training module.
The National Institutes of Health Stroke Scale (NIHSS) has proven to be a valuable tool for assessing neurological impairment and predicting stroke outcome [1, 2]. The NIHSS is the culmination of a process that began with the need to develop a systematic means by which to clinically measure the severity of cerebral infarction in light of the introduction of new therapies for acute stroke. Previously limited clinical measurements and unvalidated rating scales were formalized in a scale developed at the University of Cincinnati Stroke Center as a system for examining patients with acute cerebral infarction . The scale was developed further by researchers at the National Institute of Neurologic Disorders and Stroke (NINDS) to quantify patient status and outcomes in stroke clinical trials. Early versions of the scale, tested among stroke researchers on both live [1, 3] and videotaped patients , all showed poor reliabilities for items intended to score facial movement, limb ataxia, and dysarthria, with moderate or excellent agreement on the remaining items. In 1994, a videotaped training and certifying instrument was devised in order to systematically train investigators and coordinators of clinical trials to rate patients as consistently as possible on the NIHSS . This is particularly important for multi-site clinical trials in which multiple investigators in different centers are measuring outcomes at variable times; inconsistency in the use of the scale might affect power and the ability to detect a true effect [3,4,5]. This development of a required, uniform certification process, and the use of actual patients for scoring, has been an important step in helping clinical trials in stroke become more rigorous.
This training/certification process was primarily aimed at skilled, trained stroke neurologists who were presumed to have prior experience in using the NIHSS. Over time, however, the NIHSS certification process has been expanded; as greater numbers of institutions seek Joint Commission certified stroke center status, an increasing number of those seeking certification to administer the scale have little or no training in neurology, and include study personnel who have no formal medical training [6, 7]. This group diverges appreciably from those for whom good test-retest reliability among various different raters was reported by Brott et al. . It has become increasingly important, therefore, to determine whether the current training portion of the certification process adequately teaches all examiners, particularly those without prior training in stroke neurology or without medical training, to score patients comparably. In addition to extending beyond its originally intended administrators, the NIHSS has been utilized beyond its original and validated purpose into areas not currently supported by high-quality clinical evidence such as specifying clinical trial inclusion criteria [8, 9], defining clinically important change from pre- to post-treatment [1, 7], functioning as a primary [7, 10] or secondary [8,11,12,13,14,15,16]endpoint in trials, determining differences between treatment groups in some studies [1, 17], assessing stroke severity, and planning patient care by determining appropriate treatment for stroke patients [18, 19]. While the scale should not be used for purposes for which it has not been adequately validated, widespread documentation of such use places an even greater burden on the certification process.
However, neurologic examination items, which previously have been found difficult for raters to score, continue to be problematic. Some of these, when viewed in a video image, are difficult to characterize despite multiple-angle presentations. To overcome the inherent limitations of video technology, the scoring algorithm for the certification test allows for multiple acceptable answers to many of the 15 individual items on which each patient is scored. Because the overall NIHSS score is a summation of the scores on these 15 individual stroke characteristics, items with multiple allowable scores necessarily impact the reliability of the total score. This grading practice, whereby a single ‘best’ answer has never been agreed upon, was more feasible when certification was aimed primarily at trained stroke neurologists. However, given the much wider use of the scale, this practice may well be detrimental to the teaching aspect that is key to the certification process. We aimed to determine whether the method for setting criteria for passing the certification tests contributes to an unacceptable level of scoring variability which could then be carried over into use of the scale in actual clinical practice. Such a finding could potentially lead to improvements to the training segment of the certification process that might increase scoring consistency of the NIHSS.
Analyses in this paper are based on the original VHS certification tapes, developed in 1994. While the current certification videos have changed, thousands of raters were certified using the original version, thus impacting research data for a number of years. These original tapes consisted of a 45-min taped training program and two taped certification tests . After watching the training tape, raters were asked to score 5 videotaped patients on Certification Tape 1. Six months later, each investigator was to review the training tape and score the 6 patients on Certification Tape 2. After completion of the NINDS rt-PA for Acute Stroke Trial, the National Stroke Association (NSA) obtained copies of the training and certification videotapes for scoring and determination of eligibility to administer the NIHSS outside of the original clinical trial. While a training module was provided, examiners were not required to complete it before attempting certification, and raters did not receive any immediate feedback on their performance, either during or after the test.
To score the test, an outlier method of grading was developed for use in the NINDS rt-PA for Acute Stroke Trial. Initially, answers chosen by at least 12% of the original 162 NINDS t-PA investigators were deemed to be acceptable, i.e., ‘correct’. Responses chosen by less than 12% were considered to be outliers . Rater tests including fewer than five outliers on Test 1 or six outliers on Test 2 were considered passing; these raters were judged to be competent to administer the NIHSS during the trial. Given the expectation that real-world users would not be able to match answers given by experts who were present at the time the video was recorded, this score sheet was updated periodically (using the same 12% outlier rule and grading method) to reflect answers given by real world raters [P.D.L., pers. commun.].
In order to determine how much variability this grading system potentially introduced into the NIHSS score, we examined data from all certification tests submitted to the NSA between December 1, 1998 and August 24, 2004, using an NSA answer key that was based on the original NINDS scoring algorithm. The two available certification videotapes included a total of 11 unique patients, with 15 items (counting sub-items) for each patient. A total NIHSS score could be calculated only for those patients for whom all 15 questions were answered; we included only those certification tests for which a complete score could be calculated for every patient included in the test. For raters who took the same test more than once, we included only their first test, whether or not they received a passing score on this test.
Using NIHSS ‘acceptable answers’, we first calculated ‘correct’ scores for each of the 11 test patients. This consisted of summing the answers for the 15 items pertaining to each patient, taking into account those items for which multiple answers were considered to be acceptable. We thus determined for each videotaped patient the NIHSS scores that would be deemed correct. We then scored rater certification tests to determine which raters had passed the test, and examined the distribution of individual NIHSS patient scores in this group of raters (i.e., those judged to be qualified to administer the stroke scale).
Our study included a total of 9,171 certification examinations completed over a 6-year period. Of these, 7,681 raters took Test 1 and 1,480 raters took Test 2. After excluding incomplete and repeat tests, 7,405 tests were eligible for these analyses: 6,268 unique raters each scoring 5 patients on Test 1 and 1,123 scoring 6 patients on Test 2. An unknown number of these users would have received training in using the scale, but this training was not required.
As a result of the outlier method of scoring, we found in our representative set of correct answers that 20 of 75 questions (27%) on Test 1 and 26 of 90 questions (29%) on Test 2 had multiple ‘correct’ answers. When combined into total NIHSS scores for each patient, a range of ‘correct’ total scores was possible for each, and no patient was characterized by a single best total score (table (table1).1). In the best case, 1 patient had 2 acceptable total scores; in the most extreme case, a patient could be correctly assigned any of 12 different total scores. For this latter patient, as a result, any score between 24 and 35 would be considered correct by the established grading standards. The average number of ‘correct’ total scores per patient was 5.5.
By accepted grading standards, a rater was considered to have passed the certification test if no more than five outlying responses for Test 1 or six outliers for Test 2 were chosen. (While this reflects one outlier per patient on average, in actuality, outliers are totaled over all patients, and not on a per patient basis.) Of the 6,268 unique raters who scored Test 1, 4,396 (70%) received a passing score; 1,059 (94%) of the 1,123 raters who took Test 2 passed the certification test. Among these raters who achieved a passing score and therefore were certified to administer the NIHSS, the distribution of total scores for the individual patients varied even more widely. The minimum number of different assigned scores was 4 for 1 patient, while the maximum number was 18. These variations were independent of the severity of the symptoms (fig. (fig.11).
Due to the limitations of video representations of patient examinations, the range of acceptable answers differed between individual examination components. Test items that are well known to be difficult to score consistently – aphasia, dysarthria, and facial paresis – ended up with multiple allowable answers as a result of the scoring algorithm used. For example, 9 of the 11 patients (82%) had multiple correct answers for the facial paresis item consistent with the fact that videography cannot portray facial weakness as accurately as an in-person examination. Of 11 patients, 8 (73%) had multiple correct items for the language item, a difficulty that has never been satisfactorily explained, and could be due to the method of training. As a result, 4 patients could be correctly classified as having either no or mild aphasia, 3 could be scored as having either mild or moderate aphasia, and 1 could be correctly identified as having either moderate or severe aphasia. In 3 patients, three answers were acceptable: 2 patients could be correctly classified as having either no, minor, or partial paralysis on the facial palsy item, and another patient could be categorized as having either no ataxia, ataxia in one, or in more than one limb. In another instance, a patient was ‘correctly’ diagnosed on the extinction item with either no abnormality or profound hemi-inattention, with the intermediary category counting as an outlier. For the items defining dysarthria and extinction, 5 of 11 patients (45%) had multiple correct answers. Items for which consistency would appear to be more easily reached, such as motor function, demonstrated inconsistency, with the left arm and right leg having multiple answers allowed in 27 and 36% of the patients, respectively. Level of consciousness was not deemed to be precisely determinable in all cases, with multiple answers acceptable for 2 of the 11 patients (18%).
The video training and certification of raters on the NIHSS was developed by the NINDS to train clinical trial investigators and coordinators (primarily trained, skilled stroke neurologists) to measure data on patients reliably and consistently across sites and over time [7, 20]. Unfortunately, with wider use of self-certification of stroke neurologists, as well as many without training in the field of stroke neurology or even without any formal medical education, and often without required training, there has been considerable degradation in the quality of these measurements. Both the American Stroke Association  and the NSA  offer a free, online version of the test through their websites. However, the accompanying training module is not mandated prior to taking the certification examination. Even when utilized, the training module provides no immediate feedback to allow raters to recognize and learn from their mistakes before attempting the certification test, despite the fact that studies have shown feedback to be one of the most inexpensive and influential methods of learning available [23,24,25]. While not all institutions choose to rely solely on the self-certification process, the fact that the majority of raters will be certified in this manner makes it imperative that the certification process be as effective as possible at teaching raters to score patients consistently on the NIHSS.
From its inception, problems with reliability in scoring of the NIHSS – whether live or via video – have been noted [3, 26]. When the certification tapes were first designed and tested, agreement beyond chance for the Test 1 (scored by 162 raters) was poor for one-third of the 15 questions comprising the scale, moderate for another third, and excellent only for the remaining third . Although agreement was somewhat better on Test 2, only 64 original raters had actually scored these patients .
In 2001, a modified version of the NIHSS, the mNIHSS, was devised . This test dropped problematic (facial weakness, ataxia, and dysarthria) and redundant (level of consciousness) items. The number of items showing poor agreement decreased to 3 of 22 (14%) items on the modified scale, but only 12 of 22 items (55%) exhibited excellent rater agreement (up from 40% for the 30 combined items of the original test). Despite general consensus about the problematic nature of some of the test questions and the apparent improvement of the scale by their removal, the modified scale has not been widely adopted. Neither has there been discussion about items not currently included in the scale. The NIHSS, which might benefit from the omission of some items, may also be strengthened by the addition of components to test attention, memory, and visual-spatial function, inclusions which could lead to more balanced non-dominant hemispheric scores.
In 2005, an updated version of the certification test for the full NIHSS was developed, encompassing a more balanced representation of patient disabilities and utilizing DVD technology, an improvement over the prior VHS technology . The outlier grading system continued to be utilized, however, and inadequate rater agreement continued to be observed. Of 15 items, 13 proved to have either poor (2) or moderate (11) levels of rater agreement . Clearly, a new approach to improve reliability is warranted. A possible alternative to changing the stroke scale itself may lie in endeavoring to improve the ability of the certification process to teach raters to derive answers that are as consistent as possible.
It is important to remember that a patient's final NIHSS score is a summation of 15 individually scored questions. The scoring system developed for NIHSS certification in the original clinical trial allows for multiple ‘correct’ answers to many of these questions (about 30% of questions across all videotaped patients per certification test version) to adjust for the limitations of video representations of patient examinations. Furthermore, the ‘correct’ answers have been periodically reevaluated based on rater responses to the questions, such that the same videotaped patient could have different total correct scores depending on when the certification test was taken. As greater numbers of users enter the certification system with little or no training, the number of ‘outliers’ necessarily increases. This adaptive scoring approach results in greater variability in the total ‘correct’ scores, and ultimately reaches a point where a rater can be certified despite an unacceptably large number of mistakes on individual test items. In the original two videotapes, fully 73% of the 11 unique patients in the two certification tests have between 5 and 12 ‘correct’ total scores, and no patient has a single correct total score. In the most extreme case, the videotaped patient can be correctly scored with an overall NIHSS score ranging from 24 to 35. Allowing substantial variability in individual patient scores increases the range of a passing score on the certification test, thus setting the pass/fail bar lower than should be acceptable. The certification process may therefore fail to teach raters to correctly categorize future stroke patients, thus leading to raters with less reliability, who therefore affect the results of clinical trials in which they participate. Outside of the clinical trial setting, this variability must have worsened over the ensuing years as well.
A precise scale ideally should have a single correct answer for each of the items tested. Consensus as to what constitutes a correct answer should, if possible, also take into account potential language and cultural differences in scoring, given the increasing number of international clinical trials. In a digital training resource for modified Rankin Scale (mRS) assessment, e.g., fully translated training packages, with native speakers overdubbing the patient interviews, have been made available for a variety of languages . An analysis of the variability between countries using these packages revealed that differences in scoring were not entirely a function of language, but appeared to have some association with sociocultural factors related to perceptions of disability and handicap . Others, however, in studying the effect of language on scoring of the NIHSS itself, have found that all venues – website, group, individual – yield similar results, and that items that film poorly in English remain poor in other languages, making the scale robust to such important variables [P.D.L., pers. commun.]. The absence of geographic information made such an effect impossible to assess in our current data.
However, it must be recognized that expecting raters to achieve absolute exact agreement on a 42-point scale composed of 15 different questions per patient is unrealistic. Exactly what should constitute a passing score on a certification test is a complex question, given that each rater scores 15 questions on each of 5 or 6 patients. A training package for the mRS, a much less complex scale with only one question per patient, faces similar issues of agreement, and classifies a rater's ranking of a patient as either ‘correct’ if agreed upon by two experts and more than 50% of trainees, and as ‘acceptable’ if one expert and a substantial minority of assessors in the pilot study chose it. A passing score on the certification test is then based on the number of correct and acceptable responses given across all patients. This is roughly similar to the approach taken to each of the 15 individual questions on the NIHSS; however, allowing a number of ‘acceptable’ individual answers has been shown to lead to a very wide range in ‘acceptable’ total scores. A more effective approach would be to derive a correct, or best, answer for each individual item through expert consensus, keeping to an absolute minimum the number of responses where multiple answers cannot be avoided due to problems of interpretation caused by videography (if these items were not eliminated altogether). A determination could then be made as to what range of total patient scores would be acceptable.
Such a program would necessarily be supported by a computer testing procedure that includes an expanded, highly interactive and required training process. One possibility would allow raters to score a number of sample patients and receive feedback about their answer choices after every question before going on to take the actual certification test. They thus would have an opportunity (1) to gain a better understanding of what experts consider to be the correct scoring of a given patient, (2) to see what mistakes they have made and why, and (3) to learn from their mistakes throughout the training process. This new system could then be evaluated in comparison to the current system, in order to determine if greater rater consistency was indeed achieved.
While such changes to the NIHSS certification process may require a greater time commitment on the part of those who wish to administer the test, the gains in reliability and consistency could well outweigh the disadvantages and costs of development. In the meantime, caution should be exercised in the use of the NIHSS for purposes for which it has not been validated. Until a new system can be evaluated and shown to have less variability in the raters who are certified, a patient's NIHSS score should not be used as a standard of exclusion or primary endpoint in clinical trials, nor should it be heavily weighted in the making of treatment decisions or communication between practitioners about patient status. Given the increasing pressure to extend the use of the NIHSS to include these functions, and the often-stated desire that it truly provide a common language with which to compare patients across different sites, raters, and points in time , such an effort would seem highly justified.
This study was supported by the NIH/NINDS (NS02254 and P50NS44148).