Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Stroke. Author manuscript; available in PMC 2010 July 1.
Published in final edited form as:
PMCID: PMC2726278

NIHSS Certification is Reliable Across Multiple Venues


NIH Stroke Scale certification is required for participation in modern stroke clinical trials and as part of good clinical care in stroke centers. A new training and demonstration DVD was produced to replace existing training and certification videotapes. Previously, this DVD, with 18 patients representing all possible scores on 15 scale items, was shown to be reliable among expert users. The DVD is now the standard for NIH stroke scale training but the videos have not been validated among general (i.e. non-expert) users. We sought to measure inter-rater reliability of the certification DVD among general users using methodology previously published for the DVD. All raters who used the DVD certification through the American Heart Association website were included in this study. Each rater evaluated one of 3 certification groups. Responses were received from 8214 raters overall, 7419 raters using the internet and 795 raters using other venues. Among raters from other venues, 33% of all responses came from registered nurses, 23% from Emergency Department MD/Other ED/other physicians, and 44% from neurologists. One half (51%) of raters were previously NIHSS certified and 93% were from United States/Canada. Item responses were tabulated, scoring performed as previously published, and agreement measured with unweighted kappa coefficients for individual items and an intraclass correlation coefficient for the overall score. In addition, agreement in this study was compared to the agreement obtained in the original DVD validation study to determine if there were differences between novice and experienced users. Kappa's ranged from 0.15 (Ataxia) to 0.81 (LOC-C questions). Of 15 items, 2 showed poor, 11 moderate, and 2 excellent agreement, based on Kappa scores. Agreement was slightly lower to that obtained from expert users for LOCC, Best Gaze, Visual Fields, Facial Weakness, Motor Left Arm, Motor Right Arm and Sensory Loss. The intraclass correlation coefficient for total score was 0.85 (95% CI 0.72, 0.90). Reliability scores were similar among specialists and there were no major differences between nurses and physicians, though scores tended to be lower for neurologists and trended higher among raters not previously certified. Scores were similar across various certification settings. The data suggest that certification using the NINDS DVDs is robust and surprisingly reliable for NIHSS certification across multiple venues.

Keywords: Stroke, Clinimetrics, Scales, Reliability


Neurologists who care for stroke patients are required to certify in use of The National Institutes of Health Stroke Scale (NIHSS) now that Disease-Specific Specialty Designation as a Primary Stroke Center is available from the Joint Commission on Accreditation of Health Care Organization1, 2. The NIHSS is a widely used stroke deficit assessment tool used in nearly all large clinical stroke trials to document baseline and outcome severity3-5. A training and certification process exists to assure that raters use the NIHSS in a uniform manner6, 7: videotapes were used for training and certification from 1988-2006. To update the training and certification process, the NINDS produced a DVD in 2006 that is distributed widely by the American Academy of Neurology, the American Heart Association, and the National Stroke Association. Originally the DVD was validated in 3 select Stroke centers to obtain a best-case impression of how the DVD patients should be scored among expert users 8. The DVD was designed, however, for a non-expert single user to view at home or in an office and the use among non-experts has not been validated. In addition, the DVD certification in group settings is not validated. Also, scores may not be generally applicable when novice users view the training DVD and then attempt certification. Hence, we collected scores from single use, group use, and a website to determine the reliability of the DVD certification outside of experienced centers and across multiple venues.


The training DVD includes 18 patients divided into 3 groups balanced for severity and stroke side. Raters were asked to certify using one of the three patient groups. Details on the DVD and the certification method have been described8.

We obtained certification scores from users in the following venues: single user (home or desktop), small groups, large groups, and a website. Single users took the DVD home or to an office, watched the training video, and then watched the certification video cases. Small group certifications occurred at single sites where the training video was shown and then no more than a dozen users watched the certification video and marked score sheets individually. Large group certifications occurred at meetings of trial investigators participating in a variety of clinical trials; the training video was shown and then certification patients were shown. In the large group settings, each user marked their own score sheet without discussion among other users. From all venues score sheets were faxed to the UCSD Stroke Clinical Trial Coordinating center for scoring using the published algorithm7. The training/certification website is sponsored by the American Heart Association. Users were encouraged to watch the training video over the Internet before certifying on one of the 3 certification groups; scores were recorded on the website and then raw data was transmitted to UCSD.

Descriptive analysis was performed on all data in the dataset. The number of raters who certified using this DVD were tabulated by setting (individual, small group, PI meeting, and website), as well as specialty (RN, ED MD, Neurology, Other ED, Other), prior certification status (Y, N) and country (US/Canada, Others), if collected. Summaries of the individual item score as well as the total NIHSS were generated.

Reliability was assessed for the individual items of the NIHSS as well as the overall score. Scores of the individual items were tabulated. Agreement for the individual items among raters was assessed using the unweighted kappa statistic (κ) for multiple raters 9 with a 95% confidence interval obtained using the bootstrap resampling technique with 1000 replicates. The methods used here are similar to the methods used in the original DVD validation study to allow comparison between the two studies8. In this study, the bootstrap technique was used instead of the jackknife technique since there are several instances when the jackknife technique was not appropriate10. Agreement between this study and the original DVD study were considered to be statistically different if the estimated kappa (κ) in the original study did not fall into the 95% confidence interval for κ in this study. Using similar methods, reliability of the individual items was assessed separately for the subgroups of patients by setting, as well as specialty, certification status and country, if available. Comparison of κ statistics across subgroups was done using the bootstrap technique for correlated data12. 95% confidence intervals for differences in κ between two subgroups were calculated. The Bonferroni correction was used to adjust for multiple comparisons within each subgroup comparison. In addition, the scatter plot of the item scores for each subject were used to visually compare and confirm the reliability graphically and the consistency of item score by group.

Agreement on the overall total NIHSS was assessed with an intraclass correlation coefficient (ICC) obtained using a one way random-effects model for repeated measurements with continuous outcomes (with ratings nested within patients)11. The bootstrap resampling technique was used to obtain 95% confidence intervals for the ICC. There are two comparisons that are of interest in this study; (a) ICC in current study with that obtained in DVD validation study, and (b) ICC in this study among the subgroups. The first was assessed by determining if the 95% confidence interval for the ICC in this study contained the ICC from the DVD validation study. If true, there was no evidence to indicate a difference in ICC between the two studies. ICC's in the present study were compared between subgroups for setting, specialty, prior certification status and country by calculating the 95% confidence interval for difference in ICC for correlate data between two subgroups. If zero is included in the confidence interval, there is no evidence to indicate a difference. To compare ICC between the 3 groups of patients (A, B and C), the Fisher's Z transformation for comparison of independent ICC's was used12. In both instances, the Bonferroni correction was applied to adjust for multiple comparisons. Similar to item score, the scatter plot of the total NIHSS for each subject was used to visualize the variability of scores by subgroups.

To assess the mean effect of the covariates on the total NIHSS, a random intercept mixed-effects regression model was fit to the data.


We received score sheets from 379 single users, 178 small group users, 238 large group users, and 7419 web users. Among the 49284 expected responses (8214 × 6), we received 49272 ratings (99.9% completion rate). Responses were received from 8214 individual raters (4796 raters scored patients in group A, 2762 in group B and 656 in group C) who each rated between 3 and 6 patients. As a result, each patient had somewhere between 655 and 4796 ratings (unequal cluster sizes). Among the raters who provided demographic information, 33% of all responses came from registered nurses, 23% from ED/other physicians, and 44% from neurologists. Most of the raters (93%) were from the United States and half of raters on an average (51%) were previously NIHSS certified. Item responses were tabulated, scoring performed as described above, and agreement measured with unweighted kappa coefficients for individual items and an intraclass correlation coefficient for the overall score.

Table 1 indicates the range of values obtained on each item over all 18 patients. The mean NIHSS total score was 8.0 ± 6.6 (median=7; range 0 to 41). The spread of responses in individual items and total scores appeared similar among the subgroups, namely, sites, specialties and prior NIHSS certification status.

Table 1
Distribution of Reponses by NIHSS Item

Table 2 compares the agreement obtained using the unweighted kappa from the current dataset with that of the original DVD study1. The agreements ranged from 0.15 (ataxia) to 0.81 (LOCC) using the current data set. The agreements obtained from this group of raters were similar to that of the original DVD study on all items of the NIHSS except for seven items with lower agreement (LOCC, Best Gaze, Visual Fields, Facial Weakness, Motor Left arm, Motor Right Arm and Sensory Loss).

Table 2
Interobserver Agreement for NIHSS items

Among all 18 certification patients, the agreement was similar across all subgroups and among all venues. Results were remarkable similar to the results in the original DVD validation study except for some small inconsistent differences across certain subgroups (data not shown). Agreement in 4 fields (LOCQ, LOCC, Visual Fields and Motor Left Leg) was higher in other countries compared to USA/Canada. Among specialties, ED MDs had higher agreement in Motor Right Leg compared to nurses, in LOCC, Motor Right Leg and Sensory Loss compared to neurologists, and in Motor Left Leg and Motor Right Leg compared to other specialties; Nurses showed greater agreement in Dysarthria compared to neurologists, and in Motor Left Arm and Motor Left Leg when compared to other specialties. Agreement in LOCQ was higher in non-certified raters than that in certified raters. Comparing venues, individual users showed higher agreement in Extinction/Neglect compared to the large group setting and higher agreement in Visual Fields and Motor Left Arm compared to web users; in the large group setting, scores showed lower agreement in Extinction/Neglect compared to the web setting; the small group setting showed higher agreement in Motor Left Arm than web users. There is no significant difference in agreement across 3 certification groups.

Table 3 lists the intraclass correlation coefficient for the overall total NIHSS score, and total NIHSS by subgroup. There continues to be very good agreement in the total NIHSS score across all venues and subgroups (overall intraclass correlation coefficient of 0.85 (95% CI (0.72, 0.90)). There are no statistically significant differences in mean NIHSS scores by country and prior NIHSS certification status. There was a statistically significant interaction between specialty and setting in mean NIHSS scores (p=0.046); however, there were no clinically significant differences. Although there were slight differences in ICC across covariates, in all cases the agreement still remained very high. Agreement was lower among raters from the United States/Canada compared to the raters from other countries. The ICC was slightly lower among neurologists compared to the nurses, ED MDs, Other MDs and other physicians. Similarly, the raters with prior certification had slightly lower agreement than those who were not certified previously. The ICC was slightly lower in the case of small group setting as compared to individual, PI meeting setting or web users. The ICC's for certification groups A and B were slightly lower than group C.

Table 3
Intra-class Correlation Coefficients (ICC) for NIHSS Total Score


Our data shows that NIHSS training and certification using the DVD is valid and reliable among general users. The certification process showed remarkable consistency across widely differing venues, including single users, small groups, large groups, and certification data from the American Heart Association website. The individuals in this study included novice users—who viewed the training video and then attempted certification—as well as previously certified users. The reliability assessments of this certification DVD among these novice users were similar to what was found using the experienced stroke centers, indicating that the DVD is a surprisingly valid and reliable replacement for the previous videotapes. The agreement among the items was similar whether it was used by a single user or in a group setting.

We found no differences in the ICC of the total NIHSS when the DVD was used by neurologists, ED physicians, and nurses, suggesting that the NIHSS may be appropriate for use in clinical research trials, as well as in daily communication among health care providers. Agreement among those identifying themselves as neurologists were slightly lower than individuals identifying themselves as registered nurses, ED/Other MDs or Other specialties, but the results were statistically similar and generally excellent. Agreement across various settings was similar and generally moderate to excellent.

The DVD format has some advantages over videotape. The digital images can be loaded onto a website, and the American Heart Association successfully implemented a web-based training campus using our images. This website allows raters to view the training and certification patient videos online. The DVD technology is more widely available now than videotapes, so NIHSS certification should be possible for many more years, even if videotapes become obsolete.

This study contains certain limitations, the most important of which is that most of the raters were from the United States and Canada. We were able to determine that the scoring sheet works well for novice as well as experienced users in North America. However, these scores may not be generally applicable for non-English speakers or raters in other countries. Therefore, we continue to collect scores from the website to determine if the same scoring sheet generally works well outside of North America. Another inherent limitation is that video technology is a poor substitute for direct examination. In the absence of widespread proctored certification, however, no other option is available. Video certification is now widely used in many disciplines, with reasonable validity and reliability2. It is likely that web-based video training and certification will become more widespread, as the cost efficiencies are significant. Finally, the website does not require viewing of the training video prior to attempted certification, so an unknown number of novice users could have tried to certify without proper training.

Due to the unbalanced group sizes, small cells for item scores and a crossed study design, we did not use weighted Kappa statistics. Unweighted Kappa scores may underestimate agreement, yet in this study, the unweighted Kappa scores were comparable to the unweighted scores obtained in the primary DVD study and the weighted scores obtained in previous videotape studies. Therefore, the agreement among the viewers was at least as good and likely better than that seen previously with the videotapes. Agreement using the DVD continues to be surprisingly good and consistent among experienced as well as novice users.


The authors acknowledge the diligent effort and expertise of Ms. Alyssa Chardi and Karen Rapp, RN.

This work was supported by the NINDS P50 NS044148, and the Veterans Affairs Medical Research Service.


1. Alberts MJ, Hademenos G, Latchaw RE, Jagoda A, Marley J, Mayberg MR, Starke RD, Todd HW, Viste KM, Girgus M, Shephard T, Emr M, Shwayder P, Walker MD. Recommendations for the establishment of primary stroke centers. Brain Attack Coalition. Journal of the American Medical Association. 2000;283:3102. [PubMed]
2. Mohammad YM, Divani AA, Jradi H, Hussein HM, Hoonjan A, Qureshi AI. Primary stroke center: basic components and recommendations. South Med J. 2006;99(7):749–52. [PubMed]
3. Lyden P, Lu M, Jackson C, Marler J, Kothari R, Brott T, Zivin J. Underlying Structure of the National Institutes of Health Stroke Scale: Results of a Factor Analysis. NINDS tPA Stroke Trial Investigators. Stroke. 1999;30(11):2347–54. [PubMed]
4. Goldstein L, Samsa G. Reliability of the National Institutes of Health Stroke Scale. Stroke. 1997;28(2):307. [PubMed]
5. Goldstein LB, Bartels C, Davis JN. Interrater reliability of the NIH stroke scale. Archives of Neurology. 1989;46:660. [PubMed]
6. Albanese MA, Clarke WR, Adams HP, Jr, Woolson RF. Ensuring reliability of outcome measures on multicenter clinical trials of treatments for acute ischemic stroke: the program developed for the trial of ORG 10172 in acute stroke treatment (TOAST) Stroke. 1994;25:1746. [PubMed]
7. Lyden P, Brott T, Tilley B, Welch KM, Mascha EJ, Levine S, Haley HC, Grotta J, Marler J. Improved reliability of the NIH stroke scale using video training. NINDS TPA Stroke Study Group. Stroke. 1994;25(11):2220–6. [PubMed]
8. Lyden P, Raman R, Liu L, Grotta J, Broderick J, Olson S, Shaw S, Spilker S, Meyer B, Emr M, Warren M, Marler J. NIHSS training and certification using a new digital video disk is reliable. Stroke. 2005;36(11):2446–9. [PubMed]
9. Fleiss JL. Statistical methods for rates and proportions. New York: John Wiley and Sons; 1981.
10. Efron B, Tibshirani RJ. An Introduction to the Boostrap. New York: Chapman & Hall/CRC; 1993. p. 436.
11. Zar JH. Biostatistical Analysis. 4th. Prentice Hall; New Jersey: 1999. pp. 390–392.
12. McKinzie DP, Mackinnon AJ, Peladeau N, Onghena P, Bruce PC, Clarke DM, Harrigan S, McGorry PD. Comparing Correlated Kappas by Resampling: Is one level of agreement significantly different from another. Journal of Psychiatric Research. 1996;30:483. [PubMed]