|Home | About | Journals | Submit | Contact Us | Français|
To describe the Neonatal Research Network’s (NRN) efforts to improve the certification process for the Follow-up Study neurologic exam and to evaluate inter-rater agreement before and after two annual training workshops.
The NRN Follow-up Study is a multi-center observational study that has examined more than 11,500 infants from 1998–2010 and born ≤ 26 weeks gestational age at 18 – 22 months corrected age for neurodevelopmental outcome. The percentages of examiners who agreed with the Gold Standard examiner on four neurodevelopmental outcomes on the initial training video and a test video were calculated. Consistency among examiners was assessed with the first-order agreement coefficient (AC1) statistic.
Improvements in agreement among examiners occurred between 2009 and 2010 and between initial training and test. Examiner agreement with the Gold Standard during the initial training was 83% – 91% in 2009 and 89% – 99% in 2010. Examiner agreement on the workshop test video increased from 2009 to 2010 with agreement reaching 100% for all four neurodevelopmental outcomes examined in 2010. AC1 values for the four neurodevelopmental outcomes on the training videos ranged from 0.64 – 0.82 in 2009 and 0.77 – 0.97 in 2010.
We demonstrate the importance of annual certification and the benefits of evaluation and revision of certification protocols to achieve high levels of confidence in neurodevelopmental study outcomes for multi-center networks.
Multi-center trials encounter challenges of standardizing research procedures across sites. It is important to periodically examine center differences, particularly for primary outcomes, to assess whether research procedures are being implemented uniformly. Periodical review of examiner training procedures can aid in identifying areas for improvement.
Examiner training for research protocol purposes is often achieved using video vignettes and to a lesser extent, the examination of actual patients during centralized training. Examiner training generally improves scoring accuracy and inter-rater reliability. For example, examiner training improved the motor development classification of infants as normal or abnormal1 as well as the reliability of upper limb function scores.2 However, there are instances where training has not improved inter-rater reliability and accuracy.3 Consideration has also been given to whether training and certification procedures are valid and reliable for novice examiners in addition to expert examiners.4
Training multiple examiners in a network for inter-rater reliability for the diagnosis of cerebral palsy (CP) and neurologic status can be particularly challenging due to substantial heterogeneity of neurologic findings ranging from very mild to severe, which is further complicated because young children may not clearly fall into a specific category. Examining the research standardization and training procedures of the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) Neonatal Research Network (NRN) Follow-up Study may be beneficial for those conducting neurodevelopmental outcomes research in neonatal medicine as well as other fields where annual examiner certification is required. The NRN Follow-up Study evaluates infants that are born less than or equal to 26 completed weeks gestational age at 18 – 22 months corrected age for neurodevelopmental outcome. The Follow-up Study assessment battery includes: demographic and medical history, physical examination, a standardized neurologic examination,5 the Palisano et al 6,7 Gross Motor Function Classification, a developmental assessment using the Bayley Scales of Infant and Toddler Development, 3rd edition (Bayley Scales-III),8 and a behavior assessment using the Brief Infant-Toddler Social Emotional Assessment (BITSEA).9 From 1998 to 2010, more than 11,500 infants completed the 18 – 22 month Follow-up Study visit.
Our objective is to describe the NRN’s implementation of a protocol to improve the certification process for the neurologic exam by NRN center examiners by evaluating inter-rater agreement before and after the annual training workshops in 2009 and 2010. It was hypothesized that inter-rater agreement would improve between initial training and test as well as between 2009 and 2010.
The NICHD Neonatal Research Network maintains a study manual with standard definitions (Table I). A diagnosis of neurodevelopmental impairment (NDI) is implemented for both randomized trials and observational follow-up studies. The definition of NDI is the presence of any of the following: moderate to severe CP, bilateral blindness, bilateral hearing impairment requiring amplification, a gross motor function classification score (GMFCS)6,7 greater than or equal to level 2 reflecting moderate to severe motor impairment or a Bayley Scales-III cognitive or motor score less than 70. The NRN Follow-up Study neurologic examination5 includes a standard assessment of reflexes, muscle tone and strength, functional motor skills, hearing and vision. Assessment vision and hearing is completed by: (1) reviewing history with family, (2) review of records from subspecialty ophthalmology and audiology services, (3) determining if child is receiving services for the blind, deaf educator, speech/language or sign language, or has amplification/FM system, (4) examination of the eyes including tracking, strabismus, glasses, surgical procedures, etc. and, (5) determining if the child can follow a simple verbal direction provided by the examiner. Hospital records, for the most part, contain this information and parents are usually aware of the formal diagnosis of blindness or permanent hearing loss.
Based on the neurologic examination, a determination is made of whether the child is normal or abnormal, and whether the child has CP as well as the severity of CP. CP is defined with the following three criteria: (1) Definite abnormalities observed in the classical neuromotor exam, which includes measurement of tone, deep tendon reflexes, coordination and movement. Any one definite abnormality in the classic neuromotor exam, as defined, except for isolated low tone (hypotonia) or toe walking without tight ankles is sufficient. (2) A delay in motor milestones with a disorder of motor function must be present. This may or may not be reflected in a motor quotient less than 70. In mild cases, there may be a subtle difference in hand functioning with a fine pincer grasp in one hand and a raking grasp in the other hand. Some disorder of motor function must be present. (3) Aberrations in primitive reflexes and postural reactions may be present.
The hierarchical classification of CP subtypes10 is included in the annual examiner training, with the aim of achieving inter-rater reliability in all subtypes including spastic, dyskinetic, and hypotonic CP. An assessment of gross motor function level is completed based on the Palisano et al.6,7 Gross Motor Function Classification system to provide a standardized classification of the severity of motor disability. This method is useful for assessing a level of severity of CP.6,7,11 Bimanual hand function is assessed in two ways. First, in a 3-level system that includes: (1) no problems with bimanual task, (2) some difficulty with bimanual tasks, and (3) no functional bimanual tasks, and second, in a 5-level assessment of each hand that ranges from 1 fine pincer grasp to 5 unable to grasp/makes no attempt to grasp.
Since 2005, there have been 16 participating NRN centers. Each center designates a primary neurologic examiner, who is generally the Follow-up Study Principal Investigator (PI) of the center, based on their expertise in the field of newborn follow-up, neurologic examination of children, and developmental assessment. The primary neurologic examiner is responsible for fulfilling annual certification requirements and for training and certifying other examiners at their center to conduct the neurologic exam.
The annual certification process for the NRN has evolved over time (Table II; available at www.jpeds.com). The 1994 workshop protocol for the 1993 cohort was in place for five years (1994 to 1999). The training consisted of bringing six to seven children to the workshop at 18–22 months corrected age with a variety of neurologic diagnoses. Initially, there were 12 examiners who split into groups of three and rotated into the exam rooms to evaluate the children and score the neurologic exam sheets. The drawback with this approach was that by the time the third or fourth group entered the exam room, the child was intolerant of further examiners. Although this method of certification provided an opportunity to discuss the various diagnoses of each child, it was considered sub-optimal due to patient fatigue. In addition, not every examiner had a “hands-on” opportunity.
In 2000, the training protocol was modified to use videos of neurologic exams. Each center submitted a video of their site’s primary neurologic examiner completing the neurologic examination on an 18–22 month corrected age former extremely low birth weight infant. The videos were all reviewed by two NRN Gold-Standard examiners and feedback was provided to each center examiner. Six high quality exams with a variety of diagnoses were identified for the training workshop. These six exams were viewed by the center examiners during the workshop, scored independently on the neurologic exam forms and discussed. Examiners reached consensus on the diagnoses after discussion and were certified as their center’s primary neurologic examiner. An annual certification DVD was made with the selected exams following the workshop for the certified center examiners to certify additional developmentalists at their centers. During the 2008 workshop, however, it was proposed that the certification process be formalized further to train and clearly document inter-rater agreement on all test items including the NRN primary outcome of NDI, and to provide focused training during the workshops on areas of identified examiner weakness. Certification, however, would be determined by inter-rater agreement on four specific outcomes: normal neurologic exam, CP, GMFCS greater than or equal to level 2, and NDI.
Starting in 2009, although the initial submission process of videos by the primary examiners to the NRN Gold-standard examiners remained the same, the scoring of certification videos by each of the centers was formalized (Table II). Five to six selected exam videos that were representative of a spectrum of neurologic findings were collated on a certification DVD and sent to primary examiners to view and score using the NRN neurologic exam forms prior to the workshop. The forms were then keyed into the center’s NRN data management system and transmitted to the data coordinating center at RTI International for analysis. The data from each center were then analyzed by the center examiner’s correct response and by the group’s consensus with the video findings. The NRN Gold-Standard examiners were then able to identify strengths and weaknesses of the center examiners’ performance. With this process, the annual certification workshop in October 2009 began to focus training on scoring discrepancies and inter-rater agreement allowing for targeted training for problematic items. The workshop concluded with a “test” video, which examiners had not viewed before the annual certification workshop. Examiners viewed and scored the “test” video using NRN neurologic exam forms which were then entered into the NRN database for analysis. Examiners received feedback on both their training DVD and the scoring of their “test” video.
All analyses were performed using SAS 9.1 for Windows.12 Statistical analyses focused on four primary outcomes of the neurologic assessment: (1) whether the neurologic exam was considered normal, (2) a diagnosis of moderate to severe CP, (3) a GMFCS greater than or equal to level 2 indicating moderate to severe gross motor impairment, and (4) whether the child met the neurologic or neurosensory component of the NRN definition of neurodevelopmental impairment (moderate/severe CP or GMFCS greater than or equal to level 2 or bilateral vision or hearing impairment). The percentage of examiners who agreed with the NRN Gold Standard raters on each video was computed. Consistency among examiners (i.e., inter-rater reliability) was assessed by computing the first-order agreement coefficient (AC1) statistic, using the AC1AC2 SAS macro.13 Even though the kappa statistic has traditionally been used to assess inter-rater reliability, researchers have noted several limitations of kappa, including the lack of correction for chance agreement and the kappa paradox wherein kappa can be low even though agreement is high.14 The AC1 statistic has been developed to address these limitations.15 To help with interpretation, we used the general guidelines for reliability coefficients suggested by Fleiss16 with values less than 0.40 indicating poor agreement, 0.40–0.75 indicating good agreement, and greater than 0.75 indicating excellent agreement.
The four neurologic outcomes on the training videos for 2009 and 2010 are shown in Table III to illustrate the diagnoses chosen for the training process. In 2009, the centers’ pre-workshop DVD scoring agreed with the Gold Standard rater 83% for normal neurologic exam to 91% for the GMFCS ≥ level 2 across the six videos (Table IV). During the workshop for real-time certification training in 2009, agreement on the test video increased to 100% for the diagnoses of abnormal neurologic exam and presence of CP, and agreement was 65% for GMFCS ≥ level 2 and NDI on test video 7. The findings of agreement and inconsistency for the 2009 exams are described in greater detail in an effort to clarify some of the challenges encountered in achieving consensus. Video 1 was scored as normal by the Gold Standard examiners. Differences were related to examiner perceptions of possible mild hip laxity, tight ankles, or a hearing loss. After discussion and reviewing, consensus was that the child was normal. For Video 2, some examiners had scored hypertonicity rather than spastic diplegia. Because resistance to dorsiflexion of the ankles and intermittent toe walking were present, the diagnosis of mild spastic diplegia was accepted. For Video 3, all examiners correctly classified CP, GMFCS and NDI, however, only 47% identified mild hypotonia with joint laxity. The lack of agreement on Video 4 was secondary to a diagnosis of hypotonia versus hypotonic CP with ataxia/athetosis. Because of associated movement disorder, a diagnosis of CP was considered correct. On Video 5, there was 100% agreement except on GMFCS level by one examiner. For Video 6, there was debate regarding the GMFCS level relative to the child’s sitting posture. He was able to maintain a stable sit in a W-position or in a chair but not on the exam table. Final consensus was reached that it was GMFCS level 2 in a child with moderate spastic diplegia. The test Video 7 also was a child with spastic diplegia. The lack of agreement for this child was because of scoring the child as a GMFCS level 2 versus 1. The inter-rater analyses showed that weaker inter-rater agreement was more frequently present for diagnoses of disorders characterized by hypotonia and mild neurologic disorders.
Agreement of the centers’ pre-workshop DVD scores in 2010 was higher than initial training in 2009 with values ranging from 89% for GMFCS ≥ level 2 to 99% for normal neurologic exam across the five videos. During the certification training in 2010, agreement on the test video reached 100% for all four outcomes. In addition, we computed the percentage of agreement between motor delay as defined as not walking at 18 months and the classifications for moderate/severe CP, GMFCS level 2 or higher, and NDI. Percentages of agreement were high, ranging from 82% to 83%.
The AC1 values show a similar pattern for agreement among examiners themselves (Table V). During 2009, AC1 values for the initial training phase ranged from 0.64 for normal neurologic exam to 0.82 for the GMFCS ≥ level 2, indicating good agreement. The examiners were entirely consistent with respect to normal neurologic exam and CP on the 2009 workshop test video (AC1=1.00), but somewhat less consistent on the GMFCS level and NDI for this particular video (AC1=0.24).
The examiners demonstrated good to excellent agreement during the initial training in 2010 with AC1 values ranging from 0.77 for GMFCS level to 0.97 for normal neurologic exam. All examiners were consistent in their ratings of the workshop test video in 2010 as reflected in the values of 1 for AC1.
Challenges remain in the attempt to achieve inter-rater reliability for the neurologic exam and for the diagnosis of CP in young children. Although use of the GMFCS in conjunction with the classic neurologic exam has improved examiner accuracy, diagnosing severity of impairment is not a perfect science.11 A number of investigators have made substantial progress in initiatives to simplify and categorize the methodology for diagnosing CP.10,17,18 The NRN has begun to use the classification system developed by the Surveillance of Cerebral Palsy in Europe working group. 10 The definition and classification of CP was recently extensively reviewed and updated in a supplement to Developmental Pediatrics and Child Neurology.18 Implementation of this new system and categorization will allow for greater agreement for each extremity involved. The revised system includes a flow chart which has definitions for specific clinical features applied to each extremity. The NRN has found that a number of former preterm children have mixed findings which has made it challenging to place these children in a specific category of CP based on the classic neurologic exam. Implementation of the revised system should produce enhanced reliability for networks. A second enhancement to be considered is the new Bimanual Fine Motor Function Test reported by Beckung and Hagberg19 which uses a 5-level classification system for the upper extremities similar to the GMFCS for the lower extremities.
Despite limitations, improvements were observed in examiner agreement with the new NRN neurologic exam certification process between 2009 and 2010. Examiner agreement with the NRN Gold Standard during the initial training in 2010 was higher than initial training in 2009. Also, an increase in examiner agreement on the workshop “test” video occurred between 2009 and 2010 with agreement reaching 100% for all four neurologic outcomes examined in 2010. The AC1 values provided further support for these improvements.
There are several benefits to the new certification process. First, quantifying the certification process facilitated evaluation of annual certification by making it possible to assess inter-rater agreement. Second, annual neurologic exam training is now focused on scoring discrepancies which allows for targeted training of problematic items. Third, examiners who are outliers can be identified using the standardized videos, and can receive additional training, thus improving cohort data collection in this difficult area of judgment. Fourth, face-to-face meeting time for training is shorter than it was before 2009 because examiners view the training videos and score them prior to the training workshop. The time previously spent on scoring exams during the training workshop is now allocated towards setting the NRN Follow-up Study research agenda and discussing new protocols.
A limitation of this certification process is the difficulty of seeing all angles clearly and inability to assess muscle tone by palpation on training videos. To address this, examiners were asked to narrate the neurologic exam as it is being video-recorded so that the viewer has knowledge of the tactile information that is needed to score the exam. The only child information included on the DVDs was the child’s age at the time of the exam. A complete medical history was included in the early years of the NRN though this information is no longer included due to the potential for examiner bias.
The study findings clearly demonstrate the importance of annual certification and assessment of inter-rater agreement as well as the benefits of evaluation and revision of certification protocols to achieve high levels of confidence in neurodevelopmental study outcomes for multi-center networks. The NRN will continue to review processes for training and annual certification to identify areas for improvement. Evaluating inter-rater agreement before and after annual training workshops is one way to identify areas where additional training may aid in the standardization of research procedures across sites.
The National Institutes of Health and the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) provided grant support for the Neonatal Research Network’s Follow-up Study. Data collected at participating sites of the NICHD Neonatal Research Network were transmitted to RTI International, the data coordinating center for the network, which stored, managed and analyzed the data for this study. As the Data Coordinating Center coordinator for the Follow-up Study, J.N.’s time was supported by the NICHD Neonatal Research Network.
We are indebted to our medical and nursing colleagues and the infants and their parents who agreed to take part in this study.
The authors declare no conflicts of interest.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.