|Home | About | Journals | Submit | Contact Us | Français|
Neurosonography can assist clinicians and can provide researchers with documentation of brain lesions. Unfortunately, we know little about the reliability of sonographically derived diagnoses.
We sought to evaluate observer variability among experienced neurosonologists.
We collected all protocol US scans of 1,450 infants born before the 28th postmenstrual week. Each set of scans was read by two independent sonologists for the presence of intraventricular hemorrhage (IVH) and moderate/severe ventriculomegaly, as well as hyperechoic and hypoechoic lesions in the cerebral white matter. Scans read discordantly for any of these four characteristics were sent to a tie-breaking third sonologist.
Ventriculomegaly, hypoechoic lesions and IVH had similar rates of positive agreement (68–76%), negative agreement (92–97%), and kappa values (0.62 to 0.68). Hyperechoic lesions, however, had considerably lower values of positive agreement (48%), negative agreement (84%), and kappa (0.32). No sonologist identified all abnormalities more or less often than his/her peers. Approximately 40% of the time, the tie-breaking reader agreed with the reader who identified IVH, ventriculomegaly, or a hypoechoic lesion in the white matter. Only about 25% of the time did the third party agree with the reader who reported a white matter hyperechoic lesion.
Obtaining concordance seems to be an acceptable way to assure reasonably high-quality of images needed for clinical research.
Imaging in medicine requires interplay between the imaged subject and the viewer. The intent, most often, is to maximize resolution and minimize the contribution of observer variability. Yet, physicians sometimes interpret images differently, including images generated from US studies in adults [1–4] and newborns [5–9]. This has obvious clinical as well as research implications , and may affect the value of studies that seek to identify the correlates and antecedents of US scan images.
We are aware of only a small number of studies that have evaluated how reliably experienced sonologists read head US scans of infants [5–9]. Except for one study , only a small proportion of infants in any of these studies was born before 28 weeks, when the prevalence of abnormalities is highest.
Our ELGAN study, so-called because all subjects were extremely low gestational age newborns (ELGANs), required all protocol US scans of 1,450 infants to be read separately by two independent readers. Each reader was asked to identify the four most common cranial sonographic findings associated with subsequent neurological impairment in ELGANs: intraventricular hemorrhage (IVH), ventricular enlargement, white matter hyperechoic lesions, and white matter hypoechoic lesions. Scans not read entirely concordantly by the first two readers were sent to another independent reader for a tie-breaking interpretation. These structural aspects of the ELGAN study provided the opportunity to assess observer variability among 14 highly experienced sonologists who each read more than 200 sets of scans also read by others.
The sample for this study consisted of all babies born between 23 and 27 completed weeks of gestation at 14 hospitals in 11 cities in five states between March 2002 and August 2004. The babies had at least one cranial US scan, and their parents consented to their participation. The Institutional Review Board at each participating institution approved this study.
Routine scans were performed by sonographers at all of the hospitals using high-frequency transducers (7.5 and 10 MHz), most often with Acuson Sequoia, Philips iU55, and GE Logiq9 systems. US studies always included the six standard quasicoronal views and five para-sagittal views using the anterior fontanel as the sonographic window .
A total of 1,450 infants had at least one set of US scans, and 895 had three sets. A set of scans was obtained in 1,123 infants between the 1st and 4th days of life, in 1,302 between the 5th and 14th days of life, and in 1,268 between the 15th day of life and 40 weeks postmenstrual age.
All sonologists reviewed a manual and data collection form used in a previous study [12–14], and some suggested revisions. Subsequently, all agreed to the revised manual and data collection form. The manual and data collection form included templates of multiple levels of ventriculomegaly. The cerebral white matter in each hemisphere was divided into eight zones chosen for ease of identification by ultrasonographic landmarks. The lesions in each zone could be further characterized as hyperechoic and/or hypoechoic.
The need for additional revision was assessed with scans that illustrated a variety of abnormalities that could be interpreted differently by experienced sonologists. These scans were distributed and served as discussion points for conference calls intended to work out ways to minimize variation in interpretation. Potential sources of variation were identified, most of which had already been identified and addressed in the manual.
Intraventricular hemorrhage (IVH) was defined as blood within the ventricles. By definition, this excluded hemorrhage localized to the subependymal region. Ventriculomegaly, categorized as mild, moderate, and severe, was defined visually with a template that was on the data collection form. Our emphasis was on moderate/severe ventriculomegaly, which was diagnosed if the lateral ventricle was at least moderately enlarged in any of four sections (frontal horn, body, and occipital horn) on either side.
All US scans were read by two independent readers who were not provided with clinical information. One reader was the study sonologist at the infant’s birth institution who for this study read all the protocol scans for each baby at one time. The other was the study sonologist at another ELGAN study institution who was provided with the scans as electronic images on a CD imbedded in the software eFilm Workstation (Merge Healthcare, Milwaukee, Wis.). The eFilm program allowed the second reader to see the images as the first reader had seen them, and to adjust and enhance the images, including the ability to zoom and alter contrast and brightness.
Definitions of discrepancies that required resolution by a third (tie-breaking) reader included:
|IVH||One reader classified an infant’s scans as having no or an uncertain IVH and the other reader classified the scan as having probable or definite IVH.|
|Ventriculomegaly||One reader classified an infant’s scans as having no or mild ventriculomegaly and the other reader classified the scan as having moderate or severe ventriculomegaly.|
|Hyperechoic lesion||One reader classified an infant’s scans as having no hyperechoic lesion in the white matter and the other reader reported the presence of a hyperechoic lesion.|
|Hypoechoic lesion||One reader classified an infant’s scans as having no hypoechoic lesion in the white matter and the other reader reported the presence of a hypoechoic lesion.|
Approximately 40% of scan sets were not read entirely concordantly by the first two readers. These were sent to a third reader, who was not informed of the nature of the discrepancy and was asked to complete the entire data collection form for all of the infant’s protocol scans.
This descriptive study characterizes the more common discrepancies seen when cranial US scans were read by multiple readers. Kappa values are used as a statistical descriptor of observer variability , but we recognize their limitations [16, 17].
We evaluated the following hypotheses:
According to congruent readings, the overall prevalence of the four target findings were: IVH, 24%; ventriculomegaly, 12%; hyperechoic lesion, 16%; and hypoechoic lesion, 7.8% (Table 1). Ventriculomegaly, echolucency, and IVH tended to be grouped together in their positive agreement rates (68–76%), negative agreement rates (92–97%), and kappa values (0.62 to 0.68; Table 2). In contrast, hyperechoic lesions had considerably lower values of positive agreement (48%), negative agreement (84%), and kappa (0.32).
A total of 1,450 sets of scans were read separately by two readers. Of all these scans, 12% were read discordantly for IVH, 9% for moderate/severe ventriculomegaly, 24% for hyperechoic lesion and 6% for hypoechoic lesion.
We compared how each reader fared in relation to all other second readers who read the same scans in an attempt to identify whether particular readers stood out as consistently identifying abnormalities more or less often than others (Table 3). Several patterns emerged from these analyses.
First, although certain individual readers identified a lesion more or less frequently than colleagues, most often this tendency was lesion-specific. For example, reader H identified IVH a bit more often than others who read the same sets of scans, but reader H did not identify ventriculomegaly or white matter lesions more frequently. Second, the identification of a white matter hyperechoic lesion showed the most variability (Table 3). One reader recognized a hyperechoic finding in almost half and five other readers identified a hyperechoic lesion in approximately 20% of scans that another colleague interpreted as having no hyperechoic lesion. Eight readers differed from their colleagues in their recognition of a hyperechoic lesion in at least 25% of the scan sets. Third, a hypoechoic lesion had the highest agreement rates, even higher than those for intraventricular hemorrhage. Fourth, the choice of discrepancy definition strongly influenced the concordance rate for ventriculomegaly. When readings were allocated to one of two categories, either normal (combining all normal or mildly abnormal readings) or abnormal (combining all moderately and severely abnormal readings), 9% of readings were discordant for ventriculomegaly. When categories were not collapsed into two groups and readings were viewed as concordant if they were within one category or the other, the rate of discordance was only 3%. In analyses that maintained four categories of ventriculomegaly, the larger the ventriculomegaly the higher the association with the other three sonographic findings. This was most evident for hyperechoic and hypoechoic lesions, with the risk of these characteristics doubling with each increase in size of ventriculomegaly (Table 4).
To address whether familiarity with an institution’s equipment influenced readings, we compared concordance rates between the second and third readers and the first and third readers. The data from readings for ventriculomegaly and for a hyperechoic lesion supported this view of greater similarity between the second and third readers (Table 5). On the other hand, for intraventricular hemorrhage and hypoechoic lesion, the agreement between first and third readers was very similar to the agreement between second and third readers.
For intraventricular hemorrhage, moderate/severe ventriculomegaly, and a hypoechoic lesion, the third reader read approximately 40% of discrepant scans as positive, which clearly indicates a tendency to identify these lesions less often than earlier readers. This tendency was even more pronounced for a hyperechoic lesion. Third readers identified a hyperechoic lesion on 19% of 165 scans read by first readers and on 26% of 183 scans read by second readers (23% of scans overall; Table 5).
In this analysis, we compared the rates of discrepancy between first and second readers separately for each set of protocol scans. The first two protocol scans were obtained in the first 2 weeks of life, while the third protocol scan was obtained closer to term (Table 6). A hyperechoic lesion was the only finding for which agreement improved from early to late scans, but even this improvement in agreement was modest.
Cranial US readings in this cohort of 1,450 ELGANs were most reliable for ventriculomegaly, a hypoechoic lesion, and IVH, and least reliable for an echogenic lesion. The kappa value for a hyperechoic lesion, an index of the overall level of reliability that corrects for the level of agreement that would be expected by chance alone, was half that for the other findings, which ranged from 0.62 to 0.68. Some readers thought their assessments were limited by not having additional (posterior fontanel and trans-mastoid) views that were obtained routinely at their institution, by not having cine clips, and by not being able to scan the infant themselves. Nevertheless, the reading discrepancies among the sonologists in our study could not be explained by whether the studies were read by a reader from the birth hospital or from an outside collaborating institution, nor whether the scans were performed closer to term or in the first week or two of life.
Cranial US variability has been investigated by at least four groups [6–9]. The positive and negative agreement rates for IVH in this study are comparable to those of one study  and modestly lower than those of the three others [6, 8, 9]. In one of the previous studies, “echolucent or echodense periventricular leukomalacia” had a kappa of 0.09 . In another study, in which kappa values for periventricular leukomalacia were found to range from 0.17 to 0.22, the “gold standard” outside reviewers found periventricular leukomalacia to be present five times more often than the initial home-center readers . Unlike two studies that had both local and central readers [9, 18], local readers in our study functioned as both outside readers for scans obtained at other institutions as well as local readers for scans obtained at their own institutions. Perhaps that is why the prevalence of findings seen by the second, outside readers did not differ substantially from that of the first. All four US findings were defined as either present or absent for the purpose of requiring a third reading. This simplicity minimizes variability when staging and grading schemes are offered . We did, however, explore finer gradations of ventriculomegaly.
Defining differences as any nonadjacent category reduced the variability. We advise caution in drawing inferences based on this finding, in large part because we did not seek third readings based on nonadjacent discrepancies.
This is the largest study in which interobserver variability in the reading of cranial US studies of infants born before the 28th postmenstrual week has been evaluated. The rates of agreement and variability found in our study are likely to be generalizable. They are comparable to those reported for mammograms [20, 21], cranial CT scans in assessments of subarachnoid blood , cerebral angiograms for determining the completeness of aneurysm therapy , CT scans for assessing patients for intravenous rtPA therapy , and MRI looking at white matter changes in adults .
In a previous study we used a consensus approach to the reading of cranial US scans of very low birth weight infants [13, 14]. Three sonologists reading together were able to agree on the presence or absence of all abnormalities on 90% of scans sent to them because a screening sonologist identified an abnormality. Others, too, have found that reliability is enhanced by consensus readings [26, 27]. Although we prefer a consensus approach, we found it to be not feasible for a large number of sonologists to come together to review more than 500 sets of studies. In addition, at that time we did not have the capability to share all images comparably during conference calls. With the availability of CDs and visualization techniques identical to those available at the home institution, conference calls among three or more readers are certainly possible, and this is the approach we recommend.
Efforts are underway to improve the reading of scans. Some have employed scales to assess how confident physicians are in reading fetal MRI scans . Evaluating the correlates of confidence might help identify characteristics of scans or readers that contribute to observer variability. Still others are trying to improve the quality of cranial US images . The higher the quality of the data, the more likely an epidemiologic study will provide useful information . The quality of US scans can be assessed against a gold standard, such as neuropathology or MRI. Until early and repeated MRI scans are obtained routinely, however, cranial sonograms will continue to be used for clinical and research purposes.
Although variability in cranial US readings is higher than we would like, seeking concordance (i.e. at least two readers must agree) seems to be an acceptable way to assure a reasonably high quality of interpretation of images needed for clinical research. Observer variability can also be reduced by additional training designed specifically to reduce interreader disparities, or by having all scans read simultaneously by several readers working to achieve consensus.
We have shown that IVH, moderate/severe ventriculomegaly and a hypoechoic lesion in the white matter can be assessed with acceptable levels of observer agreement. On the other hand, hyperechoic lesions pose a problem. Additional work is needed to see how best to reduce observer variability associated with hyperechoic lesions.
This work was funded by a cooperative agreement with the National Institute of Neurological Disorders and Stroke (1 U01 NS 40069-01A2) and a program project grant form the National Institute of Child Health and Human Development (NIH-P30-HD-18655). The authors are also grateful for the assistance of all their colleagues, and the cooperation of the families of the infants who are the focus of our attention.
Karl Kuban, Division of Pediatric Neurology, Boston University Medical Center, Boston University School of Medicine, Boston, MA, USA.
Ira Adler, Eastern Radiologists, Inc., Grenville, NC, USA.
Elizabeth N. Allred, Neuroepidemiology Unit, Children’s Hospital Boston, Harvard Medical School, Harvard School of Public Health, Boston, MA, USA.
Daniel Batton, Departments of Pediatrics and Neonatology, William Beaumont Hospital, Royal Oak, MI, USA.
Steven Bezinque, Department of Radiology, DeVos Children’s Hospital, Grand Rapids, MI, USA.
Bradford W. Betz, Department of Radiology, DeVos Children’s Hospital, Grand Rapids, MI, USA.
Ellen Cavenagh, Department of Radiology, Sparrow Hospital, Lansing, MI, USA.
Sara Durfee, Department of Radiology, Brigham & Women’s Hospital, Harvard Medical School, Boston, MA, USA.
Kirsten Ecklund, Department of Radiology, Children’s Hospital Boston, Harvard Medical School, Boston, MA, USA.
Kate Feinstein, Department of Radiology, University of Chicago Hospital, University of Chicago, Chicago, IL, USA.
Lynn Ansley Fordham, Department of Radiology, University of North Carolina School of Medicine, Chapel Hill, NC, USA.
Frederick Hampf, Department of Radiology, Baystate Medical Center, Springfield, MA, USA.
Joseph Junewick, Department of Radiology, DeVos Children’s Hospital, Grand Rapids, MI, USA.
Robert Lorenzo, Department of Radiology, Children’s Healthcare of Atlanta, Emory University School of Medicine, Atlanta, GA, USA.
Roy McCauley, Department of Radiology, Tufts-New England Medical Center, Tufts University School of Medicine, Boston, MA, USA.
Cindy Miller, Department of Radiology, Yale-New Haven Hospital, Yale University School of Medicine, New Haven, CT, USA.
Joanna Seibert, Department of Radiology, Arkansas Children’s Hospital, University of Arkansas Medical School, Little Rock, AR, USA.
Barbara Specter, Department of Radiology, Forsyth Hospital, Baptist Medical Center, Wake Forest University School of Medicine, Winston-Salem, NC, USA.
Jacqueline Wellman, Department of Radiology, Milford Regional Medical Center, Milford, MA, USA.
Sjirk Westra, Division of Pediatric Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA.
Alan Leviton, Neuroepidemiology Unit, Children’s Hospital Boston, Harvard Medical School, 1 Autumn St. #720, Boston, MA 02215-5393, USA, Email: alan.leviton/at/childrens.harvard.edu.