|Home | About | Journals | Submit | Contact Us | Français|
Assessment of the reliability of standardized magnetic resonance imaging (MRI) interpretations and measurements.
To determine the intra- and inter-reader reliability of MRI parameters relevant to patients with intervertebral disc herniation (IDH), including disc morphology classification, degree of thecal sac compromise, grading of nerve root impingement, and measurements of cross-sectional area of the spinal canal, thecal sac, and disc fragment.
MRI is increasingly used to assess patients with sciatica and IDH, but the relationship between specific imaging characteristics and patient outcomes remains uncertain. Although other studies have evaluated the reliability of certain MRI characteristics, comprehensive evaluation of the reliability of readings of herniated disc features on MRI is lacking.
Sixty randomly selected MR images from patients with IDH enrolled in the Spine Patient Outcomes Research Trial were each rated according to defined criteria by 4 independent readers (3 radiologists and 1 orthopedic surgeon). Quantitative measurements were performed separately by 2 other radiologists. A sample of 20 MRIs was re-evaluated by each reader at least 1 month later. Agreement for rating data were assessed with kappa statistics using linear weights. Reliability of the quantitative measurements was assessed using intraclass correlation coefficients (ICCs) and summaries of measurement error.
Inter-reader reliability was substantial for disc morphology [overall kappa 0.81 (95% confidence interval (CI): 0.78, 0.85)], moderate for thecal sac compression [overall kappa 0.54 (95% CI: 0.37, 0.68)], and moderate for grading nerve root impingement [overall kappa 0.47 (95% CI: 0.36, 0.56)]. Quantitative measures showed high ICCs of 0.87 to 0.96 for spinal canal and thecal sac cross-sectional areas. Measures of disc fragment area had moderate ICCs of 0.65 to 0.83. Mean absolute differences between measurements ranged from approximately 15% to 20%.
Classification of disc morphology showed substantial intra- and inter-reader agreement, whereas thecal sac and nerve root compression showed more moderate reader reliability. Quantitative measures of canal and thecal sac area showed good reliability, whereas measurement of disc fragment area showed more modest reliability.
Low back pain is one of the most prevalent and costly health problems in the industrial world. Magnetic resonance imaging (MRI) is increasingly used to assess patients with lumbar spine problems, particularly those with sciatica and intervertebral disc herniation (IDH). It is considered the diagnostic imaging procedure of choice for IDH,1 as it can provide exquisite morphologic detail of the disc abnormality.2,3 Unfortunately, the relationship between findings on MRI and clinical course remains controversial, with several studies showing a high prevalence of disc “herniations” in asymptomatic subjects.4-7
Efforts have been made to improve the specificity of MRI interpretation by developing more precise morphologic terminology than simply “herniation.”5,8 Although disc “extrusions” are much less common in asymptomatic subjects, the reliability of this determination has been variable.9 Recently, a grading system for determining nerve root compression has been proposed that seems to have substantial reliability, but awaits confirmation in additional studies.10 Another approach has been to look at quantitative measurement of disc fragment size and canal morphology; preliminary results of this approach appear promising.11 However, a comprehensive evaluation of the reliability of different features of IDH is lacking.
In this study, we used baseline MRIs collected from patients enrolled in the Spine Patient Outcomes Research Trial (SPORT) with a diagnosis of lumbar IDH. We evaluated the reliability of MRI readings, i.e., the variability in the interpretation and measurements of the same MR image by different readers. Interpretations and measurements were performed by readers using multiple predefined criteria.
SPORT enrolled 1244 patients with IDH defined on the basis of 3 factors: radicular pain with a positive nerve root tension sign or neurologic deficit, a confirmatory imaging study demonstrating IDH corresponding to their symptoms, and presence of symptoms for at least 6 weeks. Baseline MRIs were available and archived for 763 patients. Of these, 92 were collected electronically, deidentified for patient confidentiality, and stored directly as DICOM files. Six hundred seventy-one were collected as printed films and then digitized using a high-definition scanner, deidentified, and stored in DICOM format. No standard imaging protocol was used; clinical films obtained at each participating site were used “as is.” Images were provided to the readers on CDs using eFilm Lite software (Merge Technologies; Milwaukee, WI) as a viewer. Display monitors were not standardized across readers.
We randomly selected 60 complete MRI studies for use in this reliability study. Complete images were defined as those containing at least T1 and T2 sagittal series and a T2 axial series. The images were read by 4 clinical experts in spine MRI interpretation, including 3 musculoskeletal radiologists with subspecialty experience in spine imaging, and 1 orthopedic spine surgeon. Image quality was assessed as good, fair, or inadequate for interpretation. Images deemed inadequate for interpretation were excluded from the study. Image interpretation was recorded using a standardized data collection form that prompted the reader to select from multiple choice lists of findings for imaging characteristics at each level. Images were prepared in monthly batches of approximately 12 studies, including some from patients with IDH and some from patients with spinal stenosis. To assess intrareader reliability, a random subsample of 20 MRIs was selected and reread by each reader at least 1 month after the initial reading.
Each reader received a handbook containing standardized definitions of imaging characteristics. Pictorial and diagrammatic examples were provided where appropriate, derived from the literature or by consensus when no relevant publication was available. Before beginning the study, the readers evaluated a sample set of images and then met in person to review each image and refine the standardized definitions.
The features assessed for IDH included disc morphology, using the published classification scheme of “normal,” “bulge,” “broad-based protrusion,” “focal protrusion,” “extrusion,” and “sequestered.”12 For analytic purposes this scheme was collapsed into 3 categories: “normal/bulge,” “protrusion,” and “extrusion/sequestered.” This was rated for all available lumbar disc levels. Additional features of thecal sac compression, nerve root impingement, apical location, and sagittal extent of the disc herniation and signal characteristics of the epidural mass were evaluated for all levels that were rated as protrusion or extrusion/sequestered. Thecal sac compression by the disc fragment was characterized as “none,” “<1/3,” “1/3 to 2/3,” or “>2/3.”12 Nerve root impingement was evaluated using the grading system of Pfirrman et al and was characterized as “no impingement,” “touching” (contact), “displaced” (deviation), or “compressed.”10
Additional characteristics that were evaluated included the axial location (left extraforaminal/foraminal, left paracentral, central, right paracentral, right extraforaminal/foraminal) and sagittal extent of the herniation when present.12 In addition, the T2-weighted signal intensity (bright, intermediate, dark) and the signal homogeneity (homogeneous, heterogenous) of the epidural material were rated.
In addition to the readings described above, 2 other independent radiologists made quantitative measurements of selected imaging characteristics. For scanned images, scaling was taken from the printed centimeter scale when available; images without any scale were excluded. Measurements were made using ImageJ software’s built-in measurement tools (Rasband, W.S., ImageJ, U.S. National Institutes of Health, Bethesda, MD, http://rsb.info.nih.gov/ij/, 1997-2006.) All area measurements were made using freehand areas (Appendix 1, available online through Article Plus). Bony and soft tissue canal area and thecal sac area were measured at all available disc and pedicle levels. Bony canal measurements used the osseous borders posterolaterally and the disc margin anteriorly. Soft tissue canal measurements used the ligamentous borders posterolaterally. Disc fragment area and the thecal sac area at the level of the largest disc fragment were measured only for those levels at which a disc herniation (protrusion, extrusion, or sequestered fragment) was identified.
A detailed handbook was provided to each reader with precise standardized definitions for each measured quantity. Each quantitative reader performed measurements on a training set of images, followed by a feedback session and refinement of the handbook before beginning the study. Measurements were checked for consistency and anatomic plausibility and returned to the readers for remeasurement or rescaling when necessary.
Initial analyses focused on the distribution of selected categories across readers for each imaging characteristic to look for systematic differences in the use of particular categories based on χ2 tests. The means of the quantitative image measurements were compared between readers using paired t tests.
The kappa statistic13 was used to summarize intrareader and inter-reader reliability of the rating data. Kappa statistics were calculated with linear weights to give less importance to disagreements closer together on an ordinal scale. Intrareader kappas were calculated for each reader individually and interreader kappas were calculated for each reader pair using the disc level as the unit of analysis. For the intrareader kappas, the bootstrap procedure was implemented using 1000 samples of size 20 from the individual image records included in the reliability study. A stratified estimate of the overall weighted intrareader kappa was formed at each bootstrap iteration. To accommodate the presence of multiple levels per image, overall inter-reader kappas and 95% confidence intervals (CIs) were calculated using the bootstrap technique with 1000 samples of size 58 taken with replacement from the individual image records. A weighted average of the pairwise kappas was taken using weights based on their estimated standard errors. The mean of the bootstrap distribution of the weighted averages was taken as the estimate of the inter-rater kappa. Interpretation of strength of agreement based on kappa values followed the schema of Landis and Koch13: <0 = Poor; 0 to 0.20 = Slight; 0.21 to 0.40 = Fair; 0.41 to 0.60 = Moderate; 0.61 to 0.80 = Substantial; 0.81 to 1.00 = Almost perfect.
The primary outcome measure for the quantitative measurements was the intraclass correlation coefficient (ICC) for both inter- and intrareader data. ICC and confidence interval values were calculated using analysis of variance methods as defined by Shrout and Fleiss.14
The sample size for this study was planned to allow for an intrareader kappa based on the 20 repeats of a feature with a prevalence of 0.44 to give an approximate predicted standard error for the estimated kappa of 0.11. For a kappa of 0.6 or higher, this would yield a coefficient of variation of less than 20%. For the interobserver kappa, approximately 55 readings by the 4 readers was chosen to give an approximate predicted standard error of 0.028, or a coefficient of less than 5% for kappas of 0.6 or higher.
Of the 60 selected MRIs, 2 were found to be inadequate for interpretation, leaving a total sample size of 58 studies included in the ratings analysis. Of these, 8 did not have an appropriate scale, leaving 50 studies for inclusion in the analysis of quantitative measurements.
Characteristics of the study population are shown in Table 1. The average age was 42.3 years, about half were women, most were white and nonhispanic, most had nerve root tension signs and neurologic deficits, and average Oswestry Disability scores were 45 at baseline. These characteristics were generally similar to the overall IDH population in SPORT.15,16
The distribution of ratings by the 4 expert readers for the main imaging characteristics are shown in Figure 1. Disc morphology had relatively similar distributions across readers, although Reader C endorsed more protrusions and fewer extrusions than the other readers. Systematic differences in response patterns were more evident for degree of thecal sac compression and most striking for nerve root impingement.
Intrareader reliability for major characteristics is summarized in Table 2. Disc morphology showed almost perfect agreement, thecal sac compression showed substantial to almost perfect agreement, and nerve root impingement showed moderate to substantial agreement. The number of levels evaluated for thecal sac compression and nerve root impingement is substantially smaller than for disc morphology because only levels with a disc herniation present were assessed for these features.
Inter-reader reliability for major characteristics is summarized in Figure 2. Reliability across reader pairs was quite consistent for disc morphology, with an overall substantial to almost perfect agreement [summary kappa = 0.81 (95% CI: 0.78, 0.85)]. Thecal sac compression showed somewhat poorer overall agreement between reader pairs and more variability between pairs. The overall agreement was moderate, with a summary kappa of 0.54 (95% CI: 0.37, 0.68). Nerve root impingement was similar, with overall moderate agreement and a summary kappa of 0.47 (95% CI: 0.36, 0.56).
The axial location showed substantial intrareader reliability [summary kappa 0.78 (95% CI: 0.61, 0.94)] and inter-reader reliability [average kappa 0.76 (95% CI: 0.66, 0.86)]. The sagittal extent of the herniation showed substantial reliability with an intrareader summary kappa of 0.67 (95% CI: 0.51, 0.79) and an inter-reader summary kappa of 0.63 (95% CI: 0.54, 0.70). Ratings of T2 signal characteristics had fair agreement for an intrareader summary kappa of 0.38 (95% CI: -0.02, 0.72) and a moderate inter-reader summary kappa of 0.43 (95% CI: 0.31, 0.54). The ratings of the signal homogeneity had moderate intrareader reliability [summary kappa of 0.58 (95% CI: 0.39, 0.75)] but only poor inter-reader reliability [summary kappa of 0.12 (95% CI: 0.05, 0.20)].
The results of the measurements by each of the 2 quantitative readers are summarized in Table 3. The mean soft tissue canal area measured at the disc level was 225 mm2 and the mean thecal sac area was 141 mm2, with no significant differences between the 2 readers. There were systematic differences between the readers in measures of the bony canal area, the area of the disc fragment, and most markedly in the anterior-posterior length of the disc fragment.
Intra- and inter-reader reliabilities for the quantitative measures are summarized in Table 4. There was excellent intra- and inter-reader reliability for all measures of the canal and thecal sac area. Disc fragment area was somewhat less reliable and the anterior-posterior length of the fragment showed the worst agreement.
Inter-reader agreement for thecal sac area at the disc level is shown graphically in Figure 3. The mean absolute difference between measurements by the 2 readers was 22 mm2, approximately 15% of the mean 144 mm2 size. Disc fragment area showed the largest discrepancy between readers. The absolute mean intrarater differences were approximately 19 mm2 for each reader, with mean disc fragment sizes of 106 and 75 mm2. The mean interrater difference in measurements was 39 mm2.
We also examined agreement between the rater assessments and the quantitative measurements for the 1 parameter that was directly comparable between the 2: degree of thecal sac compression caused by the disc herniation. We compared the subjective assessments of “<1/3,” “1/3 to 2/3” and “>2/3” with the measured ratio of the thecal sac area at the level of the disc herniation to the thecal sac area at the level of the pedicle above the herniation (<33%, 33%-67%, >67%). The agreement for the subjective rating of thecal sac compression was moderate, with a kappa of 0.54 as previously described. The agreement of the corresponding measured ratio of thecal sac area to the area at the pedicle level was substantial, with an intrareader kappa of 0.63 and a moderate inter-reader kappa of 0.46. The agreement between the ratings of compression and the measured decrement in thecal sac area was fair, with an overall kappa of 0.22 (95% CI: 0.04, 0.41).
We found excellent intra- and inter-reader reliability for most of the MRI features assessed in this study. Disc morphology, rated as “normal/bulge,” “protrusion,” or “extrusion/sequestered,” showed near-perfect intrareader agreement and substantial to near-perfect inter-reader agreement. The degree of thecal sac compression was also highly reliable, whereas the grading of nerve root impingement was only moderate. Our quantitative measurements generally had excellent intra- and inter-reader reliability by ICC, with modest absolute error sizes on remeasurement. Disc fragment area, however, was less reliably measured.
Our results for the reliability of the disc morphology classification compares favorably to prior studies. Brant-Zawadzki et al found substantial intrareader agreement (unweighted kappa 0.68) and moderate inter-reader agreement (unweighted kappa 0.59) using the terminology “normal,” “bulge,” “protrusion,” or “extrusion.”8 Jarvick et al also found moderate to substantial interreader agreement for this classification with weighted kappas of 0.50 to 0.75 across reader pairs.9 Similarly, Weishaupt et al7 and Sorensen et al17 found substantial agreement for classifying disc morphology, with interreader kappas of 0.79 and 0.68, respectively. Our reliability was slightly better than these prior studies and may be related to efforts to review criteria and build consensus before undertaking the readings. It may also represent increased familiarity and comfort with this classification system over time.
Despite these efforts, our reliability for grading nerve root impingement was only moderate (overall weighted kappa 0.47). This is lower than initial reliability estimates by Pfirrmann et al, who showed substantial interreader reliability for this grading system, with kappas of 0.62 to 0.67 across reader pairs.10 Our finding of somewhat poorer reliability may be related to the lack of a consistent imaging protocol. Although Pfirrmann et al used standard imaging sequences on a single scanner, we used clinically available images with varying image acquisition protocols, field strength, slice orientation, etc. This may have contributed to poorer reliability on imaging characteristics that were more finely detailed than disc morphology. The somewhat variable appearance of nerve roots depending on the level and orientation of each slice may also contribute to decreased reliability of grading nerve root compression. It may, however, reflect the type of reliability that could be expected in clinical practice where there is substantial variability in image quality and characteristics.
Although the quantitative measurements showed good reliability in terms of intraclass correlations, the absolute measurement errors were larger than in previous studies. Carragee and Kim’s finding of absolute measurement errors of <3% in repeated measures of disc herniations and canal size are much lower than the approximately 15% to 20% seen in the current study.11 Carlisle et al also reported a 3% intraobserver overall measurement variability in a study looking at disc fragment and spinal canal areas.18 The higher variability in our study may relate to heterogeneity among the images and the limited interaction before and during the study between the 2 radiologists who performed the measurements. The extent to which this degree of measurement error might impede the use of these imaging characteristics to predict clinical findings or outcomes is unknown. The measure of disc fragment area showed somewhat lower reader reliability, perhaps related to the fact that these structures were irregular, sometimes small, and varied in terms of the level of maximal extent.
This study had a number of important limitations. As noted above, heterogeneity among the images used is a potential shortcoming, in terms of determining the ideal reliability, but may be more representative of actual clinical practice. In addition, there was no standardization across readers in terms of how the readings were done (i.e., all in 1 sitting vs. a few at a time) or the monitor on which they were viewed. This could have created substantial differences across readers. In addition, the readers themselves were heterogenous, with 3 radiologists and an orthopedic spine surgeon. Differences in training and background may have affected the inter-reader reliability. Interestingly, however, when we assessed reliability across reader pairs, we did not see any systematic differences in inter-reader agreement based on reader specialty.
It is important to note our use of prestudy meetings, detailed handbooks of definitions, and standardized reporting forms with multiple choice categories for each parameter at each level. These features allowed the assessments to be structured far more than possible in general clinical practice. Thus, our results may overestimate the reliability that might be expected among readers doing routine clinical assessments. In addition, while the readers were not provided with specific clinical data on subjects except their age and sex, they were aware that all the images were from patients with either disc herniation or spinal stenosis severe enough to qualify them to be surgical candidates and enter the SPORT trial. How this knowledge may have affected the readers’ interpretations is unknown.
Finally, we studied only the reliability of different readings of the same images. We did not assess the reliability of interpretations between different scans on the same patients, or different imaging protocols. These other factors may introduce entirely separate challenges and create additional possibilities for disagreement.
Disagreements between readers in our study were fairly modest overall. However, when they did occur, we had no gold standard by which to decide between differing interpretations. For example, it is unclear whether the measured thecal sac area or the subjective rating of thecal sac compression is the most “valid.” The standard for preferring 1 assessment over the other should not be based on reliability, but rather on whether 1 assessment is able, or better able, to predict patient symptoms or outcome.
The assessment of reliability is merely the first step in this process. The imaging characteristics in this study generally had moderate to substantial intra- and inter-reader reliability. Carlisle et al showed that larger disc fragment size, smaller canal area, and larger proportion of canal compromise predicted the need for surgery using a clinical algorithm.18 Caragee and Kim showed that larger disc fragment size predicted surgical outcomes, but not nonoperative outcomes.11 Future studies should evaluate whether these features may have potential prognostic implications for the outcomes of surgery compared to nonoperative care in patients with IDH.
Acknowledgment date: June 20, 2007. First revision date: September 14, 2007. Second revision date: November 29, 2007. Acceptance date: December 3, 2007.
The manuscript submitted does not contain information about medical device(s)/drug(s).
Federal funds were received in support of this work. No benefits in any form have been or will be received from a commercial party related directly or indirectly to the subject of this manuscript.
The authors acknowledge funding from the following sources: The National Institute of Arthritis and Musculoskeletal and Skin Diseases (U01-AR45444-01A1) and the Office of Research on Women’s Health, the National Institutes of Health, and the National Institute of Occupational Safety and Health, the Centers for Disease Control and Prevention. The Multidisciplinary Clinical Research Center in Musculoskeletal Diseases is funded by NIAMS (P60-AR048094-01A1). Dr. Lurie is supported by a Research Career Award from NIAMS (1 K23 AR 048138-01).
Appendix available online through Article Plus.