We found substantial reliability for many of the qualitative and quantitative MRI features of SPS assessed in this study. Agreement on the severity of central canal stenosis and foraminal stenosis was good, whereas subarticular zone stenosis showed markedly variable agreement between reader pairs. The measurements of soft tissue canal area and thecal sac area were reasonably reliable, though the tricotomized stenosis ratio showed less reliability than the ratings of central stenosis as mild, moderate, or severe. These findings are important because they suggest that some MRI features may be measured reliably enough to be examined as correlates of prognosis.
To judge the clinical applicability of the levels of agreement seen in this study, we can compare them to the reliability of physical examination features that have been studied in various spine populations. The substantial agreement for central stenosis (κ
0.73) was similar to the most reliable physical examination features studied, such as calfwasting with a κ
of 0.80 and crossed straight leg raising in patients with disc herniation with a κ
The moderate agreement for foraminal stenosis (κ
0.58) is similar to the reliability of the assessment of pain with bending (κ
0.56) or pain with resisted external hip rotation (0.63).11
The lowest agreement in our study (κ
0.45) was similar to the agreement seen for reproducibility of bony tenderness (κ
0.40) or Achilles reflex deficit (κ
Our results compare favorably with prior studies of imaging interpretation in SPS. Speciale et al
reported an overall interobserver κ
of 0.26 for ratings of stenosis severity.5
This much poorer agreement may stem from the lack of discussion or definition of what constituted mild, moderate, or severe in that study. We attempted to define in advance all ratings used in our study, convening in-person meetings to review cases in order to reach consensus on an approach. In addition, the agreement reported by Speciale et al
seems to include foraminal and lateral recess stenosis along with central stenosis. We found wide variability in the agreement between reader pairs for subarticular (lateral recess) stenosis ratings.
The quantitative measurements showed reasonably good reliability in terms of intraclass correlations. The differences between measurements ranged from 4.8% to 13%. The reliability of the thecal sac area has been previously studied. Haminishi reported a correlation coefficient for the dural sac area of 0.92 and Weiner a correlation of 0.91.12,13
However, these values were for Pearson’s correlation coefficient rather than the intraclass correlation and values for absolute differences between measurers were not reported. In our current study, the reliability of the measured thecal sac stenosis ratio did not seem to be more reliable than the subjective rating of central canal severity.
This study had a number of important limitations. Despite our efforts to define terms and reach consensus on rating procedures, we relied on clinically available images with varying image acquisition protocols, field strength, slice orientation, etc
. This may have contributed to poorer reliability for some imaging characteristics. However, it is likely to reflect the level of reliability that could be expected in clinical practice where there is substantial variability in image quality.14
In addition, there was no standardization across readers in terms of the setting or equipment on which the readings were done. This could have contributed to the differences between readers. In addition, the readers themselves were heterogeneous (3 radiologists and an orthopedic spine surgeon); however, when we assessed reliability across reader pairs, we did not see any systematic differences in inter-reader agreement based on reader specialty.
It is important to note our use of prestudy meetings, detailed handbooks of definitions, and standardized reporting forms with multiple choice categories for each parameter at each level. These features allowed the assessments to be structured far more than possible in general clinical practice. Thus, our results may overestimate the reliability that might be expected among readers doing routine clinical assessments. In addition, although the readers were not provided with specific clinical data on subjects except their age and sex, they were aware that all the images were from patients with either disc herniation or SPS severe enough to qualify them as surgical candidates. How the lack of “normal” studies may have affected the readers’ interpretations is unknown.
Disagreements between readers in our study were fairly modest overall. However, when they did occur, we had no gold standard by which to decide between differing interpretations. For example, it is unclear whether the measured thecal sac area or the subjective rating of central stenosis severity is the most “valid.” The standard for preferring 1 assessment over another should not be based on reliability alone, but rather on whether 1 assessment is able, or better able, to predict patient symptoms or outcome. The assessment of reliability is merely the first step in this process. Future studies should assess these ratings and measurements for their potential prognostic implications in predicting outcomes.
- In this cohort of patients with spinal stenosis and neurogenic claudication with or without associated degenerative spondylolisthesis, ratings of central stenosis, foraminal stenosis, and thecal sac area showed moderate to substantial intra-reader and inter-reader reliability.
- Rating of subarticular zone stenosis and measures of osseous canal area were less reliable.
- Future studies should assess the prognostic significance of these findings.