Within an arbitrarily chosen period of two months (October and November 2007) all patients discharged from five Danish university hospitals (orthopaedic ward or emergency unit) diagnosed with a fracture of the proximal humerus were identified. Imaging material was collected, stored, and presented electronically by the first author who was not serving as an observer. Plain anterior-posterior and scapular-lateral radiographs should be available for each case. We excluded pathological fractures, humeral shaft fractures, healed fractures, fractures in skeletal immature, pseudoarthroses, and miscoded cases. No further selection of images was allowed.
Five shoulder fellowship trained surgeons (BO, AS, LF, MK, SJ) served as observers. They were not informed about the purpose and design of the study. The observers were blinded to the identity of the patients, institutions, and the treatments given.
The five observers independently assessed and classified all sets of radiographs on two occasions three months apart. The observers were allowed to use a goniometer, a numbered diagram of the original 16-category Neer classification [1
], and a written definition of displacement. There was no time limit. First, the observers were asked to assess whether the quality of imaging material was sufficient for classification and treatment purposes in each case. Second, all pairs of radiographs were classified according to the Neer classification. Third, the observers were asked to recommend one of three treatment modalities for each case: non-operative treatment, locking plate osteosynthesis, or hemiarthroplasty.
Three months later the observers independently re-assessed and re-classified all sets of radiographs in a new, random order. At this classification round the observers were additionally provided with information on the patient’s age.
Mean kappa-values for inter-observer agreement and ninety-five percent confidence intervals were calculated. Kappa-values were interpreted qualitatively according to Landis and Koch [3
]: kappa-values less than 0 indicate poor agreement, 0.00-0.20 slight agreement, 0.21-0.40 fair agreement, 0.41-0.60 moderate agreement, 0.61-0.80 substantial agreement, and 0.81-1.00 excellent agreement.
For both classification rounds mean kappa-values and ninety-five percent confidence intervals were calculated for inter-observer agreement on 1) adequacy of radiographs for classification and treatment purposes, 2) classification according to the 16-category Neer classification, and 3) treatment recommendations: non-operative, locking plate osteosynthesis, or hemiarthroplasty.
Changes in mean kappa-values for inter-observer agreement on classification and treatment recommendations between first and second classification round were analyzed. The statistical significance of observed differences in mean kappa-values was calculated using a bootstrapping technique.
For all cases of change in classification category from first to second round we recorded if the change in classification was accompanied by a change in treatment recommendation.
Finally, we conducted a sensitivity analysis by omitting the most extreme observer and repeating the calculations.
STATA, version 11.0 was used for calculation of kappa statistics and confidence intervals (StataCorp, 2009, Collage Station, Texas, USA). R statistical software version 2.12.1 ‘bootstrap’ package was used for bootstrapping (R Foundation for statistical software, 2010, Vienna, Austria).