In previously reported studies, fracture classification systems of apparent merit were found to lack the reliability necessary for clinical use 
. In the present study, we investigate the possibility that the application of a classification may be hindered by human error, and that a consensus classification approach may improve reliability.
We performed a traditional experiment of reliability on a modified Garden classification. We reported a mean kappa coefficient for inter-observer reliability of 0.69, comparable to the 0.73 value found by Oakes et al 
and the 0.68 value determined by Thomsen et al 
We also found that the mean rate of agreement with the consensus was 82%. This corresponds to the accuracy rate implied by the observed 0.69 kappa coefficient, as accuracy is approximated by the square root of kappa. The concordance between the consensus-agreement rate, 82%, with the square root of kappa, 0.83, suggests that in cases lacking a reference standard, consensus-agreement rates can be used as a proxy for individual accuracy rates.
In the second step of analysis, we formed virtual reader groups, to assess the effect of consensus classification. These virtual reader groups were indeed able to classify hip fractures with greater reliability: contrasted with the individual inter-observer reliability of 0.69, the mean reliability was 0.77 for classification by 3-member virtual reader groups and 0.80 for 5-member groups.
To test the feasibility of assembling groups of readers in real time, we timed the response of 40 volunteers to email queries for image interpretation. We found that a group of 9 (one theoretically 99% accurate) could be assembled in three hours, even on the weekend.
Limitations of our study must be considered. To begin, the effect of group reading was studied apart from the study of feasibility. That split reflects the chronology of discovery: first, the improved accuracy of group reading was detected, and only later, to address the issue of practicality, was a determination of response times undertaken.
Furthermore, it may seem that the results presented are obvious; that it should be apparent that having more readers leads to greater accuracy. That is not always true: increasing the number of readers only enhances the proclivities of the individual reader. If individual accuracy was less than 50%, increasing the number of readers would indeed decrease accuracy. Thus, one must demonstrate that the individual reader accuracy exceeds 50%. A second necessary finding is that no particular case was particularly difficult. In the instance where overall accuracy is, say, 80%, yet that rate is based on an accuracy of 100% in the 80% of cases which are "easy" and 0% accuracy in the remaining 20% of cases which are "hard", increasing the number of readers will not improve things: the hard cases will continue to vex the readers.
It must also be considered that our email response time experiment represents a “best case scenario”: the task was easy and of low stakes, and a series of ten may have been too short to evoke fatigue, apathy or other causes of waning interest. That said, the study population was small and perhaps employing a larger group may more than compensate for the inevitable drop-outs. Additionally, if group members were to be reciprocally rewarded, so to speak –by having their own cases read by their peers- attrition may be less of a concern.
Two general criticisms of fracture classification studies such as ours apply here: first, that the volunteer reviewers simply do not care as much as attending surgeons and therefore devote less mental effort to the task of diagnosis and second, that the cases were not representative of the true distribution seen in clinical practice (a form of spectrum bias). These cannot be answered beyond the equally general reply, namely, that this is a feature of all studies of this type.
In sum, we have found that harvesting the wisdom of the crowd may help improve fracture classification reliability, suggesting that group efforts might improve diagnostic accuracy in general. This is consistent with the experimental behavioral investigations reported in Science
by Wooley et al
who found “converging evidence of a general collective intelligence factor that explains a group's performance on a wide variety of tasks.” Of course, not all crowds are wise: crowds can be susceptible to “madness” and “extraordinary delusions” 
. To create a wise crowd, we need to have diversity of opinion; we need to ensure that opinions are based on some form of knowledge; and we must make certain that an individual's opinions remain independent of others' opinions. Those criteria can be met in the case of fracture classification, and perhaps other clinical problems in orthopaedic surgery and medicine. Thhe advice of a wise crowd can be used to supplement (and not supplant) our individual powers of reason. In turn, crowd intelligence may help us reduce error and improve the quality of care at low additional cost.