This study tested a novel method of conducting research on agreement when interpreting clinical signs between expert clinicians who were widely dispersed geographically. The successful use of the internet to host these videos and use of a version contained on a memory stick where internet access is still poor suggests that this approach can be further developed to include clinicians even from remote areas with access to a computer. Such methodologies have obvious extensions to teaching new skills to students and health workers. We then extended the approach, using a group presentation, to explore the ability of health workers in routine practice to identify consensus defined clinical signs.
It is possible that a different set of experts would have classified the signs presented on videos differently. However, we included experts from a wide variety of settings internationally. It is also possible that agreement within local hospital panels was high because we used an open presentation despite our attempts to limit contamination between observers. Despite these potential limitations we believe the study demonstrated that very clear consensus can be reached over the presence (or absence/grade) of specific clinical signs amongst experts. Furthermore it also demonstrated that where experts have a clear view on a clinical sign then health workers of a wide variety of cadres and with widely different levels of clinical experience in routine practice, at least in Kenya, are also able to identify the clinical sign. This provides some reassurance that teaching or guidelines based on these clinical signs have the potential to be understood and implemented widely. However, the study also demonstrated that for many clinical videos experts showed only moderate or even poor agreement. Where experts found it hard to agree health workers in routine settings also found it hard to agree. This finding has several implications.
Firstly, clinical signs may be depicted better as a spectrum from obviously present to obviously not present with the position on the spectrum for any one child or video being best represented by the proportionate agreement amongst multiple, expert observers. The consequence of this is that training people to interpret clinical signs might best be done using videos where possible and a standard set of examples defined by proportionate agreement amongst experts. It will also be clear that any research study or aspect of clinical practice based on clinical sign criteria, whether it is an observational study, a randomised controlled trial or a guideline, will suffer to a greater or lesser degree from misclassification errors as lack of agreement interpreting clinical signs is not uncommon. Standard sets of video records could help improve clinical research and the generalisability of results.
The mean sensitivity scores were marginally higher than the specificity scores. Sensitivity was based on ability of health workers to detect truly positive clinical signs while the specificity was based on the health workers ability to detect truly negative clinical signs. Scoring higher for sensitivity than specificity may be interpreted that the health workers tend to over diagnose; that is any person attending hospital is likely to be labelled as being sick. The clinicians' cautiousness would ensure that sick patients are identified and subsequently treated but the lower specificity may result in overtreatment of children attending hospital who did not need treatment.
When investigating agreement between observers researchers have for a long time used kappa and other chance adjusted measures with a commonly used scale to interpret kappa derived by Landis and Koch in 1977
. However, the appropriateness of kappa as a measure of agreement has recently been debated. The dependence of kappa on trait prevalence and on the marginal totals in the cross-tabulation used in its calculation predisposes kappa to two paradoxes. Counter intuitively studies can have high kappa values at relatively low levels of crude agreement and, conversely, there can be low levels of kappa for corresponding high crude agreement 
. These limitations of kappa mean scores are not comparable across studies and suggests simple scales for their interpretation are unhelpful. A relatively new statistic, the AC1 statistic, has been suggested by Gwet to adjust for chance in agreement studies 
. In this study we compared crude agreement and chance adjusted agreement using Fleiss' kappa and the AC1 statistic (). At the extremes of crude agreement the AC1 and Fleiss' kappa scores approximated each other. For the other values of crude agreement, kappa scores were usually lower than AC1 scores and were not linearly correlated with crude agreement.
In conclusion, we have shown that there can be widespread agreement in identifying obvious examples of clinical signs amongst all types of clinicians. However, greater attention should be paid to establishing where possible standardised thresholds for decisions on when a sign is or is not present, as appropriate, to delineate a particular condition. Video records provide one possible means to achieve this. Clinicians should also be more aware of the development of statistical theory underpinning measures of agreement to avoid well-described pitfalls. This study adds to the wider body of evidence on work done to understand workers abilities in recognising signs recommended by IMCI