The overall reproducibility of the assessment of the quality of reporting on diagnostic accuracy studies published in 12 medical journals with a high impact factor, using the STARD statement, was found to be good with a Cohen's kappa statistic of 0.70 and ICC of 0.79. Substantial disagreement was found for some items, including (a) the reporting of the rationale of the reference standard, (b) the number of included participants that undergo the index tests and/or the reference standard and description why participants failed to undergo either test, (c) the distribution of the severity of the disease in those with the target condition and other diagnosis in participants without the target condition, (d) a cross tabulation of the results of the index test by the results of the reference standard, and (e) how indeterminate results, missing data and outliers of the index test were handled. The results of the intra-observer and inter-observer reproducibility showed low observer agreements (≤ 75%) for the same items, indicating that reviewers had difficulties with the assessment of these items. If we had found high intra-observer but low inter-observer reproducibility, this would have pointed at different interpretation of the items. Therefore, these disagreements in our study were probably not caused by differences in interpretation of the items, but rather by difficulties in assessing the reporting of these items in the articles.
Stengel and colleagues found similar results in 62 diagnostic accuracy studies on ultrasonography for trauma. The inter-observer agreement of assessment of STARD items ranged from poor for specification of the number of patients who dropped out (58%) to almost perfect for the specification of the selection criteria (98%). [25
Although four reviewers acted as second reviewer, we decided, based on the small number of studies assessed by these four reviewers, to ignore differences in scoring between the four reviewers and not calculate stratified reproducibility statistics. As the agreements of the reviewers with the first reviewer were comparable, this is unlikely to have influenced our results.
The presentation of a flow diagram, presenting the design of the study and the flow of patients through the study, would be helpful in improving the quality of reporting of diagnostic accuracy studies, as a flow diagram explicitly clarifies items that appeared to be difficult to assess. The optimal flow diagram presents the target population (setting, location and characteristics of potentially eligible persons and represents the individuals to whom the results are expected to apply), eligible population (proportion of potential participants who undergo screening and are eligible to enroll), and the actual research population (eligible patients who are willing to participate; informed consent). The number of participants who did not satisfy the eligibility criteria, reasons for exclusion, number of participants who failed to receive one of tests and the results of the index tests (including indeterminate and missing results) by the results of the reference standard representing the true positives, true negatives, false positives and false negatives can easily be reported in a flow diagram. The intra- and inter-observer disagreements regarding item 16 and 22 were caused due confusion regarding the final research population. If in a study, patients were excluded from the analysis because they did not receive one of the tests (missing) and the reasons for exclusion were not specified, it was unclear whether these patients belonged to the actual research population or not.
A large variety in diseases and tests were included in our study. This was a result of our decision to select all diagnostic accuracy studies published in 2000 in general medical journals and discipline specific journals. Although a pilot evaluation of the quality of reporting was carried out among all reviewers, no additional criteria were defined for specification of the severity of all diseases (item 18) described in the studies. Evaluating the reporting of this item was affected by the differences in the subjective judgment of the reviewers. A similar observation was made reporting the rationale of the reference standard. More detailed specification of these items is possible if the STARD statement is used to evaluate papers about a specific subject. Items such as the recruitment period, adverse events of the tests and presentation of a flow diagram are less susceptible to subjective judgment of the reviewers, resulting in higher inter-assessment, intra-observer and inter-observer agreement.
High inter-assessment agreement was observed for the reporting of those items that are associated with biased estimates of diagnostic accuracy, such as the blinding of the readers of the index and reference test to the results of the other test and the clinical information and the description of the study population. [26
] Furthermore, the results of this reproducibility study indicates the importance of the need of at least two reviewers who independently assess the quality of reporting of diagnostic accuracy studies.
We recommend that the STARD steering committee should discuss the results of this reproducibility study and decide whether certain items of the STARD statement should be more clarified in the statement as these items cause difficulties in the assessment of the quality of reporting. To our opinion, including a flow diagram in all reports on diagnostic accuracy studies would be very helpful for both readers and reviewers.
As this reproducibility study did yield important information for the applicability of the STARD statement, this could also be the case for other guidelines such as CONSORT, QUOROM, MOOSE and STROBE.