Twenty clinical cases were generated for the evaluation. The evaluation process involved 5 assessors and 2 referees. Each clinical case was assigned to three assessors. So, in summary, the parameters of the evaluation were
- clinical cases: 20
- number of assessors: 5
- number of assessors per case: 3
- cases per assessor: 12
- number of referees: 2.
The full results obtained from the evaluation can be found online at [38
]. A value of N/A in the tables is interpreted as not applicable. In some circumstances, there are diseases which were neither diagnosed by the assessors nor the ML-DDSS system and therefore have no calculable parameters.
4.1. Results of Entire Knowledge Base
summarizes the results, showing the values obtained for the entire system, in comparison with the five assessors (anonymized as EX-NNNN in the figure columns). Given that the most constrained calculation for the system is when the values are calculated for the intersection of the arbitration, we have used these as representative values.
Results of the evaluation (comparison between system and all the assessors).
When the accuracy is used as a traditional quality metric, the system performs similarly to the best experts. However, the results are quite different from one another, reinforcing the need to use additional metrics in the evaluation. When looking at the MCC, another value that tries to summarize the overall quality, there is a difference of 30% between system and experts. Although the global quality is being measured, the MCC takes into account balance between accuracy and specificity, which is worse in the experts than in the system.
Although the experts were able to provide as many diagnoses as they saw fit, their sensitivity is lower than the automated system. The experts, however, performed better than the system in the specificity metric. However, given that both results are near 95% in the system, it is easier to perform statistically worse than to perform statistically better. This explains the fact that experts have a slight advantage in specificity, while having an important disadvantage in the sensitivity metric.
These results suggest that the system would be beneficial as a supporting tool for experts, where the system can suggest diagnostics and the experts can confirm them. This would be similar to, for example, a pair of experts where one has the highest sensitivity and the other has the highest specificity; the combination would likely generate better diagnostics than a lone expert. Finally, the precision is much lower for the experts than for the system. Mathematically, this is because the quotient TN (true negative)/TP (true positive) is larger for the experts than for the system, which in practice is because the number of TP is greater for the system (as shown by higher sensitivity values). This has the unexpected consequence that positive predictions from the system are more likely to be true.
To determine if the observed differences are statistically significant (
), shows the results of applying a t
-test to the metrics. The differences between the assessors and the system in precision, accuracy, and specificity were not significant (
The conclusions are supported by small confidence intervals for the system, indicating enough data has been gathered to accurately perform the evaluation. It is difficult to extract information about the precision given the wide interval and overlap between experts and system. However, it is possible to extract some conclusions from the MCC and recall metrics, even with wide confidence intervals for the experts, as they do not overlap. More experts or diagnostic cases will be useful in order to narrow those intervals; but the data is useful enough in its current form to draw several conclusions.
4.2. Results for Common Diseases
For more frequent diseases, the system can perform as well as experts, in some cases with 100% accuracy as shown for influenza (). However, for gastroenteritis (), the expert failed to provide the diagnosis in some cases, as sensitivity does not reach 100%. This can probably be attributed to rare cases of the disease, as some experts missed the same cases. The modeling may also be at fault; with only three symptoms in the diagnostic rules for gastroenteritis, it is particularly sensitive to a lack of evidentiary symptoms.
There are also important differences among the assessors, particularly with respect to the MCC and precision metrics, which suggests that the panel of physician evaluators have different levels of familiarity with these diseases. These interevaluator differences were consistent among most of the common diseases, where the best experts in the influenza case match the best in the gastroenteritis case. However, they do not match the global results, which suggest that these experts are worse at diagnosing less common diseases.
The specificity metric is the focal point of the analysis for common diseases; high sensitivity is expected because these diseases are almost always considered during an expert's differential diagnosis. Often, it is more preferable to have a high specificity, in order to rapidly start considering other options in the case where a common disease does not match. Here, the system has surprisingly good results, showing that experts may be biased towards overdiagnosing these common diseases.
4.3. Results for Less Common Diseases
As predicted, there is much less correlation among experts in the rare diseases case, where experts tend to over, or underdiagnose the disease, shown by dramatic differences between sensitivity (recall) and specificity, depending on expert and disease. The results of this behavior are shown for pneumonia () and pyelonephritis (). For both diseases, there was an expert whose diagnoses closely matched the system.
It is possible to interpret these results as being indicative of “niche” knowledge, where experts in that niche can accurately diagnose the disease better than other experts. Additionally, the system's overall behavior is very similar to the best expert for each disease, making it comparable to a team of experts covering all disciplines.
Specificity and precision for these rare diseases are generally high, as they usually require more symptoms to be diagnosed, but surprisingly the experts do not rank much higher than the system (which was designed to diagnose a disease with just one matching symptom). The more interesting metric for these diseases is sensitivity, as they can be easily overlooked. In this case, the system shows a clear advantage over the experts.
It is necessary to remark that these results are based on the diseases presented in the examples. Sometimes, rare diseases are characterized for having one or two findings that show the real diagnosis, which means that if you know this specific finding, it becomes easier to accurately diagnose, but if the observation is lacking, then it is more difficult to diagnose.