Although this study was conducted by hepatologists with good experience in hepatotoxicity, the RUCAM largely failed to meet the minimum thresholds for a reliable instrument. Discrepancies of three and four points in the continuous version and one point in the categorized version seem rather small; however, reliability is concerned with variability rather than bias. In this regard, there was considerable variability in the RUCAM between the two occasions and among the three reviewers. Under the best-case scenario, the test-retest reliability among the site PIs was only 0.65 while the interrater reliability among the external reviewers was unacceptably low at 0.46. Typically, a test-retest reliability of 0.8 and interrater reliability of 0.6 are expected, and only the U95CL for the former exceeded its threshold.
Multi-item questionnaires are frequently used in other clinical areas to quantify the level of disease activity. These instruments also require clinical judgment, and it is instructive to compare our results against the reliability coefficients of these questionnaires. For example, test-retest reliabilities of 0.85 to 0.93 were reported for the Recent-Onset Arthritis Disability Index27
; 0.71 to 0.95 for the Dyspnea Management Questionnaire28
; and 0.87 to 0.97 for the Inflammatory Bowel Disease Questionnaire.29
Similarly, the interrater reliability of the Pediatric Ulcerative Colitis Activity Index30
was reported as 0.87, while that of the Myositis Assessment Scale31
was 0.89. Interestingly, a disease activity index was also developed using a consensus process among clinical experts in idiopathic inflammatory myopathy.32
Even after an initial training series, however, interrater reliabilities of 0.32 to 0.74 were observed.
There are a number of limitations to the generalizability of these results. First, the ILIAD drugs are well known hepatotoxins and were selected largely for their known DILI signatures. Cases were enrolled only if the site PI felt a priori that there was a significant degree of association between the liver injury and the implicated drug. Moreover, the liver injuries were severe: 73% of cases were hospitalized, many were jaundiced, and 6% required liver transplantation. In effect, these are “classic” DILI cases, and compared to other drugs, should have resulted in greater agreement over time and among the reviewers. On the other hand, the study was conducted retrospectively, with cases going back to 1994. Many medical records and charts for older cases were missing or incomplete, data on death, fulminant hepatic failure, and dechallenge were not always available, and other competing causes may not have been excluded completely. It will be of interest to see if the reliability is greater with more complete data collected prospectively.
discussed the statistical implications of poor reliability in clinical research. Specifically, the level of association between a measure with poor reliability and other variables as assessed by correlation, regression, or analysis of variance is shrunk toward zero, making it more difficult to declare statistical significance. Statistical power is reduced, so that the sample size must be increased correspondingly. Sensitivity and specificity of the instrument are also attenuated, giving rise to classification errors and impairing its utility to serve as a diagnostic marker. This has significant implications for the RUCAM's ability to detect DILI signals and declare DILI cases.
There are two approaches to overcome these limitations. One is to categorize the instrument. However, our analysis reveals that this maneuver improved matters only marginally. There was complete agreement in only a small majority of cases, and the test-retest and interrater reliabilities remained low. The other is to have m
reviewers perform the evaluation independently and take the average. Lachin33
showed that if ρ is the reliability coefficient of a single assessment, the reliability of the average is given by, m
ρ/[1 + (m
− 1) ρ]. With the three independent reviewers in ILIAD, this would raise the interrater reliability from 0.46 to 0.71. This brings the reliability to a more acceptable range and is strongly recommended for research purposes.
Site PIs tended to attribute greater causality score than the external reviewers, which raises important issues. Because the site PI enrolled the case and was the “champion” of that case in the Causality committee, he or she may have been more zealous in attributing the event to a drug-induced liver injury. Alternatively, site PIs may have been more intimately familiar with nuances of the cases not captured completely in the CRF subset and narrative, selectively emphasizing certain components of the instrument. Either way, this suggests that the RUCAM is a “subjective” instrument and casts doubt on its utility as an “objective” measure of DILI causality. It also raises the possibility that causality should only be assessed by reviewers at arm's length from the case. This might avoid bias, but from a reliability perspective, this would be a mistake. The site PIs were consistently more reliable than the external reviewers. Written instructions, criteria for competing causes, and evidence-based revisions, pilot-tested in prospective cohorts, would go a long way toward overcoming these limitations.
Smaller MADs among the three reviewers were observed on the second occasion compared to the first. This may reflect accumulating experience and familiarity with the RUCAM as time progressed. It may also reflect accumulating experience with the “gestalt” of the monthly Causality Committee teleconferences. Nobody wants to be an outlier, and reviewers may have become more adept at anticipating how their colleagues would weigh the evidence and score the case. This weakens the assumption of reviewers working independently as time progressed, and argues that special attention must be paid to this operating assumption.
Finally, there are many who would argue that because of its idiosyncratic nature, the gold standard for adjudicating cases of DILI can only be the clinical judgment of expert hepatologists. Indeed, DILIN is applying an expert opinion process in its clinical studies. However, this is not practical in a clinical setting, and the reliability among practitioners is likely to be lower. Thus, over the long term, priority should be given to developing an authoritative, evidence-based causality instrument that would be easily accessible to the clinical and research communities; for example, over the internet. In the interim, modifications to the RUCAM, including improved instructions, updated criteria for competing causes of liver injury, and a central reference for prior reports of hepatotoxicity, are needed to improve its performance characteristics as an investigational tool.