Validated outcome measures play an important role in standardizing patient care and in developing reliable clinical trials by objectively measuring the severity of disease. The scientific method states the importance of attaining reproducible results. An outcome measure, therefore, must also be reproducible in order to adequately function in future clinical trials. The importance of an outcome measure's reliability, which measures reproducibility, is clearly important and is necessary for attaining validity (Klein et al., 2008; Downing, 2004
). ICC values were compared via the method described by Steel et al. (Steel et al., 1997
). Though post-hoc power analysis showed that the difference in ICC scores did not reach statistical significance, there is a trend that the CDASI has good inter-rater reliability in regards to its Activity and Total measurements while the CAT-BM has moderate and poor inter-rater reliability for its Activity and Total measurements, respectively (). Likely, the nature of the instruments lends the CDASI to having a higher inter-rater reliability even though the CAT-BM is a binary instrument. For example, an item on the CAT-BM which was seen to have a large standard deviation among raters was item scoring the presence of non-sun exposed erythema. Since the CDASI has five to six items that would qualify as non-sun exposed erythema in addition to a larger number of items contributing to the activity score, it lends itself to having an intrinsically high inter-rater reliability since one disagreement among physicians would have less of an impact on the overall reliability than in the CAT-BM. Additionally, it is also possible that since the CDASI specifically goes through all anatomical parts, it gives more “pressure” to the rater to look through all the parts more efficiently than in the CAT-BM. Thirdly, the ambiguousness of certain question items in the CAT-BM may have contributed to a lower reliability. For example, the items scoring the presence of cuticular overgrowth or subcutaneous edema were seen to have a large standard deviation among raters. Although the CDASI may not be a binary system, the measures of activity that it scores (erythema, scale, and erosions) are defined more clearly among physicians than certain measures of activity in the CAT-BM. Notably, the inter-rater reliability among activity scores in the initial study exploring the CAT-BM (Huber, Lachenbruch, et al. 2008
) reports an ICC score of 0.6 (95% CI 0.06-0.83), contrasting to our reported value of 0.34. Although our value of 0.34 lies within the 95% CI making statistical variability the most likely cause of the difference, the differing patient populations between the studies (adult vs. juvenile) may have also played a role.
Interestingly, inter-rater reliability of damage measurements were lower in both the CDASI, the CAT-BM, and PGA scales (-ICC: CDASI Damage 0.563; CAT-BM 0.340; PGA Damage 0.506, PGA Damage Likert 0.542). This is consistent for other outcome instruments that contain a damage subscore such as the CAT and the previous version of the CDASI, suggesting that physicians have difficulty agreeing with one another in their assessment of damage21. It was noted that in the physician training session, the concept of poikiloderma varied among physicians. Additionally, in a previous study, agreement of a physician's perception of poikiloderma was poor as well (Klein et al., 2008). Poikiloderma accounts for almost half, less than 10%, and theoretically 100% of the maximum damage score in the CDASI, the CAT-BM, and the PGA Damage scales, respectively. This suggests that there is another factor, perhaps an inherent limitation of the outcome measure, explaining the poor, and lower, inter-rater reliability of the CAT-BM when compared to the CDASI or PGA Damage scales.
The intra-rater reliability of the CDASI was almost perfect in activity and total scores and good across damage scores. The CAT-BM had a lower intra-reliability across activity, damage, and total scores with good intra-rater reliability in all realms (). Although this shows a trend that the CDASI has a better intra-rater reliability, post-hoc power analysis showed that the difference did not reach statistical significance.
Although an outcome instrument may be reliable, if it does not have adequate construct validity, or the ability to measure what it has been designed to measure effectively, then its usefulness is limited. Both the CDASI and the CAT-BM were shown to be significant predictors of PGA scales, which is the ‘gold standard’, and thus to have good construct validity. While both the CDASI and the CAT-BM were found to have good content validity as stated above, a physician noted that the CAT-BM did not sufficiently assess scalp disease, which can be very troublesome for patients and found in over 80% in the DM population (Tilstra et al., 2009
; Kasteler, 1994
It is also important for an outcome instrument to be able to capture the disease state of patients at the extremes of disease. This is particularly important in patients with extreme disease activity. In this study, the maximum CDASI Activity and CAT-BM Activity score reached was 61 (61% of maximum activity score) and 14 (82% of maximum activity score). This suggests that the CAT-BM may be more prone to reach its maximum limit faster than the CDASI and therefore not be able to capture differences in disease activity in more severe patients.
To implement an outcome instrument for the use of clinical trials, it is essential that it be able to measure change in disease severity. The CDASI had the best responsiveness when compared to CAT-BM and PGA scales. Furthermore, all physicians anticipated that the CDASI would be a more effective response tool than the CAT-BM. This was not a surprising result, as shown by many of the physician rater comments, predicting that the CAT-BM would have this limitation as it only documents presence or absence of a certain measure whereas the CDASI documents the degree of severity of a certain measure.
Another important factor when comparing outcome instruments is its completion time. Even a tool that is reliable and valid but takes too long to complete would not be practical in a clinical research setting. Although the CAT-BM took significantly less time to complete than the CDASI (Mean Completion Time: CAT-BM 3.19 minutes; CDASI 4.76 minutes; p<0.001), the mean difference in completion time was about 90 seconds and may not be practically relevant.
There were limitations to the study. Firstly, as the patient population was relatively small, the external validity of our findings may be limited. Secondly, the relatively small patient population may have allowed the physician raters to recall how they evaluated a patient when completing their repeat evaluation. This could potentially raise the intra-rater reliability from its true value. To minimize this impact, physicians were asked to perform their repeat evaluation on a patient they had evaluated during the morning session, thus minimizing a likelihood of recall. Thirdly, as the study session lasted about 7 hours, it is possible that the physicians may have experienced fatigue that may have impacted their patient evaluation. This was minimized by offering snacks and lunch during the day and allowing physicians to rate patients at their own pace. Fourthly, five of ten of physician participants have used both the CAT and the original version of the CDASI previously, which may have falsely elevated the reliability and validity scores in both instruments since many physicians had increased familiarity with both of the instruments. Regardless of the limitations above, we can conclude that the CDASI appears to be a more effective tool than the CAT-BM in evaluating cutaneous severity in DM.