While inter- and intra-rater reliability and consistency are fundamental principles in the development and use of outcome instruments, ease of administration, applicability of content, and demonstration of clinical responsiveness are critical components. We focused our analyses and discussions with study participants on both reliability and validity.
Validity refers to the interpretation of a test result for the purpose for which the test was designed. Since validity is not a property of the instrument, but rather the interpretation and inferences made from the test results, it must be established for each intended purpose7
. In terms of content, each instrument used in this study was developed by leading academic dermatologists and rheumatologists with considerable expertise in DM. Since the ultimate purpose of these instruments is use in clinical trials and longitudinal patient assessment, it is crucial that scores increase and decrease with changing clinical and systemic disease activity and damage. To assess responsiveness of disease, the instruments must have a range of scores that will fluctuate with inflammatory states of DM. The goal of this study was to theoretically determine whether the tools clinically and accurately reflect current disease states by displaying a range of scores that could potentially translate into longitudinal trends; however, a clinical trial will ultimately be needed to assess whether the instruments are indeed responsive to fluctuating disease states.
While reliability is a prerequisite to validity, the internal construct of an instrument must be assessed to demonstrate clinical usefulness, inference, and appropriateness8
. In terms of our validity analysis comparing the three outcome instruments to the GPhS, GPaS, and GIS, all scales correlated better with physician (GPhS) than patient (GPaS, GIS) global measures. The trichotomized GPhS validation assessment demonstrated that the CDASI total score had the highest correlation with the global score and the best spread of the 3 indices, thus illustrating clinical responsiveness and applicability (). Since the global measures were used as “gold standards” in this study for validation, it would also be appropriate to use the GPhS in clinical trials to correlate with the quantitative outcome measures being studied.
The interquartile range analysis also demonstrated that the DSSI and CAT activity scores do not have as large a spread as the CDASI activity score, thus also potentially limiting clinical applicability. While both the DSSI and CDASI demonstrated near perfect ICC results (0.93 and 0.86, respectively), it remains to be determined if both measures are responsive if used during a clinical trial. Further testing is required to confirm this possibility. While the CAT and CDASI were normally distributed, the CAT’s lower test-retest intra-rater reliability (ICC=0.74 for activity and ICC=0.58 for damage) and inter-rater reliability (ICC=0.60 for activity, ICC=0.43 for damage) demonstrate that this instrument performs less well in temporal stability, inter-rater agreement, and generalizability9
More study is needed to develop the optimal method to measure disease activity and damage. The reversibility of some clinical findings (e.g., activity rather than damage) is unknown. Most notably, the term “poikiloderma” was controversial during training, which is reflected in our assessments. While intra-rater reliability for this term was consistent, inter-rater reliability and validity of “poikiloderma” were extremely poor. This study revealed significant confusion surrounding the term poikiloderma, signifying the need for a more precise definition.
Reviewing the experiences of the physicians for ease of administration and use of the indices, the CDASI was deemed the favored outcome instrument for clinical appropriateness. The placement of the CDASI first in the packet for each patient may have artificially decreased the time to complete the other instruments, since patients were often examined during the CDASI time period and then the two other instruments were completed without re-examination. Thus, the CDASI time is really the only accurate and valid time in this study as it was always performed first. In this regard, another limitation of this study is the fact that the instruments were not completed in a randomized order, which may have biased other aspects of the measures as well. For example, if the CDASI was performed during the examination, the examiners may have remembered the severity of the different dimensions with greater detail when completing this first instrument than the others. Moreover, fatigue may have limited the examiners’ performance on later measures and limited the reliability of the latter instruments. Additionally, only a small number of patients were examined during this study, and thus, the generalizability of our results may be limited. Regardless, in combination with the normal distribution of the CDASI, the excellent ICC score, and the validity assessment, it appears that the CDASI is a useful outcome measure for studies of cutaneous dermatomyositis. Further studies are needed to determine if modifications might be beneficial and to determine responsiveness to change for all of the instruments.