|Home | About | Journals | Submit | Contact Us | Français|
To properly evaluate therapies for cutaneous dermatomyositis (DM), it is essential to administer an outcome instrument that is reliable, valid, and responsive to clinical change, particularly when measuring disease activity.
The purpose of this study is to compare two skin-severity DM outcome measures, the Cutaneous Disease and Activity Severity Index (CDASI) and the Cutaneous Assessment Tool-Binary Method (CAT-BM), with the physician global assessment (PGA) as the ‘gold standard’.
Ten dermatologists evaluated fourteen patients with DM using the CDASI, CAT-BM, and PGA scales. Inter-, intra-rater reliability, validity, responsiveness, and completion time were compared for each outcome instrument. Responsiveness was assessed from a different study population, where one physician evaluated 35 patients with 110 visits.
The CDASI was found to have a higher inter- and intra-rater reliability. Regarding construct validity, both the CDASI and the CAT-BM were significant predictors of the PGA scales. The CDASI had the best responsiveness among the three outcome instruments examined. The CDASI had a statistically longer completion time than the CAT-BM by about 1.5 minutes.
The small patient population may limit the external validity of the findings observed.
The CDASI is a better clinical tool to assess skin severity in DM.
Dermatomyositis (DM) is a chronic systemic autoimmune disease categorized among the idiopathic inflammatory myopathies (Dugan et al, 2009). DM is often associated with extramuscular and extracutaneous pathology, with involvement of the joints, heart (cardiomyopathy and conduction defects), and lungs (Lorizzo et al., 2008). The most widely accepted classification criteria for DM has traditionally emphasized the importance of clinical, laboratory, histopathologic, or electrophysiological evidence of muscle inflammation for making the diagnosis (Bohan et al., 1975). Subtypes of dermatomyositis, amyopathic and hypomyopathic dermatomyositis, have been described for patients with no or minor muscle findings, respectively (Gerami et al., 2006).
Characteristic inflammatory skin changes are seen in a large majority of individuals with DM (Callen et al., 2006). Nevertheless, the cutaneous manifestations of DM are among the least systemically studied aspects of the disease. This has resulted in part from the lack of validated tools to reliably determine the activity of the cutaneous manifestations of DM, especially relative to other dermatologic diseases such as psoriasis and atopic dermatitis, where disease-specific skin severity outcome instruments have been used extensively (Gaines et al., 2008; Feldman, 2005; Kunz et al, 1997; Mrowietz, 2006. The FDA has developed guidelines for researchers on how to measure clinical response through measuring disease activity, disease-induced damage, the response as determined by the patient, and health-related quality of life (Gaines et al., 2008; Concept paper, pamphlet, 2003. From these guidelines, researchers must develop an outcome instrument that will capture appropriate elements of the disease to determine clinical response. Currently, effective treatments for the cutaneous manifestation of dermatomyositis are limited. There are a number of new biological therapies that may be beneficial for patients with DM (Lorizzo et al., 2008). There is a critical need to develop optimal validated instruments to quantify organ-specific disease activity, so that the efficacy of medications can be methodically and quantitatively evaluated.
We have previously validated a cutaneous severity outcome instrument, the Cutaneous Dermatomyositis Disease Area and Severity Index (CDASI), and have shown that it may be a more effective and reliable tool compared to other outcome measures, namely the Dermatomyositis Skin Severity Index (DSSI), and the Cutaneous Assessment Tool (CAT) (Klein et al., 2008). In order to further simplify the CDASI, we have revised the original CDASI and have shown that the modified version correlates almost perfectly with the original CDASI (Yassaee, in press). The CAT was originally developed with similar goals to the CDASI and was found to have appropriate reliability, construct validity, and responsiveness in the juvenile dermatomyositis population (Huber et al., 2008 and 2007). Recently, the CAT has also been simplified, and has been validated in the juvenile population (Huber, Lachenbruch, et al., 2008). The modified versions of the CAT, named CAT-Binary Method (CAT-BM) and CAT-Maximum Method (CAT-MM), stem from an alternative scoring method of the CAT. The CAT-BM has been shown to correlate almost perfectly to the original CAT (Huber et al., 2008). There have yet to be any studies comparing the modified CDASI and the CAT-BM for use in longitudinal clinical research.
The current study evaluates and compares the modified tools, with a goal to provide partial validation of each tool for use in the adult DM population and to determine the optimal effective research tool for measuring the severity of cutaneous disease in adult DM. The goal is to establish an appropriate tool for evaluating DM within and between studies to evaluate therapeutic responses most effectively.
CDASI Total and CAT-BM Total scores had a normal distribution with scores ranging from 1-72 and 1-20, respectively (CDASI Total: Mean 24.25 +/- 14.67; CAT-BM Total: Mean 9.24 +/- 4.17).
Inter-rater reliability was assessed by determining the agreement between the CDASI and the CAT-BM scores from the ten physician raters. The CDASI was found to have good inter-rater reliability among activity and total scores and moderate inter-rater reliability in damage scores, meaning the scores among physicians were in good accordance to one another among activity and total scores and moderate accordance to one another among damage scores. Contrastingly, the CAT-BM was found to have moderate inter-rater reliability in activity scores and poor inter-rater reliability among damage and total scores. The CDASI had the best inter-rater reliability overall when compared to the CAT-BM and PGA scales (Activity: CDASI 0.748, CAT-BM 0.563, PGA Activity 0.721, PGA Activity Likert 0.653; Damage: CDASI 0.563, CAT-BM 0.340, PGA Damage 0.506, PGA Damage Likert 0.542; Total CDASI 0.726, CAT-BM 0.432, PGA Overall 0.632, PGA Overall Likert 0.694) (Table 1).
Intra-rater reliability measures the degree of agreement of multiple outcome scores performed by a single physician. It was assessed by determining the agreement between initial and repeat scores, using the ICC, for each outcome instrument as well as determining the significance of a difference between mean initial scores and mean repeat scores for each outcome instrument. The CDASI was found to have an almost perfect intra-rater reliability between activity and total scores and good intra-rater reliability with damage scores (ICC: Activity 0.868; Damage 0.800; Total 0.903. No significant difference between mean initial and mean repeat activity, damage, and total scores was found (Mean difference: Activity 0.00, p=1.00; Damage 0.40, p=0.728; Total -0.40, p=0.541). The CAT-BM was found to have good intra-rater reliability between activity, damage scores, and total scores (ICC: Activity 0.714; Damage 0.792 Total 0.800). No significant difference between mean initial and mean repeat activity, damage, and total scores was found (Mean difference: Activity 0.2, p=0.713; Damage 0.35, p=0.496; Total -0.15, p=0.634). PGA scales were found to have almost perfect intra-rater reliability in all assessments except for PGA Activity Likert and PGA Damage Likert (ICC 0.737 and 0.708, respectively). There was also a significant difference between initial and repeat mean scores for PGA Overall and PGA Activity Likert (Mean difference: PGA Overall 0.63, p=0.019; PGA Activity Likert - 0.24, p=0.021) (Table 2).
Validity was assessed for the CDASI and the CAT-BM by using a linear mixed model. Both the CDASI and the CAT-BM were found to be a significant predictor of the compared ‘gold standard’, the PGA scales using both the VAS and the Likert scale (all p≤0.001 among total, activity, and damage scores) (Table 3), indicating that both the CDASI and the CAT-BM were good predictors of both the VAS and the Likert PGA scales.
As another means to assess construct validity and linearity, CDASI and CAT-BM scores were grouped by Likert scores. All CDASI and CAT-BM mean scores (Total, Activity, and Damage) expressed statistically significant distinct values when grouped by Likert scores (all p values ≤ 0.001) (Table 4), reaffirming that both tools are good predictors of the Likert PGA scales. Furthermore, both the CDASI and CAT-BM expressed a significant, near-perfect fit for linearity with all coefficient of determination values, or r2, values ≥ 0.947 (highest p=0.026). .
All physicians felt that the CDASI was complete, though one physician noted that it may be useful to have a mechanism to capture lipoatrophy from panniculitis in patients. 9/10 physicians felt that the CAT-BM was complete. One physician felt that the CAT-BM did not adequately assess the scalp.
Responsiveness was measured by using the SRM, defined as the ratio of the mean of the differences (i.e. CDASI and CAT-BM scores before and after a clinical change was noted) between two time points to the standard deviation of the differences. The CDASI had the highest SRM among outcome instruments (SRM: CDASI 1.25; CAT-BM 0.93; PGA Activity 1.03; PGA Activity Likert0.61). The CDASI was the only instrument to have an SRM > 1, indicating that the mean change between visits was greater than the standard deviation change between visits. As mentioned above, the CDASI had the highest intra-rater reliability among all compared outcome instruments (Table 2).
The CDASI had a statistically longer completion time than the CAT-BM (Completion Time: CDASI 4.76 minutes; CAT-BM 3.19 minutes; p<0.001) with a mean time difference of 1.58 minutes (95% Confidence Interval: 1.18 minutes – 1.97 minutes).
6/10 physicians felt that the CDASI would be more easily incorporated in a clinical setting than the CAT-BM. Those who preferred the CDASI mentioned the likelihood that it would be a more effective instrument to assess responsiveness as well as the order in which the anatomical locations were organized. Contrastingly, those who preferred the CAT-BM stated that it was a quicker instrument to complete. 6/10 physicians felt that the CAT-BM was less difficult to use. Those who preferred the CAT-BM mentioned it was quicker to complete whereas those who preferred the CDASI stated that the CAT-BM was “poorly organized” and that they would need “jump around” while completing it. 10/10 physicians felt that the CDASI was a better instrument to grade skin severity and improvement over time. Physicians commented that the CDASI measures the “degree of intensity of an eruption” whereas a “binary [method] won't be helpful in estimating response to treatment” and would “need to have complete resolution to capture change.” Furthermore, one physician commented that the CAT-BM included livedo reticularis in its scoring, which “would not be expected to improve with most therapy.”
Validated outcome measures play an important role in standardizing patient care and in developing reliable clinical trials by objectively measuring the severity of disease. The scientific method states the importance of attaining reproducible results. An outcome measure, therefore, must also be reproducible in order to adequately function in future clinical trials. The importance of an outcome measure's reliability, which measures reproducibility, is clearly important and is necessary for attaining validity (Klein et al., 2008; Downing, 2004). ICC values were compared via the method described by Steel et al. (Steel et al., 1997). Though post-hoc power analysis showed that the difference in ICC scores did not reach statistical significance, there is a trend that the CDASI has good inter-rater reliability in regards to its Activity and Total measurements while the CAT-BM has moderate and poor inter-rater reliability for its Activity and Total measurements, respectively (Table 1). Likely, the nature of the instruments lends the CDASI to having a higher inter-rater reliability even though the CAT-BM is a binary instrument. For example, an item on the CAT-BM which was seen to have a large standard deviation among raters was item scoring the presence of non-sun exposed erythema. Since the CDASI has five to six items that would qualify as non-sun exposed erythema in addition to a larger number of items contributing to the activity score, it lends itself to having an intrinsically high inter-rater reliability since one disagreement among physicians would have less of an impact on the overall reliability than in the CAT-BM. Additionally, it is also possible that since the CDASI specifically goes through all anatomical parts, it gives more “pressure” to the rater to look through all the parts more efficiently than in the CAT-BM. Thirdly, the ambiguousness of certain question items in the CAT-BM may have contributed to a lower reliability. For example, the items scoring the presence of cuticular overgrowth or subcutaneous edema were seen to have a large standard deviation among raters. Although the CDASI may not be a binary system, the measures of activity that it scores (erythema, scale, and erosions) are defined more clearly among physicians than certain measures of activity in the CAT-BM. Notably, the inter-rater reliability among activity scores in the initial study exploring the CAT-BM (Huber, Lachenbruch, et al. 2008) reports an ICC score of 0.6 (95% CI 0.06-0.83), contrasting to our reported value of 0.34. Although our value of 0.34 lies within the 95% CI making statistical variability the most likely cause of the difference, the differing patient populations between the studies (adult vs. juvenile) may have also played a role.
Interestingly, inter-rater reliability of damage measurements were lower in both the CDASI, the CAT-BM, and PGA scales (Table 1-ICC: CDASI Damage 0.563; CAT-BM 0.340; PGA Damage 0.506, PGA Damage Likert 0.542). This is consistent for other outcome instruments that contain a damage subscore such as the CAT and the previous version of the CDASI, suggesting that physicians have difficulty agreeing with one another in their assessment of damage21. It was noted that in the physician training session, the concept of poikiloderma varied among physicians. Additionally, in a previous study, agreement of a physician's perception of poikiloderma was poor as well (Klein et al., 2008). Poikiloderma accounts for almost half, less than 10%, and theoretically 100% of the maximum damage score in the CDASI, the CAT-BM, and the PGA Damage scales, respectively. This suggests that there is another factor, perhaps an inherent limitation of the outcome measure, explaining the poor, and lower, inter-rater reliability of the CAT-BM when compared to the CDASI or PGA Damage scales.
The intra-rater reliability of the CDASI was almost perfect in activity and total scores and good across damage scores. The CAT-BM had a lower intra-reliability across activity, damage, and total scores with good intra-rater reliability in all realms (Table 2). Although this shows a trend that the CDASI has a better intra-rater reliability, post-hoc power analysis showed that the difference did not reach statistical significance.
Although an outcome instrument may be reliable, if it does not have adequate construct validity, or the ability to measure what it has been designed to measure effectively, then its usefulness is limited. Both the CDASI and the CAT-BM were shown to be significant predictors of PGA scales, which is the ‘gold standard’, and thus to have good construct validity. While both the CDASI and the CAT-BM were found to have good content validity as stated above, a physician noted that the CAT-BM did not sufficiently assess scalp disease, which can be very troublesome for patients and found in over 80% in the DM population (Tilstra et al., 2009; Kasteler, 1994).
It is also important for an outcome instrument to be able to capture the disease state of patients at the extremes of disease. This is particularly important in patients with extreme disease activity. In this study, the maximum CDASI Activity and CAT-BM Activity score reached was 61 (61% of maximum activity score) and 14 (82% of maximum activity score). This suggests that the CAT-BM may be more prone to reach its maximum limit faster than the CDASI and therefore not be able to capture differences in disease activity in more severe patients.
To implement an outcome instrument for the use of clinical trials, it is essential that it be able to measure change in disease severity. The CDASI had the best responsiveness when compared to CAT-BM and PGA scales. Furthermore, all physicians anticipated that the CDASI would be a more effective response tool than the CAT-BM. This was not a surprising result, as shown by many of the physician rater comments, predicting that the CAT-BM would have this limitation as it only documents presence or absence of a certain measure whereas the CDASI documents the degree of severity of a certain measure.
Another important factor when comparing outcome instruments is its completion time. Even a tool that is reliable and valid but takes too long to complete would not be practical in a clinical research setting. Although the CAT-BM took significantly less time to complete than the CDASI (Mean Completion Time: CAT-BM 3.19 minutes; CDASI 4.76 minutes; p<0.001), the mean difference in completion time was about 90 seconds and may not be practically relevant.
There were limitations to the study. Firstly, as the patient population was relatively small, the external validity of our findings may be limited. Secondly, the relatively small patient population may have allowed the physician raters to recall how they evaluated a patient when completing their repeat evaluation. This could potentially raise the intra-rater reliability from its true value. To minimize this impact, physicians were asked to perform their repeat evaluation on a patient they had evaluated during the morning session, thus minimizing a likelihood of recall. Thirdly, as the study session lasted about 7 hours, it is possible that the physicians may have experienced fatigue that may have impacted their patient evaluation. This was minimized by offering snacks and lunch during the day and allowing physicians to rate patients at their own pace. Fourthly, five of ten of physician participants have used both the CAT and the original version of the CDASI previously, which may have falsely elevated the reliability and validity scores in both instruments since many physicians had increased familiarity with both of the instruments. Regardless of the limitations above, we can conclude that the CDASI appears to be a more effective tool than the CAT-BM in evaluating cutaneous severity in DM.
This study has been approved by the local IRB. Declarations of Helsinki protocols were adhered and physician and patient participants gave their written, informed consent prior to study initiation.
10 dermatology-boarded physicians were invited to participate in the one-day study at the Hospital of the University of Pennsylvania. Physicians were given the CDASI and the CAT-BM as well as corresponding literature prior to the study session day so that they may better familiarize themselves with the tools. On the study session day, prior to initiating the study, the physicians were given a training session with visual examples in order to score all study instruments correctly. Adequate time was given to the physicians to address any questions and/or clarifications they may have had regarding the outcome instruments.
14 patients with the clinical and/or pathological evidence of DM were invited to participate in the study at the Hospital of the University of Pennsylvania. Patients represented a wide spectrum of disease. The patient population consisted of fourteen Caucasians, 3 males, 11 females, with varying degrees of muscle and cutaneous involvement (noted to have PGA Activity scores ranging from 0 to 9.3 with a mean of 3.2+/-2.8, PGA Damage scores ranging from 0-9.4 with a mean of 2.8+/- 2.6, and PGA Overall scores ranging from 0.2-9.2 with a mean of 3.4 +/- 2.5). Average age of participants was 53 +/- 16. Average duration of disease among patients was not recorded.
The study day was divided into Session 1 and Session 2. Each physician was given a randomized number from 1-10 and consequently a folder corresponding to their number. Based on the assigned number, physicians were divided into two groups of five physicians, Group 1Ph and Group 2Ph. One physician group contained folders with packets of each outcome instrument in the order of CDASI, CAT-BM, and PGA scales for Session 1 and packets of each outcome instrument in the order of CAT-BM, CDASI, and PGA scales for Session 2. The remaining physician group contained folders with a reverse order of packets (i.e. CAT-BM, CDASI, and PGA scales for Session 1). All folders from both physician groups also contained two packets of each outcome instrument for re-rates. All physicians evaluated fourteen patients. All physicians also re-evaluated two patients. At the end of the study session, physicians were given an exit questionnaire consisting of seven questions, each of which consisting of a short answer and four questions including a multiple-choice part. Patients were randomized and divided into two groups, Group 1P, consisting of 8 patients, and Group 2P, consisting of 6 patients. During Session 1, Group 1Ph evaluated Group 1P and Group 2Ph evaluated Group 2P. During Session 2, Group 1Ph evaluated Group 2P and Group 2Ph evaluated Group 1P. No more than one physician was permitted per patient encounter at any time.
The CDASI is a one-page, partially validated outcome instrument used to determine the severity of cutaneous disease specific to DM. Total scores range from 0-132. Scores are divided into activity and damage, with scores ranging from 0-100 and 0-32, respectively. Neither activity nor damage is scored by percentage of body surface area involvement. Disease activity is assessed by the degree of erythema, scale, and the presence of erosions or ulceration in 15 different anatomical locations. Disease damage is assessed by presence of poikiloderma or calcinosis in the 15 different anatomical locations. Periungual changes were scored from 0-2, with zero indicating no periungual changes, one indicating periungual erythema, and two indicating visible telangectasias. Alopecia scores range from 0-1 with zero indicating no alopecia in the last 30 days and one indicating presence of alopecia in the last 30 days. Gottron's sign on the knuckles are assessed similarly to the erythema scale used in other anatomical locations. When Gottron's papules were present, the erythema score obtained on the knuckles was doubled.
The CAT-BM is a 1-page, normally distributed validated outcome instrument derived from an alternative scoring method of the CAT that is used to determine the severity of cutaneous disease in DM. Total scores range from 0-28, 0-17 for activity and 0-11 for damage. Neither activity nor damage is scored by percentage of body surface area involvement. Activity scores are based on the presence of erythema in 7 different anatomic areas and presence of other characteristic DM lesions. Secondary changes such as scale, erosions, or necrosis are not captured. Disease damage is scored by the presence of atrophy or dyspigmentation without erythema in the same seven different anatomic areas, as well as presence of poikiloderma, calcinosis, lipoatrophy, or a depressed scar anywhere on the body.
To assess intra-rater reliability, after a physician participant had completed all patient encounters, they were asked to re-evaluate two patients which they had seen during the morning session (to minimize physician recollection of scoring). Though physicians arbitrarily decided which patient to re-rate based on patient availability, it was ensured that no patient would be re-rated more than twice. Inter-rater reliability was used to assess accordance of scores among physicians. All physicians re-rated two patients. Inter-rater reliability was determined by the ten physicians who evaluated all fourteen patients. Physicians also recorded the time to complete each instrument for each patient encounter.
In order to assess and compare validity among different outcome instruments, three validation measures were used, 1) the Overall Skin-Physician Global Assessment (PGA Overall), 2) the Skin Activity- Physician Global Assessment (PGA Activity), and 3) the Skin Damage- Physician Global Assessment (PGA Damage). Scores were captured using visual analogue scales (VAS) and Likert scales. The VAS is a continuous scale ranging from 0-10 where 10 represents extremely active disease. The Likert Scale ranges from 0-4, where 4 represents extremely severe disease.
Specifically, convergent construct validity was determined by comparing the Skin Activity-PGA to the activity scores of the activity subscore of the outcome instruments, comparing the Skin Damage-PGA to the damage subscore of the outcome instruments, and comparing the Overall Skin-PGA to the overall score of the outcome instruments. Convergent construct validity refers to the degree one measure (i.e. the CDASI or the CAT-BM) correlates to another measure (i.e. the corresponding PGA) that it theoretically should correlate with. The PGAs were also used to determine if the either of the outcome instruments was skewed to any direction, which could potentially limit the usefulness in longitudinal studies. Content validity was determined by administrating the Physician Exit Questionnaire, which includes the question, “Was there any information missing from any of the measures that you feel should be added?”
Responsiveness was assessed from prospective visit data collected separately from the inter-rater, intra-rater validation study. This included assessments of the CDASI, CAT-BM, and PGA scale scores, as well as an overall evaluation from the physician as to whether the patient had improved, worsened, or had no change from their previous research visit. 35 patients with a cumulative 110 visits were obtained from this data source. There were 27 visits in which a clinical change was noted. The largest clinical change per patient, as defined as the largest difference in the PGA-Activity score between two consecutive visits, was included in the analysis. The standardized response mean (SRM) was used to determine responsiveness for the CDASI and the CAT-BM. The SRM measures the ratio of the mean of the differences (i.e. CDASI and CAT-BM scores before and after a clinical change was noted) between two time points to the standard deviation of the differences. The absolute mean change was used between visits to account for improvement and worsening of disease. This approach has been used in the past (Ruperto et al., 2010; Beaton et al., 1997).
Statistical analyses were performed using statistical programs STATA and SPSS. Inter-rater reliability was determined by intraclass correlation coefficient (ICC), type ICC (2,1) via Shrout and Fleiss convention (Shrout, 1979). Previous research has dictated that an ICC between 0.5 and 0.7 to be moderate, between 0.70 to 0.81 to be good, and an ICC ≥ 0.81 to be almost perfect (Landis, 1977; Klein et al., 2008). Intra-rater reliability was determined by ICC (2,1) and paired, two-tailed t-test comparing mean scores between initial and repeat scores of each instrument. Construct validity was assessed by testing the association between outcome measure (CDASI or CAT-BM) and corresponding validation measure. Because each patient and each physician had repeated measures, we used a linear mixed model for this test, adjusting for within-patient and within-physician variations. Other covariates, such as age and gender, were not seen to have an influence. Physician subject # and patient subject # were placed as random effect factors, while PGA scores were placed as a fixed effect covariate. Likert scores were also used as an additional means to assess construct validity. Differences in CDASI and CAT-BM scores when grouped by corresponding Likert scores were evaluated using one-way ANOVA. Linear regression was also used on mean CDASI and CAT-BM scores of each Likert group to determine linearity.
This material is based upon work supported by Celgene Corporation.
Conflicts of Interest
The authors state no conflicts of interest.