PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
J Invest Dermatol. Author manuscript; available in PMC 2010 December 27.
Published in final edited form as:
PMCID: PMC3010359
NIHMSID: NIHMS225248

Reliability and convergent validity of two outcome instruments for pemphigus

Abstract

A major obstacle in performing multicenter controlled trials for pemphigus is the lack of a validated disease activity scoring system. Here we assess the reliability and convergent validity of the PDAI (pemphigus disease area index). A group of 10 dermatologists scored 15 patients with pemphigus to estimate the inter- and intra-rater reliability of the PDAI and the recently described ABSIS (autoimmune bullous skin disorder intensity score) instrument. To assess convergent validity, these tools were also correlated with the Physician’s Global Assessment (PGA). Reliability studies demonstrated an intra-class correlation coefficient (ICC) for inter-rater reliability of 0.76 [95% CI = 0.61–0.91] for the PDAI and 0.77 [0.63–0.91] for the ABSIS. The tools differed most in reliability of assessing skin activity, with an ICC of 0.39 [0.17–0.60] for the ABSIS and 0.86 [0.76–0.95] for the PDAI. Intra-rater test-retest reliability demonstrated an ICC of 0.98 [0.96–1.0] for the PDAI and 0.80 [0.65–0.96] for the ABSIS. The PDAI also correlated more closely with the PGA. We conclude that the PDAI is more reproducible and correlates better with physician impression of extent. Subset analysis suggests that for this population of mild to moderate disease activity, the PDAI captures more variability in cutaneous disease than the ABSIS.

Introduction

Our scientific understanding of the pathogenesis of pemphigus has greatly improved in recent years. As we try to evaluate therapeutic options to develop treatment paradigms for pemphigus, however, a major obstacle in performing large, multicenter controlled trials is the lack of a validated scoring system to rate disease activity. An ongoing Cochrane review of clinical trials studying pemphigus has revealed a total of 116 outcome measures described in 96 articles over the past 25 years (Martin and Murrell 2006). This lack of uniformity makes it challenging to compare treatment regimens and to choose among therapeutic options. Here we evaluate a new measurement instrument (PDAI – pemphigus disease area index), which was developed by the International Pemphigus Committee in an effort to create a tool that can reliably capture all ranges of cutaneous and mucosal disease extent, and we compare it to a recently developed scoring system, the autoimmune bullous skin disorder intensity score (ABSIS), which has not yet been evaluated for inter-rater or intra-rater reliability. We also compared these two instruments to the Physician’s Global Assessment (PGA), an instrument often used for inflammatory skin disease, although it has not been specifically validated for pemphigus. Validated scoring systems are needed to allow proper evaluation of potential therapies, as well as to compare results from different trials. These, combined with other evaluations, including laboratory testing of autoantibodies and quality of life measures, will help determine the efficacy of therapies. The goal of these instruments is to allow standardized assessment of disease extent in patients with pemphigus in an effort to facilitate physicians to conduct multicenter trials.

Results

Distribution of Scores

PDAI total score was approximately normally distributed with total scores ranging from 0–44 (mean 13.1±9.03) out of a possible 263 maximum score for the instrument. The interquartile range (25% to 75% of scores) for the PDAI total score was 6.3–19. The PDAI skin activity subscore ranged from 0–28.3 (mean 5.57±6.48) out of a possible 130, with an interquartile range of 0–8.0. The PDAI mucous membrane activity subscore ranged from 0–40 (mean 4.62±7.55) out of a possible 120, with an interquartile range of 0–6.0. The PDAI total activity subscores (skin plus mucous membrane) ranged from 0–43 (mean 10.22±8.14) out of a possible 250, with an interquartile range of 4.2–14. The PDAI damage subscore ranged from 0–12 (mean 2.91±3.28) out of a possible 13, with an interquartile range of 0–6.0.

ABSIS scores were skewed towards the low end of the scale, with 50% of the scores between 0 and 4. ABSIS total scores ranged from 0–38 out of a possible 206, with a median score of 4.0 and an interquartile range of 2–8.5. ABSIS skin involvement subscores ranged from 0–30 out of a possible 150, with a median score of 1.0 and an interquartile range of 0–3.5. ABSIS oral involvement subscore ranged from 0–11 out of a possible 11, with a median score of 1.0 and an interquartile range of 0–4.0. ABSIS oral subjective discomfort subscores ranged from 0–35 out of a possible 45, with a median score of 0.0 and an interquartile range of 0–0 (eighty percent of the scores were zero). Notably seven out of ten raters recorded values of “less than 1%” when commenting on BSA involved with pemphigus. These values were treated as if they were exactly 1%, since it is impossible to quantify a specific value for “less than 1%.”

PGA total score had an approximately normal distribution. The total scores ranged from 0–8.0 (mean 2.54±1.58) out of a possible 10, with an interquartile range of 1–3.0.

Reliability

Test-retest reliability

Physicians’ agreement between their initial scoring and re-rating of the same subject demonstrated an overall ICC [95% confidence interval] of 0.98 [0.96–1.0] for the PDAI, 0.80 [0.65–0.96] for the ABSIS, and 0.75 [0.55–0.94] for the PGA (Table 1). The ICC [95% CI] for the PDAI skin activity was 0.98 [0.96–1.0], for mucous membrane activity was 0.98 [0.97–1.0], and for damage was 0.89 [0.80–0.98]. The ICC for the ABSIS skin involvement was 0.87 [0.77–0.98], for mucosal involvement was 0.99 [0.97–1.0], and for subjective discomfort it was not calculable (due to too little variability among subjects). The test-retest correlation (Spearman’s rho) was rsp=0.94[0.86–1.0] for the total PDAI, and rsp=0.91 [0.79–1.0] for the total ABSIS scale.

Table 1
Intra-rater test-retest reliability (n=20)

Inter-rater reliability

The consistency among physicians’ rating of subjects using each tool had an overall ICC of 0.76 [0.61–0.91] for the PDAI, 0.77 [0.63–0.91] for the ABSIS, and 0.44 [0.22–0.65] for the PGA (Table 2). The ICC for the PDAI skin activity was 0.86 [0.76–0.95], for mucous membrane activity was 0.84 [0.73–0.95], for total activity (skin and mucous membrane) was 0.77 [0.62–0.91], and for damage was 0.69 [0.52–0.86]. The ICC for ABSIS skin involvement was 0.39 [0.17–0.60], for mucosal involvement was 0.85 [0.75–0.95], and for subjective discomfort was 0.89 [0.82–0.97].

Table 2
Inter-rater reliability: Score distribution and convergent validity

Validity: Correlation with the PGA

The PDAI and ABSIS scores were then correlated with the PGA. The PDAI had a correlation of 0.60 [0.49–0.71] compared to the ABSIS correlation of 0.43 [0.30–0.55]. Figure 1 provides the mean PDAI and ABSIS score for each PGA. There is a statistically significant linear trend between PGA with PDAI (F=49.75; df=1;p<0.0001) and PGA with ABSIS (F=12.38;df=1;p<0.0006). The mean scores for both PDAI and ABSIS increase (indicating a worsening condition) as PGA increases. Assessing pairwise comparisons of PDAI levels at different levels of PGA, several comparisons (PGA 1 vs 3, 1 vs 4, 2 vs 4, 3 vs 4) are statistically significantly different. For ABSIS scores, only one pair (PGA 1 vs 3) shows a statistically significant difference.

Figure 1
ABSIS scoring sheet, adapted from Pfutze et al., 2007

Time for Instrument Completion

The mean time for the PDAI was 4.7 minutes [± 0.18] and for the ABSIS was 3.9 minutes [± 0.18]; PGA time was not recorded.

Discussion

While both scoring systems were heavily weighted towards the low end of their respective scales, the PDAI data was normally distributed, whereas the ABSIS data was non-normal and skewed to the left. The inter-rater reliability for both tools was similar, with an ICC of 0.76 for the PDAI compared to 0.77 for the ABSIS. The comparison of skin activity between the two instruments, however, shows the PDAI skin activity had an ICC of 0.86 compared to 0.39 for the ABSIS. The ABSIS scoring tool achieves much of its inter-rater reliability from the subjective component, patient report of discomfort with foods, as opposed to the objective, physician-dependent portion of the scaling system. The discrepancy between skin activity ICC scores suggests that at mild-to-moderate levels of disease, such as those represented in this study population, the PDAI may be better able to detect small differences in pemphigus cutaneous disease extent. This is further suggested by the distribution of scores, with the PDAI allowing for a normal distribution of scores even in this limited population of mild-to-moderate disease activity, while the ABSIS data was skewed to the left. In fact, many of the raters reported difficulty grading the patients using the ABSIS scale, and seven out of ten raters recorded values less than the minimum 1% (all such scores were treated as 1% for study statistics), which suggests the ABSIS as designed may be failing to capture clinically detectable differences in disease activity. In the original ABSIS publication, the authors note that using ‘the rule of nine’ to calculate body surface area can be difficult and lead to inter-rater disagreement when scorers are untrained; in this study, all of the raters were experienced dermatologists familiar with ‘the rule of nine.’ However, there is evidence that using skin area assessments to measure inflammatory skin disease can be difficult, even for physicians (Tilin-Grosse and Rees, 1993, Charman and Williams, 2000, Charman et al, 1999).

The intra-rater reliability was examined by having physicians re-evaluate patients they had previously scored, using both metrics again. The test-retest analysis showed an overall ICC of 0.98 [0.96–1.0] for the PDAI, 0.80 [0.65–0.96] for the ABSIS, and 0.75 [0.55–0.94] for the PGA. This suggests that not only was the PDAI more consistent among multiple raters, but that the scores were more reproducible for each individual physician rater as well.

After the patient scoring, physician feedback was obtained. These responses suggested that the majority of physicians involved in this study felt that both the PDAI and ABSIS were too difficult to be incorporated into routine practice, but that the PDAI allowed for more accurate representation of the spectrum of disease extent. Many physicians had trouble with the ABSIS subjective discomfort component. Some physicians felt that the PGA was sufficient, although the lack of inter-rater (ICC=0.44) or intra-rater (ICC=0.75) reliability seen in this study suggests that the PGA is a poor tool to objectively assess pemphigus disease activity. Notably, while the PDAI took longer to complete on average, the time to use either tool was not markedly different (4.7 minutes for the PDAI, 3.9 minutes for the ABSIS).

The patients evaluated in this study had limited disease extent. For both instruments, the highest score (representing most severe disease extent) was less than 25% of the maximum possible score. This limits the conclusions that can be drawn from these data. However, at the low end of the spectrum of pemphigus disease extent, it appears that the PDAI is able to capture small differences in disease extent. The ability of the PDAI to allow raters to consistently assess small differences in even mild disease activity suggests that the PDAI may represent a scoring system capable of allowing researchers to follow disease activity in a clinical trial setting and compare data amongst individual physicians as well as across multiple research centers. Furthermore, it is our belief that given the nature of pemphigus, which tends to be dramatically responsive to glucocorticoid therapy, it is crucial that an instrument to monitor disease be able to detect activity at the low end of the disease spectrum. Notably, both instruments correlated well with the PGA at PGA scores of 0–4, with a statistically significant linear trend (data not shown). The small number of patients with more severe disease extent (with only 19 PGA scores rated as a 5 or higher, out of 150 total scores) limits the ability to draw conclusions at more severe levels of disease.

One of the limitations of this study is that the patients evaluated represented the low end of the disease activity spectrum. Therefore these results cannot be extrapolated to patients with more severe disease. We are currently performing additional studies to evaluate the PDAI and validate it for use in more severe disease populations. A portion of this study will take place using web-based digital photographs to allow for representation of a wider spectrum of disease and to allow members of the international pemphigus community to participate in the evaluation and validation of these scoring instruments.

In conclusion, the PDAI demonstrates reasonable convergent validity and was found to be a reliable, quick, and easy-to-use method to capture the extent of skin and mucosal lesions in patients with mild to moderate PV and PF. It should be considered as a possible outcome measure that needs further study to examine responsiveness.

Methods

Ten physicians and fifteen patients were brought to the dermatology clinic at the Hospital of the University of Pennsylvania on one day to complete this study. The morning of the study, physicians completed a training session on the PDAI, ABSIS and Physician Global Assesment (PGA) with visual images of skin manifestations of pemphigus vulgaris (PV), pemphigus foliaceus (PF), and paraneoplastic pemphigus (PNP) to familiarize themselves with the three instruments and discuss scoring methods. Physicians and patients were divided into two groups. Physicians scored the first group of patients using the ABSIS scale first, followed by the PDAI and PGA. They reversed the order of instruments for scoring the second group. Physicians were instructed to individually document their start and stop times for each tool. Physicians rotated among patient rooms, with only one physician and patient in a room at any one time. Each physician then returned to the original group and re-rated two patients selected at random by patient availability, with at least two hours between the initial and subsequent re-rating. In order to minimize recall, neither the physicians nor patients were told at the beginning of the study that there would be a re-scoring session at the end to assess intra-rater reliability. Re-rating was done on a random basis based on patient availability, not requiring each patient to be re-rated but rather requiring each physician to re-rate two patients. Of the 15 patients, all but 4 were re-rated at least once; 6 patients were re-rated by 1 physician, 3 patients were re-rated by 2 physicians, 1 patient was re-rated by 3 physicians, and 1 patient was re-rated by 5 physicians.

Patients

The patients were volunteers from the outpatient clinic of the Department of Dermatology of the University of Pennsylvania, a tertiary care center in Philadelphia, Pennsylvania, or members of the International Pemphigus and Pemphigoid Foundation (IPPF) who volunteered to participate. All patients had a clinical exam, histologic result from a skin biopsy, and immunofluorescence studies consistent with a diagnosis of PV, PF, or PNP.

Physicians

All physicians were board-certified dermatologists with extensive experience diagnosing, treating and managing patients with pemphigus. Physician questions regarding the instruments were addressed in a group discussion mediated by the principal investigator immediately before study commencement. All physicians scored all fifteen patients and all physicians re-scored two patients. At the end of the session all physicians were polled for feedback regarding both scoring tools.

Description of Instruments

The purpose of these instruments is to allow clinicians and researchers to measure disease extent and to potentially monitor patients longitudinally.

ABSIS

The ABSIS instrument was developed by Pfutze et al. in 2007 (Pfutze et al, 2007). Using body surface area (BSA) and lesion type as weighting factors, the ABSIS incorporates skin activity and oral involvement with a subjective severity scale based on discomfort during eating and drinking. Using higher scores to denote worse disease, the ABSIS has a possible scoring range of 0 to 206, with 150 points for skin involvement, 11 points for oral involvement, and 45 points for subjective discomfort (Figure 1).

PDAI

The PDAI instrument used in this study was developed by leading academic dermatologists with extensive experience in the management of pemphigus. Starting in 2005, there were several of meetings of the International Pemphigus Committee to develop a tool to measure pemphigus disease activity; the PDAI is the result of an international consensus (Murrell et al, 2008). Using higher scores to denote worse disease, the PDAI has a total possible score ranging from 0 to 263, with 250 points representing disease activity (120 points for skin activity, 10 points for scalp activity, and 120 points for mucosal activity) and 13 points representing disease damage (Figure 2). The tool is used by examining each area as indicated, with scores assigned to each anatomic region based on number and size of lesions in that region. The skin, scalp, and mucosa is scored separately, and there is a damage component incorporated to capture areas affected by pemphigus, such as patches of post-inflammatory hyperpigmentation, which lacking primary lesions. When raters scored an anatomic region with a 1 (1–3 lesions, none greater than 2 cm), they also recorded the number of lesions at that site, ranging from one to three lesions. This lesion count was incorporated into the scoring by giving each region a score of 1 if 1 lesion was present, a score of 1.3 if 2 lesions were present, and a score of 1.6 if 3 lesions were present.

Figure 2
Pemphigus Disease Area Index (PDAI)

Physician Global Assessment (PGA)

The PGA is a ten-point visual analog scale ranging from 0 = perfect health to 10 = worst skin condition imaginable (Figure 3). This scoring system allows physicians to rate disease activity by general overall impression; its use has been validated in studying psoriasis (Langley and Ellis, 2004), and has been used in studies of patients with cutaneous lymphoma (Heald et al, 2003), eczema (Guzzo et al, 1991), dermatomyositis (Hundley et al, 2006), and other inflammatory dermatoses. Because there is no gold-standard instrument used to assess pemphigus disease activity, we used the PGA as part of an assessment of convergent validity, expecting a positive correlation between the PGA and both the PDAI and ABSIS.

Figure 3
Physician’s Global Assessment

Statistical Methods

Scale Distribution

Summary statistics were used to describe the sample’s distribution of scores for each instrument; the Shapiro-Wilk’s test was applied to assess normality of the distributions.

Reliability and Validity

To analyze and describe the change in physician scores from first to second rating on the same patient, test-retest intra-rater reliability was assessed using the intra-class correlation coefficient (ICC). The Spearman’s rho correlation was also assessed for all of the test-rest data. Inter-rater reliability was also assessed using ICC. All ICCs were calculated using the one-way random-effect ANOVA model. Based on previous research, an ICC of 0.5–0.7 is considered minimally acceptable, while an ICC above 0.81 is considered excellent (Shrout and Fleiss, 1979).

Validity was assessed using the Spearman’s rho to correlate the physician’s global assessment (PGA) to each instrument (PDAI and ABSIS) as a means of identifying whether the tools were appropriately reflecting the current overall level of disease. To assess whether the instruments demonstrated linear trends over increasing intervals of the PGA, the GLM-ANOVA F-test for linear trends was used, with the Scheffe test to adjust for multiple pairwise comparisons. Subjects with PGA levels of 1, 2, 3, and 4 were included in the analysis, and those with levels 5 through 10 were omitted due to their small sample size (i.e., fewer than 10 subject scores in this range).

Time for Instrument Completion

All physicians individually recorded the time they spent in each patient’s room, as well as the start and end time for each instrument. The means and standard errors were computed for the PDAI and ABSIS instruments.

Acknowledgements

This study was supported by a grant to VPW from the International Pemphigus and Pemphigoid Foundation. We would also like to acknowledge specific assistance from Janet Segall, William J. Zrnchik II, and David Sirois, and the National Institutes of Health to VPW (NIH K24-AR 02207).

This work was presented at the Society of Investigative Dermatology in May 2008, and is published in abstract form in the Journal of Investigative Dermatology JID (abstract) 43: S75, 2008.

Footnotes

Conflict of Interest:

There are no conflicts of interest reported.

References

1. Charman C, Williams H. Outcome measures of disease severity in atopic eczema. Arch Derm. 2000;136:763–769. [PubMed]
2. Charman CR, Venn AJ, Williams HC. Measurement of body surface area involvement in atopic eczema: An impossible task? Br J Dermatol. 1999;25:406–411. [PubMed]
3. Guzzo CA, Weiss JS, Mogavero HS, Ellis CN, Zaias N, Lowe NJ, et al. A review of two controlled multicenter trials comparing 0.05% halobetasol propionate ointment to its vehicle in the treatment of chronic eczematous dermatoses. J Am Acad Dermatol. 1991;25:1179–1183. [PubMed]
4. Heald P, Mehlmauer M, Martin AG, Crowley CA, Yocum RC, Reich SD. Topical bexarotene therapy for patients with refractory or persistent early-stage cutaneous T-cell lymphoma: results of the phase III clinical trial. J Am Acad Dermatol. 2003;49:801–815. [PubMed]
5. Hundley JL, Carroll CL, Lang W, Snively B, Yosipovitch G, Feldman SR, et al. Cutaneous symptoms of dermatomyositis significantly impact patients’ quality of life. J Am Acad Dermatol. 2006;54:217–220. [PubMed]
6. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–174. [PubMed]
7. Langley RG, Ellis CN. Evaluating psoriasis with Psoriasis Area and Severity Index, Psoriasis Global Assessment, and Lattice System Physician's Global Assessment. J Am Acad Dermatol. 2004;51:563–569. [PubMed]
8. Martin L, Murrell DF. Measuring the Immeasurable: A Systematic Review of Outcome Measures in Pemphigus. Paper. Australas J Dermatol. 2006;47 Suppl1:A32–A33.
9. Murrell DF, Dick S, Ahmed AR, Amagai M, Barnadas MA, Borradori L, et al. Consensus statement on definitions of disease, end points, and therapeutic response for pemphigus. J Am Acad Dermatol. 2008;58:1043–1046. [PMC free article] [PubMed]
10. Pfutze M, Niedermeier A, Hertl M, Eming R. Indroducing a novel Autoimmune Bullous Skin Disorder Intensity Score (ABSIS) in pemphigus. Eur J Dermatol. 2007;17:4–11. [PubMed]
11. Shrout PE, Fleiss JL. Intraclass correlations: Uses in assessing rater reliability. Psychol Bulletin. 1979;86:420–427. [PubMed]
12. Tilin-Grosse S, Rees J. Assessment of area of involvement in skin disease: A study using schematic figure outlines. Br J Dermatol. 1993;128:69–74. [PubMed]