|Home | About | Journals | Submit | Contact Us | Français|
Traditional metrics for evaluating the severity of psoriasis are subjective, which complicates efforts to measure effective treatments in clinical trials.
We collected images of psoriasis plaques and calibrated the coloration of the images according to an included color card. Features were extracted from the images and used to train a linear discriminant analysis classifier with cross-validation to automatically classify the degree of erythema. The results were tested against numerical scores obtained by a panel of dermatologists using a standard rating system.
Quantitative measures of erythema based on the digital color images showed good agreement with subjective assessment of erythema severity (κ = 0.4203). The color calibration process improved the agreement from κ = 0.2364 to κ = 0.4203.
We propose a method for the objective measurement of the psoriasis severity parameter of erythema and show that the calibration process improved the results.
Psoriasis is a chronic, inflammatory disease that affects the skin and joints. The most common form is plaque psoriasis, which presents with scaly red and white patches of the epidermal layer of the skin (1). These plaques often occur on the elbows and knees, but they can affect any part of the body. The cause of the condition is not fully understood, and there is no cure currently available (1).
Examination of the National Health and Nutrition Examination Survey suggests the prevalence of diagnosed psoriasis in the United States includes roughly 5 million adults (2). Out of these patients, 17% have moderate to severe psoriasis and 25% report that the condition poses a significant problem in their daily lives (2). Furthermore, the condition is associated with widespread treatment dissatisfaction indicating a need for new, effective therapies (3).
Treatment of psoriasis is currently an active area of research. The introduction of injectable immune modulating medications called “biologics” have provided patients increasing relief from the effects of their disease when more traditional treatments have failed. (4) Side effects of these medications are not trivial with increased risk of serious opportunistic infections and tuberculosis reactivation making objective evaluation of new therapeutic agents essential. (5)
One of the major challenges to developing more effective treatment of psoriasis is difficulty in tracking the progression of psoriasis given the subjective nature of assessing its severity. Even experienced physicians can show wide variation in evaluating the severity of psoriatic plaques (6). The lack of an objective metric for psoriasis severity inhibits tracking of patient progress and establishment of treatment goals (7). This is particularly problematic for studies intended to compare and evaluate different treatments, because it increases the difficulty of establishing an objective improvement (7). The necessity of a physician’s evaluation of severity parameters also significantly increases the cost and duration of these studies.
Currently, the most widespread method of evaluation of psoriasis for clinical trials is the psoriasis area and severity index (PASI) (7). In this semi-quantitative method, the body is split up into 4 sections (head, arms, trunk and legs) and each section is given a specific weight based on the percentage of the body’s total skin in that region. A physician evaluates the severity of the psoriasis in each of those regions on a 0–4 scale based on the erythema (redness), desquamation (scaling) and induration (thickness) of the plaques as well as the proportion of skin affected (7). All of these values are entered into a formula that yields a value from 0–72 indicating the overall severity. In this study, we only consider the erythema scores.
However, PASI has significant limitations. Despite efforts to refine the PASI formula that have yielded marginal improvements, all variations of the PASI score suffer from similar drawbacks (8). The evaluation of the severity parameters is still a relatively subjective endeavor, which decreases the utility of the PASI score.
Consequently, there have been efforts to develop new tools for the automated evaluation of psoriasis based on clinical images. These new methods are based on the traditional PASI model, and so seek to automate the evaluation of the established parameters: erythema, scaling, and induration. Objective, quantitative evaluation of these parameters could greatly enhance the reliability of the PASI score for evaluating psoriasis.
Some prior studies classified the severity of psoriasis plaques using erythema, desquamation, or both (9), (10), (11), (12). However, all of the existing studies used a standard camera in order to acquire the images, and little effort has been made to create a calibration method to standardize the images across cameras and lighting conditions. This is a key issue since a variety of cameras and lighting conditions are present in clinical practice. Moreover, prior work has explored only few of the feature sets that could potentially be used to quantify properties such as redness or scaling.
The goal of this study was to algorithmically calculate a measure of erythema from clinical photographs and evaluate it with respect to an expert assessment of erythema by a panel of dermatologists. A color calibration process was used to help control for different lighting conditions and other inconsistencies in acquired images.
Standard clinical photographs were taken of 20 patients exhibiting psoriatic plaques in accordance with the protocol approved by the Seton Institutional Review Board. Patients were recruited from Seton clinics including the University Medical Center Brackenridge Dermatology Clinic, Seton Family of Doctors at Hays, and Trinity Clinic. Images were taken of both knees and both elbows for each patient regardless of whether all of those areas exhibited plaques, giving a total of 80 images. Pictures were taken with either a Canon PowerShot SX230 HS (Cannon U.S.A., Melville, New York) (N=76) or a Canon PowerShot ELPH 520 HS (N=4), and the field of view was large enough to encompass the plaque as well as some surrounding skin for comparison. We also included a 4×6 color card (CameraTrax, Las Vegas, Nevada) with 24 colors in the images in order to calibrate the coloration.
Photographs were typically taken against a uniform blue background. The normal lighting of the clinic rooms was used to illuminate the target regions.
The images were independently rated by each member of a panel of five dermatologists. The panel included one attending physician and four experienced dermatology residents. The raters were given a chart with examples of each score for reference during the rating, and the images were shown to each rater using the same screen. Each of the raters reported a score from 0–4 for the erythema severity of each of the images. Examples from each category are shown in Table I.
The subjective erythema scores were analyzed for agreement and consistency. The intraclass correlation coefficient (ICC) was calculated assuming the raters were fixed and the different images represented a random sample of possible images. The ICC test for agreement indicated that r = 0.7306 and the test for consistency showed that r = 0.7492. These values indicate a high level of uniformity for both parameters of concordance. The median of the scores was taken as a composite overall assessment in order to ensure that the images were put into discrete categories.
One of our goals was to create a process that could function with any camera. This was challenging since different cameras and lighting conditions can produce differences in sharpness, coloration, and noise that can subvert image analysis.
In order to remedy the problem, we implemented a method proposed by Marguir et al. for the calibration of the photographs using a color card with pre-defined color values in order to ensure that the images appear similar (13). This method has been applied successfully in existing cosmetics and dermatology applications. The authors used a color card with pre-defined color values in order to standardize the coloration of different types of skin despite varying levels of illumination present and different cameras used. Their method was able to make the resulting images look very consistent despite widely varying lighting conditions. The only difference from our study was that we utilized a simpler 4×6 color card instead of the more extensive ones that they employed.
The method works through computing a linear color transform that minimizes the mean square error between the color values detected in the image and the reference color values. This transform is then applied to the entire image for the calibration.
After obtaining the calibrated images, features were extracted from the plaques. We compiled a list of features from other similar studies that were intended to match the degree of redness in an image (9), (10), (11). This is the complete list of features considered:
Before the features were evaluated, a human operator (A.R.) familiar with psoriasis images selected two representative areas on each of the images manually. One area was taken to be representative of the erythema of the skin and the other was representative of the unaffected skin of the patient. A bounding box was drawn enclosing each selected area for isolation. This manual area selection removed the need for an automated segmentation process to isolate the plaques. This is key since psoriasis plaques are highly variable and hence not amenable to automated segmentation. Most images had a standard blue background, but others included the surrounding clinic in the images. Moreover, since there can be substantial variation even within a given psoriasis location, selecting an area that is typical helps focus the subsequent analysis.
The following features were extracted from the representative area associated with erythema: (1) mean, (2) standard deviation, and (3) kurtosis of the red channel; (4) mean of the proportion of redness in the RGB image; (5) mean, (6) standard deviation, and (7) kurtosis of the difference between the blue and green channels. The choice of features was empirically optimized. Each feature from the initial list was added one at a time to the classifier, and if it improved the resulting correlation then it was retained.
The features and corresponding ratings of erythema were used to train a linear discriminant analysis classifier using a leave-one-out cross-validation procedure. Specifically, the features and the associated expert ratings of all 4 images of all but one of the subjects were used to train a classifier to predict the severity of erythema for the 4 images of the one patient who was excluded; the process was repeated such that each patient was held out for testing.
Additionally, the same overall process was repeated without the color calibration step in order to compare the results of using the classifier on the original and calibrated images. This served to evaluate the influence of the calibration on the results.
Most previous studies on erythema scoring evaluate the algorithm’s performance using the metric of accuracy, defined as the number of exact matches in PASI score in the predicted and actual data divided by the total number of images. However, we believe that this is not the best metric to use for evaluating the results. A linearly weighted Cohen’s kappa coefficient is preferred because this measure accounts for different degrees of concordance. If the predicted and expert rating are close together, the overall score will be penalized less than if the ratings are far apart. In contrast, the accuracy will be the same if the predicted and actual scores differ by small degree or a large degree.
An additional data set was obtained in order to test the effect of the calibration process on images from different cameras. The following six cameras were used: Canon EOS Rebel T3i DSLR Camera (Cannon U.S.A., Melville, New York), HTC Sensation Camera (HTC Corporation, New Taipei City, Taiwan), iPhone 5C Camera (Apple Inc., Cupertino, California), iPhone 6 Camera (Apple Inc., Cupertino, California), Kodak PlaySport Camera (Kodak, Rochester, New York), and the OnePlus One camera (OnePlus, Shenzhen, Guangdong province, China). A subject with normal, non-psoriatic skin was selected. For each camera, five images were taken with a field of view including both the subject’s elbow and the CameraTrax color card.
Each of the images in the calibration data set was calibrated using the same method described previously. A human operator (A.R.) drew a bounding box on each image indicating the region containing the elbow. The mean of the red channel in the resulting window was calculated for each image to evaluate redness. The standard deviation of the mean red channel value of the set of calibrated images and the set of uncalibrated images was calculated. If the calibration process effectively standardizes the coloration of the images, then the variance of the mean red channel value should decrease in the calibrated image set. An F-test was used to compare the variances.
The quantitative erythema scores had roughly the same distribution as the expert ratings of erythema, although the classifier tended to overestimate the severity (Table II). The agreement between the quantitative scores computed by our algorithm (with image calibration) and the subjective ratings by experienced dermatologists was κ = 0.4203 and likewise the accuracy was 48.75% which constitutes a good but not exceptional degree of agreement.
In contrast, when the same algorithm was applied to the images without calibration, the agreement between the quantitative scores and the subjective ratings was κ = 0.2364 and the accuracy was 42.5%. This means that the calibration process produced a moderate improvement in the accuracy of the classifier evaluated with respect to subjective rating of erythema by dermatologists.
However, there is not a published hypothesis test for a linearly weighted Cohen’s kappa. Therefore, this analysis was accomplished using a bootstrap method. The calibrated and uncalibrated rating sets were each sampled 80 times with replacement and the linearly weighted kappa was calculated between the predicted and dermatologists’ ratings. The process was repeated 1000 times in order to simulate a distribution for each set. The 95% confidence intervals were computed with the nonparametric method, where the values of the 2.5th and 97.5th percentiles of each set were obtained. The results indicate that the 95% intervals for the calibrated data are 0.2522–0.5575 and those of the uncalibrated data are 0.0307–0.2940 (p=0.08). The modest overlap in the confidence intervals estimated from this small data set (N = 20) suggests that the differences between calibrated and uncalibrated data may be significant for moderate sample sizes.
Additionally, the results from the calibration data set suggest that the calibration process effectively standardizes the red levels across different cameras. The F-test indicates that the standard deviation of the average red channel value of the calibrated images (s = 10) is significantly less than the standard deviation of the averaged red channel value of the uncalibrated images (s = 18) (p=0.0008).
Our method for calibrating of the coloration of the images is promising for improving automatic classification of erythema severity. Previous work has identified the influence of difference sources of noise in the coloration as an important source for error in evaluating erythema (11). This suggests that the usage of a color card is an effective way to reduce noise introduced by environmental factors as well as the camera itself.
The accuracy of classification in this study was not as high as those reported by previous studies. A recent study on the subject from Lu et al. reported a much higher overall accuracy of 78.85% for the correct categorization of the erythema severity (10). However, the results reported here cannot be simply compared to those in the literature since both the methodology and the data set are different. To enable direct comparison of methods, one must apply them to the same data set.
Hence, in order to directly compare our results to prior work, we implemented the analysis method described by Lu et al. and applied it to our own data set. Lu et al. used five total features, including the difference between the plaque and the skin in the mean of the red, green, and blue channels as well as the differences in what they define as the average hemoglobin and melanin components of the image (10). They applied a k-nearest neighbor classification algorithm (k=5) with 10-fold cross-validation. When we applied their method to our data set of calibrated images, the accuracy was 30.0%, which is lower than the accuracy obtained using the algorithm we propose here (48.75%). The algorithm of Lu et al. achieved a similar accuracy when applied to the uncalibrated images. This result suggests that the overall lower classification accuracy obtained in this study is likely to be due to characteristics of the data set.
It should be noted that the images in our study require a larger field of view in order to include the entire plaque, the color card, and surrounding skin. However, a larger field of view does not permit as much detail in in the image of the plaque itself, and also introduces a greater risk of error from shadows and uneven illumination. In some of the images in our data set, the illumination was not completely uniform and shadows were visible in the image. This served to introduce significant noise into the calculations particularly in the case where the shadow was over the plaque but not on the color card because the color calibration could not account for the shadows.
The images from other studies appear to have been taken with a smaller field of view and more consistent illumination, which could help explain why higher accuracy was obtained (9), (10). A more detailed image of the plaque could include more information, making it easier for the algorithms to distinguish the images.
In conclusion, we present a quantitative method for measuring erythema in psoriasis plaques and show that color calibration using a color card leads to improved results. This technique could serve to reduce error from environmental factors such as illumination and could theoretically standardize image collection from different camera systems.
However, the larger field of view needed to include the color card in the images can also make classification more challenging by reducing the amount of information specifically from the plaque.
We would like to thank Dr. Ashley Brown and Dr. Donald Warren for helping with the rating of the images as well as the Seton hospital system for accommodating our data collection.