marinbenc

Paper test

Introduction

Analyzing dermatological images using deep neural networks is a very active research field with a large number of published papers. A subset of those papers specifically focus on semantic segmentation of lesion images, which can help more easily analyze the lesion, infer the pathology as well as diagnose melanomas and other conditions. Additionally, lesion segmentation is commonly employed as a preprocessing step in the evaluation of lesion analysis models and various other skin condition studies . As a pivotal aspect of skin lesion analysis, it is imperative to address and comprehend any biases that may be present in lesion segmentation models.

Numerous researchers have found evidence that DNNs for classifying dermatoscopic images are biased against dark-skinned individuals due to dataset imbalance as well as lower contrast in images of darker skinned individuals . However, despite the growing awareness of biases in classification models, there are no papers describing a comprehensive investigation into skin color bias in lesion segmentation models.

Unlike classification models, segmentation models directly evaluate each pixel, offering potential benefits in reducing bias. Nevertheless, biases encountered in these models can serve as indicators of underlying issues in the data itself, warranting a thorough examination of dataset collection and labeling practices.

In this paper, we present a thorough evaluation of skin tone bias within commonly-used deep neural networks for skin lesion segmentation. Our study begins by developing two distinct methods to estimate skin color from clinical dermatological images: a convolutional neural network classifier and an algorithm based on preprocessing and k-means clustering. Subsequently, we investigate bias in commonly-used models by training a variety of U-Net-based models using combinations of three different widely used datasets. We then conduct an in-depth statistical evaluation of the segmentation performance for different estimated skin colors, drawing inspiration from the field of artificial intelligence fairness. Additionally, to validate bias independently of skin color prediction errors, we manually select a sample of dark-skinned subjects from the dataset.

Our findings reveal a significant and large correlation between segmentation performance and skin fairness, indicating that common DNNs consistently struggle with segmenting lesions in individuals with darker skin tones compared to those with lighter skin tones. This bias is evident both within and outside the training dataset, and across multiple publicly available datasets for both estimated and manually determined skin tones. Furthermore, we present a qualitative evaluation of biased predictions and assess several commonly-used preprocessing techniques aimed at reducing bias, but find that they fail to significantly alleviate skin color bias.

In light of these discoveries, we propose several suggestions for future dataset collection, labeling, and model development, aiming to foster more equitable and unbiased skin lesion segmentation models.

To the best of our knowledge, this study represents the first comprehensive evaluation of bias in skin lesion segmentation models.

There have been several studies evaluating skin color bias in neural network classifiers. manually categorize widely used facial analysis benchmarks into Fitzpatrick (FP) skin types and find disparities in model predictions based on skin color. Similar work has been done using Individual Typology Angle (ITA) as a measure of skin color .

In the context of dermatological images, most research has focused on classification bias, as opposed to segmentation. evaluated the distribution of ITA on two widely used datasets of dermoscopic images. They do not find significant correlations between accuracy and ITA, however the datasets they evaluate on have very few dark-skinned subjects. evaluated commonly used classifiers on a hand-labeled dataset and found that the accuracy is lower for less-represented skin colors. However, they do not perform any statistical analysis beyond reporting mean values for different skin types. curate and label a diverse dataset of dermatological images with subjects of various skin colors. They find that models trained on widely used datasets perform worse on dark-skinned subjects. On top of that, they also find that dermatologists have worse performance of diagnosing subjects based on images when compared to gold standard diagnosis. However, even when fine-tuning models using their diverse and balanced dataset, there still remained a significant difference between accuracy of classifying dark and light-skinned subjects.

use common methods of bias unlearning (Learning Not To Learn and Turning a Blind Eye ) to reduce dermatological classifier bias. However, they only report the mean results across all subjects and do not analyze the differences between different skin colors.

The only study of skin-color lesion segmentation bias is . They develop a technique of augmenting the skin color of an image to synthetically increase dataset diversity, improving both segmentation and classification performance. However, much like , only the mean results are reported and no statistical analysis of bias is performed. Thus, it can be hard to tell whether the model is less biased or if the results have generally improved but remained biased towards light-skinned subjects.

Generally, the topic of bias in medical image segmentation models is not well researched, with a notable exception of studies in cardiac MRI . This paper presents a detailed evaluation of skin color bias in skin lesion segmentation on a wide variety of widely used datasets. To our knowledge, this is the first analysis of fairness in widely used skin lesion segmentation models.

Skin color quantification methods

We employ the Fitzpatrick scale and the Individual Typology Angle (ITA) to classify skin tones and represent skin darkness, respectively, as they are commonly used in skin color bias studies .

The Fitzpatrick scale categorizes skin types into six groups based on UV response, ranging from type I (palest, never tans, always burns) to type VI (darkest, never burns). While subjectively evaluating skin types from images poses challenges , it can be effective with large sample sizes to evaluate bias . To reduce label noise caused by raters’ disagreements, we use less granular labels, classifying images into type I-II, III-IV, or V-VI.

ITA, being a more objective measure based on colorimetry, quantifies the skin’s constitutive pigmentation . Higher ITA values correspond to lighter skin, and vice versa. Estimating ITA from images using the CIE-Lab colorspace can be achieved as an indication of relative skin darkness within the dataset :

$$\operatorname{ITA(L*, b*)} = \operatorname{arctan}(\frac{L* - 50}{b*}) \cdot \frac{180}{\pi},$$

where L* and b* are the lightness and blue-yellow opponents values of the CIELAB colorspace, respectively. However, this estimate is highly dependent on lighting conditions and image contents, precluding its use for classification purposes.

Skin color estimation from images

In the realm of dermatological images, researchers have proposed various methods to estimate skin tones. utilized a neural network for segmenting and removing lesion regions from the image. The remaining pixels were then transformed to the CIE-Lab colorspace, and the ITA was estimated from the mean pixel value after removing outliers. On the other hand, estimated color by sampling mean colors from different patches of the image, selecting the lightest patch under the assumption that skin is generally lighter than lesion regions. used Shades of Gray color constancy to estimate the illuminant of an image. However, all these approaches may face challenges related to lighting conditions as they rely on the pixel values themselves.

To tackle the issue of robustness under varying lighting conditions, a more effective approach involves using a neural network to classify skin color into Fitzpatrick skin types. employed this strategy by creating a dataset of clinical skin disease images labeled according to the Fitzpatrick skin type. They then trained a neural network-based classifier on this dataset to estimate the Fitzpatrick skin type given an image. However, both the human labelers and the neural network achieved relatively low accuracy. Nevertheless, this approach is still able to provide valuable insights given a large enough sample size.

Methods

Skin tone extraction methods

Due to the absence of patient skin color information in widely available lesion segmentation datasets, we devise two methods to estimate skin color. Firstly, we use a neural network to classify Fitzpatrick skin types into dark, medium, and light (or 1-2, 3-4, and 5-6). Secondly, traditional image processing and k-means clustering techniques are applied to estimate the dominant skin color in the image, from which we calculate the ITA. These methods enable us to approximate skin color for segmentation datasets lacking explicit skin color labels.

Neural network-based Fitzpatrick type classification

We trained a VGG16-based network to classify skin color into three Fitzpatrick type classes: I-II, III-IV, and V-VI. The network was initially pre-trained on the ImageNet dataset , followed by the Fitzpatrick-17k dataset , which contains clinical skin disease images. Since these images differ in domain from the dermatological images used for segmentation, we further fine-tuned the model on the Diverse Dermatology Images and PAD-UFES-20 datasets containing clinical images of skin lesions only. Augmentation techniques were applied to increase out-of-sample robustness, including random scaling, translation, rotation, as well as horizontal and vertical flipping.

To address class imbalance, we used cross entropy loss with increased error weight for less-represented classes (wc = 1/nc where nc is the number of samples of class c in the training dataset). 5-fold cross-validation was employed for model evaluation during training, and during inference on segmentation datasets, an ensemble of the five folds was used for skin type prediction through majority voting.

We also explored deeper backbone architectures like ResNet18 and ResNet34 , but validation results showed no significant improvement. Therefore, we opted for the smaller VGG16 backbone.

k-means-Based ITA estimation

The image undergoes extensive preprocessing to extract healthy skin regions, involving contrast-limited adaptive histogram equalization of the L* channel in the CIE-Lab colorspace, followed by artifact removal using Dullrazor . The HSV colorspace is then employed to threshold the value channel using Otsu thresholding, and the resulting mask is morphologically expanded and subtracted from the image, leaving behind healthy skin areas with background, hairs, lesions, and pigmentations removed.

After extracting the skin region, we convert the image back to the CIE-Lab colorspace and perform k-means clustering on all skin pixel value vectors (Li, ai, b*i). The optimal value of k is automatically determined for each image following , and we identify the most populated cluster as the estimated skin color of the subject.

It is important to emphasize that while ITA and Fitzpatrick types may show correlation, they are not interchangeable. The ITA calculation serves as an estimate and is only intended to represent relative skin color lightness among images in the dataset. Its absolute value may not directly correspond to the actual ITA of the individual subjects.

Skin color classification results and experiments

The results of skin color classification, as tested on FP17K

A scatterplot of predicted ITA angle by the k-means algorithm and the predicted probability of class “56” by the neural network classifier. The dots are colored according to the predicted dominant colors by the k-means algorithm.

Quantifying bias in lesion segmentation

To evaluate bias in neural networks we train various U-Net based models using a ResNet-18 encoder, a widely used architecture in lesion segmentation . To ensure representative results, we pretrain each model using the ISIC dataset, although we do not calculate segmentation metrics for ISIC images due to their lack of dark-skinned images. Instead, we fine-tune the models using 5-fold cross-validation on a combination of the PH2, Dermofit and Waterloo datasets. Additionally, we assess out-of-sample performance by employing a leave-one-dataset-out approach, training models on two datasets and evaluating them on the left-out dataset.

The main segmentation performance metrics we use is the Dice Score Coefficient (DSC), Housdorff distance (HD) and average symmetric surface distance (ASSD). DSC is a measure of both sensitivity and precision, and thus provides a comprehensive evaluation of the similarity between predicted and ground truth segmentation masks. On the other hand, HD and ASSD focus on boundary quality, with HD measureing the maximum distance between the predicted and ground truth boundary, while ASSD represents the average boundary distance. Therefore, these three metrics capture different problems in lesion boundary segmentation.

To assess skin color bias, we use several proxies to quantify skin darkness. Firstly, we use the NN-based classification as a categorical variable, comparing the distributions of segmentation metrics between the light, medium and dark skin color categories. Secondly, we use the NN-based classifier’s predicted probability of the dark-skinned class (p(FP = V − VI)) as a continuous proxy for skin darkness, examining correlations between segmentation metrics and p(FP = V − VI). Similarly, we employ the predicted ITA as a continuous proxy to evaluate correlations with segmentation metrics.

To ensure normality for the ANOVA, Tukey’s honestly significant difference (HSD) and t-tests, we use log(1 − DSC), log(HD) and log(ASSD) for DSC, HD and ASSD, respectively. To address multiple testing, we use a significance threshold of p < 0.01 for all of these analyses.

Finally, we also manually labeled each image in the evaluation datasets as either belonging to dark-skinned (FP 5 or 6) or non-dark-skinned classes, and we then compare the distributions of these two classes. These labels were assigned by a computer scientist on an individual basis, ensuring conservative classification and minimizing the risk of false-positive dark-skinned images to avoid overestimating bias. Consequently, we adopt a significance threshold of p < 0.05 for tests on the manually classified groups.

Results

Datasets used for bias evaluation

We evaluate bias on three publicly available datasets:

  1. PH2 , a set of 200 dermatoscopic images.

  2. The Waterloo dataset consisting of 191 photographs of skin lesions from two different databases.

  3. The Dermofit dataset consisting of 1300 dermatoscopic images with internal color standards.

Additionally, to pre-train the models we use ISIC 2018 challenge Task 1 data consisting of 3594 multi-source dermatoscopic images. However, due to the limited diversity in the ISIC dataset, we do not use it to evaluate skin color bias.

In Fig. 2, we present the distribution of predicted Individual Typology Angle (ITA) and the probability of belonging to the dark-skinned class using the k-means and NN models, respectively. Notably, the ISIC dataset demonstrates the least diversity among the four datasets, while the Waterloo and Dermofit databases exhibit higher diversity. Even in these datasets, though, there is a limited number of individuals classified as FP 5 or 6, necessitating the use of proxy values to evaluate bias in these groups.

The per-dataset distribution of predicted ITA by the k-means algorithm as well as predicted probability of the dark-skinned class (FP 5 or 6) by the neural network classifier.

In-sample bias quantification results

When using the NN-based classifier to evaluate in-sample bias, we observe a large difference in both the mean and standard deviation of DSC, HD and ASSD for light (FP 1 or 2), medium (FP 3 or 4) and dark (FP 5 or 6) individuals. This difference is presented in Table 1. A one-way ANOVA confirms a statistically significant difference between the mean DSC of the groups (F(2, 1755) = 18.842, p < 0.0001). Subsequent Tukey’s HSD test indicates that the mean DSC of dark-skinned individuals was significantly lower than that of light-skinned (p < 0.0001, 95% CI = [0.192, 0.479]) as well as medium-skinned (p < 0.0001, 95% CI = [0.158, 0.449]) individuals. In other words, in terms of DSC, the in-sample segmentation is worse for dark individuals than for other groups.

In-sample results for baseline lesion segmentation on different skin types as classified by the neural network. One-way ANOVA statistics and p values are shown for each metric.
Skin Type N DSC HD ASSD
Light 1587 0.904 ± 0.092 22.179 ± 17.285 6.190 ± 6.453
Medium 166 0.878 ± 0.111 20.854 ± 18.684 6.494 ± 7.826
Dark 5 0.744 ± 0.321 37.342 ± 39.039 10.814 ± 9.758
ANOVA -
p< 0.0001
p = 0.044
p = 0.182

Further evidence of bias arises when using the predicted probability of belonging to the dark-skinned class (p(FP = 56)) and the predicted ITA as two proxies for skin color darkness. Using Spearman’s rank correlation, we find a significant negative correlation between DSC and p(FP = 56) (r(1756) = −0.320, p < 0.0001) as well as a positive correlation between DSC and ITA (r(1756) = 0.283, p < 0.0001). Higher ITA corresponds to lighter skin, so both of these results strongly indicate that segmentation is worse for individuals of darker skin colors.

However, it is important to note that we do not find a significant in-sample difference between manually classified dark and non-dark groups. We deliberately included only high-certainty images in the dark class to reduce false positives, which might result in an underestimation of bias when using manual classification. As described later in this section, this test yields a statistically significant difference in the out-of-sample evaluation.

Out-of-sample bias quantification results

Given that segmentation models are often deployed in diverse settings and tasked with segmenting images outside their initial training domain, assessing out-of-sample performance becomes crucial in evaluating the fairness and safety of a model. To address this, we adopt a leave-one-dataset-out training procedure to comprehensively evaluate the model’s performance beyond its training data.

Out-of-sample results for baseline lesion segmentation on different skin types as classified by the neural network. One-way ANOVA statistics and p values are shown for each metric.
Skin Type N DSC HD ASSD
Light 1587 0.843 ± 0.150 30.141 ± 23.379 9.685 ± 10.122
Medium 166 0.770 ± 0.197 33.772 ± 27.507 12.399 ± 13.450
Dark 5 0.532 ± 0.293 65.052 ± 39.922 30.098 ± 19.174
ANOVA -
p < 0.0001
p = 0.031
p < 0.0001

Out-of-sample, the models are worse both in terms of segmentation metrics as well as bias, as is evident in Table 2. The ANOVA analysis of the metrics reveals significant differences between the DSCs of the three groups (F(2, 1755) = 25.810, p < 0.0001) as well as ASSD (F(2, 1755) = 8.014, p < 0.0001) that are notably larger than the in-sample differences. Subsequent testing using Tukey HSD confirms significant differences between the DSCs of all groups as well as the ASSD between the light and dark group. The results of the Tukey HSD test are presented in Table 3.

Results of the Tukey HSD test on out-of-sample evaluation. For brevity, only significant differences. (p < 0.01) are shown.
Skin Types p 95% CI
DSC
Light vs Dark  < 0.0001 [0.342, 0.916]
Light vs Medium  < 0.0001 [0.062, 0.166]
Medium vs Dark  < 0.0001 [0.224, 0.806]
ASSD
Light vs Dark 0.003 [-2.365, -0.408]

Finally, an one-tailed independent samples Welch’s t-test indicates the DSC scores of manually labeled dark-skinned subjects (M = 0.75, SD = 0.22) and non-dark-skinned (M = 0.84, SD = 0.16) are statistically different (t(1756) = 2.14, p = 0.020). This can be seen in Table 4.

As can be seen in Figure 3, the out-of-sample segmentation generally exhibits lower performance compared to in-sample segmentation. Moreover, this decrement is particularly pronounced for individuals with darker skin colors. Bias is significantly increased when the models are used on out-of-sample images, as would be the case in real-world scenarios, underscoring the importance of comprehensively assessing out-of-sample performance for a robust and fair evaluation of the model’s performance.

Evaluation of methods for skin color bias correction

Various methods, such as stratified sampling or utilizing the CIE-Lab colorspace, are commonly employed to reduce bias in lesion segmentation models. While these methods are intuitively expected to mitigate bias, their quantitative impact remains largely unexplored. To address this gap, we conduct a comparative study involving three different pre-processing procedures: (1) RGB images with minimal pre-processing (Baseline), (2) conversion to the CIE-Lab colorspace (CIE-Lab), and (3) stratified sampling to increase representation of under-represented classes.

The stratified sampling was implemented using the NN classification results. When each training batch is sampled, an image is selected from one of the three skin color classes (light, medium, dark) with a weight of 1/Nc, where Nc represents the number of samples in the training set belonging to that specific class.

An independent samples t-test revealed a significant difference (t(1756) = 3.57, p = 0.0004) between the baseline DSCs (M = 0.835, SD = 0.16) and those obtained from the stratified CIE-Lab model (M = 0.815, 0.17). Although both stratified sampling and CIE-Lab conversion slightly reduced the correlation between p(FP = 56) and DSC, the bias persisted without significant reduction, as can be seen in Figure 3.

A box plot of out-of-sample DSC and ASSD scores for groups binned by predicted probability of belonging to the dark-skinned class (p(FP = 56)) for two different pre-processing procedures.

A one-way ANOVA of the results of the model using stratification and CIE-Lab still revealed significant differences for DSC (F(2, 1755) = 15.605, p < 0.0001), HD (F(2, 1755) = 5.453, p = 0.004) and ASSD (F(2, 1755) = 6.334, p = 0.002) between the light, medium and dark skin color groups.

In addition, when examining subjects manually classified as dark (FP 5 or 6) or non-dark, a one-tailed Welch’s independent samples t-test indicated a significant difference between the groups for both the baseline and stratified CIE-Lab models, as presented in 4.

A comparison of segmentation results of subjects manually labeled as dark (nd = 23) and the rest of the dataset (nr = 1735). p values of a one-sided Welch’s t-test with 10,000 permutations are reported. Significant results (p < 0.05) are marked with an asterisk.
Skin Type DSC
Baseline in-sample
Dark 0.84 ± 0.22 t = 1.31
Rest 0.90 ± 0.09 p = 0.0953
CIE-Lab & strat. sampling in-sample
Dark 0.85 ± 0.18 t = 1.31
Rest 0.90 ± 0.09 p = 0.0968
Baseline out-of-sample
Dark 0.75 ± 0.22* t = 2.14
Rest 0.84 ± 0.16* p = 0.0200
CIE-Lab & strat. sampling out-of-sample
Dark 0.73 ± 0.24* t = 2.13
Rest 0.82 ± 0.17* p = 0.0225

Qualitative assessment

Examples of out-of-sample predictions on manually labeled dark-skinned subjects are presented in Figure 4. Qualitatively, the predicted lesion border follows the ground truth border better on light-skinned subjects. On dark-skinned subjects, the border exhibits both false positives and false negatives.

Examples of model prediction of the out-of-sample baseline RGB model. Top two rows (A-H) show manually labeled as dark-skinned subjects, while the bottom two rows (I-P) show non-dark-skinned subjects.

A significant challenge for the model lies in accurately identifying depigmented regions, as evident in Figure 4(D) and 4(E). The predicted area in Figure 4(D) contains a sizable area that appears depigmented but was not labeled as such by the dermatologist. Conversely, the ground truth border of Figure 4(E) includes the depigmented area, however the predicted border encompasses an even larger area of surrounding skin. Depigmented areas with gradual borders pose difficulties for both manual labelers and fully-automatic models in defining precise borders. Consequently, these examples might inadvertently influence the model to incorporate surrounding skin around the lesion, even when no depigmentation occurs, as observed in 4(H). Although this issue is also present in light-skinned subjects, the underrepresentation of dark-skinned subjects accentuates its prominence in these images.

To address this issue, a potential approach is to adopt more descriptive labels. Instead of solely labeling binary lesion/non-lesion areas, incorporating labels for differently colored areas such as white globules, yellow or orange areas, black lacunae, blue-gray areas, as well as structures like hypopigmented areas, structureless areas, and blue-white veils would be valuable. By providing more detailed labels, the model can learn border contrasts, shapes, smoothness, and other relevant features for each distinct structure. This, in turn, would enhance the accuracy of segmentation for areas such as depigmented skin.

Furthermore, considering the challenges posed by gradual lesion borders, it may be beneficial to incorporate affordances for varying degrees of lesion border contrast. The inherent difficulty in precisely defining gradual borders suggests that introducing fuzzy labels could facilitate the development of new models that account for the intrinsic uncertainty in these cases. By allowing for fuzzy labels, the models can better capture the ambiguity associated with gradual borders and make more nuanced predictions.

These approaches could also be extended to post-processing techniques. Instead of solely binarizing predictions into lesion and non-lesion areas, post-processing methods could be employed to treat smooth borders probabilistically. This would lead to more refined and reliable segmentation results, particularly for cases with ambiguous or gradual lesion borders.

Lastly, the existing datasets have been compiled from various sources and annotated by different dermatologists without standardized guidelines. Consequently, there are instances where depigmented areas and surrounding lesion tissue are inconsistently incorporated into the lesion area. Additionally, there is variability even within one dataset in annotation methods, with some images manually labeled using polygons while others employ semi-automatic pixel-level labels . All of these issues lead to a large degree of intra- and inter-observer variability within the datasets .

These inconsistencies present challenges in generating consistent predictions and undermine the reliability of the evaluation procedures. To address these issues, future datasets should adhere to specific guidelines and establish consistent criteria for lesion area inclusion. By implementing standardized annotation protocols, we can enhance the consistency and reliability of data, leading to more accurate and robust skin lesion analysis models.

Conclusions and suggestions

We have used several methods of estimating skin color from dematological images including a neural network, traditional image processing and k-means clustering and manual classification. In all cases we have found significant bias in skin lesion segmentation against darker-skinned individuals when evaluated both in- and out-of-sample. These findings indicate a pervasive bias in most published lesion segmentation methods, given our use of commonly employed neural network architectures and publicly available datasets.

It is important to acknowledge a limitation of our study, namely the utilization of algorithmic estimates of skin color from relatively small skin areas, which inherently introduces error. However, we have deliberately erred on the side of underestimating bias in order to ensure the robustness of our results, which consistently reveal the presence of bias even when accounting for errors in predicted skin colors.

One of the key reasons for this bias is the lack of diverse publicly available datasets. We have shown that dark skin is severely underrepresented in widely used datasets including ISIC 2018, PH2, Dermofit and Waterloo . Furthermore, the absence of information regarding skin color, race, or ethnicity within these datasets poses challenges for tracking and evaluating fairness, inadvertently incentivizing mean results as the primary focus.

Additionally, the bias observed may be attributed to characteristics inherent in the images and labels themselves. Most lesions are harder to segment in dark-skinned individuals due to a lower amount of contrast between surrounding tissue. This could lead to noisier segmentation labels as well as more challenging automatic segmentation.

Despite employing stratified sampling and the CIE-Lab colorspace, our attempts to mitigate bias did not yield significant improvements. It is noteworthy that the mean in-sample results did show some improvement with the utilization of CIE-Lab and stratified sampling; however, these changes did not translate into reduced bias, and in fact, the mean results worsened when evaluated out-of-sample. This highlights the importance of not solely relying on reporting the mean in-samle result for lesion segmentation, as it may not accurately reflect the model’s performance in real-world scenarios.

Current lesion segmentation models, while producing impressive mean results, are are inadequate for practical deployment in real-world scenarios involving diverse patient populations. We summarize our results in the following suggestions for future dataset curation and lesion border segmentation research:

By addressing these issues, we can strive for more equitable and accurate lesion segmentation for all patient populations.