A Method for Detecting Skin Cancer Disease Based on Deep Learning in Dermoscopic Images ()
1. Introduction
Skin cancer is one of the most common cancers worldwide. It can be caused by a number of factors, including exposure to the sun, artificial UV rays (such as those from tanning booths), exposure to hazardous substances, a family history of skin cancer, fair skin, or a large number of moles. People with a history of sunburns or immunosuppressive treatments may also be at increased risk [1]. According to the WHO (World Health Organization), the number of new cases of melanoma, one of the most serious forms of skin cancer, has increased by 5% a year in the last 50 years, faster than any other cancer. If diagnosed early in the course of the lesion, the relative survival rate at 5 years is 88% for localized stages. In contrast, 5-year survival of patients with advanced melanoma and metastases is less than 20% [2]. Detecting skin cancer at an early stage, when relatively few signs of malignancy are visible, is therefore highly desirable. The diagnosis and detection of skin cancers are traditionally carried out by manual screening and visual inspection. However, this remains a difficult operation, as the lesions have a strong visual resemblance and are strongly associated with each other due to their similarity in color, shape and size. This leads to misinformation on the characteristics [3] [4]. Dermatologists face difficulties in manual screening, which makes automatic computerized diagnostic systems indispensable for analyzing skin lesions and enabling dermatologists to make faster and more accurate diagnoses. Compared with approaches in the literature, we have chosen the EfficientNet-B0 model for several reasons. Firstly, EfficientNet-B0 is designed to use fewer parameters and calculations, making it ideal for deployment on resource-constrained devices. Compared with other popular learning transfer networks such as ResNet-50, Inception-v4 and MobileNet-v2, EfficientNet-B0 shows excellent performance on various benchmarks, including dermoscopic image classification tasks. The remainder of this paper is organized as follows: Section 2 describes related work, and Section 3 describes the proposed method and related preprocessing steps. Section 4 presents the results of the study and Section 5 presents the conclusions and suggestions for further research.
2. Related Work
For decades, a large part of the scientific community has been looking for solutions to reduce the major public health problem of skin cancer. In the literature, numerous studies have been conducted on the design of automatic systems for the analysis and classification of pigmented skin lesions. In the context of our work, we focus on those that implement a deep learning approach.
In 2014, Ramezani and al. [5] proposed a method based on principal component analysis (PCA), in which lesions are determined as malignant or benign using a support vector machine classifier. For their work, the authors used a set of 282 macroscopic images of pigmented skin lesions collected from various online dermatology atlases, such as dermnet, dermis, and dermquest. The authors applied morphological button-hat transformation followed by morphological opening to remove thick hairs from the images to improve the performance. This method achieved an accuracy of 82.2%, a sensitivity of 77%, and a specificity of 86.93%.
In a similar, Sagar and Saini [6] proposed a simple and effective method for the automatic segmentation of clinical images using color space analysis and a binary thresholding algorithm for images captured by mobile cameras. Morphological closure operations with interpolation of hair pixels with neighboring pixels were performed on the image to eliminate hair. The efficiency of the overall segmentation process was ensured by calculating similarity matrices for the lesions segmented from each color channel used. The framework can differentiate and extract cancerous lesions from the background skin with an accuracy of approximately 94% by selecting the preferred color channel.
In 2016, Majtner and al. [7] used an approach that combined two different classifiers. The first classifier is an SVM with a Gaussian kernel and standardized predictors. The proposed method combines RSurf and LBP features to estimate the class of a given input image. The input feature vector comprises 2768 components. Each image class label is predicted with a certain score representing the a posteriori probability that an observation belongs to a particular class given the data [8]. The second classifier is the original version of AlexNet combined with a nonlinear SVM, which contains five convolutional layers (CONV) and three fully connected layers (FC). Each of these layers is followed by a nonlinear pointwise activation layer called rectified linear units (RELU). There is also a local response normalization (LRN) layer that follows the first two convolutional layers. AlexNet has three maximum pooling (MAX) layers, two after the LRN layers, and the third after the fifth convolutional layer. The basic structure of AlexNet lies between the second and third layer of maximum blending, which contains three convolutional layers, each with 3 × 3 convolution kernels. With this model, they achieved an accuracy of 82.6%, a sensitivity of 50.3% and a specificity of 89.8% [8].
In their study, Lopez and al. [9] selected the VGG16 architecture based on the VGGNet architecture because it has been shown to generalize well to other datasets. The input layer of the network was expected to be a 224 × RGB image of 224 pixels. The input image passes through five convolutional blocks consisting of convolutional filters with a 3 × 3 receiver field. Each convolutional block performs a 2D convolution layer operation (the number of filters changes between blocks) All hidden layers were equipped with a ReLU (Rectified Linear Unit) as the activation function layer (nonlinearity operation) and included spatial pooling using a maximum pooling layer. The network is represented by a classifier block composed of three fully connected layers. The authors used a publicly available set of dermoscopic images of skin lesions (the ISIC Archive dataset), which contains 1279 high-quality images of skin lesions collected from 10 different classifications. With this model, they obtained an accuracy of 68.67% a sensitivity of 33.11% and a specificity of 49.5%. Abbas and al implemented in 2019 a method that fuses visual feature extraction with feature extraction by an autoencoder, then uses an RNN model to optimize feature selection and an SVM to classify skin lesions based on fused features; the method achieved an overall accuracy of 95%.
The methods proposed by Uckuner and al. [10] focused on skin cancer detection and classification using a deep learning approach. The methods described in this paper include the screening of an MNIST dataset: HAM10000 which consists of seven different types of skin lesions with a sample size of 10,015 and the PH2 dataset, which contains 200 images of skin lesions. Methods include data augmentation, and models are trained using deep learning architectures such as mobilenet and VGG-16 [11]. Accuracy was 81.52% with mobilenet and 80.07% with VGG-16. In 2022 W. Gouda and al. [12] proposed a method based on the use of ESRGAN as a preprocessing step. They used a CNN to detect the two main types of tumor, malignant and benign, using the ISIC 2018 dataset. The proposed method achieved an accuracy of 83.2% comparable to the accuracy of pre-trained models on the same database: Resnet50 83.7%, InceptionV3 85.8% and Inception Resnet 84%.
In 2023, Gururaj and al. [8] introduced a method for skin cancer classification on the HAM10000 dataset based on data preprocessing techniques, such as oversampling and undersampling, blunt razor and segmentation using an automatic encoder and decoder, and transfer learning techniques, such as DenseNet169 and Resnet 50, were used. The method achieved an accuracy of 83% using the oversampling technique and ResNet50, and 91.20% using the undersampling technique and Densenet 161. More recently, Bazgir and al. [13] established a model based on a deep neural network that can automatically classify several types of skin cancer as melanoma or non-melanoma with a significant level of accuracy. They proposed an optimized Inception architecture in which the InceptionNet model was enhanced with increased data and base layers, and the proposed InceptionNet provided an accuracy of 84.39% and 85.94% for the Adam and Nadam optimizers, respectively. In the same vein, Rahman et al. (2024) [14] proposed a DCNN-based model capable of automatically classifying different types of skin cancer, for which they proposed an optimized NASNet architecture, in which the NASNet model is enhanced with additional data, and an additional base layer used in CNN is added. The proposed strategy improves the models’ ability to handle incomplete and inconsistent data. The accuracy values of Optimized NASNet Mobile and NASNet Large provide an accuracy of 85.62% and 83.62%, respectively, for the Adam optimizer.
3. Methodology
In this study, we propose a method based on transfer learning with the EfficientNet-B0 model, whose structure we modified for the detection of skin cancer so that it makes the most of the information contained in the images to extract features for better classification. Our research was based on the ISIC-2019 dermoscopic image dataset. We aim to help physicians diagnose skin cancer from the dermoscopic images of an individual.
The modification made to EfficientNet-B0 is the progressive thawing of layers. This modification enables us to prevent the model from over-specializing too quickly in the target task, which could impair its generalization capabilities. This method also has the advantage of helping the model to converge more quickly and efficiently towards an optimal solution, as it avoids large, sudden parameter updates during the initial training steps.
3.1. Data Set and Analysis
The ISIC-2019 archive is a large, constantly expanding, open-access database of skin images (Figure 1). It serves as a public resource for teaching, researching, developing, and testing artificial intelligence algorithms for diagnosis. The data set contains 2368 images of malignant and benign oncological diseases. These images were obtained from the following sources: Hospital Clnic de Barcelona, Medical University of Vienna, Memorial Sloan Kettering Cancer Center, Melanoma Institute, Australia, University of Queensland, and University of Athens Medical School. The data set was pre-partitioned into two folders, a train folder containing 2250 images and a test folder containing 118 images. The features of each image are as follows:
• Definition: 1024 pixels long 1024 pixels wide;
• Each pixel is represented by 3 numerical values between [0, 255]; corresponding respectively to the intensity of the fundamental colors RGB: red, green, blue;
• Each image was labeled according to 9 different skin lesion classes;
• Extension: jpg.
Figure 1. The different skin cancer classes present in the data set.
We examined the distribution of images in each class for our data and noted an over-representation of the “benign pigmented keratosis”, “melanoma” and “basal cell carcinoma” classes (Figure 2).
Figure 2. Data distribution by class.
3.2. Morphological Shutdown Operation and Application of the
Fuzzy Filter
Morphological operators are essential tools in image processing. They allow images to be transformed and features, objects, or measurements to be extracted through an analysis that combines the properties of the objects and the context. In this study, we perform a morphological closure operation that removes redundant information from images, such as hair and vessel strands. Morphological erosion and dilation have the disadvantage of strongly modifying the size of structures in the image. To reduce this effect, they are often used in morphological combinations, such as dilation followed by erosion.
Suppose
is an image, i.e. a set of pixels. For a structuring element
, the dilation of
by
is the set obtained by replacing each pixel
of
by its window
:
(1)
Let
be an image and
a structuring element. The erosion of
by
is the set of pixels
such that the window
is included in
:
(2)
Closure by
is therefore the composition of dilation by
followed by erosion by
:
(3)
The morphological closure operator slightly blurred the images (Figure 3). To address this issue, we employed a Gaussian filter. A Gaussian filter is a low-pass filter that is used to reduce noise (high-frequency components). The method is implemented in the form of an odd-sized symmetrical kernel that passes through each pixel in the region of interest to achieve the desired effect. Pixels located toward the center of the kernel have more weight in the final value than those on the periphery. A Gaussian filter can be considered an approximation of a Gaussian function. The formula for a two-dimensional Gaussian function is:
(4)
where:
•
is the
coordinate value for a pixel;
•
is the
coordinate value for a pixel;
•
is the standard deviation;
•
is the mathematical constant PI (value = 3.14).
Figure 3. Rendering of the morphological closure operation and fuzzy filtering.
Morphological closure operations and the application of a Gaussian filter to our images yielded the results shown in Figure 4, which enabled us to reduce unwanted artifacts. Indeed, the process for each of the images in our initial database consists of taking a color image containing unwanted artifacts and transforming it first into a binary image, then successively applying the morphological operations of dilation and erosion (morphological closure), and finally re-transforming our binary image into a new color image on which the artifacts initially present have been eliminated. We then apply a Gaussian filter to correct the quality of the formed image, thereby avoiding any significant loss of information and obtaining a new image to be used by our model.
Figure 4. Morphological closure method and application of the Gaussian Filter.
ALGORITHM 1 Pseudo code for morphological dilation |
Data: Fixed-size original image, fixed-size structuring element |
Output: Expanded image |
For each pixel of the original image do: |
If at least one pixel of the structuring element centered on |
this pixel is inside the object of the object in the original image, Then: |
Dilated image [pixel] = 1 |
Else |
Expanded image [pixel] = 0 |
Endif |
Return Expanded image |
ALGORITHM 2 Pseudo code for morphological closing |
Data: Fixed-size original image, fixed-size structuring element |
Output: Image closed |
For each pixel of the dilated image do: |
If all the pixels of the structuring element centered on this pixel |
are inside the of the object in the dilated image Then: |
Dilated image [pixel] = 1 |
Else |
Expanded image [pixel] = 0 |
Endif |
Return Expanded image |
3.3. Class Balance and Data Augmentation
Data augmentation as illustrated in Figure 5 is an artificial intelligence technique that involves artificially manipulating or augmenting existing data to create new, diverse samples by rotating, flipping, zooming, or changing color. This approach is particularly crucial in model training and machine learning, where augmented data can greatly enhance a model’s ability to generalize its training data to unseen data.
Figure 5. Data augmentation.
It is important to address the issue of class imbalances detected in the data analysis. If we do not explicitly take action against it, the results will be sub-optimal because the network will be biased toward over-represented classes and will not have a chance to learn the distributions of under-represented classes. To solve this problem, we used a re-sampling method, more precisely, an undersampling method because we did not have a large database. This method reduces the number of observations of the majority class(es) to obtain a satisfactory ratio of the minority class to the majority class. Figure 6 illustrates a data distribution after class balancing.
3.4. Reframing and Resizing
During the preprocessing phase, images are resized to adjust their spatial resolution [15]. In this case, we set the image resolution to 224 × 224 pixels to achieve a finer resolution, which facilitates image manipulation by the proposed model. Before resizing the image, it is first cropped to eliminate certain parts. For example, we remove part of the background so that the lesion area becomes the main feature. Figure 7 shows the original image, the cropped image, and the resized image.
Figure 6. Data distribution after class balancing.
Figure 7. Cropping and resizing images.
3.5. EfficientNet-B0 Architecture
EfficientNet is a CNN architecture that employs compound scaling, and its primary objective is to improve accuracy [16]. This compound scaling method scales uniformly across all dimensions of width, depth, and resolution using a compound coefficient [17]. There are several EfficientNet models, including models B0-B7. The EfficientNet-B0 architecture is based on the components contained in MobileNetV2, namely the Mobile Inverted Bottleneck Conv (MBConv) with the addition of Squeeze and Excitation (SE) optimization [18] [19]. The use of MBConv and SE blocks has been demonstrated to increase accuracy with a minimum number of parameters, which enables them to be used on mobile devices [20]. Figure 8 illustrates the layout of the layers in EfficientNet-B0.
Figure 8. EfficientNet-B0 architecture.
3.6. Description of the Model and Illustration of the Proposed Method
In the proposed approach, we are working with the EfficientNet-B0 model, on which we perform progressive layer thawing. Progressive layer thawing is a technique that consists of unlocking (or “thawing”) the layers of the model progressively rather than unlocking everything immediately. The lower layers of EfficientNet-B0 were trained on a large set of general data to learn basic visual features. By first unfreezing only the upper layers, these basic features are left intact, and the focus is on learning features specific to the target task. In addition, gradual unfreezing prevents the model from overspecializing too quickly in the target task, which could impair its generalizability. By slowly unfreezing the layers, the model can adjust its parameters gradually without risking the loss of transferred knowledge. In addition, unfreezing only certain layers serves as a form of regularization, which limits the model’s ability to quickly adapt to specific training data. This method is unique in that it helps the model converge more quickly and efficiently toward an optimal solution because it avoids sudden and extensive parameter updates during the early stages of training.
Therefore, the proposed prediction model employs a modified EfficientNet-B0 feature extractor, to which we attach a global pooling layer, dropout layer, and flatten layer. In terms of classification, the proposed extractor is linked to two fully connected layers: the first contains 180 neurons and the last 9; and a softmax function is used as the classifier because this is a multi-class classification. Figure 9 shows a simplified description of the architecture of our model, while Figure 10 illustrates the methodology proposed in this work.
Figure 9. Simplified illustration of the architecture of our model.
Figure 10. Illustration of proposed method.
4. Experimental Results and Analysis
This research used dermoscopic skin images and classified them into nine different types of skin cancer: actinic keratosis, basal cell carcinoma, dermatoma, melanoma, mole, benign pigmented keratosis, seborrheic keratosis, squamous cell carcinoma, and vascular lesion. We analysed the performance of the proposed method in terms of classification accuracy in terms of receiver-operator characteristic curves (AUC-ROC) using the ISIC-2019 datasets. The receiver operator characteristic (ROC) curve is a probability curve that plots the true positive rate (TPR) against the false positive rate (FPR) for different threshold values. The area under the curve (AUC) represents the degree or measure of separability, indicating how well the model can distinguish between classes.
Model evaluation was performed using the following metrics:
(5)
(6)
(7)
Confusion matrix contains the number of True Positive, True Negative, False Negative, False Positive.
We trained our model over 20 epochs using stochastic gradient descent with momentum as the optimizer at a learning rate of 0.001. Following data augmentation and class rebalancing preprocessing, we increased the training data from 2250 to 9000 images and retained 7200 images for training (80% of the total data) and 1800 images for validation (20% of the total data). Note that the test set is unchanged.
Figure 11 shows the results of training our learning model, in particular its accuracy and its loss, over the training periods. We obtained a total accuracy of 99.76% on the training data with a loss value of 0.01052 and 88% accuracy on the test data with a loss value of 0.613939, with an AUC percentage above 80% for 6 classes out of a total of 9 (Figure 12 and Figure 13). Indeed, an excellent model has an AUC close to 100.0%, meaning it has a good measure of separability, while a mediocre model has an AUC close to 0.0%, meaning it has the worst measure of separability, and when the AUC is close to 0.5 it means that the model performs random classifications; in other words, it has no class separation capability.
Figure 11. Model learning curve: accuracy and losses as a function of epochs.
Figure 12. Confusion matrix and model evaluation and classification report results for each class.
Figure 13. Different AUCs of the model for different classes.
In Table 1 and Table 2, we will make a comparative study with those of the literature. We’ll make an in-depth study of the different approaches.
For our work, we used a laptop with the following specifications:
• Brand: HP Eliteteboock mt40;
• Processor: Intel Celeron CPU N3060 Dual Core;
• Processor frequency: 1.6 GHz;
• Memory: 6 GB;
• Hard disk: 500 GB;
• Graphics card: Intel HD Graphics;
• Total graphics memory: 2132 MB;
• Dedicated dump memory: 128 MB.
However, to minimize learning times, the CNN model must be trained on machines with very high capacities in terms of RAM and computing units. Therefore, we used the Google Colab platform, which provides a virtual machine with the following features:
• Ram: 334.56 GB;
• Disk: 225.33 GB.
The training time was 4 hours 36 minutes.
Table 1. Comparison with the literature.
Our method |
Model |
EfficientNet-B0 modified mainly by gradual thawing of layers |
Pretreatments |
-Morphological closure + fuzzy filter application; |
-Class balancing using sub-sampling method; |
-Data augmentation; |
-Cropping + resizing. |
Data set |
ISIC-20219 (2368 images) |
Precision |
88% |
Specific |
0.88 |
Sensitivity |
0.88 |
Type of classification |
Multiclasse (9 classes) |
Romero Lopez et al. [9] |
Model |
VGG16 based on VGGNet architecture |
Pretreatments |
-K-fold cross-validation technique; |
-Image cropping + resizing |
-Morphological closing |
Data set |
Archive ISIC (1279 images) |
Precision |
68.67% |
Specific |
0.3311 |
Sensitivity |
0.495 |
Type of classification |
Multiclasse (9 classes) |
Majtner et al. [7] |
Model |
SVM with Gaussian kernel and one standardized predictor + AlexNet combined with a nonlinear SVM |
Pretreatments |
-RGB to grayscale image transformation; |
-Rsurf features; |
-Use of Gaussian filter to remove image noise; |
-Sub-sampling to balance classes; |
Data set |
Archive ISIC |
Precision |
82.6% |
Specific |
0.503 |
Sensitivity |
0.898 |
Type of classification |
Binary |
Gouda et al. [12] |
Model |
CNN inspired by ResNet50 and inceptionV3 architectures |
Pretreatments |
-Use of ESRGAN for image enhancement and retouching; |
-Data were enhanced, normalized and resized |
Data set |
ISIC-2018 (3533 images) |
Precision |
83.2% |
Specific |
0.7818 |
Sensitivity |
0.7606 |
Type of classification |
Binary |
Table 2. Comparison with the literature continued.
Y. Filali et al. [11] |
Model |
-MobileNet |
-VGG-16 |
Pretreatments |
-Decomposition of the image into two components using the partial differential equation (PDE). |
-Image segmentation |
Data set |
HAM10000 |
Precision |
81.52% |
80.07% |
Specific |
0.7320 |
0.6309 |
Sensitivity |
0.7689 |
0.6843 |
Type of classification |
Multiclasse (7classes) |
Bazgir et al. [13] |
Model |
-InceptionNet with Adam optimizer |
-InceptionNet avec optimiseur Adam |
Pretreatments |
-Gaussian filtering of images. |
-Data augmentation; |
-Resizing. |
Data set |
Archive ISIC (2637 images). |
Precision |
84.39%; |
85.94%. |
Specific |
0.8669; |
0.8689. |
Sensitivity |
0.8179; |
0.8198. |
Type of classification |
Binary. |
5. Conclusion
This paper analyzes skin cancer images and proposes solutions based on image processing and deep learning techniques to classify skin cancer image classification problems. In this paper, we have proposed the EfficientNet-B0 transfer learning model EfficientNet-B0 that we have refined for the specific needs of this study. We used a dermoscopic image database provided by the international skin imaging collaboration (ISIC-2019) on which we performed pre-processing to place the images in a more exploitable form, thereby enabling the model to perform well. For evaluation purposes, we derived various features of the classifier; however, our aim in designing the classifier was to maximize the accuracy, sensitivity, and specificity of the classification, as suggested by the results in Table 1, we obtained results comparable to the state-of-the-art methods presented in the literature review. In the future, when we need to achieve higher accuracy, we may consider using more advanced architectures to realize enhanced performance.