Multimodal deep learning with attention-based fusion for skin cancer diagnosis

Wiem Abdelbaki; Hend Alshaya; Inzamam Mashood Nasir; Sara Tehsin; Wided Bouchelligua

doi:10.3390/bioengineering13050564

Title	Multimodal deep learning with attention-based fusion for skin cancer diagnosis
Authors	Abdelbaki, Wiem ; Alshaya, Hend ; Nasir, Inzamam Mashood ; Tehsin, Sara ; Bouchelligua, Wided
DOI	10.3390/bioengineering13050564
Full Text
Is Part of	Bioengineering.. Basel : MDPI. 2026, vol. 13, iss. 5, art. no. 564, p. 1-35.. ISSN 2306-5354
Keywords [eng]	attention mechanisms ; clinical data fusion ; cross-dataset generalization ; dermoscopic image analysis ; multimodal deep learning ; skin cancer diagnosis
Abstract [eng]	The diagnosis of skin cancer remains a growing challenge because of its high variability as a result of the varying imaging conditions in clinical settings. This paper proposes a multimodal deep learning framework to address these challenges by combining the auxiliary clinical information with dermoscopic image features. This proposed architecture uses an attention-based feature encoder with a structured multimodal fusion approach to utilize the integrated feature representation across all channels. Evaluations of the proposed architecture were conducted across a range of benchmark datasets, including ISIC 2019, ISIC 2020, and HAM10000, using a unified experimental approach. This proposed model achieved accuracies of 90.5%, 88.7%, and 91.8% and AUCs of 95.8%, 94.6%, and 96.3%, respectively, on the selected datasets. For the baseline models, ResNet50 and EfficientNet-B4, our approach increased the AUC by 6.5% and the F1 score by 8.0%. Furthermore, across various datasets, the model achieved an AUC of 90.9%, proposing strong generalization. From the ablation analysis results, the attention and multimodal fusion mechanisms showed a 4.1% decrease in AUC when key components were removed, confirming their effectiveness. With 34.7 million parameters and an average of 19.3 Ms., the model has adequate intensity to deploy in a real clinical setting without affecting its performance. Additionally, the improvements to the model were statistically significant across all evaluation metrics (p = 0.01). The proposed multimodal framework demonstrates strong performance and robustness across multiple benchmark datasets.
Published	Basel : MDPI
Type	Journal article
Language	English
Publication date	2026
CC license

„Multimodal deep learning with attention-based fusion for skin cancer diagnosis“