| Abstract [eng] |
The diagnosis of skin cancer remains a growing challenge because of its high variability as a result of the varying imaging conditions in clinical settings. This paper proposes a multimodal deep learning framework to address these challenges by combining the auxiliary clinical information with dermoscopic image features. This proposed architecture uses an attention-based feature encoder with a structured multimodal fusion approach to utilize the integrated feature representation across all channels. Evaluations of the proposed architecture were conducted across a range of benchmark datasets, including ISIC 2019, ISIC 2020, and HAM10000, using a unified experimental approach. This proposed model achieved accuracies of 90.5%, 88.7%, and 91.8% and AUCs of 95.8%, 94.6%, and 96.3%, respectively, on the selected datasets. For the baseline models, ResNet50 and EfficientNet-B4, our approach increased the AUC by 6.5% and the F1 score by 8.0%. Furthermore, across various datasets, the model achieved an AUC of 90.9%, proposing strong generalization. From the ablation analysis results, the attention and multimodal fusion mechanisms showed a 4.1% decrease in AUC when key components were removed, confirming their effectiveness. With 34.7 million parameters and an average of 19.3 Ms., the model has adequate intensity to deploy in a real clinical setting without affecting its performance. Additionally, the improvements to the model were statistically significant across all evaluation metrics (p = 0.01). The proposed multimodal framework demonstrates strong performance and robustness across multiple benchmark datasets. |