DFSNet-VLM: a hybrid frequency-aware and vision-language framework for remote sensing scene classification and semantic image explanation

Muhammad John Abbas; Muhammad Attique Khan; Ameer Hamza; Shrooq Alsenan; Areej Alasiry; Mehrez Marzougui; Jungpil Shin; Yunyoung Nam

doi:10.1109/JSTARS.2026.3677131

Title	DFSNet-VLM: a hybrid frequency-aware and vision-language framework for remote sensing scene classification and semantic image explanation
Authors	Abbas, Muhammad John ; Khan, Muhammad Attique ; Hamza, Ameer ; Alsenan, Shrooq ; Alasiry, Areej ; Marzougui, Mehrez ; Shin, Jungpil ; Nam, Yunyoung
DOI	10.1109/JSTARS.2026.3677131
Full Text
Is Part of	IEEE Journal of selected topics in applied earth observations and remote sensing.. Piscataway, NJ : IEEE. 2026, vol. 19, p. 12933-12955.. ISSN 1939-1404. eISSN 2151-1535
Keywords [eng]	deep learning ; explainable AI ; land cover ; land use ; Remote sensing ; spatial information ; vegetation ; VLM
Abstract [eng]	Remote sensing has always been an area of interest for researchers due to its significance in Earth monitoring, which supports proper future planning for agriculture, construction, reforestation, and climate change. Transformer architecture achieves significant performance in remote sensing image classification; however, they come with the trade-off of higher computational complexity. In this paper, we propose a novel deep learning framework, DFSNet-VLM—Cross Domain Fusion based Texture-Sensitive Dual Stream Network — for high-precision remote sensing scene understanding. The proposed framework includes a classification model, “DFSNet,” that improves feature representation by employing both spatial and frequency domain features, which ultimately help detect global patterns and textures alongside local features. The model also promotes information exchange between both streams to complement one type of features with respect to the other by integrating cross-domain fusion blocks at multiple stages. Additionally, a pretrained VLM model, “BLIP-2,” is integrated to provide semantic descriptions of classified images. Bayesian optimization is applied to fine tune hyperparameters, reducing overfitting and improving model performance. The proposed model is evaluated on six diverse publicly available datasets and achieves improved accuracies of 97.13% on MLRSNet, 94.67% on NWPU-RESISC-45, 98.00% on EuroSAT, 92.25% on GeoSceneNet16k, 98.25% on cloud, and 96.03% on the Bijie-landslide dataset, respectively. Detailed ablation studies, comparative analysis, and Grad-CAM++-based model explainability demonstrate that the proposed model is generalizable and scalable, and that it achieves improved accuracy. In addition, the proposed model can be easily implemented in a real-time environment for diverse applications. The trained model's links are available in the data availability section.
Published	Piscataway, NJ : IEEE
Type	Journal article
Language	English
Publication date	2026
CC license

„DFSNet-VLM: a hybrid frequency-aware and vision-language framework for remote sensing scene classification and semantic image explanation“