Semantic segmentation algorithm for multiscale remote sensing images based on CNN and Efficient Transformer

doi:10.3785/j.issn.1008-973X.2025.04.013

Journal of ZheJiang University (Engineering Science)

2025, Vol. 59

Issue (4): 778-786 DOI: 10.3785/j.issn.1008-973X.2025.04.013

Semantic segmentation algorithm for multiscale remote sensing images based on CNN and Efficient Transformer

Zhenli ZHANG1,2(

),Xinkai HU1,2,Fan LI1,2,Zhicheng FENG1,2,Zhichao CHEN1,2

1. School of Electrical Engineering and Automation, Jiangxi University of Science and Technology, Ganzhou 341000, China
2. Jiangxi Province Key Laboratory of Maglev Rail Transit Equipment, Jiangxi University of Science and Technology, Ganzhou 341000, China

Download:

HTML

PDF(1345KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

Aiming at the problems of the existing methods, such as the difficulty of multi-scale feature extraction and the inaccuracy of target edge segmentation in remote sensing images, a new semantic segmentation algorithm was proposed. CNN and Efficient Transformer were utilized to construct a dual encoder to decouple context and spatial information. A feature fusion module was proposed to enhance the information interaction between the encoders, effectively fusing the global context and local detail information. A hierarchical Transformer structure was constructed to extract feature information at different scales, allowing the encoder to focus effectively on objects at different scales. An edge thinning loss function was proposed to mitigate the problem of inaccurate target edge segmentation. Experimental results showed that mean intersection over union (MIoU) of 72.45% and 82.29% was achieved by the proposed algorithm on the ISPRS Vaihingen and ISPRS Potsdam datasets, respectively. On the SOTA, SIOR, and FAST subsets of the SAMRS dataset, the MIoU of the proposed algorithm was 88.81%, 97.29%, and 86.65%, respectively, overall accuracy and mean intersection over union metrics were better than those of the comparison models. The proposed algorithm has good segmentation performance on various types of targets with different scales.

Key words： remote sensing image semantic segmentation dual encoder structure feature fusion Efficient Transformer

Received: 27 March 2024 Published: 25 April 2025

CLC:	TP 751
	U 212

Fund: 国家自然科学基金资助项目（62063009）；国家重点研发计划项目（2023YFB4302100）.

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Zhenli ZHANG
	Xinkai HU
	Fan LI
	Zhicheng FENG
	Zhichao CHEN

Cite this article:

Zhenli ZHANG,Xinkai HU,Fan LI,Zhicheng FENG,Zhichao CHEN. Semantic segmentation algorithm for multiscale remote sensing images based on CNN and Efficient Transformer. Journal of ZheJiang University (Engineering Science), 2025, 59(4): 778-786.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2025.04.013 OR https://www.zjujournals.com/eng/Y2025/V59/I4/778

基于CNN和Efficient Transformer的多尺度遥感图像语义分割算法

针对现有方法存在遥感图像的多尺度地物特征提取困难和目标边缘分割不准确的问题，提出新的语义分割算法. 利用CNN和Efficient Transformer构建双编码器，解耦上下文信息和空间信息. 提出特征融合模块加强编码器间的信息交互，有效融合全局上下文信息和局部细节信息. 构建分层Transformer结构提取不同尺度的特征信息，使编码器有效专注不同尺度的物体. 提出边缘细化损失函数，缓解遥感图像目标边缘分割不准确的问题. 实验结果表明，在ISPRS Vaihingen和ISPRS Potsdam数据集上，所提算法的平均交并比（MIoU）分别为72.45%和82.29%. 在SAMRS数据集中的SOTA、SIOR和FAST子集上，所提算法的MIoU分别为88.81%、97.29%和86.65%，总体精度和平均交并比指标均优于对比模型. 所提算法在各类不同尺度的目标上有较好的分割性能.

关键词： 遥感图像, 语义分割, 双编码器结构, 特征融合, Efficient Transformer

Fig.1 Network structure of proposed semantic segmentation algorithm for remote sensing images

Fig.2 Structure of main encoder

Fig.3 Structure of efficient multi-heads self-attention module

Fig.4 Structure of feature extraction layer

Fig.5 Structure of feature fusion module

Fig.6 Edge thinning loss function

Tab.1 Comparison of segmentation results of different models in ISPRS Vaihingen dataset

Fig.7 Visualization of segmentation results of different models in ISPRS Vaihingen dataset

Tab.2 Comparison of segmentation results of different models in ISPRS Postdam dataset

Tab.3 Comparison of segmentation results of different models in SAMRS SOTA dataset

Tab.4 Comparison of segmentation results of different models in SAMRS SIOR dataset

Tab.5 Comparison of segmentation results of different models in SAMRS FAST dataset

Fig.8 Ablation results of dual encoder structure

Tab.6 Results of module ablation experiment in ISPRS Vaihingen dataset

Fig.9 Ablation experiments of efficient multi-heads self-attention


[1]	XIAO D, KANG Z, FU Y, et al Csswin-UNet: a Swin-UNet network for semantic segmentation of remote sensing images by aggregating contextual information and extracting spatial information[J]. International Journal of Remote Sensing, 2023, 44 (23): 7598- 7625 doi: 10.1080/01431161.2023.2285738

[2]	冯志成, 杨杰, 陈智超基于轻量级Transformer的城市路网提取方法[J]. 浙江大学学报: 工学版, 2024, 58 (1): 40- 49 FENG Zhicheng, YANG Jie, CHEN Zhichao Urban road network extraction method based on lightweight Transformer[J]. Journal of Zhejiang University: Engineering Science, 2024, 58 (1): 40- 49

[3]	PAN T, ZUO R, WANG Z Geological mapping via convolutional neural network based on remote sensing and geochemical survey data in vegetation coverage areas[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023, 16: 3485- 3494 doi: 10.1109/JSTARS.2023.3260584

[4]	JIA P, CHEN C, ZHANG D, et al Semantic segmentation of deep learning remote sensing images based on band combination principle: application in urban planning and land use[J]. Computer Communications, 2024, 217: 97- 106 doi: 10.1016/j.comcom.2024.01.032

[5]	ZHENG Z, ZHONG Y, WANG J, et al Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: from natural disasters to man-made disasters[J]. Remote Sensing of Environment, 2021, 265: 112636 doi: 10.1016/j.rse.2021.112636

[6]	FU J, LIU J, TIAN H, et al. Dual attention network for scene segmentation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 3141–3149.

[7]	HU X, ZHANG P, ZHANG Q, et al GLSANet: global-local self-attention network for remote sensing image semantic segmentation[J]. IEEE Geoscience and Remote Sensing Letters, 2023, 20: 6000105

[8]	CHEN H, QIN Y, LIU X, et al An improved DeepLabv3+ lightweight network for remote-sensing image semantic segmentation[J]. Complex and Intelligent Systems, 2024, 10 (2): 2839- 2849 doi: 10.1007/s40747-023-01304-z

[9]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image is worth 16x16 words: Transformers for image recognition at scale [EB/OL]. (2021−06−03)[2024−05−20]. https://arxiv.org/pdf/2010.11929.

[10]	ZHENG S, LU J, ZHAO H, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with Transformers [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville: IEEE, 2021: 6877–6886.

[11]	WANG L, LI R, DUAN C, et al A novel Transformer based semantic segmentation scheme for fine-resolution remote sensing images[J]. IEEE Geoscience and Remote Sensing Letters, 2022, 19: 6506105

[12]	GAO L, LIU H, YANG M, et al STransFuse: fusing swin Transformer and convolutional neural network for remote sensing image semantic segmentation[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2021, 14: 10990- 11003 doi: 10.1109/JSTARS.2021.3119654

[13]	雷涛, 翟钰杰, 许叶彤, 等基于边缘引导和动态可变形Transformer的遥感图像变化检测[J]. 电子学报, 2024, 52 (1): 107- 117 LEI Tao, ZHAI Yujie, XU Yetong, et al Edge guided and dynamically deformable Transformer network for remote sensing images change detection[J]. Acta Electronica Sinica, 2024, 52 (1): 107- 117 doi: 10.12263/DZXB.20230583

[14]	ZHANG Q, YANG Y B ResT: an efficient Transformer for visual recognition[J]. Advances in Neural Information Processing Systems, 2021, 34: 15475- 15485

[15]	YUAN L, CHEN Y, WANG T, et al. Tokens-to-token ViT: training vision Transformers from scratch on ImageNet [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Montreal: IEEE, 2021: 538–547.

[16]	ULYANOV D, VEDALDI A, LEMPITSKY V. Instance normalization: the missing ingredient for fast stylization [EB/OL]. (2017−11−06) [2024−05−20]. https://arxiv.org/pdf/1607.08022.

[17]	HU J, SHEN L, SUN G. Squeeze-and-excitation networks [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City: IEEE, 2018: 7132–7141.

[18]	HE X, ZHOU Y, ZHAO J, et al Swin Transformer embedding UNet for remote sensing image semantic segmentation[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 4408715

[19]	STERGIOU A, POPPE R, KALLIATAKIS G. Refining activation downsampling with SoftPool [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Montreal: IEEE, 2021: 10337–10346.

[20]	何小英, 徐伟铭, 潘凯祥, 等基于Swin Transformer与卷积神经网络的高分遥感影像分类[J]. 激光与光电子学进展, 2024, 61 (14): 1428002 HE Xiaoying, XU Weiming, PAN Kaixiang, et al Classification of high-resolution remote sensing images based on Swin Transformer and convolutional neural network[J]. Laser and Optoelectronics Progress, 2024, 61 (14): 1428002 doi: 10.3788/LOP232003

[21]	XU Z, ZHANG W, ZHANG T, et al Efficient Transformer for remote sensing image segmentation[J]. Remote Sensing, 2021, 13 (18): 3585 doi: 10.3390/rs13183585

[22]	WANG D, ZHANG J, DU B, et al. SAMRS: scaling-up remote sensing segmentation dataset with segment anything model [EB/OL]. (2023−10−13)[2024−05−20]. https://arxiv.org/pdf/2305.02034.

[23]	LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Boston: IEEE, 2015: 3431–3440.

[24]	SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 5693–5703.

[25]	CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image segmentation [EB/OL]. (2017−12−05) [2024−05−20]. https://arxiv.org/pdf/1706.05587.

[26]	RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation [C]// Medical Image Computing and Computer-Assisted Intervention . [S.l.]: Springer, 2015: 234–241.

[27]	XIE E, WANG W, YU Z, et al. SegFormer: simple and efficient design for semantic segmentation with Transformers [EB/OL]. (2021−10−28)[2024−05−20]. https://arxiv.org/pdf/2105.15203.

[28]	CHEN J, LU Y, YU Q, et al. TransUNet: Transformers make strong encoders for medical image segmentation [EB/OL]. (2021−02−08)[2024−05−20]. https://arxiv.org/pdf/2102.04306.

[1]	Shenchong LI,Xinhua ZENG,Chuanqu LIN. Multi-task environment perception algorithm for autonomous driving based on axial attention[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(4): 769-777.

[2]	Dengfeng LIU,Wenjing GUO,Shihai CHEN. Content-guided attention-based lane detection network[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(3): 451-459.

[3]	Yongfu HE,Shiwei XIE,Jialu YU,Siyu CHEN. Detection method for spillage risk vehicle considering cross-level feature fusion[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(2): 300-309.

[4]	Huan LIU,Yunhong LI,Leitao ZHANG,Yue GUO,Xueping SU,Yaolin ZHU,Lele HOU. Identification of apple leaf diseases based on MA-ConvNext network and stepwise relational knowledge distillation[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(9): 1757-1767.

[5]	Fan LI,Jie YANG,Zhicheng FENG,Zhichao CHEN,Yunxiao FU. Pantograph-catenary contact point detection method based on image recognition[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(9): 1801-1810.

[6]	Jun YANG,Chen ZHANG. Semantic segmentation of 3D point cloud based on boundary point estimation and sparse convolution neural network[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(6): 1121-1132.

[7]	Yi LIU,Yidan CHEN,Lin GAO,Jiao HONG. Lightweight road extraction model based on multi-scale feature fusion[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(5): 951-959.

[8]	Hai HUAN,Yu SHENG,Chenxi GU. Global guidance multi-feature fusion network based on remote sensing image road extraction[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(4): 696-707.

[9]	Kang FAN,Ming’en ZHONG,Jiawei TAN,Zehui ZHAN,Yan FENG. Traffic scene perception algorithm with joint semantic segmentation and depth estimation[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(4): 684-695.

[10]	Yin CAO,Junping QIN,Tong GAO,Qianli MA,Jiaqi REN. Generative adversarial network based two-stage generation of high-quality images from text[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(4): 674-683.

[11]	Qingjie QIAN,Junhe YU,Hongfei ZHAN,Rui WANG,Jian HU. Dimension prediction method of injection molded parts based on multi-feature fusion of DL-BiGRU[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(3): 646-654.

[12]	Huijuan ZHANG,Kunpeng LI,Miaoxin JI,Zhenjiang LIU,Jianjuan LIU,Chi ZHANG. UAV detection algorithm based on spatial correlation enhancement[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(3): 468-479.

[13]	Canlin LI,Wenjiao ZHANG,Zhiwen SHAO,Lizhuang MA,Xinyue WANG. Semantic segmentation method on nighttime road scene based on Trans-nightSeg[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(2): 294-303.

[14]	Yuebo MENG,Bo WANG,Guanghui LIU. Multi-scale context-guided feature elimination for ancient tower image classification[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(12): 2489-2499.

[15]	Wanliang WANG,Jie PAN,Zheng WANG,Jiayu PAN. Recognition method of surface electromyographic signal based on two-branch network[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(11): 2208-2218.

Viewed

Full text

Abstract

Cited

Shared

Discussed