1. School of Electrical Engineering and Automation, Jiangxi University of Science and Technology, Ganzhou 341000, China 2. Jiangxi Province Key Laboratory of Maglev Rail Transit Equipment, Jiangxi University of Science and Technology, Ganzhou 341000, China
Aiming at the problems of the existing methods, such as the difficulty of multi-scale feature extraction and the inaccuracy of target edge segmentation in remote sensing images, a new semantic segmentation algorithm was proposed. CNN and Efficient Transformer were utilized to construct a dual encoder to decouple context and spatial information. A feature fusion module was proposed to enhance the information interaction between the encoders, effectively fusing the global context and local detail information. A hierarchical Transformer structure was constructed to extract feature information at different scales, allowing the encoder to focus effectively on objects at different scales. An edge thinning loss function was proposed to mitigate the problem of inaccurate target edge segmentation. Experimental results showed that mean intersection over union (MIoU) of 72.45% and 82.29% was achieved by the proposed algorithm on the ISPRS Vaihingen and ISPRS Potsdam datasets, respectively. On the SOTA, SIOR, and FAST subsets of the SAMRS dataset, the MIoU of the proposed algorithm was 88.81%, 97.29%, and 86.65%, respectively, overall accuracy and mean intersection over union metrics were better than those of the comparison models. The proposed algorithm has good segmentation performance on various types of targets with different scales.
Zhenli ZHANG,Xinkai HU,Fan LI,Zhicheng FENG,Zhichao CHEN. Semantic segmentation algorithm for multiscale remote sensing images based on CNN and Efficient Transformer. Journal of ZheJiang University (Engineering Science), 2025, 59(4): 778-786.
Fig.1Network structure of proposed semantic segmentation algorithm for remote sensing images
Fig.2Structure of main encoder
Fig.3Structure of efficient multi-heads self-attention module
Fig.4Structure of feature extraction layer
Fig.5Structure of feature fusion module
Fig.6Edge thinning loss function
模型
IoU/%
OA/%
MIoU/%
不透明表面
建筑物
低矮植被
树木
汽车
FCN[23]
78.81
85.45
65.56
74.76
24.25
86.49
65.56
DANet[6]
77.80
84.81
63.55
68.33
36.05
84.99
66.11
HRNet[24]
78.35
82.72
63.21
75.94
38.49
86.17
67.74
DeepLabV3[25]
79.28
86.34
66.05
77.22
30.49
86.34
67.88
Segformer[27]
78.88
83.18
61.04
75.39
45.22
85.94
68.74
UNet[26]
77.38
83.85
61.04
75.05
34.03
84.43
66.27
TransUNet[28]
76.68
81.05
63.46
74.08
46.53
84.80
68.36
SwinUNet[29]
74.16
77.85
62.01
73.46
35.62
83.50
64.62
本研究
79.98
84.88
65.27
74.44
57.69
86.51
72.45
Tab.1Comparison of segmentation results of different models in ISPRS Vaihingen dataset
Fig.7Visualization of segmentation results of different models in ISPRS Vaihingen dataset
模型
IoU/%
OA/%
MIoU/%
不透明表面
建筑物
低矮植被
树木
汽车
FCN[23]
76.31
83.23
64.65
66.03
68.78
86.04
71.80
DANet[6]
77.34
82.52
64.73
70.78
79.87
86.94
75.05
HRNet[24]
79.11
84.97
67.95
70.53
81.65
87.78
76.84
DeepLabV3[25]
78.90
85.23
68.68
70.91
83.17
87.73
77.38
Segformer[27]
79.96
86.70
69.72
65.21
77.64
87.09
75.85
UNet[26]
76.86
83.74
65.90
63.69
79.13
86.01
73.86
TransUNet[28]
79.79
86.13
68.94
66.30
78.63
86.41
75.96
SwinUNet[29]
73.01
76.29
61.74
54.27
68.88
80.49
66.83
本研究
86.05
92.60
74.93
73.68
84.17
90.50
82.29
Tab.2Comparison of segmentation results of different models in ISPRS Postdam dataset
模型
IoU/%
OA/%
MIoU/%
大车
游泳池
飞机
小车
FCN[23]
72.28
68.57
80.53
80.31
84.85
75.42
DANet[6]
70.54
77.65
72.14
71.24
82.14
72.89
HRNet[24]
77.61
79.78
83.28
83.12
85.45
75.16
DeepLabV3[25]
83.20
82.69
91.12
87.37
93.37
86.09
Segformer[27]
73.49
85.79
74.81
76.24
87.26
77.58
UNet[26]
75.61
74.34
80.37
83.08
87.75
78.35
TransUNet[28]
79.24
81.07
91.38
83.98
91.59
83.91
SwinUNet[29]
64.92
77.92
64.42
66.90
78.77
68.54
本研究
87.05
84.42
92.98
90.78
94.98
88.81
Tab.3Comparison of segmentation results of different models in SAMRS SOTA dataset
模型
IoU/%
OA/%
MIoU/%
飞机
棒球场
轮船
网球场
FCN[23]
82.29
95.61
95.41
95.59
96.97
92.22
DANet[6]
78.32
96.11
96.03
96.36
94.59
91.71
HRNet[24]
83.01
93.62
94.74
95.67
96.47
91.76
DeepLabV3[25]
90.34
96.10
97.69
95.34
97.83
94.87
Segformer[27]
73.65
94.10
95.19
91.43
92.65
88.59
UNet[26]
77.38
92.03
92.34
96.62
93.47
89.59
TransUNet[28]
92.76
96.45
97.18
97.48
97.88
95.97
SwinUNet[29]
80.88
95.43
92.65
93.85
94.49
90.70
本研究
94.38
98.65
97.77
98.38
98.93
97.29
Tab.4Comparison of segmentation results of different models in SAMRS SIOR dataset
模型
IoU/%
OA/%
MIoU/%
棒球场
桥梁
足球场
汽车
FCN[23]
80.21
94.17
90.26
63.76
90.63
82.10
DANet[6]
84.97
87.29
91.66
55.01
87.88
80.23
HRNet[24]
92.32
84.83
91.83
51.51
88.64
80.12
DeepLabV3[25]
93.57
95.75
96.98
54.97
92.89
85.32
Segformer[27]
87.95
93.87
94.48
42.71
86.77
79.75
UNet[26]
84.30
92.78
93.53
59.70
90.21
82.58
TransUNet[28]
94.46
85.84
95.66
60.33
91.83
84.07
SwinUNet[29]
87.83
90.05
95.29
43.72
87.49
79.22
本研究
93.79
92.91
95.98
63.93
93.45
86.65
Tab.5Comparison of segmentation results of different models in SAMRS FAST dataset
Fig.8Ablation results of dual encoder structure
模型
IoU/%
OA/%
MIoU/%
不透明表面
建筑物
低矮植被
树木
汽车
B
76.87
81.57
62.45
73.06
48.27
84.70
68.44
B+FFM
77.29
82.06
62.51
73.74
51.05
85.01
69.33
B+ETL
77.56
83.04
63.49
74.29
44.58
85.40
68.59
B+ETL+FFM
79.98
84.88
65.27
74.44
57.69
86.51
72.45
Tab.6Results of module ablation experiment in ISPRS Vaihingen dataset
Fig.9Ablation experiments of efficient multi-heads self-attention
[1]
XIAO D, KANG Z, FU Y, et al Csswin-UNet: a Swin-UNet network for semantic segmentation of remote sensing images by aggregating contextual information and extracting spatial information[J]. International Journal of Remote Sensing, 2023, 44 (23): 7598- 7625
doi: 10.1080/01431161.2023.2285738
[2]
冯志成, 杨杰, 陈智超 基于轻量级Transformer的城市路网提取方法[J]. 浙江大学学报: 工学版, 2024, 58 (1): 40- 49 FENG Zhicheng, YANG Jie, CHEN Zhichao Urban road network extraction method based on lightweight Transformer[J]. Journal of Zhejiang University: Engineering Science, 2024, 58 (1): 40- 49
[3]
PAN T, ZUO R, WANG Z Geological mapping via convolutional neural network based on remote sensing and geochemical survey data in vegetation coverage areas[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023, 16: 3485- 3494
doi: 10.1109/JSTARS.2023.3260584
[4]
JIA P, CHEN C, ZHANG D, et al Semantic segmentation of deep learning remote sensing images based on band combination principle: application in urban planning and land use[J]. Computer Communications, 2024, 217: 97- 106
doi: 10.1016/j.comcom.2024.01.032
[5]
ZHENG Z, ZHONG Y, WANG J, et al Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: from natural disasters to man-made disasters[J]. Remote Sensing of Environment, 2021, 265: 112636
doi: 10.1016/j.rse.2021.112636
[6]
FU J, LIU J, TIAN H, et al. Dual attention network for scene segmentation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 3141–3149.
[7]
HU X, ZHANG P, ZHANG Q, et al GLSANet: global-local self-attention network for remote sensing image semantic segmentation[J]. IEEE Geoscience and Remote Sensing Letters, 2023, 20: 6000105
[8]
CHEN H, QIN Y, LIU X, et al An improved DeepLabv3+ lightweight network for remote-sensing image semantic segmentation[J]. Complex and Intelligent Systems, 2024, 10 (2): 2839- 2849
doi: 10.1007/s40747-023-01304-z
[9]
DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image is worth 16x16 words: Transformers for image recognition at scale [EB/OL]. (2021−06−03)[2024−05−20]. https://arxiv.org/pdf/2010.11929.
[10]
ZHENG S, LU J, ZHAO H, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with Transformers [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville: IEEE, 2021: 6877–6886.
[11]
WANG L, LI R, DUAN C, et al A novel Transformer based semantic segmentation scheme for fine-resolution remote sensing images[J]. IEEE Geoscience and Remote Sensing Letters, 2022, 19: 6506105
[12]
GAO L, LIU H, YANG M, et al STransFuse: fusing swin Transformer and convolutional neural network for remote sensing image semantic segmentation[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2021, 14: 10990- 11003
doi: 10.1109/JSTARS.2021.3119654
[13]
雷涛, 翟钰杰, 许叶彤, 等 基于边缘引导和动态可变形Transformer的遥感图像变化检测[J]. 电子学报, 2024, 52 (1): 107- 117 LEI Tao, ZHAI Yujie, XU Yetong, et al Edge guided and dynamically deformable Transformer network for remote sensing images change detection[J]. Acta Electronica Sinica, 2024, 52 (1): 107- 117
doi: 10.12263/DZXB.20230583
[14]
ZHANG Q, YANG Y B ResT: an efficient Transformer for visual recognition[J]. Advances in Neural Information Processing Systems, 2021, 34: 15475- 15485
[15]
YUAN L, CHEN Y, WANG T, et al. Tokens-to-token ViT: training vision Transformers from scratch on ImageNet [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Montreal: IEEE, 2021: 538–547.
[16]
ULYANOV D, VEDALDI A, LEMPITSKY V. Instance normalization: the missing ingredient for fast stylization [EB/OL]. (2017−11−06) [2024−05−20]. https://arxiv.org/pdf/1607.08022.
[17]
HU J, SHEN L, SUN G. Squeeze-and-excitation networks [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City: IEEE, 2018: 7132–7141.
[18]
HE X, ZHOU Y, ZHAO J, et al Swin Transformer embedding UNet for remote sensing image semantic segmentation[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 4408715
[19]
STERGIOU A, POPPE R, KALLIATAKIS G. Refining activation downsampling with SoftPool [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Montreal: IEEE, 2021: 10337–10346.
[20]
何小英, 徐伟铭, 潘凯祥, 等 基于Swin Transformer与卷积神经网络的高分遥感影像分类[J]. 激光与光电子学进展, 2024, 61 (14): 1428002 HE Xiaoying, XU Weiming, PAN Kaixiang, et al Classification of high-resolution remote sensing images based on Swin Transformer and convolutional neural network[J]. Laser and Optoelectronics Progress, 2024, 61 (14): 1428002
doi: 10.3788/LOP232003
[21]
XU Z, ZHANG W, ZHANG T, et al Efficient Transformer for remote sensing image segmentation[J]. Remote Sensing, 2021, 13 (18): 3585
doi: 10.3390/rs13183585
[22]
WANG D, ZHANG J, DU B, et al. SAMRS: scaling-up remote sensing segmentation dataset with segment anything model [EB/OL]. (2023−10−13)[2024−05−20]. https://arxiv.org/pdf/2305.02034.
[23]
LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Boston: IEEE, 2015: 3431–3440.
[24]
SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 5693–5703.
[25]
CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image segmentation [EB/OL]. (2017−12−05) [2024−05−20]. https://arxiv.org/pdf/1706.05587.
[26]
RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation [C]// Medical Image Computing and Computer-Assisted Intervention . [S.l.]: Springer, 2015: 234–241.
[27]
XIE E, WANG W, YU Z, et al. SegFormer: simple and efficient design for semantic segmentation with Transformers [EB/OL]. (2021−10−28)[2024−05−20]. https://arxiv.org/pdf/2105.15203.
[28]
CHEN J, LU Y, YU Q, et al. TransUNet: Transformers make strong encoders for medical image segmentation [EB/OL]. (2021−02−08)[2024−05−20]. https://arxiv.org/pdf/2102.04306.