Please wait a minute...
浙江大学学报(工学版)  2025, Vol. 59 Issue (4): 778-786    DOI: 10.3785/j.issn.1008-973X.2025.04.013
计算机技术与控制工程     
基于CNN和Efficient Transformer的多尺度遥感图像语义分割算法
张振利1,2(),胡新凯1,2,李凡1,2,冯志成1,2,陈智超1,2
1. 江西理工大学 电气工程与自动化学院,江西 赣州 341000
2. 江西理工大学 磁浮轨道交通装备江西省重点实验室,江西 赣州 341000
Semantic segmentation algorithm for multiscale remote sensing images based on CNN and Efficient Transformer
Zhenli ZHANG1,2(),Xinkai HU1,2,Fan LI1,2,Zhicheng FENG1,2,Zhichao CHEN1,2
1. School of Electrical Engineering and Automation, Jiangxi University of Science and Technology, Ganzhou 341000, China
2. Jiangxi Province Key Laboratory of Maglev Rail Transit Equipment, Jiangxi University of Science and Technology, Ganzhou 341000, China
 全文: PDF(1345 KB)   HTML
摘要:

针对现有方法存在遥感图像的多尺度地物特征提取困难和目标边缘分割不准确的问题,提出新的语义分割算法. 利用CNN和Efficient Transformer构建双编码器,解耦上下文信息和空间信息. 提出特征融合模块加强编码器间的信息交互,有效融合全局上下文信息和局部细节信息. 构建分层Transformer结构提取不同尺度的特征信息,使编码器有效专注不同尺度的物体. 提出边缘细化损失函数,缓解遥感图像目标边缘分割不准确的问题. 实验结果表明,在ISPRS Vaihingen和ISPRS Potsdam数据集上,所提算法的平均交并比(MIoU)分别为72.45%和82.29%. 在SAMRS数据集中的SOTA、SIOR和FAST子集上,所提算法的MIoU分别为88.81%、97.29%和86.65%,总体精度和平均交并比指标均优于对比模型. 所提算法在各类不同尺度的目标上有较好的分割性能.

关键词: 遥感图像语义分割双编码器结构特征融合Efficient Transformer    
Abstract:

Aiming at the problems of the existing methods, such as the difficulty of multi-scale feature extraction and the inaccuracy of target edge segmentation in remote sensing images, a new semantic segmentation algorithm was proposed. CNN and Efficient Transformer were utilized to construct a dual encoder to decouple context and spatial information. A feature fusion module was proposed to enhance the information interaction between the encoders, effectively fusing the global context and local detail information. A hierarchical Transformer structure was constructed to extract feature information at different scales, allowing the encoder to focus effectively on objects at different scales. An edge thinning loss function was proposed to mitigate the problem of inaccurate target edge segmentation. Experimental results showed that mean intersection over union (MIoU) of 72.45% and 82.29% was achieved by the proposed algorithm on the ISPRS Vaihingen and ISPRS Potsdam datasets, respectively. On the SOTA, SIOR, and FAST subsets of the SAMRS dataset, the MIoU of the proposed algorithm was 88.81%, 97.29%, and 86.65%, respectively, overall accuracy and mean intersection over union metrics were better than those of the comparison models. The proposed algorithm has good segmentation performance on various types of targets with different scales.

Key words: remote sensing image    semantic segmentation    dual encoder structure    feature fusion    Efficient Transformer
收稿日期: 2024-03-27 出版日期: 2025-04-25
CLC:  TP 751  
基金资助: 国家自然科学基金资助项目(62063009);国家重点研发计划项目(2023YFB4302100).
作者简介: 张振利(1976—),男,副教授,硕士,从事人工智能研究. orcid.org/0009-0004-7539-9260. E-mail:zhangzhenli@jxust.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
张振利
胡新凯
李凡
冯志成
陈智超

引用本文:

张振利,胡新凯,李凡,冯志成,陈智超. 基于CNN和Efficient Transformer的多尺度遥感图像语义分割算法[J]. 浙江大学学报(工学版), 2025, 59(4): 778-786.

Zhenli ZHANG,Xinkai HU,Fan LI,Zhicheng FENG,Zhichao CHEN. Semantic segmentation algorithm for multiscale remote sensing images based on CNN and Efficient Transformer. Journal of ZheJiang University (Engineering Science), 2025, 59(4): 778-786.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2025.04.013        https://www.zjujournals.com/eng/CN/Y2025/V59/I4/778

图 1  所提遥感图像语义分割算法的网络结构
图 2  主编码器的结构
图 3  高效多头自注意力模块的结构
图 4  特征提取层的结构
图 5  特征融合模块的结构
图 6  边缘细化损失函数
模型IoU/%OA/%MIoU/%
不透明表面建筑物低矮植被树木汽车
FCN[23]78.8185.4565.5674.7624.2586.4965.56
DANet[6]77.8084.8163.5568.3336.0584.9966.11
HRNet[24]78.3582.7263.2175.9438.4986.1767.74
DeepLabV3[25]79.2886.3466.0577.2230.4986.3467.88
Segformer[27]78.8883.1861.0475.3945.2285.9468.74
UNet[26]77.3883.8561.0475.0534.0384.4366.27
TransUNet[28]76.6881.0563.4674.0846.5384.8068.36
SwinUNet[29]74.1677.8562.0173.4635.6283.5064.62
本研究79.9884.8865.2774.4457.6986.5172.45
表 1  不同模型在ISPRS Vaihingen数据集上的分割结果对比
图 7  不同模型在ISPRS Vaihingen数据集上的可视化分割结果
模型IoU/%OA/%MIoU/%
不透明表面建筑物低矮植被树木汽车
FCN[23]76.3183.2364.6566.0368.7886.0471.80
DANet[6]77.3482.5264.7370.7879.8786.9475.05
HRNet[24]79.1184.9767.9570.5381.6587.7876.84
DeepLabV3[25]78.9085.2368.6870.9183.1787.7377.38
Segformer[27]79.9686.7069.7265.2177.6487.0975.85
UNet[26]76.8683.7465.9063.6979.1386.0173.86
TransUNet[28]79.7986.1368.9466.3078.6386.4175.96
SwinUNet[29]73.0176.2961.7454.2768.8880.4966.83
本研究86.0592.6074.9373.6884.1790.5082.29
表 2  不同模型在ISPRS Postdam数据集上的分割结果对比
模型IoU/%OA/%MIoU/%
大车游泳池飞机小车
FCN[23]72.2868.5780.5380.3184.8575.42
DANet[6]70.5477.6572.1471.2482.1472.89
HRNet[24]77.6179.7883.2883.1285.4575.16
DeepLabV3[25]83.2082.6991.1287.3793.3786.09
Segformer[27]73.4985.7974.8176.2487.2677.58
UNet[26]75.6174.3480.3783.0887.7578.35
TransUNet[28]79.2481.0791.3883.9891.5983.91
SwinUNet[29]64.9277.9264.4266.9078.7768.54
本研究87.0584.4292.9890.7894.9888.81
表 3  不同模型在SAMRS SOTA数据集上的分割结果对比
模型IoU/%OA/%MIoU/%
飞机棒球场轮船网球场
FCN[23]82.2995.6195.4195.5996.9792.22
DANet[6]78.3296.1196.0396.3694.5991.71
HRNet[24]83.0193.6294.7495.6796.4791.76
DeepLabV3[25]90.3496.1097.6995.3497.8394.87
Segformer[27]73.6594.1095.1991.4392.6588.59
UNet[26]77.3892.0392.3496.6293.4789.59
TransUNet[28]92.7696.4597.1897.4897.8895.97
SwinUNet[29]80.8895.4392.6593.8594.4990.70
本研究94.3898.6597.7798.3898.9397.29
表 4  不同模型在SAMRS SIOR数据集上的分割结果对比
模型IoU/%OA/%MIoU/%
棒球场桥梁足球场汽车
FCN[23]80.2194.1790.2663.7690.6382.10
DANet[6]84.9787.2991.6655.0187.8880.23
HRNet[24]92.3284.8391.8351.5188.6480.12
DeepLabV3[25]93.5795.7596.9854.9792.8985.32
Segformer[27]87.9593.8794.4842.7186.7779.75
UNet[26]84.3092.7893.5359.7090.2182.58
TransUNet[28]94.4685.8495.6660.3391.8384.07
SwinUNet[29]87.8390.0595.2943.7287.4979.22
本研究93.7992.9195.9863.9393.4586.65
表 5  不同模型在SAMRS FAST数据集上的分割结果比较
图 8  双编码器结构的消融实验
模型IoU/%OA/%MIoU/%
不透明表面建筑物低矮植被树木汽车
B76.8781.5762.4573.0648.2784.7068.44
B+FFM77.2982.0662.5173.7451.0585.0169.33
B+ETL77.5683.0463.4974.2944.5885.4068.59
B+ETL+FFM79.9884.8865.2774.4457.6986.5172.45
表 6  在ISPRS Vaihingen数据集上的模块消融实验结果
图 9  高效多头自注意力的消融实验
1 XIAO D, KANG Z, FU Y, et al Csswin-UNet: a Swin-UNet network for semantic segmentation of remote sensing images by aggregating contextual information and extracting spatial information[J]. International Journal of Remote Sensing, 2023, 44 (23): 7598- 7625
doi: 10.1080/01431161.2023.2285738
2 冯志成, 杨杰, 陈智超 基于轻量级Transformer的城市路网提取方法[J]. 浙江大学学报: 工学版, 2024, 58 (1): 40- 49
FENG Zhicheng, YANG Jie, CHEN Zhichao Urban road network extraction method based on lightweight Transformer[J]. Journal of Zhejiang University: Engineering Science, 2024, 58 (1): 40- 49
3 PAN T, ZUO R, WANG Z Geological mapping via convolutional neural network based on remote sensing and geochemical survey data in vegetation coverage areas[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023, 16: 3485- 3494
doi: 10.1109/JSTARS.2023.3260584
4 JIA P, CHEN C, ZHANG D, et al Semantic segmentation of deep learning remote sensing images based on band combination principle: application in urban planning and land use[J]. Computer Communications, 2024, 217: 97- 106
doi: 10.1016/j.comcom.2024.01.032
5 ZHENG Z, ZHONG Y, WANG J, et al Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: from natural disasters to man-made disasters[J]. Remote Sensing of Environment, 2021, 265: 112636
doi: 10.1016/j.rse.2021.112636
6 FU J, LIU J, TIAN H, et al. Dual attention network for scene segmentation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 3141–3149.
7 HU X, ZHANG P, ZHANG Q, et al GLSANet: global-local self-attention network for remote sensing image semantic segmentation[J]. IEEE Geoscience and Remote Sensing Letters, 2023, 20: 6000105
8 CHEN H, QIN Y, LIU X, et al An improved DeepLabv3+ lightweight network for remote-sensing image semantic segmentation[J]. Complex and Intelligent Systems, 2024, 10 (2): 2839- 2849
doi: 10.1007/s40747-023-01304-z
9 DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image is worth 16x16 words: Transformers for image recognition at scale [EB/OL]. (2021−06−03)[2024−05−20]. https://arxiv.org/pdf/2010.11929.
10 ZHENG S, LU J, ZHAO H, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with Transformers [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville: IEEE, 2021: 6877–6886.
11 WANG L, LI R, DUAN C, et al A novel Transformer based semantic segmentation scheme for fine-resolution remote sensing images[J]. IEEE Geoscience and Remote Sensing Letters, 2022, 19: 6506105
12 GAO L, LIU H, YANG M, et al STransFuse: fusing swin Transformer and convolutional neural network for remote sensing image semantic segmentation[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2021, 14: 10990- 11003
doi: 10.1109/JSTARS.2021.3119654
13 雷涛, 翟钰杰, 许叶彤, 等 基于边缘引导和动态可变形Transformer的遥感图像变化检测[J]. 电子学报, 2024, 52 (1): 107- 117
LEI Tao, ZHAI Yujie, XU Yetong, et al Edge guided and dynamically deformable Transformer network for remote sensing images change detection[J]. Acta Electronica Sinica, 2024, 52 (1): 107- 117
doi: 10.12263/DZXB.20230583
14 ZHANG Q, YANG Y B ResT: an efficient Transformer for visual recognition[J]. Advances in Neural Information Processing Systems, 2021, 34: 15475- 15485
15 YUAN L, CHEN Y, WANG T, et al. Tokens-to-token ViT: training vision Transformers from scratch on ImageNet [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Montreal: IEEE, 2021: 538–547.
16 ULYANOV D, VEDALDI A, LEMPITSKY V. Instance normalization: the missing ingredient for fast stylization [EB/OL]. (2017−11−06) [2024−05−20]. https://arxiv.org/pdf/1607.08022.
17 HU J, SHEN L, SUN G. Squeeze-and-excitation networks [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City: IEEE, 2018: 7132–7141.
18 HE X, ZHOU Y, ZHAO J, et al Swin Transformer embedding UNet for remote sensing image semantic segmentation[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 4408715
19 STERGIOU A, POPPE R, KALLIATAKIS G. Refining activation downsampling with SoftPool [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Montreal: IEEE, 2021: 10337–10346.
20 何小英, 徐伟铭, 潘凯祥, 等 基于Swin Transformer与卷积神经网络的高分遥感影像分类[J]. 激光与光电子学进展, 2024, 61 (14): 1428002
HE Xiaoying, XU Weiming, PAN Kaixiang, et al Classification of high-resolution remote sensing images based on Swin Transformer and convolutional neural network[J]. Laser and Optoelectronics Progress, 2024, 61 (14): 1428002
doi: 10.3788/LOP232003
21 XU Z, ZHANG W, ZHANG T, et al Efficient Transformer for remote sensing image segmentation[J]. Remote Sensing, 2021, 13 (18): 3585
doi: 10.3390/rs13183585
22 WANG D, ZHANG J, DU B, et al. SAMRS: scaling-up remote sensing segmentation dataset with segment anything model [EB/OL]. (2023−10−13)[2024−05−20]. https://arxiv.org/pdf/2305.02034.
23 LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Boston: IEEE, 2015: 3431–3440.
24 SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 5693–5703.
25 CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image segmentation [EB/OL]. (2017−12−05) [2024−05−20]. https://arxiv.org/pdf/1706.05587.
26 RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation [C]// Medical Image Computing and Computer-Assisted Intervention . [S.l.]: Springer, 2015: 234–241.
27 XIE E, WANG W, YU Z, et al. SegFormer: simple and efficient design for semantic segmentation with Transformers [EB/OL]. (2021−10−28)[2024−05−20]. https://arxiv.org/pdf/2105.15203.
28 CHEN J, LU Y, YU Q, et al. TransUNet: Transformers make strong encoders for medical image segmentation [EB/OL]. (2021−02−08)[2024−05−20]. https://arxiv.org/pdf/2102.04306.
[1] 李沈崇,曾新华,林传渠. 基于轴向注意力的多任务自动驾驶环境感知算法[J]. 浙江大学学报(工学版), 2025, 59(4): 769-777.
[2] 刘登峰,郭文静,陈世海. 基于内容引导注意力的车道线检测网络[J]. 浙江大学学报(工学版), 2025, 59(3): 451-459.
[3] 何永福,谢世维,于佳禄,陈思宇. 考虑跨层特征融合的抛洒风险车辆检测方法[J]. 浙江大学学报(工学版), 2025, 59(2): 300-309.
[4] 刘欢,李云红,张蕾涛,郭越,苏雪平,朱耀麟,侯乐乐. 基于MA-ConvNext网络和分步关系知识蒸馏的苹果叶片病害识别[J]. 浙江大学学报(工学版), 2024, 58(9): 1757-1767.
[5] 李凡,杨杰,冯志成,陈智超,付云骁. 基于图像识别的弓网接触点检测方法[J]. 浙江大学学报(工学版), 2024, 58(9): 1801-1810.
[6] 杨军,张琛. 基于边界点估计与稀疏卷积神经网络的三维点云语义分割[J]. 浙江大学学报(工学版), 2024, 58(6): 1121-1132.
[7] 刘毅,陈一丹,高琳,洪姣. 基于多尺度特征融合的轻量化道路提取模型[J]. 浙江大学学报(工学版), 2024, 58(5): 951-959.
[8] 宦海,盛宇,顾晨曦. 基于遥感图像道路提取的全局指导多特征融合网络[J]. 浙江大学学报(工学版), 2024, 58(4): 696-707.
[9] 范康,钟铭恩,谭佳威,詹泽辉,冯妍. 联合语义分割和深度估计的交通场景感知算法[J]. 浙江大学学报(工学版), 2024, 58(4): 684-695.
[10] 曹寅,秦俊平,高彤,马千里,任家琪. 基于生成对抗网络的文本两阶段生成高质量图像方法[J]. 浙江大学学报(工学版), 2024, 58(4): 674-683.
[11] 钱庆杰,余军合,战洪飞,王瑞,胡健. 基于DL-BiGRU多特征融合的注塑件尺寸预测方法[J]. 浙江大学学报(工学版), 2024, 58(3): 646-654.
[12] 张会娟,李坤鹏,姬淼鑫,刘振江,刘建娟,张弛. 基于空间相关性增强的无人机检测算法[J]. 浙江大学学报(工学版), 2024, 58(3): 468-479.
[13] 李灿林,张文娇,邵志文,马利庄,王新玥. 基于Trans-nightSeg的夜间道路场景语义分割方法[J]. 浙江大学学报(工学版), 2024, 58(2): 294-303.
[14] 孟月波,王博,刘光辉. 多尺度上下文引导特征消除的古塔图像分类[J]. 浙江大学学报(工学版), 2024, 58(12): 2489-2499.
[15] 王万良,潘杰,王铮,潘家宇. 基于双分支网络的表面肌电信号识别方法[J]. 浙江大学学报(工学版), 2024, 58(11): 2208-2218.