Please wait a minute...
浙江大学学报(工学版)  2023, Vol. 57 Issue (6): 1205-1214    DOI: 10.3785/j.issn.1008-973X.2023.06.016
计算机与控制工程     
基于Transformer的高效自适应语义分割网络
张海波1,2(),蔡磊1,2,任俊平1,2,王汝言1,刘富3
1. 重庆邮电大学 通信与信息工程学院,重庆 400065
2. 泛在感知与互联重庆市重点实验室,重庆 400065
3. 重庆市城市照明中心,重庆 400023
Efficient and adaptive semantic segmentation network based on Transformer
Hai-bo ZHANG1,2(),Lei CAI1,2,Jun-ping REN1,2,Ru-yan WANG1,Fu LIU3
1. School of Communications and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
2. Chongqing Key Laboratory of Ubiquitous Sensing and Networking, Chongqing 400065, China
3. Chongqing Urban Lighting Center, Chongqing 400023, China
 全文: PDF(1465 KB)   HTML
摘要:

基于Transformer的语义分割网络存在2个问题:分辨率变化引起的分割精度显著下降,自注意力机制计算复杂度过高。为此,利用零值填充的卷积可保留位置信息的特性,提出自适应卷积位置编码模块;利用自注意力计算中特定矩阵的维度可相互抵消的特性,提出降低自注意力计算量的联合重采样自注意力模块;设计用于融合不同阶段特征图的解码器,构造能够自适应不同分辨率输入的高效分割网络EA-Former. EA-Former在数据集ADE20K、Cityscapes上的最优平均交并比分别为51.0%、83.9%. 与主流分割算法相比,EA-Former能够以更低的计算复杂度得到具有竞争力的分割精度,由输入分辨率变化引起的分割性能下降问题得以缓解.

关键词: 语义分割Transformer自注意力位置编码神经网络    
Abstract:

There are two problems at semantic segmentation network based on Transformer: significant drop of the segmentation accuracy due to the resolution variation and high computational complexity of self-attention. An adaptive convolutional positional encoding module was proposed, using a property of zero-padding convolution to retain positional information. Using the property that the dimensions of specific matrices can cancel each other in the self-attention computation. A joint resampling self-attention module to reduce the computational burden was proposed. A decoder was designed to fuse feature maps from different stages, resulting in the construction of an efficient segmentation network EA-Former which was capable of adapting to different resolution inputs. The mean intersection over union of EA-Former on the ADE20K was 51.0% and on the Cityscapes was 83.9%. Compared with the mainstream segmentation methods, the proposed network could achieve competitive accuracy with lower computational complexity, and the degradation of the segmentation performance caused by the variation of the input resolution was alleviated.

Key words: semantic segmentation    Transformer    self-attention    position encoding    neural network
收稿日期: 2022-06-24 出版日期: 2023-06-30
CLC:  TP 391  
基金资助: 国家自然科学基金资助项目(62271094);长江学者和创新团队发展计划基金资助项目(IRT16R72);重庆市留创计划创新类资助项目(cx2020059)
作者简介: 张海波(1979—),男,副教授,博士,从事车联网和计算机视觉研究. orcid.org/0000-0003-2719-9956.E-mail: zhanghb@cqupt.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
张海波
蔡磊
任俊平
王汝言
刘富

引用本文:

张海波,蔡磊,任俊平,王汝言,刘富. 基于Transformer的高效自适应语义分割网络[J]. 浙江大学学报(工学版), 2023, 57(6): 1205-1214.

Hai-bo ZHANG,Lei CAI,Jun-ping REN,Ru-yan WANG,Fu LIU. Efficient and adaptive semantic segmentation network based on Transformer. Journal of ZheJiang University (Engineering Science), 2023, 57(6): 1205-1214.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2023.06.016        https://www.zjujournals.com/eng/CN/Y2023/V57/I6/1205

图 1  EA-Former网络结构
图 2  自适应卷积位置编码结构
图 3  联合重采样自注意力结构
图 4  特征融合解码器结构
算法 基础网络结构 N/106 GFLOPs mIoU/
%
FCN[1] ResNet-101[19] 68.6 275.7 39.9
PSPNet[3] ResNet-101 68.1 256.4 44.3
DeepLab-V3+[7] ResNet-101 62.7 255.1 45.4
DeepLab-V3+ ResNeSt-101[21] 66.3 262.9 46.9
UperNet[22] DeiT[23] 120.5 90.1 45.3
UperNet Swin-S[24] 81.0 259.3 49.3
UperNet Convnext[25] 60.2 234.6 46.1
UperNet Focal-B[26] 126.0 49.0
SETR[11] ViT[12] 318.5 213.6 47.3
DPT[27] ViT 109.7 171.0 46.9
Segmenter Mask[28] ViT 102.5 71.1 49.6
Semantic FPN[29] PVTv2-B3[30] 49.0 62.0 47.3
Semantic FPN VAN-B3[31] 49.0 68.0 48.1
SeMask-B FPN[32] SeMask Swin 96.0 107.0 49.4
Segformer[20] MiT[20] 83.9 110.5 50.1
EA-Former MiT 136.4 61.3 49.3
Segformer* MiT 83.9 172.7 52.1
EA-Former* MiT 136.4 95.8 51.0
表 1  不同分割模型在ADE20K数据集上的模型评估结果
算法 基础网络结构 GFLOPs FPS/
(帧 $ \cdot {{\rm{s}}^{ - 1}} $
mIoU/
%
SETR[11] MiT-B0[20] 25.3 28.7 34.8
Segformer[20] MiT-B0 8.6 50.5 37.5
UperNet[22] MiT-B0 28.5 29.6 39.3
Segmenter[28] MiT-B0 7.9 49.2 35.9
Semantic FPN[29] MiT-B0 23.0 46.4 37.1
EA-Former-T MiT-B0 7.1 51.1 38.1
表 2  轻量级分割模型在ADE20K数据集上的模型评估结果
图 5  不同分割模型在ADE20K数据集上训练的平均交并比变化
算法 基础网络结构 N/106 GFLOPs mIoU/%
FCN[1] ResNet-101[19] 68.4 619.6 75.5
PSPNet[3] ResNet-101 67.9 576.3 79.7
DeepLabV3+[7] ResNet-101 62.5 571.6 80.6
CCnet[9] ResNet-101 68.8 625.7 79.4
UperNet[22] ResNet-101 85.4 576.5 80.1
DeepLabV3[6] ResNeSt-101[21] 90.8 798.9 80.4
OCRNet[33] HRNet[34] 70.3 364.7 80.7
SETR[11] ViT[12] 318.3 818.2 79.3
Segformer[20] MiT[20] 83.9 597.6 81.8
EA-Former MiT 136.4 137.7 82.1
Segformer# MiT 83.9 735.2 84.1
EA-Former# MiT 136.4 191.8 83.9
表 3  不同分割模型在Cityscapes数据集上的模型评估结果
图 6  不同模型在Cityscapes数据集上的图形分割效果对比
算法 基础网络结构 FPS/(帧 $\cdot {{\rm{s}}^{ - 1} }$
ADE20K Citysapes
FCN[1] ResNet-101[19] 20.7 1.7
PSPNet[3] ResNet-101 20.3 1.8
DeepLabV3+[7] ResNet-101 18.7 1.6
DeepLabV3+ ResNeSt-101[21] 16.1 2.5
UperNet[22] Swin-S[24] 20.1
UperNet Convnext[25] 17.1
SETR[11] ViT[12] 8.3
DPT[27] ViT 20.5
Segmenter Mask[28] ViT 21.3
Segformer[20] MiT[20] 18.6 2.5
EA-Former MiT 21.9 2.8
Segformer* MiT 15.7
EA-Former* MiT 18.1
UperNet ResNet-101 2.3
CCnet[9] ResNet-101 1.7
DeepLabV3[6] ResNeSt-101 2.4
SETR ViT 0.4
Segformer# MiT 2.3
EA-Former# MiT 2.5
表 4  不同分割模型在ADE20K数据集、Cityscapes数据集上推理速度的评估结果
分辨率 mIoU/%
SETR[11]
(ViT[12]
EA-Former
(不含ACPE)
EA-Former
(含ACPE)
$ 768 \times 768 $ 79.3 81.9 82.1
$ 832 \times 832 $ 79.0 81.7 82.0
$1\;024 \times 1\;024$ 78.4 81.2 81.8
$1\;024 \times 2\;048$ 75.4 78.6 81.2
表 5  自适应卷积位置编码模块对模型分割精度的影响
联合重采样
自注意力
降维 FPS/
(帧 $ \cdot {{\rm{s}}^{ - 1}} $
GFLOPs mIoU/
%
× × 32.4 8.4 37.5
× 11.5 9.0 37.6
51.1 7.1 38.1
表 6  联合重采样自注意力模块对EA-Former-T算法性能的影响
1 LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 3431-3440.
2 EVERINGHAM M, ESLAMI S M, VAN G L, et al The Pascal visual object classes challenge: a retrospective[J]. International Journal of Computer Vision, 2015, 111: 98- 136
doi: 10.1007/s11263-014-0733-5
3 ZHAO H, SHI J, QI X, et al. Pyramid scene parsing network [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 2881-2890.
4 CHEN L C, PAPANDREOU G, KOKKINOS I, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs [EB/OL]. (2016-06-07)[2022-04-25]. https://arxiv.org/pdf/1412.7062.pdf.
5 CHEN L C, PAPANDREOU G, KOKKINOS I, et al Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40 (4): 834- 848
6 CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image seg-mentation [EB/OL]. (2017-06-17)[2022-04-26]. https://arxiv.org/abs/1706.05587.
7 CHEN L C, ZHU Y, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation [C]// Proceedings of the European Conference on Computer Vision. Munich: Springer, 2018: 801-818.
8 ZHAO H, ZHANG Y, LIU S, et al. PSANet: point-wise spatial attention network for scene parsing [C]// Proceedings of the European Conference on Computer Vision. Munich: Springer, 2018: 267-283.
9 HUANG Z, WANG X, HUANG L, et al. CCNet: criss-cross attention for semantic segmentation [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 603-612.
10 VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Advances in Neural Information Processing Systems. Long Beach: MIT Press, 2017: 5998-6008.
11 ZHENG S, LU J, ZHAO H, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers [C]// Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Nashville: IEEE, 2021: 6881-6890.
12 DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale [EB/OL]. (2020-10-22)[2022-04-27]. https://arxiv.org/pdf/2010.11929.pdf.
13 ZHOU B, ZHAO H, PUIG X, et al. Scene parsing through ADE20K dataset [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 633-641.
14 ISLAM M A, JIA S, BRUCE N D B. How much p-osition information do convolutional neural networks encode? [EB/OL]. (2020-01-22)[2022-04-28]. https://ar-xiv.org/pdf/2001.08248.pdf.
15 CHU X, TIAN Z, ZHANG B, et al. Conditional posi-tional encodings for vision transformers [EB/OL]. (2021-02-22)[2022-04-29]. https://arxiv.org/pdf/2102.10882.pdf.
16 YUAN K, GUO S, LIU Z, et al. Incorporating conv-olution designs into visual transformers [C]// Proceed-ings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 579-588.
17 WU H, XIAO B, CODELLA N, et al. CvT: introducing convolutions to vision transformers [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 22-31.
18 CORDTS M, OMRAN M, RAMOS S, et al. The Cityscapes dataset for semantic urban scene understanding [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 3213-3223.
19 HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
20 XIE E, WANG W, YU Z, et al. SegFormer: simple and efficient design for semantic segmentation with transformers [C]// Advances in Neural Information Processing Systems. [S.l.]: MIT Press, 2021: 12077-12090.
21 ZHANG H, WU C, ZHANG Z, et al. ResNeSt: split-attention networks [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. New Orleans: IEEE, 2022: 2736-2746.
22 XIAO T, LIU Y, ZHOU B, et al. Unified perceptual parsing for scene understanding [C]// Proceedings of the European Conference on Computer Vision. Munich: Springer, 2018: 418-434.
23 TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention [C]// Proceedings of the 38th International Conference on Machine Learning. [S.l.]: PMLR, 2021: 10347-10357.
24 LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 10012-10022.
25 LIU Z, MAO H, WU C Y, et al. A convnet for the 2020s [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 11976-11986.
26 YANG J, LI C, ZHANG P, et al. Focal self-attention for local-global interactions in vision transformers [EB/OL]. (2021-07-01)[2022-05-06]. https://arxiv.org/pdf/21-07.00641.pdf.
27 CHEN Z, ZHU Y, ZHAO C, et al. DPT: deformable patch-based transformer for visual recognition [C]// Proceedings of the 29th ACM International Conference on Multimedia. [S.l.]: ACM, 2021: 2899-2907.
28 STRUDEL R, GARCIA R, LAPTEV I, et al. Segmenter: transformer for semantic segmentation [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 7262-7272.
29 KIRILLOV A, GIRSHICK R, HE K, et al. Panoptic feature pyramid networks [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 6399-6408.
30 WANG W, XIE E, LI X, et al PVT v2: Improved baselines with pyramid vision transformer[J]. Computational Visual Media, 2022, 8: 415- 424
doi: 10.1007/s41095-022-0274-8
31 GUO M H, LU C Z, LIU Z N, et al. Visual attenti-on network [EB/OL]. (2022-02-20)[2022-05-16]. https://arxiv.org/pdf/2202.09741.pdf.
32 JAIN J, SINGH A, ORLOV N, et al. Semask: seman-tically masked transformers for semantic segmentation[EB/OL]. (2021-12-23)[2022-05-23]. https://arxiv.org/pdf/2112.12782.pdf.
33 YUAN Y, CHEN X, WANG J. Object-contextual representations for semantic segmentation [C]// European Conference on Computer Vision. [S.l.]: Springer, 2020: 173-190.
[1] 周欣磊,顾海挺,刘晶,许月萍,耿芳,王冲. 基于集成学习与深度学习的日供水量预测方法[J]. 浙江大学学报(工学版), 2023, 57(6): 1120-1127.
[2] 覃承进,蒋俊正. 基于深度神经网络的雷达距离超分辨方法[J]. 浙江大学学报(工学版), 2023, 57(6): 1215-1223.
[3] 周传华,操礼春,周家亿,詹凤. 图卷积融合计算时效网络节点重要性评估分析[J]. 浙江大学学报(工学版), 2023, 57(5): 930-938.
[4] 熊帆,陈田,卞佰成,刘军. 基于卷积循环神经网络的芯片表面字符识别[J]. 浙江大学学报(工学版), 2023, 57(5): 948-956.
[5] 王誉翔,钟智伟,夏鹏程,黄亦翔,刘成良. 基于改进Transformer的复合故障解耦诊断方法[J]. 浙江大学学报(工学版), 2023, 57(5): 855-864.
[6] 吕鑫栋,李娇,邓真楠,冯浩,崔欣桐,邓红霞. 基于改进Transformer的结构化图像超分辨网络[J]. 浙江大学学报(工学版), 2023, 57(5): 865-874.
[7] 张剑钊,郭继昌,汪昱东. 基于融合逆透射率图的水下图像增强算法[J]. 浙江大学学报(工学版), 2023, 57(5): 921-929.
[8] 张超凡,乔一铭,曹露,王志刚,崔少伟,王硕. 基于神经形态的触觉滑动感知方法[J]. 浙江大学学报(工学版), 2023, 57(4): 683-692.
[9] 陆昱翔,徐冠华,唐波. 基于视觉Transformer时空自注意力的工人行为识别[J]. 浙江大学学报(工学版), 2023, 57(3): 446-454.
[10] 徐少铭,李钰,袁晴龙. 基于强化学习和3σ准则的组合剪枝方法[J]. 浙江大学学报(工学版), 2023, 57(3): 486-494.
[11] 周青松,蔡晓东,刘家良. 结合社交影响和长短期偏好的个性化推荐算法[J]. 浙江大学学报(工学版), 2023, 57(3): 495-502.
[12] 曾菊香,王平辉,丁益东,兰林,蔡林熹,管晓宏. 面向节点分类的图神经网络节点嵌入增强模型[J]. 浙江大学学报(工学版), 2023, 57(2): 219-225.
[13] 杨长春,叶赞挺,刘半藤,王柯,崔海东. 基于多源信息融合的医学图像分割方法[J]. 浙江大学学报(工学版), 2023, 57(2): 226-234.
[14] 张京京,张兆功,许鑫. 融合图增强和采样策略的图卷积协同过滤模型[J]. 浙江大学学报(工学版), 2023, 57(2): 243-251.
[15] 周天琪,杨艳,张继杰,殷少伟,郭增强. 基于无负样本损失和自适应增强的图对比学习[J]. 浙江大学学报(工学版), 2023, 57(2): 259-266.