Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2023, Vol. 57 Issue (6): 1205-1214    DOI: 10.3785/j.issn.1008-973X.2023.06.016
    
Efficient and adaptive semantic segmentation network based on Transformer
Hai-bo ZHANG1,2(),Lei CAI1,2,Jun-ping REN1,2,Ru-yan WANG1,Fu LIU3
1. School of Communications and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
2. Chongqing Key Laboratory of Ubiquitous Sensing and Networking, Chongqing 400065, China
3. Chongqing Urban Lighting Center, Chongqing 400023, China
Download: HTML     PDF(1465KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

There are two problems at semantic segmentation network based on Transformer: significant drop of the segmentation accuracy due to the resolution variation and high computational complexity of self-attention. An adaptive convolutional positional encoding module was proposed, using a property of zero-padding convolution to retain positional information. Using the property that the dimensions of specific matrices can cancel each other in the self-attention computation. A joint resampling self-attention module to reduce the computational burden was proposed. A decoder was designed to fuse feature maps from different stages, resulting in the construction of an efficient segmentation network EA-Former which was capable of adapting to different resolution inputs. The mean intersection over union of EA-Former on the ADE20K was 51.0% and on the Cityscapes was 83.9%. Compared with the mainstream segmentation methods, the proposed network could achieve competitive accuracy with lower computational complexity, and the degradation of the segmentation performance caused by the variation of the input resolution was alleviated.



Key wordssemantic segmentation      Transformer      self-attention      position encoding      neural network     
Received: 24 June 2022      Published: 30 June 2023
CLC:  TP 391  
Fund:  国家自然科学基金资助项目(62271094);长江学者和创新团队发展计划基金资助项目(IRT16R72);重庆市留创计划创新类资助项目(cx2020059)
Cite this article:

Hai-bo ZHANG,Lei CAI,Jun-ping REN,Ru-yan WANG,Fu LIU. Efficient and adaptive semantic segmentation network based on Transformer. Journal of ZheJiang University (Engineering Science), 2023, 57(6): 1205-1214.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2023.06.016     OR     https://www.zjujournals.com/eng/Y2023/V57/I6/1205


基于Transformer的高效自适应语义分割网络

基于Transformer的语义分割网络存在2个问题:分辨率变化引起的分割精度显著下降,自注意力机制计算复杂度过高。为此,利用零值填充的卷积可保留位置信息的特性,提出自适应卷积位置编码模块;利用自注意力计算中特定矩阵的维度可相互抵消的特性,提出降低自注意力计算量的联合重采样自注意力模块;设计用于融合不同阶段特征图的解码器,构造能够自适应不同分辨率输入的高效分割网络EA-Former. EA-Former在数据集ADE20K、Cityscapes上的最优平均交并比分别为51.0%、83.9%. 与主流分割算法相比,EA-Former能够以更低的计算复杂度得到具有竞争力的分割精度,由输入分辨率变化引起的分割性能下降问题得以缓解.


关键词: 语义分割,  Transformer,  自注意力,  位置编码,  神经网络 
Fig.1 EA-Former network structure
Fig.2 Adaptive convolutional positional encoding structure
Fig.3 Joint resampling self-attention structure
Fig.4 Feature fusion decoder structure
算法 基础网络结构 N/106 GFLOPs mIoU/
%
FCN[1] ResNet-101[19] 68.6 275.7 39.9
PSPNet[3] ResNet-101 68.1 256.4 44.3
DeepLab-V3+[7] ResNet-101 62.7 255.1 45.4
DeepLab-V3+ ResNeSt-101[21] 66.3 262.9 46.9
UperNet[22] DeiT[23] 120.5 90.1 45.3
UperNet Swin-S[24] 81.0 259.3 49.3
UperNet Convnext[25] 60.2 234.6 46.1
UperNet Focal-B[26] 126.0 49.0
SETR[11] ViT[12] 318.5 213.6 47.3
DPT[27] ViT 109.7 171.0 46.9
Segmenter Mask[28] ViT 102.5 71.1 49.6
Semantic FPN[29] PVTv2-B3[30] 49.0 62.0 47.3
Semantic FPN VAN-B3[31] 49.0 68.0 48.1
SeMask-B FPN[32] SeMask Swin 96.0 107.0 49.4
Segformer[20] MiT[20] 83.9 110.5 50.1
EA-Former MiT 136.4 61.3 49.3
Segformer* MiT 83.9 172.7 52.1
EA-Former* MiT 136.4 95.8 51.0
Tab.1 Model evaluation results of different segmentation models on ADE20K dataset
算法 基础网络结构 GFLOPs FPS/
(帧 $ \cdot {{\rm{s}}^{ - 1}} $
mIoU/
%
SETR[11] MiT-B0[20] 25.3 28.7 34.8
Segformer[20] MiT-B0 8.6 50.5 37.5
UperNet[22] MiT-B0 28.5 29.6 39.3
Segmenter[28] MiT-B0 7.9 49.2 35.9
Semantic FPN[29] MiT-B0 23.0 46.4 37.1
EA-Former-T MiT-B0 7.1 51.1 38.1
Tab.2 Model evaluation results of lightweight segmentation models on ADE20K dataset
Fig.5 Changes in mean intersection over union trained on ADE20K dataset for different segmentation model
算法 基础网络结构 N/106 GFLOPs mIoU/%
FCN[1] ResNet-101[19] 68.4 619.6 75.5
PSPNet[3] ResNet-101 67.9 576.3 79.7
DeepLabV3+[7] ResNet-101 62.5 571.6 80.6
CCnet[9] ResNet-101 68.8 625.7 79.4
UperNet[22] ResNet-101 85.4 576.5 80.1
DeepLabV3[6] ResNeSt-101[21] 90.8 798.9 80.4
OCRNet[33] HRNet[34] 70.3 364.7 80.7
SETR[11] ViT[12] 318.3 818.2 79.3
Segformer[20] MiT[20] 83.9 597.6 81.8
EA-Former MiT 136.4 137.7 82.1
Segformer# MiT 83.9 735.2 84.1
EA-Former# MiT 136.4 191.8 83.9
Tab.3 Model evaluation results of different segmentation models on Cityscapes dataset
Fig.6 Comparison of image segmentation effects of different models on Cityscapes dataset
算法 基础网络结构 FPS/(帧 $\cdot {{\rm{s}}^{ - 1} }$
ADE20K Citysapes
FCN[1] ResNet-101[19] 20.7 1.7
PSPNet[3] ResNet-101 20.3 1.8
DeepLabV3+[7] ResNet-101 18.7 1.6
DeepLabV3+ ResNeSt-101[21] 16.1 2.5
UperNet[22] Swin-S[24] 20.1
UperNet Convnext[25] 17.1
SETR[11] ViT[12] 8.3
DPT[27] ViT 20.5
Segmenter Mask[28] ViT 21.3
Segformer[20] MiT[20] 18.6 2.5
EA-Former MiT 21.9 2.8
Segformer* MiT 15.7
EA-Former* MiT 18.1
UperNet ResNet-101 2.3
CCnet[9] ResNet-101 1.7
DeepLabV3[6] ResNeSt-101 2.4
SETR ViT 0.4
Segformer# MiT 2.3
EA-Former# MiT 2.5
Tab.4 Evaluation results of inference speed for different segmentation models on ADE20K dataset and Cityscapes dataset
分辨率 mIoU/%
SETR[11]
(ViT[12]
EA-Former
(不含ACPE)
EA-Former
(含ACPE)
$ 768 \times 768 $ 79.3 81.9 82.1
$ 832 \times 832 $ 79.0 81.7 82.0
$1\;024 \times 1\;024$ 78.4 81.2 81.8
$1\;024 \times 2\;048$ 75.4 78.6 81.2
Tab.5 Influence of adaptive convolutional position encoding module on model segmentation accuracy
联合重采样
自注意力
降维 FPS/
(帧 $ \cdot {{\rm{s}}^{ - 1}} $
GFLOPs mIoU/
%
× × 32.4 8.4 37.5
× 11.5 9.0 37.6
51.1 7.1 38.1
Tab.6 Influence of joint resampling self-attention module on algorithm performance of EA-Former-T
[1]   LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 3431-3440.
[2]   EVERINGHAM M, ESLAMI S M, VAN G L, et al The Pascal visual object classes challenge: a retrospective[J]. International Journal of Computer Vision, 2015, 111: 98- 136
doi: 10.1007/s11263-014-0733-5
[3]   ZHAO H, SHI J, QI X, et al. Pyramid scene parsing network [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 2881-2890.
[4]   CHEN L C, PAPANDREOU G, KOKKINOS I, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs [EB/OL]. (2016-06-07)[2022-04-25]. https://arxiv.org/pdf/1412.7062.pdf.
[5]   CHEN L C, PAPANDREOU G, KOKKINOS I, et al Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40 (4): 834- 848
[6]   CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image seg-mentation [EB/OL]. (2017-06-17)[2022-04-26]. https://arxiv.org/abs/1706.05587.
[7]   CHEN L C, ZHU Y, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation [C]// Proceedings of the European Conference on Computer Vision. Munich: Springer, 2018: 801-818.
[8]   ZHAO H, ZHANG Y, LIU S, et al. PSANet: point-wise spatial attention network for scene parsing [C]// Proceedings of the European Conference on Computer Vision. Munich: Springer, 2018: 267-283.
[9]   HUANG Z, WANG X, HUANG L, et al. CCNet: criss-cross attention for semantic segmentation [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 603-612.
[10]   VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Advances in Neural Information Processing Systems. Long Beach: MIT Press, 2017: 5998-6008.
[11]   ZHENG S, LU J, ZHAO H, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers [C]// Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Nashville: IEEE, 2021: 6881-6890.
[12]   DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale [EB/OL]. (2020-10-22)[2022-04-27]. https://arxiv.org/pdf/2010.11929.pdf.
[13]   ZHOU B, ZHAO H, PUIG X, et al. Scene parsing through ADE20K dataset [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 633-641.
[14]   ISLAM M A, JIA S, BRUCE N D B. How much p-osition information do convolutional neural networks encode? [EB/OL]. (2020-01-22)[2022-04-28]. https://ar-xiv.org/pdf/2001.08248.pdf.
[15]   CHU X, TIAN Z, ZHANG B, et al. Conditional posi-tional encodings for vision transformers [EB/OL]. (2021-02-22)[2022-04-29]. https://arxiv.org/pdf/2102.10882.pdf.
[16]   YUAN K, GUO S, LIU Z, et al. Incorporating conv-olution designs into visual transformers [C]// Proceed-ings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 579-588.
[17]   WU H, XIAO B, CODELLA N, et al. CvT: introducing convolutions to vision transformers [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 22-31.
[18]   CORDTS M, OMRAN M, RAMOS S, et al. The Cityscapes dataset for semantic urban scene understanding [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 3213-3223.
[19]   HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
[20]   XIE E, WANG W, YU Z, et al. SegFormer: simple and efficient design for semantic segmentation with transformers [C]// Advances in Neural Information Processing Systems. [S.l.]: MIT Press, 2021: 12077-12090.
[21]   ZHANG H, WU C, ZHANG Z, et al. ResNeSt: split-attention networks [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. New Orleans: IEEE, 2022: 2736-2746.
[22]   XIAO T, LIU Y, ZHOU B, et al. Unified perceptual parsing for scene understanding [C]// Proceedings of the European Conference on Computer Vision. Munich: Springer, 2018: 418-434.
[23]   TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention [C]// Proceedings of the 38th International Conference on Machine Learning. [S.l.]: PMLR, 2021: 10347-10357.
[24]   LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 10012-10022.
[25]   LIU Z, MAO H, WU C Y, et al. A convnet for the 2020s [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 11976-11986.
[26]   YANG J, LI C, ZHANG P, et al. Focal self-attention for local-global interactions in vision transformers [EB/OL]. (2021-07-01)[2022-05-06]. https://arxiv.org/pdf/21-07.00641.pdf.
[27]   CHEN Z, ZHU Y, ZHAO C, et al. DPT: deformable patch-based transformer for visual recognition [C]// Proceedings of the 29th ACM International Conference on Multimedia. [S.l.]: ACM, 2021: 2899-2907.
[28]   STRUDEL R, GARCIA R, LAPTEV I, et al. Segmenter: transformer for semantic segmentation [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 7262-7272.
[29]   KIRILLOV A, GIRSHICK R, HE K, et al. Panoptic feature pyramid networks [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 6399-6408.
[30]   WANG W, XIE E, LI X, et al PVT v2: Improved baselines with pyramid vision transformer[J]. Computational Visual Media, 2022, 8: 415- 424
doi: 10.1007/s41095-022-0274-8
[31]   GUO M H, LU C Z, LIU Z N, et al. Visual attenti-on network [EB/OL]. (2022-02-20)[2022-05-16]. https://arxiv.org/pdf/2202.09741.pdf.
[32]   JAIN J, SINGH A, ORLOV N, et al. Semask: seman-tically masked transformers for semantic segmentation[EB/OL]. (2021-12-23)[2022-05-23]. https://arxiv.org/pdf/2112.12782.pdf.
[33]   YUAN Y, CHEN X, WANG J. Object-contextual representations for semantic segmentation [C]// European Conference on Computer Vision. [S.l.]: Springer, 2020: 173-190.
[1] Yu-xiang WANG,Zhi-wei ZHONG,Peng-cheng XIA,Yi-xiang HUANG,Cheng-liang LIU. Compound fault decoupling diagnosis method based on improved Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(5): 855-864.
[2] Xin-dong LV,Jiao LI,Zhen-nan DENG,Hao FENG,Xin-tong CUI,Hong-xia DENG. Structured image super-resolution network based on improved Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(5): 865-874.
[3] Yu-xiang LU,Guan-hua XU,Bo TANG. Worker behavior recognition based on temporal and spatial self-attention of vision Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(3): 446-454.
[4] Chao LIU,Bing KONG,Guo-wang DU,Li-hua ZHOU,Hong-mei CHEN,Chong-ming BAO. Deep clustering via high-order mutual information maximization and pseudo-label guidance[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(2): 299-309.
[5] Wan-liang WANG,Tie-jun WANG,Jia-cheng CHEN,Wen-bo YOU. Medical image segmentation method combining multi-scale and multi-head attention[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(9): 1796-1805.
[6] Xiao-chen JU,Xin-xin ZHAO,Sheng-sheng QIAN. Self-attention mechanism based bridge bolt detection algorithm[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(5): 901-908.
[7] Ze-kang WU,Shan ZHAO,Hong-wei LI,Yi-rui JIANG. Spatial global context information network for semantic segmentation of remote sensing image[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(4): 795-802.
[8] Guo-peng ZHANG,Zi-han LI,Hao WANG,zheng ZHENG. Isolated AC-DC solid state transformer front and rear stages integrated sliding mode control[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(3): 622-630.
[9] Ying-li LIU,Rui-gang WU,Chang-hui YAO,Tao SHEN. Construction method of extraction dataset of Al-Si alloy entity relationship[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(2): 245-253.
[10] Tian-le YUAN,Ju-long YUAN,Yong-jian ZHU,Han-chen ZHENG. Surface defect detection algorithm of thrust ball bearing based on improved YOLOv5[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(12): 2349-2357.
[11] Nan-jing YU,Xiao-biao FAN,Tian-min DENG,Guo-tao MAO. Ship detection algorithm in complex backgrounds via multi-head self-attention[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(12): 2392-2402.
[12] Qiao-hong CHEN,Fei-yu LI,Qi SUN,Yu-bo JIA. Answer selection model based on LSTM and decay self-attention[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(12): 2436-2444.
[13] Zhen-hong MA,Zhen LIU,Sheng-yong YIN,Rong-wei MA,Ke-ping YAN. Experimental study on melanoma cell ablation by high-voltage nanosecond pulsed electric field[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(6): 1168-1174.
[14] Deng-wen ZHOU,Jin-yue TIAN,Lu-yao MA,Xiu-xiu SUN. Lightweight image semantic segmentation based on multi-level feature cascaded network[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(8): 1516-1524.
[15] Le XIE,Xi-dan HENG,Yang LIU,Qi-long JIANG,Dong LIU. Transformer fault diagnosis based on linear discriminant analysis and step-by-step machine learning[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(11): 2266-2272.