Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2024, Vol. 58 Issue (4): 684-695    DOI: 10.3785/j.issn.1008-973X.2024.04.004
    
Traffic scene perception algorithm with joint semantic segmentation and depth estimation
Kang FAN1(),Ming’en ZHONG1,*(),Jiawei TAN2,Zehui ZHAN1,Yan FENG1
1. Fujian Key Laboratory of Bus Advanced Design and Manufacture, Xiamen University of Technology, Xiamen 361024, China
2. School of Aerospace Engineering, Xiamen University, Xiamen 361102, China
Download: HTML     PDF(2815KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

Inspired by the idea that feature information between different pixel-level visual tasks can guide and optimize each other, a traffic scene perception algorithm based on multi-task learning theory was proposed for joint semantic segmentation and depth estimation. A bidirectional cross-task attention mechanism was proposed to achieve explicit modeling of global correlation between tasks, guiding the network to fully explore and utilize complementary pattern information between tasks. A multi-task Transformer was constructed to enhance the spatial global representation of specific task features, implicitly model the cross-task global context relationship, and promote the fusion of complementary pattern information between tasks. An encoder-decoder fusion upsampling module was designed to effectively fuse the spatial details contained in the encoder to generate fine-grained high-resolution specific task features. The experimental results on the Cityscapes dataset showed that the mean IoU of semantic segmentation of the proposed algorithm reached 79.2%, the root mean square error of depth estimation was 4.485, and the mean relative error of distance estimation for five typical traffic participants was 6.1%. Compared with the mainstream algorithms, the proposed algorithm can achieve better comprehensive performance with lower computational complexity.



Key wordsperception of traffic environment      multi-task learning      semantic segmentation      depth estimation      Transformer     
Received: 06 September 2023      Published: 27 March 2024
CLC:  TP 391.4  
Fund:  福建省自然科学基金资助项目(2023J011439,2019J01859).
Corresponding Authors: Ming’en ZHONG     E-mail: 476863019@qq.com;zhongmingen@xmut.edu.cn
Cite this article:

Kang FAN,Ming’en ZHONG,Jiawei TAN,Zehui ZHAN,Yan FENG. Traffic scene perception algorithm with joint semantic segmentation and depth estimation. Journal of ZheJiang University (Engineering Science), 2024, 58(4): 684-695.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2024.04.004     OR     https://www.zjujournals.com/eng/Y2024/V58/I4/684


联合语义分割和深度估计的交通场景感知算法

受不同像素级视觉任务间的特征信息能够相互指导和优化的思路启发,基于多任务学习理论提出联合语义分割和深度估计的交通场景感知算法. 提出双向跨任务注意力机制,实现任务间的全局相关性显式建模,引导网络充分挖掘和利用任务间互补模式信息. 构建多任务Transformer,增强特定任务特征的空间全局表示,实现跨任务全局上下文关系的隐式建模,促进任务间互补模式信息的融合. 设计编-解码融合上采样模块来有效融合编码器蕴含的空间细节信息,生成精细的高分辨率特定任务特征. 在Cityscapes数据集上的实验结果表明,所提算法的语义分割平均交并比达到79.2%,深度估计均方根误差为4.485,针对5类典型交通参与者的距离估计平均相对误差为6.1%,能够以比现有主流算法更低的计算复杂度获得更优的综合性能.


关键词: 交通环境感知,  多任务学习,  语义分割,  深度估计,  Transformer 
Fig.1 Overall structure of SDFormer
Fig.2 Overall structure of bidirectional cross-task attention module
Fig.3 Overall structure of multi-task Transformer module
Fig.4 Overall structure of encoder-decoder fusion upsampling module
模型MIoU/%RMSEARENp/106GFLOPs
STL-Seg76.350.5132.5
STL-Depth4.9100.17450.5132.5
MTL73.25.3550.22766.3131.1
+BCTA75.85.0280.18672.6153.8
+MT-T77.04.8940.16173.4162.7
+EDFU77.64.7810.15674.5167.0
Tab.1 Results of SDFormer ablation experiments
Fig.5 Comparison of semantic segmentation effects between MTL and SDFormer
Fig.6 Comparison of depth estimation effects between MTL and SDFormer
模块MIoU/%RMSEAREGFLOPs
FPT76.15.0580.204169.2
SPT76.74.9240.180173.8
BCTA77.64.7810.156167.0
Tab.2 Performance comparison of different bidirectional cross-task feature interaction modules
模型MIoU/%RMSEAREGFLOPsf/(帧.s?1)
SDFormer-noT76.04.9570.184155.231.7
SDFormer-s76.74.8700.175170.118.5
SDFormer-noD77.84.8150.153178.39.7
SDFormer77.64.7810.156167.023.6
Tab.3 Results of multi-task Transformer module ablation experiments
编码器MIoU/%RMSEARENp/106f/(帧.s?1)
ResNet-5074.85.2870.22654.165.1
ResNet-10176.45.0530.18473.134.8
Swin-T75.35.1280.20658.747.6
Swin-S77.64.7810.15674.523.6
Swin-B79.24.4850.132116.711.2
Tab.4 Experimental results of performance comparison for different encoders
方法编码器MIoU/%Np/106GFLOPs
PSPNetResNet-10178.568.5546.8
SETRViT-B78.097.6457.2
SegFormerMiT-B380.348.2224.6
Mask2FormerSwin-B81.2107.0343.5
SDFormerSwin-B79.2116.7247.5
Tab.5 Comparison of semantic segmentation performance between SDFormer and single-task algorithms
方法编码器RMSEARENp/106GFLOPs
Lap-DepthResNet-1014.5530.15473.4186.5
DepthFormerSwin-B4.3260.127151.3282.0
pixelFormerSwin-B4.2580.115146.1346.4
SDFormerSwin-B4.4850.132116.7247.5
Tab.6 Comparison of depth estimation performance between SDFormer and single-task algorithms
方法编码器MIoU/%RMSEARENp/106GFLOPs
MTANResNet-10174.45.8460.24688.3204.9
PAD-NetResNet-10176.35.2750.20884.5194.5
PSD-NetResNet-10175.65.5810.225101.8224.6
MTI-NetResNet-10176.85.1350.194112.4269.4
InvPTSwin-B77.84.6570.154136.9431.5
SDFormerSwin-B79.24.4850.132116.7247.5
Tab.7 Performance comparison results of different multi-task algorithms
Fig.7 Comparison of semantic segmentation effects between SDFormer and InvPT
Fig.8 Comparison of depth estimation effects between SDFormer and InvPT
方法MRE/%mMRE/%
personridercarbustruck
MTAN14.515.513.016.114.714.7
PAD-Net12.613.310.712.411.312.0
PSD-Net11.712.611.812.712.212.2
MTI-Net12.312.89.811.410.711.4
InvPT7.67.26.67.88.37.5
SDFormer6.15.47.45.26.56.1
Tab.8 Comparison of distance estimation errors of different multi-task algorithms
距离MRE/%mMRE/%
personridercarbustruck
3.54.85.14.33.24.1
5.45.64.85.25.75.3
13.711.29.89.510.110.8
Tab.9 Distance estimation errors of SDFormer in different distance ranges
Fig.9 Display of distance prediction effects of SDFormer in different distance ranges
[1]   李琳辉, 钱波, 连静, 等 基于卷积神经网络的交通场景语义分割方法研究[J]. 通信学报, 2018, 39 (4): 2018053
LI Linhui, QIAN Bo, LIAN Jing, et al Study on traffic scene semantic segmentation method based on convolutional neural network[J]. Journal on Communications, 2018, 39 (4): 2018053
[2]   PAN H, HONG Y, SUN W, et al Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes[J]. IEEE Transactions on Intelligent Transportation Systems, 2023, 24 (3): 3448- 3460
doi: 10.1109/TITS.2022.3228042
[3]   张海波, 蔡磊, 任俊平, 等 基于Transformer的高效自适应语义分割网络[J]. 浙江大学学报: 工学版, 2023, 57 (6): 1205- 1214
ZHANG Haibo, CAI Lei, REN Junping, et al Efficient and adaptive semantic segmentation network based on Transformer[J]. Journal of Zhejiang University: Engineering Science, 2023, 57 (6): 1205- 1214
[4]   EIGEN D, PUHRSCH C, FERGUS R. Depth map prediction from a single image using a multi-scale deep network [C]// Proceedings of the 27th International Conference on Neural Information Processing Systems . Cambridge: MIT Press, 2014: 2366–2374.
[5]   SONG M, LIM S, KIM W Monocular depth estimation using laplacian pyramid-based depth residuals[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31 (11): 4381- 4393
doi: 10.1109/TCSVT.2021.3049869
[6]   LI Z, CHEN Z, LIU X, et al DepthFormer: exploiting long-range correlation and local information for accurate monocular depth estimation[J]. Machine Intelligence Research, 2023, 20 (6): 837- 854
doi: 10.1007/s11633-023-1458-0
[7]   WANG P, SHEN X, LIN Z, et al. Towards unified depth and semantic prediction from a single image [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Boston: IEEE, 2015: 2800–2809.
[8]   SHOOURI S, YANG M, FAN Z, et al. Efficient computation sharing for multi-task visual scene understanding [EB/OL]. (2023-08-14)[2023-08-23]. https://arxiv.org/pdf/2303.09663.pdf.
[9]   VANDENHENDE S, GEORGOULIS S, VAN GANSBEKE W, et al Multi-task learning for dense prediction tasks: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44 (7): 3614- 3633
[10]   YE H, XU D. Inverted pyramid multi-task transformer for dense scene understanding [C]// European Conference on Computer Vision . [S.l.]: Springer, 2022: 514–530
[11]   XU D, OUYANG W, WANG X, et al. PAD-Net: multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City: IEEE, 2018: 675–684.
[12]   VANDENHENDE S, GEORGOULIS S, VAN GOOL L. MTI-Net: multi-scale task interaction networks for multi-task learning [C]// European Conference on Computer Vision . [S.l.]: Springer, 2020: 527–543.
[13]   ZHOU L, CUI Z, XU C, et al. Pattern-structure diffusion for multi-task learning [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle: IEEE, 2020: 4514–4523.
[14]   VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Advances in Neural Information Processing Systems . Long Beach: MIT Press, 2017: 5998–6008.
[15]   ZHANG X, ZHOU L, LI Y, et al. Transfer vision patterns for multi-task pixel learning [C]// Proceedings of the 29th ACM International Conference on Multimedia . [S.l.]: ACM, 2021: 97–106.
[16]   LIU Z, LIN Y, CAO Y, et al. Swin Transformer: hierarchical vision transformer using shifted windows [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Montreal: IEEE, 2021: 10012–10022.
[17]   ZHANG X, CHEN Y, ZHANG H, et al When visual disparity generation meets semantic segmentation: a mutual encouragement approach[J]. IEEE Transactions on Intelligent Transportation Systems, 2021, 22 (3): 1853- 1867
doi: 10.1109/TITS.2020.3027556
[18]   LAINA I, RUPPRECHT C, BELAGIANNIS V, et al. Deeper depth prediction with fully convolutional residual networks [C]// 2016 Fourth International Conference on 3D Vision . Stanford: IEEE, 2016: 239–248.
[19]   CORDTS M, OMRAN M, RAMOS S, et al. The Cityscapes dataset for semantic urban scene understanding [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas: IEEE, 2016: 3213–3223.
[20]   XIE E, WANG W, YU Z, et al. SegFormer: simple and efficient design for semantic segmentation with transformers [C]// Advances in Neural Information Processing Systems . [S.1.]: MIT Press, 2021: 12077–12090.
[21]   WANG W, XIE E, LI X, et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Montreal: IEEE, 2021: 568–578.
[22]   HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas: IEEE, 2016: 770–778.
[23]   ZHAO H, SHI J, QI X, et al. Pyramid scene parsing network [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Honolulu: IEEE, 2017: 2881–2890.
[24]   ZHENG S, LU J, ZHAO H, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville: IEEE, 2021: 6881–6890.
[25]   CHENG B, MISRA I, SCHWING A G, et al. Masked-attention mask transformer for universal image segmentation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans: IEEE, 2022: 1290–1299.
[26]   AGARWAL A, ARORA C. Attention attention everywhere: monocular depth prediction with skip attention [C]// Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision . Waikoloa: IEEE, 2023: 5861–5870.
[1] Canlin LI,Wenjiao ZHANG,Zhiwen SHAO,Lizhuang MA,Xinyue WANG. Semantic segmentation method on nighttime road scene based on Trans-nightSeg[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(2): 294-303.
[2] Shaojie WEN,Ruigang WU,Chaowen FENG,Yingli LIU. Multimodal cascaded document layout analysis network based on Transformer[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(2): 317-324.
[3] Zhicheng FENG,Jie YANG,Zhichao CHEN. Urban road network extraction method based on lightweight Transformer[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(1): 40-49.
[4] Hao-ran GUO,Ji-chang GUO,Yu-dong WANG. Lightweight semantic segmentation network for underwater image[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(7): 1278-1286.
[5] Chun-juan LIU,Ze QIAO,Hao-wen YAN,Xiao-suo WU,Jia-wei WANG,Yu-qiang XIN. Semantic segmentation network for remote sensing image based on multi-scale mutual attention[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(7): 1335-1344.
[6] Hai-bo ZHANG,Lei CAI,Jun-ping REN,Ru-yan WANG,Fu LIU. Efficient and adaptive semantic segmentation network based on Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(6): 1205-1214.
[7] Xin-dong LV,Jiao LI,Zhen-nan DENG,Hao FENG,Xin-tong CUI,Hong-xia DENG. Structured image super-resolution network based on improved Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(5): 865-874.
[8] Yu-xiang WANG,Zhi-wei ZHONG,Peng-cheng XIA,Yi-xiang HUANG,Cheng-liang LIU. Compound fault decoupling diagnosis method based on improved Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(5): 855-864.
[9] Yu-xiang LU,Guan-hua XU,Bo TANG. Worker behavior recognition based on temporal and spatial self-attention of vision Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(3): 446-454.
[10] Qiao-hong CHEN,Jia-jin SUN,Yang-bo LOU,Zhi-jian FANG. Multimodal sentiment analysis model based on multi-task learning and stacked cross-modal Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(12): 2421-2429.
[11] Jin-bo HU,Wei-zhi NIE,Dan SONG,Zhuo GAO,Yun-peng BAI,Feng ZHAO. Chest X-ray imaging disease diagnosis model assisted by deformable Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(10): 1923-1932.
[12] Yan-ze YANG,Meng WANG,Cheng LIU,Hui-tong XU,Xiao-yue ZHANG. Intelligent identification of asphalt pavement cracks based on semantic segmentation[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(10): 2094-2105.
[13] Wan-liang WANG,Tie-jun WANG,Jia-cheng CHEN,Wen-bo YOU. Medical image segmentation method combining multi-scale and multi-head attention[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(9): 1796-1805.
[14] Ze-kang WU,Shan ZHAO,Hong-wei LI,Yi-rui JIANG. Spatial global context information network for semantic segmentation of remote sensing image[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(4): 795-802.
[15] Guo-peng ZHANG,Zi-han LI,Hao WANG,zheng ZHENG. Isolated AC-DC solid state transformer front and rear stages integrated sliding mode control[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(3): 622-630.