Please wait a minute...
浙江大学学报(工学版)  2024, Vol. 58 Issue (4): 684-695    DOI: 10.3785/j.issn.1008-973X.2024.04.004
计算机与控制工程     
联合语义分割和深度估计的交通场景感知算法
范康1(),钟铭恩1,*(),谭佳威2,詹泽辉1,冯妍1
1. 厦门理工学院 福建省客车先进设计与制造重点实验室,福建 厦门 361024
2. 厦门大学 航空航天学院,福建 厦门 361102
Traffic scene perception algorithm with joint semantic segmentation and depth estimation
Kang FAN1(),Ming’en ZHONG1,*(),Jiawei TAN2,Zehui ZHAN1,Yan FENG1
1. Fujian Key Laboratory of Bus Advanced Design and Manufacture, Xiamen University of Technology, Xiamen 361024, China
2. School of Aerospace Engineering, Xiamen University, Xiamen 361102, China
 全文: PDF(2815 KB)   HTML
摘要:

受不同像素级视觉任务间的特征信息能够相互指导和优化的思路启发,基于多任务学习理论提出联合语义分割和深度估计的交通场景感知算法. 提出双向跨任务注意力机制,实现任务间的全局相关性显式建模,引导网络充分挖掘和利用任务间互补模式信息. 构建多任务Transformer,增强特定任务特征的空间全局表示,实现跨任务全局上下文关系的隐式建模,促进任务间互补模式信息的融合. 设计编-解码融合上采样模块来有效融合编码器蕴含的空间细节信息,生成精细的高分辨率特定任务特征. 在Cityscapes数据集上的实验结果表明,所提算法的语义分割平均交并比达到79.2%,深度估计均方根误差为4.485,针对5类典型交通参与者的距离估计平均相对误差为6.1%,能够以比现有主流算法更低的计算复杂度获得更优的综合性能.

关键词: 交通环境感知多任务学习语义分割深度估计Transformer    
Abstract:

Inspired by the idea that feature information between different pixel-level visual tasks can guide and optimize each other, a traffic scene perception algorithm based on multi-task learning theory was proposed for joint semantic segmentation and depth estimation. A bidirectional cross-task attention mechanism was proposed to achieve explicit modeling of global correlation between tasks, guiding the network to fully explore and utilize complementary pattern information between tasks. A multi-task Transformer was constructed to enhance the spatial global representation of specific task features, implicitly model the cross-task global context relationship, and promote the fusion of complementary pattern information between tasks. An encoder-decoder fusion upsampling module was designed to effectively fuse the spatial details contained in the encoder to generate fine-grained high-resolution specific task features. The experimental results on the Cityscapes dataset showed that the mean IoU of semantic segmentation of the proposed algorithm reached 79.2%, the root mean square error of depth estimation was 4.485, and the mean relative error of distance estimation for five typical traffic participants was 6.1%. Compared with the mainstream algorithms, the proposed algorithm can achieve better comprehensive performance with lower computational complexity.

Key words: perception of traffic environment    multi-task learning    semantic segmentation    depth estimation    Transformer
收稿日期: 2023-09-06 出版日期: 2024-03-27
CLC:  TP 391.4  
基金资助: 福建省自然科学基金资助项目(2023J011439,2019J01859).
通讯作者: 钟铭恩     E-mail: 476863019@qq.com;zhongmingen@xmut.edu.cn
作者简介: 范康(1999—),男,硕士生,从事机器视觉和智慧交通研究. orcid.org/0009-0006-2101-8973. E-mail:476863019@qq.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
范康
钟铭恩
谭佳威
詹泽辉
冯妍

引用本文:

范康,钟铭恩,谭佳威,詹泽辉,冯妍. 联合语义分割和深度估计的交通场景感知算法[J]. 浙江大学学报(工学版), 2024, 58(4): 684-695.

Kang FAN,Ming’en ZHONG,Jiawei TAN,Zehui ZHAN,Yan FENG. Traffic scene perception algorithm with joint semantic segmentation and depth estimation. Journal of ZheJiang University (Engineering Science), 2024, 58(4): 684-695.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2024.04.004        https://www.zjujournals.com/eng/CN/Y2024/V58/I4/684

图 1  SDFormer的整体结构
图 2  双向跨任务注意力模块的结构图
图 3  多任务Transformer模块的结构图
图 4  编-解码融合上采样模块的结构图
模型MIoU/%RMSEARENp/106GFLOPs
STL-Seg76.350.5132.5
STL-Depth4.9100.17450.5132.5
MTL73.25.3550.22766.3131.1
+BCTA75.85.0280.18672.6153.8
+MT-T77.04.8940.16173.4162.7
+EDFU77.64.7810.15674.5167.0
表 1  SDFormer消融实验结果
图 5  MTL与SDFormer的语义分割效果对比
图 6  MTL与SDFormer的深度估计效果对比
模块MIoU/%RMSEAREGFLOPs
FPT76.15.0580.204169.2
SPT76.74.9240.180173.8
BCTA77.64.7810.156167.0
表 2  不同双向跨任务特征交互模块的性能对比
模型MIoU/%RMSEAREGFLOPsf/(帧.s?1)
SDFormer-noT76.04.9570.184155.231.7
SDFormer-s76.74.8700.175170.118.5
SDFormer-noD77.84.8150.153178.39.7
SDFormer77.64.7810.156167.023.6
表 3  多任务Transformer模块消融实验结果
编码器MIoU/%RMSEARENp/106f/(帧.s?1)
ResNet-5074.85.2870.22654.165.1
ResNet-10176.45.0530.18473.134.8
Swin-T75.35.1280.20658.747.6
Swin-S77.64.7810.15674.523.6
Swin-B79.24.4850.132116.711.2
表 4  不同编码器的性能对比实验结果
方法编码器MIoU/%Np/106GFLOPs
PSPNetResNet-10178.568.5546.8
SETRViT-B78.097.6457.2
SegFormerMiT-B380.348.2224.6
Mask2FormerSwin-B81.2107.0343.5
SDFormerSwin-B79.2116.7247.5
表 5  SDFormer与单任务算法的语义分割性能对比
方法编码器RMSEARENp/106GFLOPs
Lap-DepthResNet-1014.5530.15473.4186.5
DepthFormerSwin-B4.3260.127151.3282.0
pixelFormerSwin-B4.2580.115146.1346.4
SDFormerSwin-B4.4850.132116.7247.5
表 6  SDFormer与单任务算法的深度估计性能对比
方法编码器MIoU/%RMSEARENp/106GFLOPs
MTANResNet-10174.45.8460.24688.3204.9
PAD-NetResNet-10176.35.2750.20884.5194.5
PSD-NetResNet-10175.65.5810.225101.8224.6
MTI-NetResNet-10176.85.1350.194112.4269.4
InvPTSwin-B77.84.6570.154136.9431.5
SDFormerSwin-B79.24.4850.132116.7247.5
表 7  不同多任务算法的性能对比结果
图 7  SDFormer与InvPT的语义分割效果对比
图 8  SDFormer与InvPT的深度估计效果对比
方法MRE/%mMRE/%
personridercarbustruck
MTAN14.515.513.016.114.714.7
PAD-Net12.613.310.712.411.312.0
PSD-Net11.712.611.812.712.212.2
MTI-Net12.312.89.811.410.711.4
InvPT7.67.26.67.88.37.5
SDFormer6.15.47.45.26.56.1
表 8  不同多任务算法的距离估计误差对比结果
距离MRE/%mMRE/%
personridercarbustruck
3.54.85.14.33.24.1
5.45.64.85.25.75.3
13.711.29.89.510.110.8
表 9  SDFormer在不同距离范围的距离估计误差
图 9  SDFormer在不同距离范围的距离预测效果展示
1 李琳辉, 钱波, 连静, 等 基于卷积神经网络的交通场景语义分割方法研究[J]. 通信学报, 2018, 39 (4): 2018053
LI Linhui, QIAN Bo, LIAN Jing, et al Study on traffic scene semantic segmentation method based on convolutional neural network[J]. Journal on Communications, 2018, 39 (4): 2018053
2 PAN H, HONG Y, SUN W, et al Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes[J]. IEEE Transactions on Intelligent Transportation Systems, 2023, 24 (3): 3448- 3460
doi: 10.1109/TITS.2022.3228042
3 张海波, 蔡磊, 任俊平, 等 基于Transformer的高效自适应语义分割网络[J]. 浙江大学学报: 工学版, 2023, 57 (6): 1205- 1214
ZHANG Haibo, CAI Lei, REN Junping, et al Efficient and adaptive semantic segmentation network based on Transformer[J]. Journal of Zhejiang University: Engineering Science, 2023, 57 (6): 1205- 1214
4 EIGEN D, PUHRSCH C, FERGUS R. Depth map prediction from a single image using a multi-scale deep network [C]// Proceedings of the 27th International Conference on Neural Information Processing Systems . Cambridge: MIT Press, 2014: 2366–2374.
5 SONG M, LIM S, KIM W Monocular depth estimation using laplacian pyramid-based depth residuals[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31 (11): 4381- 4393
doi: 10.1109/TCSVT.2021.3049869
6 LI Z, CHEN Z, LIU X, et al DepthFormer: exploiting long-range correlation and local information for accurate monocular depth estimation[J]. Machine Intelligence Research, 2023, 20 (6): 837- 854
doi: 10.1007/s11633-023-1458-0
7 WANG P, SHEN X, LIN Z, et al. Towards unified depth and semantic prediction from a single image [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Boston: IEEE, 2015: 2800–2809.
8 SHOOURI S, YANG M, FAN Z, et al. Efficient computation sharing for multi-task visual scene understanding [EB/OL]. (2023-08-14)[2023-08-23]. https://arxiv.org/pdf/2303.09663.pdf.
9 VANDENHENDE S, GEORGOULIS S, VAN GANSBEKE W, et al Multi-task learning for dense prediction tasks: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44 (7): 3614- 3633
10 YE H, XU D. Inverted pyramid multi-task transformer for dense scene understanding [C]// European Conference on Computer Vision . [S.l.]: Springer, 2022: 514–530
11 XU D, OUYANG W, WANG X, et al. PAD-Net: multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City: IEEE, 2018: 675–684.
12 VANDENHENDE S, GEORGOULIS S, VAN GOOL L. MTI-Net: multi-scale task interaction networks for multi-task learning [C]// European Conference on Computer Vision . [S.l.]: Springer, 2020: 527–543.
13 ZHOU L, CUI Z, XU C, et al. Pattern-structure diffusion for multi-task learning [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle: IEEE, 2020: 4514–4523.
14 VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Advances in Neural Information Processing Systems . Long Beach: MIT Press, 2017: 5998–6008.
15 ZHANG X, ZHOU L, LI Y, et al. Transfer vision patterns for multi-task pixel learning [C]// Proceedings of the 29th ACM International Conference on Multimedia . [S.l.]: ACM, 2021: 97–106.
16 LIU Z, LIN Y, CAO Y, et al. Swin Transformer: hierarchical vision transformer using shifted windows [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Montreal: IEEE, 2021: 10012–10022.
17 ZHANG X, CHEN Y, ZHANG H, et al When visual disparity generation meets semantic segmentation: a mutual encouragement approach[J]. IEEE Transactions on Intelligent Transportation Systems, 2021, 22 (3): 1853- 1867
doi: 10.1109/TITS.2020.3027556
18 LAINA I, RUPPRECHT C, BELAGIANNIS V, et al. Deeper depth prediction with fully convolutional residual networks [C]// 2016 Fourth International Conference on 3D Vision . Stanford: IEEE, 2016: 239–248.
19 CORDTS M, OMRAN M, RAMOS S, et al. The Cityscapes dataset for semantic urban scene understanding [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas: IEEE, 2016: 3213–3223.
20 XIE E, WANG W, YU Z, et al. SegFormer: simple and efficient design for semantic segmentation with transformers [C]// Advances in Neural Information Processing Systems . [S.1.]: MIT Press, 2021: 12077–12090.
21 WANG W, XIE E, LI X, et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Montreal: IEEE, 2021: 568–578.
22 HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas: IEEE, 2016: 770–778.
23 ZHAO H, SHI J, QI X, et al. Pyramid scene parsing network [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Honolulu: IEEE, 2017: 2881–2890.
24 ZHENG S, LU J, ZHAO H, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville: IEEE, 2021: 6881–6890.
25 CHENG B, MISRA I, SCHWING A G, et al. Masked-attention mask transformer for universal image segmentation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans: IEEE, 2022: 1290–1299.
26 AGARWAL A, ARORA C. Attention attention everywhere: monocular depth prediction with skip attention [C]// Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision . Waikoloa: IEEE, 2023: 5861–5870.
[1] 李灿林,张文娇,邵志文,马利庄,王新玥. 基于Trans-nightSeg的夜间道路场景语义分割方法[J]. 浙江大学学报(工学版), 2024, 58(2): 294-303.
[2] 温绍杰,吴瑞刚,冯超文,刘英莉. 基于Transformer的多模态级联文档布局分析网络[J]. 浙江大学学报(工学版), 2024, 58(2): 317-324.
[3] 冯志成,杨杰,陈智超. 基于轻量级Transformer的城市路网提取方法[J]. 浙江大学学报(工学版), 2024, 58(1): 40-49.
[4] 郭浩然,郭继昌,汪昱东. 面向水下场景的轻量级图像语义分割网络[J]. 浙江大学学报(工学版), 2023, 57(7): 1278-1286.
[5] 刘春娟,乔泽,闫浩文,吴小所,王嘉伟,辛钰强. 基于多尺度互注意力的遥感图像语义分割网络[J]. 浙江大学学报(工学版), 2023, 57(7): 1335-1344.
[6] 张海波,蔡磊,任俊平,王汝言,刘富. 基于Transformer的高效自适应语义分割网络[J]. 浙江大学学报(工学版), 2023, 57(6): 1205-1214.
[7] 吕鑫栋,李娇,邓真楠,冯浩,崔欣桐,邓红霞. 基于改进Transformer的结构化图像超分辨网络[J]. 浙江大学学报(工学版), 2023, 57(5): 865-874.
[8] 王誉翔,钟智伟,夏鹏程,黄亦翔,刘成良. 基于改进Transformer的复合故障解耦诊断方法[J]. 浙江大学学报(工学版), 2023, 57(5): 855-864.
[9] 陆昱翔,徐冠华,唐波. 基于视觉Transformer时空自注意力的工人行为识别[J]. 浙江大学学报(工学版), 2023, 57(3): 446-454.
[10] 杨长春,叶赞挺,刘半藤,王柯,崔海东. 基于多源信息融合的医学图像分割方法[J]. 浙江大学学报(工学版), 2023, 57(2): 226-234.
[11] 陈巧红,孙佳锦,漏杨波,方志坚. 基于多任务学习与层叠 Transformer 的多模态情感分析模型[J]. 浙江大学学报(工学版), 2023, 57(12): 2421-2429.
[12] 胡锦波,聂为之,宋丹,高卓,白云鹏,赵丰. 可形变Transformer辅助的胸部X光影像疾病诊断模型[J]. 浙江大学学报(工学版), 2023, 57(10): 1923-1932.
[13] 杨燕泽,王萌,刘诚,徐慧通,张小月. 基于语义分割的沥青路面裂缝智能识别[J]. 浙江大学学报(工学版), 2023, 57(10): 2094-2105.
[14] 王万良,王铁军,陈嘉诚,尤文波. 融合多尺度和多头注意力的医疗图像分割方法[J]. 浙江大学学报(工学版), 2022, 56(9): 1796-1805.
[15] 吴泽康,赵姗,李宏伟,姜懿芮. 遥感图像语义分割空间全局上下文信息网络[J]. 浙江大学学报(工学版), 2022, 56(4): 795-802.