Please wait a minute...
浙江大学学报(工学版)  2025, Vol. 59 Issue (9): 1784-1792    DOI: 10.3785/j.issn.1008-973X.2025.09.002
计算机技术     
基于跨任务双向特征交互的交通场景感知算法
林鹏志1(),钟铭恩1,*(),范康2,谭佳威2,林志强1
1. 厦门理工学院 机械与汽车工程学院,福建 厦门 361024
2. 厦门大学 航空航天学院,福建 厦门 361005
Traffic scene perception algorithm based on cross-task bidirectional feature interaction
Pengzhi LIN1(),Ming’en ZHONG1,*(),Kang FAN2,Jiawei TAN2,Zhiqiang LIN1
1. School of Mechanical and Automotive Engineering, Xiamen University of Technology, Xiamen 361024, China
2. School of Aerospace Engineering, Xiamen University, Xiamen 361005, China
 全文: PDF(2358 KB)   HTML
摘要:

为了提高交通场景感知算法的整体性能,利用语义分割任务和深度估计任务之间的显式和隐式相关性,依据跨任务双向特征交互原理,提出面向城市街道自动驾驶的感知算法SDFormer++. 在跨任务特征提取阶段加入交互门控线性单元,形成高质量的特定任务特征表达;构建多任务特征交互模块,应用双向注意力机制,借助跨域共享任务的特征信息来增强初始特定任务特征;设计多尺度特征融合模块,整合不同层次的信息,以获取精细的高分辨率特征. 在Cityscapes数据集上的实验结果表明,算法的像素分割平均交并比mIoU为82.4%,深度估计平均平方根误差RMSE和绝对相对误差ARE分别为4.453和0.130,针对5类典型交通参与者的平均距离估计误差为6.0%,均超越InvPT++、SDFormer等主流多任务算法.

关键词: 跨任务交互多任务学习交通环境感知语义分割深度估计    
Abstract:

A traffic scene perception algorithm (SDFormer++) based on the principle of cross-task bidirectional feature interaction for autonomous driving in urban street scenarios was proposed by leveraging the explicit and implicit correlations between the semantic segmentation tasks and the depth estimation tasks to improve the overall performance of traffic scene perception algorithms. An interaction-gated linear unit was added into the cross-task feature extraction stage to form high-quality task-specific feature representations. A multi-task feature interaction module that used the bidirectional attention mechanism was constructed to enhance the initial task-specific features by utilizing the feature information of shared cross-domain tasks. A multi-scale feature fusion module was designed to integrate information at different levels to obtain fine high-resolution features. Experimental results on the Cityscapes dataset showed that the algorithm achieved a mean intersection over union (mIoU) of 82.4% for pixel segmentation, a root mean square error (RMSE) of 4.453 for depth estimation, an absolute relative error (ARE) of 0.130 for depth estimation, and an average distance estimation error of 6.0% for five typical traffic participants, all of which outperformed the existing mainstream multi-task algorithms such as InvPT++ and SDFormer.

Key words: cross-task interaction    multi-task learning    traffic environment perception    semantic segmentation    depth estimation
收稿日期: 2024-12-05 出版日期: 2025-08-25
CLC:  TP 391.4  
基金资助: 福建省自然科学基金资助项目(2023J011439).
通讯作者: 钟铭恩     E-mail: 2477541661@qq.com;zhongmingen@xmut.edu.cn
作者简介: 林鹏志(2000—),男,硕士生,从事机器视觉和智慧交通研究. orcid.org/0009-0005-8197-9429. E-mail:2477541661@qq.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
林鹏志
钟铭恩
范康
谭佳威
林志强

引用本文:

林鹏志,钟铭恩,范康,谭佳威,林志强. 基于跨任务双向特征交互的交通场景感知算法[J]. 浙江大学学报(工学版), 2025, 59(9): 1784-1792.

Pengzhi LIN,Ming’en ZHONG,Kang FAN,Jiawei TAN,Zhiqiang LIN. Traffic scene perception algorithm based on cross-task bidirectional feature interaction. Journal of ZheJiang University (Engineering Science), 2025, 59(9): 1784-1792.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2025.09.002        https://www.zjujournals.com/eng/CN/Y2025/V59/I9/1784

图 1  交通场景感知算法SDFormer++的整体结构
图 2  跨任务特征提取模块结构图
图 3  多任务特征交互模块结构图
图 4  多尺度特征融合模块结构图
模块mIoU/%RMSEARENp/106GFLOPs
MTL73.25.3550.22766.3131.1
+CFE77.85.0680.17974.5157.4
+MFI78.64.7900.16875.6168.4
+MFF79.34.6980.15476.1177.0
表 1  不同网络组件的消融实验结果
图 5  不同特征提取模块的注意力模式可视化对比
模块骨干网络mIoU/%RMSEAREFPS
MFFSSwin-S75.84.8110.17630.8
MFFMSwin-S79.34.6980.15426.2
MFFLSwin-S79.54.6620.15110.6
表 2  多尺度特征融合模块消融实验结果
算法骨干网络mIoU/%RMSEARENp/106
JTRSegNet72.35.5820.16379.6
MTPSLSegNet73.65.1350.16584.5
DenseMTLResNet-10175.06.6490.194124.3
SwinMTLSwin-B76.44.4890.13465.2
SDFormerSwin-B79.24.4850.132116.7
InvPT++ViT-B82.04.5270.146156.9
SDFormer++Swin-B82.44.4530.130129.4
表 3  不同多任务算法的性能对比结果
图 6  SDFormer++、SDFormer与次优算法在语义分割任务上的推理效果对比
图 7  SDFormer++、SDFormer与次优算法在深度估计任务上的推理效果对比
方法骨干网络mIoU/%Np/106GFLOPs
CSFNet-2STDC276.319.447.8
WaveMixWaveMix80.763.2161.5
DSNet-BaseDSNet-Base82.068.0226.6
CMX(B4)MiT-B482.6140.0134.0
EfficientViT-B3EfficientViT-L283.253.1396.2
SDFormer++Swin-B82.4129.4272.5
表 4  SDFormer++与单任务语义分割算法的性能对比
方法骨干网络RMSEARENp/106GFLOPs
Manydepth2HRNet 165.8270.097123.1246.4
DepthFormerSwin-B4.3260.127151.3282.0
PixelFormerSwin-B4.2580.115146.1346.4
SDFormer++Swin-B4.4530.130129.4272.5
表 5  SDFormer++与单任务深度估计算法的性能对比
方法MRE/%Avg/%
行人骑行者小车公交车卡车
DenseMTL7.78.68.86.78.28.0
JTR8.56.58.07.16.77.3
MTPSL8.66.37.75.47.37.0
SwinMTL7.36.87.06.46.76.8
InvPT++6.66.26.65.86.36.3
SDFormer6.15.47.45.26.56.1
SDFormer++5.85.57.24.96.46.0
表 6  不同交通参与者的距离估计误差对比
距离MRE/%Avg/%
行人骑行者小车公交车卡车
3.34.75.04.13.04.0
5.25.34.55.05.45.1
11.710.29.59.39.810.1
表 7  不同距离范围下的距离估计误差
图 8  典型交通参与者在不同光照和天气条件下的距离预测效果
1 金立生, 华强, 郭柏苍, 等 基于优化DeepSort的前方车辆多目标跟踪[J]. 浙江大学学报: 工学版, 2021, 55 (6): 1056- 1064
JIN Lisheng, HUA Qiang, GUO Baicang, et al Multi-target tracking of vehicles based on optimized DeepSort[J]. Journal of Zhejiang University: Engineering Science, 2021, 55 (6): 1056- 1064
2 XIAO X, ZHAO Y, ZHANG F, et al BASeg: boundary aware semantic segmentation for autonomous driving[J]. Neural Networks, 2023, 157 (12): 460- 470
3 ABDIGAPPOROV S, MIRALIEV S, KAKANI V, et al Joint multiclass object detection and semantic segmentation for autonomous driving[J]. IEEE Access, 2023, 11: 37637- 37649
doi: 10.1109/ACCESS.2023.3266284
4 LV J, TONG H, PAN Q, et al. Importance-aware image segmentation-based semantic communication for autonomous driving [EB/OL]. (2024-01-06) [2024-12-05]. https://arxiv.org/pdf/2401.10153.
5 LAHIRI S, REN J, LIN X Deep learning-based stereopsis and monocular depth estimation techniques: a review[J]. Vehicles, 2024, 6 (1): 305- 351
doi: 10.3390/vehicles6010013
6 JUN W, YOO J, LEE S Synthetic data enhancement and network compression technology of monocular depth estimation for real-time autonomous driving system[J]. Sensors, 2024, 24 (13): 4205
doi: 10.3390/s24134205
7 RAJAPAKSHA U, SOHEL F, LAGA H, et al Deep learning-based depth estimation methods from monocular image and videos: a comprehensive survey[J]. ACM Computing Surveys, 2024, 56 (12): 1- 51
8 FENG Y, SUN X, DIAO W, et al Height aware understanding of remote sensing images based on cross-task interaction[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2023, 195 (4): 233- 249
9 SAMANT R M, BACHUTE M R, GITE S, et al Framework for deep learning-based language models using multi-task learning in natural language understanding: a systematic literature review and future directions[J]. IEEE Access, 2022, 10: 17078- 17097
doi: 10.1109/ACCESS.2022.3149798
10 ZHANG H, LIU H, KIM C Semantic and instance segmentation in coastal urban spatial perception: a multi-task learning framework with an attention mechanism[J]. Sustainability, 2024, 16 (2): 833
doi: 10.3390/su16020833
11 AGAND P, MAHDAVIAN M, SAVVA M, et al. LeTFuser: light-weight end-to-end Transformer-based sensor fusion for autonomous driving with multi-task learning [EB/OL]. (2023-10-19) [2024-12-05]. https://arxiv.org/pdf/2310.13135.
12 YAO J, LI Y, LIU C, et al EHSINet: efficient high-order spatial interaction multi-task network for adaptive autonomous driving perception[J]. Neural Processing Letters, 2023, 55 (8): 11353- 11370
doi: 10.1007/s11063-023-11379-x
13 TAN G, WANG C, LI Z, et al A multi-task network based on dual-neck structure for autonomous driving perception[J]. Sensors, 2024, 24 (5): 1547
doi: 10.3390/s24051547
14 WEI X, CHEN Y. Joint extraction of long-distance entity relation by aggregating local- and semantic-dependent features [J]. Wireless Communications and Mobile Computing, 2022: 3763940.
15 YE H, XU D InvPT++: inverted pyramid multi-task Transformer for visual scene understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46 (12): 7493- 7508
doi: 10.1109/TPAMI.2024.3397031
16 范康, 钟铭恩, 谭佳威, 等 联合语义分割和深度估计的交通场景感知算法[J]. 浙江大学学报: 工学版, 2024, 58 (4): 684- 695
FAN Kang, ZHONG Ming’en, TAN Jiawei, et al Traffic scene perception algorithm with joint semantic segmentation and depth estimation[J]. Journal of Zhejiang University: Engineering Science, 2024, 58 (4): 684- 695
17 CORDTS M, OMRAN M, RAMOS S, et al. The Cityscapes dataset for semantic urban scene understanding [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 3213–3223.
18 NISHI K, KIM J, LI W, et al. Joint-task regularization for partially labeled multi-task learning [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 16152–16162.
19 LI W, LIU X, BILEN H. Learning multiple dense prediction tasks from partially annotated data [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 18857–18867.
20 LOPES I, VU T H, CHARETTE R. Cross-task attention mechanism for dense multi-task learning [C]// IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2023: 2328–2337.
21 TAGHAVI P, LANGARI R, PANDEY G. SwinMTL: a shared architecture for simultaneous depth estimation and semantic segmentation from monocular camera images [EB/OL]. (2024-03-15) [2024-12-05]. https://arxiv.org/abs/2403.10662.
22 QASHQAI D, MOUSAVIAN E, SHOKOUHI S B, et al. CSFNet: a cosine similarity fusion network for real-time RGB-X semantic segmentation of driving scenes [EB/OL]. (2024-07-01) [2024-12-05]. https://arxiv.org/pdf/2407.01328.
23 JEEVAN P, VISWABATHAN K, SETHI A. WaveMix: a resource-efficient neural network for image analysis [EB/OL]. (2024-03-28) [2024-12-05]. https://arxiv.org/pdf/2205.143755.
24 GUO Z, BIAN L, HUANG X, et al. DSNet: a novel way to use atrous convolutions in semantic segmentation [EB/OL]. (2024-06-06) [2024-12-05]. https://arxiv.org/pdf/2406.03702.
25 ZHANG J, LIU H, YANG K, et al CMX: cross-modal fusion for RGB-X semantic segmentation with Transformers[J]. IEEE Transactions on Intelligent Transportation Systems, 2023, 24 (12): 14679- 14694
doi: 10.1109/TITS.2023.3300537
26 CAI H, LI J, HU M, et al. EfficientViT: multi-scale linear attention for high-resolution dense prediction [EB/OL]. (2024-02-06) [2024-12-05]. https://arxiv.org/pdf/2205.14756.
27 ZHOU K, BIAN J, XIE Q, et al. Manydepth2: motion-aware self-supervised multi-frame monocular depth estimation in dynamic scenes [EB/OL]. (2024-10-11) [2024-12-05]. https://arxiv.org/pdf/2312.15268v6.
28 LI Z, CHEN Z, LIU X, et al DepthFormer: exploiting long-range correlation and local information for accurate monocular depth estimation[J]. Machine Intelligence Research, 2023, 20 (6): 837- 854
doi: 10.1007/s11633-023-1458-0
[1] 魏新雨,饶蕾,范光宇,陈年生,程松林,杨定裕. 用于无人机遥感图像的高精度实时语义分割网络[J]. 浙江大学学报(工学版), 2025, 59(7): 1411-1420.
[2] 李沈崇,曾新华,林传渠. 基于轴向注意力的多任务自动驾驶环境感知算法[J]. 浙江大学学报(工学版), 2025, 59(4): 769-777.
[3] 顾正宇,赖菲菲,耿辰,王希明,戴亚康. 基于知识引导的缺血性脑卒中梗死区分割方法[J]. 浙江大学学报(工学版), 2025, 59(4): 814-820.
[4] 张振利,胡新凯,李凡,冯志成,陈智超. 基于CNN和Efficient Transformer的多尺度遥感图像语义分割算法[J]. 浙江大学学报(工学版), 2025, 59(4): 778-786.
[5] 李凡,杨杰,冯志成,陈智超,付云骁. 基于图像识别的弓网接触点检测方法[J]. 浙江大学学报(工学版), 2024, 58(9): 1801-1810.
[6] 杨军,张琛. 基于边界点估计与稀疏卷积神经网络的三维点云语义分割[J]. 浙江大学学报(工学版), 2024, 58(6): 1121-1132.
[7] 刘毅,陈一丹,高琳,洪姣. 基于多尺度特征融合的轻量化道路提取模型[J]. 浙江大学学报(工学版), 2024, 58(5): 951-959.
[8] 范康,钟铭恩,谭佳威,詹泽辉,冯妍. 联合语义分割和深度估计的交通场景感知算法[J]. 浙江大学学报(工学版), 2024, 58(4): 684-695.
[9] 李灿林,张文娇,邵志文,马利庄,王新玥. 基于Trans-nightSeg的夜间道路场景语义分割方法[J]. 浙江大学学报(工学版), 2024, 58(2): 294-303.
[10] 梁龙学,贺成龙,吴小所,闫浩文. 全局信息提取与重建的遥感图像语义分割网络[J]. 浙江大学学报(工学版), 2024, 58(11): 2270-2279.
[11] 薛雅丽,周李尊,王林飞,欧阳权. 基于多特征重构的三维目标反演算法[J]. 浙江大学学报(工学版), 2024, 58(11): 2199-2207.
[12] 冯志成,杨杰,陈智超. 基于轻量级Transformer的城市路网提取方法[J]. 浙江大学学报(工学版), 2024, 58(1): 40-49.
[13] 郭浩然,郭继昌,汪昱东. 面向水下场景的轻量级图像语义分割网络[J]. 浙江大学学报(工学版), 2023, 57(7): 1278-1286.
[14] 刘春娟,乔泽,闫浩文,吴小所,王嘉伟,辛钰强. 基于多尺度互注意力的遥感图像语义分割网络[J]. 浙江大学学报(工学版), 2023, 57(7): 1335-1344.
[15] 张海波,蔡磊,任俊平,王汝言,刘富. 基于Transformer的高效自适应语义分割网络[J]. 浙江大学学报(工学版), 2023, 57(6): 1205-1214.