Traffic scene perception algorithm based on cross-task bidirectional feature interaction

doi:10.3785/j.issn.1008-973X.2025.09.002

Journal of ZheJiang University (Engineering Science)

2025, Vol. 59

Issue (9): 1784-1792 DOI: 10.3785/j.issn.1008-973X.2025.09.002

Traffic scene perception algorithm based on cross-task bidirectional feature interaction

Pengzhi LIN1(

),Ming’en ZHONG1,*(

),Kang FAN2,Jiawei TAN2,Zhiqiang LIN1

1. School of Mechanical and Automotive Engineering, Xiamen University of Technology, Xiamen 361024, China
2. School of Aerospace Engineering, Xiamen University, Xiamen 361005, China

Download:

HTML

PDF(2358KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

A traffic scene perception algorithm (SDFormer++) based on the principle of cross-task bidirectional feature interaction for autonomous driving in urban street scenarios was proposed by leveraging the explicit and implicit correlations between the semantic segmentation tasks and the depth estimation tasks to improve the overall performance of traffic scene perception algorithms. An interaction-gated linear unit was added into the cross-task feature extraction stage to form high-quality task-specific feature representations. A multi-task feature interaction module that used the bidirectional attention mechanism was constructed to enhance the initial task-specific features by utilizing the feature information of shared cross-domain tasks. A multi-scale feature fusion module was designed to integrate information at different levels to obtain fine high-resolution features. Experimental results on the Cityscapes dataset showed that the algorithm achieved a mean intersection over union (mIoU) of 82.4% for pixel segmentation, a root mean square error (RMSE) of 4.453 for depth estimation, an absolute relative error (ARE) of 0.130 for depth estimation, and an average distance estimation error of 6.0% for five typical traffic participants, all of which outperformed the existing mainstream multi-task algorithms such as InvPT++ and SDFormer.

Key words： cross-task interaction multi-task learning traffic environment perception semantic segmentation depth estimation

Received: 05 December 2024 Published: 25 August 2025

CLC:

TP 391.4

Fund: 福建省自然科学基金资助项目（2023J011439）.

Corresponding Authors: Ming’en ZHONG E-mail: 2477541661@qq.com;zhongmingen@xmut.edu.cn

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Pengzhi LIN
	Ming’en ZHONG
	Kang FAN
	Jiawei TAN
	Zhiqiang LIN

Cite this article:

Pengzhi LIN,Ming’en ZHONG,Kang FAN,Jiawei TAN,Zhiqiang LIN. Traffic scene perception algorithm based on cross-task bidirectional feature interaction. Journal of ZheJiang University (Engineering Science), 2025, 59(9): 1784-1792.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2025.09.002 OR https://www.zjujournals.com/eng/Y2025/V59/I9/1784

基于跨任务双向特征交互的交通场景感知算法

为了提高交通场景感知算法的整体性能，利用语义分割任务和深度估计任务之间的显式和隐式相关性，依据跨任务双向特征交互原理，提出面向城市街道自动驾驶的感知算法SDFormer++. 在跨任务特征提取阶段加入交互门控线性单元，形成高质量的特定任务特征表达；构建多任务特征交互模块，应用双向注意力机制，借助跨域共享任务的特征信息来增强初始特定任务特征；设计多尺度特征融合模块，整合不同层次的信息，以获取精细的高分辨率特征. 在Cityscapes数据集上的实验结果表明，算法的像素分割平均交并比mIoU为82.4%，深度估计平均平方根误差RMSE和绝对相对误差ARE分别为4.453和0.130，针对5类典型交通参与者的平均距离估计误差为6.0%，均超越InvPT++、SDFormer等主流多任务算法.

关键词： 跨任务交互, 多任务学习, 交通环境感知, 语义分割, 深度估计

Fig.1 Overall structure of traffic scene perception algorithm SDFormer++

Fig.2 Structure diagram of cross-task feature extraction module

Fig.3 Structure diagram of multi-task feature interaction module

Fig.4 Structure diagram of multi-scale feature fusion module

Tab.1 Ablation study results of different network components

Fig.5 Visualization comparison of attention patterns in different feature extraction modules

Tab.2 Ablation experimental results of multi-scale feature fusion module

Tab.3 Performance comparison results of different multi-task algorithms

Fig.6 Comparison of semantic segmentation inference performance of SDFormer++, SDFormer and suboptimal algorithm

Fig.7 Comparison of depth estimation inference performance of SDFormer++, SDFormer and suboptimal algorithm

Tab.4 Performance comparison of SDFormer++ and single-task semantic segmentation algorithms

Tab.5 Performance comparison of SDFormer++ and single-task depth estimation algorithms

Tab.6 Comparison of distance estimation errors for different traffic participants

Tab.7 Distance estimation errors under different distance ranges

Fig.8 Distance prediction performance of typical traffic participants under different lighting and weather conditions


[1]	金立生, 华强, 郭柏苍, 等基于优化DeepSort的前方车辆多目标跟踪[J]. 浙江大学学报: 工学版, 2021, 55 (6): 1056- 1064 JIN Lisheng, HUA Qiang, GUO Baicang, et al Multi-target tracking of vehicles based on optimized DeepSort[J]. Journal of Zhejiang University: Engineering Science, 2021, 55 (6): 1056- 1064

[2]	XIAO X, ZHAO Y, ZHANG F, et al BASeg: boundary aware semantic segmentation for autonomous driving[J]. Neural Networks, 2023, 157 (12): 460- 470

[3]	ABDIGAPPOROV S, MIRALIEV S, KAKANI V, et al Joint multiclass object detection and semantic segmentation for autonomous driving[J]. IEEE Access, 2023, 11: 37637- 37649 doi: 10.1109/ACCESS.2023.3266284

[4]	LV J, TONG H, PAN Q, et al. Importance-aware image segmentation-based semantic communication for autonomous driving [EB/OL]. (2024-01-06) [2024-12-05]. https://arxiv.org/pdf/2401.10153.

[5]	LAHIRI S, REN J, LIN X Deep learning-based stereopsis and monocular depth estimation techniques: a review[J]. Vehicles, 2024, 6 (1): 305- 351 doi: 10.3390/vehicles6010013

[6]	JUN W, YOO J, LEE S Synthetic data enhancement and network compression technology of monocular depth estimation for real-time autonomous driving system[J]. Sensors, 2024, 24 (13): 4205 doi: 10.3390/s24134205

[7]	RAJAPAKSHA U, SOHEL F, LAGA H, et al Deep learning-based depth estimation methods from monocular image and videos: a comprehensive survey[J]. ACM Computing Surveys, 2024, 56 (12): 1- 51

[8]	FENG Y, SUN X, DIAO W, et al Height aware understanding of remote sensing images based on cross-task interaction[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2023, 195 (4): 233- 249

[9]	SAMANT R M, BACHUTE M R, GITE S, et al Framework for deep learning-based language models using multi-task learning in natural language understanding: a systematic literature review and future directions[J]. IEEE Access, 2022, 10: 17078- 17097 doi: 10.1109/ACCESS.2022.3149798

[10]	ZHANG H, LIU H, KIM C Semantic and instance segmentation in coastal urban spatial perception: a multi-task learning framework with an attention mechanism[J]. Sustainability, 2024, 16 (2): 833 doi: 10.3390/su16020833

[11]	AGAND P, MAHDAVIAN M, SAVVA M, et al. LeTFuser: light-weight end-to-end Transformer-based sensor fusion for autonomous driving with multi-task learning [EB/OL]. (2023-10-19) [2024-12-05]. https://arxiv.org/pdf/2310.13135.

[12]	YAO J, LI Y, LIU C, et al EHSINet: efficient high-order spatial interaction multi-task network for adaptive autonomous driving perception[J]. Neural Processing Letters, 2023, 55 (8): 11353- 11370 doi: 10.1007/s11063-023-11379-x

[13]	TAN G, WANG C, LI Z, et al A multi-task network based on dual-neck structure for autonomous driving perception[J]. Sensors, 2024, 24 (5): 1547 doi: 10.3390/s24051547

[14]	WEI X, CHEN Y. Joint extraction of long-distance entity relation by aggregating local- and semantic-dependent features [J]. Wireless Communications and Mobile Computing, 2022: 3763940.

[15]	YE H, XU D InvPT++: inverted pyramid multi-task Transformer for visual scene understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46 (12): 7493- 7508 doi: 10.1109/TPAMI.2024.3397031

[16]	范康, 钟铭恩, 谭佳威, 等联合语义分割和深度估计的交通场景感知算法[J]. 浙江大学学报: 工学版, 2024, 58 (4): 684- 695 FAN Kang, ZHONG Ming’en, TAN Jiawei, et al Traffic scene perception algorithm with joint semantic segmentation and depth estimation[J]. Journal of Zhejiang University: Engineering Science, 2024, 58 (4): 684- 695

[17]	CORDTS M, OMRAN M, RAMOS S, et al. The Cityscapes dataset for semantic urban scene understanding [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 3213–3223.

[18]	NISHI K, KIM J, LI W, et al. Joint-task regularization for partially labeled multi-task learning [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 16152–16162.

[19]	LI W, LIU X, BILEN H. Learning multiple dense prediction tasks from partially annotated data [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 18857–18867.

[20]	LOPES I, VU T H, CHARETTE R. Cross-task attention mechanism for dense multi-task learning [C]// IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2023: 2328–2337.

[21]	TAGHAVI P, LANGARI R, PANDEY G. SwinMTL: a shared architecture for simultaneous depth estimation and semantic segmentation from monocular camera images [EB/OL]. (2024-03-15) [2024-12-05]. https://arxiv.org/abs/2403.10662.

[22]	QASHQAI D, MOUSAVIAN E, SHOKOUHI S B, et al. CSFNet: a cosine similarity fusion network for real-time RGB-X semantic segmentation of driving scenes [EB/OL]. (2024-07-01) [2024-12-05]. https://arxiv.org/pdf/2407.01328.

[23]	JEEVAN P, VISWABATHAN K, SETHI A. WaveMix: a resource-efficient neural network for image analysis [EB/OL]. (2024-03-28) [2024-12-05]. https://arxiv.org/pdf/2205.143755.

[24]	GUO Z, BIAN L, HUANG X, et al. DSNet: a novel way to use atrous convolutions in semantic segmentation [EB/OL]. (2024-06-06) [2024-12-05]. https://arxiv.org/pdf/2406.03702.

[25]	ZHANG J, LIU H, YANG K, et al CMX: cross-modal fusion for RGB-X semantic segmentation with Transformers[J]. IEEE Transactions on Intelligent Transportation Systems, 2023, 24 (12): 14679- 14694 doi: 10.1109/TITS.2023.3300537

[26]	CAI H, LI J, HU M, et al. EfficientViT: multi-scale linear attention for high-resolution dense prediction [EB/OL]. (2024-02-06) [2024-12-05]. https://arxiv.org/pdf/2205.14756.

[27]	ZHOU K, BIAN J, XIE Q, et al. Manydepth2: motion-aware self-supervised multi-frame monocular depth estimation in dynamic scenes [EB/OL]. (2024-10-11) [2024-12-05]. https://arxiv.org/pdf/2312.15268v6.

[28]	LI Z, CHEN Z, LIU X, et al DepthFormer: exploiting long-range correlation and local information for accurate monocular depth estimation[J]. Machine Intelligence Research, 2023, 20 (6): 837- 854 doi: 10.1007/s11633-023-1458-0

[1]	Xinyu WEI,Lei RAO,Guangyu FAN,Niansheng CHEN,Songlin CHENG,Dingyu YANG. High-precision real-time semantic segmentation network for UAV remote sensing images[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1411-1420.

[2]	Shenchong LI,Xinhua ZENG,Chuanqu LIN. Multi-task environment perception algorithm for autonomous driving based on axial attention[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(4): 769-777.

[3]	Zhengyu GU,Feifei LAI,Chen GENG,Ximing WANG,Yakang DAI. Knowledge-guided infarct segmentation of ischemic stroke[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(4): 814-820.

[4]	Zhenli ZHANG,Xinkai HU,Fan LI,Zhicheng FENG,Zhichao CHEN. Semantic segmentation algorithm for multiscale remote sensing images based on CNN and Efficient Transformer[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(4): 778-786.

[5]	Fan LI,Jie YANG,Zhicheng FENG,Zhichao CHEN,Yunxiao FU. Pantograph-catenary contact point detection method based on image recognition[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(9): 1801-1810.

[6]	Jun YANG,Chen ZHANG. Semantic segmentation of 3D point cloud based on boundary point estimation and sparse convolution neural network[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(6): 1121-1132.

[7]	Yi LIU,Yidan CHEN,Lin GAO,Jiao HONG. Lightweight road extraction model based on multi-scale feature fusion[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(5): 951-959.

[8]	Kang FAN,Ming’en ZHONG,Jiawei TAN,Zehui ZHAN,Yan FENG. Traffic scene perception algorithm with joint semantic segmentation and depth estimation[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(4): 684-695.

[9]	Canlin LI,Wenjiao ZHANG,Zhiwen SHAO,Lizhuang MA,Xinyue WANG. Semantic segmentation method on nighttime road scene based on Trans-nightSeg[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(2): 294-303.

[10]	Longxue LIANG,Chenglong HE,Xiaosuo WU,Haowen YAN. Remote sensing image semantic segmentation network based on global information extraction and reconstruction[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(11): 2270-2279.

[11]	Zhicheng FENG,Jie YANG,Zhichao CHEN. Urban road network extraction method based on lightweight Transformer[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(1): 40-49.

[12]	Chun-juan LIU,Ze QIAO,Hao-wen YAN,Xiao-suo WU,Jia-wei WANG,Yu-qiang XIN. Semantic segmentation network for remote sensing image based on multi-scale mutual attention[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(7): 1335-1344.

[13]	Hao-ran GUO,Ji-chang GUO,Yu-dong WANG. Lightweight semantic segmentation network for underwater image[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(7): 1278-1286.

[14]	Hai-bo ZHANG,Lei CAI,Jun-ping REN,Ru-yan WANG,Fu LIU. Efficient and adaptive semantic segmentation network based on Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(6): 1205-1214.

[15]	Qiao-hong CHEN,Jia-jin SUN,Yang-bo LOU,Zhi-jian FANG. Multimodal sentiment analysis model based on multi-task learning and stacked cross-modal Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(12): 2421-2429.

Viewed

Full text

Abstract

Cited

Shared

Discussed