Multi-task environment perception algorithm for autonomous driving based on axial attention

doi:10.3785/j.issn.1008-973X.2025.04.012

Journal of ZheJiang University (Engineering Science)

2025, Vol. 59

Issue (4): 769-777 DOI: 10.3785/j.issn.1008-973X.2025.04.012

Multi-task environment perception algorithm for autonomous driving based on axial attention

Shenchong LI1(

),Xinhua ZENG2,*(

),Chuanqu LIN1

1. School of Information Engineering, Huzhou University, Huzhou 313000, China
2. Academy for Engineering and Technology, Fudan University, Shanghai 200433, China

Download:

HTML

PDF(4344KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

A new algorithm was proposed based on a shared backbone network to meet the autonomous driving requirements and to improve the synergy effect among multiple models. An axial attention mechanism was added to the backbone network, and connections between global key points were established while maintaining lightweight feature extraction to enhance the location representation of the model. The adaptive weight allocation method, along with the implementation of a three-dimensional attention mechanism, was devised to mitigate the information conflict that emerges from the diverse scale features present in the multi-scale information extraction phase. The loss function was optimized based on the challenging sample region, and the capacity of the proposed algorithm to capture intricate details in the difficult sample region was strengthened. Experimental results in the BDD100K dataset showed that compared with YOLOP, the proposed algorithm improved the mean average accuracy in the traffic target detection task (at IoU=50%) by 3.3 percentage points, the mIoU in road drivable area segmentation task by 1.0 percentage points, the accuracy of lane line detection by 6.7 percentage points, and the reasoning speed was 223.7 frames per second. The proposed algorithm demonstrates excellent performance in traffic target detection, drivable area segmentation, and lane line detection, and achieves a good balance between detection accuracy and reasoning speed at the same time.

Key words： multi-task learning object detection semantic segmentation automatic driving feature fusion axial attention

Received: 22 January 2024 Published: 25 April 2025

CLC:

TP 391

Fund: 国家自然科学基金资助项目（62373148）.

Corresponding Authors: Xinhua ZENG E-mail: 994930867@qq.com;zengxh@fudan.edu.cn

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Shenchong LI
	Xinhua ZENG
	Chuanqu LIN

Cite this article:

Shenchong LI,Xinhua ZENG,Chuanqu LIN. Multi-task environment perception algorithm for autonomous driving based on axial attention. Journal of ZheJiang University (Engineering Science), 2025, 59(4): 769-777.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2025.04.012 OR https://www.zjujournals.com/eng/Y2025/V59/I4/769

基于轴向注意力的多任务自动驾驶环境感知算法

为了满足自动驾驶要求并提升多模型间的协同效果，基于共享主干网络提出新的算法. 为了提升模型的位置表达能力，将轴向注意力机制加入主干网络，在保持轻量化特征提取的前提下建立全局关键点间的联系. 在多尺度信息提取阶段，引入自适应权重分配方法和三维注意力机制，降低不同尺度特征间的信息冲突. 根据难分样本区域优化损失函数，加强所提算法在难样本区域的细节识别能力. 在BDD100K数据集上的实验结果表明，相比YOLOP，所提算法在交通目标检测任务中的平均精度均值（在IoU=50%的情况下）提高了3.3个百分点，在道路可行驶区域分割任务中的mIoU提升了1.0个百分点，车道线检测准确率提升了6.7个百分点，推理速度为223.7帧/s. 所提算法在交通目标检测、可行驶区域分割和车道线检测任务上了均表现出良好的性能，能够较好平衡检测精度与推理速度.

关键词： 多任务学习, 目标检测, 语义分割, 自动驾驶, 特征融合, 轴向注意力

Fig.1 Framework of multi-task environment perception algorithm for autonomous driving based on axial attention

Fig.2 Framework of Sea-Attention module

Fig.3 Structure of improved cross stage partial module

Fig.4 Structure of adaptive-weight fusion module

Fig.5 Modular structure of SimAM 3D attention mechanism

Fig.6 Structure of decoupled detection head

Fig.7 Examples of different scenarios and weather images in BDD100K dataset

Tab.1 Traffic target detection results of different algorithms in BDD100K dataset

Tab.2 Drivable area segmentation results of different algorithms in BDD100K dataset

Tab.3 Lane line detection results of different algorithms in BDD100K dataset

Fig.8 Comparison of different algorithms for traffic environment perception in daytime road scenes

Fig.9 Comparison of different algorithms for traffic environment perception in night road scenes

Tab.4 Experimental results of modular ablation for proposed algorithm

Fig.10 Visualizations before and after adding Sea-Attention module


[1]	LIANG X, NIU M, HAN J, et al. Visual exemplar driven task-prompting for unified perception in autonomous driving [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver: IEEE, 2023: 9611–9621.

[2]	LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot MultiBox detector [C]// Computer Vision – ECCV 2016 . [S. l.]: Springer, 2016: 21–37.

[3]	REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas: IEEE, 2016: 779–788.

[4]	REDMON J, FARHADI A. YOLOv3: an incremental improvement [EB/OL]. (2018–04–08)[2024–03–08]. https://www.arxiv.org/pdf/1804.02767.

[5]	BOCHKOVSKIY A, WANG C Y, LIAO H M, et al. YOLOv4: optimal speed and accuracy of object detection [EB/OL]. (2020–4–23)[2024–01–22]. https://arxiv.org/pdf/2004.10934.

[6]	蒋超, 张豪, 章恩泽, 等基于改进YOLOv5s的行人车辆目标检测算法[J]. 扬州大学学报: 自然科学版, 2022, 25 (6): 45- 49 JIANG Chao, ZHANG Hao, ZHANG Enze, et al Pedestrian and vehicle target detection algorithm based on the improved YOLOv5s[J]. Journal of Yangzhou University: Natural Science Edition, 2022, 25 (6): 45- 49

[7]	韩俊, 袁小平, 王准, 等基于YOLOv5s的无人机密集小目标检测算法[J]. 浙江大学学报: 工学版, 2023, 57 (6): 1224- 1233 HAN Jun, YUAN Xiaoping, WANG Zhun, et al UAV dense small target detection algorithm based on YOLOv5s[J]. Journal of Zhejiang University: Engineering Science, 2023, 57 (6): 1224- 1233

[8]	GIRSHICK R. Fast R-CNN [C]// Proceedings of the IEEE International Conference on Computer Vision . Santiago: IEEE, 2015: 1440–1448.

[9]	REN S, HE K, GIRSHICK R, et al Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39 (6): 1137- 1149 doi: 10.1109/TPAMI.2016.2577031

[10]	CHEN L C, PAPANDREOU G, KOKKINOS I, et al DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40 (4): 834- 848

[11]	LIN G, MILAN A, SHEN C, et al. RefineNet: multi-path refinement networks for high-resolution semantic segmentation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Honolulu: IEEE, 2017: 5168–5177.

[12]	ZHAO H, SHI J, QI X, et al. Pyramid scene parsing network [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Honolulu. IEEE, 2017: 6230–6239.

[13]	YU C, WANG J, PENG C, et al. BiSeNet: bilateral segmentation network for real-time semantic segmentation [C]// Computer Vision – ECCV 2018 . [S.l.]: Springe, 2018: 334–349.

[14]	FAN M, LAI S, HUANG J, et al. Rethinking BiSeNet for real-time semantic segmentation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville: IEEE, 2021: 9711–9720.

[15]	WANG Z, REN W, QIU Q. LaneNet: real-time lane detection networks for autonomous driving [EB/OL]. (2018–07–04)[2024–03–08]. https://arxiv.org/pdf/1807.01726.

[16]	TEICHMANN M, WEBER M, ZOELLNER M, et al. MultiNet: real-time joint semantic reasoning for autonomous driving [EB/OL]. (2016–12–22)[2024–03–08]. https://arxiv.org/pdf/1612.07695.

[17]	QIAN Y, DOLAN J M, YANG M DLT-Net: joint detection of drivable areas, lane lines, and traffic objects[J]. IEEE Transactions on Intelligent Transportation Systems, 2020, 21 (11): 4670- 4679

[18]	WU D, LIAO M W, ZHANG W T, et al YOLOP: you only look once for panoptic driving perception[J]. Machine Intelligence Research, 2022, 19 (6): 550- 562

[19]	VU D, NGO B, PHAN H. HybridNets: end-to-end perception network [EB/OL]. (2022–03–17)[2024–03–08]. https://arxiv.org/pdf/2203.09035.

[20]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale [EB/OL]. (2021–06–03)[2024–03–08]. https://arxiv.org/pdf/2010.11929.

[21]	CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers [C]// Computer Vision – ECCV 2020 . [S.l.]: Springer, 2020: 213–229.

[22]	LIU Z, LIN Y, CAO Y, et al. Swin Transformer: hierarchical vision transformer using shifted windows [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Montreal: IEEE, 2021: 9992–10002.

[23]	WAN Q, HUANG Z, LU J, et al. SeaFormer++: squeeze-enhanced axial transformer for mobile semantic segmentation [EB/OL]. (2023–06–09)[2024–03–08]. https://arxiv.org/pdf/2301.13156.

[24]	LIU S, QI L, QIN H, et al. Path aggregation network for instance segmentation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City: IEEE, 2018: 8759–8768.

[25]	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module [C]// Computer Vision – ECCV 2018 . [S.l.]: Springer, 2018: 3–19.

[26]	YANG L, ZHANG R Y, LI L, et al. SimAM: a simple, parameter-free attention module for convolutional neural networks [C]// Proceeding of the 38th International Conference on Machine Learning . Vienna: ACM, 2021: 11863-11874.

[27]	DUAN K, BAI S, XIE L, et al. CenterNet: keypoint triplets for object detection [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Seoul: IEEE, 2019: 6569–6578.

[28]	ZHANG Y, ZHANG W, YU J, et al Complete and accurate holly fruits counting using YOLOX object detection[J]. Computers and Electronics in Agriculture, 2022, 198: 107062

[29]	LI X, WANG W, WU L, et al Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection[J]. Advances in Neural Information Processing Systems, 2020, 33: 21002- 21012

[1]	Dengfeng LIU,Wenjing GUO,Shihai CHEN. Content-guided attention-based lane detection network[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(3): 451-459.

[2]	Yongfu HE,Shiwei XIE,Jialu YU,Siyu CHEN. Detection method for spillage risk vehicle considering cross-level feature fusion[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(2): 300-309.

[3]	Hongzhao DONG,Shaoxuan LIN,Yini SHE. Research progress of YOLO detection technology for traffic object[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(2): 249-260.

[4]	Jiayi YU,Qin WU. Monocular 3D object detection based on context information enhancement and depth guidance[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(1): 89-99.

[5]	Huan LIU,Yunhong LI,Leitao ZHANG,Yue GUO,Xueping SU,Yaolin ZHU,Lele HOU. Identification of apple leaf diseases based on MA-ConvNext network and stepwise relational knowledge distillation[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(9): 1757-1767.

[6]	Fan LI,Jie YANG,Zhicheng FENG,Zhichao CHEN,Yunxiao FU. Pantograph-catenary contact point detection method based on image recognition[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(9): 1801-1810.

[7]	Jun YANG,Chen ZHANG. Semantic segmentation of 3D point cloud based on boundary point estimation and sparse convolution neural network[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(6): 1121-1132.

[8]	Juan SONG,Longxi HE,Huiping LONG. Deep learning-based algorithm for multi defect detection in tunnel lining[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(6): 1161-1173.

[9]	Yi LIU,Yidan CHEN,Lin GAO,Jiao HONG. Lightweight road extraction model based on multi-scale feature fusion[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(5): 951-959.

[10]	Kang FAN,Ming’en ZHONG,Jiawei TAN,Zehui ZHAN,Yan FENG. Traffic scene perception algorithm with joint semantic segmentation and depth estimation[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(4): 684-695.

[11]	Yin CAO,Junping QIN,Tong GAO,Qianli MA,Jiaqi REN. Generative adversarial network based two-stage generation of high-quality images from text[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(4): 674-683.

[12]	Qingjie QIAN,Junhe YU,Hongfei ZHAN,Rui WANG,Jian HU. Dimension prediction method of injection molded parts based on multi-feature fusion of DL-BiGRU[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(3): 646-654.

[13]	Huijuan ZHANG,Kunpeng LI,Miaoxin JI,Zhenjiang LIU,Jianjuan LIU,Chi ZHANG. UAV detection algorithm based on spatial correlation enhancement[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(3): 468-479.

[14]	Tianmin DENG,Xinxin CHENG,Jinfeng LIU,Xiyue ZHANG. Small target detection algorithm for aerial images based on feature reuse mechanism[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(3): 437-448.

[15]	Canlin LI,Wenjiao ZHANG,Zhiwen SHAO,Lizhuang MA,Xinyue WANG. Semantic segmentation method on nighttime road scene based on Trans-nightSeg[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(2): 294-303.

Viewed

Full text

Abstract

Cited

Shared

Discussed