Traffic scene perception algorithm with joint semantic segmentation and depth estimation
Kang FAN1(),Ming’en ZHONG1,*(),Jiawei TAN2,Zehui ZHAN1,Yan FENG1
1. Fujian Key Laboratory of Bus Advanced Design and Manufacture, Xiamen University of Technology, Xiamen 361024, China 2. School of Aerospace Engineering, Xiamen University, Xiamen 361102, China
Inspired by the idea that feature information between different pixel-level visual tasks can guide and optimize each other, a traffic scene perception algorithm based on multi-task learning theory was proposed for joint semantic segmentation and depth estimation. A bidirectional cross-task attention mechanism was proposed to achieve explicit modeling of global correlation between tasks, guiding the network to fully explore and utilize complementary pattern information between tasks. A multi-task Transformer was constructed to enhance the spatial global representation of specific task features, implicitly model the cross-task global context relationship, and promote the fusion of complementary pattern information between tasks. An encoder-decoder fusion upsampling module was designed to effectively fuse the spatial details contained in the encoder to generate fine-grained high-resolution specific task features. The experimental results on the Cityscapes dataset showed that the mean IoU of semantic segmentation of the proposed algorithm reached 79.2%, the root mean square error of depth estimation was 4.485, and the mean relative error of distance estimation for five typical traffic participants was 6.1%. Compared with the mainstream algorithms, the proposed algorithm can achieve better comprehensive performance with lower computational complexity.
Kang FAN,Ming’en ZHONG,Jiawei TAN,Zehui ZHAN,Yan FENG. Traffic scene perception algorithm with joint semantic segmentation and depth estimation. Journal of ZheJiang University (Engineering Science), 2024, 58(4): 684-695.
Fig.2Overall structure of bidirectional cross-task attention module
Fig.3Overall structure of multi-task Transformer module
Fig.4Overall structure of encoder-decoder fusion upsampling module
模型
MIoU/%
RMSE
ARE
Np/106
GFLOPs
STL-Seg
76.3
—
—
50.5
132.5
STL-Depth
—
4.910
0.174
50.5
132.5
MTL
73.2
5.355
0.227
66.3
131.1
+BCTA
75.8
5.028
0.186
72.6
153.8
+MT-T
77.0
4.894
0.161
73.4
162.7
+EDFU
77.6
4.781
0.156
74.5
167.0
Tab.1Results of SDFormer ablation experiments
Fig.5Comparison of semantic segmentation effects between MTL and SDFormer
Fig.6Comparison of depth estimation effects between MTL and SDFormer
模块
MIoU/%
RMSE
ARE
GFLOPs
FPT
76.1
5.058
0.204
169.2
SPT
76.7
4.924
0.180
173.8
BCTA
77.6
4.781
0.156
167.0
Tab.2Performance comparison of different bidirectional cross-task feature interaction modules
模型
MIoU/%
RMSE
ARE
GFLOPs
f/(帧.s?1)
SDFormer-noT
76.0
4.957
0.184
155.2
31.7
SDFormer-s
76.7
4.870
0.175
170.1
18.5
SDFormer-noD
77.8
4.815
0.153
178.3
9.7
SDFormer
77.6
4.781
0.156
167.0
23.6
Tab.3Results of multi-task Transformer module ablation experiments
编码器
MIoU/%
RMSE
ARE
Np/106
f/(帧.s?1)
ResNet-50
74.8
5.287
0.226
54.1
65.1
ResNet-101
76.4
5.053
0.184
73.1
34.8
Swin-T
75.3
5.128
0.206
58.7
47.6
Swin-S
77.6
4.781
0.156
74.5
23.6
Swin-B
79.2
4.485
0.132
116.7
11.2
Tab.4Experimental results of performance comparison for different encoders
方法
编码器
MIoU/%
Np/106
GFLOPs
PSPNet
ResNet-101
78.5
68.5
546.8
SETR
ViT-B
78.0
97.6
457.2
SegFormer
MiT-B3
80.3
48.2
224.6
Mask2Former
Swin-B
81.2
107.0
343.5
SDFormer
Swin-B
79.2
116.7
247.5
Tab.5Comparison of semantic segmentation performance between SDFormer and single-task algorithms
方法
编码器
RMSE
ARE
Np/106
GFLOPs
Lap-Depth
ResNet-101
4.553
0.154
73.4
186.5
DepthFormer
Swin-B
4.326
0.127
151.3
282.0
pixelFormer
Swin-B
4.258
0.115
146.1
346.4
SDFormer
Swin-B
4.485
0.132
116.7
247.5
Tab.6Comparison of depth estimation performance between SDFormer and single-task algorithms
方法
编码器
MIoU/%
RMSE
ARE
Np/106
GFLOPs
MTAN
ResNet-101
74.4
5.846
0.246
88.3
204.9
PAD-Net
ResNet-101
76.3
5.275
0.208
84.5
194.5
PSD-Net
ResNet-101
75.6
5.581
0.225
101.8
224.6
MTI-Net
ResNet-101
76.8
5.135
0.194
112.4
269.4
InvPT
Swin-B
77.8
4.657
0.154
136.9
431.5
SDFormer
Swin-B
79.2
4.485
0.132
116.7
247.5
Tab.7Performance comparison results of different multi-task algorithms
Fig.7Comparison of semantic segmentation effects between SDFormer and InvPT
Fig.8Comparison of depth estimation effects between SDFormer and InvPT
方法
MRE/%
mMRE/%
person
rider
car
bus
truck
MTAN
14.5
15.5
13.0
16.1
14.7
14.7
PAD-Net
12.6
13.3
10.7
12.4
11.3
12.0
PSD-Net
11.7
12.6
11.8
12.7
12.2
12.2
MTI-Net
12.3
12.8
9.8
11.4
10.7
11.4
InvPT
7.6
7.2
6.6
7.8
8.3
7.5
SDFormer
6.1
5.4
7.4
5.2
6.5
6.1
Tab.8Comparison of distance estimation errors of different multi-task algorithms
距离
MRE/%
mMRE/%
person
rider
car
bus
truck
近
3.5
4.8
5.1
4.3
3.2
4.1
中
5.4
5.6
4.8
5.2
5.7
5.3
远
13.7
11.2
9.8
9.5
10.1
10.8
Tab.9Distance estimation errors of SDFormer in different distance ranges
Fig.9Display of distance prediction effects of SDFormer in different distance ranges
[1]
李琳辉, 钱波, 连静, 等 基于卷积神经网络的交通场景语义分割方法研究[J]. 通信学报, 2018, 39 (4): 2018053 LI Linhui, QIAN Bo, LIAN Jing, et al Study on traffic scene semantic segmentation method based on convolutional neural network[J]. Journal on Communications, 2018, 39 (4): 2018053
[2]
PAN H, HONG Y, SUN W, et al Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes[J]. IEEE Transactions on Intelligent Transportation Systems, 2023, 24 (3): 3448- 3460
doi: 10.1109/TITS.2022.3228042
[3]
张海波, 蔡磊, 任俊平, 等 基于Transformer的高效自适应语义分割网络[J]. 浙江大学学报: 工学版, 2023, 57 (6): 1205- 1214 ZHANG Haibo, CAI Lei, REN Junping, et al Efficient and adaptive semantic segmentation network based on Transformer[J]. Journal of Zhejiang University: Engineering Science, 2023, 57 (6): 1205- 1214
[4]
EIGEN D, PUHRSCH C, FERGUS R. Depth map prediction from a single image using a multi-scale deep network [C]// Proceedings of the 27th International Conference on Neural Information Processing Systems . Cambridge: MIT Press, 2014: 2366–2374.
[5]
SONG M, LIM S, KIM W Monocular depth estimation using laplacian pyramid-based depth residuals[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31 (11): 4381- 4393
doi: 10.1109/TCSVT.2021.3049869
[6]
LI Z, CHEN Z, LIU X, et al DepthFormer: exploiting long-range correlation and local information for accurate monocular depth estimation[J]. Machine Intelligence Research, 2023, 20 (6): 837- 854
doi: 10.1007/s11633-023-1458-0
[7]
WANG P, SHEN X, LIN Z, et al. Towards unified depth and semantic prediction from a single image [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Boston: IEEE, 2015: 2800–2809.
[8]
SHOOURI S, YANG M, FAN Z, et al. Efficient computation sharing for multi-task visual scene understanding [EB/OL]. (2023-08-14)[2023-08-23]. https://arxiv.org/pdf/2303.09663.pdf.
[9]
VANDENHENDE S, GEORGOULIS S, VAN GANSBEKE W, et al Multi-task learning for dense prediction tasks: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44 (7): 3614- 3633
[10]
YE H, XU D. Inverted pyramid multi-task transformer for dense scene understanding [C]// European Conference on Computer Vision . [S.l.]: Springer, 2022: 514–530
[11]
XU D, OUYANG W, WANG X, et al. PAD-Net: multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City: IEEE, 2018: 675–684.
[12]
VANDENHENDE S, GEORGOULIS S, VAN GOOL L. MTI-Net: multi-scale task interaction networks for multi-task learning [C]// European Conference on Computer Vision . [S.l.]: Springer, 2020: 527–543.
[13]
ZHOU L, CUI Z, XU C, et al. Pattern-structure diffusion for multi-task learning [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle: IEEE, 2020: 4514–4523.
[14]
VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Advances in Neural Information Processing Systems . Long Beach: MIT Press, 2017: 5998–6008.
[15]
ZHANG X, ZHOU L, LI Y, et al. Transfer vision patterns for multi-task pixel learning [C]// Proceedings of the 29th ACM International Conference on Multimedia . [S.l.]: ACM, 2021: 97–106.
[16]
LIU Z, LIN Y, CAO Y, et al. Swin Transformer: hierarchical vision transformer using shifted windows [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Montreal: IEEE, 2021: 10012–10022.
[17]
ZHANG X, CHEN Y, ZHANG H, et al When visual disparity generation meets semantic segmentation: a mutual encouragement approach[J]. IEEE Transactions on Intelligent Transportation Systems, 2021, 22 (3): 1853- 1867
doi: 10.1109/TITS.2020.3027556
[18]
LAINA I, RUPPRECHT C, BELAGIANNIS V, et al. Deeper depth prediction with fully convolutional residual networks [C]// 2016 Fourth International Conference on 3D Vision . Stanford: IEEE, 2016: 239–248.
[19]
CORDTS M, OMRAN M, RAMOS S, et al. The Cityscapes dataset for semantic urban scene understanding [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas: IEEE, 2016: 3213–3223.
[20]
XIE E, WANG W, YU Z, et al. SegFormer: simple and efficient design for semantic segmentation with transformers [C]// Advances in Neural Information Processing Systems . [S.1.]: MIT Press, 2021: 12077–12090.
[21]
WANG W, XIE E, LI X, et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Montreal: IEEE, 2021: 568–578.
[22]
HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas: IEEE, 2016: 770–778.
[23]
ZHAO H, SHI J, QI X, et al. Pyramid scene parsing network [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Honolulu: IEEE, 2017: 2881–2890.
[24]
ZHENG S, LU J, ZHAO H, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville: IEEE, 2021: 6881–6890.
[25]
CHENG B, MISRA I, SCHWING A G, et al. Masked-attention mask transformer for universal image segmentation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans: IEEE, 2022: 1290–1299.
[26]
AGARWAL A, ARORA C. Attention attention everywhere: monocular depth prediction with skip attention [C]// Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision . Waikoloa: IEEE, 2023: 5861–5870.