Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2025, Vol. 59 Issue (1): 89-99    DOI: 10.3785/j.issn.1008-973X.2025.01.009
    
Monocular 3D object detection based on context information enhancement and depth guidance
Jiayi YU1(),Qin WU1,2,*()
1. School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China
2. Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computing Intelligence, Jiangnan University, Wuxi 214122, China
Download: HTML     PDF(2813KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A method based on context information enhancement and depth guidance was proposed to fully utilize the feature information provided by a monocular image. An efficient context information enhancement module was proposed to adaptively enhance the context information for multi-scale objects by using multiple large kernel convolutions, and the depth-wise separable convolution and strip convolution were adopted to effectively reduce the parameter count and computational complexity associated with large kernel convolutions. The prediction errors of each attribute in the 3D object bounding box were analyzed, and the primary cause of the large deviation in the prediction bounding box is the inaccurate prediction of the length and depth of the 3D object. A depth error weighted loss function was proposed to provide supervision for the predictions of length and depth for the 3D object during the training process. By using the proposed loss function, the prediction accuracy of the length and depth attributes was improved, and the accuracy of the 3D prediction bounding box was enhanced. Experiments were conducted on the KITTI dataset, and the results showed that the proposed method achieved higher accuracy than existing monocular 3D object detection methods at multiple levels of the dataset.



Key wordsmonocular 3D object detection      large kernel convolution      depth-wise separable convolution      strip convolution      multi-scale object     
Received: 29 November 2023      Published: 18 January 2025
CLC:  TP 391  
Fund:  国家自然科学基金资助项目(61972180).
Corresponding Authors: Qin WU     E-mail: 3076710949@qq.com;qinwu@jiangnan.edu.cn
Cite this article:

Jiayi YU,Qin WU. Monocular 3D object detection based on context information enhancement and depth guidance. Journal of ZheJiang University (Engineering Science), 2025, 59(1): 89-99.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2025.01.009     OR     https://www.zjujournals.com/eng/Y2025/V59/I1/89


基于上下文信息增强和深度引导的单目3D目标检测

为了充分利用单目图像提供的特征信息,提出上下文信息增强和深度引导的单目3D目标检测方法. 设计高效的上下文信息增强模块,使用多个大核卷积自适应地增强多尺度目标的上下文信息,利用深度可分离卷积和条形卷积操作有效减少大核卷积的参数量和计算复杂度. 统计分析3D目标框各个属性的预测误差,发现3D目标框的长度和深度属性预测不准确是导致预测框偏差大的主要原因. 设计深度误差加权损失函数,在训练过程中进行目标的长度和深度预测监督,提高长度和深度属性的预测精度,进而提升3D预测框的准确性. 在KITTI数据集上开展实验,结果表明,所提方法在数据集的多个级别上的平均准确度高于现有的单目3D目标检测方法.


关键词: 单目3D目标检测,  大核卷积,  深度可分离卷积,  条形卷积,  多尺度目标 
Fig.1 Architecture of context information enhancement and depth guidance model
Fig.2 Histogram of prediction errors of different attributes by baseline model on KITTI validation set
模型发表位置$ {{\mathrm{AP}}_{{\mathrm{3D}}}}|{R_{40}} $$ {{\mathrm{AP}}_{{\mathrm{BEV}}}}|{R_{40}} $
简单中等困难简单中等困难
CaDDN[21]CVPR2119.1713.4111.4627.9418.9117.19
Monodle[22]CVPR2117.2312.2610.2924.7918.8916.00
GrooMeD-NMS[23]CVPR2118.1012.329.6526.1918.2714.05
MonoEF[24]CVPR2121.2913.8711.7129.0319.7017.26
MonoFlex[25]CVPR2119.9413.8912.0728.2319.7516.89
AutoShape[7]ICCV2122.4714.1711.3630.6620.0815.95
GUPNet[10]ICCV2122.2615.0213.1230.2921.1918.20
PCT[26]NeurIPS2121.0013.3711.3129.6519.0315.92
MonoGround[27]CVPR2221.3714.3612.6230.0720.4717.74
HomoLoss[28]CVPR2221.7514.9413.0729.6020.6817.81
MonoDTR[14]CVPR2221.9915.3912.7328.5920.3817.14
MonoJSG[29]CVPR2224.6916.1413.6432.5921.2618.18
DCD[9]ECCV2223.8115.9013.2132.5521.5018.25
DEVIANT[30]ECCV2221.8814.4611.8929.6520.4417.43
DID-M3D[17]ECCV2224.4016.2913.7532.9522.7619.83
SGM3D[31]RAL2222.4614.6512.9731.4921.3718.43
MonoCon[32]AAAI2222.5016.4613.9531.1222.1019.00
MonoRCNN++[33]WACV2320.0813.7211.34
MonoEdge[34]WACV2321.0814.4712.7328.8020.3517.57
本研究26.7416.6714.3334.7322.8419.52
Tab.1 Comparison of monocular 3D object detection accuracy for different object detection models in KITTI test set %
实验ECIE$ {L}_{r}^{d} $$ {L}_{r}^{l} $${{\mathrm{AP}}_{3{\mathrm{D}}}}|{R_{40}} $${{\mathrm{AP}}_{{\mathrm{BEV}}}}|{R_{40}} $
简单中等困难简单中等困难
125.4217.0914.0833.9023.3019.51
226.7318.2515.1934.2023.7220.90
326.0317.5814.5634.6124.5921.05
426.0617.8414.7233.0224.1720.63
527.2818.1114.9335.3524.5620.97
627.1118.2315.0435.4324.6921.01
727.0318.2515.0035.1323.9620.97
827.5618.3215.1335.8524.8221.19
Tab.2 Monocular 3D object detection accuracy of different method combinations %
71121${{\mathrm{AP}}_{3{\mathrm{D}}}}|{R_{40}} $${{\mathrm{AP}}_{{\mathrm{BEV}}}}|{R_{40}} $
简单中等困难简单中等困难
24.5417.1214.1432.6522.9520.04
26.0217.3514.3832.6322.6919.99
25.1817.3614.3533.2923.2020.39
25.8817.6414.5233.5823.2620.42
26.6017.7314.6832.6722.8019.37
26.6217.8814.7234.3823.5820.67
26.7318.2515.1934.2023.7220.90
Tab.3 Monocular 3D object detection accuracy of different kernel sizes in efficient contextual information enhancement module %
Fig.3 Variation of object detection accuracy with hyper-parameters in depth error weighted loss
卷积操作参数量/106复杂度/109
普通卷积2.60980.153
深度可分离卷积0.0451.376
深度可分离卷积+条形卷积0.0110.328
Tab.4 Comparison of parameter count and computational complexity with different convolution operations in efficient contextual information enhancement module
$ {L}_{r} $AP3D(IoU=0.7,IoU=0.5)
整体d$ \in $(0,30] md$ \in $(30,50] md>50 m
1.95,9.275.63,19.560.91,6.720.15,1.70
2.3311.477.2323.520.73,7.100.242.53
Tab.5 Monocular 3D object detection accuracy of depth error weighted loss on Waymo dataset %
Fig.4 Visualization of object detection results of different object detection models in KITTI validation set
[1]   LIU Y X, YUAN Y X, LIU M Ground-aware monocular 3D object detection for autonomous driving[J]. IEEE Robotics and Automation Letters, 2021, 6 (2): 919- 926
doi: 10.1109/LRA.2021.3052442
[2]   SIMONELLI A, BULÒ S R, PORZI L, et al. Disentangling monocular 3D object detection [C]// IEEE/CVF International Conference on Computer Vision . Seoul: IEEE, 2019: 1991–1999.
[3]   WANG Y, CHAO W L, GARG D, et al. Pseudo-LiDAR from visual depth estimation: bridging the gap in 3D object detection for autonomous driving [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 8445–8453.
[4]   MA X Z, LIU S N, XIA Z Y, et al. Rethinking pseudo-LiDAR representation [C]// European Conference on Computer Vision . Glasgow: Springer, 2020: 311–327.
[5]   PENG L, LIU F, YU Z X, et al. LiDAR point cloud guided monocular 3D object detection [C]// European Conference on Computer Vision . Tel Aviv: Springer, 2022: 123–139.
[6]   HONG Y, DAI H, DING Y. Cross-modality knowledge distillation network for monocular 3D object detection [C]// European Conference on Computer Vision . Tel Aviv: Springer, 2022: 87–104.
[7]   LIU Z D, ZHOU D F, LU F X, et al. AutoShape: real-time shape-aware monocular 3D object detection [C]/ / IEEE/CVF International Conference on Computer Vision . Montreal: IEEE, 2021: 15641–15650.
[8]   张峻宁, 苏群星, 刘鹏远, 等 基于空间约束的自适应单目3D物体检测算法[J]. 浙江大学学报: 工学版, 2020, 54 (6): 1138- 1146
ZHANG Junning, SU Qunxing, LIU Pengyuan, et al Adaptive monocular 3D object detection algorithm based on spatial constraint[J]. Journal of Zhejiang University: Engineering Science, 2020, 54 (6): 1138- 1146
[9]   LI Y Y, CHEN Y T, HE J W, et al. Densely constrained depth estimator for monocular 3D object detection [C]// European Conference on Computer Vision . Tel Aviv: Springer, 2022: 718–734.
[10]   LU Y, MA X Z, YANG L, et al. Geometry uncertainty projection network for monocular 3D object detection [C]// IEEE/CVF International Conference on Computer Vision . Montreal: IEEE, 2021: 3111–3121.
[11]   LIU Z C, WU Z Z, TÓTH R. SMOKE: single-stage monocular 3D object detection via keypoint estimation [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops . Seattle: IEEE, 2020: 996–997.
[12]   BRAZIL G, LIU X M. M3D-RPN: monocular 3D region proposal network for object detection [C]// IEEE/CVF International Conference on Computer Vision . Seoul: IEEE, 2019: 9287–9296.
[13]   ZHANG R R, QIU H, WANG T, et al. MonoDETR: depth-guided transformer for monocular 3D object detection [C]// IEEE/CVF International Conference on Computer Vision . Paris: IEEE, 2023: 9155–9166.
[14]   HUANG K C, WU T H, SU H T, et al. MonoDTR: monocular 3D object detection with depth-aware transformer [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans: IEEE, 2022: 4012–4021.
[15]   YU F, WANG D Q, SHELHAMER E, et al. Deep layer aggregation [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City: IEEE, 2018: 2403–2412.
[16]   ZHOU X Y, WANG D Q, KRÄHENBÜHL P. Objects as points [EB/OL]. (2019–04–25)[2023–11–29]. https://arxiv.org/pdf/1904.07850.
[17]   PENG L, WU X P, YANG Z, et al. DID-M3D: decoupling instance depth for monocular 3D object detection [C]// European Conference on Computer Vision . Tel Aviv: Springer, 2022: 71–88.
[18]   GEIGER A, LENZ P, URTASUN R. Are we ready for autonomous driving? The KITTI vision benchmark suite [C]// IEEE Conference on Computer Vision and Pattern Recognition . Providence: IEEE, 2012: 3354–3361.
[19]   MOUSAVIAN A, ANGUELOV D, FLYNN J, et al. 3D bounding box estimation using deep learning and geometry [C]// IEEE Conference on Computer Vision and Pattern Recognition . Honolulu: IEEE, 2017: 7074–7082.
[20]   SUN P, KRETZSCHMAR H, DOTIWALLA X, et al. Scalability in perception for autonomous driving: Waymo open dataset [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle: IEEE, 2020: 2446–2454.
[21]   READING C, HARAKEH A, CHAE J, et al. Categorical depth distribution network for monocular 3D object detection [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville: IEEE, 2021: 8555–8564.
[22]   MA X Z, ZHANG Y M, XU D, et al. Delving into localization errors for monocular 3D object detection [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville: IEEE, 2021: 4721–4730.
[23]   KUMAR A, BRAZIL G, LIU X M. GrooMeD-NMS: grouped mathematically differentiable NMS for monocular 3D object detection [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville: IEEE, 2021: 8973–8983.
[24]   ZHOU Y S, HE Y, ZHU H Z, et al. Monocular 3D object detection: an extrinsic parameter free approach [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville: IEEE, 2021: 7556–7566.
[25]   ZHANG Y P, LU J W, ZHOU J. Objects are different: flexible monocular 3D object detection [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville: IEEE, 2021: 3289–3298.
[26]   WANG L, ZHANG L, ZHU Y, et al. Progressive coordinate transforms for monocular 3D object detection [C]// The 35th International Conference on Neural Information Processing Systems . [S. l.]: Curran Associates, 2021: 13364–13377.
[27]   QIN Z Q, LI X. MonoGround: detecting monocular 3D objects from the ground [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans: IEEE, 2022: 3793–3802.
[28]   GU J Q, WU B J, FAN L B, et al. Homography loss for monocular 3D object detection [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans: IEEE, 2022: 1080–1089.
[29]   LIAN Q, LI P L, CHEN X Z. MonoJSG: joint semantic and geometric cost volume for monocular 3D object detection [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans: IEEE, 2022: 1070–1079.
[30]   KUMAR A, BRAZIL G, CORONA E, et al. DEVIANT: depth equivariant network for monocular 3D object detection [C]// European Conference on Computer Vision . Tel Aviv: Springer, 2022: 664–683.
[31]   ZHOU Z Y, DU L, YE X Q, et al SGM3D: stereo guided monocular 3D object detection[J]. IEEE Robotics and Automation Letters, 2022, 7 (4): 10478- 10485
doi: 10.1109/LRA.2022.3191849
[32]   LIU X P, XUE N, WU T F. Learning auxiliary monocular contexts helps monocular 3D object detection [C]// AAAI Conference on Artificial Intelligence . Vancouver: AAAI, 2022: 1810–1818.
[33]   SHI X P, CHEN Z X, KIM T K. Multivariate probabilistic monocular 3D object detection [C]// IEEE/CVF Winter Conference on Applications of Computer Vision . Waikoloa: IEEE, 2023: 4281–4290.
[1] Mingze HOU,Lei RAO,Guangyu FAN,Niansheng CHEN,Songlin CHENG. Span-level aspect sentiment triplet extraction based on curriculum learning[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(1): 79-88.
[2] Dengfeng LIU,Shihai CHEN,Wenjing GUO,Zhilei CHAI. Efficient halftone algorithm based on lightweight residual networks[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(1): 62-69.
[3] Guangming WANG,Zhengyao BAI,Shuai SONG,Yue’e XU. Lightweight multimodal data fusion network for auxiliary diagnosis of Alzheimer’s disease[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(1): 39-48.
[4] Bing YANG,Chuyang XU,Jinliang YAO,Xueqin XIANG. 3D hand pose estimation method based on monocular RGB images[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(1): 18-26.
[5] Yaolian SONG,Can WANG,Dayan LI,Xinyi LIU. UAV small target detection algorithm based on improved YOLOv5s[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(12): 2417-2426.
[6] Changzhen XIONG,Chuanxi GUO,Cong WANG. Target tracking algorithm based on dynamic position encoding and attention enhancement[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(12): 2427-2437.
[7] Jiaming LV,Feng ZHANG,Yabo LUO. Improved YOLOv5s based target detection algorithm for tobacco stem material[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(12): 2438-2446.
[8] Henghui MO,linjing WEI. Improved YOLOv7 based apple target detection in complex environment[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(12): 2447-2458.
[9] Wei LUO,Zuotao YAN,Jiahao GUAN,Jian HAN. Solar cell defect segmentation model based on improved SegFormer[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(12): 2459-2468.
[10] Yuntang LI,Kun ZHANG,Hengjie LI,Wenkai ZHU,Jie JIN,Cong ZHANG,Bingqing WANG,Francis OPPONG. Insulator defect detection based on improved YOLOv5s network[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(12): 2469-2478.
[11] Yan ZHAN,Jieya CHEN,Weiguang JIANG,Jiansha LU,Hongtao TANG,Xinyu SONG,Lili XU,Saimiao LIU. Multi-objective workshop material distribution method based on improved NSGA-[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(12): 2510-2519.
[12] Yunlong ZHAO,Minzhe ZHAO,Wenqiang ZHU,Xingyu CHA. Cloud-edge collaborative natural language processing method based on lightweight transfer learning[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(12): 2531-2539.
[13] Yifan ZHOU,Lingwei ZHANG,Zhengdong ZHOU,Zhi CAI,Mengyao YUAN,Xiaoxi YUAN,Zeyi YANG. Classification of group speech imagined EEG signals based on attention mechanism and deep learning[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(12): 2540-2546.
[14] Bingyang ZHU,Jianfeng WU,Ke WANG,Zhangquan WANG,Banteng LIU. Sleep staging based on single-channel ECG signal and INFO-ABCLogitBoost model[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(12): 2547-2555.
[15] Xiaoqian XIANG,Jing CHEN. Pedestrian trajectory prediction based on dual-attention spatial-temporal graph convolutional network[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(12): 2586-2595.