Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2025, Vol. 59 Issue (11): 2277-2284    DOI: 10.3785/j.issn.1008-973X.2025.11.006
    
6D pose estimation of binocular vision object based on 3D key point
Kaixu NING(),Qing LU,Heng YANG*(),Shaohan WANG
School of Mechanical Engineering, Taiyuan University of Science and Technology, Taiyuan 030024, China
Download: HTML     PDF(1301KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A binocular dataset fabrication method based on multi-view geometry and a 3D key point-based object 6D pose estimation network, StereoNet, were proposed to address the reliance on CAD models in traditional pose estimation methods. The 3D key points of the object were obtained through a 3D key point estimation network, and a parallax attention module was introduced into the network to improve the accuracy of key point prediction. The structure-from-motion (SfM) method was employed to reconstruct a sparse point cloud model of the object. The 3D points from the query image and those from the SfM model were fed into a graph attention network (GATs) for matching. The 6D pose of the object was computed by using the RANSAC and PnP algorithms. The experimental results showed that the MAE metric of StereoNet was 1.2–1.6 times higher than that of KeypointNet and KeyPose in 3D key point estimation. StereoNet outperformed HLoc, OnePose, and Gen6D in the 5 cm 5° and 3 cm 3° evaluation metrics in terms of 6D pose estimation, achieving an average accuracy of 82.1%. The network has strong generalization ability and accuracy.



Key words6D pose      dataset creation      binocular vision      3D key point matching      perspective-n-point (PnP) algorithm     
Received: 04 November 2024      Published: 30 October 2025
CLC:  TP 183  
Corresponding Authors: Heng YANG     E-mail: 2654223903@qq.com;93328173@qq.com
Cite this article:

Kaixu NING,Qing LU,Heng YANG,Shaohan WANG. 6D pose estimation of binocular vision object based on 3D key point. Journal of ZheJiang University (Engineering Science), 2025, 59(11): 2277-2284.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2025.11.006     OR     https://www.zjujournals.com/eng/Y2025/V59/I11/2277


基于3D关键点的双目视觉物体6D位姿估计

针对传统位姿估计方法中依赖CAD模型的问题,提出基于多视图几何的双目数据集制作方法及基于3D关键点的物体6D位姿估计网络StereoNet. 通过3D关键点估计网络获取物体的3D关键点,在网络中引入视差注意模块,提高关键点预测的精度. 采用运动恢复结构(SfM)方法重建物体的稀疏点云模型,将查询图像的3D点与SfM模型中的3D点输入图注意力网络(GATs)中进行匹配,通过RANSAC和PnP算法计算得到物体的6D位姿. 实验结果表明,当对3D关键点估计时,StereoNet的MAE评价指标较KeypointNet、KeyPose高1.2~1.6倍. 在6D位姿估计方面,StereoNet的5 cm 5°和3 cm 3°评价指标均优于HLoc、OnePose、Gen6D,平均精确度达到82.1%,证明该网络具有良好的泛化性和准确性.


关键词: 6D位姿,  数据集制作,  双目视觉,  3D关键点匹配,  PnP算法 
Fig.1 Flow chart of dataset creation
Fig.2 Binocular image automatic acquisition platform
Fig.3 Result of image annotation
Fig.4 CAD models of some objects
Fig.5 Overall architecture of StereoNet network
Fig.6 Sparse point cloud model of some objects
Fig.7 3D key point estimation network
Fig.8 Structure framework of GATs network
方法MAE/mm
盒子瓶子杯子
KeypointNet6.46.010.5
KeyPose6.65.89.9
StereoNet5.04.76.1
Tab.1 Estimation result of category-level 3D key point
方法MAE/mm
鼠标小黄鸭
KeypointNet42.855.0
KeyPose38.249.1
StereoNet18.213.6
Tab.2 3D keypoint estimation result for unseen object
方法瓶子马克杯杯子鼠标
3 cm 3°5 cm 5°3 cm 3°5 cm 5°3 cm 3°5 cm 5°3 cm 3°5 cm 5°
HLoc0.7030.8130.7930.8310.7390.8370.7290.832
OnePose0.7330.8360.8060.8280.7290.8320.7110.819
Gen6D$ ^\dagger $0.5720.6130.5750.6080.4680.5150.5080.613
Gen6D0.5910.6330.5980.6310.5020.5830.5660.631
StereoNet0.7730.8650.8130.8440.7910.8540.7870.845
Tab.3 Comparison result of accuracy of pose estimation
Fig.9 Effect of different reconstruction error on accuracy of pose estimation
Fig.10 Effect of different distance on accuracy of pose estimation
有纹理无纹理
物体3 cm 3°5 cm 5°3 cm 3°5 cm 5°
杯子0.7810.8260.422
盒子0.7980.8480.459
马克杯0.8010.8320.403
Tab.4 Comparison test for pose estimation accuracy of textured and untextured object
[1]   YANN L, JUSTIN C, MATHIEU A, et al. Cosypose: consistent multi-view multi-object 6D pose estimation [C]// European Conference on Computer Vision. Glasgow: [s. n. ], 2020: 574–591.
[2]   PENG S D, LIU Y, HUANG Q X, et al. Pvnet: pixel-wise voting network for 6dof pose estimation [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 4561-4570.
[3]   HE Y S, SUN W, HUANG H B, et al. Pvn3d: a deep point-wise 3d keypoints voting network for 6dof pose estimation [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11632-11641.
[4]   HE Y S, HUANG H B, FAN H Q, et al. Ffb6d: a full flow bidirectional fusion network for 6d pose estimation [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 3003-3013.
[5]   LI Z G, WANG G, JI X Y. CDPN: coordinates-based disentangled pose network for real time rgb-based 6-DoF object pose estimation [C]// IEEE International Conference on Computer Vision. Seoul: IEEE, 2019: 7678-7687.
[6]   TREMBLAY J, TO T, SUNDARALINGAM B, et al. Deep object pose estimation for semantic robotic grasping of household objects [C]//2nd Conference on Robot Learning. Zurich: PMLR, 2018: 306-316.
[7]   GAO G, LAURI M, WANG Y L, et al. 6d object pose regression via supervised learning on point clouds [C]//IEEE International Conference on Robotics and Automation. Paris: IEEE, 2020: 3643-3649.
[8]   CHEN W, JIA X, CHANG H J, et al. g2l-net: global to local network for real-time 6d pose estimation with embedding vector features [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 4233-4242.
[9]   AHMADYAN A, ZHANG L K, ABLAVATSKI A, et al. Objectron: a large scale dataset of object-centric videos in the wild with pose annotations [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 7822-7831.
[10]   WANG H, SRIDHAR S, HUANG J W, et al. Normalized object coordinate space for category-level 6d object pose and size estimation [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 2642-2651.
[11]   ZHANG R D, DI Y, MANHARDT F, et al. SSP-pose: symmetry-aware shape prior deformation for direct category-level object pose estimation [C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Kyoto: IEEE, 2022: 7452-7459.
[12]   HE Y S, WANG Y, FAN H Q, et al. Fs6d: few-shot 6d pose estimation of novel objects [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 6814-6824.
[13]   WU J, WANG Y, XIONG R. Unseen object pose estimation via registration [C]// IEEE International Conference on Real-time Computing and Robotics. Guangzhou: IEEE, 2021: 974-979.
[14]   SUN J M, WANG Z H, ZHANG S Y, et al. Onepose: one-shot object pose estimation without cad models [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 6825-6834.
[15]   CHEN K, JAMES S, SUI C Y, et al. Stereopose: category-level 6d transparent object pose estimation from stereo images via back-view nocs [C]//IEEE International Conference on Robotics and Automation. London: IEEE, 2023: 2855-2861.
[16]   YIN M H, YAO Z L, CAO Y, et al. Disentangled non-local neural networks [C]// European Conference on Computer Vision. Glasgow: Springer, 2020: 191-207.
[17]   WANG Y Q, YING X Y, WANG L G, et al. Symmetric parallax attention for stereo image super-resolution [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 766-775.
[18]   VELIčKOVIć P, CUCURULL G, CASANOVA A, et al. Graph attention networks [C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg: ACL, 2021: 850-862.
[19]   SUWAJANAKORN S, SNAVELY N, TOMPSON J J, et al. Discovery of latent 3d keypoints via end-to-end geometric reasoning [C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc, 2018: 2067-2074.
[20]   LIU X Y, JONSCHKOWSKI R, ANGELOVA A, et al. Keypose: multi-view 3d labeling and keypoint estimation for transparent objects [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11602-11610.
[21]   SARLIN P E, CADENA C, SIEGWART R, et al. From coarse to fine: robust hierarchical localization at large scale [C] //Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 12716-12725.
[1] Jingyao HE,Pengfei LI,Chengzhi WANG,Zhenming LV,Ping MU. Dynamic 3D reconstruction method using binocular vision and improved YOLOv8[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1443-1450.
[2] Heng YANG,Zhuo LI,Zhong-yuan KANG,Bing TIAN,Qing DONG. Binocular vision object 6D pose estimation based on circulatory neural network[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(11): 2179-2187.
[3] Hao-ran MA,Ya-bin DING. Calibration method of laser displacement sensor based on binocular vision[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(9): 1634-1642.
[4] KE Xian-xin, ZHANG Wen-zhen, YANG Yang, WEN Lei. Multi-sensor positioning system for humanoid robot[J]. Journal of ZheJiang University (Engineering Science), 2018, 52(7): 1247-1252.