6D pose estimation of binocular vision object based on 3D key point

doi:10.3785/j.issn.1008-973X.2025.11.006

Journal of ZheJiang University (Engineering Science)

2025, Vol. 59

Issue (11): 2277-2284 DOI: 10.3785/j.issn.1008-973X.2025.11.006

6D pose estimation of binocular vision object based on 3D key point

Kaixu NING(

),Qing LU,Heng YANG*(

),Shaohan WANG

School of Mechanical Engineering, Taiyuan University of Science and Technology, Taiyuan 030024, China

Download:

HTML

PDF(1301KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

A binocular dataset fabrication method based on multi-view geometry and a 3D key point-based object 6D pose estimation network, StereoNet, were proposed to address the reliance on CAD models in traditional pose estimation methods. The 3D key points of the object were obtained through a 3D key point estimation network, and a parallax attention module was introduced into the network to improve the accuracy of key point prediction. The structure-from-motion (SfM) method was employed to reconstruct a sparse point cloud model of the object. The 3D points from the query image and those from the SfM model were fed into a graph attention network (GATs) for matching. The 6D pose of the object was computed by using the RANSAC and PnP algorithms. The experimental results showed that the MAE metric of StereoNet was 1.2–1.6 times higher than that of KeypointNet and KeyPose in 3D key point estimation. StereoNet outperformed HLoc, OnePose, and Gen6D in the 5 cm 5° and 3 cm 3° evaluation metrics in terms of 6D pose estimation, achieving an average accuracy of 82.1%. The network has strong generalization ability and accuracy.

Key words： 6D pose dataset creation binocular vision 3D key point matching perspective-n-point (PnP) algorithm

Received: 04 November 2024 Published: 30 October 2025

CLC:

TP 183

Corresponding Authors: Heng YANG E-mail: 2654223903@qq.com;93328173@qq.com

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Kaixu NING
	Qing LU
	Heng YANG
	Shaohan WANG

Cite this article:

Kaixu NING,Qing LU,Heng YANG,Shaohan WANG. 6D pose estimation of binocular vision object based on 3D key point. Journal of ZheJiang University (Engineering Science), 2025, 59(11): 2277-2284.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2025.11.006 OR https://www.zjujournals.com/eng/Y2025/V59/I11/2277

基于3D关键点的双目视觉物体6D位姿估计

针对传统位姿估计方法中依赖CAD模型的问题，提出基于多视图几何的双目数据集制作方法及基于3D关键点的物体6D位姿估计网络StereoNet. 通过3D关键点估计网络获取物体的3D关键点，在网络中引入视差注意模块，提高关键点预测的精度. 采用运动恢复结构（SfM）方法重建物体的稀疏点云模型，将查询图像的3D点与SfM模型中的3D点输入图注意力网络（GATs）中进行匹配，通过RANSAC和PnP算法计算得到物体的6D位姿. 实验结果表明，当对3D关键点估计时，StereoNet的MAE评价指标较KeypointNet、KeyPose高1.2~1.6倍. 在6D位姿估计方面，StereoNet的5 cm 5°和3 cm 3°评价指标均优于HLoc、OnePose、Gen6D，平均精确度达到82.1%，证明该网络具有良好的泛化性和准确性.

关键词： 6D位姿, 数据集制作, 双目视觉, 3D关键点匹配, PnP算法

Fig.1 Flow chart of dataset creation

Fig.2 Binocular image automatic acquisition platform

Fig.3 Result of image annotation

Fig.4 CAD models of some objects

Fig.5 Overall architecture of StereoNet network

Fig.6 Sparse point cloud model of some objects

Fig.7 3D key point estimation network

Fig.8 Structure framework of GATs network

Tab.1 Estimation result of category-level 3D key point

Tab.2 3D keypoint estimation result for unseen object

Tab.3 Comparison result of accuracy of pose estimation

Fig.9 Effect of different reconstruction error on accuracy of pose estimation

Fig.10 Effect of different distance on accuracy of pose estimation

Tab.4 Comparison test for pose estimation accuracy of textured and untextured object


[1]	YANN L, JUSTIN C, MATHIEU A, et al. Cosypose: consistent multi-view multi-object 6D pose estimation [C]// European Conference on Computer Vision. Glasgow: [s. n. ], 2020: 574–591.

[2]	PENG S D, LIU Y, HUANG Q X, et al. Pvnet: pixel-wise voting network for 6dof pose estimation [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 4561-4570.

[3]	HE Y S, SUN W, HUANG H B, et al. Pvn3d: a deep point-wise 3d keypoints voting network for 6dof pose estimation [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11632-11641.

[4]	HE Y S, HUANG H B, FAN H Q, et al. Ffb6d: a full flow bidirectional fusion network for 6d pose estimation [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 3003-3013.

[5]	LI Z G, WANG G, JI X Y. CDPN: coordinates-based disentangled pose network for real time rgb-based 6-DoF object pose estimation [C]// IEEE International Conference on Computer Vision. Seoul: IEEE, 2019: 7678-7687.

[6]	TREMBLAY J, TO T, SUNDARALINGAM B, et al. Deep object pose estimation for semantic robotic grasping of household objects [C]//2nd Conference on Robot Learning. Zurich: PMLR, 2018: 306-316.

[7]	GAO G, LAURI M, WANG Y L, et al. 6d object pose regression via supervised learning on point clouds [C]//IEEE International Conference on Robotics and Automation. Paris: IEEE, 2020: 3643-3649.

[8]	CHEN W, JIA X, CHANG H J, et al. g2l-net: global to local network for real-time 6d pose estimation with embedding vector features [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 4233-4242.

[9]	AHMADYAN A, ZHANG L K, ABLAVATSKI A, et al. Objectron: a large scale dataset of object-centric videos in the wild with pose annotations [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 7822-7831.

[10]	WANG H, SRIDHAR S, HUANG J W, et al. Normalized object coordinate space for category-level 6d object pose and size estimation [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 2642-2651.

[11]	ZHANG R D, DI Y, MANHARDT F, et al. SSP-pose: symmetry-aware shape prior deformation for direct category-level object pose estimation [C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Kyoto: IEEE, 2022: 7452-7459.

[12]	HE Y S, WANG Y, FAN H Q, et al. Fs6d: few-shot 6d pose estimation of novel objects [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 6814-6824.

[13]	WU J, WANG Y, XIONG R. Unseen object pose estimation via registration [C]// IEEE International Conference on Real-time Computing and Robotics. Guangzhou: IEEE, 2021: 974-979.

[14]	SUN J M, WANG Z H, ZHANG S Y, et al. Onepose: one-shot object pose estimation without cad models [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 6825-6834.

[15]	CHEN K, JAMES S, SUI C Y, et al. Stereopose: category-level 6d transparent object pose estimation from stereo images via back-view nocs [C]//IEEE International Conference on Robotics and Automation. London: IEEE, 2023: 2855-2861.

[16]	YIN M H, YAO Z L, CAO Y, et al. Disentangled non-local neural networks [C]// European Conference on Computer Vision. Glasgow: Springer, 2020: 191-207.

[17]	WANG Y Q, YING X Y, WANG L G, et al. Symmetric parallax attention for stereo image super-resolution [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 766-775.

[18]	VELIčKOVIć P, CUCURULL G, CASANOVA A, et al. Graph attention networks [C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg: ACL, 2021: 850-862.

[19]	SUWAJANAKORN S, SNAVELY N, TOMPSON J J, et al. Discovery of latent 3d keypoints via end-to-end geometric reasoning [C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc, 2018: 2067-2074.

[20]	LIU X Y, JONSCHKOWSKI R, ANGELOVA A, et al. Keypose: multi-view 3d labeling and keypoint estimation for transparent objects [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11602-11610.

[21]	SARLIN P E, CADENA C, SIEGWART R, et al. From coarse to fine: robust hierarchical localization at large scale [C] //Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 12716-12725.

[1]	Jingyao HE,Pengfei LI,Chengzhi WANG,Zhenming LV,Ping MU. Dynamic 3D reconstruction method using binocular vision and improved YOLOv8[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1443-1450.

[2]	Heng YANG,Zhuo LI,Zhong-yuan KANG,Bing TIAN,Qing DONG. Binocular vision object 6D pose estimation based on circulatory neural network[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(11): 2179-2187.

[3]	Hao-ran MA,Ya-bin DING. Calibration method of laser displacement sensor based on binocular vision[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(9): 1634-1642.

[4]	KE Xian-xin, ZHANG Wen-zhen, YANG Yang, WEN Lei. Multi-sensor positioning system for humanoid robot[J]. Journal of ZheJiang University (Engineering Science), 2018, 52(7): 1247-1252.

Viewed

Full text

Abstract

Cited

Shared

Discussed