Please wait a minute...
浙江大学学报(工学版)  2025, Vol. 59 Issue (1): 18-26    DOI: 10.3785/j.issn.1008-973X.2025.01.002
计算机与控制工程     
基于单目RGB图像的三维手部姿态估计方法
杨冰1,2(),徐楚阳1,2,姚金良1,2,向学勤3
1. 杭州电子科技大学 计算机学院,浙江 杭州 310018
2. 杭州电子科技大学 浙江省脑机协同智能重点实验室,浙江 杭州 310018
3. 杭州灵伴科技有限公司,浙江 杭州 311121
3D hand pose estimation method based on monocular RGB images
Bing YANG1,2(),Chuyang XU1,2,Jinliang YAO1,2,Xueqin XIANG3
1. School of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China
2. Key Laboratory of Brain Machine Collaborative Intelligence of Zhejiang Province, Hangzhou Dianzi University, Hangzhou 310018, China
3. Hangzhou Lingban Technology Limited Company, Hangzhou 311121, China
 全文: PDF(1337 KB)   HTML
摘要:

现有的三维手部姿态估计方法大多基于Transformer技术,未充分利用高分辨率下的局部空间信息,为此提出基于改进FastMETRO的三维手部姿态估计方法. 引入可变形注意力机制,使得编码器的设计不再受限于图像特征序列长度;引入交错更新多尺度特征编码器来融合多尺度特征,强化生成手部姿态;引入图卷积残差模块来挖掘网格顶点间的显式语义联系. 为了验证所提方法的有效性,在数据集FreiHAND、HO3D V2和HO3D V3上开展训练及评估实验. 结果表明,所提方法的回归精度优于现有先进方法,在FreiHAND、HO3D V2、HO3D V3上的普鲁克对齐-平均关节点误差分别为5.8、10.0、10.5 mm.

关键词: 三维手部姿态估计Transformer可变形注意力机制交错更新多尺度特征编码器神经网络    
Abstract:

A 3D hand pose estimation method based on improved FastMETRO was proposed for most of the existing 3D hand pose estimation methods are based on Transformer technology and do not fully consider the local spatial information at high resolution. By introducing deformable attention, the design of the encoder was not limited by the length of the image feature sequence. The interleaved update multi-scale feature encoder was employed to fuse multi-scale features and enhance the generation of hand pose, and the graph convolution residual module was employed to exploit the explicit semantic connections between mesh vertices. To validate the effectiveness of the proposed method, the training and experiments were conducted in the datasets FreiHAND, HO3D V2 and HO3D V3. Results showed that the regression accuracy of the proposed method was better than existing advanced methods with Procrustes aligned - average joints error of 5.8, 10.0, and 10.5 mm in FreiHAND, HO3D V2, and HO3D V3, respectively.

Key words: 3D hand poses estimation    Transformer    deformable attention    interleaved update multi-scale feature encoder    neural network
收稿日期: 2024-01-22 出版日期: 2025-01-18
CLC:  TP 391  
基金资助: 浙江省基础公益研究计划(LGG22F020027).
作者简介: 杨冰(1985—),女,副教授,博士,从事计算机视觉、机器学习研究. orcid.org/0000-0002-0585-0579. E-mail:yb@hdu.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
杨冰
徐楚阳
姚金良
向学勤

引用本文:

杨冰,徐楚阳,姚金良,向学勤. 基于单目RGB图像的三维手部姿态估计方法[J]. 浙江大学学报(工学版), 2025, 59(1): 18-26.

Bing YANG,Chuyang XU,Jinliang YAO,Xueqin XIANG. 3D hand pose estimation method based on monocular RGB images. Journal of ZheJiang University (Engineering Science), 2025, 59(1): 18-26.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2025.01.002        https://www.zjujournals.com/eng/CN/Y2025/V59/I1/18

图 1  三维手部姿态估计方法的模型结构
图 2  可变形注意力的计算流程
图 3  交错更新多尺度特征
图 4  图卷积残差模块
方法PA-MPJPE/mmPA-MPVPE/mmF@5F@15
MobileHand[22]13.10.4390.902
FreiHAND[7]11.010.90.5160.934
Pose2Mesh[23]7.87.70.6740.969
I2L-MeshNet[24]7.47.60.6810.973
CMR[15]6.97.00.7150.977
I2UV-HandNet[25]6.76.90.7070.977
METRO[4]6.36.50.7310.984
RealisticHands[26]7.80.6620.971
FastMETRO[5]6.57.20.6880.983
PA-Tran[27]6.06.3
FastViT-MA36[28]6.66.70.7220.981
本研究(ResNet)6.46.70.7140.982
本研究(HRNet)5.86.20.7490.986
表 1  不同三维手部姿态估计方法在FreiHAND数据集上的性能对比
图 5  真实手部姿态图像与所提方法预测得到的手部姿态图像对比
方法NTP/106NOP/106FPS/(帧·s?1)PA-MPJPE/mmPA-MPVPE/mmFLOPs/109
METRO[4]102.3230.437±36.36.5108.7
FastMETRO[5]24.9153.052±36.57.271.0
本研究(HRNet)13.9129.638±35.86.280.0
表 2  Transformer相关方法在FreiHAND数据集上的性能参数
方法PA-
MPJPE/mm
PA-
MPVPE/mm
F@5F@15
Pose2Mesh[23]12.512.70.4410.909
METRO[4]10.411.10.4840.946
ArtiBoost[29]11.410.90.4880.944
Keypoint Transformer[30]10.8
本研究(HRNet)10.010.00.5120.951
表 3  不同三维手部姿态估计方法在HO3D V2数据集上的性能对比
方法PA-
MPJPE/mm
PA-
MPVPE/mm
F@5F@15
ArtiBoost[29]10.810.40.5070.946
Keypoint Transformer[30]10.9
HandOccNet[31]10.710.40.4790.935
本研究(HRNet)10.510.30.4910.941
表 4  不同三维手部姿态估计方法在HO3D V3数据集上的性能对比
mm
交错更新多尺度
特征编码器
可变形
注意力
图卷积残
差模块
PA-
MPJPE
PA-
MPVPE
8.28.4
7.07.3
6.87.0
6.46.7
表 5  FreiHAND数据集上的模块消融实验
1 LI R, LIU Z, TAN J A survey on 3D hand pose estimation: cameras, methods, and datasets[J]. Pattern Recognition, 2019, 93: 251- 272
doi: 10.1016/j.patcog.2019.04.026
2 LIU Y, JIANG J, SUN J. Hand pose estimation from RGB images based on deep learning: a survey [C]// 2021 IEEE 7th International Conference on Virtual Reality . Foshan: IEEE, 2021: 82−89.
3 VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems . [S.l.]: Curran Associates, 2017: 6000−6010.
4 LIN K, WANG L, LIU Z. End-to-end human pose and mesh reconstruction with transformers [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville: IEEE, 2021: 1954−1963.
5 CHO J, KIM Y, OH T H. Cross-attention of disentangled modalities for 3D human mesh recovery with transformers [C]// European Conference on Computer Vision . [S.l.]: Springer, 2022: 342−359.
6 ZHENG C, WU W, CHEN C, et al Deep learning-based human pose estimation: a survey[J]. ACM Computing Surveys, 2023, 56 (1): 11
7 ROMERO J, TZIONAS D, BLACK M J Embodied hands: modeling and capturing hands and bodies together[J]. ACM Transactions on Graphics, 2017, 36 (6): 245
8 ZHANG X, LI Q, MO H, et al. End-to-end hand mesh recovery from a monocular RGB image [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Seoul: IEEE, 2019: 2354−2364.
9 BAEK S, KIM K I, KIM T K. Pushing the envelope for RGB-based dense 3D hand pose estimation via neural rendering [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 1067−1076.
10 BOUKHAYMA A, DE BEM R, TORR P H S. 3D hand shape and pose from images in the wild [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 10835−10844.
11 ZIMMERMANN C, BROX T. Learning to estimate 3D hand pose from single RGB images [C]// Proceedings of the IEEE International Conference on Computer Vision . Venice: IEEE, 2017: 4913−4921.
12 IQBAL U, MOLCHANOV P, BREUEL T, et al. Hand pose estimation via latent 2.5D heatmap regression [C]// Proceedings of the European Conference on Computer Vision . Munich: Springer, 2018: 125−143.
13 KULON D, GÜLER R A, KOKKINOS I, et al. Weakly-supervised mesh-convolutional hand reconstruction in the wild [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle: IEEE, 2020: 4989−4999.
14 GE L, REN Z, LI Y, et al. 3D hand shape and pose estimation from a single RGB image [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 10825−10834.
15 CHEN X, LIU Y, MA C, et al. Camera-space hand mesh recovery via semantic aggregation and adaptive 2D-1D registration [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville: IEEE, 2021: 13269−13278.
16 ZHU X, SU W, LU L, et al. Deformable DETR: deformable transformers for end-to-end object detection [EB/OL]. (2021−03−18)[2023−11−20]. https://arxiv.org/abs/2010.04159.
17 LIN K, WANG L, LIU Z. Mesh graphormer [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Montreal: IEEE, 2021: 12919−12928.
18 ZIMMERMANN C, CEYLAN D, YANG J, et al. FreiHAND: a dataset for markerless capture of hand pose and shape from single RGB images [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Seoul: IEEE, 2019: 813−822.
19 HAMPALI S, RAD M, OBERWEGER M, et al. HOnnotate: a method for 3D annotation of hand and object poses [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle: IEEE, 2020: 3193−3203.
20 HAMPALI S, SARKAR S D, LEPETIT V. HO-3D_v3: improving the accuracy of hand-object annotations of the HO-3D dataset [EB/OL]. (2021−07−02)[2024−03−17]. https://arxiv.org/abs/2107.00887.
21 GOWER J C Generalized procrustes analysis[J]. Psychometrika, 1975, 40: 33- 51
doi: 10.1007/BF02291478
22 LIM G M, JATESIKTAT P, ANG W T. Mobilehand: real-time 3D hand shape and pose estimation from color image [C]// International Conference on Neural Information Processing . [S.l.]: Springer, 2020: 450−459.
23 CHOI H, MOON G, LEE K M. Pose2Mesh: graph convolutional network for 3D human pose and mesh recovery from a 2D human pose [C]// European Conference on Computer Vision . [S.l.]: Springer, 2020: 769−787.
24 MOON G, LEE K M. I2l-MeshNet: image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image [C]// European Conference on Computer Vision . [S.l.]: Springer, 2020: 752−768.
25 CHEN P, CHEN Y, YANG D, et al. I2UV-HandNet: image-to-UV prediction network for accurate and high-fidelity 3D hand mesh modeling [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Montreal: IEEE, 2021: 12909−12918.
26 SEEBER M, PORANNE R, POLLEYFEYS M, et al. Realistichands: a hybrid model for 3D hand reconstruction [C]// 2021 International Conference on 3D Vision . London: IEEE, 2021: 22−31.
27 YU T, BIDULKA L, MCKEOWN M J, et al PA-Tran: learning to estimate 3D hand pose with partial annotation[J]. Sensors, 2023, 23 (3): 1555
doi: 10.3390/s23031555
28 VASU P K A, GABRIEL J, ZHU J, et al. FastViT: a fast hybrid vision Transformer using structural reparameterization [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Paris: IEEE, 2023: 5762−5772.
29 YANG L, LI K, ZHAN X, et al. ArtiBoost: boosting articulated 3D hand-object pose estimation via online exploration and synthesis [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans: IEEE, 2022: 2740−2750.
30 HAMPALI S, SARKAR S D, RAD M, et al. Keypoint Transformer: solving joint identification in challenging hands and object interactions for accurate 3D pose estimation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans: IEEE, 2022: 11080−11090.
[1] 王海军,王涛,俞慈君. 基于递归量化分析的CFRP超声检测缺陷识别方法[J]. 浙江大学学报(工学版), 2024, 58(8): 1604-1617.
[2] 李劲业,李永强. 融合知识图谱的时空多图卷积交通流量预测[J]. 浙江大学学报(工学版), 2024, 58(7): 1366-1376.
[3] 徐满,张冬梅,余想,李江,吴益平. 基于多重分形的改进GRU滑坡位移预测模型[J]. 浙江大学学报(工学版), 2024, 58(7): 1407-1416.
[4] 马现伟,范朝辉,聂为之,李东,朱逸群. 对失效传感器具备鲁棒性的故障诊断方法[J]. 浙江大学学报(工学版), 2024, 58(7): 1488-1497.
[5] 宋娟,贺龙喜,龙会平. 基于深度学习的隧道衬砌多病害检测算法[J]. 浙江大学学报(工学版), 2024, 58(6): 1161-1173.
[6] 邢志伟,朱书杰,李彪. 基于改进图卷积神经网络的航空行李特征感知[J]. 浙江大学学报(工学版), 2024, 58(5): 941-950.
[7] 刘议丹,朱小飞,尹雅博. 基于异质图卷积神经网络的论点对抽取模型[J]. 浙江大学学报(工学版), 2024, 58(5): 900-907.
[8] 范康,钟铭恩,谭佳威,詹泽辉,冯妍. 联合语义分割和深度估计的交通场景感知算法[J]. 浙江大学学报(工学版), 2024, 58(4): 684-695.
[9] 王骏骋,王法慧. 基于变权重PSO-Elman神经网络的路面附着系数估计[J]. 浙江大学学报(工学版), 2024, 58(3): 622-634.
[10] 宋明俊,严文,邓益昭,张俊然,涂海燕. 轻量化机器人抓取位姿实时检测算法[J]. 浙江大学学报(工学版), 2024, 58(3): 599-610.
[11] 周宇,甘露一,狄生奎,贺文宇,李宁波. 基于应变影响线的桥梁模型修正试验[J]. 浙江大学学报(工学版), 2024, 58(3): 537-546.
[12] 王卓,李永强,冯宇,冯远静. 两方零和马尔科夫博弈策略梯度算法及收敛性分析[J]. 浙江大学学报(工学版), 2024, 58(3): 480-491.
[13] 温绍杰,吴瑞刚,冯超文,刘英莉. 基于Transformer的多模态级联文档布局分析网络[J]. 浙江大学学报(工学版), 2024, 58(2): 317-324.
[14] 姚鑫骅,于涛,封森文,马梓健,栾丛丛,沈洪垚. 基于图神经网络的零件机加工特征识别方法[J]. 浙江大学学报(工学版), 2024, 58(2): 349-359.
[15] 熊昌镇,郭传玺,王聪. 基于动态位置编码和注意力增强的目标跟踪算法[J]. 浙江大学学报(工学版), 2024, 58(12): 2427-2437.