Please wait a minute...
浙江大学学报(工学版)  2026, Vol. 60 Issue (3): 565-573    DOI: 10.3785/j.issn.1008-973X.2026.03.012
计算机技术、控制工程     
基于多尺度编码器融合的三维人体姿态估计算法
包晓安1(),陈恩琳1,张娜1,涂小妹2,吴彪3,张庆琪4,*()
1. 浙江理工大学 计算机科学与技术学院,浙江 杭州 310018
2. 浙江广厦建设职业技术大学 建筑工程学院,浙江 东阳 322100
3. 浙江理工大学 理学院,浙江 杭州 310018
4. 山口大学 大学院东亚研究科,日本 山口 753-8514
3D human pose estimation based on multi-scale encoder fusion
Xiaoan BAO1(),Enlin CHEN1,Na ZHANG1,Xiaomei TU2,Biao WU3,Qingqi ZHANG4,*()
1. School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
2. School of Civil Engineering and Architecture, Zhejiang Guangsha Vocational and Technical University of Construction, Dongyang 322100, China
3. School of Science, Zhejiang Sci-Tech University, Hangzhou 310018, China
4. Graduate School of East Asian Studies, Yamaguchi University, Yamaguchi 753-8514, Japan
 全文: PDF(1735 KB)   HTML
摘要:

针对冗余信息干扰与信息完整性需求之间的矛盾,提出基于多尺度编码器融合的三维人体姿态估计方法. 该方法由关键帧时空编码器(KFSTE)和全局保留自注意力编码器(GRSAE)构成. KFSTE通过关键帧选择器对骨架特征序列进行筛选后,由时间编码器获取局部时空建模. GRSAE通过保留编码器进行全局单阶段编码来获取全局骨架序列特征,避免因关键帧筛选偏差导致的信息损失. 通过对双编码器的特征拼接及回归处理,预测得到三维人体姿态坐标. 实验结果表明,在较大规模的Human3.6M数据集上,所提方法的平均关节位置误差(MPJPE)比MixSTE低3%,有11个动作获得最佳.

关键词: 三维人体姿态估计时空编码器关键帧提取保留自注意力编码多编码特征融合    
Abstract:

A 3D human pose estimation method based on multi-scale encoder fusion was proposed in order to address the contradiction between redundant information interference and the need for information completeness. The method consisted of a key-frame spatial-temporal encoder (KFSTE) and a global retention self-attention encoder (GRSAE). The skeletal feature sequence was filtered by using a key-frame selector in KFSTE. Then local spatial-temporal dependencies were modeled through a temporal encoder. Global single-stage encoding was performed via a retention encoder in GRSAE to capture global skeletal sequence feature, thereby avoiding information loss caused by key-frame selection bias. The 3D human pose coordinates were predicted by concatenating the features from the two encoders followed by a regression module. The experimental results on the large-scale Human3.6M dataset demonstrated that the proposed method reduced the mean per-joint position error (MPJPE) by 3% compared with MixSTE and achieved the best performance on 11 actions.

Key words: three-dimensional human pose estimation    spatial-temporal encoder    key-frame extraction    retentive self-attention encoding    multi-encoder feature fusion
收稿日期: 2025-03-13 出版日期: 2026-02-04
:  TP 393  
基金资助: 国家自然科学基金资助项目 (6207050141);浙江省重点研发计划资助项目(2020C03094);浙江省教育厅一般科研资助项目(Y202147659);浙江省教育厅资助项目(Y202250706,Y202250677);浙江省基础公益研究计划资助项目(QY19E050003).
通讯作者: 张庆琪     E-mail: baoxiaoan@zstu.edu.cn;c503snw@yamaguchi-u.ac.jp
作者简介: 包晓安(1973—),男,教授,从事机器视觉的研究. orcid.org/0000-0001-8305-0369. E-mail:baoxiaoan@zstu.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
包晓安
陈恩琳
张娜
涂小妹
吴彪
张庆琪

引用本文:

包晓安,陈恩琳,张娜,涂小妹,吴彪,张庆琪. 基于多尺度编码器融合的三维人体姿态估计算法[J]. 浙江大学学报(工学版), 2026, 60(3): 565-573.

Xiaoan BAO,Enlin CHEN,Na ZHANG,Xiaomei TU,Biao WU,Qingqi ZHANG. 3D human pose estimation based on multi-scale encoder fusion. Journal of ZheJiang University (Engineering Science), 2026, 60(3): 565-573.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2026.03.012        https://www.zjujournals.com/eng/CN/Y2026/V60/I3/565

图 1  基于多尺度编码融合的三维人体姿态估计算法的网络结构
图 2  关键帧时空编码器的示意图
图 3  时-空编码器模块的结构
图 4  保留自注意力编码的结构
CPN 协议1会议名MPJPE/mm
Dir.Disc.Eat.GrceetPhonePhotoPosePunch.SitSitD.SomkeWaitWalkD.WalkWalkT.平均值
文献[32] (f= 243)CVPR’1845.246.743.345.648.155.144.644.357.365.847.1444932.833.946.8
文献[33]NeurIPS’1944.846.143.346.449.055.244.64458.362.747.143.948.632.733.346.7
文献[9] (f = 243)CVPR’2041.844.841.144.947.454.143.442.256.263.645.343.545.331.332.245.1
SRNet[11]ECCV’2046.647.143.941.645.849.646.540.053.461.146.142.643.131.532.644.8
UGCN[34] (f= 96)ECCV’2041.343.944.042.248.057.142.243.257.361.347.043.547.032.631.845.6
文献[8] (f = 81)TCSVT’2142.143.841.043.846.153.542.443.153.960.545.742.146.232.233.844.6
PoseFormer[13] (f =81)ICCV’2141.544.839.842.546.551.6424253.360.745.543.346.131.832.244.3
MHFormer[21] (f =351)CVPR’2239.243.140.140.944.951.240.641.353.560.343.741.143.829.830.643.0
MixSTE[15] (f =243)CVPR’2237.640.937.339.742.349.940.139.851.755.042.139.841.027.927.940.9
KTPFormer[30] (f =243)CVPR’2437.339.235.937.642.548.238.639.051.455.941.639.040.027.027.440.1
TCPFormer[31] (f =81)CVPR’2540.5
MSEFN (f =81,kf = 27)36.839.935.839.141.047.537.537.651.547.939.838.242.125.626.839.8
CPN 协议2会议名MPJPE/mm
Dir.Disc.Eat.GrceetPhonePhotoPosePunch.SitSitD.SomkeWaitWalkD.WalkWalkT.平均值
文献[7]CVPR’1834.739.841.838.642.547.538.036.650.756.842.639.643.932.136.541.8
文献[35] (f = 7)ICCV’1935.737.836.940.739.645.237.434.546.950.140.536.141.029.632.339.0
文献[9] (f = 243)CVPR’2032.335.233.335.835.941.533.232.744.650.937.032.437.025.227.235.6
UGCN[34] (f = 96)ECCV’2032.935.,235.634.436442.731.232.545.650.237.332.836.326.023.935.5
PoseFormer[13] (f =81)ICCV’2132.534,832.634.635.339.532.132.042.848.534.832.435.324.52634.6
MHFormer[21] (f =351)CVPR’2231.534.932.833.635.339.632.032.243.548.136.432.634.323.925.134.4
MixSTE[15] (f =243)CVPR’2230.833.130.331.833.139.131.130.542.544.534.030.832.722.122.932.6
KTPFormer[30] (f =243)CVPR’2430.132.329.630.832.337.330.030.241.045.333.629.931.421.522.631.9
TCPFormer[31] (f =81)CVPR’2533.7
MSEFN (f =81,kf = 27)27.530.528.531.831.536.428.52841.746.63228.331.819.821.331.0
表 1  协议1、协议2下MSEFN与不同方法在Human3.6M数据集上的MPJPE结果 (基于CPN检测输入)
GT 协议1会议名MPJPE/mm
Dir.Disc.Eat.GrceetPhonePhotoPosePunch.SitSitD.SomkeWaitWalkD.WalkWalkT.平均值
PoseFormer [13] (f = 81)ICCV’2130.033.629.931.030.233.334.831.437.838.631.731.529.023.323.131.3
MHFormer [21] (f =351)CVPR’2227.732.129.128.930.033.933.031.23739.330.031.029.422.223.030.5
POT[36] (f = 81)AAAI’2332.938.328.333.834.938.737.230.734.539.733.934.734.326.128.933.8
MSEFN (f =81,kf = 27)27.128.025.326.524.627.729.826.031.333.626.528.728.116.318.326.5
表 2  协议1下MSEFN与不同方法在Human3.6M数据集上的MPJPE结果对比(基于GT检测姿态输入)
方法PCK/%AUC/%MPJPE/mm
VideoPose3D [37]86.051.984.0
UGCN [34]86.962.168.1
Anatomy3D [8]87.954.078.8
MixSTE [15]94.466.554.9
Poseformer [13]95.463.257.7
MHFormer [21]93.863.358.0
P-STMO [22]97.975.832.2
MSEFN99.176.431.2
表 3  MPI-INF-3DHP下3个指标的详细定量比较结果
方法fFLOPs/106Np/106MPJPE/mm
Poseformer [13]27542.19.6547.0
StridedTran [14]813924.0645.4
MHFormer [21]271031.818.9245.9
MSEFN81(kf = 27)5539.9739.8
表 4  Human3.6 M上FLOPs、Np和MPJPE指标的定量比较结果
模块组合MPJPE/mm
Baseline31.3
GRSAN34.7
Baseline+GRSAN28.2
Baseline +KFS(KFSTE)29.6
Baseline+KFS+ GRSAN(MSEFN)26.5
表 5  MSEFN算法不同模块的消融研究
图 5  本文方法在Human3.6M数据集上的3D HPE图像视觉评估
方法fkfFLOPs/106Npv/(帧·s?1)MPJPE/mm
Poseformer[13]2727542.19.6542847.0
StridedTran[14]81813924.0619945.4
MHFormer[21]27271031.818.923345.9
MSEFN2734096.2347047.8
MSEFN2794677.5545045.5
MSEFN8134587.1145542.8
MSEFN8195217.9644940.1
MSEFN81275539.9740339.8
表 6  选择帧数的消融研究的定量比较结果
tctlMPJPE/mmtctlMPJPE/mm
16242.032640.3
16441.948240.6
16641.848440.4
32241.148641.2
32439.8
表 7  在不同维数、层数下的消融研究结果
图 6  Human3.6M数据集上本文方法的可视化评估
1 ZHANG C, YANG T, WENG J, et al. Unsupervised pre-training for temporal action localization tasks [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 14011–14021.
2 CHEN H, HE J Y, XIANG W, et al. Hdformer: high-order directed transformer for 3d human pose estimation [C]//Proceedings of the 32nd International Joint Conference on Artificial Intelligence. Macao: ACM, 2023: 581-589.
3 LIU M, YUAN J. Recognizing human actions as the evolution of pose estimation maps [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 1159–1168.
4 ZHANG Q, BAO X, WU R, et al A skeleton temporal fusion graph convolutional network for elderly action recognition[J]. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2025, 108 (5): 704- 713
5 MEHTA D, SRIDHAR S, SOTNYCHENKO O, et al VNect: real-time 3D human pose estimation with a single RGB camera[J]. ACM Transactions on Graphics, 2017, 36 (4): 1- 14
6 MOON G, LEE K M. I2L-MeshNet: image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image [C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2020: 752–768.
7 PAVLAKOS G, ZHOU X, DANIILIDIS K. Ordinal depth supervision for 3D human pose estimation [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 7307–7316.
8 CHEN T, FANG C, SHEN X, et al Anatomy-aware 3D human pose estimation with bone-based pose decomposition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32 (1): 198- 209
doi: 10.1109/TCSVT.2021.3057267
9 LIU R, SHEN J, WANG H, et al. Attention mechanism exploits temporal contexts: real-time 3D human pose reconstruction [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 5063–5072.
10 WANG J, YAN S, XIONG Y, et al. Motion guided 3D pose estimation from videos [C]// Proceedings of the European Conference on Computer Vision. Cham: Springer, 2020: 764–780.
11 ZENG A, SUN X, HUANG F, et al. SRNet: improving generalization in 3D human pose estimation with a split-and-recombine approach [C]// Proceedings of the European Conference on Computer Vision. Cham: Springer, 2020: 507–523.
12 CAO Z, SIMON T, WEI S E, et al. Realtime multi-person 2d pose estimation using part affinity fields [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 7291-7299.
13 ZHENG C, ZHU S, MENDIETA M, et al. 3D human pose estimation with spatial and temporal transformers [C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2022: 11636–11645.
14 LI W, LIU H, DING R, et al Exploiting temporal contexts with strided transformer for 3D human pose estimation[J]. IEEE Transactions on Multimedia, 2023, 25: 1282- 1293
doi: 10.1109/TMM.2022.3141231
15 ZHANG J, TU Z, YANG J, et al. MixSTE: seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 13222–13232.
16 ZHU W, MA X, LIU Z, et al. MotionBERT: a unified perspective on learning human motion representations [C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2024: 15039–15053.
17 TANG Z, QIU Z, HAO Y, et al. 3D human pose estimation with spatio-temporal criss-cross attention [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 4790–4799.
18 CHEN X, HAN Y, WANG X, et al Action keypoint network for efficient video recognition[J]. IEEE Transactions on Image Processing, 2022, 31: 4980- 4993
doi: 10.1109/TIP.2022.3191461
19 EINFALT M, LUDWIG K, LIENHART R. Uplift and upsample: efficient 3D human pose estimation with uplifting transformers [C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2023: 2902–2912.
20 FAN Q, HUANG H, CHEN M, et al. Rmt: retentive networks meet vision transformers [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 5641-5651.
21 LI W, LIU H, TANG H, et al. MHFormer: multi-hypothesis transformer for 3D human pose estimation [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 13137–13146.
22 SHAN W, LIU Z, ZHANG X, et al. P-STMO: pre-trained spatial temporal many-to-one model for3D human pose estimation [C]//European Conference on Computer Vision. Cham: Springer, 2022: 461–478.
23 FAN Q, HUANG H, CHEN M, et al. RMT: retentive networks meet vision transformers [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 5641–5651.
24 IONESCU C, PAPAVA D, OLARU V, et al Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36 (7): 1325- 1339
doi: 10.1109/TPAMI.2013.248
25 MEHTA D, RHODIN H, CASAS D, et al. Monocular 3D human pose estimation in the wild using improved CNN supervision [C]//Proceedings of the International Conference on 3D Vision. Qingdao: IEEE, 2018: 506–516.
26 ZHENG C, WU W, CHEN C, et al Deep learning-based human pose estimation: a survey[J]. ACM Computing Surveys, 2024, 56 (1): 1- 37
27 MARGOSSIAN C C A review of automatic differentiation and its efficient implementation[J]. WIREs Data Mining and Knowledge Discovery, 2019, 9 (4): e1305
doi: 10.1002/widm.1305
28 FINDER S E, AMOYAL R, TREISTER E, et al. Wavelet convolutions for large receptive fields [C]//European Conference on Computer Vision. Cham: Springer, 2024: 363-380.
29 CHEN Y, WANG Z, PENG Y, et al. Cascaded pyramid network for multi-person pose estimation [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 7103–7112.
30 PENG J, ZHOU Y, MOK P Y. KTPFormer: kinematics and trajectory prior knowledge-enhanced transformer for 3D human pose estimation [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 1123–1132.
31 LIU J, LIU M, LIU H, et al. Tcpformer: Learning temporal correlation with implicit pose proxy for 3d human pose estimation [C]//Proceedings of the AAAI Conference on Artificial Intelligence. Washington: AAAI, 2025, 39(5): 5478−5486.
32 PAVLLO D, FEICHTENHOFER C, GRANGIER D, et al. 3D human pose estimation in video with temporal convolutions and semi-supervised training [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2020: 7745–7754.
33 YEH R, HU Y T, SCHWING A. Chirality nets for human pose regression [J]. Advances in Neural Information Processing Systems, 2019, 32: 8161–8171.
34 WANG J, YAN S, XIONG Y, et al. Motion guided 3d pose estimation from videos [C]//European Conference on Computer Vision. Cham: Springer, 2020: 764−780.
35 CAI Y, GE L, LIU J, et al. Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks [C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2020: 2272–2281.
36 LI H, SHI B, DAI W, et al Pose-oriented transformer with uncertainty-guided refinement for 2D-to-3D human pose estimation[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37 (1): 1296- 1304
doi: 10.1609/aaai.v37i1.25213
[1] 吕君陶,祁珏瑜,于淏辰,马雷,马惠敏,胡天宇. 自动驾驶综合仿真平台的现状与展望[J]. 浙江大学学报(工学版), 2026, 60(3): 513-526.
[2] 陈文强,冯琳越,王东丹,顾玉磊,赵轩. 融合动态风险图与多变量注意力机制的车辆轨迹预测模型[J]. 浙江大学学报(工学版), 2026, 60(3): 455-467.
[3] 奎晓燕,张敏,肖伶,李钦松,陈立明,张文生,邹北骥. 数据去重与缩减技术的系统分类与性能分析[J]. 浙江大学学报(工学版), 2026, 60(2): 287-302.
[4] 王彦乐,张瑞峰,李锵. 融合全局信息和对比学习的图神经网络推荐模型[J]. 浙江大学学报(工学版), 2026, 60(2): 351-359.
[5] 吴越,梁铮,高巍,杨茂达,赵培森,邓红霞,常媛媛. 基于SMPL模态分解与嵌入融合的多模态步态识别[J]. 浙江大学学报(工学版), 2026, 60(1): 52-60.
[6] 陈沪航,吕泉,苏子航,张君樵,陈筑,韩旭. 计及容量电费的供热机组运行模式选择模型[J]. 浙江大学学报(工学版), 2026, 60(1): 169-178.
[7] 何浩,舒永东,林勇刚,代富全,张举. 变负载下大功率船舶全回转推进器水动及液控仿真[J]. 浙江大学学报(工学版), 2026, 60(1): 191-198.
[8] 于翼铭,蔡巍,李毅,付玮,姚旭,张停毅,刁尚祺,李丹,林松清,陈永顺. 基于微电极阵列的环境重金属检测研究进展[J]. 浙江大学学报(工学版), 2026, 60(1): 217-230.
[9] 张凌浩,谭海波,赵赫,陈中,程昊天,马志宇. CompuDEX:基于区块链的大模型微调算力共享平台[J]. 浙江大学学报(工学版), 2026, 60(1): 1-18.
[10] 杨延璞,伍智泓,孟文昊,卓玥鸣,刘嘉玲. 基于Hammerstein模型和表面肌电的过头作业上肢肌肉疲劳评估[J]. 浙江大学学报(工学版), 2025, 59(12): 2483-2494.
[11] 刘宇轩,刘毅志,廖祝华,邹正标,汤璟昕. 面向动态交通流量预测的自适应图注意Transformer[J]. 浙江大学学报(工学版), 2025, 59(12): 2585-2592.
[12] 周雄雄,何秋江,何纪春,周璟. 考虑力学成因的高心墙坝变形解耦及参数反演[J]. 浙江大学学报(工学版), 2025, 59(12): 2616-2626.
[13] 汤佳伟,郭铁铮,闻英友. 基于强化学习的Kubernetes云边协同计算调度算法[J]. 浙江大学学报(工学版), 2025, 59(11): 2400-2408.
[14] 王沁雪,张文芳. 基于紧凑零知识证明的海上位置共享方案[J]. 浙江大学学报(工学版), 2025, 59(11): 2409-2417.
[15] 董兆龙,黄鹤,李战一,杨澜,王会峰. 自动驾驶样本库中雪天场景融合构建方法[J]. 浙江大学学报(工学版), 2025, 59(10): 2078-2085.