Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2026, Vol. 60 Issue (3): 565-573    DOI: 10.3785/j.issn.1008-973X.2026.03.012
    
3D human pose estimation based on multi-scale encoder fusion
Xiaoan BAO1(),Enlin CHEN1,Na ZHANG1,Xiaomei TU2,Biao WU3,Qingqi ZHANG4,*()
1. School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
2. School of Civil Engineering and Architecture, Zhejiang Guangsha Vocational and Technical University of Construction, Dongyang 322100, China
3. School of Science, Zhejiang Sci-Tech University, Hangzhou 310018, China
4. Graduate School of East Asian Studies, Yamaguchi University, Yamaguchi 753-8514, Japan
Download: HTML     PDF(1735KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A 3D human pose estimation method based on multi-scale encoder fusion was proposed in order to address the contradiction between redundant information interference and the need for information completeness. The method consisted of a key-frame spatial-temporal encoder (KFSTE) and a global retention self-attention encoder (GRSAE). The skeletal feature sequence was filtered by using a key-frame selector in KFSTE. Then local spatial-temporal dependencies were modeled through a temporal encoder. Global single-stage encoding was performed via a retention encoder in GRSAE to capture global skeletal sequence feature, thereby avoiding information loss caused by key-frame selection bias. The 3D human pose coordinates were predicted by concatenating the features from the two encoders followed by a regression module. The experimental results on the large-scale Human3.6M dataset demonstrated that the proposed method reduced the mean per-joint position error (MPJPE) by 3% compared with MixSTE and achieved the best performance on 11 actions.



Key wordsthree-dimensional human pose estimation      spatial-temporal encoder      key-frame extraction      retentive self-attention encoding      multi-encoder feature fusion     
Received: 13 March 2025      Published: 04 February 2026
CLC:  TP 393  
Fund:  国家自然科学基金资助项目 (6207050141);浙江省重点研发计划资助项目(2020C03094);浙江省教育厅一般科研资助项目(Y202147659);浙江省教育厅资助项目(Y202250706,Y202250677);浙江省基础公益研究计划资助项目(QY19E050003).
Corresponding Authors: Qingqi ZHANG     E-mail: baoxiaoan@zstu.edu.cn;c503snw@yamaguchi-u.ac.jp
Cite this article:

Xiaoan BAO,Enlin CHEN,Na ZHANG,Xiaomei TU,Biao WU,Qingqi ZHANG. 3D human pose estimation based on multi-scale encoder fusion. Journal of ZheJiang University (Engineering Science), 2026, 60(3): 565-573.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2026.03.012     OR     https://www.zjujournals.com/eng/Y2026/V60/I3/565


基于多尺度编码器融合的三维人体姿态估计算法

针对冗余信息干扰与信息完整性需求之间的矛盾,提出基于多尺度编码器融合的三维人体姿态估计方法. 该方法由关键帧时空编码器(KFSTE)和全局保留自注意力编码器(GRSAE)构成. KFSTE通过关键帧选择器对骨架特征序列进行筛选后,由时间编码器获取局部时空建模. GRSAE通过保留编码器进行全局单阶段编码来获取全局骨架序列特征,避免因关键帧筛选偏差导致的信息损失. 通过对双编码器的特征拼接及回归处理,预测得到三维人体姿态坐标. 实验结果表明,在较大规模的Human3.6M数据集上,所提方法的平均关节位置误差(MPJPE)比MixSTE低3%,有11个动作获得最佳.


关键词: 三维人体姿态估计,  时空编码器,  关键帧提取,  保留自注意力编码,  多编码特征融合 
Fig.1 Network structure of three-dimensional human pose estimation algorithm based on multi-scale encoding fusion
Fig.2 Schematic diagram of key-frame spatial-temporal encoder
Fig.3 Structure of spatial-temporal transformer encoder
Fig.4 Structure of retentive self-attention encoder
CPN 协议1会议名MPJPE/mm
Dir.Disc.Eat.GrceetPhonePhotoPosePunch.SitSitD.SomkeWaitWalkD.WalkWalkT.平均值
文献[32] (f= 243)CVPR’1845.246.743.345.648.155.144.644.357.365.847.1444932.833.946.8
文献[33]NeurIPS’1944.846.143.346.449.055.244.64458.362.747.143.948.632.733.346.7
文献[9] (f = 243)CVPR’2041.844.841.144.947.454.143.442.256.263.645.343.545.331.332.245.1
SRNet[11]ECCV’2046.647.143.941.645.849.646.540.053.461.146.142.643.131.532.644.8
UGCN[34] (f= 96)ECCV’2041.343.944.042.248.057.142.243.257.361.347.043.547.032.631.845.6
文献[8] (f = 81)TCSVT’2142.143.841.043.846.153.542.443.153.960.545.742.146.232.233.844.6
PoseFormer[13] (f =81)ICCV’2141.544.839.842.546.551.6424253.360.745.543.346.131.832.244.3
MHFormer[21] (f =351)CVPR’2239.243.140.140.944.951.240.641.353.560.343.741.143.829.830.643.0
MixSTE[15] (f =243)CVPR’2237.640.937.339.742.349.940.139.851.755.042.139.841.027.927.940.9
KTPFormer[30] (f =243)CVPR’2437.339.235.937.642.548.238.639.051.455.941.639.040.027.027.440.1
TCPFormer[31] (f =81)CVPR’2540.5
MSEFN (f =81,kf = 27)36.839.935.839.141.047.537.537.651.547.939.838.242.125.626.839.8
CPN 协议2会议名MPJPE/mm
Dir.Disc.Eat.GrceetPhonePhotoPosePunch.SitSitD.SomkeWaitWalkD.WalkWalkT.平均值
文献[7]CVPR’1834.739.841.838.642.547.538.036.650.756.842.639.643.932.136.541.8
文献[35] (f = 7)ICCV’1935.737.836.940.739.645.237.434.546.950.140.536.141.029.632.339.0
文献[9] (f = 243)CVPR’2032.335.233.335.835.941.533.232.744.650.937.032.437.025.227.235.6
UGCN[34] (f = 96)ECCV’2032.935.,235.634.436442.731.232.545.650.237.332.836.326.023.935.5
PoseFormer[13] (f =81)ICCV’2132.534,832.634.635.339.532.132.042.848.534.832.435.324.52634.6
MHFormer[21] (f =351)CVPR’2231.534.932.833.635.339.632.032.243.548.136.432.634.323.925.134.4
MixSTE[15] (f =243)CVPR’2230.833.130.331.833.139.131.130.542.544.534.030.832.722.122.932.6
KTPFormer[30] (f =243)CVPR’2430.132.329.630.832.337.330.030.241.045.333.629.931.421.522.631.9
TCPFormer[31] (f =81)CVPR’2533.7
MSEFN (f =81,kf = 27)27.530.528.531.831.536.428.52841.746.63228.331.819.821.331.0
Tab.1 Comparison of MPJPE result of MSEFN and different methods under protocol 1 and protocol 2 on Human3.6M dataset (based on CPN-detected input)
GT 协议1会议名MPJPE/mm
Dir.Disc.Eat.GrceetPhonePhotoPosePunch.SitSitD.SomkeWaitWalkD.WalkWalkT.平均值
PoseFormer [13] (f = 81)ICCV’2130.033.629.931.030.233.334.831.437.838.631.731.529.023.323.131.3
MHFormer [21] (f =351)CVPR’2227.732.129.128.930.033.933.031.23739.330.031.029.422.223.030.5
POT[36] (f = 81)AAAI’2332.938.328.333.834.938.737.230.734.539.733.934.734.326.128.933.8
MSEFN (f =81,kf = 27)27.128.025.326.524.627.729.826.031.333.626.528.728.116.318.326.5
Tab.2 Comparison of MPJPE result of MSEFN and different methods under protocol 1 on Human3.6M dataset (based on GT-detected pose input)
方法PCK/%AUC/%MPJPE/mm
VideoPose3D [37]86.051.984.0
UGCN [34]86.962.168.1
Anatomy3D [8]87.954.078.8
MixSTE [15]94.466.554.9
Poseformer [13]95.463.257.7
MHFormer [21]93.863.358.0
P-STMO [22]97.975.832.2
MSEFN99.176.431.2
Tab.3 Detailed quantitative comparison result of three indicators under MPI-INF-3DHP
方法fFLOPs/106Np/106MPJPE/mm
Poseformer [13]27542.19.6547.0
StridedTran [14]813924.0645.4
MHFormer [21]271031.818.9245.9
MSEFN81(kf = 27)5539.9739.8
Tab.4 Quantitative comparison result of FLOPs, Np and MPJPE indicator on Human3.6M
模块组合MPJPE/mm
Baseline31.3
GRSAN34.7
Baseline+GRSAN28.2
Baseline +KFS(KFSTE)29.6
Baseline+KFS+ GRSAN(MSEFN)26.5
Tab.5 Ablation study of different modules of MSEFN
Fig.5 3D HPE visual evaluation of proposed method on Human3.6M dataset
方法fkfFLOPs/106Npv/(帧·s?1)MPJPE/mm
Poseformer[13]2727542.19.6542847.0
StridedTran[14]81813924.0619945.4
MHFormer[21]27271031.818.923345.9
MSEFN2734096.2347047.8
MSEFN2794677.5545045.5
MSEFN8134587.1145542.8
MSEFN8195217.9644940.1
MSEFN81275539.9740339.8
Tab.6 Quantitative comparison result of ablation study on selection of frame rate
tctlMPJPE/mmtctlMPJPE/mm
16242.032640.3
16441.948240.6
16641.848440.4
32241.148641.2
32439.8
Tab.7 Ablation study result under different dimension and layer
Fig.6 Visualization result of proposed method on Human3.6M dataset
[1]   ZHANG C, YANG T, WENG J, et al. Unsupervised pre-training for temporal action localization tasks [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 14011–14021.
[2]   CHEN H, HE J Y, XIANG W, et al. Hdformer: high-order directed transformer for 3d human pose estimation [C]//Proceedings of the 32nd International Joint Conference on Artificial Intelligence. Macao: ACM, 2023: 581-589.
[3]   LIU M, YUAN J. Recognizing human actions as the evolution of pose estimation maps [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 1159–1168.
[4]   ZHANG Q, BAO X, WU R, et al A skeleton temporal fusion graph convolutional network for elderly action recognition[J]. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2025, 108 (5): 704- 713
[5]   MEHTA D, SRIDHAR S, SOTNYCHENKO O, et al VNect: real-time 3D human pose estimation with a single RGB camera[J]. ACM Transactions on Graphics, 2017, 36 (4): 1- 14
[6]   MOON G, LEE K M. I2L-MeshNet: image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image [C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2020: 752–768.
[7]   PAVLAKOS G, ZHOU X, DANIILIDIS K. Ordinal depth supervision for 3D human pose estimation [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 7307–7316.
[8]   CHEN T, FANG C, SHEN X, et al Anatomy-aware 3D human pose estimation with bone-based pose decomposition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32 (1): 198- 209
doi: 10.1109/TCSVT.2021.3057267
[9]   LIU R, SHEN J, WANG H, et al. Attention mechanism exploits temporal contexts: real-time 3D human pose reconstruction [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 5063–5072.
[10]   WANG J, YAN S, XIONG Y, et al. Motion guided 3D pose estimation from videos [C]// Proceedings of the European Conference on Computer Vision. Cham: Springer, 2020: 764–780.
[11]   ZENG A, SUN X, HUANG F, et al. SRNet: improving generalization in 3D human pose estimation with a split-and-recombine approach [C]// Proceedings of the European Conference on Computer Vision. Cham: Springer, 2020: 507–523.
[12]   CAO Z, SIMON T, WEI S E, et al. Realtime multi-person 2d pose estimation using part affinity fields [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 7291-7299.
[13]   ZHENG C, ZHU S, MENDIETA M, et al. 3D human pose estimation with spatial and temporal transformers [C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2022: 11636–11645.
[14]   LI W, LIU H, DING R, et al Exploiting temporal contexts with strided transformer for 3D human pose estimation[J]. IEEE Transactions on Multimedia, 2023, 25: 1282- 1293
doi: 10.1109/TMM.2022.3141231
[15]   ZHANG J, TU Z, YANG J, et al. MixSTE: seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 13222–13232.
[16]   ZHU W, MA X, LIU Z, et al. MotionBERT: a unified perspective on learning human motion representations [C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2024: 15039–15053.
[17]   TANG Z, QIU Z, HAO Y, et al. 3D human pose estimation with spatio-temporal criss-cross attention [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 4790–4799.
[18]   CHEN X, HAN Y, WANG X, et al Action keypoint network for efficient video recognition[J]. IEEE Transactions on Image Processing, 2022, 31: 4980- 4993
doi: 10.1109/TIP.2022.3191461
[19]   EINFALT M, LUDWIG K, LIENHART R. Uplift and upsample: efficient 3D human pose estimation with uplifting transformers [C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2023: 2902–2912.
[20]   FAN Q, HUANG H, CHEN M, et al. Rmt: retentive networks meet vision transformers [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 5641-5651.
[21]   LI W, LIU H, TANG H, et al. MHFormer: multi-hypothesis transformer for 3D human pose estimation [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 13137–13146.
[22]   SHAN W, LIU Z, ZHANG X, et al. P-STMO: pre-trained spatial temporal many-to-one model for3D human pose estimation [C]//European Conference on Computer Vision. Cham: Springer, 2022: 461–478.
[23]   FAN Q, HUANG H, CHEN M, et al. RMT: retentive networks meet vision transformers [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 5641–5651.
[24]   IONESCU C, PAPAVA D, OLARU V, et al Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36 (7): 1325- 1339
doi: 10.1109/TPAMI.2013.248
[25]   MEHTA D, RHODIN H, CASAS D, et al. Monocular 3D human pose estimation in the wild using improved CNN supervision [C]//Proceedings of the International Conference on 3D Vision. Qingdao: IEEE, 2018: 506–516.
[26]   ZHENG C, WU W, CHEN C, et al Deep learning-based human pose estimation: a survey[J]. ACM Computing Surveys, 2024, 56 (1): 1- 37
[27]   MARGOSSIAN C C A review of automatic differentiation and its efficient implementation[J]. WIREs Data Mining and Knowledge Discovery, 2019, 9 (4): e1305
doi: 10.1002/widm.1305
[28]   FINDER S E, AMOYAL R, TREISTER E, et al. Wavelet convolutions for large receptive fields [C]//European Conference on Computer Vision. Cham: Springer, 2024: 363-380.
[29]   CHEN Y, WANG Z, PENG Y, et al. Cascaded pyramid network for multi-person pose estimation [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 7103–7112.
[30]   PENG J, ZHOU Y, MOK P Y. KTPFormer: kinematics and trajectory prior knowledge-enhanced transformer for 3D human pose estimation [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 1123–1132.
[31]   LIU J, LIU M, LIU H, et al. Tcpformer: Learning temporal correlation with implicit pose proxy for 3d human pose estimation [C]//Proceedings of the AAAI Conference on Artificial Intelligence. Washington: AAAI, 2025, 39(5): 5478−5486.
[32]   PAVLLO D, FEICHTENHOFER C, GRANGIER D, et al. 3D human pose estimation in video with temporal convolutions and semi-supervised training [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2020: 7745–7754.
[33]   YEH R, HU Y T, SCHWING A. Chirality nets for human pose regression [J]. Advances in Neural Information Processing Systems, 2019, 32: 8161–8171.
[34]   WANG J, YAN S, XIONG Y, et al. Motion guided 3d pose estimation from videos [C]//European Conference on Computer Vision. Cham: Springer, 2020: 764−780.
[35]   CAI Y, GE L, LIU J, et al. Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks [C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2020: 2272–2281.
[36]   LI H, SHI B, DAI W, et al Pose-oriented transformer with uncertainty-guided refinement for 2D-to-3D human pose estimation[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37 (1): 1296- 1304
doi: 10.1609/aaai.v37i1.25213
[1] Juntao LV,Jueyu QI,Haochen YU,Lei MA,Huimin MA,Tianyu HU. Current status and future prospect of integrated simulation platform for autonomous driving[J]. Journal of ZheJiang University (Engineering Science), 2026, 60(3): 513-526.
[2] Wenqiang CHEN,Linyue FENG,Dongdan WANG,Yulei GU,Xuan ZHAO. Vehicle trajectory prediction model integrating dynamic risk map and multivariate attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2026, 60(3): 455-467.
[3] Xiaoyan KUI,Min ZHANG,Ling XIAO,Qinsong LI,Liming CHEN,Wensheng ZHANG,Beiji ZOU. Systematic classification and performance analysis of data deduplication and reduction techniques[J]. Journal of ZheJiang University (Engineering Science), 2026, 60(2): 287-302.
[4] Yanle WANG,Ruifeng ZHANG,Qiang LI. Graph neural network recommendation model integrating global information and contrastive learning[J]. Journal of ZheJiang University (Engineering Science), 2026, 60(2): 351-359.
[5] Yue WU,Zheng LIANG,Wei GAO,Maoda YANG,Peisen ZHAO,Hongxia DENG,Yuanyuan CHANG. Multi-modal gait recognition based on SMPL model decomposition and embedding fusion[J]. Journal of ZheJiang University (Engineering Science), 2026, 60(1): 52-60.
[6] Huhang CHEN,Quan LV,Zihang SU,Junqiao ZHANG,Zhu CHEN,Xu HAN. Operating mode selection model for heating unit considering capacity charge[J]. Journal of ZheJiang University (Engineering Science), 2026, 60(1): 169-178.
[7] Hao HE,Yongdong SHU,Yonggang LIN,Fuquan DAI,Ju ZHANG. hydrodynamic and hydraulic-control simulation of high-power ship azimuth thruster under variable load[J]. Journal of ZheJiang University (Engineering Science), 2026, 60(1): 191-198.
[8] Yiming YU,Wei CAI,Yi LI,Wei FU,Xu YAO,Tingyi ZHANG,Shangqi DIAO,Dan LI,Songqing LIN,Yongshun CHEN. Review of environmental heavy metal detection based on microelectrode arrays[J]. Journal of ZheJiang University (Engineering Science), 2026, 60(1): 217-230.
[9] Linghao ZHANG,Haibo TAN,He ZHAO,Zhong CHEN,Haotian CHENG,Zhiyu MA. CompuDEX: blockchain-based large model fine-tuning compute-power sharing platform[J]. Journal of ZheJiang University (Engineering Science), 2026, 60(1): 1-18.
[10] Yanpu YANG,Zhihong WU,Wenhao MENG,Yueming ZHUO,Jialing LIU. Upper-limb muscle fatigue assessment in overhead work based on Hammerstein model and surface electromyography[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(12): 2483-2494.
[11] Yuxuan LIU,Yizhi LIU,Zhuhua LIAO,Zhengbiao ZOU,Jingxin TANG. Adaptive graph attention Transformer for dynamic traffic flow prediction[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(12): 2585-2592.
[12] Xiongxiong ZHOU,Qiujiang HE,Jichun HE,Jing ZHOU. Deformation decoupling and parameter inversion for high core wall dams considering mechanical causes[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(12): 2616-2626.
[13] Jiawei TANG,Tiezheng GUO,Yingyou WEN. Reinforcement learning-based scheduling algorithm for cloud-edge collaborative computing on Kubernetes[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(11): 2400-2408.
[14] Qinxue WANG,Wenfang ZHANG. Maritime positioning sharing scheme based on compressing zero-knowledge proof[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(11): 2409-2417.
[15] Zhaolong DONG,He HUANG,Zhanyi LI,Lan YANG,Huifeng WANG. Snowy scene integration construction method in autonomous driving sample library[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(10): 2078-2085.