Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2024, Vol. 58 Issue (1): 50-60    DOI: 10.3785/j.issn.1008-973X.2024.01.006
    
Lightweight and efficient human pose estimation with enhanced priori skeleton structure
Xuefei SUN(),Ruifeng ZHANG,Xin GUAN,Qiang LI*()
School of Microelectronics, Tianjin University, Tianjin 300072, China
Download: HTML     PDF(2533KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A lightweight and efficient human pose estimation method with an enhanced priori skeleton structure was proposed to better utilize the unique distribution properties of human pose keypoints. The high-resolution network was used to preserve spatial location information better. The lightweight inverse residual module was employed to reduce the number of model parameters. The postural enhancement module was designed to strengthen the priori information of human pose and the connection between human pose keypoints using global spatial feature information and context information. The direction-enhanced convolution module was proposed to address the problem of missing spatial feature information of keypoints caused by blurred pixel positions and directional shifts of convolution kernel optimization when fusing multi-resolution feature images. The prior distribution of keypoints was combined by utilizing the properties of the horizontal and vertical directions of the keypoints on the torso. The experimental results demonstrate that the network can efficiently estimate human pose. The model achieves an average precision score of 78.4 on the COCO test-dev set and reduces the number of parameters by 17.4×106 compared with the benchmark network, balancing accuracy and efficiency.



Key wordshuman pose estimation      keypoints detection      deep learning      postural enhancement      convolution direction enhancement     
Received: 03 March 2023      Published: 07 November 2023
CLC:  TP 391  
Fund:  国家自然科学基金资助项目(61471263);天津市自然科学基金资助项目(16JCZDJC31100);天津大学自主创新基金资助项目(2021XZC-0024)
Corresponding Authors: Qiang LI     E-mail: 2020232080@tju.edu.cn;liqiang@tju.edu.cn
Cite this article:

Xuefei SUN,Ruifeng ZHANG,Xin GUAN,Qiang LI. Lightweight and efficient human pose estimation with enhanced priori skeleton structure. Journal of ZheJiang University (Engineering Science), 2024, 58(1): 50-60.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2024.01.006     OR     https://www.zjujournals.com/eng/Y2024/V58/I1/50


强化先验骨架结构的轻量型高效人体姿态估计

为了更好地利用人体姿态关键点特有的分布属性,提出强化先验骨架结构的轻量型高效人体姿态估计方法. 利用高分辨率网络较好地保留空间位置信息,为了进一步降低模型参数量,提出轻量倒残差模块. 设计体位强化模块,利用全局空间特征和上下文信息强化躯干位置的先验信息及关键点之间的联系. 针对多分辨率特征图像融合时,像素位置模糊、卷积核优化方向偏移导致关键点空间特征信息遗失的问题,提出方向强化卷积模块,利用躯干上关键点分布的水平和垂直方向特性,高效融合关键点先验分布. 实验结果表明,利用该网络,可以高效地估计人体姿态. 与基准网络相比,该模型在COCO测试集上的平均精度达到78.4,参数量减少了17.4×106,兼顾精度与效率.


关键词: 人体姿态估计,  关键点检测,  深度学习,  体位强化,  卷积方向强化 
Fig.1 General architecture of human pose estimation network with enhanced priori skeleton structure
Fig.2 Structure of bottleneck and basicblock module
Fig.3 Structure of lightweight inverse residual module
Fig.4 Structure of postural enhancement module
Fig.5 Structure of direction-enhanced convolution module
Fig.6 Asymmetric convolution structure
Fig.7 Transposed convolution structure
方法 模块 Np/106 FLOPs/109 PCKhmean/%
第1阶段 第2阶段 第3阶段 第4阶段
Baseline Bottleneck Basicblock Basicblock Basicblock 28.5 9.4 90.3
模型1 Bottleneck Basicblock Basicblock LIRM 18.3 6.8 90.3
模型2 Bottleneck Basicblock LIRM LIRM 15.1 4.4 89.3
模型3 Bottleneck LIRM LIRM LIRM 14.9 4.1 89.0
Tab.1 Mean values of PCKh for different backbone networks at threshold of 0.5 on MPII dataset
方法 Np/106 FLOPs/109 PCK/% PCKhmean/%
头部 肩膀 肘部 腕部 臀部 膝盖 脚踝
Baseline 28.5 9.4 97.1 95.9 90.3 86.4 89.1 87.1 83.3 90.3
Baseline+LIRM 18.3 6.8 96.9 96.0 90.3 85.7 89.2 86.8 83.2 90.3
Baseline+LIRM+PEM 19.0 7.0 97.3 96.1 90.7 86.3 89.3 87.3 83.4 90.4
Baseline+LIRM+PEM+DCM 21.3 8.4 97.8 96.4 90.8 86.7 89.5 87.4 83.6 90.5
Tab.2 PCKh values for different backbone networks at threshold of 0.5 on MPII dataset
方法 Np/106 FLOPs/109 AP/% AP0.5/% AP0.75/% APM/% APL/% AR/%
Baseline 28.5 7.1 74.4 90.5 81.9 70.8 81.0 79.8
Baseline+LIRM 18.3 5.1 74.9 91.0 82.4 71.2 81.3 79.9
Baseline+LIRM+PEM 18.8 5.3 75.6 92.1 83.0 72.2 81.5 80.1
Baseline+LIRM+PEM+DCM 21.1 6.3 76.7 93.6 84.8 73.3 81.8 80.7
Tab.3 Average precision and average recall ablation results of different networks on COCO validation set
方法 预训练 输入图像尺寸 Np/106 FLOPs/109 AP/% AP0.5/% AP0.75/% APM/% APL/% AR/%
8-stage Hourglass[2] N 256×192 25.1 14.3 66.9
CPN50[3] Y 256×192 27.0 6.2 68.6
Simple Baseline152[4] Y 256×192 68.6 15.7 72.0 89.3 79.8 68.7 78.9 77.8
HRNet(W32)[5] Y 256×192 28.5 7.1 74.4 90.5 81.9 70.8 81.0 79.8
HRNet(W48)[5] Y 256×192 63.6 14.6 75.1 90.6 82.2 71.5 81.8 80.4
RAM-GPRNet(W32)[23] Y 256×192 31.4 7.7 76.0
RAM-GPRNet(W48)[23] Y 256×192 70.0 15.8 76.5
HRFormer-B[24] Y 256×192 43.2 12.2 75.6 90.8 82.8 71.7 82.6 80.8
HRGCNet(W32)[25] Y 256×192 29.6 7.11 76.6 93.6 84.6 73.9 80.7 79.3
HRGCNet(W48)[25] Y 256×192 64.6 14.6 77.4 93.6 84.8 74.6 81.7 80.1
AMHRNet(W32)[26] 256×192 36.4 76.1 91.0 82.7 71.5 82.9 81.2
AMHRNet(W48)[26] 256×192 71.8 76.4 91.1 83.1 72.2 83.3 81.4
本文方法(W32) Y 256×192 21.1 6.3 76.7 93.6 84.8 73.3 81.8 80.7
本文方法(W48) Y 256×192 46.2 12.3 77.4 93.7 85.0 74.4 82.3 81.4
CPN50[3] Y 384×288 13.9 70.6
Simple Baseline152[4] Y 384×288 68.6 35.3 74.3 89.6 81.1 70.5 81.6 79.7
HRNet(W32)[5] Y 384×288 28.5 16.0 75.8 90.6 82.5 72.0 82.7 80.9
HRNet(W48)[5] Y 384×288 63.6 32.9 76.3 90.8 82.9 72.3 83.4 81.2
RAM-GPRNet(W32)[23] Y 384×288 31.4 17.2 77.3
RAM-GPRNet(W48)[23] Y 384×288 70.0 35.6 77.7
HRFormer-B[24] Y 384×288 43.2 26.8 77.2 91.0 83.6 73.2 84.2 82.0
HRGCNet(W32)[25] Y 384×288 29.6 16.1 78.0 93.6 84.8 75.0 82.6 80.5
HRGCNet(W48)[25] Y 384×288 64.6 32.9 78.4 93.6 85.8 75.3 83.5 81.3
本文方法(W32) Y 384×288 21.1 14.8 78.2 93.7 85.0 75.4 82.9 81.2
本文方法(W48) Y 384×288 46.2 28.5 78.5 93.7 85.8 75.5 83.7 81.9
Tab.4 Comparison results of average precision and average recall for different networks on COCO validation set
方法 预训练 输入图像尺寸 Np/106 FLOPs/109 AP/% AP0.5/% AP0.75/% APM/% APL/% AR/%
CPN50[3] 384×288 72.6 86.1 69.7 78.3 64.1
Simple Baseline152[4] Y 256×192 68.6 15.7 71.6 91.2 80.1 68.7 77.2 77.3
HRNet(W32)[5] Y 384×288 28.5 16.0 74.9 92.5 82.8 71.3 80.9 80.1
HRNet(W48)[5] Y 384×288 63.6 32.9 75.5 92.5 83.3 71.9 81.5 80.5
RAM-GPRNet(W32)[23] Y 384×288 31.4 17.2 76.5
RAM-GPRNet(W48)[23] Y 384×288 70.0 35.6 77.0
HRFormer-B[24] Y 384×288 43.2 26.8 76.2 92.7 83.8 72.5 82.3 81.2
HRGCNet(W32)[25] Y 384×288 29.6 16.1 77.9 93.6 84.8 74.8 82.9 80.6
HRGCNet(W48)[25] Y 384×288 64.6 32.9 78.3 93.6 85.7 75.3 83.5 81.2
本文方法(W32) Y 384×288 21.1 14.8 78.1 93.6 85.0 75.2 83.1 81.2
本文方法(W48) Y 384×288 46.2 28.5 78.4 93.7 85.5 75.5 83.6 81.7
Tab.5 Comparison results of average precision and average recall for different networks on COCO test set
Fig.8 Comparison of visualization results
Fig.9 Visualization of experimental results with partial zoom comparison
[1]   REIS E S, SEEWALD L A, ANTUNES R S, et al Monocular multi-person pose estimation: a survey[J]. Pattern Recognition, 2021, 118: 108046
[2]   NEWELL A, YANG K, DENG J. Stacked hourglass networks for human pose estimation [C]// European Conference on Computer Vision. Amsterdam: Springer, 2016: 483–499.
[3]   CHEN Y, WANG Z, PENG Y, et al. Cascaded pyramid network for multi-person pose estimation [C]// IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 7103–7112.
[4]   XIAO B, WU H, WEI Y. Simple baselines for human pose estimation and tracking [C]// European Conference on Computer Vision. Munich: Springer, 2018: 472–487.
[5]   SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation [C]// IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 5686–5696.
[6]   SANDLER M, HOWARD A, ZHU M, et al. MobileNetV2: inverted residuals and linear bottlenecks [C]// IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 4510-4520.
[7]   ZHANG X, ZHOU X, LIN M, et al. ShuffleNet: an extremely efficient convolutional neural network for mobile devices [C]// IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6848-6856.
[8]   QIAO S, CHEN L C, YUILLE A. DetectoRS: detecting objects with recursive feature pyramid and switchable atrous convolution [C]// IEEE Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 10208-10219.
[9]   LIN T Y, DOLLAR P, GIRSHICK R, et al Feature pyramid networks for object detection[J]. IEEE Computer Society, 2017, 1: 936- 944
[10]   SU H, JAMPANI V, SUN D, et al. Pixel-adaptive convolutional neural networks [C]// IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 11158-11167.
[11]   CHEN Y, DAI X, LIU M, et al. Dynamic convolution: attention over convolution kernels [C]// IEEE Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11027-11036.
[12]   WANG Q, WU B, ZHU P, et al. ECA-Net: efficient channel attention for deep convolutional neural networks [C]// IEEE Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11531-11539.
[13]   LI X, WANG W, HU X, et al. Selective kernel networks [C]// IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 510-519.
[14]   RAJAMANI K, GOWDA S D, TEJ V N, et al. Deformable attention (DANet) for semantic image segmentation [C]// Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Glasgow: IEEE, 2022: 3781-3784.
[15]   刘勇. 基于关键点检测的目标二维姿态估计研究[D]. 成都: 中国科学院光电技术研究所, 2021.
LIU Yong. Research on two-dimensional object pose estimation based on key-point detection [D]. Chengdu: Institute of Optics and Electronics, Chinese Academy of Sciences, 2021.
[16]   LIU Z, MAO H, WU C Y, et al. A Convnet for the 2020s [C]// IEEE Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 11966-11976.
[17]   CHEN J, HE T, ZHUO W, et al. TVConv: efficient translation variant convolution for layout-aware visual processing [C]// IEEE Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 12538-12548.
[18]   CAO Y, XU J, LIN S, et al. GCNet: non-local networks meet squeeze-excitation networks and beyond [C]// IEEE International Conference on Computer Vision Workshop. Seoul: IEEE, 2019: 1971-1980.
[19]   DIND X, GUO Y, DING G, et al. ACNet: strengthening the kernel skeletons for powerful CNN via asymmetric convolution blocks [C]// IEEE International Conference on Computer Vision. Seoul: IEEE, 2019: 1911-1920.
[20]   ZEILER M D, KRISHNAN D, TAYLOR G W, et al. Deconvolutional networks [C]// IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco: IEEE, 2010: 2528-2535.
[21]   ANDRILUKA M, PISHCHULIN L, GEHLER P, et al. 2D human pose estimation: new benchmark and state of the art analysis [C]// IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 3686-3693.
[22]   LIN T, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context [C]// European Conference on Computer Vision. Zurich: Springer, 2014: 740-755.
[23]   ZHANG K, HE P, YAO P, et al. Learning enhanced resolution-wise features for human pose estimation [C]// IEEE International Conference on Image Processing. Abu Dhabi: IEEE, 2020: 2256-2260.
[24]   YUAN Y, FU R, HUANG L, et al. HRFormer: high-resolution transformer for dense prediction [C]// Neural Information Processing Systems. Vancouver: MIT Press, 2021.
[25]   WANG K, LI C, REN R. High-resolution with global context network for human pose estimation [C]// Asia Pacific Conference on Communications. Jeju Island: IEEE, 2022: 621-626.
[1] Chao-hao ZHENG,Zhi-wei YIN,Gang-feng ZENG,Yue-ping XU,Peng ZHOU,Li LIU. Post-processing of numerical precipitation forecast based on spatial-temporal deep learning model[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(9): 1756-1765.
[2] Zhe YANG,Hong-wei GE,Ting LI. Framework of feature fusion and distribution with mixture of experts for parallel recommendation algorithm[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(7): 1317-1325.
[3] Yun-hong LI,Jiao-jiao DUAN,Xue-ping SU,Lei-tao ZHANG,Hui-kang YU,Xing-rui LIU. Calligraphy generation algorithm based on improved generative adversarial network[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(7): 1326-1334.
[4] Wei QUAN,Yong-qing CAI,Chao WANG,Jia SONG,Hong-kai SUN,Lin-xuan LI. VR sickness estimation model based on 3D-ResNet two-stream network[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(7): 1345-1353.
[5] Xin-lei ZHOU,Hai-ting GU,Jing LIU,Yue-ping XU,Fang GENG,Chong WANG. Daily water supply prediction method based on integrated learning and deep learning[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(6): 1120-1127.
[6] Pei-feng LIU,Lu QIAN,Xing-wei ZHAO,Bo TAO. Continual learning framework of named entity recognition in aviation assembly domain[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(6): 1186-1194.
[7] Jia-chi ZHAO,Tian-qi WANG,Li-fang ZENG,Xue-ming SHAO. Rapid prediction of unsteady aerodynamic characteristics of flapping wing based on GRU[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(6): 1251-1256.
[8] Xiao-lu CAO,Fu-nan LU,Xiang ZHU,Li-bo WENG,Shu-fang LU,Fei GAO. Sketch-based compatible clothing image generation[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(5): 939-947.
[9] Yu-ting SU,Rong-xuan LU,Wei ZHANG. Vehicle re-identification algorithm based on attention mechanism and adaptive weight[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(4): 712-718.
[10] Qing-lu MA,Jia-ping LU,Xiao-yao TANG,Xue-feng DUAN. Improved YOLOv5s flame and smoke detection method in road tunnels[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(4): 784-794.
[11] Yao ZENG,Fa-qin GAO. Surface defect detection algorithm of electronic components based on improved YOLOv5[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(3): 455-465.
[12] Huan LAN,Jian-bo YU. Steel surface defect detection based on deep learning 3D reconstruction[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(3): 466-476.
[13] Ju-xiang ZENG,Ping-hui WANG,Yi-dong DING,Lin LAN,Lin-xi CAI,Xiao-hong GUAN. Graph neural network based node embedding enhancement model for node classification[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(2): 219-225.
[14] Jian-sha LU,Qin BAO,Hong-tao TANG,Yi-ping SHAO,Wen-bin ZHAO. Optimal tag selection method for device-free human tracking system[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(2): 415-425.
[15] Tian-le YANG,Ling-xia LI,Wei ZHANG. Dual-branch crowd counting algorithm based on self-attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(10): 1955-1965.