Please wait a minute...
浙江大学学报(工学版)  2024, Vol. 58 Issue (1): 50-60    DOI: 10.3785/j.issn.1008-973X.2024.01.006
计算机技术     
强化先验骨架结构的轻量型高效人体姿态估计
孙雪菲(),张瑞峰,关欣,李锵*()
天津大学 微电子学院,天津 300072
Lightweight and efficient human pose estimation with enhanced priori skeleton structure
Xuefei SUN(),Ruifeng ZHANG,Xin GUAN,Qiang LI*()
School of Microelectronics, Tianjin University, Tianjin 300072, China
 全文: PDF(2533 KB)   HTML
摘要:

为了更好地利用人体姿态关键点特有的分布属性,提出强化先验骨架结构的轻量型高效人体姿态估计方法. 利用高分辨率网络较好地保留空间位置信息,为了进一步降低模型参数量,提出轻量倒残差模块. 设计体位强化模块,利用全局空间特征和上下文信息强化躯干位置的先验信息及关键点之间的联系. 针对多分辨率特征图像融合时,像素位置模糊、卷积核优化方向偏移导致关键点空间特征信息遗失的问题,提出方向强化卷积模块,利用躯干上关键点分布的水平和垂直方向特性,高效融合关键点先验分布. 实验结果表明,利用该网络,可以高效地估计人体姿态. 与基准网络相比,该模型在COCO测试集上的平均精度达到78.4,参数量减少了17.4×106,兼顾精度与效率.

关键词: 人体姿态估计关键点检测深度学习体位强化卷积方向强化    
Abstract:

A lightweight and efficient human pose estimation method with an enhanced priori skeleton structure was proposed to better utilize the unique distribution properties of human pose keypoints. The high-resolution network was used to preserve spatial location information better. The lightweight inverse residual module was employed to reduce the number of model parameters. The postural enhancement module was designed to strengthen the priori information of human pose and the connection between human pose keypoints using global spatial feature information and context information. The direction-enhanced convolution module was proposed to address the problem of missing spatial feature information of keypoints caused by blurred pixel positions and directional shifts of convolution kernel optimization when fusing multi-resolution feature images. The prior distribution of keypoints was combined by utilizing the properties of the horizontal and vertical directions of the keypoints on the torso. The experimental results demonstrate that the network can efficiently estimate human pose. The model achieves an average precision score of 78.4 on the COCO test-dev set and reduces the number of parameters by 17.4×106 compared with the benchmark network, balancing accuracy and efficiency.

Key words: human pose estimation    keypoints detection    deep learning    postural enhancement    convolution direction enhancement
收稿日期: 2023-03-03 出版日期: 2023-11-07
CLC:  TP 391  
基金资助: 国家自然科学基金资助项目(61471263);天津市自然科学基金资助项目(16JCZDJC31100);天津大学自主创新基金资助项目(2021XZC-0024)
通讯作者: 李锵     E-mail: 2020232080@tju.edu.cn;liqiang@tju.edu.cn
作者简介: 孙雪菲(1998―),女,硕士生,从事深度学习图像处理的研究. orcid.org/0009-0000-6125-3202. E-mail: 2020232080@tju.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
孙雪菲
张瑞峰
关欣
李锵

引用本文:

孙雪菲,张瑞峰,关欣,李锵. 强化先验骨架结构的轻量型高效人体姿态估计[J]. 浙江大学学报(工学版), 2024, 58(1): 50-60.

Xuefei SUN,Ruifeng ZHANG,Xin GUAN,Qiang LI. Lightweight and efficient human pose estimation with enhanced priori skeleton structure. Journal of ZheJiang University (Engineering Science), 2024, 58(1): 50-60.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2024.01.006        https://www.zjujournals.com/eng/CN/Y2024/V58/I1/50

图 1  强化先验骨架结构的人体姿态估计网络整体架构
图 2  瓶颈层和基础模块的结构
图 3  轻量倒残差模块的结构
图 4  体位强化模块的结构
图 5  方向强化卷积模块的结构
图 6  非对称卷积结构
图 7  转置卷积结构
方法 模块 Np/106 FLOPs/109 PCKhmean/%
第1阶段 第2阶段 第3阶段 第4阶段
Baseline Bottleneck Basicblock Basicblock Basicblock 28.5 9.4 90.3
模型1 Bottleneck Basicblock Basicblock LIRM 18.3 6.8 90.3
模型2 Bottleneck Basicblock LIRM LIRM 15.1 4.4 89.3
模型3 Bottleneck LIRM LIRM LIRM 14.9 4.1 89.0
表 1  在MPII数据集上阈值为0.5时不同主干网络的PCKh平均值
方法 Np/106 FLOPs/109 PCK/% PCKhmean/%
头部 肩膀 肘部 腕部 臀部 膝盖 脚踝
Baseline 28.5 9.4 97.1 95.9 90.3 86.4 89.1 87.1 83.3 90.3
Baseline+LIRM 18.3 6.8 96.9 96.0 90.3 85.7 89.2 86.8 83.2 90.3
Baseline+LIRM+PEM 19.0 7.0 97.3 96.1 90.7 86.3 89.3 87.3 83.4 90.4
Baseline+LIRM+PEM+DCM 21.3 8.4 97.8 96.4 90.8 86.7 89.5 87.4 83.6 90.5
表 2  在MPII数据集上阈值为0.5时不同主干网络的PCKh值
方法 Np/106 FLOPs/109 AP/% AP0.5/% AP0.75/% APM/% APL/% AR/%
Baseline 28.5 7.1 74.4 90.5 81.9 70.8 81.0 79.8
Baseline+LIRM 18.3 5.1 74.9 91.0 82.4 71.2 81.3 79.9
Baseline+LIRM+PEM 18.8 5.3 75.6 92.1 83.0 72.2 81.5 80.1
Baseline+LIRM+PEM+DCM 21.1 6.3 76.7 93.6 84.8 73.3 81.8 80.7
表 3  在COCO验证集上不同网络的平均精度和平均召回率消融结果
方法 预训练 输入图像尺寸 Np/106 FLOPs/109 AP/% AP0.5/% AP0.75/% APM/% APL/% AR/%
8-stage Hourglass[2] N 256×192 25.1 14.3 66.9
CPN50[3] Y 256×192 27.0 6.2 68.6
Simple Baseline152[4] Y 256×192 68.6 15.7 72.0 89.3 79.8 68.7 78.9 77.8
HRNet(W32)[5] Y 256×192 28.5 7.1 74.4 90.5 81.9 70.8 81.0 79.8
HRNet(W48)[5] Y 256×192 63.6 14.6 75.1 90.6 82.2 71.5 81.8 80.4
RAM-GPRNet(W32)[23] Y 256×192 31.4 7.7 76.0
RAM-GPRNet(W48)[23] Y 256×192 70.0 15.8 76.5
HRFormer-B[24] Y 256×192 43.2 12.2 75.6 90.8 82.8 71.7 82.6 80.8
HRGCNet(W32)[25] Y 256×192 29.6 7.11 76.6 93.6 84.6 73.9 80.7 79.3
HRGCNet(W48)[25] Y 256×192 64.6 14.6 77.4 93.6 84.8 74.6 81.7 80.1
AMHRNet(W32)[26] 256×192 36.4 76.1 91.0 82.7 71.5 82.9 81.2
AMHRNet(W48)[26] 256×192 71.8 76.4 91.1 83.1 72.2 83.3 81.4
本文方法(W32) Y 256×192 21.1 6.3 76.7 93.6 84.8 73.3 81.8 80.7
本文方法(W48) Y 256×192 46.2 12.3 77.4 93.7 85.0 74.4 82.3 81.4
CPN50[3] Y 384×288 13.9 70.6
Simple Baseline152[4] Y 384×288 68.6 35.3 74.3 89.6 81.1 70.5 81.6 79.7
HRNet(W32)[5] Y 384×288 28.5 16.0 75.8 90.6 82.5 72.0 82.7 80.9
HRNet(W48)[5] Y 384×288 63.6 32.9 76.3 90.8 82.9 72.3 83.4 81.2
RAM-GPRNet(W32)[23] Y 384×288 31.4 17.2 77.3
RAM-GPRNet(W48)[23] Y 384×288 70.0 35.6 77.7
HRFormer-B[24] Y 384×288 43.2 26.8 77.2 91.0 83.6 73.2 84.2 82.0
HRGCNet(W32)[25] Y 384×288 29.6 16.1 78.0 93.6 84.8 75.0 82.6 80.5
HRGCNet(W48)[25] Y 384×288 64.6 32.9 78.4 93.6 85.8 75.3 83.5 81.3
本文方法(W32) Y 384×288 21.1 14.8 78.2 93.7 85.0 75.4 82.9 81.2
本文方法(W48) Y 384×288 46.2 28.5 78.5 93.7 85.8 75.5 83.7 81.9
表 4  在COCO验证集上不同网络的平均精度和平均召回率对比结果
方法 预训练 输入图像尺寸 Np/106 FLOPs/109 AP/% AP0.5/% AP0.75/% APM/% APL/% AR/%
CPN50[3] 384×288 72.6 86.1 69.7 78.3 64.1
Simple Baseline152[4] Y 256×192 68.6 15.7 71.6 91.2 80.1 68.7 77.2 77.3
HRNet(W32)[5] Y 384×288 28.5 16.0 74.9 92.5 82.8 71.3 80.9 80.1
HRNet(W48)[5] Y 384×288 63.6 32.9 75.5 92.5 83.3 71.9 81.5 80.5
RAM-GPRNet(W32)[23] Y 384×288 31.4 17.2 76.5
RAM-GPRNet(W48)[23] Y 384×288 70.0 35.6 77.0
HRFormer-B[24] Y 384×288 43.2 26.8 76.2 92.7 83.8 72.5 82.3 81.2
HRGCNet(W32)[25] Y 384×288 29.6 16.1 77.9 93.6 84.8 74.8 82.9 80.6
HRGCNet(W48)[25] Y 384×288 64.6 32.9 78.3 93.6 85.7 75.3 83.5 81.2
本文方法(W32) Y 384×288 21.1 14.8 78.1 93.6 85.0 75.2 83.1 81.2
本文方法(W48) Y 384×288 46.2 28.5 78.4 93.7 85.5 75.5 83.6 81.7
表 5  在COCO测试集上不同网络的平均精度和平均召回率对比结果
图 8  可视化结果的对比图
图 9  可视化实验结果局部放大对比图
1 REIS E S, SEEWALD L A, ANTUNES R S, et al Monocular multi-person pose estimation: a survey[J]. Pattern Recognition, 2021, 118: 108046
2 NEWELL A, YANG K, DENG J. Stacked hourglass networks for human pose estimation [C]// European Conference on Computer Vision. Amsterdam: Springer, 2016: 483–499.
3 CHEN Y, WANG Z, PENG Y, et al. Cascaded pyramid network for multi-person pose estimation [C]// IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 7103–7112.
4 XIAO B, WU H, WEI Y. Simple baselines for human pose estimation and tracking [C]// European Conference on Computer Vision. Munich: Springer, 2018: 472–487.
5 SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation [C]// IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 5686–5696.
6 SANDLER M, HOWARD A, ZHU M, et al. MobileNetV2: inverted residuals and linear bottlenecks [C]// IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 4510-4520.
7 ZHANG X, ZHOU X, LIN M, et al. ShuffleNet: an extremely efficient convolutional neural network for mobile devices [C]// IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6848-6856.
8 QIAO S, CHEN L C, YUILLE A. DetectoRS: detecting objects with recursive feature pyramid and switchable atrous convolution [C]// IEEE Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 10208-10219.
9 LIN T Y, DOLLAR P, GIRSHICK R, et al Feature pyramid networks for object detection[J]. IEEE Computer Society, 2017, 1: 936- 944
10 SU H, JAMPANI V, SUN D, et al. Pixel-adaptive convolutional neural networks [C]// IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 11158-11167.
11 CHEN Y, DAI X, LIU M, et al. Dynamic convolution: attention over convolution kernels [C]// IEEE Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11027-11036.
12 WANG Q, WU B, ZHU P, et al. ECA-Net: efficient channel attention for deep convolutional neural networks [C]// IEEE Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11531-11539.
13 LI X, WANG W, HU X, et al. Selective kernel networks [C]// IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 510-519.
14 RAJAMANI K, GOWDA S D, TEJ V N, et al. Deformable attention (DANet) for semantic image segmentation [C]// Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Glasgow: IEEE, 2022: 3781-3784.
15 刘勇. 基于关键点检测的目标二维姿态估计研究[D]. 成都: 中国科学院光电技术研究所, 2021.
LIU Yong. Research on two-dimensional object pose estimation based on key-point detection [D]. Chengdu: Institute of Optics and Electronics, Chinese Academy of Sciences, 2021.
16 LIU Z, MAO H, WU C Y, et al. A Convnet for the 2020s [C]// IEEE Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 11966-11976.
17 CHEN J, HE T, ZHUO W, et al. TVConv: efficient translation variant convolution for layout-aware visual processing [C]// IEEE Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 12538-12548.
18 CAO Y, XU J, LIN S, et al. GCNet: non-local networks meet squeeze-excitation networks and beyond [C]// IEEE International Conference on Computer Vision Workshop. Seoul: IEEE, 2019: 1971-1980.
19 DIND X, GUO Y, DING G, et al. ACNet: strengthening the kernel skeletons for powerful CNN via asymmetric convolution blocks [C]// IEEE International Conference on Computer Vision. Seoul: IEEE, 2019: 1911-1920.
20 ZEILER M D, KRISHNAN D, TAYLOR G W, et al. Deconvolutional networks [C]// IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco: IEEE, 2010: 2528-2535.
21 ANDRILUKA M, PISHCHULIN L, GEHLER P, et al. 2D human pose estimation: new benchmark and state of the art analysis [C]// IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 3686-3693.
22 LIN T, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context [C]// European Conference on Computer Vision. Zurich: Springer, 2014: 740-755.
23 ZHANG K, HE P, YAO P, et al. Learning enhanced resolution-wise features for human pose estimation [C]// IEEE International Conference on Image Processing. Abu Dhabi: IEEE, 2020: 2256-2260.
24 YUAN Y, FU R, HUANG L, et al. HRFormer: high-resolution transformer for dense prediction [C]// Neural Information Processing Systems. Vancouver: MIT Press, 2021.
25 WANG K, LI C, REN R. High-resolution with global context network for human pose estimation [C]// Asia Pacific Conference on Communications. Jeju Island: IEEE, 2022: 621-626.
[1] 郑超昊,尹志伟,曾钢锋,许月萍,周鹏,刘莉. 基于时空深度学习模型的数值降水预报后处理[J]. 浙江大学学报(工学版), 2023, 57(9): 1756-1765.
[2] 杨哲,葛洪伟,李婷. 特征融合与分发的多专家并行推荐算法框架[J]. 浙江大学学报(工学版), 2023, 57(7): 1317-1325.
[3] 李云红,段姣姣,苏雪平,张蕾涛,于惠康,刘杏瑞. 基于改进生成对抗网络的书法字生成算法[J]. 浙江大学学报(工学版), 2023, 57(7): 1326-1334.
[4] 权巍,蔡永青,王超,宋佳,孙鸿凯,李林轩. 基于3D-ResNet双流网络的VR病评估模型[J]. 浙江大学学报(工学版), 2023, 57(7): 1345-1353.
[5] 周欣磊,顾海挺,刘晶,许月萍,耿芳,王冲. 基于集成学习与深度学习的日供水量预测方法[J]. 浙江大学学报(工学版), 2023, 57(6): 1120-1127.
[6] 刘沛丰,钱璐,赵兴炜,陶波. 航空装配领域中命名实体识别的持续学习框架[J]. 浙江大学学报(工学版), 2023, 57(6): 1186-1194.
[7] 赵嘉墀,王天琪,曾丽芳,邵雪明. 基于GRU的扑翼非定常气动特性快速预测[J]. 浙江大学学报(工学版), 2023, 57(6): 1251-1256.
[8] 曹晓璐,卢富男,朱翔,翁立波,卢书芳,高飞. 基于草图的兼容性服装生成方法[J]. 浙江大学学报(工学版), 2023, 57(5): 939-947.
[9] 苏育挺,陆荣烜,张为. 基于注意力和自适应权重的车辆重识别算法[J]. 浙江大学学报(工学版), 2023, 57(4): 712-718.
[10] 马庆禄,鲁佳萍,唐小垚,段学锋. 改进YOLOv5s的公路隧道烟火检测方法[J]. 浙江大学学报(工学版), 2023, 57(4): 784-794.
[11] 曾耀,高法钦. 基于改进YOLOv5的电子元件表面缺陷检测算法[J]. 浙江大学学报(工学版), 2023, 57(3): 455-465.
[12] 兰欢,余建波. 基于深度学习三维成型的钢板表面缺陷检测[J]. 浙江大学学报(工学版), 2023, 57(3): 466-476.
[13] 曾菊香,王平辉,丁益东,兰林,蔡林熹,管晓宏. 面向节点分类的图神经网络节点嵌入增强模型[J]. 浙江大学学报(工学版), 2023, 57(2): 219-225.
[14] 鲁建厦,包秦,汤洪涛,邵益平,赵文彬. 无设备人体追踪系统的择优标签方法[J]. 浙江大学学报(工学版), 2023, 57(2): 415-425.
[15] 杨天乐,李玲霞,张为. 基于自注意力机制的双分支密集人群计数算法[J]. 浙江大学学报(工学版), 2023, 57(10): 1955-1965.