Please wait a minute...
浙江大学学报(工学版)  2021, Vol. 55 Issue (10): 1815-1824    DOI: 10.3785/j.issn.1008-973X.2021.10.003
计算机技术     
基于ShuffleNetv2-YOLOv3模型的静态手势实时识别方法
辛文斌1(),郝惠敏1,卜明龙1,2,兰媛1(),黄家海1,*(),熊晓燕1
1. 太原理工大学 机械与运载工程学院,山西 太原 030024
2. 哈尔滨电机厂有限责任公司,黑龙江 哈尔滨 150040
Static gesture real-time recognition method based on ShuffleNetv2-YOLOv3 model
Wen-bin XIN1(),Hui-min HAO1,Ming-long BU1,2,Yuan LAN1(),Jia-hai HUANG1,*(),Xiao-yan XIONG1
1. School of Mechanical and Transportation Engineering, Taiyuan University of Technology, Taiyuan 030024, China
2. Harbin Electric Machinery Limited Company, Harbin 150040, China
 全文: PDF(923 KB)   HTML
摘要:

针对移动端平台下计算资源有限、存储空间小的特点,提出高效的ShuffleNetv2及YOLOv3集成网络静态手势实时识别方法,以减小模型对硬件的计算能力需求. 通过将轻量化网络ShuffleNetv2代替Darknet-53作为主干网络,减小模型的计算复杂度. 引入CBAM注意力机制模块,加强网络对空间和通道的关注度. 采用K-means聚类算法,重新生成Anchors的长宽比和数量,使重新生成的Anchors尺寸对目标进行精确定位来提高模型的检测精度. 实验结果表明,提出算法在手势识别上的平均识别准确率为99.2%,识别速度为44帧/s,单张416×416图片在GPU上的推理时间为15 ms,CPU上的推理时间为58 ms,模型所占内存为15.1 MB. 该方法具有识别精度高、识别速度快、内存占用率低等优点,有利于模型在移动终端上部署.

关键词: YOLOv3轻量化ShuffleNetv2网络CBAM注意力机制手势识别移动终端    
Abstract:

An efficient ShuffleNetv2 and YOLOv3 integrated network static gesture real-time recognition method was proposed to reduce the computing power requirements of the model on the hardware aiming at the characteristics of limited computing resources and small storage space under the mobile terminal platform. The computational complexity of the model was reduced by replacing Darknet-53 with the lightweight network ShuffleNetv2 as the backbone network. The CBAM attention mechanism module was introduced to strengthen the network’s attention to space and channels. The K-means clustering algorithm was used to regenerate the aspect ratio and number of Anchors, so that the regenerated Anchors size can accurately locate the target to improve the detection accuracy of the model. The experimental results showed that the average recognition accuracy of the proposed algorithm on gesture recognition was 99.2%, and the recognition speed was 44 frames/s. The inference time of a single 416×416 picture on the GPU was 15 ms, and the inference time on the CPU was 58 ms. The memory occupied by the model was 15.1 MB. The method has the advantages of high recognition accuracy, fast recognition speed, and low memory occupancy rate, which is conducive to the deployment of models on mobile terminals.

Key words: YOLOv3    lightweight ShuffleNetv2 network    CBAM attention mechanism    gesture recognition    mobile terminal
收稿日期: 2020-09-24 出版日期: 2021-10-27
CLC:  TP 391  
基金资助: 国家重点研发计划资助项目(2018YFB1308700);2020年山西省关键核心技术和共性技术研发攻关专项项目(2020XXX009,2020XXX001)
通讯作者: 黄家海     E-mail: 2878095493@qq.com;lanyuan@tyut.edu.cn;huangjiahai@tyut.edu.cn
作者简介: 辛文斌(1995—),男,硕士生,从事计算机视觉研究. orcid.org/0000-0002-6891-8235. E-mail: 2878095493@qq.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
辛文斌
郝惠敏
卜明龙
兰媛
黄家海
熊晓燕

引用本文:

辛文斌,郝惠敏,卜明龙,兰媛,黄家海,熊晓燕. 基于ShuffleNetv2-YOLOv3模型的静态手势实时识别方法[J]. 浙江大学学报(工学版), 2021, 55(10): 1815-1824.

Wen-bin XIN,Hui-min HAO,Ming-long BU,Yuan LAN,Jia-hai HUANG,Xiao-yan XIONG. Static gesture real-time recognition method based on ShuffleNetv2-YOLOv3 model. Journal of ZheJiang University (Engineering Science), 2021, 55(10): 1815-1824.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2021.10.003        https://www.zjujournals.com/eng/CN/Y2021/V55/I10/1815

图 1  YOLOv3模型的整体结构
图 2  ShuffleNetv2-YOLOv3模型的整体结构
层级 Os Ks S R Oc
0.5× 1.0× 1.5× 2.0×
Image 416×416 3 3 3 3
Conv1
MaxPool
208×208
104×104
3×3
3×3
2
2
1
1
24
24
24
24
24
24
24
24
Stage2
Stage2
52×52
52×52

2
1
1
3
48
48
116
116
176
176
244
244
Stage3
Stage3
26×26
26×26

2
1
1
7
96
96
232
232
352
352
488
488
Stage4
Stage4
13×13
13×13

2
1
1
3
192
192
464
464
704
704
976
976
Conv5 13×13 1×1 1 1 1024 1024 1024 2048
GlobalPool 1×1 13×13
FC 1000 1000 1000 1000
表 1  ShuffleNetv2的网络结构
图 3  手势识别流程图
图 4  自制手势数据集样例
图 5  Microsoft Kinect and Leap Motion数据集样例
图 6  Creative Senz3D数据集样例
硬件名称 型号 数量
Main board Asus WS X299 SAGE 1
CPU Intel I9-9900X 20
Memory The Corsair 32 GB DDR4 2
CUDA Geforce RTX 2080Ti 4
Solid-state drives Corsair 4.0T 4
Hard disk Western digital 976.5 GB 1
表 2  算法训练硬件环境配置
网络模型 主干网络 mAP TT/h Ws /MB v /(帧·s?1)
YOLOv3 ShuffleNetv2-0.5× 0.952 1.482 9.7 45
YOLOv3 ShuffleNetv2-1.0× 0.966 1.481 14.6 45
YOLOv3 ShuffleNetv2-1.5× 0.972 1.493 20.6 44
YOLOv3 ShuffleNetv2-2.0× 0.978 1.645 36.0 43
表 3  ShuffleNetv2不同的输出通道数之间的测试结果
网络模型 主干网络 mAP TT/h Ws /MB v /(帧·s?1)
YOLOv3+CBAM ShuffleNetv2-0.5× 0.979 1.587 10.2 44
YOLOv3+CBAM ShuffleNetv2-1.0× 0.992 1.594 15.1 44
YOLOv3+CBAM ShuffleNetv2-1.5× 0.990 1.620 21.1 43
YOLOv3+CBAM ShuffleNetv2-2.0× 0.987 1.680 36.5 43
表 4  ShuffleNetv2+CBAM不同的输出通道数的测试结果
网络模型 主干网络 mAP TT/h Ws /MB v /(帧·s?1)
YOLOv3+6Anchors ShuffleNetv2-1.0× 0.966 1.481 14.6 45
YOLOv3+9Anchors ShuffleNetv2-1.0× 0.968 1.724 33.7 42
YOLOv3+CBAM+6Anchors ShuffleNetv2-1.0× 0.992 1.594 15.1 44
YOLOv3+CBAM+9Anchors ShuffleNetv2-1.0× 0.982 1.745 34.2 41
表 5  不同Anchors对模型精度的测试结果
网络模型 主干网络 mAP TT/h Ws /MB v /(帧·s?1)
YOLOv3 ShuffleNetv2-1.0× 0.966 1.481 14.6 45
YOLOv3+CBAM ShuffleNetv2-1.0× 0.982 1.534 15.1 44
表 6  使用CBAM注意力机制模块的测试结果
网络模型 主干网络 mAP TT/h Ws /MB v /(帧·s?1)
YOLOv3 Darknet-53 0.98 4.138 246.6 41
YOLOv3+K-means Darknet-53 0.99 4.104 246.6 41
YOLOv3+CBAM ShuffleNetv2-1.0× 0.982 1.534 15.1 44
YOLOv3+CBAM+ K-means ShuffleNetv2-1.0× 0.992 1.594 15.1 44
表 7  使用K-means聚类Anchors的测试结果
网络模型 主干网络 mAP TT/h Ws /MB v /(帧·s?1)
YOLOv2 Darknet-19 0.930 3.105 202.4 47
YOLOv3-Tiny Tiny 0.974 1.256 34.8 46
YOLOv3 ResNet-50 0.984 2.821 161.2 40
YOLOv3 Darknet-53 0.990 4.104 246.6 41
YOLOv3 MobileNetv2 0.955 2.051 28.0 37
SSD MobileNetv2 0.882 4.390 24.1 19
YOLOv3 ShuffleNetv2-1.0× 0.992 1.594 15.1 44
表 8  不同主干网络的测试结果
数据集 网络模型 主干网络 mAP TT/h Ws /MB v /(帧·s?1)
自制数据集 YOLOv3 Darknet-53 0.990 4.104 246.6 41
自制数据集 YOLOv3 ShuffleNetv2-1.0× 0.992 1.594 15.1 44
Kinect数据集 YOLOv3 Darknet-53 0.987 2.394 246.6 11
Kinect数据集 YOLOv3 ShuffleNetv2-1.0× 0.987 1.289 15.1 13
Senz3D数据集 YOLOv3 Darknet-53 0.990 2.136 246.6 28
Senz3D数据集 YOLOv3 ShuffleNetv2-1.0× 0.991 0.966 15.1 31
表 9  不同数据集的测试结果
网络模型 主干网络 mAP TT /h Ws /MB v /(帧·s?1) tGPU /ms tCPU /ms
YOLOv3 Darknet-53 0.990 4.104 246.6 41 22 170
YOLOv3 ShuffleNetv2-1.0× 0.992 1.594 15.1 44 15 58
表 10  改进模型在硬件性能上的测试结果
图 7  单目标手势的识别结果
图 8  多目标手势识别结果
1 JIANG D, ZHENG Z, LI G, et al Gesture recognition based on binocular vision[J]. Cluster Computing, 2018, 22 (3): 1- 11
2 AL-HELALI B M, MAHMOUD S A Arabic online handwriting recognition (AOHR): a survey[J]. ACM Computing Surveys, 2017, 50 (3): 1- 35
3 GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation [EB/OL]. [2020-08-15]. https://doi.org/10.1109/cvpr.2014.81.
4 HE K, ZHANG X, REN S, et al Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 37 (9): 1904- 1916
5 REN S, HE K, GIRSHICK R, et al Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39 (6): 1137- 1149
6 LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C]// Proceedings of European Conference on Computer Vision. Berlin: Springer, 2016: 21-37.
7 REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection [C]// Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C.: IEEE, 2016: 779-788.
8 HOWARD A G, ZHU M, CHEN B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications [EB/OL]. [2020-08-15]. https://arxiv.org/abs/1704.04861.
9 ZHANG X, ZHOU X, LIN M, et al. ShuffleNet: an extremely efficient convolutional neural network for mobile devices [EB/OL]. [2020-08-15]. https://arxiv.org/abs/1707.01083v2.
10 REDMON J, FARHADI A. YOLOv3: an incremental improvement [EB/OL]. [2019-02-25]. https://arxiv.org/abs/1804.02767v1.
11 MA N, ZHANG X, ZHENG H T, et al. ShuffleNetV2: practical guidelines for efficient CNN architecture design [C]// European Conference on Computer Vision. Cham: Springer, 2018: 116-131.
12 SANDLER M, HOWARD A, ZHU M, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S. l. ]: IEEE, 2018: 4510-4520.
13 REDMON J, FARHADI A. YOLO9000: better, faster, stronger [C]// IEEE Conference on Computer Vision and Pattern Recognition. [S. l. ]: IEEE, 2017: 6517-6525.
14 LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection [C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. [S. l. ]: IEEE, 2017: 1-9.
15 WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]// European Conference on Computer Vision. Cham: Springer, 2018.
16 MARIN G, DOMINIO F, ZANUTTIGH P. Hand gesture recognition with leap motion and kinect devices [C]// IEEE International Conference on Image Processing. Paris: IEEE, 2014.
17 MARIN G, DOMINIO F, ZANUTTIGH P Hand gesture recognition with jointly calibrated leap motion and depth sensor[J]. Multimedia Tools and Applications, 2015, 75 (22): 1- 10
18 MEMO A, MINTO L, ZANUTTIGH P. Exploiting silhouette descriptors and synthetic data for hand gesture recognition [EB/OL]. [2020-08-15]. https://dx.doi.org/10.2312/stag.20151288.
19 MEMO A, ZANUTTIGH P Head-mounted gesture controlled interface for human-computer interaction[J]. Multimedia Tools and Applications, 2017, 77 (6): 1- 13
20 PINTO R F, BORGES C D B, ALMEIDA A M A, et al. Static hand gesture recognition based on convolutional neural networks [EB/OL]. [2020-08-15]. https://doi.org/10.1155/2019/4167890.
21 CHEOK M J, OMAR Z, JAWARD M H A review of hand gesture and sign language recognition techniques[J]. International Journal of Machine Learning and Cybernetics, 2019, (10): 131- 153
22 LIU J, WANG X. Early recognition of tomato gray leaf spot disease based on MobileNetv2-YOLOv3 model [EB/OL]. [2020-08-15]. https://doi.org/10.1186/s13007-021-00708-7.
23 RAJENDRAN S P, SHINE L, PRADEEP R, et al. Real-time traffic sign recognition using YOLOv3 based detector [C]// 2019 10th International Conference on Computing, Communication and Networking Technologies. [S. l. ]: IEEE, 2019.
24 YI Z, YONGLIANG S, JUN Z An improved tiny-yolov3 pedestrian detection algorithm[J]. Optik-International Journal for Light and Electron Optics, 2019, 183: 17- 23
doi: 10.1016/j.ijleo.2019.02.038
25 周文军, 张勇, 王昱洁 基于DSSD的静态手势实时识别方法[J]. 计算机工程, 2020, 46 (510): 261- 267
ZHOU Wen-jun, ZHANG Yong, WANG Yu-jie Real-time recognition of static gestures based on DSSD[J]. Computer Engineering, 2020, 46 (510): 261- 267
[1] 李志, 单洪, 马涛, 黄郡. 基于反向标签传播的移动终端用户群体发现[J]. 浙江大学学报(工学版), 2018, 52(11): 2171-2179.
[2] 杨文珍, 张昊, 吴新丽, 邵明朝, 金中正. 面向移动终端人机交互的指尖点击力[J]. 浙江大学学报(工学版), 2016, 50(10): 1995-2001.
[3] 跃龙,陈岭,陈根才. 基于层次化BoF模型和Spectral-HIK过滤的手势识别算法闯[J]. J4, 2013, 47(9): 1531-1536.