Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2021, Vol. 55 Issue (10): 1815-1824    DOI: 10.3785/j.issn.1008-973X.2021.10.003
    
Static gesture real-time recognition method based on ShuffleNetv2-YOLOv3 model
Wen-bin XIN1(),Hui-min HAO1,Ming-long BU1,2,Yuan LAN1(),Jia-hai HUANG1,*(),Xiao-yan XIONG1
1. School of Mechanical and Transportation Engineering, Taiyuan University of Technology, Taiyuan 030024, China
2. Harbin Electric Machinery Limited Company, Harbin 150040, China
Download: HTML     PDF(923KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

An efficient ShuffleNetv2 and YOLOv3 integrated network static gesture real-time recognition method was proposed to reduce the computing power requirements of the model on the hardware aiming at the characteristics of limited computing resources and small storage space under the mobile terminal platform. The computational complexity of the model was reduced by replacing Darknet-53 with the lightweight network ShuffleNetv2 as the backbone network. The CBAM attention mechanism module was introduced to strengthen the network’s attention to space and channels. The K-means clustering algorithm was used to regenerate the aspect ratio and number of Anchors, so that the regenerated Anchors size can accurately locate the target to improve the detection accuracy of the model. The experimental results showed that the average recognition accuracy of the proposed algorithm on gesture recognition was 99.2%, and the recognition speed was 44 frames/s. The inference time of a single 416×416 picture on the GPU was 15 ms, and the inference time on the CPU was 58 ms. The memory occupied by the model was 15.1 MB. The method has the advantages of high recognition accuracy, fast recognition speed, and low memory occupancy rate, which is conducive to the deployment of models on mobile terminals.



Key wordsYOLOv3      lightweight ShuffleNetv2 network      CBAM attention mechanism      gesture recognition      mobile terminal     
Received: 24 September 2020      Published: 27 October 2021
CLC:  TP 391  
Fund:  国家重点研发计划资助项目(2018YFB1308700);2020年山西省关键核心技术和共性技术研发攻关专项项目(2020XXX009,2020XXX001)
Corresponding Authors: Jia-hai HUANG     E-mail: 2878095493@qq.com;lanyuan@tyut.edu.cn;huangjiahai@tyut.edu.cn
Cite this article:

Wen-bin XIN,Hui-min HAO,Ming-long BU,Yuan LAN,Jia-hai HUANG,Xiao-yan XIONG. Static gesture real-time recognition method based on ShuffleNetv2-YOLOv3 model. Journal of ZheJiang University (Engineering Science), 2021, 55(10): 1815-1824.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2021.10.003     OR     https://www.zjujournals.com/eng/Y2021/V55/I10/1815


基于ShuffleNetv2-YOLOv3模型的静态手势实时识别方法

针对移动端平台下计算资源有限、存储空间小的特点,提出高效的ShuffleNetv2及YOLOv3集成网络静态手势实时识别方法,以减小模型对硬件的计算能力需求. 通过将轻量化网络ShuffleNetv2代替Darknet-53作为主干网络,减小模型的计算复杂度. 引入CBAM注意力机制模块,加强网络对空间和通道的关注度. 采用K-means聚类算法,重新生成Anchors的长宽比和数量,使重新生成的Anchors尺寸对目标进行精确定位来提高模型的检测精度. 实验结果表明,提出算法在手势识别上的平均识别准确率为99.2%,识别速度为44帧/s,单张416×416图片在GPU上的推理时间为15 ms,CPU上的推理时间为58 ms,模型所占内存为15.1 MB. 该方法具有识别精度高、识别速度快、内存占用率低等优点,有利于模型在移动终端上部署.


关键词: YOLOv3,  轻量化ShuffleNetv2网络,  CBAM注意力机制,  手势识别,  移动终端 
Fig.1 Overall structure of YOLOv3 model
Fig.2 Overall structure of ShuffleNetv2-YOLOv3 model
层级 Os Ks S R Oc
0.5× 1.0× 1.5× 2.0×
Image 416×416 3 3 3 3
Conv1
MaxPool
208×208
104×104
3×3
3×3
2
2
1
1
24
24
24
24
24
24
24
24
Stage2
Stage2
52×52
52×52

2
1
1
3
48
48
116
116
176
176
244
244
Stage3
Stage3
26×26
26×26

2
1
1
7
96
96
232
232
352
352
488
488
Stage4
Stage4
13×13
13×13

2
1
1
3
192
192
464
464
704
704
976
976
Conv5 13×13 1×1 1 1 1024 1024 1024 2048
GlobalPool 1×1 13×13
FC 1000 1000 1000 1000
Tab.1 Network structure of ShuffleNetv2
Fig.3 Flowchart of gesture recognition
Fig.4 Self-made gesture dataset samples
Fig.5 Samples of Microsoft Kinect and Leap Motion dataset
Fig.6 Samples of Creative Senz3D dataset
硬件名称 型号 数量
Main board Asus WS X299 SAGE 1
CPU Intel I9-9900X 20
Memory The Corsair 32 GB DDR4 2
CUDA Geforce RTX 2080Ti 4
Solid-state drives Corsair 4.0T 4
Hard disk Western digital 976.5 GB 1
Tab.2 Algorithm training hardware environment configuration
网络模型 主干网络 mAP TT/h Ws /MB v /(帧·s?1)
YOLOv3 ShuffleNetv2-0.5× 0.952 1.482 9.7 45
YOLOv3 ShuffleNetv2-1.0× 0.966 1.481 14.6 45
YOLOv3 ShuffleNetv2-1.5× 0.972 1.493 20.6 44
YOLOv3 ShuffleNetv2-2.0× 0.978 1.645 36.0 43
Tab.3 Test results of ShuffleNetv2 different output channels
网络模型 主干网络 mAP TT/h Ws /MB v /(帧·s?1)
YOLOv3+CBAM ShuffleNetv2-0.5× 0.979 1.587 10.2 44
YOLOv3+CBAM ShuffleNetv2-1.0× 0.992 1.594 15.1 44
YOLOv3+CBAM ShuffleNetv2-1.5× 0.990 1.620 21.1 43
YOLOv3+CBAM ShuffleNetv2-2.0× 0.987 1.680 36.5 43
Tab.4 Test results of ShuffleNetv2+CBAM different output channels
网络模型 主干网络 mAP TT/h Ws /MB v /(帧·s?1)
YOLOv3+6Anchors ShuffleNetv2-1.0× 0.966 1.481 14.6 45
YOLOv3+9Anchors ShuffleNetv2-1.0× 0.968 1.724 33.7 42
YOLOv3+CBAM+6Anchors ShuffleNetv2-1.0× 0.992 1.594 15.1 44
YOLOv3+CBAM+9Anchors ShuffleNetv2-1.0× 0.982 1.745 34.2 41
Tab.5 Test results of model accuracy by different Anchors
网络模型 主干网络 mAP TT/h Ws /MB v /(帧·s?1)
YOLOv3 ShuffleNetv2-1.0× 0.966 1.481 14.6 45
YOLOv3+CBAM ShuffleNetv2-1.0× 0.982 1.534 15.1 44
Tab.6 Test results of using CBAM attention mechanism module
网络模型 主干网络 mAP TT/h Ws /MB v /(帧·s?1)
YOLOv3 Darknet-53 0.98 4.138 246.6 41
YOLOv3+K-means Darknet-53 0.99 4.104 246.6 41
YOLOv3+CBAM ShuffleNetv2-1.0× 0.982 1.534 15.1 44
YOLOv3+CBAM+ K-means ShuffleNetv2-1.0× 0.992 1.594 15.1 44
Tab.7 Test results of using K-means to cluster Anchors
网络模型 主干网络 mAP TT/h Ws /MB v /(帧·s?1)
YOLOv2 Darknet-19 0.930 3.105 202.4 47
YOLOv3-Tiny Tiny 0.974 1.256 34.8 46
YOLOv3 ResNet-50 0.984 2.821 161.2 40
YOLOv3 Darknet-53 0.990 4.104 246.6 41
YOLOv3 MobileNetv2 0.955 2.051 28.0 37
SSD MobileNetv2 0.882 4.390 24.1 19
YOLOv3 ShuffleNetv2-1.0× 0.992 1.594 15.1 44
Tab.8 Test results of different backbone networks
数据集 网络模型 主干网络 mAP TT/h Ws /MB v /(帧·s?1)
自制数据集 YOLOv3 Darknet-53 0.990 4.104 246.6 41
自制数据集 YOLOv3 ShuffleNetv2-1.0× 0.992 1.594 15.1 44
Kinect数据集 YOLOv3 Darknet-53 0.987 2.394 246.6 11
Kinect数据集 YOLOv3 ShuffleNetv2-1.0× 0.987 1.289 15.1 13
Senz3D数据集 YOLOv3 Darknet-53 0.990 2.136 246.6 28
Senz3D数据集 YOLOv3 ShuffleNetv2-1.0× 0.991 0.966 15.1 31
Tab.9 Test results on different datasets
网络模型 主干网络 mAP TT /h Ws /MB v /(帧·s?1) tGPU /ms tCPU /ms
YOLOv3 Darknet-53 0.990 4.104 246.6 41 22 170
YOLOv3 ShuffleNetv2-1.0× 0.992 1.594 15.1 44 15 58
Tab.10 Improved model test results on hardware performance
Fig.7 Recognition results of single target gesture
Fig.8 Recognition results of multi-target gesture
[1]   JIANG D, ZHENG Z, LI G, et al Gesture recognition based on binocular vision[J]. Cluster Computing, 2018, 22 (3): 1- 11
[2]   AL-HELALI B M, MAHMOUD S A Arabic online handwriting recognition (AOHR): a survey[J]. ACM Computing Surveys, 2017, 50 (3): 1- 35
[3]   GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation [EB/OL]. [2020-08-15]. https://doi.org/10.1109/cvpr.2014.81.
[4]   HE K, ZHANG X, REN S, et al Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 37 (9): 1904- 1916
[5]   REN S, HE K, GIRSHICK R, et al Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39 (6): 1137- 1149
[6]   LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C]// Proceedings of European Conference on Computer Vision. Berlin: Springer, 2016: 21-37.
[7]   REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection [C]// Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C.: IEEE, 2016: 779-788.
[8]   HOWARD A G, ZHU M, CHEN B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications [EB/OL]. [2020-08-15]. https://arxiv.org/abs/1704.04861.
[9]   ZHANG X, ZHOU X, LIN M, et al. ShuffleNet: an extremely efficient convolutional neural network for mobile devices [EB/OL]. [2020-08-15]. https://arxiv.org/abs/1707.01083v2.
[10]   REDMON J, FARHADI A. YOLOv3: an incremental improvement [EB/OL]. [2019-02-25]. https://arxiv.org/abs/1804.02767v1.
[11]   MA N, ZHANG X, ZHENG H T, et al. ShuffleNetV2: practical guidelines for efficient CNN architecture design [C]// European Conference on Computer Vision. Cham: Springer, 2018: 116-131.
[12]   SANDLER M, HOWARD A, ZHU M, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. [S. l. ]: IEEE, 2018: 4510-4520.
[13]   REDMON J, FARHADI A. YOLO9000: better, faster, stronger [C]// IEEE Conference on Computer Vision and Pattern Recognition. [S. l. ]: IEEE, 2017: 6517-6525.
[14]   LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection [C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. [S. l. ]: IEEE, 2017: 1-9.
[15]   WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]// European Conference on Computer Vision. Cham: Springer, 2018.
[16]   MARIN G, DOMINIO F, ZANUTTIGH P. Hand gesture recognition with leap motion and kinect devices [C]// IEEE International Conference on Image Processing. Paris: IEEE, 2014.
[17]   MARIN G, DOMINIO F, ZANUTTIGH P Hand gesture recognition with jointly calibrated leap motion and depth sensor[J]. Multimedia Tools and Applications, 2015, 75 (22): 1- 10
[18]   MEMO A, MINTO L, ZANUTTIGH P. Exploiting silhouette descriptors and synthetic data for hand gesture recognition [EB/OL]. [2020-08-15]. https://dx.doi.org/10.2312/stag.20151288.
[19]   MEMO A, ZANUTTIGH P Head-mounted gesture controlled interface for human-computer interaction[J]. Multimedia Tools and Applications, 2017, 77 (6): 1- 13
[20]   PINTO R F, BORGES C D B, ALMEIDA A M A, et al. Static hand gesture recognition based on convolutional neural networks [EB/OL]. [2020-08-15]. https://doi.org/10.1155/2019/4167890.
[21]   CHEOK M J, OMAR Z, JAWARD M H A review of hand gesture and sign language recognition techniques[J]. International Journal of Machine Learning and Cybernetics, 2019, (10): 131- 153
[22]   LIU J, WANG X. Early recognition of tomato gray leaf spot disease based on MobileNetv2-YOLOv3 model [EB/OL]. [2020-08-15]. https://doi.org/10.1186/s13007-021-00708-7.
[23]   RAJENDRAN S P, SHINE L, PRADEEP R, et al. Real-time traffic sign recognition using YOLOv3 based detector [C]// 2019 10th International Conference on Computing, Communication and Networking Technologies. [S. l. ]: IEEE, 2019.
[24]   YI Z, YONGLIANG S, JUN Z An improved tiny-yolov3 pedestrian detection algorithm[J]. Optik-International Journal for Light and Electron Optics, 2019, 183: 17- 23
doi: 10.1016/j.ijleo.2019.02.038
[25]   周文军, 张勇, 王昱洁 基于DSSD的静态手势实时识别方法[J]. 计算机工程, 2020, 46 (510): 261- 267
ZHOU Wen-jun, ZHANG Yong, WANG Yu-jie Real-time recognition of static gestures based on DSSD[J]. Computer Engineering, 2020, 46 (510): 261- 267
[1] Wei-qi CHEN,Jing-chang WANG,Ling CHEN,Yong-qin YANG,Yong WU. Prediction model of multi-factor aware mobile terminal replacement based on deep neural network[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(1): 109-115.
[2] LI Zhi, SHAN Hong, MA Tao, HUANG Jun. Group discovery of mobile terminal users based on reverse-label propagation algorithm[J]. Journal of ZheJiang University (Engineering Science), 2018, 52(11): 2171-2179.