Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2023, Vol. 57 Issue (3): 446-454    DOI: 10.3785/j.issn.1008-973X.2023.03.002
    
Worker behavior recognition based on temporal and spatial self-attention of vision Transformer
Yu-xiang LU1(),Guan-hua XU2,*(),Bo TANG1,3
1. College of Metrology and Measurement Engineering, China Jiliang University, Hangzhou 310018, China
2. Zhejiang Province’s Key Laboratory of 3D Printing Process and Equipment, State Key Laboratory of Fluid Power and Mechatronic Systems, Zhejiang University, Hangzhou 310027, China
3. Ningbo Water Meter (Group) Limited Company, Ningbo 315033, China
Download: HTML     PDF(2240KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A video human behavior recognition model based on Transformer network structure was proposed, in order to solve the problem of worker behavior recognition in the special scene of human-robot collaboration. The self-attention mechanism at the core of Transformer network was used to reduce the structure complexity and boost the performance of the network. On the basis of extracting the spatial features of the image, a method of adding time features analysis was used to process the video data from two dimensions of space and time. After that, the classification vector was extracted from the processed data, and passed into the classification module to get the final recognition result. Human behavior recognition experiments were carried out on the public dataset UCF101 and the routine behavior dataset of workers collected in the laboratory (a self-built dataset) respectively, in order to verify the effectiveness of the model. Experimental results showed that the average recognition accuracy of the model on UCF101 was 93.44%, and the average recognition accuracy of the model on the self-built dataset was 98.54%.



Key wordshuman-robot collaboration      Transformer      temporal and spatial self-attention      worker action      behavior recognition     
Received: 20 May 2022      Published: 31 March 2023
CLC:  TP 391  
Fund:  国家自然科学基金资助项目(51805477)
Corresponding Authors: Guan-hua XU     E-mail: yuxiang_lu1996@163.com;xuguanhua@zju.edu.cn
Cite this article:

Yu-xiang LU,Guan-hua XU,Bo TANG. Worker behavior recognition based on temporal and spatial self-attention of vision Transformer. Journal of ZheJiang University (Engineering Science), 2023, 57(3): 446-454.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2023.03.002     OR     https://www.zjujournals.com/eng/Y2023/V57/I3/446


基于视觉Transformer时空自注意力的工人行为识别

针对人机协作特殊场景中工人行为识别的问题,提出基于Transformer网络的视频人体行为识别模型,利用Transformer网络核心的自注意力机制,减少网络的结构复杂度,提升网络的性能. 模型在提取图像空间特征的基础上,增加时间特征的分析,从空间和时间2个维度实现对视频数据的处理. 在处理后的数据中提取分类向量传入分类模块,得到最终的识别结果. 为了验证模型的有效性,分别在公开数据集UCF101和实验室采集的工人常规行为(自建)数据集上进行人体行为识别实验. 实验结果显示,在UCF101上模型平均识别准确率为93.44%,在自建数据集上模型平均识别准确率为98.54%.


关键词: 人机协作,  Transformer,  时空自注意力,  工人行为,  行为识别 
Fig.1 Working example of self-attention module
Fig.2 Flowchart of worker behavior recognition model
Fig.3 Image block
Fig.4 Screenshots of self-built dataset with eight categories of video
模型 Accmin/% Accmax/% Accavg/% TPR/% F1
C3D[23] 85.17 85.42 85.32 98.35 0.7412
ViT[17] 88.39 88.71 88.54 96.48 0.8557
P3D[24] 88.51 88.65 88.59 96.43 0.7994
Conv-LSTM[25] 88.53 88.68 88.61 98.16 0.8235
本研究 93.25 93.68 93.44 99.21 0.9226
Tab.1 Results evaluated by different models on UCF101 dataset
%
视频类别 Accmin Accmax Accavg
人与物体交互 92.62 93.70 93.16
单纯的肢体动作 92.19 92.26 92.28
人与人交互 96.89 96.96 96.93
演奏乐器 97.76 98.50 98.13
体育运动 92.64 93.53 93.09
Tab.2 Recognition accuracy of proposed model for UCF101 video categories
Fig.5 Variation of recognition accuracy for different image recognition models
Fig.6 Variation of training loss rate for different image recognition models
模型 Accmin/% Accmax/% Accavg/% TPR/% F1
ViT 92.55 92.68 92.65 97.54 0.8903
本研究 98.50 98.58 98.54 100.00 0.9812
Tab.3 Results evaluated by different image recognition models on self-built dataset
Fig.7 Recognition accuracy of different image recognition models by category
%
采样帧数 Acc TPR
验证集 测试集
2 92.21 93.67 97.71
4 93.03 94.30 98.64
8 94.85 95.17 99.73
16 98.54 99.73 100.00
32 95.67 99.25 100.00
Tab.4 Evaluation experimental results of proposed model for different sampling frames
%
模型 Acc
连续帧(固定) 连续帧(关键) 离散帧(关键)
ViT 90.54 92.62 92.55
本研究 97.29 98.54 96.97
Tab.5 Recognition accuracy of proposed model for different types of input frames
%
结构调整 Acc
无预训练参数初始化 89.72
Head=4 88.71
Head=8 94.44
Head=12 98.54
Head=16 97.31
双空间自注意力模块 92.96
Tab.6 Experimental results of structure ablation of proposed model
[1]   LASOTA P A, ROSSANO G F, SHAH J A. Toward safe close-proximity human-robot interaction with standard industrial robots [C]// Proceeding of 2014 IEEE International Conference on Automation Science and Engineering. [S. l.]: IEEE, 2014: 339-344.
[2]   SCHMIDT B, WANG L Depth camera based collision avoidance via active robot control[J]. Journal of Manufacturing Systems, 2014, 33 (4): 711- 718
doi: 10.1016/j.jmsy.2014.04.004
[3]   富倩 人体行为识别研究[J]. 信息与电脑, 2017, (24): 146- 147
FU Qian Analysis of human behavior recognition[J]. China Computer and Communication, 2017, (24): 146- 147
doi: 10.3969/j.issn.1003-9767.2017.24.058
[4]   ZANCHETTIN A M, CASALINO A, PIRODDI L, et al Prediction of human activity patterns for human–robot collaborative assembly tasks[J]. IEEE Transactions on Industrial Informatics, 2019, 15 (7): 3934- 3942
doi: 10.1109/TII.2018.2882741
[5]   ZANCHETTIN A M, ROCCO P. Probabilistic inference of human arm reaching target for effective human-robot collaboration [C]// Proceeding of 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems. Vancouver: IEEE, 2017: 6595-6600.
[6]   SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos [EB/OL]. [2022-05-03]. https://arxiv.org/pdf/1406.2199.pdf.
[7]   FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 1933-1941.
[8]   YE Q, LIANG Z, ZHONG H, et al Human behavior recognition based on time correlation sampling two stream heterogeneous grafting network[J]. Optik, 2022, 251: 168402
doi: 10.1016/j.ijleo.2021.168402
[9]   PENG B, YAO Z, WU Q, et al 3D convolutional neural network for human behavior analysis in intelligent sensor network[J]. Mobile Networks and Applications, 2022, 27: 1559- 1568
doi: 10.1007/s11036-021-01873-8
[10]   张传雷, 武大硕, 向启怀, 等 基于ResNet-LSTM的具有注意力机制的办公人员行为视频识别[J]. 天津科技大学学报, 2020, 35 (6): 72- 80
ZHANG Chuan-lei, WU Da-shuo, XIANG Qi-huai, et al Office staff behavior recognition based on ResNET-LSTM attention mechanism[J]. Journal of Tianjin University of Science and Technology, 2020, 35 (6): 72- 80
doi: 10.13364/j.issn.1672-6510.20190252
[11]   YU S, CHENG Y, XIE L, et al A novel recurrent hybrid network for feature fusion in action recognition[J]. Journal of Visual Communication and Image Representation, 2017, 49: 192- 203
doi: 10.1016/j.jvcir.2017.09.007
[12]   TANBERK S, KILIMCI Z H, TÜKEL D B, et al A hybrid deep model using deep learning and dense optical flow approaches for human activity recognition[J]. IEEE Access, 2020, 8: 19799- 19809
doi: 10.1109/ACCESS.2020.2968529
[13]   WU J, YANG X, XI M, et al Research on behavior recognition algorithm based on SE-I3D-GRU network[J]. High Technology Letters, 2021, 27 (2): 163- 172
[14]   VASWANI A, SHAZZER N, PARMAR N, et al. Attention is all you need [EB/OL]. [2022-05-03]. https://arxiv.org/pdf/1706.03762.pdf.
[15]   PARMAR N, VASWANI A, USZKOREIT J, et al. Image transformer [EB/OL]. [2022-05-03]. https://arxiv.org/pdf/1802.05751.pdf.
[16]   ZHU X, SU W, LU L, et al. Deformable DETR: deformable transformers for end-to-end object detection [EB/OL]. [2022-05-03]. https://arxiv.org/pdf/2010.04159.pdf.
[17]   DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale [EB/OL]. [2022-05-03]. https://arxiv.org/pdf/2010.11929.pdf.
[18]   ZHOU D, SHI Y, KANG B, et al. Refiner: refining self-attention for vision-transformers[EB/OL]. [2022-05-03]. https://arxiv.org/pdf/2106.03714.pdf.
[19]   CHEN C F, FAN Q F, PANDA R. CrossViT: cross-attention multi-scale vision transformer for image classification [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 347-356.
[20]   LI G, LIU Z, CAI L, et al Standing-posture recognition in human–robot collaboration based on deep learning and the dempster–shafer evidence theory[J]. Sensors, 2020, 20 (4): 1158
doi: 10.3390/s20041158
[21]   JIANG J, NAN Z, CHEN H, et al Predicting short-term next-active-object through visual attention and hand position[J]. Neurocomputing, 2021, 433: 212- 222
doi: 10.1016/j.neucom.2020.12.069
[22]   汪涛, 汪泓章, 夏懿, 等 基于卷积神经网络与注意力模型的人体步态识别[J]. 传感技术学报, 2019, 32 (7): 1027- 1033
WANG Tao, WANG Hong-zhang, XIA Yi, et al Human gait recognition based on convolutional neural network and attention model[J]. Chinese Journal of Sensors and Actuators, 2019, 32 (7): 1027- 1033
doi: 10.3969/j.issn.1004-1699.2019.07.012
[23]   TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks [C]// Proceedings of the IEEE International Conference on Computer Vision. Santiago: IEEE, 2015: 4489-4497.
[24]   QIU Z, YAO T, MEI T. Learning spatio-temporal representation with pseudo-3D residual networks [C]// Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 5533-5541.
[1] Wan-liang WANG,Tie-jun WANG,Jia-cheng CHEN,Wen-bo YOU. Medical image segmentation method combining multi-scale and multi-head attention[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(9): 1796-1805.
[2] Guo-peng ZHANG,Zi-han LI,Hao WANG,zheng ZHENG. Isolated AC-DC solid state transformer front and rear stages integrated sliding mode control[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(3): 622-630.
[3] Tian-le YUAN,Ju-long YUAN,Yong-jian ZHU,Han-chen ZHENG. Surface defect detection algorithm of thrust ball bearing based on improved YOLOv5[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(12): 2349-2357.
[4] Zhen-hong MA,Zhen LIU,Sheng-yong YIN,Rong-wei MA,Ke-ping YAN. Experimental study on melanoma cell ablation by high-voltage nanosecond pulsed electric field[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(6): 1168-1174.
[5] Le XIE,Xi-dan HENG,Yang LIU,Qi-long JIANG,Dong LIU. Transformer fault diagnosis based on linear discriminant analysis and step-by-step machine learning[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(11): 2266-2272.
[6] SU Guo-dong, SUN Ling-ling, WANG Xiang, WANG Zun-feng, ZHANG Sheng-zhou, LEI Yu-chao. Design of 126.6-128.1 GHz fundamental voltage control oscillator[J]. Journal of ZheJiang University (Engineering Science), 2018, 52(9): 1788-1795.
[7] LIU Xin, ZHENG Xiang-jie, HOU Qing-hui, SHI Jian-jiang. Current-sharing characteristic of converter composed of LLC with series-parallel transformer and interleaved Buck[J]. Journal of ZheJiang University (Engineering Science), 2018, 52(4): 806-818.
[8] ZHU Ming-lei, ZHAO Rong-xiang, YANG Huan. Power electronic transformer using multi-pulse rectification technique[J]. Journal of ZheJiang University (Engineering Science), 2017, 51(9): 1861-1869.
[9] TANG Wei-jia, JIANG Dao-zhuo, YIN Rui, LIANG Yi-qiao, WANG Yu-fen. Design of coaxial transformer for modular isolated DC/DC converters[J]. Journal of ZheJiang University (Engineering Science), 2017, 51(8): 1646-1652.
[10] LI Hua-shan, WANG Han-zhi, WANG Ling-bao, WANG Xian-long, BU Xian-biao. Influence of solution heat exchangers on double absorption heat transformer (DAHT)[J]. Journal of ZheJiang University (Engineering Science), 2017, 51(3): 471-477.
[11] LIU Tong, GONG Guo fang, PENG Zuo, WU Wei qiang, PENG Xiong bin. Hybrid cutterhead driving system for TBM based on hydraulic transformer[J]. Journal of ZheJiang University (Engineering Science), 2016, 50(3): 419-427.
[12] LIU Tong, GONG Guo fang, PENG Zuo, WU Wei qiang, PENG Xiong bin. Hybrid cutterhead driving system for TBM based on hydraulic transformer[J]. Journal of ZheJiang University (Engineering Science), 2016, 50(2): 0-.
[13] HU Chen, WU Xin-ke, PENG Fang-zheng, QIAN Zhao-ming. LED驱动器|反激|变压器级联|变压器原边串联[J]. Journal of ZheJiang University (Engineering Science), 2015, 49(4): 740-748.
[14] SHI Mao-shun, LIU Hong-wei, LI Wei, LIN Yong-gang, DING Jin-zhong,ZHOU Hong-bin. Tidal current turbine hydraulic transmission system based on hydraulic transformer[J]. Journal of ZheJiang University (Engineering Science), 2014, 48(5): 764-769.
[15] LV Yun-teng, ZHU Chang-sheng. Improved transformer model for output impedance of eddy-current sensor probe[J]. Journal of ZheJiang University (Engineering Science), 2014, 48(5): 882-888.