基于视觉Transformer时空自注意力的工人行为识别

doi:10.3785/j.issn.1008-973X.2023.03.002

浙江大学学报(工学版)

2023, Vol. 57

Issue (3): 446-454 DOI: 10.3785/j.issn.1008-973X.2023.03.002

计算机与控制工程

基于视觉Transformer时空自注意力的工人行为识别

陆昱翔1(

),徐冠华2,*(

),唐波1,3

1. 中国计量大学计量测试工程学院，浙江杭州 310018
2. 浙江大学浙江省三维打印工艺与装备重点实验室，流体动力与机电系统国家重点实验室，浙江杭州 310027
3. 宁波水表（集团）股份有限公司，浙江宁波 315033

Worker behavior recognition based on temporal and spatial self-attention of vision Transformer

Yu-xiang LU1(

),Guan-hua XU2,*(

),Bo TANG1,3

1. College of Metrology and Measurement Engineering, China Jiliang University, Hangzhou 310018, China
2. Zhejiang Province’s Key Laboratory of 3D Printing Process and Equipment, State Key Laboratory of Fluid Power and Mechatronic Systems, Zhejiang University, Hangzhou 310027, China
3. Ningbo Water Meter (Group) Limited Company, Ningbo 315033, China

全文: PDF(2240 KB) HTML

摘要：

针对人机协作特殊场景中工人行为识别的问题，提出基于Transformer网络的视频人体行为识别模型，利用Transformer网络核心的自注意力机制，减少网络的结构复杂度，提升网络的性能. 模型在提取图像空间特征的基础上，增加时间特征的分析，从空间和时间2个维度实现对视频数据的处理. 在处理后的数据中提取分类向量传入分类模块，得到最终的识别结果. 为了验证模型的有效性，分别在公开数据集UCF101和实验室采集的工人常规行为（自建）数据集上进行人体行为识别实验. 实验结果显示，在UCF101上模型平均识别准确率为93.44%，在自建数据集上模型平均识别准确率为98.54%.

关键词： 人机协作; Transformer; 时空自注意力; 工人行为; 行为识别

Abstract:

A video human behavior recognition model based on Transformer network structure was proposed, in order to solve the problem of worker behavior recognition in the special scene of human-robot collaboration. The self-attention mechanism at the core of Transformer network was used to reduce the structure complexity and boost the performance of the network. On the basis of extracting the spatial features of the image, a method of adding time features analysis was used to process the video data from two dimensions of space and time. After that, the classification vector was extracted from the processed data, and passed into the classification module to get the final recognition result. Human behavior recognition experiments were carried out on the public dataset UCF101 and the routine behavior dataset of workers collected in the laboratory (a self-built dataset) respectively, in order to verify the effectiveness of the model. Experimental results showed that the average recognition accuracy of the model on UCF101 was 93.44%, and the average recognition accuracy of the model on the self-built dataset was 98.54%.

Key words: human-robot collaboration Transformer temporal and spatial self-attention worker action behavior recognition

收稿日期: 2022-05-20 出版日期: 2023-03-31

CLC:

TP 391

基金资助: 国家自然科学基金资助项目（51805477）

通讯作者: 徐冠华 E-mail: yuxiang_lu1996@163.com;xuguanhua@zju.edu.cn

作者简介: 陆昱翔（1996—），男，硕士生，从事图像处理及机器人自动化应用研究. orcid.org/0000-0001-8285-8796.E-mail： yuxiang_lu1996@163.com

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	作者相关文章
	陆昱翔
	徐冠华
	唐波

引用本文:

陆昱翔,徐冠华,唐波. 基于视觉Transformer时空自注意力的工人行为识别[J]. 浙江大学学报(工学版), 2023, 57(3): 446-454.

Yu-xiang LU,Guan-hua XU,Bo TANG. Worker behavior recognition based on temporal and spatial self-attention of vision Transformer. Journal of ZheJiang University (Engineering Science), 2023, 57(3): 446-454.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2023.03.002 或 https://www.zjujournals.com/eng/CN/Y2023/V57/I3/446

图 1 自注意力模块的工作示例

图 2 工人行为识别模型的流程图

图 3 图像分块

图 4 自建数据集8类视频的截图

表 1 不同模型在UCF101数据集上的评估实验结果

表 2 本研究模型对UCF101各类别视频的识别精度

图 5 不同图像识别模型的识别准确率变化

图 6 不同图像识别模型的训练损失率变化

表 3 不同图像识别模型在自建数据集上的评估实验结果

图 7 不同图像识别模型的类别识别准确率

表 4 本研究模型在不同采样帧数下的评估实验结果

表 5 本研究模型在不同类型输入帧下的识别精度

表 6 本研究模型的结构消融实验结果

1	LASOTA P A, ROSSANO G F, SHAH J A. Toward safe close-proximity human-robot interaction with standard industrial robots [C]// Proceeding of 2014 IEEE International Conference on Automation Science and Engineering. [S. l.]: IEEE, 2014: 339-344.
2	SCHMIDT B, WANG L Depth camera based collision avoidance via active robot control[J]. Journal of Manufacturing Systems, 2014, 33 (4): 711- 718 doi: 10.1016/j.jmsy.2014.04.004
3	富倩人体行为识别研究[J]. 信息与电脑, 2017, (24): 146- 147 FU Qian Analysis of human behavior recognition[J]. China Computer and Communication, 2017, (24): 146- 147 doi: 10.3969/j.issn.1003-9767.2017.24.058
4	ZANCHETTIN A M, CASALINO A, PIRODDI L, et al Prediction of human activity patterns for human–robot collaborative assembly tasks[J]. IEEE Transactions on Industrial Informatics, 2019, 15 (7): 3934- 3942 doi: 10.1109/TII.2018.2882741
5	ZANCHETTIN A M, ROCCO P. Probabilistic inference of human arm reaching target for effective human-robot collaboration [C]// Proceeding of 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems. Vancouver: IEEE, 2017: 6595-6600.
6	SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos [EB/OL]. [2022-05-03]. https://arxiv.org/pdf/1406.2199.pdf.
7	FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 1933-1941.
8	YE Q, LIANG Z, ZHONG H, et al Human behavior recognition based on time correlation sampling two stream heterogeneous grafting network[J]. Optik, 2022, 251: 168402 doi: 10.1016/j.ijleo.2021.168402
9	PENG B, YAO Z, WU Q, et al 3D convolutional neural network for human behavior analysis in intelligent sensor network[J]. Mobile Networks and Applications, 2022, 27: 1559- 1568 doi: 10.1007/s11036-021-01873-8
10	张传雷, 武大硕, 向启怀, 等基于ResNet-LSTM的具有注意力机制的办公人员行为视频识别[J]. 天津科技大学学报, 2020, 35 (6): 72- 80 ZHANG Chuan-lei, WU Da-shuo, XIANG Qi-huai, et al Office staff behavior recognition based on ResNET-LSTM attention mechanism[J]. Journal of Tianjin University of Science and Technology, 2020, 35 (6): 72- 80 doi: 10.13364/j.issn.1672-6510.20190252
11	YU S, CHENG Y, XIE L, et al A novel recurrent hybrid network for feature fusion in action recognition[J]. Journal of Visual Communication and Image Representation, 2017, 49: 192- 203 doi: 10.1016/j.jvcir.2017.09.007
12	TANBERK S, KILIMCI Z H, TÜKEL D B, et al A hybrid deep model using deep learning and dense optical flow approaches for human activity recognition[J]. IEEE Access, 2020, 8: 19799- 19809 doi: 10.1109/ACCESS.2020.2968529
13	WU J, YANG X, XI M, et al Research on behavior recognition algorithm based on SE-I3D-GRU network[J]. High Technology Letters, 2021, 27 (2): 163- 172
14	VASWANI A, SHAZZER N, PARMAR N, et al. Attention is all you need [EB/OL]. [2022-05-03]. https://arxiv.org/pdf/1706.03762.pdf.
15	PARMAR N, VASWANI A, USZKOREIT J, et al. Image transformer [EB/OL]. [2022-05-03]. https://arxiv.org/pdf/1802.05751.pdf.
16	ZHU X, SU W, LU L, et al. Deformable DETR: deformable transformers for end-to-end object detection [EB/OL]. [2022-05-03]. https://arxiv.org/pdf/2010.04159.pdf.
17	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale [EB/OL]. [2022-05-03]. https://arxiv.org/pdf/2010.11929.pdf.
18	ZHOU D, SHI Y, KANG B, et al. Refiner: refining self-attention for vision-transformers[EB/OL]. [2022-05-03]. https://arxiv.org/pdf/2106.03714.pdf.
19	CHEN C F, FAN Q F, PANDA R. CrossViT: cross-attention multi-scale vision transformer for image classification [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 347-356.
20	LI G, LIU Z, CAI L, et al Standing-posture recognition in human–robot collaboration based on deep learning and the dempster–shafer evidence theory[J]. Sensors, 2020, 20 (4): 1158 doi: 10.3390/s20041158
21	JIANG J, NAN Z, CHEN H, et al Predicting short-term next-active-object through visual attention and hand position[J]. Neurocomputing, 2021, 433: 212- 222 doi: 10.1016/j.neucom.2020.12.069
22	汪涛, 汪泓章, 夏懿, 等基于卷积神经网络与注意力模型的人体步态识别[J]. 传感技术学报, 2019, 32 (7): 1027- 1033 WANG Tao, WANG Hong-zhang, XIA Yi, et al Human gait recognition based on convolutional neural network and attention model[J]. Chinese Journal of Sensors and Actuators, 2019, 32 (7): 1027- 1033 doi: 10.3969/j.issn.1004-1699.2019.07.012
23	TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks [C]// Proceedings of the IEEE International Conference on Computer Vision. Santiago: IEEE, 2015: 4489-4497.
24	QIU Z, YAO T, MEI T. Learning spatio-temporal representation with pseudo-3D residual networks [C]// Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 5533-5541.

[1]	王万良,王铁军,陈嘉诚,尤文波. 融合多尺度和多头注意力的医疗图像分割方法[J]. 浙江大学学报(工学版), 2022, 56(9): 1796-1805.
[2]	袁天乐,袁巨龙,朱勇建,郑翰辰. 基于改进YOLOv5的推力球轴承表面缺陷检测算法[J]. 浙江大学学报(工学版), 2022, 56(12): 2349-2357.
[3]	刘德斌,王旦,陈柏,王尧尧,宋立瑶. 外肢体机器人研究综述[J]. 浙江大学学报(工学版), 2021, 55(2): 251-258.
[4]	汤自林,高霄,肖晓晖. 基于模仿学习的变刚度人机协作搬运控制[J]. 浙江大学学报(工学版), 2021, 55(11): 2091-2099.
[5]	王钰翔, 李晟洁, 王皓, 马钧轶, 王亚沙, 张大庆. 基于Wi-Fi的非接触式行为识别研究综述[J]. 浙江大学学报(工学版), 2017, 51(4): 648-654.
[6]	胡晨, 吴新科, 彭方正, 钱照明. 变压器级联的双路均流准谐振反激LED驱动器[J]. 浙江大学学报(工学版), 2015, 49(4): 740-748.
[7]	叶芳芳,许力,杜鉴豪,杨洁. 以头部为基准的人体轮廓模型[J]. J4, 2011, 45(7): 1175-1180.

Viewed

Full text

Abstract

Cited

Shared

Discussed