Please wait a minute...
浙江大学学报(工学版)  2024, Vol. 58 Issue (1): 10-19    DOI: 10.3785/j.issn.1008-973X.2024.01.002
计算机技术     
混合采样下多级特征聚合的视频目标检测算法
秦思怡1,2(),盖绍彦1,2,*(),达飞鹏1,2
1. 东南大学 自动化学院,江苏 南京 210096
2. 东南大学 复杂工程系统测量与控制教育部重点实验室,江苏 南京 210096
Video object detection algorithm based on multi-level feature aggregation under mixed sampler
Siyi QIN1,2(),Shaoyan GAI1,2,*(),Feipeng DA1,2
1. School of Automation, Southeast University, Nanjing 210096, China
2. Key Laboratory of Measurement and Control of Complex Engineering Systems, Ministry of Education, Southeast University, Nanjing 210096, China
 全文: PDF(2492 KB)   HTML
摘要:

针对现有基于深度学习的视频目标检测算法无法同时满足精度和效率要求的问题,在单阶段检测器YOLOX-S的基础上,提出基于混合加权采样和多级特征聚合注意力的视频目标检测算法. 混合加权参考帧采样(MWRS)策略采用加权随机采样操作和局部连续采样操作,充分利用有效的全局信息与帧间局部信息. 多级特征聚合注意力模块(MFAA)基于自注意力机制,对YOLOX-S提取的分类特征进行细化,使得网络从不同层次的特征中学到更加丰富的特征信息. 实验结果表明,所提算法在ImageNet VID数据集上的检测精度均值AP50达到77.8%,平均检测速度为11.5 ms/帧,在检测图片上的目标分类和定位效果明显优于YOLOX-S,表明所提算法达到了较高的精度,具有较快的检测速度.

关键词: 机器视觉视频目标检测特征聚合注意力机制YOLOX    
Abstract:

A video object detection algorithm which was built upon the YOLOX-S single-stage detector based on mixed weighted reference-frame sampler and multi-level feature aggregation attention was proposed aiming at the problems of existing deep learning-based video object detection algorithms failing to simultaneously meet accuracy and efficiency requirements. Mixed weighted reference-frame sampler (MWRS) included weighted random sampling and local consecutive sampling to fully utilize effective global information and inter-frame local information. Multi-level feature aggregation attention (MFAA) module refined the classification features extracted by YOLOX-S based on self-attention mechanism, encouraging the network to learn richer feature information from multi-level features. The experimental results demonstrated that the proposed algorithm achieved an average precision AP50 of 77.8% on the ImageNet VID dataset with an average detection speed of 11.5 milliseconds per frame. The object classification and location performance are significantly better than that of YOLOX-S, indicating that the proposed algorithm achieves higher accuracy and faster detection speed.

Key words: machine vision    video object detection    feature aggregation    attention mechanism    YOLOX
收稿日期: 2023-06-13 出版日期: 2023-11-07
CLC:  TP 391  
基金资助: 江苏省前沿引领技术基础研究专项项目(BK20192004C);江苏省高校优势学科建设工程资助项目
通讯作者: 盖绍彦     E-mail: qin.siyi@foxmail.com;qxxymm@163.com
作者简介: 秦思怡(1999—),女,硕士生,从事目标检测的研究. orcid.org/0009-0004-8702-1230. E-mail: qin.siyi@foxmail.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
秦思怡
盖绍彦
达飞鹏

引用本文:

秦思怡,盖绍彦,达飞鹏. 混合采样下多级特征聚合的视频目标检测算法[J]. 浙江大学学报(工学版), 2024, 58(1): 10-19.

Siyi QIN,Shaoyan GAI,Feipeng DA. Video object detection algorithm based on multi-level feature aggregation under mixed sampler. Journal of ZheJiang University (Engineering Science), 2024, 58(1): 10-19.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2024.01.002        https://www.zjujournals.com/eng/CN/Y2024/V58/I1/10

图 1  SA和MFAA模块的结构
图 2  MMNet算法的网络结构图
图 3  混合加权参考帧采样策略的结构
${k_{\rm{g}}}:{k_{\rm{l}}}$ AP50/%
1∶2 72.6
1∶1 74.6
2∶1 76.1
3∶1 77.1
4∶1 77.8
5∶1 77.7
表 1  不同采样比例在ImageNet VID验证集上的精度
网络模型 主干网络 t/ms AP50/%
FGFA[1] ResNet-101 104.2 76.3
MEGA[10] ResNet-101 230.4 82.9
VOD-MT[30] VGG-16 73.2
LSTS[11] ResNet-101 43.5 77.2
TIAM[31] ResNet-101 74.9
QueryProp[7] ResNet-50 21.9(T) 80.3
SALISA[32] EfficientNet-B3 75.4
YOLOX-S[20] Modified CSP v5 9.4 69.5
MMNet Modified CSP v5 11.5 77.8
表 2  不同算法在ImageNet VID验证集上的实验结果
图 4  2种算法在ImageNet VID测试集上的检测结果可视化对比
%
速度 AP50
YOLOX-S YOLOX-S+MWRS
慢速 80.1 81.5
中速 71.4 75.6
快速 55.3 59.3
平均精度 69.5 77.1
表 3  混合加权参考帧采样策略在ImageNet VID验证集上测试不同速度目标的精度结果
%
速度 AP50
YOLOX-S YOLOX-S+SA YOLOX-S+MFAA
慢速 80.1 81.8 81.9
中速 71.4 75.4 75.6
快速 55.3 58.3 59.3
平均精度 69.5 76.9 77.5
表 4  多级特征聚合注意力模块在ImageNet VID验证集上不同速度目标的精度
MWRS MFAA P/106 FLOPs/109 t/ms AP50/%
8.95 21.63 9.4 69.5
8.95 21.63 9.6 77.1
10.41 26.88 11.3 77.5
10.41 26.88 11.5 77.8
表 5  所提算法在ImageNet VID验证集上的消融实验结果
1 史钰祜, 张起贵 基于局部注意的快速视频目标检测方法[J]. 计算机工程, 2022, 48 (5): 314- 320
SHI Yuhu, ZHANG Qigui Method for fast video object detection based on local attention[J]. Computer Engineering, 2022, 48 (5): 314- 320
2 ZHU X, WANG Y, DAI J, et al. Flow-guided feature aggregation for video object detection [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Venice: IEEE, 2017: 408-417.
3 ZHU X, XIONG Y, DAI J, et al. Deep feature flow for video recognition [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 2349-2358.
4 FEICHTENHOFER C, PINZ A, ZISSERMAN A. Detect to track and track to detect [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Venice: IEEE, 2017: 3038-3046.
5 KANG K, LI H, YAN J, et al T-CNN: tubelets with convolutional neural networks for object detection from videos[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2017, 28 (10): 2896- 2907
6 HAN M, WANG Y, CHANG X, et al. Mining inter-video proposal relations for video object detection [C]// Proceedings of the European Conference on Computer Vision. Glasgow: Springer, 2020: 431-446.
7 HE F, GAO N, JIA J, et al. QueryProp: object query propagation for high-performance video object detection [C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2022, 36(1): 834-842.
8 JIAO L, ZHANG R, LIU F, et al New generation deep learning for video object detection: a survey[J]. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33 (8): 3195- 3215
doi: 10.1109/TNNLS.2021.3053249
9 WU H, CHEN Y, WANG N, et al. Sequence level semantics aggregation for video object detection [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 9217-9225.
10 CHEN Y, CAO Y, HU H, et al. Memory enhanced global-local aggregation for video object detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 10337-10346.
11 JIANG Z, LIU Y, YANG C, et al. Learning where to focus for efficient video object detection [C]// Proceedings of European Conference on Computer Vision. Berlin: Springer, 2020: 18-34.
12 RUSSAKOVSKY O, DENG J, SU H, et al Imagenet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115 (3): 211- 252
doi: 10.1007/s11263-015-0816-y
13 GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 580-587.
14 REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2016: 779-788.
15 李凯, 林宇舜, 吴晓琳, 等 基于多尺度融合与注意力机制的小目标车辆检测[J]. 浙江大学学报: 工学版, 2022, 56 (11): 2241- 2250
LI Kai, LIN Yushun, WU Xiaolin, et al Small target vehicle detection based on multi-scale fusion technology and attention mechanism[J]. Journal of ZheJiang University: Engineering Science, 2022, 56 (11): 2241- 2250
16 REDMON J, FARHADI A. YOLO9000: better, faster, stronger [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Honolulu: IEEE, 2017: 6517–6525.
17 REDMON J, FARHADI A. YOLOv3: an incremental improvement [EB/OL]. (2018-04-08)[2023-07-31]. https://arxiv.org/abs/1804.02767.
18 LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector [C]// Proceedings of the European Conference on Computer Vision. [S. l. ]: Springer, 2016: 21-37.
19 YAN B, FAN P, LEI X, et al A real-time apple targets detection method for picking robot based on improved YOLOv5[J]. Remote Sensing, 2021, 13 (9): 1619- 1627
doi: 10.3390/rs13091619
20 GE Z, LIU S, WANG F, et al. Yolox: exceeding yolo series in 2021 [EB/OL]. (2021-08-06)[2023-07-31]. https://arxiv.org/abs/2107.08430.
21 于楠晶, 范晓飚, 邓天民, 等 基于多头自注意力的复杂背景船舶检测算法[J]. 浙江大学学报: 工学版, 2022, 56 (12): 2392- 2402
YU Nanjing, FAN Xiaobiao, DENG Tianmin, et al Ship detection algorithm in complex backgrounds via multi-head self-attention[J]. Journal of ZheJiang University: Engineering Science, 2022, 56 (12): 2392- 2402
22 VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// In Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc. , 2017: 6000–6010.
23 NEUBECK A, VAN G L. Efficient non-maximum suppression [C]// 18th International Conference on Pattern Recognition. Hong Kong: IEEE, 2006: 850-855.
24 张娜, 戚旭磊, 包晓安, 等 基于25预测定位的单阶段目标检测算法[J]. 浙江大学学报: 工学版, 2022, 56 (4): 783- 794
ZHANG Na, QI Xulei, BAO Xiaoan, et al Single-stage object detection algorithm based on optimizing position prediction[J]. Journal of ZheJiang University: Engineering Science, 2022, 56 (4): 783- 794
25 SUN G, HUA Y, HU G, et al. Mamba: multi-level aggregation via memory bank for video object detection [C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2021: 2620-2627.
26 WANG H, TANG J, LIU X, et al. PTSEFormer: progressive temporal-spatial enhanced TransFormer towards video object detection [C]// Proceedings of the European Conference on Computer Vision. Tel Aviv: Springer, 2022: 732-747.
27 EFRAIMIDIS P S, SPIRAKIS P G Weighted random sampling with a reservoir[J]. Information Processing Letters, 2006, 97 (5): 181- 185
doi: 10.1016/j.ipl.2005.11.003
28 TAN M, LE Q. Efficientnet: rethinking model scaling for convolutional neural networks [C]// International Conference on Machine Learning. Long Beach: [s. n.], 2019: 6105-6114.
29 ZHENG Z, WANG P, LIU W, et al. Distance-IoU loss: faster and better learning for bounding box regression [C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2020: 12993-13000.
30 KIM J, KOH J, LEE B, et al. Video object detection using object's motion context and spatio-temporal feature aggregation [C]// 25th International Conference on Pattern Recognition. Milan: IEEE, 2021: 1604-1610.
31 蔡强, 李韩玉, 李楠, 等 基于时序信息和注意力机制的视频目标检测[J]. 计算机仿真, 2021, 38 (12): 380- 385
CAI Qiang, LI Hanyu, LI Nan, et al Video object detection with temporal information and attention mechanism[J]. Computer Simulation, 2021, 38 (12): 380- 385
doi: 10.3969/j.issn.1006-9348.2021.12.078
[1] 李海烽,张雪英,段淑斐,贾海蓉,Huizhi Liang . 融合生成对抗网络与时间卷积网络的普通话情感识别[J]. 浙江大学学报(工学版), 2023, 57(9): 1865-1875.
[2] 赵小强,王泽,宋昭漾,蒋红梅. 基于动态注意力网络的图像超分辨率重建[J]. 浙江大学学报(工学版), 2023, 57(8): 1487-1494.
[3] 王慧欣,童向荣. 融合知识图谱的推荐系统研究进展[J]. 浙江大学学报(工学版), 2023, 57(8): 1527-1540.
[4] 宋秀兰,董兆航,单杭冠,陆炜杰. 基于时空融合的多头注意力车辆轨迹预测[J]. 浙江大学学报(工学版), 2023, 57(8): 1636-1643.
[5] 李晓艳,王鹏,郭嘉,李雪,孙梦宇. 基于双注意力机制的多分支孪生网络目标跟踪[J]. 浙江大学学报(工学版), 2023, 57(7): 1307-1316.
[6] 权巍,蔡永青,王超,宋佳,孙鸿凯,李林轩. 基于3D-ResNet双流网络的VR病评估模型[J]. 浙江大学学报(工学版), 2023, 57(7): 1345-1353.
[7] 韩俊,袁小平,王准,陈烨. 基于YOLOv5s的无人机密集小目标检测算法[J]. 浙江大学学报(工学版), 2023, 57(6): 1224-1233.
[8] 项学泳,王力,宗文鹏,李广云. ASIS模块支持下融合注意力机制KNN的点云实例分割算法[J]. 浙江大学学报(工学版), 2023, 57(5): 875-882.
[9] 苏育挺,陆荣烜,张为. 基于注意力和自适应权重的车辆重识别算法[J]. 浙江大学学报(工学版), 2023, 57(4): 712-718.
[10] 卞佰成,陈田,吴入军,刘军. 基于改进YOLOv3的印刷电路板缺陷检测算法[J]. 浙江大学学报(工学版), 2023, 57(4): 735-743.
[11] 程艳芬,吴家俊,何凡. 基于关系门控图卷积网络的方面级情感分析[J]. 浙江大学学报(工学版), 2023, 57(3): 437-445.
[12] 曾耀,高法钦. 基于改进YOLOv5的电子元件表面缺陷检测算法[J]. 浙江大学学报(工学版), 2023, 57(3): 455-465.
[13] 杨帆,宁博,李怀清,周新,李冠宇. 基于语义增强特征融合的多模态图像检索模型[J]. 浙江大学学报(工学版), 2023, 57(2): 252-258.
[14] 刘超,孔兵,杜国王,周丽华,陈红梅,包崇明. 高阶互信息最大化与伪标签指导的深度聚类[J]. 浙江大学学报(工学版), 2023, 57(2): 299-309.
[15] 杨天乐,李玲霞,张为. 基于自注意力机制的双分支密集人群计数算法[J]. 浙江大学学报(工学版), 2023, 57(10): 1955-1965.