A method based on attention mechanism and multi-scale information fusion was proposed to resovle the problem of low accuracy of the traditional single shot multibox detector (SSD) algorithm in detecting small targets. The algorithm was applied to the vehicle detection task. The feature maps of the target detection branch were fused with 5 branches and 2 branches respectively, combining the advantages of the shallow feature map and the deep feature map. The attention mechanism module was added between the basic network layers to make the model pay attention to the channels containing more information. Experimental results showed that the mean average precision of the self-built vehicle data set reached 90.2%, which was 10.0% higher than the traditional SSD algorithm. The detection accuracy of small objects was improved by 17.9%. The mAP on the PASCAL VOC 2012 dataset was 83.1%, which was 6.4% higher than the current mainstream YOLOv5 algorithm. The detection speed of proposed algorithm on the GTX1 660 Ti PC reached 25 frame/s, which satisfied the demand of real-time performance.
Kai LI,Yu-shun LIN,Xiao-lin WU,Fei-yu LIAO. Small target vehicle detection based on multi-scale fusion technology and attention mechanism. Journal of ZheJiang University (Engineering Science), 2022, 56(11): 2241-2250.
Fig.1Network structure diagram of small target detection method based on SE module and multi-scale feature fusion technology
Fig.2Attention mechanism module based on SENet
Fig.3Conv7~Conv11_2 feature fusion map
Fig.4Conv4_3 feature fusion
Fig.5Sample image of self-built vehicle dataset
Fig.6Schematic diagram of targets detection
方法
mAP/%
AP/%
小目标
中目标
大目标
CenterNet
73.5
52.3
82.1
86.1
SSD
78.5
65.6
88.2
81.7
YOLOv4
79.3
67.2
83.2
87.5
YOLOv5
85.7
74.8
86.8
95.5
OURS
90.2
83.5
91.2
95.9
Tab.1Test results of each method on self-built vehicle dataset
方法
mAP/%
AP /%
v/(frame·s?1)
小目标
中目标
大目标
SSD+FPN
85.1
74.6
87.3
93.4
32.0
SSD+MSIF
88.5
81.2
90.1
94.2
28.0
SSD+MSIF+SE
90.2
83.5
91.2
95.9
25.0
Tab.2Comparison of test performance of various methods on self-built vehicle datasets
融合方式
mAP/%
v/(frame·s?1)
3支路融合
86.4
31
4支路融合
87.2
30
5支路融合
88.5
28
6支路融合
88.6
25
Tab.3Conv4_3 layer different fusion method performance comparison
Fig.7Comparison diagram of SSD and proposed method test results
算法模型
aero
bike
bird
boat
bottle
bus
car
cat
chair
cow
FasterRcnn
84.9
79.8
79.8
74.3
53.9
77.5
75.9
88.5
45.6
77.1
CenterNet
81.0
75.0
66.0
52.0
43.0
78.0
80.0
87.0
59.0
72.0
SSD
83.1
84.7
74.0
69.6
49.5
85.4
86.2
85.2
60.4
81.5
RP-SSD
88
83.8
74.8
73.2
48.9
83.9
86.8
91.0
63.2
81.9
DSSD
83.6
85.2
74.5
70.1
50.4
85.6
86.7
85.6
61.0
82.1
FSSD
84.9
86.4
74.8
63.3
50.6
84.6
87.9
86.9
63.1
83.2
YOLOv4
83.6
84.0
73.8
59.2
72.2
91.0
90.0
70.7
60.9
64.9
YOLOv5
84.2
87.6
65.9
63.3
77.0
80.2
91.5
83.7
66.5
66.4
OURS
89.8
89.8
85.4
75.5
61.5
82.5
87.5
90.5
73.9
95.6
算法模型
table
dog
horse
mbike
person
plant
sheep
sofa
train
tv
FasterRcnn
55.3
86.9
81.7
80.9
79.6
40.1
72.6
60.9
81.2
61.5
CenterNet
54.0
81.0
70.0
68.0
74.0
41.0
71.0
58.0
82.0
70.0
SSD
75.1
82.0
85.9
85.3
77.7
49.6
76.1
80.0
87.4
74.4
RP-SSD
76.3
81.2
85.3
84.6
79.3
63.5
78.9
83.4
87.9
73.9
DSSD
75.4
82.5
86.2
85.4
78.6
51.2
75.9
80.5
86.7
75.1
FSSD
76.8
83.1
85.0
83.2
77.3
57.9
78.4
82.1
86.5
73.2
YOLOv4
67.3
89.6
77.4
65.2
86.0
47.7
77.4
72.3
82.6
83.3
YOLOv5
59.8
82.8
86.6
83.1
85.4
56.4
70.3
62.9
87.9
90.8
OURS
78.4
90.7
89.5
82.1
75.6
63.1
81.5
93.9
89.7
85.9
Tab.4AP value of various methods in PASCAL VOC data test %
Fig.8Comparison of detection results of SSD algorithm and method for small targets
方法
显卡型号
基础网络框架
mAP/%
v/(frame·s?1)
Faster Rcnn
Titan X
VGG-16
70.4
7.0
YOLOv4
1060 Ti
CSPDarknet53
75.0
35.0
YOLOv5
1060 Ti
FOCUS+CSP
76.6
38.0
SSD
Titan X
VGG-16
75.6
46.0
RP-SSD
1080 Ti
VGG-16
78.4
32.0
OURS
1060 Ti
VGG-16
83.1
25.0
Tab.5Comparison of detection speed of various methods
[1]
DALAL N, TRIGGS B. Histograms of oriented gradients for human detection [C]// IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). San Diego: IEEE, 2005, 1: 886-893.
[2]
KUMAR P, HENIKOFF S, NG P C Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm[J]. Nature Protocols, 2009, 4 (8): 1073- 1081
[3]
CHERKASSKY V, MA Y Practical selection of SVM parameters and noise estimation for SVM regression[J]. Neural networks, 2004, 17 (1): 113- 126
doi: 10.1016/S0893-6080(03)00169-2
[4]
GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 580-587.
[5]
GIRSHICK R. Fast r-cnn [C]// Proceedings of the IEEE International Conference on Computer Vision. Boston: IEEE, 2015: 1440-1448.
[6]
REN S, HE K, GIRSHICK R, et al Faster r-cnn: towards real-time object detection with region proposal networks[J]. Advances in Neural Information Processing Systems, 2015, 1137- 1149
[7]
REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection [C]// IEEE Conference on Computer Vision and Pattern Recognition. Washington, DC: IEEE Computer Society. Las Vegas: IEEE, 2016: 779-788.
[8]
LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector [C]// European Conference on Computer Vision, LNCS 9905. Berlin: Springer, 2016: 21-37.
[9]
FU C Y, LIU W, RANGA A, et al. DSSD: deconvolutional single shot detector [EB/OL]. [2021-11-23]. https://arxiv.org/abs/1701.06659v1.
[10]
LI Z, ZHOU F. FSSD: feature fusion single shot multibox detector [EB/OL]. [2021-11-23]. https://arxiv.org/abs/1712.00960.
[11]
JEONG J, PARK H, KWAK N. Enhancement of SSD by concatenating feature maps for object detection [EB/OL]. [2021-11-23]. https://arxiv.org/abs/1705.09587.
[12]
李航, 朱明 基于深度卷积神经网络的小目标检测算法[J]. 计算机工程与科学, 2020, 42 (4): 649- 657 LI Hang, ZHU Ming A small object detection algorithm based on deep convolutional neural network[J]. Computer Engineering and Science, 2020, 42 (4): 649- 657
doi: 10.3969/j.issn.1007-130X.2020.04.011
[13]
CHEN Yu-kang, ZHANG Pei-zhen, LI Ze-ming, et al. Dynamic scale training for object detection [EB/OL]. [2021-11-23].https://arxiv.org/abs/2004.12432v2.
[14]
LIU S, HUANG D, WANG Y. Learning spatial fusion for single shot object detection [EB/OL]. [2021-11-23]. https://arxiv.org/abs/1911.09516.
[15]
ZOPH B, CUBUK E D, GHIASI G, et al. Learning data augmentation strategies for object detection [C]// European Conference on Computer Vision. Springer, Cham, 2020: 566-583.
[16]
WANG T, ANWER R M, CHOLAKKAL H, et al. Learning rich features at high-speed for single-shot object detection [C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Long Beach: IEEE, 2019: 1971-1980.
[17]
HU J, SHEN L, SUN G. Squeeze and excitation networks [C]//Proceedings of the IEEE conference on computer vision and pattern recognition. Salt Lake City: IEEE, 2018: 7132-7141.
[18]
DUAN K, BAI S, XIE L, et al. Centernet: keypoint triplets for object detection [C]// Proceedings of the IEEE/CVF international conference on computer vision. Long Beach: IEEE, 2019: 6569-6578.
[19]
WANG C Y, BOCHKOVSKIY A, LIAO H Y M. Scaled-yolov4: scaling cross stage partial network [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 13029-13038.
[20]
YANG G, FENG W, JIN J, et al. Face mask recognition system with YOLOV5 based on image recognition [C]// IEEE 6th International Conference on Computer and Communications (ICCC). Seattle: IEEE, 2020: 1398-1404.
[21]
梁鸿, 李洋, 邵明文, 等 基于残差网络和改进特征金字塔的油田作业现场目标检测算法[J]. 科学技术与工程, 2020, 20 (11): 4442- 4450 LIANG Hong, LI Yang, SHAO Ming-wen, et al Field object detection for oilfield operation based on residual network and improved feature pyramid networks[J]. Science Technology and Engineering, 2020, 20 (11): 4442- 4450
doi: 10.3969/j.issn.1671-1815.2020.11.035