Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2020, Vol. 54 Issue (3): 529-539    DOI: 10.3785/j.issn.1008-973X.2020.03.013
Computer Technology and Image Processing     
Object detection enhanced context model
Chen-bin ZHENG1(),Yong ZHANG1,*(),Hang HU2,Ying-rui WU1,Guang-jing HUANG3
1. School of Instrumetation and Optoelectronic Engineering, Beihang University, Beijing 100191, China
2. Unit 66133 of PLA, Beijing 100144, China
3. School of Aeronautic Science and Engineering, Beihang University, Beijing 100191, China
Download: HTML     PDF(1492KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

Double-atrous convolution structure was used in enhanced context module (ECM) of the enhanced context model to reduce parameters while expanding effective receptive field to enhance context information of shallow layers, and ECM flexibly acted on middle shallow prediction layers with less damage to original SSD, forming enhanced context model net (ECMNet). Using input image with size of 300×300, ECMNet obtained mean average precision of 80.52% on PASCAL VOC2007 test set, and achieved 73.5 frames per second on 1080Ti. The experimental results show that ECMNet can effectively enhance context information and achieves a better trade-off in parameter, speed and accuracy, which is superior to many state-of-the-art object detectors.



Key wordsobject detection      context information      effective receptive field      enhanced context module (ECM)      one-stage object detector     
Received: 01 March 2019      Published: 05 March 2020
CLC:  TP 391  
Corresponding Authors: Yong ZHANG     E-mail: 13171087@buaa.edu.cn;06952@buaa.edu.cn
Cite this article:

Chen-bin ZHENG,Yong ZHANG,Hang HU,Ying-rui WU,Guang-jing HUANG. Object detection enhanced context model. Journal of ZheJiang University (Engineering Science), 2020, 54(3): 529-539.

URL:

http://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2020.03.013     OR     http://www.zjujournals.com/eng/Y2020/V54/I3/529


目标检测强化上下文模型

强化上下文模型中的强化上下文模块(ECM)利用双空洞卷积结构,在节省参数量的同时,通过扩大有效感受野来强化浅层上下文信息,并在较少破坏原始SSD网络的基础上灵活作用于网络中浅预测层,形成强化上下文模型网络(ECMNet). 当输入图像大小为300×300时,在PASCAL VOC2007测试集上,ECMNet获得的均值平均精度为80.52%,在1080Ti上的速度为73.5 帧/s. 实验结果表明,ECMNet能有效强化上下文信息,并在参数量、速度和精度上达到较优权衡,优于许多先进的目标检测器.


关键词: 目标检测,  上下文信息,  有效感受野,  强化上下文模块(ECM),  一阶段目标检测器 
Fig.1 Specific network structure diagram of three kind independent modules
Fig.2 Double-atrous convolution structure and its theory receptive field
Fig.3 Enhanced context model network architecture
Fig.4 Example and schematic diagram of small-angle rotation transformation
网络模型 是否含 $5 \times 5$独立模块 $\varPsi $/106 $\varPhi $/%
PPMNet ~30.00 79.96
~30.42 79.96
ASPPMNet ~30.15 80.12
~30.61 80.20
ECMNet ~29.96 80.30
~30.33 80.52
Tab.1 Test results of three different independent modules on VOC2007 test set
Fig.5 Effective receptive field on two kind prediction layers of five network models
$38 \times 38$ $19 \times 19$ $10 \times 10$ $5 \times 5$ $\varPhi $/%
注:""则表示ECM作用于相应分辨率的特征图.
? ? ? 79.88
? ? 80.33
? 80.30
80.52
Tab.2 Test results of feature map range acted by ECM
方法 $\varPsi $/106 $\varPhi $/%
注:1)均值平均精度80.50%为文献[1]中所给的结果,自测结果只有80.42%,详见表6倒数第二行; ${\rm{ECMNe}}{{\rm{t}}_{{\rm{no}} \,{\rm{rotation}}}}300$表示不含小角度旋转变换.
${\rm{SSD}}{300^{\rm{*}}}$ ~26.29 77.51
${\rm{RFB}}\,{\rm{Net300}}$ ~34.19 80.501
${\rm{ECMNe}}{{\rm{t}}_{{\rm{no}} \,{\rm{rotation}}}}300$ ~30.33 80.29
${\rm{ECMNet}}300$ ~30.33 80.52
Tab.4 Parameters of each network and mean average precision on VOC2007 test set
是否含 $5 \times 5$ 独立模块 小角度旋转变换 $\varPhi $/%
80.17
80.30
80.29
80.52
Tab.3 Test results of small-angle rotation transformation on VOC2007 test set
方法 骨干网络 框架 GPU 锚框数目 输入大小 v/(帧·s?1) $\varPhi $/%
注:1)网络模型的官方版本使用Caffe实现,且硬件和环境配置与本文不同,为了公平比较检测速度,使用PyTorch重新实现SSD和FSSD模型,并在相同环境下进行测试;2)网络模型的硬件和环境配置也与本文不同,同样在相同环境下进行测试.
Faster R-CNN[19] VGG16 Caffe K40 300 ~1 000×600 5.0 73.17
ION[20] VGG16 Caffe Titan X 3 000 ~1 000×600 1.3 75.55
R-FCN[21] ResNet-101 Caffe K40 300 ~1 000×600 5.9 79.51
CoupleNet[22] ResNet-101 Caffe Titan X 300 ~1 000×600 9.8 81.70
YOLOv2[14] Darknet-19 darknet Titan X ? 352×352 81.0 73.70
YOLOv2[14] Darknet-19 darknet Titan X ? 544×544 40.0 78.60
${\rm{SSD}}{300^{\rm{*}}}$[12] VGG16 Caffe Titan X 8 732 300×300 46.0 77.51
${\rm{SSD}}{300^{1)}}$ VGG16 PyTorch 1080Ti 8 732 300×300 95.3 77.51
DSOD300[23] DS/64-192-48-1 Caffe Titan X 8 732 300×300 17.4 77.66
DSSD321[12] ResNet-101 Caffe Titan X 17 080 321×321 9.5 78.63
R-SSD300[5] VGG16 Caffe Titan X 8 732 300×300 35.0 78.50
FSSD300[6] VGG16 Caffe 1080Ti 11 570 300×300 65.8 78.77
${\rm{FSSD}}300$ 1) VGG16 PyTorch 1080Ti 11 570 300×300 85.7 78.77
RefineDet320[24] VGG16 Caffe Titan X 6 375 320×320 40.3 79.97
RFB Net300[1] VGG16 PyTorch Titan X 11 620 300×300 83.0 80.50
${\rm{RFB}}\,{\rm{Net30}}{\rm{0}}$ 2) VGG16 PyTorch 1080Ti 11 620 300×300 70.0 80.42
ECMNet300 VGG16 PyTorch 1080Ti 11 620 300×300 73.5 80.52
Tab.5 Detection results of each object detector on VOC2007 test set
方法 $\varPhi $/% aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
注:部分论文中没有给出VOC2007测试集上的完整检测结果,1)网络模型是本文使用对应论文公开发布的权重文件的检测结果.
Faster R-CNN[19] 73.17 76.5 79.0 70.9 65.5 52.1 83.1 84.7 86.4 52.0 81.9 65.7 84.8 84.6 77.5 76.7 38.8 73.6 73.9 83.0 72.6
ION[20] 75.55 79.2 83.1 77.6 65.6 54.9 85.4 85.1 87.0 54.4 80.6 73.8 85.3 82.2 82.2 74.4 47.1 75.8 72.7 84.2 80.4
R-FCN[21] 79.51 82.5 83.7 80.3 69.0 69.2 87.5 88.4 88.4 65.4 87.3 72.1 87.9 88.3 81.3 79.8 54.1 79.6 78.8 87.1 79.5
SSD300*[12] 77.51 79.5 83.9 76.0 69.6 50.5 87.0 85.7 88.1 60.3 81.5 77.0 86.1 87.5 84.0 79.4 52.3 77.9 79.5 87.6 76.8
${\rm{DSOD30}}{{\rm{0}}}$ 1) 77.66 80.5 85.5 76.7 70.9 51.5 87.4 87.9 87.1 61.7 79.3 77.1 83.2 87.1 85.6 80.9 48.5 78.7 80.2 86.7 76.7
DSSD321[12] 78.63 81.9 84.9 80.5 68.4 53.9 85.6 86.2 88.9 61.1 83.5 78.7 86.7 88.7 86.7 79.7 51.7 78.0 80.9 87.2 79.4
${\rm{FSSD}}{300}$ 1) 78.77 82.3 85.8 78.2 73.6 56.8 86.3 86.4 88.1 60.3 85.8 77.7 85.3 87.7 85.4 79.9 54.1 77.9 78.7 88.4 76.7
RefineDet320[24] 79.97 83.9 85.4 81.4 75.5 60.2 86.4 88.1 89.1 62.7 83.9 77.0 85.4 87.1 86.7 82.6 55.3 82.7 78.5 88.1 79.4
${\rm{RFB}}\,{\rm{Net}}{300}$ 1) 80.42 83.7 87.6 78.9 74.8 59.8 88.8 87.5 87.9 65.0 85.0 77.1 86.1 88.4 86.6 81.7 58.1 81.5 81.2 88.4 80.2
ECMNet300 80.52 83.9 88.3 79.9 73.1 61.8 88.7 87.9 87.8 64.1 85.7 78.9 86.2 88.5 86.9 82.4 56.8 79.6 81.3 88.4 80.2
Tab.6 Complete detection results of each object detector on VOC2007 test set
Fig.6 Some detection results of ECMNet (even rows) and SSD (odd rows) on VOC2007 test set
Fig.7 More visualization results of ECMNet on VOC2007 test set
[1]   LIU S T, HUANG D, WANG Y H. Receptive field block net for accurate and fast object detection [C] // European Conference on Computer Vision. Munich: Springer, 2018: 404-418.
[2]   LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector [C] // European Conference on Computer Vision. Amsterdam: Springer, 2016: 21-37.
[3]   LUO W J, LI Y J, URTASUN R, et al. Understanding the effective receptive field in deep convolutional neural networks [C] // Neural Information Processing Systems. Barcelona: [s. n.], 2016: 4898-4906.
[4]   LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection [C] // Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 936-944.
[5]   JEONG J, PARK H, KWAK N. Enhancement of SSD by concatenating feature maps for object detection [EB/OL]. (2017-05-26)[2019-02-26]. https://arxiv.xilesou.top/abs/1705.09587.
[6]   LI Z X, ZHOU F Q. FSSD: feature fusion single shot multibox detector [EB/OL]. (2018-05-17)[2019-02-26]. https://arxiv.org/abs/1712.00960.
[7]   SHELHAMER E, LONG J, DARRELL T Fully convolutional networks for semantic segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39 (4): 640- 651
[8]   BADRINARAYANAN V, KENDALL A, CIPOLLA R Segnet: a deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39 (12): 2481- 2495
doi: 10.1109/TPAMI.2016.2644615
[9]   ZHAO H S, SHI J P, QI X J, et al. Pyramid scene parsing network [C] // Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 6230-6239.
[10]   CHEN L C, PAPANDREOU G, KOKKINOS I, et al DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40 (4): 834- 848
[11]   CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image segmentation [EB/OL]. (2017-12-25)[2019-02-26]. https://arxiv.org/abs/1706.05587.
[12]   FU C, LIU W, RANGA A, et al. DSSD: deconvolutional single shot detector [EB/OL]. (2017-01-23)[2019-02-26]. https://arxiv.org/abs/1701.06659.
[13]   WANDELL B A, WINAWER J Computational neuroimaging and population receptive fields[J]. Trends in Cognitive Sciences, 2015, 19 (6): 349- 357
doi: 10.1016/j.tics.2015.03.009
[14]   REDMON J, FARHADI A. Farhadi. YOLO9000: better, faster, stronger [C] // Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 7263-7271.
[15]   REDMON J, FARHADI A. YOLOv3: an incremental improvement [EB/OL]. (2018-04-08)[2019-02-26]. https://arxiv.org/abs/1804.02767.
[16]   HE K M, ZHANG X Y, REN S Q, et al. Delving deep into rectifiers: surpassing human-level performance on imagenet classification [C] // International Conference on Computer Vision. Santiago: IEEE, 2015: 1026-1034.
[17]   EVERINGHAM M, GOOL L V, WILLIAMS C K I, et al The pascal visual object classes (VOC) challenge[J]. International Journal of Computer Vision, 2010, 88 (2): 303- 338
doi: 10.1007/s11263-009-0275-4
[18]   HUANG J, RATHOD V, SUN C, M, et al. Speed/accuracy trade-offs for modern convolutional object detectors [C] // Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 3296-3297.
[19]   REN S Q, HE K M, GIRSHICK R, et al Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39 (6): 1137- 1149
doi: 10.1109/TPAMI.2016.2577031
[20]   BELL S, ZITNICK C L, BALA K, et al. Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks [C] // Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 2874-2883.
[21]   DAI J F, LI Y, HE K M, et al. R-FCN: object detection via region-based fully convolutional networks [C] // Neural Information Processing Systems. Barcelona: [s. n.], 2016: 379-387.
[22]   ZHU Y S, ZHAO C Y, WANG J Q, et al. CoupleNet: coupling global structure with local parts for object detection [C] // International Conference on Computer Vision. Venice: IEEE, 2017: 4146-4154.
[23]   SHEN Z Q, LIU Z, LI J G, et al. DSOD: learning deeply supervised object detectors from scratch [C] // International Conference on Computer Vision. Venice: IEEE, 2017: 1937-1945.
[1] Ying-jie XIA,Cong-yu OUYANG. Dynamic image background modeling method for detecting abandoned objects in highway[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(7): 1249-1255.
[2] Yao JIN,Wei ZHANG. Real-time fire detection algorithm with Anchor-Free network architecture[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(12): 2430-2436.
[3] Jing-chang WANG,Ling CHEN,Shan-shan YU,Chen-shu JIANG,Yong WU. Multi-factor perceived short-term tourist number prediction model based on gated recurrent unit[J]. Journal of ZheJiang University (Engineering Science), 2019, 53(12): 2357-2364.
[4] YE Fang-fang, XU Li. Real-time detection and discrimination of static objects and ghosts[J]. Journal of ZheJiang University (Engineering Science), 2015, 49(1): 181-185.
[5] XU Xue-mei, LI Li-xian, ZHANG Jian-yang, NI Lan, HUANG Zheng-yu, CAO Jian. Tracking algorithm of visible particles in transparent
liquid pharmaceutical
[J]. Journal of ZheJiang University (Engineering Science), 2012, 46(10): 1822-1830.