Please wait a minute...
浙江大学学报(工学版)  2020, Vol. 54 Issue (3): 529-539    DOI: 10.3785/j.issn.1008-973X.2020.03.013
计算机技术与图像处理     
目标检测强化上下文模型
郑晨斌1(),张勇1,*(),胡杭2,吴颖睿1,黄广靖3
1. 北京航空航天大学 仪器科学与光电工程学院,北京 100191
2. 解放军66133部队,北京 100144
3. 北京航空航天大学 航空科学与工程学院,北京 100191
Object detection enhanced context model
Chen-bin ZHENG1(),Yong ZHANG1,*(),Hang HU2,Ying-rui WU1,Guang-jing HUANG3
1. School of Instrumetation and Optoelectronic Engineering, Beihang University, Beijing 100191, China
2. Unit 66133 of PLA, Beijing 100144, China
3. School of Aeronautic Science and Engineering, Beihang University, Beijing 100191, China
 全文: PDF(1492 KB)   HTML
摘要:

强化上下文模型中的强化上下文模块(ECM)利用双空洞卷积结构,在节省参数量的同时,通过扩大有效感受野来强化浅层上下文信息,并在较少破坏原始SSD网络的基础上灵活作用于网络中浅预测层,形成强化上下文模型网络(ECMNet). 当输入图像大小为300×300时,在PASCAL VOC2007测试集上,ECMNet获得的均值平均精度为80.52%,在1080Ti上的速度为73.5 帧/s. 实验结果表明,ECMNet能有效强化上下文信息,并在参数量、速度和精度上达到较优权衡,优于许多先进的目标检测器.

关键词: 目标检测上下文信息有效感受野强化上下文模块(ECM)一阶段目标检测器    
Abstract:

Double-atrous convolution structure was used in enhanced context module (ECM) of the enhanced context model to reduce parameters while expanding effective receptive field to enhance context information of shallow layers, and ECM flexibly acted on middle shallow prediction layers with less damage to original SSD, forming enhanced context model net (ECMNet). Using input image with size of 300×300, ECMNet obtained mean average precision of 80.52% on PASCAL VOC2007 test set, and achieved 73.5 frames per second on 1080Ti. The experimental results show that ECMNet can effectively enhance context information and achieves a better trade-off in parameter, speed and accuracy, which is superior to many state-of-the-art object detectors.

Key words: object detection    context information    effective receptive field    enhanced context module (ECM)    one-stage object detector
收稿日期: 2019-03-01 出版日期: 2020-03-05
CLC:  TP 391  
通讯作者: 张勇     E-mail: 13171087@buaa.edu.cn;06952@buaa.edu.cn
作者简介: 郑晨斌(1995—),男,硕士生,从事深度学习、图像处理研究. orcid.org/0000-0002-6413-613X. E-mail: 13171087@buaa.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
郑晨斌
张勇
胡杭
吴颖睿
黄广靖

引用本文:

郑晨斌,张勇,胡杭,吴颖睿,黄广靖. 目标检测强化上下文模型[J]. 浙江大学学报(工学版), 2020, 54(3): 529-539.

Chen-bin ZHENG,Yong ZHANG,Hang HU,Ying-rui WU,Guang-jing HUANG. Object detection enhanced context model. Journal of ZheJiang University (Engineering Science), 2020, 54(3): 529-539.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2020.03.013        http://www.zjujournals.com/eng/CN/Y2020/V54/I3/529

图 1  3种独立模块的具体网络结构图
图 2  双空洞卷积结构及其理论感受野
图 3  强化上下文模型网络架构
图 4  小角度旋转变换举例和原理图示意
网络模型 是否含 $5 \times 5$独立模块 $\varPsi $/106 $\varPhi $/%
PPMNet ~30.00 79.96
~30.42 79.96
ASPPMNet ~30.15 80.12
~30.61 80.20
ECMNet ~29.96 80.30
~30.33 80.52
表 1  3种独立模块在VOC2007测试集上的测试结果
图 5  5类网络模型在2种预测层上的有效感受野
$38 \times 38$ $19 \times 19$ $10 \times 10$ $5 \times 5$ $\varPhi $/%
注:""则表示ECM作用于相应分辨率的特征图.
? ? ? 79.88
? ? 80.33
? 80.30
80.52
表 2  ECM作用下的特征图范围测试结果
方法 $\varPsi $/106 $\varPhi $/%
注:1)均值平均精度80.50%为文献[1]中所给的结果,自测结果只有80.42%,详见表6倒数第二行; ${\rm{ECMNe}}{{\rm{t}}_{{\rm{no}} \,{\rm{rotation}}}}300$表示不含小角度旋转变换.
${\rm{SSD}}{300^{\rm{*}}}$ ~26.29 77.51
${\rm{RFB}}\,{\rm{Net300}}$ ~34.19 80.501
${\rm{ECMNe}}{{\rm{t}}_{{\rm{no}} \,{\rm{rotation}}}}300$ ~30.33 80.29
${\rm{ECMNet}}300$ ~30.33 80.52
表 4  各网络参数量及VOC2007测试集上的均值平均精度
是否含 $5 \times 5$ 独立模块 小角度旋转变换 $\varPhi $/%
80.17
80.30
80.29
80.52
表 3  小角度旋转变换在VOC2007测试集上的测试结果
方法 骨干网络 框架 GPU 锚框数目 输入大小 v/(帧·s?1) $\varPhi $/%
注:1)网络模型的官方版本使用Caffe实现,且硬件和环境配置与本文不同,为了公平比较检测速度,使用PyTorch重新实现SSD和FSSD模型,并在相同环境下进行测试;2)网络模型的硬件和环境配置也与本文不同,同样在相同环境下进行测试.
Faster R-CNN[19] VGG16 Caffe K40 300 ~1 000×600 5.0 73.17
ION[20] VGG16 Caffe Titan X 3 000 ~1 000×600 1.3 75.55
R-FCN[21] ResNet-101 Caffe K40 300 ~1 000×600 5.9 79.51
CoupleNet[22] ResNet-101 Caffe Titan X 300 ~1 000×600 9.8 81.70
YOLOv2[14] Darknet-19 darknet Titan X ? 352×352 81.0 73.70
YOLOv2[14] Darknet-19 darknet Titan X ? 544×544 40.0 78.60
${\rm{SSD}}{300^{\rm{*}}}$[12] VGG16 Caffe Titan X 8 732 300×300 46.0 77.51
${\rm{SSD}}{300^{1)}}$ VGG16 PyTorch 1080Ti 8 732 300×300 95.3 77.51
DSOD300[23] DS/64-192-48-1 Caffe Titan X 8 732 300×300 17.4 77.66
DSSD321[12] ResNet-101 Caffe Titan X 17 080 321×321 9.5 78.63
R-SSD300[5] VGG16 Caffe Titan X 8 732 300×300 35.0 78.50
FSSD300[6] VGG16 Caffe 1080Ti 11 570 300×300 65.8 78.77
${\rm{FSSD}}300$ 1) VGG16 PyTorch 1080Ti 11 570 300×300 85.7 78.77
RefineDet320[24] VGG16 Caffe Titan X 6 375 320×320 40.3 79.97
RFB Net300[1] VGG16 PyTorch Titan X 11 620 300×300 83.0 80.50
${\rm{RFB}}\,{\rm{Net30}}{\rm{0}}$ 2) VGG16 PyTorch 1080Ti 11 620 300×300 70.0 80.42
ECMNet300 VGG16 PyTorch 1080Ti 11 620 300×300 73.5 80.52
表 5  各目标检测器在VOC2007测试集上的检测结果
方法 $\varPhi $/% aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
注:部分论文中没有给出VOC2007测试集上的完整检测结果,1)网络模型是本文使用对应论文公开发布的权重文件的检测结果.
Faster R-CNN[19] 73.17 76.5 79.0 70.9 65.5 52.1 83.1 84.7 86.4 52.0 81.9 65.7 84.8 84.6 77.5 76.7 38.8 73.6 73.9 83.0 72.6
ION[20] 75.55 79.2 83.1 77.6 65.6 54.9 85.4 85.1 87.0 54.4 80.6 73.8 85.3 82.2 82.2 74.4 47.1 75.8 72.7 84.2 80.4
R-FCN[21] 79.51 82.5 83.7 80.3 69.0 69.2 87.5 88.4 88.4 65.4 87.3 72.1 87.9 88.3 81.3 79.8 54.1 79.6 78.8 87.1 79.5
SSD300*[12] 77.51 79.5 83.9 76.0 69.6 50.5 87.0 85.7 88.1 60.3 81.5 77.0 86.1 87.5 84.0 79.4 52.3 77.9 79.5 87.6 76.8
${\rm{DSOD30}}{{\rm{0}}}$ 1) 77.66 80.5 85.5 76.7 70.9 51.5 87.4 87.9 87.1 61.7 79.3 77.1 83.2 87.1 85.6 80.9 48.5 78.7 80.2 86.7 76.7
DSSD321[12] 78.63 81.9 84.9 80.5 68.4 53.9 85.6 86.2 88.9 61.1 83.5 78.7 86.7 88.7 86.7 79.7 51.7 78.0 80.9 87.2 79.4
${\rm{FSSD}}{300}$ 1) 78.77 82.3 85.8 78.2 73.6 56.8 86.3 86.4 88.1 60.3 85.8 77.7 85.3 87.7 85.4 79.9 54.1 77.9 78.7 88.4 76.7
RefineDet320[24] 79.97 83.9 85.4 81.4 75.5 60.2 86.4 88.1 89.1 62.7 83.9 77.0 85.4 87.1 86.7 82.6 55.3 82.7 78.5 88.1 79.4
${\rm{RFB}}\,{\rm{Net}}{300}$ 1) 80.42 83.7 87.6 78.9 74.8 59.8 88.8 87.5 87.9 65.0 85.0 77.1 86.1 88.4 86.6 81.7 58.1 81.5 81.2 88.4 80.2
ECMNet300 80.52 83.9 88.3 79.9 73.1 61.8 88.7 87.9 87.8 64.1 85.7 78.9 86.2 88.5 86.9 82.4 56.8 79.6 81.3 88.4 80.2
表 6  各目标检测器在VOC2007测试集上的完整检测结果
图 6  ECMNet(偶数行)和SSD(奇数行)在VOC2007测试集上的部分检测结果
图 7  ECMNet在VOC2007测试集上的更多可视化结果
1 LIU S T, HUANG D, WANG Y H. Receptive field block net for accurate and fast object detection [C] // European Conference on Computer Vision. Munich: Springer, 2018: 404-418.
2 LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector [C] // European Conference on Computer Vision. Amsterdam: Springer, 2016: 21-37.
3 LUO W J, LI Y J, URTASUN R, et al. Understanding the effective receptive field in deep convolutional neural networks [C] // Neural Information Processing Systems. Barcelona: [s. n.], 2016: 4898-4906.
4 LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection [C] // Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 936-944.
5 JEONG J, PARK H, KWAK N. Enhancement of SSD by concatenating feature maps for object detection [EB/OL]. (2017-05-26)[2019-02-26]. https://arxiv.xilesou.top/abs/1705.09587.
6 LI Z X, ZHOU F Q. FSSD: feature fusion single shot multibox detector [EB/OL]. (2018-05-17)[2019-02-26]. https://arxiv.org/abs/1712.00960.
7 SHELHAMER E, LONG J, DARRELL T Fully convolutional networks for semantic segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39 (4): 640- 651
8 BADRINARAYANAN V, KENDALL A, CIPOLLA R Segnet: a deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39 (12): 2481- 2495
doi: 10.1109/TPAMI.2016.2644615
9 ZHAO H S, SHI J P, QI X J, et al. Pyramid scene parsing network [C] // Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 6230-6239.
10 CHEN L C, PAPANDREOU G, KOKKINOS I, et al DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40 (4): 834- 848
11 CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image segmentation [EB/OL]. (2017-12-25)[2019-02-26]. https://arxiv.org/abs/1706.05587.
12 FU C, LIU W, RANGA A, et al. DSSD: deconvolutional single shot detector [EB/OL]. (2017-01-23)[2019-02-26]. https://arxiv.org/abs/1701.06659.
13 WANDELL B A, WINAWER J Computational neuroimaging and population receptive fields[J]. Trends in Cognitive Sciences, 2015, 19 (6): 349- 357
doi: 10.1016/j.tics.2015.03.009
14 REDMON J, FARHADI A. Farhadi. YOLO9000: better, faster, stronger [C] // Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 7263-7271.
15 REDMON J, FARHADI A. YOLOv3: an incremental improvement [EB/OL]. (2018-04-08)[2019-02-26]. https://arxiv.org/abs/1804.02767.
16 HE K M, ZHANG X Y, REN S Q, et al. Delving deep into rectifiers: surpassing human-level performance on imagenet classification [C] // International Conference on Computer Vision. Santiago: IEEE, 2015: 1026-1034.
17 EVERINGHAM M, GOOL L V, WILLIAMS C K I, et al The pascal visual object classes (VOC) challenge[J]. International Journal of Computer Vision, 2010, 88 (2): 303- 338
doi: 10.1007/s11263-009-0275-4
18 HUANG J, RATHOD V, SUN C, M, et al. Speed/accuracy trade-offs for modern convolutional object detectors [C] // Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 3296-3297.
19 REN S Q, HE K M, GIRSHICK R, et al Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39 (6): 1137- 1149
doi: 10.1109/TPAMI.2016.2577031
20 BELL S, ZITNICK C L, BALA K, et al. Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks [C] // Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 2874-2883.
21 DAI J F, LI Y, HE K M, et al. R-FCN: object detection via region-based fully convolutional networks [C] // Neural Information Processing Systems. Barcelona: [s. n.], 2016: 379-387.
22 ZHU Y S, ZHAO C Y, WANG J Q, et al. CoupleNet: coupling global structure with local parts for object detection [C] // International Conference on Computer Vision. Venice: IEEE, 2017: 4146-4154.
23 SHEN Z Q, LIU Z, LI J G, et al. DSOD: learning deeply supervised object detectors from scratch [C] // International Conference on Computer Vision. Venice: IEEE, 2017: 1937-1945.
[1] 徐利锋,黄海帆,丁维龙,范玉雷. 基于改进DenseNet的水果小目标检测[J]. 浙江大学学报(工学版), 2021, 55(2): 377-385.
[2] 郑浦,白宏阳,李伟,郭宏伟. 复杂背景下的小目标检测算法[J]. 浙江大学学报(工学版), 2020, 54(9): 1777-1784.
[3] 张峻宁,苏群星,刘鹏远,王正军,谷宏强. 基于空间约束的自适应单目3D物体检测算法[J]. 浙江大学学报(工学版), 2020, 54(6): 1138-1146.
[4] 晋耀,张为. 采用Anchor-Free网络结构的实时火灾检测算法[J]. 浙江大学学报(工学版), 2020, 54(12): 2430-2436.
[5] 林志洁,罗壮,赵磊,鲁东明. 特征金字塔多尺度全卷积目标检测算法[J]. 浙江大学学报(工学版), 2019, 53(3): 533-540.
[6] 叶芳芳,许力. 实时的静止目标与鬼影检测及判别方法[J]. 浙江大学学报(工学版), 2015, 49(1): 181-185.
[7] 刘辉涛,汪李明,李建龙. 声纳强脉冲干扰的自适应抵消方法[J]. J4, 2011, 45(3): 515-519.
[8] 刘士荣, 王凯, 邱雪娜. 基于自适应混合高斯模型全方位视觉目标检测[J]. J4, 2010, 44(7): 1387-1393.
[9] 刘佳 于慧敏. 基于水平集的运动目标检测和速度估算[J]. J4, 2009, 43(2): 244-249.
[10] 赵云峰 陈隆道 孟庆利. 提高红外经纬仪跟踪弱小目标精度的新算法[J]. J4, 2008, 42(7): 1169-1173.
[11] 楼斌 沈海斌 严晓浪. 基于运动目标检测的自嵌入视频水印[J]. J4, 2008, 42(3): 382-386.