基于全局?局部特征和自适应注意力机制的图像语义描述算法

doi:10.3785/j.issn.1008-973X.2020.01.015

浙江大学学报(工学版)

2020, Vol. 54

Issue (1): 126-134 DOI: 10.3785/j.issn.1008-973X.2020.01.015

计算机技术、信息工程

基于全局?局部特征和自适应注意力机制的图像语义描述算法

赵小虎1,2(

),尹良飞1,2(

),赵成龙1,2

1. 中国矿业大学矿山互联网应用技术国家地方联合工程实验室，江苏徐州 221008
2. 中国矿业大学信息与控制工程学院，江苏徐州 221116

Image captioning based on global-local feature and adaptive-attention

Xiao-hu ZHAO1,2(

),Liang-fei YIN1,2(

),Cheng-long ZHAO1,2

1. National and Local Joint Engineering Laboratory of Internet Application Technology on Mine, China University of Mining and Technology, Xuzhou 221008, China
2. School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China

全文: PDF(1697 KB) HTML

摘要：

为了探究图像底层视觉特征与高层语义概念存在的差异，提出可以确定图像关注重点、挖掘更高层语义信息以及完善描述句子的细节信息的图像语义描述算法. 在图像视觉特征提取时提取输入图像的全局-局部特征作为视觉信息输入，确定不同时刻对图像的关注点，对图像细节的描述更加完善；在解码时加入注意力机制对图像特征加权输入，可以自适应选择当前时刻输出的文本单词对视觉信息与语义信息的依赖权重，有效地提高对图像语义描述的性能. 实验结果表明，该方法相对于其他语义描述算法效果更有竞争力，可以更准确、更细致地识别图片中的物体，对输入图像进行更全面地描述；对于微小的物体的识别准确率更高.

关键词： 图像语义描述; 图像关注点; 高层语义信息; 描述句子细节; 全局-局部特征提取; 自适应注意力机制

Abstract:

The image captioning algorithm was proposed in order to explore the difference of the image visual features and the upper layer semantic concept. The algorithm can determine the image focus, mine higher-level semantic information, and improve the description details. Local features were added for the image visual feature extraction, and the global-local feature of the input image was combined with the global features and local features for visual information. Then the focus of the image at different time was determined, and more details of the image were caught. The attention mechanism was added to weight the image feature during decoding, so that the dependence of the text words on the visual information and the semantic information at the current moment could be adaptively adjusted, and the performance of image captioning was effectively improved. The experimental results show that the proposed method can acquire competitive captioning results than other image captioning algorithms. The method can describe the image more accurately and more comprehensively, and the recognition accuracy of tiny objects is higher than others.

Key words: image captioning image focus higher-level semantic information description detail global-local feature extraction adaptive-attention mechanism

收稿日期: 2019-04-29 出版日期: 2020-01-05

CLC:

TP 391

基金资助: 国家重点研发计划资助项目（2017YFC0804400）

作者简介: 赵小虎（1976—），男，教授，从事矿山物联网与智能计算的研究. orcid.org/0000-0002-7352-103X. E-mail： 525815788@qq.com

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	作者相关文章
	赵小虎
	尹良飞
	赵成龙

引用本文:

赵小虎,尹良飞,赵成龙. 基于全局?局部特征和自适应注意力机制的图像语义描述算法[J]. 浙江大学学报(工学版), 2020, 54(1): 126-134.

Xiao-hu ZHAO,Liang-fei YIN,Cheng-long ZHAO. Image captioning based on global-local feature and adaptive-attention. Journal of ZheJiang University (Engineering Science), 2020, 54(1): 126-134.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2020.01.015 或 http://www.zjujournals.com/eng/CN/Y2020/V54/I1/126

图 1 基于CNN-LSTM的图像语义描述模型结构示意图

图 2 Faster R-CNN结构示意图

图 3 RPN结构示意图

图 4 局部特征提取示意图

图 5 R-CNN示意图

图 6 全局-局部特征提取示意图

图 7 传统注意力机制与自适应注意力机制示意图

图 8 自适应注意力机制细节图

图 9 模型整体结构示意图

表 1 Flickr30k和MS COCO实验数据集介绍

表 2 Microsoft COCO数据集实验性能对比

表 3 Flickr30k数据集的实验性能对比

图 10 图像语义描述效果对比

1	FARHADI A, HEJRATI M, SADEGHI M A, et al. Every picture tells a story: generating sentences from images [C] // International Conference on Computer Vision. Heraklion: Springer, 2010: 15-29.
2	MAO J, XU W, YANG Y, et al. Deep captioning with multimodal recurrent neural networks(m-RNN) [EB/OL]. [2014-12-20]. https://arxiv.org/abs/1412.6632.
3	VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator [C] // IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 3156-3164.
4	WU Q, SHEN C, LIU L, et al What value do explicit high level concepts have in vision to language problems[J]. Computer Science, 2016, 12 (1): 1640- 1649
5	ZHOU L, XU C, KOCH P, et al. Watch what you just said: image captioning with text-conditional attention [C] // Proceedings of the on Thematic Workshops of ACM Multimedia. [S.l]: Association for Computing Machinery, 2017: 305-313.
6	RENNIE S J, MARCHERET E, ROUEH Y, et al. Self-critical sequence training for image captioning [C] // IEEE Conference on Computer Vision and Pattern Recognition. Maryland: IEEE, 2017: 1179-1195.
7	SIMONYAN K, ZISSERMAN A Very deep convolutional networks for large-scale image recognition[J]. Computer Science, 2014, 32 (2): 67- 85
8	REN S, HE K, GIRSHICK R, et al Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39 (6): 1137- 1149
9	XU K, BA J, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention [C]// Computer Science. Lille: IMLS, 2015: 2048-2057.
10	FANG H, GUPTA S, IANDOLA F, et al. From captions to visual concepts and back [C] // IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 1473-1482.
11	DONAHUE J, HENDRICKS L A, ROHRBACH M, et al Long-term recurrent convolutional networks for visual recognition and description[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 39 (4): 677- 691
12	WU Q, SHEN C, WANG P, et al Image captioning and visual question answering based on attributes and external knowledge[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40 (6): 1367- 1381
13	YAO T, PAN Y, LI Y, et al. Boosting image captioning with attributes [C]// IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 4904- 4912.
14	YOU Q, JIN H, WANG Z, et al. Image captioning with semantic attention [C] // IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 4651- 4659.
15	JIN J, FU K, CUI R, et al Aligning where to see and what to tell: image caption with region-based attention and scene factorization[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39 (12): 2321- 2334
16	YANG Z, YUAN Y, WU Y, et al. Encode, review, and decode: reviewer module for caption generation [C] // International Conference on Neural Image Processing System. Barcelona: [s. n.], 2016.

[1]	郑守国,张勇德,谢文添,樊虎,王青. 基于数字孪生的飞机总装生产线建模[J]. 浙江大学学报(工学版), 2021, 55(5): 843-854.
[2]	张师林,马思明,顾子谦. 基于大边距度量学习的车辆再识别方法[J]. 浙江大学学报(工学版), 2021, 55(5): 948-956.
[3]	宋鹏,杨德东,李畅,郭畅. 整体特征通道识别的自适应孪生网络跟踪算法[J]. 浙江大学学报(工学版), 2021, 55(5): 966-975.
[4]	蔡君,赵罡,于勇,鲍强伟,戴晟. 基于点云和设计模型的仿真模型快速重构方法[J]. 浙江大学学报(工学版), 2021, 55(5): 905-916.
[5]	王虹力,郭斌,刘思聪,刘佳琪,仵允港,於志文. 边端融合的终端情境自适应深度感知模型[J]. 浙江大学学报(工学版), 2021, 55(4): 626-638.
[6]	张腾,蒋鑫龙,陈益强,陈前,米涛免,陈彪. 基于腕部姿态的帕金森病用药后开-关期检测[J]. 浙江大学学报(工学版), 2021, 55(4): 639-647.
[7]	郑英杰,吴松荣,韦若禹,涂振威,廖进,刘东. 基于目标图像FCM算法的地铁定位点匹配及误报排除方法[J]. 浙江大学学报(工学版), 2021, 55(3): 586-593.
[8]	雍子叶,郭继昌,李重仪. 融入注意力机制的弱监督水下图像增强算法[J]. 浙江大学学报(工学版), 2021, 55(3): 555-562.
[9]	于勇,薛静远,戴晟,鲍强伟,赵罡. 机加零件质量预测与工艺参数优化方法[J]. 浙江大学学报(工学版), 2021, 55(3): 441-447.
[10]	胡惠雅,盖绍彦,达飞鹏. 基于生成对抗网络的偏转人脸转正[J]. 浙江大学学报(工学版), 2021, 55(1): 116-123.
[11]	陈杨波,伊国栋,张树有. 基于点云特征对比的曲面翘曲变形检测方法[J]. 浙江大学学报(工学版), 2021, 55(1): 81-88.
[12]	段有康,陈小刚,桂剑,马斌,李顺芬,宋志棠. 基于相位划分的下肢连续运动预测[J]. 浙江大学学报(工学版), 2021, 55(1): 89-95.
[13]	张太恒,梅标,乔磊,杨浩杰,朱伟东. 纹理边界引导的复合材料圆孔检测方法[J]. 浙江大学学报(工学版), 2020, 54(12): 2294-2300.
[14]	梁栋,刘昕宇,潘家兴,孙涵,周文俊,金子俊一. 动态背景下基于自更新像素共现的前景分割[J]. 浙江大学学报(工学版), 2020, 54(12): 2405-2413.
[15]	晋耀,张为. 采用Anchor-Free网络结构的实时火灾检测算法[J]. 浙江大学学报(工学版), 2020, 54(12): 2430-2436.

Viewed

Full text

Abstract

Cited

Shared

Discussed