Please wait a minute...
浙江大学学报(工学版)  2020, Vol. 54 Issue (1): 126-134    DOI: 10.3785/j.issn.1008-973X.2020.01.015
计算机技术、信息工程     
基于全局?局部特征和自适应注意力机制的图像语义描述算法
赵小虎1,2(),尹良飞1,2(),赵成龙1,2
1. 中国矿业大学 矿山互联网应用技术国家地方联合工程实验室,江苏 徐州 221008
2. 中国矿业大学 信息与控制工程学院,江苏 徐州 221116
Image captioning based on global-local feature and adaptive-attention
Xiao-hu ZHAO1,2(),Liang-fei YIN1,2(),Cheng-long ZHAO1,2
1. National and Local Joint Engineering Laboratory of Internet Application Technology on Mine, China University of Mining and Technology, Xuzhou 221008, China
2. School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
 全文: PDF(1697 KB)   HTML
摘要:

为了探究图像底层视觉特征与高层语义概念存在的差异,提出可以确定图像关注重点、挖掘更高层语义信息以及完善描述句子的细节信息的图像语义描述算法. 在图像视觉特征提取时提取输入图像的全局-局部特征作为视觉信息输入,确定不同时刻对图像的关注点,对图像细节的描述更加完善;在解码时加入注意力机制对图像特征加权输入,可以自适应选择当前时刻输出的文本单词对视觉信息与语义信息的依赖权重,有效地提高对图像语义描述的性能. 实验结果表明,该方法相对于其他语义描述算法效果更有竞争力,可以更准确、更细致地识别图片中的物体,对输入图像进行更全面地描述;对于微小的物体的识别准确率更高.

关键词: 图像语义描述图像关注点高层语义信息描述句子细节全局-局部特征提取自适应注意力机制    
Abstract:

The image captioning algorithm was proposed in order to explore the difference of the image visual features and the upper layer semantic concept. The algorithm can determine the image focus, mine higher-level semantic information, and improve the description details. Local features were added for the image visual feature extraction, and the global-local feature of the input image was combined with the global features and local features for visual information. Then the focus of the image at different time was determined, and more details of the image were caught. The attention mechanism was added to weight the image feature during decoding, so that the dependence of the text words on the visual information and the semantic information at the current moment could be adaptively adjusted, and the performance of image captioning was effectively improved. The experimental results show that the proposed method can acquire competitive captioning results than other image captioning algorithms. The method can describe the image more accurately and more comprehensively, and the recognition accuracy of tiny objects is higher than others.

Key words: image captioning    image focus    higher-level semantic information    description detail    global-local feature extraction    adaptive-attention mechanism
收稿日期: 2019-04-29 出版日期: 2020-01-05
CLC:  TP 391  
基金资助: 国家重点研发计划资助项目(2017YFC0804400)
作者简介: 赵小虎(1976—),男,教授,从事矿山物联网与智能计算的研究. orcid.org/0000-0002-7352-103X. E-mail: 525815788@qq.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
赵小虎
尹良飞
赵成龙

引用本文:

赵小虎,尹良飞,赵成龙. 基于全局?局部特征和自适应注意力机制的图像语义描述算法[J]. 浙江大学学报(工学版), 2020, 54(1): 126-134.

Xiao-hu ZHAO,Liang-fei YIN,Cheng-long ZHAO. Image captioning based on global-local feature and adaptive-attention. Journal of ZheJiang University (Engineering Science), 2020, 54(1): 126-134.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2020.01.015        http://www.zjujournals.com/eng/CN/Y2020/V54/I1/126

图 1  基于CNN-LSTM的图像语义描述模型结构示意图
图 2  Faster R-CNN结构示意图
图 3  RPN结构示意图
图 4  局部特征提取示意图
图 5  R-CNN示意图
图 6  全局-局部特征提取示意图
图 7  传统注意力机制与自适应注意力机制示意图
图 8  自适应注意力机制细节图
图 9  模型整体结构示意图
数据集 语言 规模(张)
Flickr30k 英语 31 783
MS COCO 英语 123 000
表 1  Flickr30k和MS COCO实验数据集介绍
方法 BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr
NIC 66.6 46.1 32.9 24.6 ? ? ?
MS Captivator 71.5 54.3 40.7 30.8 24.8 52.6 93.1
m-RNN 67 45 35 25 ? ? ?
LRCN 62.79 44.19 30.41 21 ? ? ?
MSR 73.0 56.5 42.9 32.5 25.1 ? 98.6
ATT-EK 74.0 56.0 42.0 31.0 26.0 ? ?
Soft-attention 70.7 49.2 34.4 24.3 23.9 ? ?
Hard-attention 71.8 50.4 35.7 25.0 23.0 ? ?
ATT-FCN 70.9 53.7 40.2 30.4 24.3 ? ?
Aligning-ATT 69.7 51.9 38.1 28.20 23.5 50.9 83.8
ERD ? ? ? 29.0 23.7 ? 88.6
Areas-ATT ? ? ? 30.7 24.5 ? 93.8
本文方法 74.0 60.1 43.9 35.2 27.5 52.4 98.7
表 2  Microsoft COCO数据集实验性能对比
方法 BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR
NIC 66.6 42.3 27.7 18.3
Soft-attention 66.7 43.4 28.8 19.1 18.49
Hard-attention 44.9 43.9 28.6 19.9 18.46
ATT-FCN 64.7 46.0 32.4 23.0 18.9
本文方法 68.1 48.1 32.7 25.7 18.9
表 3  Flickr30k数据集的实验性能对比
图 10  图像语义描述效果对比
1 FARHADI A, HEJRATI M, SADEGHI M A, et al. Every picture tells a story: generating sentences from images [C] // International Conference on Computer Vision. Heraklion: Springer, 2010: 15-29.
2 MAO J, XU W, YANG Y, et al. Deep captioning with multimodal recurrent neural networks(m-RNN) [EB/OL]. [2014-12-20]. https://arxiv.org/abs/1412.6632.
3 VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator [C] // IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 3156-3164.
4 WU Q, SHEN C, LIU L, et al What value do explicit high level concepts have in vision to language problems[J]. Computer Science, 2016, 12 (1): 1640- 1649
5 ZHOU L, XU C, KOCH P, et al. Watch what you just said: image captioning with text-conditional attention [C] // Proceedings of the on Thematic Workshops of ACM Multimedia. [S.l]: Association for Computing Machinery, 2017: 305-313.
6 RENNIE S J, MARCHERET E, ROUEH Y, et al. Self-critical sequence training for image captioning [C] // IEEE Conference on Computer Vision and Pattern Recognition. Maryland: IEEE, 2017: 1179-1195.
7 SIMONYAN K, ZISSERMAN A Very deep convolutional networks for large-scale image recognition[J]. Computer Science, 2014, 32 (2): 67- 85
8 REN S, HE K, GIRSHICK R, et al Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39 (6): 1137- 1149
9 XU K, BA J, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention [C]// Computer Science. Lille: IMLS, 2015: 2048-2057.
10 FANG H, GUPTA S, IANDOLA F, et al. From captions to visual concepts and back [C] // IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 1473-1482.
11 DONAHUE J, HENDRICKS L A, ROHRBACH M, et al Long-term recurrent convolutional networks for visual recognition and description[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 39 (4): 677- 691
12 WU Q, SHEN C, WANG P, et al Image captioning and visual question answering based on attributes and external knowledge[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40 (6): 1367- 1381
13 YAO T, PAN Y, LI Y, et al. Boosting image captioning with attributes [C]// IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 4904- 4912.
14 YOU Q, JIN H, WANG Z, et al. Image captioning with semantic attention [C] // IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 4651- 4659.
15 JIN J, FU K, CUI R, et al Aligning where to see and what to tell: image caption with region-based attention and scene factorization[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39 (12): 2321- 2334
16 YANG Z, YUAN Y, WU Y, et al. Encode, review, and decode: reviewer module for caption generation [C] // International Conference on Neural Image Processing System. Barcelona: [s. n.], 2016.
[1] 郑守国,张勇德,谢文添,樊虎,王青. 基于数字孪生的飞机总装生产线建模[J]. 浙江大学学报(工学版), 2021, 55(5): 843-854.
[2] 张师林,马思明,顾子谦. 基于大边距度量学习的车辆再识别方法[J]. 浙江大学学报(工学版), 2021, 55(5): 948-956.
[3] 宋鹏,杨德东,李畅,郭畅. 整体特征通道识别的自适应孪生网络跟踪算法[J]. 浙江大学学报(工学版), 2021, 55(5): 966-975.
[4] 蔡君,赵罡,于勇,鲍强伟,戴晟. 基于点云和设计模型的仿真模型快速重构方法[J]. 浙江大学学报(工学版), 2021, 55(5): 905-916.
[5] 王虹力,郭斌,刘思聪,刘佳琪,仵允港,於志文. 边端融合的终端情境自适应深度感知模型[J]. 浙江大学学报(工学版), 2021, 55(4): 626-638.
[6] 张腾,蒋鑫龙,陈益强,陈前,米涛免,陈彪. 基于腕部姿态的帕金森病用药后开-关期检测[J]. 浙江大学学报(工学版), 2021, 55(4): 639-647.
[7] 郑英杰,吴松荣,韦若禹,涂振威,廖进,刘东. 基于目标图像FCM算法的地铁定位点匹配及误报排除方法[J]. 浙江大学学报(工学版), 2021, 55(3): 586-593.
[8] 雍子叶,郭继昌,李重仪. 融入注意力机制的弱监督水下图像增强算法[J]. 浙江大学学报(工学版), 2021, 55(3): 555-562.
[9] 于勇,薛静远,戴晟,鲍强伟,赵罡. 机加零件质量预测与工艺参数优化方法[J]. 浙江大学学报(工学版), 2021, 55(3): 441-447.
[10] 胡惠雅,盖绍彦,达飞鹏. 基于生成对抗网络的偏转人脸转正[J]. 浙江大学学报(工学版), 2021, 55(1): 116-123.
[11] 陈杨波,伊国栋,张树有. 基于点云特征对比的曲面翘曲变形检测方法[J]. 浙江大学学报(工学版), 2021, 55(1): 81-88.
[12] 段有康,陈小刚,桂剑,马斌,李顺芬,宋志棠. 基于相位划分的下肢连续运动预测[J]. 浙江大学学报(工学版), 2021, 55(1): 89-95.
[13] 张太恒,梅标,乔磊,杨浩杰,朱伟东. 纹理边界引导的复合材料圆孔检测方法[J]. 浙江大学学报(工学版), 2020, 54(12): 2294-2300.
[14] 梁栋,刘昕宇,潘家兴,孙涵,周文俊,金子俊一. 动态背景下基于自更新像素共现的前景分割[J]. 浙江大学学报(工学版), 2020, 54(12): 2405-2413.
[15] 晋耀,张为. 采用Anchor-Free网络结构的实时火灾检测算法[J]. 浙江大学学报(工学版), 2020, 54(12): 2430-2436.