Please wait a minute...
浙江大学学报(工学版)  2022, Vol. 56 Issue (3): 542-549    DOI: 10.3785/j.issn.1008-973X.2022.03.013
计算机与控制工程     
基于视觉关系推理与上下文门控机制的图像描述
陈巧红(),裴皓磊,孙麒
浙江理工大学 信息学院,浙江 杭州 310018
Image caption based on relational reasoning and context gate mechanism
Qiao-hong CHEN(),Hao-lei PEI,Qi SUN
School of Informatics Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
 全文: PDF(899 KB)   HTML
摘要:

为了探索图像场景理解所需要的视觉区域间关系的建模与推理,提出视觉关系推理模块. 该模块基于图像中不同的语义和空间上下文信息,对相关视觉对象间的关系模式进行动态编码,并推断出与当前生成的关系词最相关的语义特征输出. 通过引入上下文门控机制,以根据不同类型的单词动态地权衡视觉注意力模块和视觉关系推理模块的贡献. 实验结果表明,对比以往基于注意力机制的图像描述方法,基于视觉关系推理与上下文门控机制的图像描述方法更好;所提模块可以动态建模和推理不同类型生成单词的最相关特征,对输入图像中物体关系的描述更加准确.

关键词: 图像语义描述视觉关系推理多模态编码上下文门控机制注意力机制    
Abstract:

A visual relationship reasoning module was proposed in order to explore the modeling and reasoning of the relationship between visual regions needed for image scene understanding. The relationship patterns between the two related visual objects were encoded dynamically based on different semantic and spatial context information, and the most relevant feature output of the currently generated relationship words was inferred by using the module. In addition, the contributions between the visual attention module and the visual relational reasoning module were controlled dynamically according to the different types of words by introducing the context gate mechanism. Experimental results show that the method has better performance than other state-of-the-art methods based on attention mechanism. By using the module a model is established dynamically, the most relevant features of different types for the generated words are inferred, and the quality of image caption is improved.

Key words: image caption    visual relationship reasoning    multimodal encoding    context gate mechanism    attention mechanism
收稿日期: 2021-04-25 出版日期: 2022-03-29
CLC:  TP 181  
作者简介: 陈巧红(1978—),女,副教授,从事计算机辅助设计及机器学习技术研究.orcid.org/0000-0003-0595-341X. E-mail: chen_lisa@zstu.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
陈巧红
裴皓磊
孙麒

引用本文:

陈巧红,裴皓磊,孙麒. 基于视觉关系推理与上下文门控机制的图像描述[J]. 浙江大学学报(工学版), 2022, 56(3): 542-549.

Qiao-hong CHEN,Hao-lei PEI,Qi SUN. Image caption based on relational reasoning and context gate mechanism. Journal of ZheJiang University (Engineering Science), 2022, 56(3): 542-549.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2022.03.013        https://www.zjujournals.com/eng/CN/Y2022/V56/I3/542

图 1  基于视觉关系推理与上下文门控机制的图像描述模型总体结构示意图
图 2  视觉关系推理模块流程图
融合方法 BLEU-1 BLEU-2 METEOR ROUGE CIDEr
加法 75.8 35.2 27.1 56.2 113.3
拼接 76.3 35.7 27.7 56.6 115.5
门控机制 77.4 36.9 28.1 57.3 118.7
表 1  上下文门控机制消融实验
m BLEU-1 BLEU-2 METEOR ROUGE CIDEr SPICE
3 76.2 36.1 27.1 55.2 116.1 20.5
5 76.9 36.5 27.8 56.4 117.8 20.8
7 77.4 36.9 28.1 57.3 118.7 21.1
11 77.2 36.6 27.9 57.2 118.3 20.9
表 2  基于注意力数值的视觉对象过滤机制参数选择实验
模型 BLEU-1 BLEU-4 METEOR ROUGE CIDEr SPICE
Att2in (XE) 31.3 26.0 54.3 101.3
GL-Att (XE) 74.0 35.2 27.5 52.4 98.7
LRCA (XE) 75.9 35.8 27.8 56.4 111.3
Adaptive (XE) 74.2 33.2 26.6 108.5 19.5
NBT (XE) 75.5 34.7 27.1 107.2 20.1
Updown (XE) 77.2 36.2 27.0 56.4 113.5 20.3
POS-SCAN (XE) 76.6 36.5 27.9 114.9 20.8
RFNet (XE) 77.5 36.8 27.2 56.8 115.3 20.5
本研究(XE) 77.4 36.9 28.1 57.3 118.7 21.1
Att2in (CIDEr) 33.3 26.3 55.3 111.4
Updown (CIDEr) 79.8 36.3 27.7 56.9 120.1 21.4
POS-SCAN (CIDEr) 80.1 37.8 28.3 125.9 22.0
RFNet (CIDEr) 79.1 36.5 27.7 57.3 121.9 21.2
JCRR (CIDEr) 37.7 28.2 120.1 21.6
本研究(CIDEr) 80.1 38.1 29.0 58.7 127.1 22.1
表 3  Microsoft COCO 数据集实验性能对比
模型 BLEU1 BLEU4 METEOR CIDEr
Hard-Attention 66.9 19.9 18.5
GL-Att 68.1 25.7 18.9
LRCA 69.8 27.7 21.5 57.4
Adaptive 67.7 25.1 20.4 53.1
NBT 69.0 27.1 21.7 57.5
本研究(XE) 73.6 30.1 23.8 60.2
表 4  Flickr30k数据集实验性能对比
图 3  视觉推理模块效果可视化
图 4  上下文门控机制效果可视化
1 HEIKKIL? M, PIETIK?INEN M, SCHMID C Description of interest regions with local binary patterns[J]. Pattern Recognition, 2009, 42 (3): 425- 436
doi: 10.1016/j.patcog.2008.08.014
2 LINDEBERG T. Scale invariant feature transform [M]. 2012: 10491.
3 DALAL N, TRIGGS B. Histograms of oriented gradients for human detection [C]// 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego: IEEE, 2005: 886-893.
4 SUYKENS J A, VANDEWALLE J Least squares support vector machine classifiers[J]. Neural processing letters, 1999, 9 (3): 293- 300
doi: 10.1023/A:1018628609742
5 FARHADI A, HEJRATI M, SADEGHI M A, et al. Every picture tells a story: generating sentences from images [M]// DANIILIDIS K, MARAGOS P, PARAGIOS N. Computer vision: ECCV 2010. [S. l.]: Springer, 2010: 15-29.
6 KIROS R, SALAKHUTDINOV R, ZEMEL R S. Unifying visual-semantic embeddings with multimodal neural language models [EB/OL].[2021-03-05]. https://arxiv.org/pdf/1411.2539.pdf.
7 HOCHREITER S, SCHMIDHUBER J Long short-term memory[J]. Neural computation, 1997, 9 (8): 1735- 1780
doi: 10.1162/neco.1997.9.8.1735
8 XU K, BA J L, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention [EB/OL]. [2021-03-05]. https://arxiv.org/pdf/1502.03044.pdf.
9 LU J, XIONG C, PARIKH D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 375-383.
10 ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6077-6086.
11 GU J, CAI J, WANG G, et al. Stack-captioning: coarse-to-fine learning for image captioning [C]// Thirty-Second AAAI Conference on Artificial Intelligence, 2018: 12266.
12 WANG W, CHEN Z, HU H. Hierarchical attention network for image captioning [C]// Proceedings of the AAAI Conference on Artificial Intelligence. Hawaii: EI, 2019: 8957-8964.
13 赵小虎, 尹良飞, 赵成龙 基于全局-局部特征和自适应注意力机制的图像语义描述算法[J]. 浙江大学学报:工学版, 2020, 54 (1): 126- 134
ZHAO Xiao-hu, YIN Liang-fei, ZHAO Cheng-long Image captioning based on global-local feature and adaptive-attention[J]. Journal of Zhejiang University: Engineering Science, 2020, 54 (1): 126- 134
14 WANG J, WANG W, WANG L, et al Learning visual relationship and context-aware attention for image captioning[J]. Pattern Recognition, 2020, 98: 107075
doi: 10.1016/j.patcog.2019.107075
15 KE L, PEI W, LI R, et al. Reflective decoding network for image captioning [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 8888-8897.
16 ZHOU Y, WANG M, LIU D, et al. More grounded image captioning by distilling image-text matching model [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Venice: IEEE, 2020: 4777-4786.
17 HOU J, WU X, ZHANG X, et al. Joint commonsense and relation reasoning for image and video captioning [C]// Proceedings of the AAAI Conference on Artificial Intelligence. New York: [s. n.], 2020: 10973-10980.
18 REN S, HE K, GIRSHICK R, et al Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39 (6): 1137- 1149
doi: 10.1109/TPAMI.2016.2577031
19 VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [EB/OL].[2021-03-05]. https://arxiv.org/pdf/1706.03762.pdf.
20 WANG J, JIANG W, MA L, et al. Bidirectional attentive fusion with context gating for dense video captioning [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 7190-7198.
21 VEDANTAM R, LAWRENCE ZITNICK C, PARIKH D. Cider: consensus-based image description evaluation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 4566-4575.
22 LIN T-Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context [M]// FLEET D, PAJDLA T, SCHIELE B, et al. Computer vision: ECCV 2014. [S.l.]: Springer, 2014: 740-755.
23 PLUMMER B A, WANG L, CERVANTES C M, et al. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models [J] International Journal of Computer Vision, 2017, 123: 74-93.
24 JOHNSON J, KARPATHY A, LI F-F. DenseCap: fully convolutional localization networks for dense captioning [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 4565-4574.
25 PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation [C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia: IEEE, 2002: 311-318.
26 DENKOWSKI M, LAVIE A. Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems [C]// Proceedings of the Sixth Workshop on Statistical Machine Translation. Scotland: IEEE, 2011: 85-91.
27 LIN C Y. Rouge: a package for automatic evaluation of summaries [C]// Proceedings of Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004. Barcelona: [s. n.], 2004: 1-10.
28 ANDERSON P, FERNANDO B, JOHNSON M, et al. Spice: semantic propositional image caption evaluation [M]// LEIBE B, MATAS J, SEBE N, et al. Computer vision: ECCV 2016. [S. l.]: Springer, 2016: 382-398.
29 KRISHNA R, ZHU Y, GROTH O, et al Visual genome: connecting language and vision using crowdsourced dense image annotations[J]. International Journal of Computer Vision, 2017, 123 (1): 32- 73
doi: 10.1007/s11263-016-0981-7
30 HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the IEEE conference on computer vision and pattern recognition. Las Vegas: IEEE, 2016: 770-778.
31 PENNINGTON J, SOCHER R, MANNING C D. Glove: global vectors for word representation [C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. [S. l.]: ACL, 2014: 1532-1543.
32 KINGMA D P, BA J L. Adam: a method for stochastic optimization [EB/OL].[2021-03-05]. https://arxiv.org/pdf/1412.6980.pdf.
33 RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 7008-7024.
[1] 农元君,王俊杰,陈红,孙文涵,耿慧,李书悦. 基于注意力机制和编码-解码架构的施工场景图像描述方法[J]. 浙江大学学报(工学版), 2022, 56(2): 236-244.
[2] 刘英莉,吴瑞刚,么长慧,沈韬. 铝硅合金实体关系抽取数据集的构建方法[J]. 浙江大学学报(工学版), 2022, 56(2): 245-253.
[3] 董红召,方浩杰,张楠. 旋转框定位的多尺度再生物品目标检测算法[J]. 浙江大学学报(工学版), 2022, 56(1): 16-25.
[4] 王鑫,陈巧红,孙麒,贾宇波. 基于关系推理与门控机制的视觉问答方法[J]. 浙江大学学报(工学版), 2022, 56(1): 36-46.
[5] 陈智超,焦海宁,杨杰,曾华福. 基于改进MobileNet v2的垃圾图像分类算法[J]. 浙江大学学报(工学版), 2021, 55(8): 1490-1499.
[6] 雍子叶,郭继昌,李重仪. 融入注意力机制的弱监督水下图像增强算法[J]. 浙江大学学报(工学版), 2021, 55(3): 555-562.
[7] 陈涵娟,达飞鹏,盖绍彦. 基于竞争注意力融合的深度三维点云分类网络[J]. 浙江大学学报(工学版), 2021, 55(12): 2342-2351.
[8] 陈岳林,田文靖,蔡晓东,郑淑婷. 基于密集连接网络和多维特征融合的文本匹配模型[J]. 浙江大学学报(工学版), 2021, 55(12): 2352-2358.
[9] 辛文斌,郝惠敏,卜明龙,兰媛,黄家海,熊晓燕. 基于ShuffleNetv2-YOLOv3模型的静态手势实时识别方法[J]. 浙江大学学报(工学版), 2021, 55(10): 1815-1824.
[10] 刘创,梁军. 基于注意力机制的车辆运动轨迹预测[J]. 浙江大学学报(工学版), 2020, 54(6): 1156-1163.
[11] 张岩,郭斌,王倩茹,张靖,於志文. SeqRec:基于长期偏好和即时兴趣的序列推荐模型[J]. 浙江大学学报(工学版), 2020, 54(6): 1177-1184.
[12] 赵小虎,尹良飞,赵成龙. 基于全局?局部特征和自适应注意力机制的图像语义描述算法[J]. 浙江大学学报(工学版), 2020, 54(1): 126-134.
[13] 董月,冯华君,徐之海,陈跃庭,李奇. Attention Res-Unet: 一种高效阴影检测算法[J]. 浙江大学学报(工学版), 2019, 53(2): 373-381.
[14] 郭宝震, 左万利, 王英. 采用词向量注意力机制的双路卷积神经网络句子分类模型[J]. 浙江大学学报(工学版), 2018, 52(9): 1729-1737.