Please wait a minute...
浙江大学学报(工学版)  2022, Vol. 56 Issue (1): 36-46    DOI: 10.3785/j.issn.1008-973X.2022.01.004
计算机技术、信息与电子工程     
基于关系推理与门控机制的视觉问答方法
王鑫(),陈巧红*(),孙麒,贾宇波
浙江理工大学 信息学院, 浙江 杭州310018
Visual question answering method based on relational reasoning and gating mechanism
Xin WANG(),Qiao-hong CHEN*(),Qi SUN,Yu-bo JIA
School of Information Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
 全文: PDF(1137 KB)   HTML
摘要:

针对现有的注意力机制存在缺乏对视觉对象间关系的理解能力及准确度较差的问题,在注意力机制的基础上增加关系推理模块与自适应门控机制. 该方法利用注意力机制关注多个与问题相关的视觉区域,利用关系推理模块中的二元关系推理与多元关系推理加强视觉区域间的联系. 将分别得到的视觉注意力特征与视觉关系特征输入到自适应门控中,动态控制2种特征对预测答案的贡献. 在VQA1.0及VQA2.0数据集上的实验结果表明:该模型与DCN、MFB、MFH及MCB等先进模型相比,在总体精度上均有约2%的提升;利用基于关系推理与门控机制的模型能够更好地理解图像内容,有效地提升视觉问答的准确率.

关键词: 视觉问答(VQA)注意力机制视觉区域关系推理自适应门控    
Abstract:

A relational reasoning module and an adaptive gating mechanism were added based on the attention mechanism aiming at the problems that the existing attention mechanism lacks understanding of the relationship between visual objects and has low accuracy. The attention mechanism was used to focus on multiple visual regions related to the question. The dual relational reasoning and multiple relational reasoning in the relational reasoning module were used to strengthen the connection between the visual regions. The obtained visual attention feature and visual relationship feature were input into adaptive gating, and the contribution of the two features to the predicted answer was dynamically controlled. The experimental results on the VQA1.0 and VQA2.0 data sets showed that the overall accuracy of the model was improved by about 2% compared with advanced models such as DCN, MFB, MFH and MCB. The model based on relational reasoning and gating mechanism can better understand the image content and effectively improve the accuracy of visual question and answer.

Key words: visual question answering (VQA)    attention mechanism    visual region    relational reasoning    adaptive gating
收稿日期: 2021-03-19 出版日期: 2022-01-05
CLC:  TP 391  
基金资助: 浙江省自然科学基金资助项目(LY17E050028)
通讯作者: 陈巧红     E-mail: xinwang952021@163.com;chen_lisa@zstu.edu.cn
作者简介: 王鑫(1995— ),男,硕士生,从事计算机辅助设计及机器学习技术的研究. orcid.org/0000-0002-5589-5628. E-mail: xinwang952021@163.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
王鑫
陈巧红
孙麒
贾宇波

引用本文:

王鑫,陈巧红,孙麒,贾宇波. 基于关系推理与门控机制的视觉问答方法[J]. 浙江大学学报(工学版), 2022, 56(1): 36-46.

Xin WANG,Qiao-hong CHEN,Qi SUN,Yu-bo JIA. Visual question answering method based on relational reasoning and gating mechanism. Journal of ZheJiang University (Engineering Science), 2022, 56(1): 36-46.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2022.01.004        https://www.zjujournals.com/eng/CN/Y2022/V56/I1/36

图 1  基于关系推理与门控机制的视觉问答模型图
图 2  关系推理模块
图 3  关系推理过程的可视化
模型 准确率/%
基线模型
基线模型+关系推理模块
基线模型+注意力机制
基线模型+注意力机制+二元关系推理
基线模型+注意力机制+多元关系推理
基线模型+注意力机制+关系推理模块
Full model
58.70
61.89
62.45
62.78
62.84
63.10
64.26
表 1  基于关系推理与门控机制的视觉问答模型消融实验结果
%
模型 测试-开发集 测试-标准集
总体 其他 数字 是/否 总体 其他 数字 是/否
MAN[19] 63.80 54.00 39.00 81.50 64.10 54.70 37.60 81.70
DAN[20] 64.30 53.90 39.10 83.00 64.20 54.00 38.10 82.80
MFB[21] 65.90 56.20 39.80 84.00 65.80 56.30 38.90 83.80
MFH[22] 66.80 57.40 39.70 85.00 66.90 57.40 39.50 85.00
DCN[23] 66.83 57.44 41.66 84.48 66.66 56.83 41.27 84.61
提出模型 68.24 58.56 42.32 84.65 68.37 58.21 47.44 84.48
表 2  基于关系推理与门控机制的视觉问答模型在VQA 1.0数据集上的实验结果
%
模型 测试-开发集 测试-标准集
总体 其他 数字 是/否 总体 其他 数字 是/否
LSTM+CNN[24] ? ? ? ? 54.22 41.83 35.18 73.46
MCB[24] ? ? ? ? 62.27 53.36 38.28 78.82
Adelaide[12] 65.32 56.05 44.21 81.82 65.67 56.26 43.90 82.20
DCN[23] 66.60 56.72 46.60 83.50 67.00 56.90 46.93 83.89
MuRel[25]
DFAF[26]
68.03 57.85 49.84 84.77 68.41 ? ? ?
70.22 60.49 53.32 86.09 70.34 ? ? ?
TRRNet[27] 70.80 61.02 51.89 87.27 71.20 ? ? ?
提出模型 68.16 58.46 47.78 84.00 68.51 58.11 47.36 84.36
表 3  基于关系推理与门控机制的视觉问答模型在VQA 2.0数据集上的实验结果
图 4  视觉问答模型可视化
1 牛玉磊, 张含望 视觉问答与对话综述[J]. 计算机科学, 2021, 48 (3): 10
NIU Yu-lei, ZHANG Han-wang Visual question answering and dialogue summary[J]. Computer Science, 2021, 48 (3): 10
2 REN M, KIROS R, ZEMEL R. Exploring models and data for image question answering [C]// Advances in Neural Information Processing Systems. Montreal: [s. n.], 2015: 2953–2961.
3 ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6077-6086.
4 CHEN F, MENG F, XU J, et al. Dmrm: a dual-channel multi-hop reasoning model for visual dialog [C]// Proceedings of the AAAI Conference on Artificial Intelligence. New York: AAAI, 2020, 34(5): 7504-7511.
5 YU Z, YU J, CUI Y, et al. Deep modular co-attention networks for visual question answering [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 6281-6290.
6 ZHU Z, YU J, WANG Y, et al. Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering [EB/OL]. (2020-11-04)[2021-03-19]. https://arxiv.org/abs/2006.09073.
7 JOHNSON J, HARIHARAN B, VAN DER MAATEN L, et al. Inferring and executing programs for visual reasoning [C]// Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2989-2998.
8 邱真娜, 张丽红, 陶云松 基于物体检测及关系推理的视觉问答方法研究[J]. 测试技术学报, 2020, 34 (5): 8
QIU Zhen-na, ZHANG Li-hong, TAO Yun-song Research on visual question answering method based on object detection and relational reasoning[J]. Journal of Testing Technology, 2020, 34 (5): 8
9 SANTORO A, RAPOSO D, BARRETT D G T, et al. A simple neural network module for relational reasoning [EB/OL]. (2017-06-05)[2021-03-19]. https://arxiv.org/abs/1706.01427.
10 HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
11 REN S, HE K, GIRSHICK R, et al Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39 (6): 1137- 1149
12 TENEY D, ANDERSON P, HE X, et al. Tips and tricks for visual question answering: learnings from the 2017 challenge [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 4223-4232.
13 PENNINGTON J, SOCHER R, MANNING C D. Glove: global vectors for word representation [C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACL, 2014: 1532-1543.
14 HOCHREITER S, SCHMIDHUBER J Long short-term memory[J]. Neural computation, 1997, 9 (8): 1735- 1780
doi: 10.1162/neco.1997.9.8.1735
15 PIRSIAVASH H, RAMANAN D, FOWLKES C C. Bilinear classifiers for visual recognition [C]// Advances in Neural Information Processing Systems. Denver, USA: Curran Associates, 2009: 3.
16 PEI H, CHEN Q, WANG J, et al. Visual relational reasoning for image caption [C]// 2020 International Joint Conference on Neural Networks. Glasgow, UK: IEEE, 2020: 1-8.
17 LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: common objects in context [C]// European Conference on Computer Vision. Cham: Springer, 2014: 740-755.
18 ANTOL S, AGRAWAL A, LU J, et al. Vqa: visual question answering [C]// Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 2425-2433.
19 MA C, SHEN C, DICK A, et al. Visual question answering with memory-augmented networks [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6975-6984.
20 NAM H, HA J W, KIM J. Dual attention networks for multimodal reasoning and matching [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 299-307.
21 YU Z, YU J, FAN J, et al. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering [C]// Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 1821-1830.
22 YU Z, YU J, XIANG C, et al Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29 (12): 5947- 5959
doi: 10.1109/TNNLS.2018.2817340
23 NGUYEN D K, OKATANI T. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6087-6096.
24 GOYAL Y, KHOT T, SUMMER-STAY D, et al. Making the v in vqa matter: elevating the role of image understanding in visual question answering [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 6904-6913.
25 CADENE R, BEN-YOUNES H, CORD M, et al. Murel: multimodal relational reasoning for visual question answering [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 1989-1998.
26 GAO P, JIANG Z, YOU H, et al. Dynamic fusion with intra-and inter-modality attention flow for visual question answering[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 6639-6648.
[1] 鞠晓臣,赵欣欣,钱胜胜. 基于自注意力机制的桥梁螺栓检测算法[J]. 浙江大学学报(工学版), 2022, 56(5): 901-908.
[2] 王友卫,童爽,凤丽洲,朱建明,李洋,陈福. 基于图卷积网络的归纳式微博谣言检测新方法[J]. 浙江大学学报(工学版), 2022, 56(5): 956-966.
[3] 张雪芹,李天任. 基于Cycle-GAN和改进DPN网络的乳腺癌病理图像分类[J]. 浙江大学学报(工学版), 2022, 56(4): 727-735.
[4] 许萌,王丹,李致远,陈远方. IncepA-EEGNet: 融合Inception网络和注意力机制的P300信号检测方法[J]. 浙江大学学报(工学版), 2022, 56(4): 745-753, 782.
[5] 柳长源,何先平,毕晓君. 融合注意力机制的高效率网络车型识别[J]. 浙江大学学报(工学版), 2022, 56(4): 775-782.
[6] 陈巧红,裴皓磊,孙麒. 基于视觉关系推理与上下文门控机制的图像描述[J]. 浙江大学学报(工学版), 2022, 56(3): 542-549.
[7] 农元君,王俊杰,陈红,孙文涵,耿慧,李书悦. 基于注意力机制和编码-解码架构的施工场景图像描述方法[J]. 浙江大学学报(工学版), 2022, 56(2): 236-244.
[8] 刘英莉,吴瑞刚,么长慧,沈韬. 铝硅合金实体关系抽取数据集的构建方法[J]. 浙江大学学报(工学版), 2022, 56(2): 245-253.
[9] 董红召,方浩杰,张楠. 旋转框定位的多尺度再生物品目标检测算法[J]. 浙江大学学报(工学版), 2022, 56(1): 16-25.
[10] 陈智超,焦海宁,杨杰,曾华福. 基于改进MobileNet v2的垃圾图像分类算法[J]. 浙江大学学报(工学版), 2021, 55(8): 1490-1499.
[11] 雍子叶,郭继昌,李重仪. 融入注意力机制的弱监督水下图像增强算法[J]. 浙江大学学报(工学版), 2021, 55(3): 555-562.
[12] 陈涵娟,达飞鹏,盖绍彦. 基于竞争注意力融合的深度三维点云分类网络[J]. 浙江大学学报(工学版), 2021, 55(12): 2342-2351.
[13] 陈岳林,田文靖,蔡晓东,郑淑婷. 基于密集连接网络和多维特征融合的文本匹配模型[J]. 浙江大学学报(工学版), 2021, 55(12): 2352-2358.
[14] 辛文斌,郝惠敏,卜明龙,兰媛,黄家海,熊晓燕. 基于ShuffleNetv2-YOLOv3模型的静态手势实时识别方法[J]. 浙江大学学报(工学版), 2021, 55(10): 1815-1824.
[15] 刘创,梁军. 基于注意力机制的车辆运动轨迹预测[J]. 浙江大学学报(工学版), 2020, 54(6): 1156-1163.