Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2022, Vol. 56 Issue (1): 36-46    DOI: 10.3785/j.issn.1008-973X.2022.01.004
    
Visual question answering method based on relational reasoning and gating mechanism
Xin WANG(),Qiao-hong CHEN*(),Qi SUN,Yu-bo JIA
School of Information Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
Download: HTML     PDF(1137KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A relational reasoning module and an adaptive gating mechanism were added based on the attention mechanism aiming at the problems that the existing attention mechanism lacks understanding of the relationship between visual objects and has low accuracy. The attention mechanism was used to focus on multiple visual regions related to the question. The dual relational reasoning and multiple relational reasoning in the relational reasoning module were used to strengthen the connection between the visual regions. The obtained visual attention feature and visual relationship feature were input into adaptive gating, and the contribution of the two features to the predicted answer was dynamically controlled. The experimental results on the VQA1.0 and VQA2.0 data sets showed that the overall accuracy of the model was improved by about 2% compared with advanced models such as DCN, MFB, MFH and MCB. The model based on relational reasoning and gating mechanism can better understand the image content and effectively improve the accuracy of visual question and answer.



Key wordsvisual question answering (VQA)      attention mechanism      visual region      relational reasoning      adaptive gating     
Received: 19 March 2021      Published: 05 January 2022
CLC:  TP 391  
Fund:  浙江省自然科学基金资助项目(LY17E050028)
Corresponding Authors: Qiao-hong CHEN     E-mail: xinwang952021@163.com;chen_lisa@zstu.edu.cn
Cite this article:

Xin WANG,Qiao-hong CHEN,Qi SUN,Yu-bo JIA. Visual question answering method based on relational reasoning and gating mechanism. Journal of ZheJiang University (Engineering Science), 2022, 56(1): 36-46.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2022.01.004     OR     https://www.zjujournals.com/eng/Y2022/V56/I1/36


基于关系推理与门控机制的视觉问答方法

针对现有的注意力机制存在缺乏对视觉对象间关系的理解能力及准确度较差的问题,在注意力机制的基础上增加关系推理模块与自适应门控机制. 该方法利用注意力机制关注多个与问题相关的视觉区域,利用关系推理模块中的二元关系推理与多元关系推理加强视觉区域间的联系. 将分别得到的视觉注意力特征与视觉关系特征输入到自适应门控中,动态控制2种特征对预测答案的贡献. 在VQA1.0及VQA2.0数据集上的实验结果表明:该模型与DCN、MFB、MFH及MCB等先进模型相比,在总体精度上均有约2%的提升;利用基于关系推理与门控机制的模型能够更好地理解图像内容,有效地提升视觉问答的准确率.


关键词: 视觉问答(VQA),  注意力机制,  视觉区域,  关系推理,  自适应门控 
Fig.1 Visual question answering model diagram based on relational reasoning and gating mechanism
Fig.2 Relational reasoning module
Fig.3 Visualization of relational reasoning process
模型 准确率/%
基线模型
基线模型+关系推理模块
基线模型+注意力机制
基线模型+注意力机制+二元关系推理
基线模型+注意力机制+多元关系推理
基线模型+注意力机制+关系推理模块
Full model
58.70
61.89
62.45
62.78
62.84
63.10
64.26
Tab.1 Ablation experiment results based on relational reasoning and gating mechanism
%
模型 测试-开发集 测试-标准集
总体 其他 数字 是/否 总体 其他 数字 是/否
MAN[19] 63.80 54.00 39.00 81.50 64.10 54.70 37.60 81.70
DAN[20] 64.30 53.90 39.10 83.00 64.20 54.00 38.10 82.80
MFB[21] 65.90 56.20 39.80 84.00 65.80 56.30 38.90 83.80
MFH[22] 66.80 57.40 39.70 85.00 66.90 57.40 39.50 85.00
DCN[23] 66.83 57.44 41.66 84.48 66.66 56.83 41.27 84.61
提出模型 68.24 58.56 42.32 84.65 68.37 58.21 47.44 84.48
Tab.2 Experimental results of visual question answering model based on relational reasoning and gating mechanism on VQA 1.0 data set
%
模型 测试-开发集 测试-标准集
总体 其他 数字 是/否 总体 其他 数字 是/否
LSTM+CNN[24] ? ? ? ? 54.22 41.83 35.18 73.46
MCB[24] ? ? ? ? 62.27 53.36 38.28 78.82
Adelaide[12] 65.32 56.05 44.21 81.82 65.67 56.26 43.90 82.20
DCN[23] 66.60 56.72 46.60 83.50 67.00 56.90 46.93 83.89
MuRel[25]
DFAF[26]
68.03 57.85 49.84 84.77 68.41 ? ? ?
70.22 60.49 53.32 86.09 70.34 ? ? ?
TRRNet[27] 70.80 61.02 51.89 87.27 71.20 ? ? ?
提出模型 68.16 58.46 47.78 84.00 68.51 58.11 47.36 84.36
Tab.3 Experimental results of visual question answering model based on relational reasoning and gating mechanism on VQA 2.0 data set
Fig.4 Visualization of visual question answering model
[1]   牛玉磊, 张含望 视觉问答与对话综述[J]. 计算机科学, 2021, 48 (3): 10
NIU Yu-lei, ZHANG Han-wang Visual question answering and dialogue summary[J]. Computer Science, 2021, 48 (3): 10
[2]   REN M, KIROS R, ZEMEL R. Exploring models and data for image question answering [C]// Advances in Neural Information Processing Systems. Montreal: [s. n.], 2015: 2953–2961.
[3]   ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6077-6086.
[4]   CHEN F, MENG F, XU J, et al. Dmrm: a dual-channel multi-hop reasoning model for visual dialog [C]// Proceedings of the AAAI Conference on Artificial Intelligence. New York: AAAI, 2020, 34(5): 7504-7511.
[5]   YU Z, YU J, CUI Y, et al. Deep modular co-attention networks for visual question answering [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 6281-6290.
[6]   ZHU Z, YU J, WANG Y, et al. Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering [EB/OL]. (2020-11-04)[2021-03-19]. https://arxiv.org/abs/2006.09073.
[7]   JOHNSON J, HARIHARAN B, VAN DER MAATEN L, et al. Inferring and executing programs for visual reasoning [C]// Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2989-2998.
[8]   邱真娜, 张丽红, 陶云松 基于物体检测及关系推理的视觉问答方法研究[J]. 测试技术学报, 2020, 34 (5): 8
QIU Zhen-na, ZHANG Li-hong, TAO Yun-song Research on visual question answering method based on object detection and relational reasoning[J]. Journal of Testing Technology, 2020, 34 (5): 8
[9]   SANTORO A, RAPOSO D, BARRETT D G T, et al. A simple neural network module for relational reasoning [EB/OL]. (2017-06-05)[2021-03-19]. https://arxiv.org/abs/1706.01427.
[10]   HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
[11]   REN S, HE K, GIRSHICK R, et al Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39 (6): 1137- 1149
[12]   TENEY D, ANDERSON P, HE X, et al. Tips and tricks for visual question answering: learnings from the 2017 challenge [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 4223-4232.
[13]   PENNINGTON J, SOCHER R, MANNING C D. Glove: global vectors for word representation [C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACL, 2014: 1532-1543.
[14]   HOCHREITER S, SCHMIDHUBER J Long short-term memory[J]. Neural computation, 1997, 9 (8): 1735- 1780
doi: 10.1162/neco.1997.9.8.1735
[15]   PIRSIAVASH H, RAMANAN D, FOWLKES C C. Bilinear classifiers for visual recognition [C]// Advances in Neural Information Processing Systems. Denver, USA: Curran Associates, 2009: 3.
[16]   PEI H, CHEN Q, WANG J, et al. Visual relational reasoning for image caption [C]// 2020 International Joint Conference on Neural Networks. Glasgow, UK: IEEE, 2020: 1-8.
[17]   LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: common objects in context [C]// European Conference on Computer Vision. Cham: Springer, 2014: 740-755.
[18]   ANTOL S, AGRAWAL A, LU J, et al. Vqa: visual question answering [C]// Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 2425-2433.
[19]   MA C, SHEN C, DICK A, et al. Visual question answering with memory-augmented networks [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6975-6984.
[20]   NAM H, HA J W, KIM J. Dual attention networks for multimodal reasoning and matching [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 299-307.
[21]   YU Z, YU J, FAN J, et al. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering [C]// Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 1821-1830.
[22]   YU Z, YU J, XIANG C, et al Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29 (12): 5947- 5959
doi: 10.1109/TNNLS.2018.2817340
[23]   NGUYEN D K, OKATANI T. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6087-6096.
[24]   GOYAL Y, KHOT T, SUMMER-STAY D, et al. Making the v in vqa matter: elevating the role of image understanding in visual question answering [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 6904-6913.
[25]   CADENE R, BEN-YOUNES H, CORD M, et al. Murel: multimodal relational reasoning for visual question answering [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 1989-1998.
[26]   GAO P, JIANG Z, YOU H, et al. Dynamic fusion with intra-and inter-modality attention flow for visual question answering[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 6639-6648.
[1] Xiao-chen JU,Xin-xin ZHAO,Sheng-sheng QIAN. Self-attention mechanism based bridge bolt detection algorithm[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(5): 901-908.
[2] You-wei WANG,Shuang TONG,Li-zhou FENG,Jian-ming ZHU,Yang LI,Fu CHEN. New inductive microblog rumor detection method based on graph convolutional network[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(5): 956-966.
[3] Xue-qin ZHANG,Tian-ren LI. Breast cancer pathological image classification based on Cycle-GAN and improved DPN network[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(4): 727-735.
[4] Meng XU,Dan WANG,Zhi-yuan LI,Yuan-fang CHEN. IncepA-EEGNet: P300 signal detection method based on fusion of Inception network and attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(4): 745-753, 782.
[5] Chang-yuan LIU,Xian-ping HE,Xiao-jun BI. Efficient network vehicle recognition combined with attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(4): 775-782.
[6] Qiao-hong CHEN,Hao-lei PEI,Qi SUN. Image caption based on relational reasoning and context gate mechanism[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(3): 542-549.
[7] Yuan-jun NONG,Jun-jie WANG,Hong CHEN,Wen-han SUN,Hui GENG,Shu-yue LI. A image caption method of construction scene based on attention mechanism and encoding-decoding architecture[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(2): 236-244.
[8] Ying-li LIU,Rui-gang WU,Chang-hui YAO,Tao SHEN. Construction method of extraction dataset of Al-Si alloy entity relationship[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(2): 245-253.
[9] Zhi-chao CHEN,Hai-ning JIAO,Jie YANG,Hua-fu ZENG. Garbage image classification algorithm based on improved MobileNet v2[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(8): 1490-1499.
[10] Zi-ye YONG,Ji-chang GUO,Chong-yi LI. weakly supervised underwater image enhancement algorithm incorporating attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(3): 555-562.
[11] Han-juan CHEN,Fei-peng DA,Shao-yan GAI. Deep 3D point cloud classification network based on competitive attention fusion[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(12): 2342-2351.
[12] Yue-lin CHEN,Wen-jing TIAN,Xiao-dong CAI,Shu-ting ZHENG. Text matching model based on dense connection networkand multi-dimensional feature fusion[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(12): 2352-2358.
[13] Wen-bin XIN,Hui-min HAO,Ming-long BU,Yuan LAN,Jia-hai HUANG,Xiao-yan XIONG. Static gesture real-time recognition method based on ShuffleNetv2-YOLOv3 model[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(10): 1815-1824.
[14] Chuang LIU,Jun LIANG. Vehicle motion trajectory prediction based on attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(6): 1156-1163.
[15] Yan ZHANG,Bin GUO,Qian-ru WANG,Jing ZHANG,Zhi-wen YU. SeqRec: sequential-based recommendation model with long-term preference and instant interest[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(6): 1177-1184.