基于语义增强特征融合的多模态图像检索模型

doi:10.3785/j.issn.1008-973X.2023.02.005

浙江大学学报(工学版)

2023, Vol. 57

Issue (2): 252-258 DOI: 10.3785/j.issn.1008-973X.2023.02.005

计算机技术

基于语义增强特征融合的多模态图像检索模型

杨帆(

),宁博*(

),李怀清,周新,李冠宇

大连海事大学信息科学技术学院，辽宁大连 116026

Multimodal image retrieval model based on semantic-enhanced feature fusion

Fan YANG(

),Bo NING*(

),Huai-qing LI,Xin ZHOU,Guan-yu LI

School of Information Science and Technology, Dalian Maritime University, Dalian 116026, China

全文: PDF(928 KB) HTML

摘要：

为了在多模态图像检索任务中建立文本特征与图像特征的相关性，提出基于语义增强特征融合的多模态图像检索模型（SEFM）. 该模型通过文本语义增强模块、图像语义增强模块2部分在特征融合时对组合特征进行语义增强. 在文本语义增强模块建立多模态双重注意力机制，利用双重注意力建立文本与图像之间的关联以增强文本语义；在图像语义增强模块引入保留强度和更新强度，控制组合特征中查询图像特征的保留和更新程度. 基于以上2个模块可以优化组合特征使其更接近目标图像特征. 在MIT-States和Fashion IQ这2个数据集上对该模型进行评估，实验结果表明在多模态图像检索任务上该模型与现有方法相比在召回率和准确率上都有所提升.

关键词： 多模态; 语义增强; 特征融合; 图像检索; 注意力机制

Abstract:

A multimodal image retrieval model based on semantic-enhanced feature fusion (SEFM) was proposed to establish the correlation between text features and image features in multimodal image retrieval tasks. Semantic enhancement was conducted on the combined features during feature fusion by two proposed modules including the text semantic enhancement module and the image semantic enhancement module. Firstly, to enhance the text semantics, a multimodal dual attention mechanism was established in the text semantic enhancement module, which associated the multimodal correlation between text and image. Secondly, to enhance the image semantics, the retain intensity and update intensity were introduced in the image semantic enhancement module, which controlled the retaining and updating degrees of the query image features in combined features. Based on the above two modules, the combined features can be optimized, and be closer to the target image features. In the experiment part, the SEFM model was evaluated on MIT-States and Fashion IQ datasets, and experimental results show that the proposed model performs better than the existing works on recall and precision metrics.

Key words: multimodality semantic enhancement feature fusion image retrieval attention mechanism

收稿日期: 2022-07-29 出版日期: 2023-02-28

CLC:

TP 391

基金资助: 国家自然科学基金资助项目 (61976032, 62002039)；辽宁省教育厅科学研究面上资助项目(LJKZ0063)

通讯作者: 宁博 E-mail: yangfany116@163.com;ningbo@dlmu.edu.cn

作者简介: 杨帆（1997—），女，硕士生，从事多模态图像检索研究. orcid.org/0000-0002-9733-8694. E-mail: yangfany116@163.com

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	作者相关文章
	杨帆
	宁博
	李怀清
	周新
	李冠宇

引用本文:

杨帆,宁博,李怀清,周新,李冠宇. 基于语义增强特征融合的多模态图像检索模型[J]. 浙江大学学报(工学版), 2023, 57(2): 252-258.

Fan YANG,Bo NING,Huai-qing LI,Xin ZHOU,Guan-yu LI. Multimodal image retrieval model based on semantic-enhanced feature fusion. Journal of ZheJiang University (Engineering Science), 2023, 57(2): 252-258.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2023.02.005 或 https://www.zjujournals.com/eng/CN/Y2023/V57/I2/252

图 1 基于语义增强特征融合的多模态图像检索模型（SEFM）整体架构

图 2 用于文本语义增强模块的多模态双重注意力MDA结构

表 1 MIT-States 数据集上不同算法的召回率结果对比

表 2 Fashion IQ 数据集上不同算法的召回率结果对比

表 3 MIT-States 数据集上不同算法的准确率结果对比

表 4 MIT-States 数据集上消融实验召回率结果对比

表 5 Fashion IQ 数据集上消融实验结果召回率结果对比

表 6 MIT-States 数据集上消融实验准确率结果对比

图 3 基于语义增强特征融合的多模态图像检索模型的检索示例

1	DUBEY S R A decade survey of content based image retrieval using deep learning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32 (5): 2687- 2704 doi: 10.1109/TCSVT.2021.3080920
2	PANG K T, LI K, YANG Y X , et al. Generalising fine-grained sketch-based image retrieval [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 677-686.
3	LIN T Y, CUI Y, BELONGIE S, et al. Learning deep representations for ground-to-aerial geolocalization [C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 5007-5015.
4	ZHANG M, MAIDMENT T, DIAB A, et al. Domain-robust VQA with diverse datasets and methods but no target labels [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [s.l.]: IEEE, 2021: 7046-7056.
5	CHEN L, JIANG Z, XIAO J, et al. Human-like controllable image captioning with verb-specific semantic roles [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [s.l.]: IEEE, 2021: 16846-16856.
6	SANTORO A, RAPOSO D, BARRETT D G T, et al. A simple neural network module for relational reasoning [C]// Advances in Neural Information Processing Systems 30. Long Beach: Curran Associates, 2017: 4967-4976.
7	PEREZ E, STRUB F, VRIES H D, et al. FiLM: visual reasoning with a general conditioning layer [C]// 32nd AAAI Conference on Artificial Intelligence. New Orleans: AAAI, 2018: 3942-3951.
8	NAGARAJAN T, GRAUMAN K. Attributes as operators: factorizing unseen attribute-object compositions [C]// European Conference on Computer Vision. Munich: Springer, 2018: 172-190.
9	VO N , LU J, CHEN S, et al. Composing text and image for image retrieval: an Empirical Odyssey[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019 : 6439-6448.
10	ANWAAR M U, LABINTCEV E, KLEINSTEUBER M. Compositional learning of image-text query for image retrieval [C]// 2021 IEEE Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2021: 1139-1148.
11	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
12	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding [C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis: Association for Computational Linguistics, 2019: 4171–4186.
13	HUANG L, WANG W M, CHEN J, et al. Attention on attention for image captioning[C]// 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 4633-4642.
14	ISOLA P, LIM J J, ADELSON E H. Discovering states and transformations in image collections [C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 1383-1391.

[1]	王林涛,毛齐. 基于RGB与深度信息融合的管片抓取位置测量方法[J]. 浙江大学学报(工学版), 2023, 57(1): 47-54.
[2]	凤丽洲,杨阳,王友卫,杨贵军. 基于Transformer和知识图谱的新闻推荐新方法[J]. 浙江大学学报(工学版), 2023, 57(1): 133-143.
[3]	郝琨,王阔,王贝贝. 基于改进Mobilenet-YOLOv3的轻量级水下生物检测算法[J]. 浙江大学学报(工学版), 2022, 56(8): 1622-1632.
[4]	莫仁鹏,司小胜,李天梅,朱旭. 基于多尺度特征与注意力机制的轴承寿命预测[J]. 浙江大学学报(工学版), 2022, 56(7): 1447-1456.
[5]	王友卫,童爽,凤丽洲,朱建明,李洋,陈福. 基于图卷积网络的归纳式微博谣言检测新方法[J]. 浙江大学学报(工学版), 2022, 56(5): 956-966.
[6]	鞠晓臣,赵欣欣,钱胜胜. 基于自注意力机制的桥梁螺栓检测算法[J]. 浙江大学学报(工学版), 2022, 56(5): 901-908.
[7]	张雪芹,李天任. 基于Cycle-GAN和改进DPN网络的乳腺癌病理图像分类[J]. 浙江大学学报(工学版), 2022, 56(4): 727-735.
[8]	许萌,王丹,李致远,陈远方. IncepA-EEGNet: 融合Inception网络和注意力机制的P300信号检测方法[J]. 浙江大学学报(工学版), 2022, 56(4): 745-753, 782.
[9]	柳长源,何先平,毕晓君. 融合注意力机制的高效率网络车型识别[J]. 浙江大学学报(工学版), 2022, 56(4): 775-782.
[10]	张娜,戚旭磊,包晓安,吴彪,涂小妹,金瑜婷. 基于优化预测定位的单阶段目标检测算法[J]. 浙江大学学报(工学版), 2022, 56(4): 783-794.
[11]	陈巧红,裴皓磊,孙麒. 基于视觉关系推理与上下文门控机制的图像描述[J]. 浙江大学学报(工学版), 2022, 56(3): 542-549.
[12]	农元君,王俊杰,陈红,孙文涵,耿慧,李书悦. 基于注意力机制和编码-解码架构的施工场景图像描述方法[J]. 浙江大学学报(工学版), 2022, 56(2): 236-244.
[13]	刘英莉,吴瑞刚,么长慧,沈韬. 铝硅合金实体关系抽取数据集的构建方法[J]. 浙江大学学报(工学版), 2022, 56(2): 245-253.
[14]	黄新宇,游帆,张沛,张昭,张柏礼,吕建华,徐立臻. 基于多分类及特征融合的静默活体检测算法[J]. 浙江大学学报(工学版), 2022, 56(2): 263-270.
[15]	袁天乐,袁巨龙,朱勇建,郑翰辰. 基于改进YOLOv5的推力球轴承表面缺陷检测算法[J]. 浙江大学学报(工学版), 2022, 56(12): 2349-2357.

Viewed

Full text

Abstract

Cited

Shared

Discussed