Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2023, Vol. 57 Issue (2): 252-258    DOI: 10.3785/j.issn.1008-973X.2023.02.005
    
Multimodal image retrieval model based on semantic-enhanced feature fusion
Fan YANG(),Bo NING*(),Huai-qing LI,Xin ZHOU,Guan-yu LI
School of Information Science and Technology, Dalian Maritime University, Dalian 116026, China
Download: HTML     PDF(928KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A multimodal image retrieval model based on semantic-enhanced feature fusion (SEFM) was proposed to establish the correlation between text features and image features in multimodal image retrieval tasks. Semantic enhancement was conducted on the combined features during feature fusion by two proposed modules including the text semantic enhancement module and the image semantic enhancement module. Firstly, to enhance the text semantics, a multimodal dual attention mechanism was established in the text semantic enhancement module, which associated the multimodal correlation between text and image. Secondly, to enhance the image semantics, the retain intensity and update intensity were introduced in the image semantic enhancement module, which controlled the retaining and updating degrees of the query image features in combined features. Based on the above two modules, the combined features can be optimized, and be closer to the target image features. In the experiment part, the SEFM model was evaluated on MIT-States and Fashion IQ datasets, and experimental results show that the proposed model performs better than the existing works on recall and precision metrics.



Key wordsmultimodality      semantic enhancement      feature fusion      image retrieval      attention mechanism     
Received: 29 July 2022      Published: 28 February 2023
CLC:  TP 391  
Fund:  国家自然科学基金资助项目 (61976032, 62002039);辽宁省教育厅科学研究面上资助项目(LJKZ0063)
Corresponding Authors: Bo NING     E-mail: yangfany116@163.com;ningbo@dlmu.edu.cn
Cite this article:

Fan YANG,Bo NING,Huai-qing LI,Xin ZHOU,Guan-yu LI. Multimodal image retrieval model based on semantic-enhanced feature fusion. Journal of ZheJiang University (Engineering Science), 2023, 57(2): 252-258.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2023.02.005     OR     https://www.zjujournals.com/eng/Y2023/V57/I2/252


基于语义增强特征融合的多模态图像检索模型

为了在多模态图像检索任务中建立文本特征与图像特征的相关性,提出基于语义增强特征融合的多模态图像检索模型(SEFM). 该模型通过文本语义增强模块、图像语义增强模块2部分在特征融合时对组合特征进行语义增强. 在文本语义增强模块建立多模态双重注意力机制,利用双重注意力建立文本与图像之间的关联以增强文本语义;在图像语义增强模块引入保留强度和更新强度,控制组合特征中查询图像特征的保留和更新程度. 基于以上2个模块可以优化组合特征使其更接近目标图像特征. 在MIT-States和Fashion IQ这2个数据集上对该模型进行评估,实验结果表明在多模态图像检索任务上该模型与现有方法相比在召回率和准确率上都有所提升.


关键词: 多模态,  语义增强,  特征融合,  图像检索,  注意力机制 
Fig.1 General architecture of multimodal image retrieval model based on semantic-enhanced feature fusion (SEFM)
Fig.2 Multimodal dual attention structure for text semantic enhancement module
模型 R@1 R@5 R@10
%
Attributes as operators 8.8±0.1 27.3±0.3 39.1±0.3
Relationship 12.3±0.5 31.9±0.7 42.9±0.9
FiLM 10.1±0.3 27.7±0.7 38.3±0.7
TIRG 12.2±0.4 31.9±0.3 43.1±0.3
TIRG-Bert 12.3±0.6 32.5±0.3 43.3±0.5
ComposeAE 13.9±0.5 35.3±0.8 47.9±0.7
SEFM 15.5±0.8 37.7±1.0 49.6±1.0
Tab.1 Comparison of recall results of different algorithms on MIT-States dataset
模型 R@10 R@50
dress shirt top&tee dress shirt top&tee
%
TIRG 2.2±0.2 4.3±0.2 3.7±0.2 8.2±0.3 10.7±0.3 8.9±0.2
TIRG-Bert 11.7±0.5 10.9±0.5 11.7±0.3 30.1±0.3 27.9±0.4 28.1±0.3
ComposeAE 11.2±0.6 9.9±0.5 10.5±0.4 29.5±0.5 25.1±0.3 26.1±0.6
SEFM 11.9±0.3 11.2±0.5 11.7±0.3 29.6±0.5 27.4±0.5 27.5±0.3
Tab.2 Comparison of recall results of different algorithms on Fashion IQ dataset
模型 P@5 P@10
%
TIRG 11.2±0.2 10.2±0.2
TIRG-Bert 11.2±0.2 10.1±0.2
ComposeAE 11.8±0.5 10.6±0.3
SEFM 12.6±0.4 11.2±0.2
Tab.3 Comparison of precision results of different algorithms on MIT-States dataset
模型 R@1 R@5 R@10
%
SEFM
(without-text semantic enhancement)
13.4±0.7 35.2±0.8 47.6±1.0
SEFM
(without-image semantic enhancement)
14.6±0.8 34.5±0.9 47.7±0.8
SEFM(Lbase) 14.7±0.7 35.7±0.5 46.2±0.7
SEFM(Lbase+LRI) 14.7±0.6 34.9±0.5 46.8±0.7
SEFM(Lbase+LRT) 14.9±0.6 36.2±0.5 47.5±0.7
SEFM 15.5±0.8 37.7±1.0 49.6±1.0
Tab.4 Comparison of ablation recall results on MIT-States dataset
模型 R@10
dress shirt top&tee
%
SEFM
(without-text semantic enhancement)
10.2±0.4 10.2±0.2 11.0±0.2
SEFM(without-image semantic enhancement) 10.8±0.5 9.1±0.5 11.5±0.5
SEFM(Lbase) 11.2±0.3 10.7±0.3 11.6±0.3
SEFM(Lbase+LRI) 11.3±0.3 11.0±0.2 11.5±0.3
SEFM(Lbase+LRT) 11.6±0.4 11.3±0.3 11.7±0.4
SEFM 11.9±0.3 11.2±0.5 11.7±0.3
Tab.5 Comparison of ablation recall results on Fashion IQ dataset
模型 P@5 P@10
%
SEFM
(without-text semantic enhancement)
11.0±0.3 10.4±0.3
SEFM
(without-image semantic enhancement)
12.1±0.4 10.9±0.2
SEFM(Lbase) 12.0±0.5 11.2±0.3
SEFM(Lbase+LRI) 12.0±0.3 11.1±0.1
SEFM(Lbase+LRT) 11.9±0.3 11.0±0.2
SEFM 12.6±0.4 11.2±0.2
Tab.6 Comparison of ablation precision results on MIT-States dataset
Fig.3 Retrieval example of multimodal image retrieval model based on semantic-enhanced feature fusion
[1]   DUBEY S R A decade survey of content based image retrieval using deep learning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32 (5): 2687- 2704
doi: 10.1109/TCSVT.2021.3080920
[2]   PANG K T, LI K, YANG Y X , et al. Generalising fine-grained sketch-based image retrieval [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 677-686.
[3]   LIN T Y, CUI Y, BELONGIE S, et al. Learning deep representations for ground-to-aerial geolocalization [C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 5007-5015.
[4]   ZHANG M, MAIDMENT T, DIAB A, et al. Domain-robust VQA with diverse datasets and methods but no target labels [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [s.l.]: IEEE, 2021: 7046-7056.
[5]   CHEN L, JIANG Z, XIAO J, et al. Human-like controllable image captioning with verb-specific semantic roles [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [s.l.]: IEEE, 2021: 16846-16856.
[6]   SANTORO A, RAPOSO D, BARRETT D G T, et al. A simple neural network module for relational reasoning [C]// Advances in Neural Information Processing Systems 30. Long Beach: Curran Associates, 2017: 4967-4976.
[7]   PEREZ E, STRUB F, VRIES H D, et al. FiLM: visual reasoning with a general conditioning layer [C]// 32nd AAAI Conference on Artificial Intelligence. New Orleans: AAAI, 2018: 3942-3951.
[8]   NAGARAJAN T, GRAUMAN K. Attributes as operators: factorizing unseen attribute-object compositions [C]// European Conference on Computer Vision. Munich: Springer, 2018: 172-190.
[9]   VO N , LU J, CHEN S, et al. Composing text and image for image retrieval: an Empirical Odyssey[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019 : 6439-6448.
[10]   ANWAAR M U, LABINTCEV E, KLEINSTEUBER M. Compositional learning of image-text query for image retrieval [C]// 2021 IEEE Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2021: 1139-1148.
[11]   HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
[12]   DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding [C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis: Association for Computational Linguistics, 2019: 4171–4186.
[13]   HUANG L, WANG W M, CHEN J, et al. Attention on attention for image captioning[C]// 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 4633-4642.
[14]   ISOLA P, LIM J J, ADELSON E H. Discovering states and transformations in image collections [C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 1383-1391.
[1] Lin-tao WANG,Qi MAO. Position measurement method for tunnel segment grabbing based on RGB and depth information fusion[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(1): 47-54.
[2] Li-zhou FENG,Yang YANG,You-wei WANG,Gui-jun YANG. New method for news recommendation based on Transformer and knowledge graph[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(1): 133-143.
[3] Kun HAO,Kuo WANG,Bei-bei WANG. Lightweight underwater biological detection algorithm based on improved Mobilenet-YOLOv3[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(8): 1622-1632.
[4] Ren-peng MO,Xiao-sheng SI,Tian-mei LI,Xu ZHU. Bearing life prediction based on multi-scale features and attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(7): 1447-1456.
[5] You-wei WANG,Shuang TONG,Li-zhou FENG,Jian-ming ZHU,Yang LI,Fu CHEN. New inductive microblog rumor detection method based on graph convolutional network[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(5): 956-966.
[6] Xiao-chen JU,Xin-xin ZHAO,Sheng-sheng QIAN. Self-attention mechanism based bridge bolt detection algorithm[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(5): 901-908.
[7] Xue-qin ZHANG,Tian-ren LI. Breast cancer pathological image classification based on Cycle-GAN and improved DPN network[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(4): 727-735.
[8] Meng XU,Dan WANG,Zhi-yuan LI,Yuan-fang CHEN. IncepA-EEGNet: P300 signal detection method based on fusion of Inception network and attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(4): 745-753, 782.
[9] Chang-yuan LIU,Xian-ping HE,Xiao-jun BI. Efficient network vehicle recognition combined with attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(4): 775-782.
[10] Na ZHANG,Xu-lei QI,Xiao-an BAO,Biao WU,Xiao-mei TU,Yu-ting JIN. Single-stage object detection algorithm based on optimizing position prediction[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(4): 783-794.
[11] Qiao-hong CHEN,Hao-lei PEI,Qi SUN. Image caption based on relational reasoning and context gate mechanism[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(3): 542-549.
[12] Yuan-jun NONG,Jun-jie WANG,Hong CHEN,Wen-han SUN,Hui GENG,Shu-yue LI. A image caption method of construction scene based on attention mechanism and encoding-decoding architecture[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(2): 236-244.
[13] Ying-li LIU,Rui-gang WU,Chang-hui YAO,Tao SHEN. Construction method of extraction dataset of Al-Si alloy entity relationship[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(2): 245-253.
[14] Xin-yu HUANG,Fan YOU,Pei ZHANG,Zhao ZHANG,Bai-li ZHANG,Jian-hua LV,Li-zhen XU. Silent liveness detection algorithm based on multi classification and feature fusion network[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(2): 263-270.
[15] Tian-le YUAN,Ju-long YUAN,Yong-jian ZHU,Han-chen ZHENG. Surface defect detection algorithm of thrust ball bearing based on improved YOLOv5[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(12): 2349-2357.