Please wait a minute...
浙江大学学报(工学版)  2026, Vol. 60 Issue (6): 1205-1212    DOI: 10.3785/j.issn.1008-973X.2026.06.007
计算机技术     
基于多视图跨模态特征融合的图像描述生成
张乃洲1(),赵云超1,曹薇2,张啸剑1
1. 河南财经政法大学 计算机与信息工程学院,河南 郑州 450046
2. 河南财经政法大学 数据科学与电子商务学院,河南 郑州 450046
Image captioning generation based on multiple-view cross-modal feature fusion
Naizhou ZHANG1(),Yunchao ZHAO1,Wei CAO2,Xiaojian ZHANG1
1. College of Computer and Information Engineering, Henan University of Economics and Law, Zhengzhou 450046, China
2. School of Data Science and E-commerce, Henan University of Economics and Law, Zhengzhou 450046, China
 全文: PDF(718 KB)   HTML
摘要:

针对视觉特征提取过程中的视觉信息损失问题,提出新的基于多视图跨模态特征增强与融合的图像描述生成方法. 使用多个预训练图像视觉特征提取器将图像数据映射到不同的特征空间中,引入交叉注意力双流机制,实现多视图跨模态特征的动态增强与互补融合. 利用该方法,对多种视觉特征进行有效地协同融合,利用不同视觉特征表示之间的互补性,减少在视觉特征编码过程中的视觉信息损失. 通过优化编码器-解码器架构,显著提升了图像描述生成的质量. 实验结果表明,提出的模型在衡量图像描述生成性能的多个指标上,明显优于现有的主流方法,验证了多视图特征协同的有效性.

关键词: 图像描述视觉特征提取跨模态特征融合注意力机制对比语言-图像预训练(CLIP)    
Abstract:

A new method based on multi-view cross-modal feature augmentation and fusion for image captioning was proposed aiming at the issue of visual information loss in visual feature extraction. Multiple pre-trained visual feature extractor was employed to map image data into different feature space, and a cross-attention dual-stream mechanism was introduced to achieve dynamic enhancement and complementary fusion of multi-view cross-modal feature. Multiple visual feature was effectively coordinated. The complementarity between different visual feature representation was exploited, and visual information loss during feature encoding was mitigated. The quality of image captioning generation was significantly improved by optimizing the encoder-decoder architecture. The experimental results showed that the proposed model significantly outperformed existing state-of-the-art methods across multiple evaluation metrics for image captioning performance, validating the effectiveness of multi-view feature collaboration.

Key words: image captioning    visual feature extraction    cross-modal feature fusion    attention mechanism    contrastive language-image pre-training (CLIP)
收稿日期: 2025-09-20 出版日期: 2026-05-06
CLC:  TP 391  
基金资助: 国家自然科学基金资助项目(62072156);河南省科技攻关项目(262102210047);河南省高等学校重点科研项目计划基础研究专项资助项目(25ZX012).
作者简介: 张乃洲(1970—),男,教授,博士,从事人工智能、自然语言处理和图像处理的研究. orcid.org/0009-0003-8222-8999. E-mail:zhangnz@126.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
张乃洲
赵云超
曹薇
张啸剑

引用本文:

张乃洲,赵云超,曹薇,张啸剑. 基于多视图跨模态特征融合的图像描述生成[J]. 浙江大学学报(工学版), 2026, 60(6): 1205-1212.

Naizhou ZHANG,Yunchao ZHAO,Wei CAO,Xiaojian ZHANG. Image captioning generation based on multiple-view cross-modal feature fusion. Journal of ZheJiang University (Engineering Science), 2026, 60(6): 1205-1212.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2026.06.007        https://www.zjujournals.com/eng/CN/Y2026/V60/I6/1205

图 1  MVCMFAF图像描述生成模型的总体架构图
模型BLEU-1BLEU-4METEORROUGE-LCIDErSPICE
SCST[6]34.226.755.7114.0
AoANet[9]80.238.929.258.8129.822.4
X-Transformer[10]80.939.729.559.1132.823.4
M2Transformer[11]80.839.129.258.6131.222.6
GET[12]81.539.529.358.9131.622.8
RSTNet[13]81.840.129.859.5135.623.3
DLCT[14]81.439.829.559.1133.823.0
Xmodal-Ctx[16]81.539.730.059.5135.923.7
DIFNet[8]81.740.029.759.4136.223.2
PureT[21]82.140.930.260.1138.224.2
VRCDA[33]80.637.928.458.2123.7.21.8
EVCAP[34]41.531.2140.124.7
MVCMFAF (本文)83.241.630.460.5140.624.4
表 1  在MSCOCO测试数据集上与其他先进模型在单一模型上的性能比较
模型BLEU-1BLEU-4METEORROUGE-LCIDErSPICE
SCST[6]35.427.156.6117.5
AoANet[9]81.640.229.359.4132.022.8
X-Transformer[10]81.740.729.959.7135.323.8
M2Transformer[11]82.040.529.759.5134.523.5
GET[12]82.140.629.859.6135.123.8
DLCT[14]82.240.829.959.8137.523.3
PureT[21]83.442.130.460.8141.024.3
MVCMFAF (本文)83.542.730.661.1142.324.5
表 2  在MSCOCO测试数据集上与其他先进模型在集成模型上的性能比较
%
模型BLEU-1BLEU-4METEORROUGE-LCIDEr
Soft-Attention[4]66.719.118.5
Hard-Attention[4]66.919.918.5
Adaptive-Attention[5]67.725.120.453.1
A_R_L[35]69.827.721.548.557.4
IVAIC[36]70.830.622.549.863.0
VRCDA[33]73.230.622.750.666.0
MVCMFAF (本文)75.233.734.252.175.6
表 3  在Flickr30k数据集上与其他先进模型的性能比较
模块模型BLEU-1BLEU-4METEORROUGE-LCIDErSPICE
CAMVCMFF模块CAMVCMFF(w/o grid features)82.140.730.159.8137.323.7
CAMVCMFF(w/o region features)80.940.229.259.7132.922.9
CAMVCMFF(w/o clip features)80.139.628.857.7130.222.5
CAMVCMFF(w/o clip-txt)80.539.929.158.6131.822.7
CAMVCMFF(w/o clip-visual)81.740.729.959.7135.323.8
Swin EncoderSwin Encoder(w/o global features)82.441.130.160.2138.324.1
Swin Encoder(w/ Transformer)82.240.229.859.6133.523.3
MVCMFAF (本文)83.241.630.460.5140.624.4
表 4  在MSCOCO测试数据集上的消融实验结果
模型FLOPs/109Np/MBt/ms
Xmodal-Ctx[16]127.61435.439137.107
DIFNet[8]137.41228.39598.244
PureT[21]882.301224.201238.937
MVCMFAF (本文)137.461175.769446.157
表 5  MVCMFAF模型与其他模型在计算量、参数量和推理时间方面的比较
1 RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision [C]//Proceedings of the International Conference on Machine Learning. Vienna: PMLR, 2021: 8748–8763.
2 LI J, LI D, XIONG C, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation [C]//Proceedings of the International Conference on Machine Learning. Baltimore: PMLR, 2022: 12888–12900.
3 李志欣, 魏海洋, 张灿龙, 等 图像描述生成研究进展[J]. 计算机研究与发展, 2021, 58 (9): 1951- 1974
LI Zhixin, WEI Haiyang, ZHANG Canlong, et al Research progress on image captioning[J]. Journal of Computer Research and Development, 2021, 58 (9): 1951- 1974
doi: 10.7544/issn1000-1239.2021.20200281
4 XU K, BA J, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention [C]//Proceedings of the International Conference on Machine Learning. Lille: JMLR, 2015: 2048–2057.
5 LU J, XIONG C, PARIKH D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 3242–3250.
6 RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 1179–1195.
7 JIANG H, MISRA I, ROHRBACH M, et al. In defense of grid features for visual question answering [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 10264–10273.
8 WU M, ZHANG X, SUN X, et al. DIFNet: boosting visual information flow for image captioning [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 17999–18008.
9 HUANG L, WANG W, CHEN J, et al. Attention on attention for image captioning [C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 4633–4642.
10 PAN Y, YAO T, LI Y, et al. X-linear attention networks for image captioning [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 10968–10977.
11 CORNIA M, STEFANINI M, BARALDI L, et al. Meshed-memory transformer for image captioning [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 10575–10584.
12 JI J, LUO Y, SUN X, et al Improving image captioning by leveraging intra- and inter-layer global representation in transformer network[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35 (2): 1655- 1663
doi: 10.1609/aaai.v35i2.16258
13 ZHANG X, SUN X, LUO Y, et al. RSTNet: captioning with adaptive attention on visual and non-visual words [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 15460–15469.
14 LUO Y, JI J, SUN X, et al Dual-level collaborative transformer for image captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35 (3): 2286- 2293
15 LI X, YIN X, LI C, et al. Oscar: object-semantics aligned pre-training for vision-language tasks [C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 121–137.
16 KUO C W, KIRA Z. Beyond a pre-trained object detector: cross-modal textual and visual context for image captioning [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 17948–17958.
17 KUO C W, KIRA Z. HAAV: hierarchical aggregation of augmented views for image captioning [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 11039–11049.
18 LIU Z, LIU J, MA F Improving cross-modal alignment with synthetic pairs for text-only image captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38 (4): 3864- 3872
doi: 10.1609/aaai.v38i4.28178
19 QIU L, NING S, HE X Mining fine-grained image-text alignment for zero-shot captioning via text-only training[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38 (5): 4605- 4613
doi: 10.1609/aaai.v38i5.28260
20 LEE J R, SHIN Y, SON G, et al. Diffusion bridge: leveraging diffusion model to reduce the modality gap between text and vision for zero-shot image captioning [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2025: 4050–4059.
21 WANG Y, XU J, SUN Y End-to-end transformer based model for image captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36 (3): 2585- 2594
doi: 10.1609/aaai.v36i3.20160
22 ASHISH V, NOAM S, NIKI P, et al. Attention is all you need [C]// Annual Conference on Neural Information Processing Systems. Long Beach: NeurIPS Foundation, 2017: 5998–6008.
23 LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows [C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2022: 9992–10002.
24 XIONG Y, LIAO R, ZHAO H, et al. UPSNet: a unified panoptic segmentation network [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2020: 8810–8818.
25 KRISHNA R, ZHU Y, GROTH O, et al Visual genome: connecting language and vision using crowdsourced dense image annotations[J]. International Journal of Computer Vision, 2017, 123 (1): 32- 73
doi: 10.1007/s11263-016-0981-7
26 LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context [C]//Proceedings of the 13th European Conference on Computer Vision. Cham: Springer, 2014: 740–755.
27 KARPATHY A, LI F F Deep visual-semantic alignments for generating image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39 (4): 664- 676
doi: 10.1109/TPAMI.2016.2598339
28 PAPINENI K, ROUKOS S, WARD T, et al. Bleu: a method for automatic evaluation of machine translation [C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia: ACL, 2002: 311–318.
29 LAVIE A, AGARWAL A. Meteor: an automatic metric for MT evaluation with high levels of correlation with human judgments [C]//Proceedings of the 2nd Workshop on Statistical Machine Translation. Prague: ACL, 2007: 228–231.
30 LIN C Y. ROUGE: a package for automatic evaluation of summaries [C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. Barcelona: ACL, 2004.
31 VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 4566–4575.
32 ANDERSON P, FERNANDO B, JOHNSON M, et al. SPICE: semantic propositional image caption evaluation [C]//Proceedings of the 14th European Conference on Computer Vision. Cham: Springer, 2016: 382–398.
33 刘茂福, 施琦, 聂礼强 基于视觉关联与上下文双注意力的图像描述生成方法[J]. 软件学报, 2022, 33 (9): 3210- 3222
LIU Maofu, SHI Qi, NIE Liqiang Image captioning based on visual relevance and context dual attention[J]. Journal of Software, 2022, 33 (9): 3210- 3222
doi: 10.13328/j.cnki.jos.006623
34 LI J, VO D M, SUGIMOTO A, et al. Evcap: retrieval-augmented image captioning with external visual-name memory for open-world comprehension [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 13733–13742.
35 WANG J, WANG W, WANG L, et al Learning visual relationship and context-aware attention for image captioning[J]. Pattern Recognition, 2020, 98: 107075
doi: 10.1016/j.patcog.2019.107075
36 李志欣, 魏海洋, 黄飞成, 等 结合视觉特征和场景语义的图像描述生成[J]. 计算机学报, 2020, 43 (9): 1624- 1640
LI Zhixin, WEI Haiyang, HUANG Feicheng, et al Combine visual features and scene semantics for image captioning[J]. Chinese Journal of Computers, 2020, 43 (9): 1624- 1640
doi: 10.11897/SP.J.1016.2020.01624
[1] 李国燕,于威,梅玉鹏,张明辉,王新强. 全局局部特征融合的遥感图像建筑物提取[J]. 浙江大学学报(工学版), 2026, 60(5): 1100-1108.
[2] 宋耀莲,彭驰,唐菁敏,赵宣植,虞贵财. 基于融合注意力机制的光学遥感图像小目标检测算法[J]. 浙江大学学报(工学版), 2026, 60(4): 763-771.
[3] 万刚,王小波,石纲,叶德震,朱思思,司帆. 基于特征细化与注意力增强重构的水下图像增强算法[J]. 浙江大学学报(工学版), 2026, 60(4): 800-811.
[4] 陈文强,冯琳越,王东丹,顾玉磊,赵轩. 融合动态风险图与多变量注意力机制的车辆轨迹预测模型[J]. 浙江大学学报(工学版), 2026, 60(3): 455-467.
[5] 胡从裕,殷晨波,马伟,杨超,颜士宽. 基于改进CNN-LSTM的挖掘机作业对象识别[J]. 浙江大学学报(工学版), 2026, 60(3): 536-545.
[6] 李彬彬,张超,覃涛,陈昌盛,刘兴艳,杨靖. 面向光伏电站建设的移动端人体跌倒检测方法[J]. 浙江大学学报(工学版), 2026, 60(3): 546-555.
[7] 李国燕,李鹏辉,刘榕,梅玉鹏,张明辉. 融合多尺度分辨率和带状特征的遥感道路提取[J]. 浙江大学学报(工学版), 2026, 60(3): 585-593.
[8] 方芳,严军,郭红想,王勇. 基于时空注意力机制的轻量级脑纹识别算法[J]. 浙江大学学报(工学版), 2026, 60(3): 633-642.
[9] 王爽,章熙泰,郭永存,孙守锁. 基于深度网络的可控混合式磁力耦合器退磁诊断[J]. 浙江大学学报(工学版), 2026, 60(2): 279-286.
[10] 李宪华,杜鹏飞,宋韬,邱洵,蔡钰. 基于多尺度滑窗注意力时序卷积网络的脑电信号分类[J]. 浙江大学学报(工学版), 2026, 60(2): 370-378.
[11] 杨明辉,宋牧原,付大喜,郭炎伟,卢贤锥,张文聪,郑伟龙. 基于多头自注意力-Bi-LSTM模型的盾构掘进引发的土体沉降预测[J]. 浙江大学学报(工学版), 2026, 60(2): 415-424.
[12] 周思瑶,夏楠,江佳鸿. 姿态引导的双分支换装行人重识别网络[J]. 浙江大学学报(工学版), 2026, 60(1): 71-80.
[13] 张学军,梁书滨,白万荣,张奉鹤,黄海燕,郭梅凤,陈卓. 基于异构图表征的源代码漏洞检测方法[J]. 浙江大学学报(工学版), 2025, 59(8): 1644-1652.
[14] 林宜山,左景,卢树华. 基于多头自注意力机制与MLP-Interactor的多模态情感分析[J]. 浙江大学学报(工学版), 2025, 59(8): 1653-1661.
[15] 翟亚红,陈雅玲,徐龙艳,龚玉. 改进YOLOv8s的轻量级无人机航拍小目标检测算法[J]. 浙江大学学报(工学版), 2025, 59(8): 1708-1717.