基于跨模态级联扩散模型的图像描述方法

doi:10.3785/j.issn.1008-973X.2025.04.014

浙江大学学报(工学版)

2025, Vol. 59

Issue (4): 787-794 DOI: 10.3785/j.issn.1008-973X.2025.04.014

计算机技术与控制工程

基于跨模态级联扩散模型的图像描述方法

陈巧红(

),郭孟浩,方贤*(

),孙麒

浙江理工大学计算机科学与技术学院，浙江杭州 310018

Image captioning based on cross-modal cascaded diffusion model

Qiaohong CHEN(

),Menghao GUO,Xian FANG*(

),Qi SUN

School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China

全文: PDF(1612 KB) HTML

摘要：

现有文本扩散模型方法无法有效根据语义条件控制扩散过程，扩散模型训练过程的收敛较为困难，为此提出基于跨模态级联扩散模型的非自回归图像描述方法. 引入跨模态语义对齐模块用于对齐视觉模态和文本模态之间的语义关系，将对齐后的语义特征向量作为后续扩散模型的语义条件. 通过设计级联式的扩散模型逐步引入丰富的语义信息，确保生成的图像描述贴近整体语境. 增强文本扩散过程中的噪声计划以提升模型对文本信息的敏感性，充分训练模型以增强模型的整体性能. 实验结果表明，所提方法能够生成比传统图像描述生成方法更准确和丰富的文本描述. 所提方法在各项评价指标上均明显优于其他非自回归文本生成方法，展现了在图像描述任务中使用扩散模型的有效性和潜力.

关键词： 深度学习; 图像描述; 扩散模型; 多模态编码器; 级联结构

Abstract:

Current text diffusion model methods are ineffective in controlling the diffusion process based on semantic conditions, and the convergence of the diffusion model training process is challenging. A non-autoregressive image captioning method was proposed based on a cross-modal cascaded diffusion model. A cross-modal semantic alignment module was introduced to align the semantic relationships between visual and text modalities, with the aligned semantic feature vectors serving as the semantic condition for the subsequent diffusion model. By designing a cascaded diffusion model, rich semantic information was gradually introduced to ensure that the generated image description closely aligns with the overall context. A noise schedule was enhanced during the text diffusion process to increase the model’s sensitivity to text information, and the model was fully trained to enhance the overall performance of the model. Experimental results show that the proposed method generates more accurate and rich text descriptions than traditional image captioning methods. The proposed method significantly outperforms other non-autoregressive text generation methods in various evaluation metrics, which showcases the effectiveness and potential of using diffusion models in the task of image captioning.

Key words: deep learning image captioning diffusion model multi-model encoder cascaded structure

收稿日期: 2024-01-18 出版日期: 2025-04-25

CLC:

TP 181

基金资助: 浙江省自然科学基金资助项目（LQ23F020021）.

通讯作者: 方贤 E-mail: chen_lisa@zstu.edu.cn;xianfang@zstu.edu.cn

作者简介: 陈巧红（1978—），女，教授，从事计算机辅助设计及机器学习技术研究. orcid.org/0000-0003-0595-341X. E-mail：chen_lisa@zstu.edu.cn

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	作者相关文章
	陈巧红
	郭孟浩
	方贤
	孙麒

引用本文:

陈巧红,郭孟浩,方贤,孙麒. 基于跨模态级联扩散模型的图像描述方法[J]. 浙江大学学报(工学版), 2025, 59(4): 787-794.

Qiaohong CHEN,Menghao GUO,Xian FANG,Qi SUN. Image captioning based on cross-modal cascaded diffusion model. Journal of ZheJiang University (Engineering Science), 2025, 59(4): 787-794.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2025.04.014 或 https://www.zjujournals.com/eng/CN/Y2025/V59/I4/787

图 1 跨模态级联扩散模型的总体结构

图 2 扩散模型的级联结构

图 3 扩散模型的正向过程和反向过程

表 1 模型在2个数据集上的模块消融实验

表 2 跨模态语义对齐模块在2个数据集上的消融实验

表 3 基于级联扩散模型的层参数选择实验

图 4 扩散模型不同层参数的文本生成描述结果

表 4 扩散模型的噪声增强级别选择实验

图 5 不同噪声增强级别对准确率的影响

表 5 Microsoft COCO 数据集中不同图像描述模型的性能对比

表 6 Flickr30k 数据集中不同图像描述模型的性能对比

图 6 不同图像描述模型的参数量和推理速度

图 7 不同非自回归方法在 Microsoft COCO 数据集上生成的图像描述

1	VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Boston: IEEE, 2015: 3156–3164.
2	XU K, BA J, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention [C]// Proceedings of the 32nd International Conference on Machine Learning . Lille: [s. n.], 2015: 2048–2057.
3	陈巧红, 裴皓磊, 孙麒基于视觉关系推理与上下文门控机制的图像描述[J]. 浙江大学学报: 工学版, 2022, 56 (3): 542- 549 CHEN Qiaohong, PEI Haolei, SUN Qi Image caption based on relational reasoning and context gate mechanism[J]. Journal of Zhejiang University: Engineering Science, 2022, 56 (3): 542- 549
4	王鑫, 宋永红, 张元林基于显著性特征提取的图像描述算法[J]. 自动化学报, 2022, 48 (3): 735- 746 WANG Xin, SONG Yonghong, ZHANG Yuanlin Salient feature extraction mechanism for image captioning[J]. Acta Automatica Sinica, 2022, 48 (3): 735- 746
5	卓亚琦, 魏家辉, 李志欣基于双注意模型的图像描述生成方法研究[J]. 电子学报, 2022, 50 (5): 1123- 1130 ZHUO Yaqi, WEI Jiahui, LI Zhixin Research on image captioning based on double attention model[J]. Acta Electronica Sinica, 2022, 50 (5): 1123- 1130 doi: 10.12263/DZXB.20210696
6	SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks [C]// Proceedings of the 28th International Conference on Neural Information Processing Systems . Montreal: ACM, 2014: 3104–3112.
7	MAO J, XU W, YANG Y, et al. Explain images with multimodal recurrent neural networks [EB/OL]. (2014–10–04)[ 2023–10–20]. https://arxiv.org/pdf/1410.1090.
8	ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City: IEEE, 2018: 6077–6086.
9	刘茂福, 施琦, 聂礼强基于视觉关联与上下文双注意力的图像描述生成方法[J]. 软件学报, 2022, 33 (9): 3210- 3222 LIU Maofu, SHI Qi, NIE Liqiang Image captioning based on visual relevance and context dual attention[J]. Journal of Software, 2022, 33 (9): 3210- 3222
10	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems . [S. l.]: Curran Associates Inc. , 2017: 6000–6010.
11	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding [EB/OL]. (2019–05–24)[2023–10–20]. https://arxiv.org/pdf/1810.04805.
12	HUANG L, WANG W, CHEN J, et al. Attention on attention for image captioning [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Seoul: IEEE, 2019: 4634–4643.
13	PAN Y, YAO T, LI Y, et al. X-linear attention networks for image captioning [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle: IEEE, 2020: 10971–10980.
14	CORNIA M, STEFANINI M, BARALDI L, et al. Meshed-memory transformer for image captioning [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle: IEEE, 2020: 10578–10587.
15	WANG Y, XU J, SUN Y End-to-end transformer based model for image captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36 (3): 2585- 2594 doi: 10.1609/aaai.v36i3.20160
16	GAO J, MENG X, WANG S, et al. Masked non-autoregressive image captioning [EB/OL]. (2019–06–03) [2023–10–20]. https://arxiv.org/pdf/1906.00717.
17	GUO L, LIU J, ZHU X, et al. Non-autoregressive image captioning with counterfactuals-critical multi-agent learning [C]// Proceedings of the 29th International Joint Conference on Artificial Intelligence . Yokohama: ACM, 2021: 767–773.
18	HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models [C]// Proceedings of the 34th International Conference on Neural Information Processing Systems . Vancouver: ACM, 2020: 6840–6851.
19	SONG J, MENG C, ERMON S. Denoising diffusion implicit models [C]// International Conference on Learning Representations . [S.l.]: ICLR, 2020: 1–20.
20	ZHOU Y, ZHANG Y, HU Z, et al. Semi-autoregressive transformer for image captioning [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops . Montreal: IEEE, 2021: 3139–3143.
21	CHEN T, ZHANG R, HINTON G. Analog bits: generating discrete data using diffusion models with self-conditioning [C]// International Conference on Learning Representations . [S.l.]: ICLR, 2023: 1–23.
22	HE Y, CAI Z, GAN X, et al. DiffCap: exploring continuous diffusion on image captioning [EB/OL]. (2023–05–20) [2023–10–20]. https://arxiv.org/pdf/2305.12144.
23	HO J, SAHARIA C, CHAN W, et al Cascaded diffusion models for high fidelity image generation[J]. Journal of Machine Learning Research, 2022, 23 (1): 2249- 2281
24	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context [C]// Computer Vision – ECCV 2014 . [S.l.]: Springer, 2014: 740–755.
25	PLUMMER B A, WANG L, CERVANTES C M, et al. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models [C]// Proceedings of the IEEE International Conference on Computer Vision . Santiago: IEEE, 2015: 2641–2649.
26	KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Boston: IEEE, 2015: 3128–3137.
27	PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation [C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics . Philadelphia: ACL, 2002: 311–318.
28	DENKOWSKI M, LAVIE A. Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems [C]// Proceedings of the Sixth Workshop on Statistical Machine Translation . Edinburgh: ACL, 2011: 85–91.
29	LIN CY. ROUGE: a package for automatic evaluation of summaries [C]// Text Summarization Branches Out . Barcelona: ACL, 2004: 74–81.
30	VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Boston: IEEE, 2015: 4566–4575.
31	LI J, LI D, XIONG C, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation [C]// Proceedings of the International Conference on Machine Learning . [S. l.]: PMLR, 2022: 12888–12900.
32	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision [C]// Proceedings of the International Conference on Machine Learning . [S. l.]: PMLR, 2021: 8748–8763.
33	KINGMA D P, BA J. ADAM: a method for stochastic optimization [EB/OL]. (2017–03–30)[2023–10–20]. https://arxiv.org/pdf/1412.6980.
34	LU J, YANG J, BATRA D, et al. Neural baby talk [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City: IEEE, 2018: 7219–7228.
35	RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Honolulu: IEEE, 2017: 1179–1195.
36	JIANG W, MA L, JIANG Y G, et al. Recurrent fusion network for image captioning [C]// Computer Vision – ECCV 2018 . [S.l.]: Springer, 2018: 510–526.
37	YAO T, PAN Y, LI Y, et al. Exploring visual relationship for image captioning [C]// Computer Vision – ECCV 2018 . [S.l.]: Springer, 2018: 711–727.
38	HERDADE S, KAPPELER A, BOAKYE K, et al. Image captioning: transforming objects into words [EB/OL]. (2020–01–11)[2023–10–20]. https://arxiv.org/pdf/1906.05963.
39	ZHANG X, SUN X, LUO Y, et al. RSTNet: captioning with adaptive attention on visual and non-visual words [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville: IEEE, 2021: 15465–15474.
40	WANG N, XIE J, WU J, et al Controllable image captioning via prompting[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37 (2): 2617- 2625 doi: 10.1609/aaai.v37i2.25360
41	YU H, LIU Y, QI B, et al. End-to-end non-autoregressive image captioning [C]// Proceedings of the ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing . Rhodes Island: IEEE, 2023: 1–5.
42	LU J, XIONG C, PARIKH D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Honolulu: IEEE, 2017: 375–383.

[1]	马莉,王永顺,胡瑶,范磊. 预训练长短时空交错Transformer在交通流预测中的应用[J]. 浙江大学学报(工学版), 2025, 59(4): 669-678.
[2]	姚明辉,王悦燕,吴启亮,牛燕,王聪. 基于小样本人体运动行为识别的孪生网络算法[J]. 浙江大学学报(工学版), 2025, 59(3): 504-511.
[3]	梁礼明,龙鹏威,金家新,李仁杰,曾璐. 基于改进YOLOv8s的钢材表面缺陷检测算法[J]. 浙江大学学报(工学版), 2025, 59(3): 512-522.
[4]	杨凯博,钟铭恩,谭佳威,邓智颖,周梦丽,肖子佶. 基于半监督学习的多场景火灾小规模稀薄烟雾检测[J]. 浙江大学学报(工学版), 2025, 59(3): 546-556.
[5]	陈智超,杨杰,李凡,冯志成. 基于深度学习的列车运行环境感知关键算法研究综述[J]. 浙江大学学报(工学版), 2025, 59(1): 1-17.
[6]	刘登峰,陈世海,郭文静,柴志雷. 基于轻量残差网络的高效半色调算法[J]. 浙江大学学报(工学版), 2025, 59(1): 62-69.
[7]	赵顗,安醇,李铭浩,马健霄,怀硕. 城市快速路互通交织区车辆的换道持续距离选择[J]. 浙江大学学报(工学版), 2025, 59(1): 205-212.
[8]	李凡,杨杰,冯志成,陈智超,付云骁. 基于图像识别的弓网接触点检测方法[J]. 浙江大学学报(工学版), 2024, 58(9): 1801-1810.
[9]	肖力,曹志刚,卢浩冉,黄志坚,蔡袁强. 基于深度学习和梯度优化的弹性超材料设计[J]. 浙江大学学报(工学版), 2024, 58(9): 1892-1901.
[10]	吴书晗,王丹,陈远方,贾子钰,张越棋,许萌. 融合注意力的滤波器组双视图图卷积运动想象脑电分类[J]. 浙江大学学报(工学版), 2024, 58(7): 1326-1335.
[11]	李林睿,王东升,范红杰. 基于法条知识的事理型类案检索方法[J]. 浙江大学学报(工学版), 2024, 58(7): 1357-1365.
[12]	马现伟,范朝辉,聂为之,李东,朱逸群. 对失效传感器具备鲁棒性的故障诊断方法[J]. 浙江大学学报(工学版), 2024, 58(7): 1488-1497.
[13]	宋娟,贺龙喜,龙会平. 基于深度学习的隧道衬砌多病害检测算法[J]. 浙江大学学报(工学版), 2024, 58(6): 1161-1173.
[14]	钟博,王鹏飞,王乙乔,王晓玲. 基于深度学习的EEG数据分析技术综述[J]. 浙江大学学报(工学版), 2024, 58(5): 879-890.
[15]	魏翠婷,赵唯坚,孙博超,刘芸怡. 基于改进Mask R-CNN与双目视觉的智能配筋检测[J]. 浙江大学学报(工学版), 2024, 58(5): 1009-1019.

Viewed

Full text

Abstract

Cited

Shared

Discussed