Image captioning based on cross-modal cascaded diffusion model

doi:10.3785/j.issn.1008-973X.2025.04.014

Journal of ZheJiang University (Engineering Science)

2025, Vol. 59

Issue (4): 787-794 DOI: 10.3785/j.issn.1008-973X.2025.04.014

Image captioning based on cross-modal cascaded diffusion model

Qiaohong CHEN(

),Menghao GUO,Xian FANG*(

),Qi SUN

School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China

Download:

HTML

PDF(1612KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

Current text diffusion model methods are ineffective in controlling the diffusion process based on semantic conditions, and the convergence of the diffusion model training process is challenging. A non-autoregressive image captioning method was proposed based on a cross-modal cascaded diffusion model. A cross-modal semantic alignment module was introduced to align the semantic relationships between visual and text modalities, with the aligned semantic feature vectors serving as the semantic condition for the subsequent diffusion model. By designing a cascaded diffusion model, rich semantic information was gradually introduced to ensure that the generated image description closely aligns with the overall context. A noise schedule was enhanced during the text diffusion process to increase the model’s sensitivity to text information, and the model was fully trained to enhance the overall performance of the model. Experimental results show that the proposed method generates more accurate and rich text descriptions than traditional image captioning methods. The proposed method significantly outperforms other non-autoregressive text generation methods in various evaluation metrics, which showcases the effectiveness and potential of using diffusion models in the task of image captioning.

Key words： deep learning image captioning diffusion model multi-model encoder cascaded structure

Received: 18 January 2024 Published: 25 April 2025

CLC:

TP 181

Fund: 浙江省自然科学基金资助项目（LQ23F020021）.

Corresponding Authors: Xian FANG E-mail: chen_lisa@zstu.edu.cn;xianfang@zstu.edu.cn

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Qiaohong CHEN
	Menghao GUO
	Xian FANG
	Qi SUN

Cite this article:

Qiaohong CHEN,Menghao GUO,Xian FANG,Qi SUN. Image captioning based on cross-modal cascaded diffusion model. Journal of ZheJiang University (Engineering Science), 2025, 59(4): 787-794.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2025.04.014 OR https://www.zjujournals.com/eng/Y2025/V59/I4/787

基于跨模态级联扩散模型的图像描述方法

现有文本扩散模型方法无法有效根据语义条件控制扩散过程，扩散模型训练过程的收敛较为困难，为此提出基于跨模态级联扩散模型的非自回归图像描述方法. 引入跨模态语义对齐模块用于对齐视觉模态和文本模态之间的语义关系，将对齐后的语义特征向量作为后续扩散模型的语义条件. 通过设计级联式的扩散模型逐步引入丰富的语义信息，确保生成的图像描述贴近整体语境. 增强文本扩散过程中的噪声计划以提升模型对文本信息的敏感性，充分训练模型以增强模型的整体性能. 实验结果表明，所提方法能够生成比传统图像描述生成方法更准确和丰富的文本描述. 所提方法在各项评价指标上均明显优于其他非自回归文本生成方法，展现了在图像描述任务中使用扩散模型的有效性和潜力.

关键词： 深度学习, 图像描述, 扩散模型, 多模态编码器, 级联结构

Fig.1 Overall structure of cross-modal cascaded diffusion model

Fig.2 Cascaded structure of diffusion model

Fig.3 forward process and reverse process of diffusion model

Tab.1 Module ablation experiment of model in two datasets

Tab.2 Ablation experiment of cross-modal semantic alignment module in two datasets

Tab.3 Selection experiment of layer parameter based on cascaded diffusion model

Fig.4 Results of text generation by different layer parameters of diffusion model

Tab.4 Selection experiment of noise enhancement level for diffusion model

Fig.5 Impact of different noise enhancement levels on accuracy

Tab.5 Performance comparison of different image description models in Microsoft COCO dataset

Tab.6 Performance comparison of different image description models in Flickr30k dataset

Fig.6 Parameter quantity and inference speed for different image description models

Fig.7 Image descriptions generated by different non-autoregressive methods in Microsoft COCO dataset


[1]	VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Boston: IEEE, 2015: 3156–3164.

[2]	XU K, BA J, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention [C]// Proceedings of the 32nd International Conference on Machine Learning . Lille: [s. n.], 2015: 2048–2057.

[3]	陈巧红, 裴皓磊, 孙麒基于视觉关系推理与上下文门控机制的图像描述[J]. 浙江大学学报: 工学版, 2022, 56 (3): 542- 549 CHEN Qiaohong, PEI Haolei, SUN Qi Image caption based on relational reasoning and context gate mechanism[J]. Journal of Zhejiang University: Engineering Science, 2022, 56 (3): 542- 549

[4]	王鑫, 宋永红, 张元林基于显著性特征提取的图像描述算法[J]. 自动化学报, 2022, 48 (3): 735- 746 WANG Xin, SONG Yonghong, ZHANG Yuanlin Salient feature extraction mechanism for image captioning[J]. Acta Automatica Sinica, 2022, 48 (3): 735- 746

[5]	卓亚琦, 魏家辉, 李志欣基于双注意模型的图像描述生成方法研究[J]. 电子学报, 2022, 50 (5): 1123- 1130 ZHUO Yaqi, WEI Jiahui, LI Zhixin Research on image captioning based on double attention model[J]. Acta Electronica Sinica, 2022, 50 (5): 1123- 1130 doi: 10.12263/DZXB.20210696

[6]	SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks [C]// Proceedings of the 28th International Conference on Neural Information Processing Systems . Montreal: ACM, 2014: 3104–3112.

[7]	MAO J, XU W, YANG Y, et al. Explain images with multimodal recurrent neural networks [EB/OL]. (2014–10–04)[ 2023–10–20]. https://arxiv.org/pdf/1410.1090.

[8]	ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City: IEEE, 2018: 6077–6086.

[9]	刘茂福, 施琦, 聂礼强基于视觉关联与上下文双注意力的图像描述生成方法[J]. 软件学报, 2022, 33 (9): 3210- 3222 LIU Maofu, SHI Qi, NIE Liqiang Image captioning based on visual relevance and context dual attention[J]. Journal of Software, 2022, 33 (9): 3210- 3222

[10]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems . [S. l.]: Curran Associates Inc. , 2017: 6000–6010.

[11]	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding [EB/OL]. (2019–05–24)[2023–10–20]. https://arxiv.org/pdf/1810.04805.

[12]	HUANG L, WANG W, CHEN J, et al. Attention on attention for image captioning [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Seoul: IEEE, 2019: 4634–4643.

[13]	PAN Y, YAO T, LI Y, et al. X-linear attention networks for image captioning [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle: IEEE, 2020: 10971–10980.

[14]	CORNIA M, STEFANINI M, BARALDI L, et al. Meshed-memory transformer for image captioning [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle: IEEE, 2020: 10578–10587.

[15]	WANG Y, XU J, SUN Y End-to-end transformer based model for image captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36 (3): 2585- 2594 doi: 10.1609/aaai.v36i3.20160

[16]	GAO J, MENG X, WANG S, et al. Masked non-autoregressive image captioning [EB/OL]. (2019–06–03) [2023–10–20]. https://arxiv.org/pdf/1906.00717.

[17]	GUO L, LIU J, ZHU X, et al. Non-autoregressive image captioning with counterfactuals-critical multi-agent learning [C]// Proceedings of the 29th International Joint Conference on Artificial Intelligence . Yokohama: ACM, 2021: 767–773.

[18]	HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models [C]// Proceedings of the 34th International Conference on Neural Information Processing Systems . Vancouver: ACM, 2020: 6840–6851.

[19]	SONG J, MENG C, ERMON S. Denoising diffusion implicit models [C]// International Conference on Learning Representations . [S.l.]: ICLR, 2020: 1–20.

[20]	ZHOU Y, ZHANG Y, HU Z, et al. Semi-autoregressive transformer for image captioning [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops . Montreal: IEEE, 2021: 3139–3143.

[21]	CHEN T, ZHANG R, HINTON G. Analog bits: generating discrete data using diffusion models with self-conditioning [C]// International Conference on Learning Representations . [S.l.]: ICLR, 2023: 1–23.

[22]	HE Y, CAI Z, GAN X, et al. DiffCap: exploring continuous diffusion on image captioning [EB/OL]. (2023–05–20) [2023–10–20]. https://arxiv.org/pdf/2305.12144.

[23]	HO J, SAHARIA C, CHAN W, et al Cascaded diffusion models for high fidelity image generation[J]. Journal of Machine Learning Research, 2022, 23 (1): 2249- 2281

[24]	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context [C]// Computer Vision – ECCV 2014 . [S.l.]: Springer, 2014: 740–755.

[25]	PLUMMER B A, WANG L, CERVANTES C M, et al. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models [C]// Proceedings of the IEEE International Conference on Computer Vision . Santiago: IEEE, 2015: 2641–2649.

[26]	KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Boston: IEEE, 2015: 3128–3137.

[27]	PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation [C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics . Philadelphia: ACL, 2002: 311–318.

[28]	DENKOWSKI M, LAVIE A. Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems [C]// Proceedings of the Sixth Workshop on Statistical Machine Translation . Edinburgh: ACL, 2011: 85–91.

[29]	LIN CY. ROUGE: a package for automatic evaluation of summaries [C]// Text Summarization Branches Out . Barcelona: ACL, 2004: 74–81.

[30]	VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Boston: IEEE, 2015: 4566–4575.

[31]	LI J, LI D, XIONG C, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation [C]// Proceedings of the International Conference on Machine Learning . [S. l.]: PMLR, 2022: 12888–12900.

[32]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision [C]// Proceedings of the International Conference on Machine Learning . [S. l.]: PMLR, 2021: 8748–8763.

[33]	KINGMA D P, BA J. ADAM: a method for stochastic optimization [EB/OL]. (2017–03–30)[2023–10–20]. https://arxiv.org/pdf/1412.6980.

[34]	LU J, YANG J, BATRA D, et al. Neural baby talk [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City: IEEE, 2018: 7219–7228.

[35]	RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Honolulu: IEEE, 2017: 1179–1195.

[36]	JIANG W, MA L, JIANG Y G, et al. Recurrent fusion network for image captioning [C]// Computer Vision – ECCV 2018 . [S.l.]: Springer, 2018: 510–526.

[37]	YAO T, PAN Y, LI Y, et al. Exploring visual relationship for image captioning [C]// Computer Vision – ECCV 2018 . [S.l.]: Springer, 2018: 711–727.

[38]	HERDADE S, KAPPELER A, BOAKYE K, et al. Image captioning: transforming objects into words [EB/OL]. (2020–01–11)[2023–10–20]. https://arxiv.org/pdf/1906.05963.

[39]	ZHANG X, SUN X, LUO Y, et al. RSTNet: captioning with adaptive attention on visual and non-visual words [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville: IEEE, 2021: 15465–15474.

[40]	WANG N, XIE J, WU J, et al Controllable image captioning via prompting[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37 (2): 2617- 2625 doi: 10.1609/aaai.v37i2.25360

[41]	YU H, LIU Y, QI B, et al. End-to-end non-autoregressive image captioning [C]// Proceedings of the ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing . Rhodes Island: IEEE, 2023: 1–5.

[42]	LU J, XIONG C, PARIKH D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Honolulu: IEEE, 2017: 375–383.

[1]	Li MA,Yongshun WANG,Yao HU,Lei FAN. Pre-trained long-short spatiotemporal interleaved Transformer for traffic flow prediction applications[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(4): 669-678.

[2]	Minghui YAO,Yueyan WANG,Qiliang WU,Yan NIU,Cong WANG. Siamese networks algorithm based on small human motion behavior recognition[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(3): 504-511.

[3]	Liming LIANG,Pengwei LONG,Jiaxin JIN,Renjie LI,Lu ZENG. Steel surface defect detection algorithm based on improved YOLOv8s[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(3): 512-522.

[4]	Kaibo YANG,Mingen ZHONG,Jiawei TAN,Zhiying DENG,Mengli ZHOU,Ziji XIAO. Small-scale sparse smoke detection in multiple fire scenarios based on semi-supervised learning[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(3): 546-556.

[5]	Zhichao CHEN,Jie YANG,Fan LI,Zhicheng FENG. Review on deep learning-based key algorithm for train running environment perception[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(1): 1-17.

[6]	Dengfeng LIU,Shihai CHEN,Wenjing GUO,Zhilei CHAI. Efficient halftone algorithm based on lightweight residual networks[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(1): 62-69.

[7]	Yi ZHAO,Chun AN,Minghao LI,Jianxiao MA,Shuo HUAI. Selection of lane-changing distance for vehicles in urban expressway interchange weaving section[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(1): 205-212.

[8]	Fan LI,Jie YANG,Zhicheng FENG,Zhichao CHEN,Yunxiao FU. Pantograph-catenary contact point detection method based on image recognition[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(9): 1801-1810.

[9]	Li XIAO,Zhigang CAO,Haoran LU,Zhijian HUANG,Yuanqiang CAI. Elastic metamaterial design based on deep learning and gradient optimization[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(9): 1892-1901.

[10]	Shuhan WU,Dan WANG,Yuanfang CHEN,Ziyu JIA,Yueqi ZHANG,Meng XU. Attention-fused filter bank dual-view graph convolution motor imagery EEG classification[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(7): 1326-1335.

[11]	Linrui LI,Dongsheng WANG,Hongjie FAN. Fact-based similar case retrieval methods based on statutory knowledge[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(7): 1357-1365.

[12]	Xianwei MA,Chaohui FAN,Weizhi NIE,Dong LI,Yiqun ZHU. Robust fault diagnosis method for failure sensors[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(7): 1488-1497.

[13]	Juan SONG,Longxi HE,Huiping LONG. Deep learning-based algorithm for multi defect detection in tunnel lining[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(6): 1161-1173.

[14]	Bo ZHONG,Pengfei WANG,Yiqiao WANG,Xiaoling WANG. Survey of deep learning based EEG data analysis technology[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(5): 879-890.

[15]	Cuiting WEI,Weijian ZHAO,Bochao SUN,Yunyi LIU. Intelligent rebar inspection based on improved Mask R-CNN and stereo vision[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(5): 1009-1019.

Viewed

Full text

Abstract

Cited

Shared

Discussed