计算机技术与控制工程 |
|
|
|
|
基于跨模态级联扩散模型的图像描述方法 |
陈巧红( ),郭孟浩,方贤*( ),孙麒 |
浙江理工大学 计算机科学与技术学院,浙江 杭州 310018 |
|
Image captioning based on cross-modal cascaded diffusion model |
Qiaohong CHEN( ),Menghao GUO,Xian FANG*( ),Qi SUN |
School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China |
引用本文:
陈巧红,郭孟浩,方贤,孙麒. 基于跨模态级联扩散模型的图像描述方法[J]. 浙江大学学报(工学版), 2025, 59(4): 787-794.
Qiaohong CHEN,Menghao GUO,Xian FANG,Qi SUN. Image captioning based on cross-modal cascaded diffusion model. Journal of ZheJiang University (Engineering Science), 2025, 59(4): 787-794.
链接本文:
https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2025.04.014
或
https://www.zjujournals.com/eng/CN/Y2025/V59/I4/787
|
1 |
VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Boston: IEEE, 2015: 3156–3164.
|
2 |
XU K, BA J, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention [C]// Proceedings of the 32nd International Conference on Machine Learning . Lille: [s. n.], 2015: 2048–2057.
|
3 |
陈巧红, 裴皓磊, 孙麒 基于视觉关系推理与上下文门控机制的图像描述[J]. 浙江大学学报: 工学版, 2022, 56 (3): 542- 549 CHEN Qiaohong, PEI Haolei, SUN Qi Image caption based on relational reasoning and context gate mechanism[J]. Journal of Zhejiang University: Engineering Science, 2022, 56 (3): 542- 549
|
4 |
王鑫, 宋永红, 张元林 基于显著性特征提取的图像描述算法[J]. 自动化学报, 2022, 48 (3): 735- 746 WANG Xin, SONG Yonghong, ZHANG Yuanlin Salient feature extraction mechanism for image captioning[J]. Acta Automatica Sinica, 2022, 48 (3): 735- 746
|
5 |
卓亚琦, 魏家辉, 李志欣 基于双注意模型的图像描述生成方法研究[J]. 电子学报, 2022, 50 (5): 1123- 1130 ZHUO Yaqi, WEI Jiahui, LI Zhixin Research on image captioning based on double attention model[J]. Acta Electronica Sinica, 2022, 50 (5): 1123- 1130
doi: 10.12263/DZXB.20210696
|
6 |
SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks [C]// Proceedings of the 28th International Conference on Neural Information Processing Systems . Montreal: ACM, 2014: 3104–3112.
|
7 |
MAO J, XU W, YANG Y, et al. Explain images with multimodal recurrent neural networks [EB/OL]. (2014–10–04)[ 2023–10–20]. https://arxiv.org/pdf/1410.1090.
|
8 |
ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City: IEEE, 2018: 6077–6086.
|
9 |
刘茂福, 施琦, 聂礼强 基于视觉关联与上下文双注意力的图像描述生成方法[J]. 软件学报, 2022, 33 (9): 3210- 3222 LIU Maofu, SHI Qi, NIE Liqiang Image captioning based on visual relevance and context dual attention[J]. Journal of Software, 2022, 33 (9): 3210- 3222
|
10 |
VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems . [S. l.]: Curran Associates Inc. , 2017: 6000–6010.
|
11 |
DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding [EB/OL]. (2019–05–24)[2023–10–20]. https://arxiv.org/pdf/1810.04805.
|
12 |
HUANG L, WANG W, CHEN J, et al. Attention on attention for image captioning [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Seoul: IEEE, 2019: 4634–4643.
|
13 |
PAN Y, YAO T, LI Y, et al. X-linear attention networks for image captioning [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle: IEEE, 2020: 10971–10980.
|
14 |
CORNIA M, STEFANINI M, BARALDI L, et al. Meshed-memory transformer for image captioning [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle: IEEE, 2020: 10578–10587.
|
15 |
WANG Y, XU J, SUN Y End-to-end transformer based model for image captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36 (3): 2585- 2594
doi: 10.1609/aaai.v36i3.20160
|
16 |
GAO J, MENG X, WANG S, et al. Masked non-autoregressive image captioning [EB/OL]. (2019–06–03) [2023–10–20]. https://arxiv.org/pdf/1906.00717.
|
17 |
GUO L, LIU J, ZHU X, et al. Non-autoregressive image captioning with counterfactuals-critical multi-agent learning [C]// Proceedings of the 29th International Joint Conference on Artificial Intelligence . Yokohama: ACM, 2021: 767–773.
|
18 |
HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models [C]// Proceedings of the 34th International Conference on Neural Information Processing Systems . Vancouver: ACM, 2020: 6840–6851.
|
19 |
SONG J, MENG C, ERMON S. Denoising diffusion implicit models [C]// International Conference on Learning Representations . [S.l.]: ICLR, 2020: 1–20.
|
20 |
ZHOU Y, ZHANG Y, HU Z, et al. Semi-autoregressive transformer for image captioning [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops . Montreal: IEEE, 2021: 3139–3143.
|
21 |
CHEN T, ZHANG R, HINTON G. Analog bits: generating discrete data using diffusion models with self-conditioning [C]// International Conference on Learning Representations . [S.l.]: ICLR, 2023: 1–23.
|
22 |
HE Y, CAI Z, GAN X, et al. DiffCap: exploring continuous diffusion on image captioning [EB/OL]. (2023–05–20) [2023–10–20]. https://arxiv.org/pdf/2305.12144.
|
23 |
HO J, SAHARIA C, CHAN W, et al Cascaded diffusion models for high fidelity image generation[J]. Journal of Machine Learning Research, 2022, 23 (1): 2249- 2281
|
24 |
LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context [C]// Computer Vision – ECCV 2014 . [S.l.]: Springer, 2014: 740–755.
|
25 |
PLUMMER B A, WANG L, CERVANTES C M, et al. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models [C]// Proceedings of the IEEE International Conference on Computer Vision . Santiago: IEEE, 2015: 2641–2649.
|
26 |
KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Boston: IEEE, 2015: 3128–3137.
|
27 |
PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation [C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics . Philadelphia: ACL, 2002: 311–318.
|
28 |
DENKOWSKI M, LAVIE A. Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems [C]// Proceedings of the Sixth Workshop on Statistical Machine Translation . Edinburgh: ACL, 2011: 85–91.
|
29 |
LIN CY. ROUGE: a package for automatic evaluation of summaries [C]// Text Summarization Branches Out . Barcelona: ACL, 2004: 74–81.
|
30 |
VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Boston: IEEE, 2015: 4566–4575.
|
31 |
LI J, LI D, XIONG C, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation [C]// Proceedings of the International Conference on Machine Learning . [S. l.]: PMLR, 2022: 12888–12900.
|
32 |
RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision [C]// Proceedings of the International Conference on Machine Learning . [S. l.]: PMLR, 2021: 8748–8763.
|
33 |
KINGMA D P, BA J. ADAM: a method for stochastic optimization [EB/OL]. (2017–03–30)[2023–10–20]. https://arxiv.org/pdf/1412.6980.
|
34 |
LU J, YANG J, BATRA D, et al. Neural baby talk [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City: IEEE, 2018: 7219–7228.
|
35 |
RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Honolulu: IEEE, 2017: 1179–1195.
|
36 |
JIANG W, MA L, JIANG Y G, et al. Recurrent fusion network for image captioning [C]// Computer Vision – ECCV 2018 . [S.l.]: Springer, 2018: 510–526.
|
37 |
YAO T, PAN Y, LI Y, et al. Exploring visual relationship for image captioning [C]// Computer Vision – ECCV 2018 . [S.l.]: Springer, 2018: 711–727.
|
38 |
HERDADE S, KAPPELER A, BOAKYE K, et al. Image captioning: transforming objects into words [EB/OL]. (2020–01–11)[2023–10–20]. https://arxiv.org/pdf/1906.05963.
|
39 |
ZHANG X, SUN X, LUO Y, et al. RSTNet: captioning with adaptive attention on visual and non-visual words [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville: IEEE, 2021: 15465–15474.
|
40 |
WANG N, XIE J, WU J, et al Controllable image captioning via prompting[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37 (2): 2617- 2625
doi: 10.1609/aaai.v37i2.25360
|
41 |
YU H, LIU Y, QI B, et al. End-to-end non-autoregressive image captioning [C]// Proceedings of the ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing . Rhodes Island: IEEE, 2023: 1–5.
|
42 |
LU J, XIONG C, PARIKH D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Honolulu: IEEE, 2017: 375–383.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|