Please wait a minute...
浙江大学学报(工学版)  2026, Vol. 60 Issue (2): 360-369    DOI: 10.3785/j.issn.1008-973X.2026.02.014
计算机技术与控制工程     
基于多模态语义信息的文本生成图像方法
杨冰1,2(),周家辉1,2,姚金良1,2,向学勤3
1. 杭州电子科技大学 计算机学院,浙江 杭州 310018
2. 杭州电子科技大学 浙江省脑机协同智能重点实验室,浙江 杭州 310018
3. 杭州灵伴科技有限公司,浙江 杭州 311121
Text-to-image generation method based on multimodal semantic information
Bing YANG1,2(),Jiahui ZHOU1,2,Jinliang YAO1,2,Xueqin XIANG3
1. School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China
2. Key Laboratory of Brain Machine Collaborative Intelligence of Zhejiang Province, Hangzhou Dianzi University, Hangzhou 310018, China
3. Hangzhou Lingban Technology Limited Company, Hangzhou 311121, China
 全文: PDF(3683 KB)   HTML
摘要:

针对文本语义与图像语义不一致以及图像细节表现不足的问题,提出新的文本生成图像方法. 基于多模态语义信息建立鉴别依据,在文本语义基础上引入真实图像语义,以解决文本描述信息密度低的问题,有效缓解生成图像细节缺失或失真的现象. 在生成器中集成可变形卷积和星模块卷积,增强生成器表达能力,提高生成图像的细节表现和整体质量. 为了验证所提方法的有效性,在CUB数据集和COCO数据集上进行模型训练及评估. 与生成式对抗对比语言?图像预训练模型(GALIP)相比,所提方法在保证高效生成的同时,在细节表现、语义一致性及整体质量上具有显著优势.

关键词: 文本生成图像多模态语义可变形卷积星模块卷积语义对齐鉴别器    
Abstract:

A new method was proposed to address text-image semantic inconsistencies and detail deficiencies in text-to-image generation. A discrimination mechanism was established that integrates real-image semantics with textual descriptions, mitigating text’s inherent information sparsity to alleviate detail omission or distortion in synthesized images. Deformable and star product convolutions were incorporated into a generator, enhancing the structural adaptability of the generator to improve fine-grained rendering and overall fidelity. To validate the effectiveness of the proposed method, model training and evaluation were conducted on the CUB and COCO datasets. Compared to generative adversarial networks trained with contrastive language-image pretraining (GALIP), the proposed method offers significant advantages in detail representation, semantic consistency, and overall image quality, while achieving efficient generation.

Key words: text-to-image generation    multimodal semantics    deformable convolution    star product convolution    semantic alignment discriminator
收稿日期: 2025-01-23 出版日期: 2026-02-03
CLC:  TP 391  
基金资助: 浙江省基础公益研究计划(LGG22F020027).
作者简介: 杨冰(1985—),女,副教授,博士,从事计算机视觉、机器学习研究. orcid.org/0000-0002-0585-0579. E-mail:yb@hdu.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
杨冰
周家辉
姚金良
向学勤

引用本文:

杨冰,周家辉,姚金良,向学勤. 基于多模态语义信息的文本生成图像方法[J]. 浙江大学学报(工学版), 2026, 60(2): 360-369.

Bing YANG,Jiahui ZHOU,Jinliang YAO,Xueqin XIANG. Text-to-image generation method based on multimodal semantic information. Journal of ZheJiang University (Engineering Science), 2026, 60(2): 360-369.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2026.02.014        https://www.zjujournals.com/eng/CN/Y2026/V60/I2/360

图 1  基于多模态语义信息的文本生成图像方法的框架图
图 2  可变形卷积结构图
图 3  语义对齐块结构
模型CUB数据集COCO数据集
FID↓IS↑SCLIPFID↓IS↑SCLIP
VQ-Diffusion[23]10.320.322 413.860.338 2
DFGAN14.815.100.292 019.3235.160.297 2
RATGAN13.915.3614.6036.42
DMF-GAN[24]13.215.4215.8336.72
SAW-GAN[25]10.454.6311.1735.17
GALIP10.085.920.316 45.8537.110.333 8
本研究9.566.040.325 25.6237.360.340 5
表 1  不同模型在2个数据集上的评估指标对比
模型类型tg/snp/109ZS-FID↓
Make-a-scene自回归9.408.011.84
LDM扩散15.001.512.63
UFOGen扩散+GAN0.090.912.78
本研究GAN0.040.312.48
表 2  不同模型在COCO数据集上的图像生成速度对比
图 4  在CUB数据集上不同模型的文本生成图像对比
图 5  在 COCO数据集上不同模型的文本生成图像对比
基线星模块卷积可变形卷积语义对齐鉴别器CUB数据集COCO数据集
FID↓SCLIPFID↓SCLIP
10.080.31645.850.3338
9.700.31845.760.3352
9.890.31765.800.3343
9.970.31915.720.3368
9.680.32055.690.3341
9.620.32225.650.3382
9.930.31985.710.3375
9.560.32595.620.3405
表 3  基于多模态语义信息的文本生成图像方法的模块消融实验
语义对齐块数量FID↓SCLIP
110.180.3158
210.030.3174
39.970.3191
410.050.3185
表 4  语义对齐鉴别器的消融实验结果
图 6  在CUB数据集上的模型平滑潜在空间验证
1 GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets [C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. [S.1.]: MIT Press, 2014: 2672–2680.
2 HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models [EB/OL]. (2020−12−16)[2024−12−23]. https://arxiv.org/pdf/2006.11239.
3 QIAO T, ZHANG J, XU D, et al. MirrorGAN: learning text-to-image generation by redescription [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2020: 1505–1514.
4 TAO M, TANG H, WU F, et al. DF-GAN: a simple and effective baseline for text-to-image synthesis [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 16494–16504.
5 TAO M, BAO B K, TANG H, et al. GALIP: generative adversarial CLIPs for text-to-image synthesis [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 14214–14223.
6 RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision [C]// Proceedings of the International Conference on Machine Learning. [S.l.]: ICLR, 2021: 748−763.
7 MA X, DAI X, BAI Y, et al. Rewrite the stars [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 5694–5703.
8 XIONG Y, LI Z, CHEN Y, et al. Efficient deformable ConvNets : rethinking dynamic and sparse operator for vision applications [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 5652−5661.
9 REED S, AKATA Z, YAN X, et al. Generative adversarial text to image synthesis [C]// Proceedings of the 33rd International Conference on International Conference on Machine Learning. [S.l.]: ACM, 2016: 1060–1069.
10 ZHANG H, XU T, LI H, et al. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks [C]// Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 5908–5916.
11 TAN H, LIU X, YIN B, et al DR-GAN: distribution regularization for text-to-image generation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34 (12): 10309- 10323
doi: 10.1109/TNNLS.2022.3165573
12 YE S, WANG H, TAN M, et al Recurrent affine transformation for text-to-image synthesis[J]. IEEE Transactions on Multimedia, 2024, 26: 462- 473
doi: 10.1109/TMM.2023.3266607
13 ZHOU Y, ZHANG R, CHEN C, et al. Towards language-free training for text-to-image generation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 17886–17896.
14 RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with CLIP latents [EB/OL]. (2022−04−13)[2024−12−23]. https://arxiv.org/pdf/2204.06125.
15 VASWAMI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. [S.l.]: Curran Associates Inc. , 2017: 6000–6010.
16 DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale [EB/OL]. (2021−06−03)[2024−12−23]. https://arxiv.org/pdf/2010.11929.
17 HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770–778.
18 WAH C, BRANSON S, WELINDER P, et al. The Caltech-UCSD birds-200-2011 dataset [DB/OL]. (2022−08−12)[2024−06−23]. https://authors.library.caltech.edu/27452/1/CUB_200_2011.pdf.
19 LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context [C]// Computer Vision – ECCV 2014. [S.l.]: Springer, 2014: 740–755.
20 HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. [S.l.]: ACM, 2017: 6629–6640.
21 WU C, LIANG J, JI L, et al. NÜWA: visual synthesis pre-training for neural visual world creation [C]// Computer Vision – ECCV 2022. [S.l.]: Springer, 2022: 720–736.
22 SALIMANS T, GOODFELLOW I, ZAREMBA W, et al. Improved techniques for training GANs [C]// Proceedings of the 30th International Conference on Neural Information Processing Systems. [S.l.]: ACM, 2016: 2234–2242.
23 GU S, CHEN D, BAO J, et al. Vector quantized diffusion model for text-to-image synthesis [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 10686–10696.
24 YANG B, XIANG X, KONG W, et al DMF-GAN: deep multimodal fusion generative adversarial networks for text-to-image synthesis[J]. IEEE Transactions on Multimedia, 2024, 26: 6956- 6967
doi: 10.1109/TMM.2024.3358086
25 JIN D, YU Q, YU L, et al SAW-GAN: multi-granularity text fusion generative adversarial networks for text-to-image generation[J]. Knowledge-Based Systems, 2024, 294: 111795
doi: 10.1016/j.knosys.2024.111795
26 GAFNI O, POLYAK A, ASHUAL O, et al. Make-a-scene: scene-based text-to-image generation with human priors [C]// Computer Vision – ECCV 2022. [S.l.]: Springer, 2022: 89–106.
27 ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 10674–10685.
[1] 冉庆东,郑力新. 基于改进YOLOv5的锂电池极片缺陷检测方法[J]. 浙江大学学报(工学版), 2024, 58(9): 1811-1821.
[2] 曹寅,秦俊平,马千里,孙昊,闫凯,王磊,任家琪. 文本生成图像研究综述[J]. 浙江大学学报(工学版), 2024, 58(2): 219-238.
[3] 杨冰,那巍,向学勤. 基于单阶段生成对抗网络的文本生成图像方法[J]. 浙江大学学报(工学版), 2023, 57(12): 2412-2420.
[4] 杨淑琴,马玉浩,方铭宇,钱伟行,蔡洁萱,刘童. 基于实例分割的复杂环境车道线检测方法[J]. 浙江大学学报(工学版), 2022, 56(4): 809-815, 832.