Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2026, Vol. 60 Issue (2): 360-369    DOI: 10.3785/j.issn.1008-973X.2026.02.014
    
Text-to-image generation method based on multimodal semantic information
Bing YANG1,2(),Jiahui ZHOU1,2,Jinliang YAO1,2,Xueqin XIANG3
1. School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China
2. Key Laboratory of Brain Machine Collaborative Intelligence of Zhejiang Province, Hangzhou Dianzi University, Hangzhou 310018, China
3. Hangzhou Lingban Technology Limited Company, Hangzhou 311121, China
Download: HTML     PDF(3683KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A new method was proposed to address text-image semantic inconsistencies and detail deficiencies in text-to-image generation. A discrimination mechanism was established that integrates real-image semantics with textual descriptions, mitigating text’s inherent information sparsity to alleviate detail omission or distortion in synthesized images. Deformable and star product convolutions were incorporated into a generator, enhancing the structural adaptability of the generator to improve fine-grained rendering and overall fidelity. To validate the effectiveness of the proposed method, model training and evaluation were conducted on the CUB and COCO datasets. Compared to generative adversarial networks trained with contrastive language-image pretraining (GALIP), the proposed method offers significant advantages in detail representation, semantic consistency, and overall image quality, while achieving efficient generation.



Key wordstext-to-image generation      multimodal semantics      deformable convolution      star product convolution      semantic alignment discriminator     
Received: 23 January 2025      Published: 03 February 2026
CLC:  TP 391  
Fund:  浙江省基础公益研究计划(LGG22F020027).
Cite this article:

Bing YANG,Jiahui ZHOU,Jinliang YAO,Xueqin XIANG. Text-to-image generation method based on multimodal semantic information. Journal of ZheJiang University (Engineering Science), 2026, 60(2): 360-369.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2026.02.014     OR     https://www.zjujournals.com/eng/Y2026/V60/I2/360


基于多模态语义信息的文本生成图像方法

针对文本语义与图像语义不一致以及图像细节表现不足的问题,提出新的文本生成图像方法. 基于多模态语义信息建立鉴别依据,在文本语义基础上引入真实图像语义,以解决文本描述信息密度低的问题,有效缓解生成图像细节缺失或失真的现象. 在生成器中集成可变形卷积和星模块卷积,增强生成器表达能力,提高生成图像的细节表现和整体质量. 为了验证所提方法的有效性,在CUB数据集和COCO数据集上进行模型训练及评估. 与生成式对抗对比语言?图像预训练模型(GALIP)相比,所提方法在保证高效生成的同时,在细节表现、语义一致性及整体质量上具有显著优势.


关键词: 文本生成图像,  多模态语义,  可变形卷积,  星模块卷积,  语义对齐鉴别器 
Fig.1 Framework diagram of text-to-image generation method based on multimodal semantic information
Fig.2 Structure of deformable convolution
Fig.3 Structure of semantic alignment block
模型CUB数据集COCO数据集
FID↓IS↑SCLIPFID↓IS↑SCLIP
VQ-Diffusion[23]10.320.322 413.860.338 2
DFGAN14.815.100.292 019.3235.160.297 2
RATGAN13.915.3614.6036.42
DMF-GAN[24]13.215.4215.8336.72
SAW-GAN[25]10.454.6311.1735.17
GALIP10.085.920.316 45.8537.110.333 8
本研究9.566.040.325 25.6237.360.340 5
Tab.1 Comparison of evaluation metrics for different models on two datasets
模型类型tg/snp/109ZS-FID↓
Make-a-scene自回归9.408.011.84
LDM扩散15.001.512.63
UFOGen扩散+GAN0.090.912.78
本研究GAN0.040.312.48
Tab.2 Comparison of image generation speed for different models on COCO dataset
Fig.4 Text-to-image generation comparison of different models on CUB dataset
Fig.5 Text-to-image generation comparison of different models on COCO dataset
基线星模块卷积可变形卷积语义对齐鉴别器CUB数据集COCO数据集
FID↓SCLIPFID↓SCLIP
10.080.31645.850.3338
9.700.31845.760.3352
9.890.31765.800.3343
9.970.31915.720.3368
9.680.32055.690.3341
9.620.32225.650.3382
9.930.31985.710.3375
9.560.32595.620.3405
Tab.3 Module ablation study of text-to-image generation method based on multimodal semantic information
语义对齐块数量FID↓SCLIP
110.180.3158
210.030.3174
39.970.3191
410.050.3185
Tab.4 Ablation experiment results of semantic alignment discriminator
Fig.6 Model’s smooth latent space validation on CUB dataset
[1]   GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets [C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. [S.1.]: MIT Press, 2014: 2672–2680.
[2]   HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models [EB/OL]. (2020−12−16)[2024−12−23]. https://arxiv.org/pdf/2006.11239.
[3]   QIAO T, ZHANG J, XU D, et al. MirrorGAN: learning text-to-image generation by redescription [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2020: 1505–1514.
[4]   TAO M, TANG H, WU F, et al. DF-GAN: a simple and effective baseline for text-to-image synthesis [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 16494–16504.
[5]   TAO M, BAO B K, TANG H, et al. GALIP: generative adversarial CLIPs for text-to-image synthesis [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 14214–14223.
[6]   RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision [C]// Proceedings of the International Conference on Machine Learning. [S.l.]: ICLR, 2021: 748−763.
[7]   MA X, DAI X, BAI Y, et al. Rewrite the stars [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 5694–5703.
[8]   XIONG Y, LI Z, CHEN Y, et al. Efficient deformable ConvNets : rethinking dynamic and sparse operator for vision applications [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 5652−5661.
[9]   REED S, AKATA Z, YAN X, et al. Generative adversarial text to image synthesis [C]// Proceedings of the 33rd International Conference on International Conference on Machine Learning. [S.l.]: ACM, 2016: 1060–1069.
[10]   ZHANG H, XU T, LI H, et al. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks [C]// Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 5908–5916.
[11]   TAN H, LIU X, YIN B, et al DR-GAN: distribution regularization for text-to-image generation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34 (12): 10309- 10323
doi: 10.1109/TNNLS.2022.3165573
[12]   YE S, WANG H, TAN M, et al Recurrent affine transformation for text-to-image synthesis[J]. IEEE Transactions on Multimedia, 2024, 26: 462- 473
doi: 10.1109/TMM.2023.3266607
[13]   ZHOU Y, ZHANG R, CHEN C, et al. Towards language-free training for text-to-image generation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 17886–17896.
[14]   RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with CLIP latents [EB/OL]. (2022−04−13)[2024−12−23]. https://arxiv.org/pdf/2204.06125.
[15]   VASWAMI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. [S.l.]: Curran Associates Inc. , 2017: 6000–6010.
[16]   DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale [EB/OL]. (2021−06−03)[2024−12−23]. https://arxiv.org/pdf/2010.11929.
[17]   HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770–778.
[18]   WAH C, BRANSON S, WELINDER P, et al. The Caltech-UCSD birds-200-2011 dataset [DB/OL]. (2022−08−12)[2024−06−23]. https://authors.library.caltech.edu/27452/1/CUB_200_2011.pdf.
[19]   LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context [C]// Computer Vision – ECCV 2014. [S.l.]: Springer, 2014: 740–755.
[20]   HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. [S.l.]: ACM, 2017: 6629–6640.
[21]   WU C, LIANG J, JI L, et al. NÜWA: visual synthesis pre-training for neural visual world creation [C]// Computer Vision – ECCV 2022. [S.l.]: Springer, 2022: 720–736.
[22]   SALIMANS T, GOODFELLOW I, ZAREMBA W, et al. Improved techniques for training GANs [C]// Proceedings of the 30th International Conference on Neural Information Processing Systems. [S.l.]: ACM, 2016: 2234–2242.
[23]   GU S, CHEN D, BAO J, et al. Vector quantized diffusion model for text-to-image synthesis [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 10686–10696.
[24]   YANG B, XIANG X, KONG W, et al DMF-GAN: deep multimodal fusion generative adversarial networks for text-to-image synthesis[J]. IEEE Transactions on Multimedia, 2024, 26: 6956- 6967
doi: 10.1109/TMM.2024.3358086
[25]   JIN D, YU Q, YU L, et al SAW-GAN: multi-granularity text fusion generative adversarial networks for text-to-image generation[J]. Knowledge-Based Systems, 2024, 294: 111795
doi: 10.1016/j.knosys.2024.111795
[26]   GAFNI O, POLYAK A, ASHUAL O, et al. Make-a-scene: scene-based text-to-image generation with human priors [C]// Computer Vision – ECCV 2022. [S.l.]: Springer, 2022: 89–106.
[27]   ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 10674–10685.
[1] Qingdong RAN,Lixin ZHENG. Defect detection method of lithium battery electrode based on improved YOLOv5[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(9): 1811-1821.
[2] Bing YANG,Wei NA,Xue-qin XIANG. Text-to-image generation method based on single stage generative adversarial network[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(12): 2412-2420.
[3] Shu-qin YANG,Yu-hao MA,Ming-yu FANG,Wei-xing QIAN,Jie-xuan CAI,Tong LIU. Lane detection method in complex environments based on instance segmentation[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(4): 809-815, 832.
[4] Yao JIN,Wei ZHANG. Real-time fire detection algorithm with Anchor-Free network architecture[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(12): 2430-2436.