Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2024, Vol. 58 Issue (4): 674-683    DOI: 10.3785/j.issn.1008-973X.2024.04.003
    
Generative adversarial network based two-stage generation of high-quality images from text
Yin CAO1,2(),Junping QIN1,2,*(),Tong GAO2,3,Qianli MA1,2,Jiaqi REN1,2
1. College of Data Science and Applications, Inner Mongolia University of Technology, Hohhot 010051, China
2. Inner Mongolia Autonomous Region Engineering Technology Research Center of Big Data Based Software Service, Hohhot 010000, China
3. Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
Download: HTML     PDF(7066KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A generative adversarial network with deep fusion attention (DFA-GAN) was proposed, using multiple loss functions as constraints, to address the issues of poor image quality and inconsistency between text descriptions and generated images in traditional text-to-image generation methods. A two-stage image generation process was employed with a single-level generative adversarial network (GAN) as the backbone. An initial blurry image which was generated in the first stage was fed into the second stage, and high-quality image regeneration was achieved to enhance the overall image generation quality. During the first stage, a visual-text fusion module was designed to deeply integrate text features and image features, and text information was adequately fused during the image sampling process at different scales. In the second stage, an image generator with an improved Vision Transformer as the encoder was proposed to fully fuse image features with text description word features. Quantitative and qualitative experimental results showed that the proposed method outperformed other mainstream models in terms of image quality improvement and alignment with text descriptions.



Key wordstext-to-image      deep fusion      generative adversarial network(GAN)      multi-scale feature fusion      semantics consistency     
Received: 22 May 2023      Published: 27 March 2024
CLC:  TP 391  
Fund:  国家自然科学基金资助项目(61962044);内蒙古自治区自然科学基金资助项目(2019MS06005);内蒙古自治区科技重大专项(2021ZD0015);自治区直属高校基本科研业务费项目(JY20220327).
Corresponding Authors: Junping QIN     E-mail: c1122335966@163.com;qinjunping30999@sina.com
Cite this article:

Yin CAO,Junping QIN,Tong GAO,Qianli MA,Jiaqi REN. Generative adversarial network based two-stage generation of high-quality images from text. Journal of ZheJiang University (Engineering Science), 2024, 58(4): 674-683.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2024.04.003     OR     https://www.zjujournals.com/eng/Y2024/V58/I4/674


基于生成对抗网络的文本两阶段生成高质量图像方法

为了解决传统文本生成图像方法生成图像质量差和文本描述与生成图像不一致问题, 以多种损失函数为约束,提出深度融合注意力的生成对抗网络方法(DFA-GAN). 采用两阶段图像生成,以单级生成对抗网络(GAN)为主干,将第一阶段生成的初始模糊图像输入第二阶段,对初始图像进行高质量再生成,以提升图像的生成质量. 在图像生成的第一阶段,设计视觉文本融合模块,深度融合文本特征与图像特征,将文本信息充分融合在不同尺度的图像采样过程中. 在图像生成的第二阶段,为了充分融合图像特征与文本描述词特征,提出以改进后的Vision Transformer为编码器的图像生成器.定量与定性实验结果表明,对比其他主流模型,所提方法提高了生成图像的质量,与文本描述更加符合.


关键词: 文字生成图像,  深度融合,  生成对抗网络(GAN),  多尺度特征融合,  语义一致性 
Fig.1 Problems with mainstream methods of generating images
Fig.2 Model architecture diagram of proposed method
Fig.3 Architecture diagram for image generation stage of deep fusion of text features
Fig.4 Multi-scale fusion of text features and image features
Fig.5 Architecture diagram for image generation stage of attention mechanism optimization
模型CUBCOCO
ISFIDFID
StackGAN[6]3.7035.5174.05
StackGAN++[6]3.84
AttnGAN[2]4.3624.3735.49
MirrorGAN[8]4.5618.3434.71
textStyleGAN[28]4.78
DM-GAN[29]4.7516.0932.64
SD-GAN[30]4.67
DF-GAN[8]5.1014.8119.32
SSA-GAN[31]5.1715.6119.37
RAT-GAN[32]5.3613.9114.60
CogView2[33]17.70
KNN-Diffusion[16]16.66
DFA-GAN-第一阶段4.5316.0725.09
DFA-GAN-第二阶段5.3410.9619.17
Tab.1 Comparison of evaluation indexes of text-to-image generation methods in two datasets
模型ISRP
AttnGAN4.3667.83
AttnGAN+DFA-GAN第二阶段5.1170.06
DF-GAN5.1044.83
DF-GAN+ DFA-GAN第二阶段5.3270.80
DFA-GAN5.3472.67
Tab.2 Ablation experiments of different models in CUB datasets
Fig.6 Comparison of generated images of different models in CUB dataset
Fig.7 Comparison of generated images of different models in COCO dataset
Fig.8 Comparison of generated images in two stages of proposed model in different datasets
[1]   GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets [C]// Proceedings of the 27th International Conference on Neural Information Processing Systems . Cambridge: MIT Press, 2014: 2672–2680.
[2]   XU T, ZHANG P, HUANG Q, et al. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks [C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City: IEEE, 2018: 1316–1324.
[3]   韩爽. 基于生成对抗网络的文本到图像生成技术研究[D]. 大庆: 东北石油大学, 2022.
HAN Shuang. Research on text-to-image generation techniques based on generative adversarial networks [D]. Daqing: Northeast Petroleum University, 2022.
[4]   QIAO T, ZHANG J, XU D, et al. Learn, imagine and create: text-to-image generation from prior knowledge [C]// Proceeding of the 33rd Conference on Neural Information Processing Systems . Vancouver: [s. n.], 2019: 887–897.
[5]   LIANG J, PEI W, LU F. CPGAN: content-parsing generative adversarial networks for text-to-image synthesis [C]// Proceeding of the 16th European Conference on Computer Vision . [S. l.]: Springer, 2020: 491–508.
[6]   ZHANG H, XU T, LI H, et al. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks [C]// 2017 IEEE International Conference on Computer Vision . Venice: IEEE, 2017: 5908–5916.
[7]   ZHANG H, XU T, LI H, et al StackGAN++: realistic image synthesis with stacked generative adversarial networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41 (8): 1947- 1962
[8]   QIAO T, ZHANG J, XU D, et al. MirrorGAN: learning text-to-image generation by redescription [C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 1505–1514.
[9]   TAO M, TANG H, WU F, et al. Df-GAN: a simple and effective baseline for text-to-image synthesis [C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans, 2022: 16515–16525.
[10]   DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale [EB/OL]. (2021-06-03)[2023-09-17]. https://arxiv.org/pdf/2010.11929.pdf.
[11]   REED S, AKATA Z, YAN X, et al. Generative adversarial text to image synthesis [C]// Proceedings of the 33rd International Conference on Machine Learning . New York: ACM, 2016: 1060–1069.
[12]   ZHU J Y, PARK T, ISOLA P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks [C]// 2017 IEEE International Conference on Computer Vision . Venice: IEEE, 2017: 2223–2232.
[13]   贺小峰, 毛琳, 杨大伟 文本生成图像中语义-空间特征增强算法[J]. 大连民族大学学报, 2022, 24 (5): 401- 406
HE Xiaofeng, MAO Lin, YANG Dawei Semantic-spatial feature enhancement algorithm for text-to-image generation[J]. Journal of Dalian Minzu University, 2022, 24 (5): 401- 406
[14]   薛志杭, 许喆铭, 郎丛妍, 等 基于图像-文本语义一致性的文本生成图像方法[J]. 计算机研究与发展, 2023, 60 (9): 2180- 2190
XUE Zhihang, XU Zheming, LANG Congyan, et al Text-to-image generation method based on image-text semantic consistency[J]. Journal of Computer Research and Development, 2023, 60 (9): 2180- 2190
[15]   吕文涵, 车进, 赵泽纬, 等. 基于动态卷积与文本数据增强的图像生成方法[EB/OL]. (2023-04-28)[2023-09-17]. https://doi.org/10.19678/j.issn.1000-3428.0066470.
[16]   SHEYNIN S, ASHUAL O, POLYAK A, et al. KNN-diffusion: image generation via large-scale retrieval [EB/OL]. (2022-10-02)[2023-09-17]. https://arxiv.org/pdf/2204.02849.pdf.
[17]   NICHOL A Q, DHARIWAL P, RAMESH A, et al. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models [C]// International Conference on Machine Learning . Long Beach: IEEE, 2022: 16784–16804.
[18]   田枫, 孙小强, 刘芳, 等 融合双注意力与多标签的图像中文描述生成方法[J]. 计算机系统应用, 2021, 30 (7): 32- 40
TIAN Feng, SUN Xiaoqiang, LIU Fang, et al Chinese image caption with dual attention and multi-label image[J]. Computer Systems and Applications, 2021, 30 (7): 32- 40
[19]   HUANG Z, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging [EB/OL]. (2015-08-09)[2023-09-17]. https://arxiv.org/pdf/1508.01991.pdf.
[20]   VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceeding of the 31st International Conference on Neural Information Processing Systems . Long Beach: [s.n.], 2017: 6000–6010.
[21]   MIRZA M, OSINDERO S. Conditional generative adversarial nets [EB/OL]. (2014-11-06)[2023-09-17]. https://arxiv.org/pdf/1411.1784.pdf.
[22]   WAH C, BRANSON S, WELINDER P, et al. The Caltech-UCSD Birds-200-2011 dataset [EB/OL]. (2022-08-12)[2023-09-17]. https://authors.library.caltech.edu/27452/1/CUB_200_2011.pdf.
[23]   LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context [C]// European Conference on Computer Vision . [S. l.]: Springer, 2014: 740–755.
[24]   SALIMANS T, GOODFELLOW I, ZAREMBA W, et al. Improved techniques for training GANs [J]. Proceedings of the 30th International Conference on Neural Information Processing Systems . Barcelona: [s. n.], 2016: 2234–2242.
[25]   HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local nash equilibrium [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems . Long Beach: [s.n.], 2017: 6629–6640.
[26]   王家喻. 基于生成对抗网络的图像生成研究[D]. 合肥: 中国科学技术大学, 2021.
WANG Jiayu. Image generation based on generative adversarial networks [D]. Hefei: University of Science and Technology of China, 2021.
[27]   王蕾. 基于关联语义挖掘的文本生成图像算法研究[D]. 西安: 西安电子科技大学, 2020.
WANG Lei. Text-to-image synthesis based on semantic correlation mining [D]. Xi’an: Xidian University, 2020.
[28]   STAP D, BLEEKER M, IBRAHIMI S, et al. Conditional image generation and manipulation for user-specified content [EB/OL]. (2020-05-11)[2023-09-17]. https://arxiv.org/pdf/2005.04909.pdf.
[29]   ZHU M, PAN P, CHEN W, et al. DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis [C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 5802–5810.
[30]   YIN G, LIU B, SHENG L, et al. Semantics disentangling for text-to-image generation [C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 2327–2336.
[31]   LIAO W, HU K, YANG M Y, et al. Text to image generation with semantic-spatial aware GAN [C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans: IEEE, 2022: 18187–18196.
[32]   YE S, WANG H, TAN M, et al Recurrent affine transformation for text-to-image synthesis[J]. IEEE Transactions on Multimedia, 2023, 26: 462- 473
[1] Yin CAO,Junping QIN,Qianli MA,Hao SUN,Kai YAN,Lei WANG,Jiaqi REN. Survey of text-to-image synthesis[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(2): 219-238.
[2] Bing YANG,Wei NA,Xue-qin XIANG. Text-to-image generation method based on single stage generative adversarial network[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(12): 2412-2420.