Please wait a minute...
浙江大学学报(工学版)  2024, Vol. 58 Issue (4): 674-683    DOI: 10.3785/j.issn.1008-973X.2024.04.003
计算机与控制工程     
基于生成对抗网络的文本两阶段生成高质量图像方法
曹寅1,2(),秦俊平1,2,*(),高彤2,3,马千里1,2,任家琪1,2
1. 内蒙古工业大学 数据科学与应用学院,内蒙古 呼和浩特 010051
2. 内蒙古自治区基于大数据的软件服务工程技术研究中心,内蒙古 呼和浩特 010000
3. 北京工业大学 信息学部,北京 100124
Generative adversarial network based two-stage generation of high-quality images from text
Yin CAO1,2(),Junping QIN1,2,*(),Tong GAO2,3,Qianli MA1,2,Jiaqi REN1,2
1. College of Data Science and Applications, Inner Mongolia University of Technology, Hohhot 010051, China
2. Inner Mongolia Autonomous Region Engineering Technology Research Center of Big Data Based Software Service, Hohhot 010000, China
3. Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
 全文: PDF(7066 KB)   HTML
摘要:

为了解决传统文本生成图像方法生成图像质量差和文本描述与生成图像不一致问题, 以多种损失函数为约束,提出深度融合注意力的生成对抗网络方法(DFA-GAN). 采用两阶段图像生成,以单级生成对抗网络(GAN)为主干,将第一阶段生成的初始模糊图像输入第二阶段,对初始图像进行高质量再生成,以提升图像的生成质量. 在图像生成的第一阶段,设计视觉文本融合模块,深度融合文本特征与图像特征,将文本信息充分融合在不同尺度的图像采样过程中. 在图像生成的第二阶段,为了充分融合图像特征与文本描述词特征,提出以改进后的Vision Transformer为编码器的图像生成器.定量与定性实验结果表明,对比其他主流模型,所提方法提高了生成图像的质量,与文本描述更加符合.

关键词: 文字生成图像深度融合生成对抗网络(GAN)多尺度特征融合语义一致性    
Abstract:

A generative adversarial network with deep fusion attention (DFA-GAN) was proposed, using multiple loss functions as constraints, to address the issues of poor image quality and inconsistency between text descriptions and generated images in traditional text-to-image generation methods. A two-stage image generation process was employed with a single-level generative adversarial network (GAN) as the backbone. An initial blurry image which was generated in the first stage was fed into the second stage, and high-quality image regeneration was achieved to enhance the overall image generation quality. During the first stage, a visual-text fusion module was designed to deeply integrate text features and image features, and text information was adequately fused during the image sampling process at different scales. In the second stage, an image generator with an improved Vision Transformer as the encoder was proposed to fully fuse image features with text description word features. Quantitative and qualitative experimental results showed that the proposed method outperformed other mainstream models in terms of image quality improvement and alignment with text descriptions.

Key words: text-to-image    deep fusion    generative adversarial network(GAN)    multi-scale feature fusion    semantics consistency
收稿日期: 2023-05-22 出版日期: 2024-03-27
CLC:  TP 391  
基金资助: 国家自然科学基金资助项目(61962044);内蒙古自治区自然科学基金资助项目(2019MS06005);内蒙古自治区科技重大专项(2021ZD0015);自治区直属高校基本科研业务费项目(JY20220327).
通讯作者: 秦俊平     E-mail: c1122335966@163.com;qinjunping30999@sina.com
作者简介: 曹寅(1998—),男,硕士生,从事计算机视觉研究. orcid.org/0000-0002-8759-0888. E-mail:c1122335966@163.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
曹寅
秦俊平
高彤
马千里
任家琪

引用本文:

曹寅,秦俊平,高彤,马千里,任家琪. 基于生成对抗网络的文本两阶段生成高质量图像方法[J]. 浙江大学学报(工学版), 2024, 58(4): 674-683.

Yin CAO,Junping QIN,Tong GAO,Qianli MA,Jiaqi REN. Generative adversarial network based two-stage generation of high-quality images from text. Journal of ZheJiang University (Engineering Science), 2024, 58(4): 674-683.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2024.04.003        https://www.zjujournals.com/eng/CN/Y2024/V58/I4/674

图 1  主流方法生成图像的问题
图 2  所提供方法的模型结构图
图 3  深度融合文本特征的图像生成阶段结构图
图 4  文本特征与图像特征的多尺度融合
图 5  注意力机制优化图像生成阶段结构图
模型CUBCOCO
ISFIDFID
StackGAN[6]3.7035.5174.05
StackGAN++[6]3.84
AttnGAN[2]4.3624.3735.49
MirrorGAN[8]4.5618.3434.71
textStyleGAN[28]4.78
DM-GAN[29]4.7516.0932.64
SD-GAN[30]4.67
DF-GAN[8]5.1014.8119.32
SSA-GAN[31]5.1715.6119.37
RAT-GAN[32]5.3613.9114.60
CogView2[33]17.70
KNN-Diffusion[16]16.66
DFA-GAN-第一阶段4.5316.0725.09
DFA-GAN-第二阶段5.3410.9619.17
表 1  文本生成图像方法在2个数据集上的评价指标对比
模型ISRP
AttnGAN4.3667.83
AttnGAN+DFA-GAN第二阶段5.1170.06
DF-GAN5.1044.83
DF-GAN+ DFA-GAN第二阶段5.3270.80
DFA-GAN5.3472.67
表 2  不同模型在CUB数据集上的消融实验
图 6  不同模型在CUB数据集上的生成图像比较
图 7  不同模型在COCO数据集上的生成图像比较
图 8  所提模型2个阶段在不同数据集上的生成图像比较
1 GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets [C]// Proceedings of the 27th International Conference on Neural Information Processing Systems . Cambridge: MIT Press, 2014: 2672–2680.
2 XU T, ZHANG P, HUANG Q, et al. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks [C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City: IEEE, 2018: 1316–1324.
3 韩爽. 基于生成对抗网络的文本到图像生成技术研究[D]. 大庆: 东北石油大学, 2022.
HAN Shuang. Research on text-to-image generation techniques based on generative adversarial networks [D]. Daqing: Northeast Petroleum University, 2022.
4 QIAO T, ZHANG J, XU D, et al. Learn, imagine and create: text-to-image generation from prior knowledge [C]// Proceeding of the 33rd Conference on Neural Information Processing Systems . Vancouver: [s. n.], 2019: 887–897.
5 LIANG J, PEI W, LU F. CPGAN: content-parsing generative adversarial networks for text-to-image synthesis [C]// Proceeding of the 16th European Conference on Computer Vision . [S. l.]: Springer, 2020: 491–508.
6 ZHANG H, XU T, LI H, et al. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks [C]// 2017 IEEE International Conference on Computer Vision . Venice: IEEE, 2017: 5908–5916.
7 ZHANG H, XU T, LI H, et al StackGAN++: realistic image synthesis with stacked generative adversarial networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41 (8): 1947- 1962
8 QIAO T, ZHANG J, XU D, et al. MirrorGAN: learning text-to-image generation by redescription [C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 1505–1514.
9 TAO M, TANG H, WU F, et al. Df-GAN: a simple and effective baseline for text-to-image synthesis [C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans, 2022: 16515–16525.
10 DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale [EB/OL]. (2021-06-03)[2023-09-17]. https://arxiv.org/pdf/2010.11929.pdf.
11 REED S, AKATA Z, YAN X, et al. Generative adversarial text to image synthesis [C]// Proceedings of the 33rd International Conference on Machine Learning . New York: ACM, 2016: 1060–1069.
12 ZHU J Y, PARK T, ISOLA P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks [C]// 2017 IEEE International Conference on Computer Vision . Venice: IEEE, 2017: 2223–2232.
13 贺小峰, 毛琳, 杨大伟 文本生成图像中语义-空间特征增强算法[J]. 大连民族大学学报, 2022, 24 (5): 401- 406
HE Xiaofeng, MAO Lin, YANG Dawei Semantic-spatial feature enhancement algorithm for text-to-image generation[J]. Journal of Dalian Minzu University, 2022, 24 (5): 401- 406
14 薛志杭, 许喆铭, 郎丛妍, 等 基于图像-文本语义一致性的文本生成图像方法[J]. 计算机研究与发展, 2023, 60 (9): 2180- 2190
XUE Zhihang, XU Zheming, LANG Congyan, et al Text-to-image generation method based on image-text semantic consistency[J]. Journal of Computer Research and Development, 2023, 60 (9): 2180- 2190
15 吕文涵, 车进, 赵泽纬, 等. 基于动态卷积与文本数据增强的图像生成方法[EB/OL]. (2023-04-28)[2023-09-17]. https://doi.org/10.19678/j.issn.1000-3428.0066470.
16 SHEYNIN S, ASHUAL O, POLYAK A, et al. KNN-diffusion: image generation via large-scale retrieval [EB/OL]. (2022-10-02)[2023-09-17]. https://arxiv.org/pdf/2204.02849.pdf.
17 NICHOL A Q, DHARIWAL P, RAMESH A, et al. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models [C]// International Conference on Machine Learning . Long Beach: IEEE, 2022: 16784–16804.
18 田枫, 孙小强, 刘芳, 等 融合双注意力与多标签的图像中文描述生成方法[J]. 计算机系统应用, 2021, 30 (7): 32- 40
TIAN Feng, SUN Xiaoqiang, LIU Fang, et al Chinese image caption with dual attention and multi-label image[J]. Computer Systems and Applications, 2021, 30 (7): 32- 40
19 HUANG Z, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging [EB/OL]. (2015-08-09)[2023-09-17]. https://arxiv.org/pdf/1508.01991.pdf.
20 VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceeding of the 31st International Conference on Neural Information Processing Systems . Long Beach: [s.n.], 2017: 6000–6010.
21 MIRZA M, OSINDERO S. Conditional generative adversarial nets [EB/OL]. (2014-11-06)[2023-09-17]. https://arxiv.org/pdf/1411.1784.pdf.
22 WAH C, BRANSON S, WELINDER P, et al. The Caltech-UCSD Birds-200-2011 dataset [EB/OL]. (2022-08-12)[2023-09-17]. https://authors.library.caltech.edu/27452/1/CUB_200_2011.pdf.
23 LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context [C]// European Conference on Computer Vision . [S. l.]: Springer, 2014: 740–755.
24 SALIMANS T, GOODFELLOW I, ZAREMBA W, et al. Improved techniques for training GANs [J]. Proceedings of the 30th International Conference on Neural Information Processing Systems . Barcelona: [s. n.], 2016: 2234–2242.
25 HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local nash equilibrium [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems . Long Beach: [s.n.], 2017: 6629–6640.
26 王家喻. 基于生成对抗网络的图像生成研究[D]. 合肥: 中国科学技术大学, 2021.
WANG Jiayu. Image generation based on generative adversarial networks [D]. Hefei: University of Science and Technology of China, 2021.
27 王蕾. 基于关联语义挖掘的文本生成图像算法研究[D]. 西安: 西安电子科技大学, 2020.
WANG Lei. Text-to-image synthesis based on semantic correlation mining [D]. Xi’an: Xidian University, 2020.
28 STAP D, BLEEKER M, IBRAHIMI S, et al. Conditional image generation and manipulation for user-specified content [EB/OL]. (2020-05-11)[2023-09-17]. https://arxiv.org/pdf/2005.04909.pdf.
29 ZHU M, PAN P, CHEN W, et al. DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis [C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 5802–5810.
30 YIN G, LIU B, SHENG L, et al. Semantics disentangling for text-to-image generation [C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 2327–2336.
31 LIAO W, HU K, YANG M Y, et al. Text to image generation with semantic-spatial aware GAN [C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans: IEEE, 2022: 18187–18196.
32 YE S, WANG H, TAN M, et al Recurrent affine transformation for text-to-image synthesis[J]. IEEE Transactions on Multimedia, 2023, 26: 462- 473
[1] 李灿林,张文娇,邵志文,马利庄,王新玥. 基于Trans-nightSeg的夜间道路场景语义分割方法[J]. 浙江大学学报(工学版), 2024, 58(2): 294-303.
[2] 杨冰,那巍,向学勤. 基于单阶段生成对抗网络的文本生成图像方法[J]. 浙江大学学报(工学版), 2023, 57(12): 2412-2420.
[3] 于楠晶,范晓飚,邓天民,冒国韬. 基于多头自注意力的复杂背景船舶检测算法[J]. 浙江大学学报(工学版), 2022, 56(12): 2392-2402.
[4] 杨栋杰,高贤君,冉树浩,张广斌,王萍,杨元维. 基于多重多尺度融合注意力网络的建筑物提取[J]. 浙江大学学报(工学版), 2022, 56(10): 1924-1934.
[5] 陈彤,郭剑锋,韩心中,谢学立,席建祥. 基于生成对抗模型的可见光-红外图像匹配方法[J]. 浙江大学学报(工学版), 2022, 56(1): 63-74.
[6] 陈智超,焦海宁,杨杰,曾华福. 基于改进MobileNet v2的垃圾图像分类算法[J]. 浙江大学学报(工学版), 2021, 55(8): 1490-1499.
[7] 胡惠雅,盖绍彦,达飞鹏. 基于生成对抗网络的偏转人脸转正[J]. 浙江大学学报(工学版), 2021, 55(1): 116-123.
[8] 刘坤,文熙,黄闽茗,杨欣欣,毛经坤. 基于生成对抗网络的太阳能电池缺陷增强方法[J]. 浙江大学学报(工学版), 2020, 54(4): 684-693.
[9] 刘宇鹏, 乔秀明, 赵石磊, 马春光. 统计机器翻译中大规模特征的深度融合[J]. 浙江大学学报(工学版), 2017, 51(1): 46-56.