Please wait a minute...
浙江大学学报(工学版)  2024, Vol. 58 Issue (2): 219-238    DOI: 10.3785/j.issn.1008-973X.2024.02.001
计算机技术、通信技术     
文本生成图像研究综述
曹寅1,2(),秦俊平1,2,*(),马千里1,2,孙昊1,2,闫凯1,2,王磊1,2,任家琪1,2
1. 内蒙古工业大学 数据科学与应用学院,内蒙古 呼和浩特 010000
2. 内蒙古自治区基于大数据的软件服务工程技术研究中心,内蒙古 呼和浩特 010000
Survey of text-to-image synthesis
Yin CAO1,2(),Junping QIN1,2,*(),Qianli MA1,2,Hao SUN1,2,Kai YAN1,2,Lei WANG1,2,Jiaqi REN1,2
1. College of Data Science and Applications, Inner Mongolia University of Technology, Hohhot 010000, China
2. Inner Mongolia Autonomous Region Engineering Technology Research Center of Big Data Based Software Service, Hohhot 010000, China
 全文: PDF(2809 KB)   HTML
摘要:

对文本生成图像任务进行综合评估和整理,根据生成图像的理念,将文本生成图像任务分为3大类:基于生成对抗网络架构生成图像、基于自回归模型架构生成图像、基于扩散模型架构生成图像. 针对基于生成对抗网络架构的文本生成图像方法,按照改进的不同技术点归纳为6小类:采用多层次体系嵌套架构、注意力机制的应用、应用孪生网络、采用循环一致方法、深度融合文本特征和改进无条件模型. 通过对不同方法的分析,总结并讨论了现有的文本生成图像方法通用评估指标和数据集.

关键词: 人工智能生成内容文本生成图像生成对抗网络自回归模型扩散模型    
Abstract:

A comprehensive evaluation and categorization of text-to-image generation tasks were conducted. Text-to-image generation tasks were classified into three major categories based on the principles of image generation: text-to-image generation based on the generative adversarial network architecture, text-to-image generation based on the autoregressive model architecture, and text-to-image generation based on the diffusion model architecture. Improvements in different aspects were categorized into six subcategories for text-to-image generation methods based on the generative adversarial network architecture: adoption of multi-level hierarchical architectures, application of attention mechanisms, utilization of siamese networks, incorporation of cycle-consistency methods, deep fusion of text features, and enhancement of unconditional models. The general evaluation indicators and datasets of existing text-to-image methods were summarized and discussed through the analysis of different methods.

Key words: AI-generated content    text-to-image    generative adversarial network    autoregressive model    diffusion model
收稿日期: 2023-07-06 出版日期: 2024-01-23
CLC:  TP 391  
基金资助: 国家自然科学基金资助项目(61962044);内蒙自然科学基金资助项目(2019MS06005);内蒙古自治区科技重大专项项目(2021ZD0015);自治区直属高校基本科研业务费资助项目(JY20220327).
通讯作者: 秦俊平     E-mail: c1122335966@163.com;qinjunping@imut.edu.cn
作者简介: 曹寅(1998—),男,硕士生,从事计算机视觉研究. orcid.org/0000-0002-8759-0888. E-mail:c1122335966@163.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
曹寅
秦俊平
马千里
孙昊
闫凯
王磊
任家琪

引用本文:

曹寅,秦俊平,马千里,孙昊,闫凯,王磊,任家琪. 文本生成图像研究综述[J]. 浙江大学学报(工学版), 2024, 58(2): 219-238.

Yin CAO,Junping QIN,Qianli MA,Hao SUN,Kai YAN,Lei WANG,Jiaqi REN. Survey of text-to-image synthesis. Journal of ZheJiang University (Engineering Science), 2024, 58(2): 219-238.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2024.02.001        https://www.zjujournals.com/eng/CN/Y2024/V58/I2/219

图 1  生成对抗网络与条件生成对抗网络
图 2  自编码器的网络结构
图 3  对比模型的结构图
图 4  自回归模型方法的原理图
图 5  扩散模型的工作原理图
图 6  文本生成图像的原理图
图 7  文本生成图像任务的发展史
图 8  基于文本生成图像的不同多层次体系嵌套方法
图 9  AttnGAN方法模型的结构图
图 10  SD-GAN方法模型的结构图
图 11  MirrorGAN方法模型的结构图
图 12  DM-GAN方法模型的结构图
图 13  DF-GAN方法模型的结构图
图 14  textStyleGAN方法模型的结构图
图 15  ERNIE-ViLG方法模型的结构图
图 16  CogView方法模型的结构图
图 17  DALL-E 2方法模型的结构图
图 18  文本生成图像的各数据集示例
方法CUB鸟类数据集Oxford-120花卉数据集COCO数据集
ISFIDR-precisionISFIDISFIDR-precision
GAN-INT-CLS[2]2.3268.792.6679.557.9560.62
StackGAN[10]3.7051.893.2055.288.4574.65
StackGAN++[25]4.0415.303.2648.688.30
AttnGAN[11]4.3623.9867.8325.8935.4985.47
DM-GAN[53]4.7516.0972.3130.4932.6288.56
ControlGAN[36]4.5839.3324.0682.43
SEGAN[34]4.6718.1627.8632.28
MirrorGAN[41]4.5657.6726.4774.52
XMC-GAN[60]30.459.3371.00
CI-GAN[66]5.729.78
DF-GAN[57]5.1014.8144.8321.4267.97
DiverGAN[37]4.9815.633.9920.52
SSA-GAN[58]5.1715.6175.919.3790.60
Adam-GAN[59]5.288.5755.9429.0712.3988.74
TVBi-GAN[71]5.0311.8331.0131.97
文献[67]方法4.2311.173.7116.47
Bridge-GAN[92]4.7416.40
textStyleGAN[63]4.7849.5633.0088.23
文献[45]方法3.5818.142.9034.978.9427.07
SD-GAN[38]4.6735.69
PPAN[29]4.383.52
HfGAN[27]4.483.5727.53
HDGAN[26]4.153.4511.86
DSE-GAN[61]5.1313.2353.2526.7115.3076.31
表 1  基于GAN的文本生成图像方法在不同数据集上的各类评价指标对比
方法MS-COCO数据集
FIDZero-shot FID
DALL-E[15]28.0
ERNIE-ViLG[73]14.7
CogView [16]27.1
CogView2[17]17.724.0
Parti[18]3.227.23
KNN-Diffusion[75]16.66
GLIDE[76]12.24
DALL-E 2[77]10.39
Imagen[78]7.27
Stable Diffusion[79]12.63
Re-Imagen[19]5.256.88
表 2  基于自回归模型架构和扩散模型架构的文本生成图像方法对比
1 ZHU X, GOLDBERG A B, ELDAWY M, et al. A text-to-picture synthesis system for augmenting communication [C]// Proceedings of the AAAI Conference on Artificial Intelligence . British Columbia: AAAI, 2007, 7: 1590-1595.
2 REED S, AKATA Z, YAN X, et al. Generative adversarial text to image synthesis [C]// International Conference on Machine Learning . New York: ACM, 2016: 1060-1069.
3 GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al Generative adversarial networks[J]. Communications of the ACM, 2020, 63 (11): 139- 144
doi: 10.1145/3422622
4 MIRZA M, OSINDERO S. Conditional generative adversarial nets [EB/OL]. [2014-11-06]. https://arxiv.org/pdf/1411. 1784.pdf.
5 RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision [C]// International Conference on Machine Learning . [S. l. ]: ACM, 2021: 8748-8763.
6 ISOLA P, ZHU J Y, ZHOU T, et al. Image-to-image translation with conditional adversarial networks [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Honolulu: IEEE, 2017: 1125-1134.
7 ZHU J Y, PARK T, ISOLA P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks [C]// Proceedings of the IEEE International Conference on Computer Vision . Honolulu: IEEE, 2017: 2223-2232.
8 WANG T, ZHANG T, LIU L, et al. CannyGAN: edge-preserving image translation with disentangled features [C]// IEEE International Conference on Image Processing . Taipei: IEEE, 2019: 514-518.
9 ZHANG T, WILIEM A, YANG S, et al. TV-GAN: generative adversarial network based thermal to visible face recognition [C]// International Conference on Biometrics . Gold Coast: IEEE, 2018: 174-181.
10 ZHANG H, XU T, LI H, et al. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks [C]// Proceedings of the IEEE International Conference on Computer Vision . Honolulu: IEEE, 2017: 5907-5915.
11 XU T, ZHANG P, HUANG Q, et al. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Salt Lake City: IEEE, 2018: 1316-1324.
12 LEE D D, PHAM P, LARGMAN Y, et al. Advances in neural information processing systems 22 [R]. Long Beach: IEEE, 2009.
13 贺小峰, 毛琳, 杨大伟 文本生成图像中语义-空间特征增强算法[J]. 大连民族大学学报, 2022, 24 (5): 401- 406
HE Xiaofeng, MAO Lin, YANG Dawei Semantic-spatial feature enhancement algorithm for text-to-image generation[J]. Journal of Dalian Minzu University, 2022, 24 (5): 401- 406
14 SHI X, CHEN Z, WANG H, et al Convolutional LSTM network: a machine learning approach for precipitation nowcasting[J]. Advances in Neural Information Processing Systems, 2015, 28 (18): 156- 167
15 RAMESH A, PAVLOV M, GOH G, et al. Zero-shot text-to-image generation [C]// International Conference on Machine Learning . [S. l. ]: ACM, 2021: 8821-8831.
16 DING M, YANG Z, HONG W, et al Cogview: mastering text-to-image generation via transformers[J]. Advances in Neural Information Processing Systems, 2021, 34 (18): 19822- 19835
17 DING M, ZHENG W, HONG W, et al. Cogview2: faster and better text-to-image generation via hierarchical transformers [EB/OL]. [2022-05-27]. https://arxiv.org/pdf/2204.14217.
18 YU J, XU Y, KOH J Y, et al. Scaling autoregressive models for content-rich text-to-image generation [EB/OL]. [2022-06-22]. https://arxiv.org/pdf/2206.10789.
19 CHEN W, HU H, SAHARIA C, et al. Re-imagen: retrieval-augmented text-to-image generator [EB/OL]. [2022-11-22]. https://arxiv.org/pdf/2209.14491.
20 HO J, JAIN A, ABBEEL P Denoising diffusion probabilistic models[J]. Advances in Neural Information Processing Systems, 2020, 33 (18): 6840- 6851
21 DHARIWAL P, NICHOL A Diffusion models beat GANs on image synthesis[J]. Advances in Neural Information Processing Systems, 2021, 34 (18): 8780- 8794
22 DEVLIN J, CHANG M W, LEE K, et al. Bert: pre-training of deep bidirectional transformers for language understanding [EB/OL]. [2019-05-24]. https://arxiv.org/pdf/ 1810.04805.
23 KARRAS T, LAINE S, AILA T. A style-based generator architecture for generative adversarial networks [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 4401-4410.
24 REED S, AKATA Z, LEE H, et al. Learning deep representations of fine-grained visual descriptions [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas: IEEE, 2016: 49-58.
25 ZHANG H, XU T, LI H, et al StackGAN++: realistic image synthesis with stacked generative adversarial networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41 (8): 1947- 1962
26 ZHANG Z, XIE Y, YANG L. Photographic text-to-image synthesis with a hierarchically-nested adversarial network [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Salt Lake City: IEEE 2018: 6199-6208.
27 HUANG X, WANG M, GONG M. Hierarchically-fused generative adversarial network for text to realistic image synthesis [C]// 16th Conference on Computer and Robot Vision . Kingston: IEEE, 2019: 73-80.
28 HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas: IEEE, 2016: 770-778.
29 GAO L, CHEN D, SONG J, et al. Perceptual pyramid adversarial networks for text-to-image synthesis [C]// Proceedings of the AAAI Conference on Artificial Intelligence . Honolulu: AAAI, 2019, 33(1): 8312-8319.
30 LAI W S, HUANG J B, AHUJA N, et al. Deep Aplacian pyramid networks for fast and accurate super-resolution [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Honolulu: IEEE, 2017: 624-632.
31 韩爽. 基于生成对抗网络的文本到图像生成技术研究[D]. 大庆: 东北石油大学, 2022.
HAN Shuang. Research on text-to-image generation techniques based on generative adversarial networks [D]. Daqing: Northeast Petroleum University, 2022.
32 王家喻. 基于生成对抗网络的图像生成研究[D]. 合肥: 中国科学技术大学, 2021.
WANG Jiayu. Research on image generation based on generative adversarial networks [D]. Hefei: University of Science and Technology of China, 2021.
33 田枫, 孙小强, 刘芳, 等 融合双注意力与多标签的图像中文描述生成方法[J]. 计算机系统应用, 2021, 30 (7): 32- 40
TIAN Feng, SUN Xiaoqiang, LIU Fang, et al Image caption generation method combining dual attention and multi-labels[J]. Computer Systems and Applications, 2021, 30 (7): 32- 40
doi: 10.15888/j.cnki.csa.008010
34 TAN H, LIU X, LI X, et al. Semantics-enhanced adversarial nets for text-to-image synthesis [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Long Beach: IEEE, 2019: 10501-10510.
35 HUANG W, DA XU R Y, OPPERMANN I. Realistic image generation using region-phrase attention [C]// Asian Conference on Machine Learning . Nagoya: ACM, 2019: 284-299.
36 LI B, QI X, LUKASIEWICZ T, et al. Controllable text-to-image generation [J]. Advances in Neural Information Processing Systems , 2019, 32(18): 2065-2075.
37 ZHANG Z, SCHOMAKER L DiverGAN: an efficient and effective single-stage framework for diverse text-to-image generation[J]. Neurocomputing, 2022, 473 (18): 182- 198
38 YIN G, LIU B, SHENG L, et al. Semantics disentangling for text-to-image generation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 2327-2336.
39 HADSELL R, CHOPRA S, LECUN Y. Dimensionality reduction by learning an invariant mapping [C]// IEEE Computer Society Conference on Computer Vision and Pattern Recognition . New York: IEEE, 2006: 1735-1742.
40 CHEN Z, LUO Y. Cycle-consistent diverse image synthesis from natural language [C]// IEEE International Conference on Multimedia and Expo Workshops . Shanghai: IEEE, 2019: 459-464.
41 QIAO T, ZHANG J, XU D, et al. MirrorGAN: learning text-to-image generation by redescription [C]/ /Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 1505-1514.
42 KARPATHY A, FEI-FEI L. Deep visual-semantic alignments for generating image descriptions [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2015: 3128-3137.
43 VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2015: 3156-3164.
44 DUMOULIN V, BELGHAZI I, POOLE B, et al. Adversarially learned inference [EB/OL]. [2017-02-21]. https://arxiv.org/pdf/1606.00704.pdf.
45 LAO Q, HAVAEI M, PESARANGHADER A, et al. Dual adversarial inference for text-to-image synthesis [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Long Beach: IEEE, 2019: 7567-7576.
46 王蕾. 基于关联语义挖掘的文本生成图像算法研究[D]. 西安: 西安电子科技大学, 2020.
WANG Lei. Research on text-to-image generation algorithm based on semantic association mining [D]. Xi’an: Xidian University, 2020.
47 吕文涵, 车进, 赵泽纬. 等. 基于动态卷积与文本数据增强的图像生成方法[EB/OL]. [2023-07-01]. https://kns.cnki.net/kcms2/article/abstract?v=sSXGFc3NEDIAReRhgp48uNl5G2T_5G24IJmVa17AFT4XZFr932Jmsa2EZrM7rxoIWSwHni_2CiKpa4phSwe9hcwvepEs3fO1pcWCTfWKZ7gIU_jFpQgmgw==&uniplatform=NZKPT.
48 薛志杭, 许喆铭, 郎丛妍, 等. 基于图像-文本语义一致性的文本生成图像方法[J]. 计算机研究与发展, 2023, 60(9): 2180-2190.
XUE Zhihang, XU Zheming, LANG Congyan, et al. Text-to-image generation method based on image-text semantic consistency [J]. Journal of Computer Research and Development , 2023, 60(9): 2180-2190.
49 吴春燕, 潘龙越, 杨有 基于特征增强生成对抗网络的文本生成图像方法[J]. 微电子学与计算机, 2023, (6): 51- 61
WU Chunyan, PAN Longyue, YANG You Text generated images based on feature enhancement generated against network approach[J]. Microelectronics and Computers, 2023, (6): 51- 61
doi: 10.19304/J.ISSN1000-7180.2022.0629
50 王威, 李玉洁, 郭富林, 等. 生成对抗网络及其文本图像合成综述[J]. 计算机工程与应用, 2022, 58(19): 14-36.
WANG Wei, LI Yujie, GUO Fulin, et al. A survey on generative adversarial networks and text-image synthesis [J]. Computer Engineering and Applications , 2012, 58(19): 14-36.
51 李欣炜. 基于多深度神经网络的文本生成图像研究[D]. 大连: 大连理工大学, 2022.
LI Xinwei. Research on text-generated image based on multi-deep neural network [D]. Dalian: Dalian University of Technology, 2022.
52 叶龙, 王正勇, 何小海. 基于多模态融合的文本生成图像[J]. 智能计算机与应用, 2022, 12(11): 9-17.
YE Long, WANG Zhengyong, HE Xiaohai. Image generation from text based on multi-modal fusion [J]. Intelligent Computer and Application , 2012, 12(11): 9-17.
53 ZHU M, PAN P, CHEN W, et al. Dm-GAN: dynamic memory generative adversarial networks for text-to-image synthesis [C]/ /Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 5802-5810.
54 GULCEHRE C, CHANDAR S, CHO K, et al Dynamic neural turing machine with continuous and discrete addressing schemes[J]. Neural Computation, 2018, 30 (4): 857- 884
doi: 10.1162/neco_a_01060
55 SUKHBAATAR S, WESTON J, FERGUS R End-to-end memory networks[J]. Advances in Neural Information Processing Systems, 2015, 28 (18): 576- 575
56 TAI K S, SOCHER R, MANNING C D. Improved semantic representations from tree-structured long short-term memory networks [EB/OL]. [2015-05-30]. https:// arxiv.org/pdf/1503.00075.pdf.
57 TAO M, TANG H, WU F, et al. DF-GAN: a simple and effective baseline for text-to-image synthesis [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2022: 16515- 16525.
58 LIAO W, HU K, YANG M Y, et al. Text to image generation with semantic-spatial aware GAN [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2022: 18187-18196.
59 WU X, ZHAO H, ZHENG L, et al. Adma-GAN: attribute-driven memory augmented GANs for text-to-image generation [C]// Proceedings of the 30th ACM International Conference on Multimedia . Lisboa: ACM, 2022: 1593-1602.
60 ZHANG H, KOH J Y, BALDRIDGE J, et al. Crossmodal contrastive learning for text-to-image generation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2021: 833-842.
61 HUANG M, MAO Z, WANG P, et al. DSE-GAN: dynamic semantic evolution generative adversarial network for text-to-image generation [C]// Proceedings of the 30th ACM International Conference on Multimedia . Long Beach: IEEE, 2022: 4345- 4354.
62 BROCK A, DONAHUE J, SIMONYAN K. Large scale GAN training for high fidelity natural image synthesis [EB/OL]. [2019-02-25]. https://arxiv.org/pdf/1809.11096.pdf.
63 STAP D, BLEEKER M, IBRAHIMI S, et al. Conditional image generation and manipulation for user-specified content [EB/OL]. [2020-05-11]. https://arxiv.org/pdf/2005.04909.pdf.
64 ZHANG Y, LU H. Deep cross-modal projection learning for image-text matching [C]// Proceedings of the European Conference on Computer Vision . Long Beach: IEEE, 2018: 686-701.
65 LEDIG C, THEIS L, HUSZÁR F, et al. Photo-realistic single image super-resolution using a generative adversarial network [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2017: 4681-4690.
66 WANG H, LIN G, HOI S C H, et al. Cycle-consistent inverse GAN for text-to-image synthesis [C]// Proceedings of the 29th ACM International Conference on Multimedia . [S. l.]: ACM, 2021: 630-638.
67 SOUZA D M, WEHRMANN J, RUIZ D D. Efficient neural architecture for text-to-image synthesis [C]// International Joint Conference on Neural Networks . Long Beach: IEEE, 2020: 1-8.
68 ROMBACH R, ESSER P, OMMER B Network-to-network translation with conditional invertible neural networks[J]. Advances in Neural Information Processing Systems, 2020, 33 (18): 2784- 2797
69 DINH L, KRUEGER D, BENGIO Y. Nice: non-linear independent components estimation [EB/OL]. [2015-04-10]. https://arxiv.org/pdf/1410.8516.
70 DINH L, SOHL-DICKSTEIN J, BENGIO S. Density estimation using real nvp [EB/OL]. [2017-02-27]. https://arxiv.org/pdf/1605.08803.
71 WANG Z, QUAN Z, WANG Z J, et al. Text to image synthesis with bidirectional generative adversarial network [C]// IEEE International Conference on Multimedia and Expo . Long Beach: IEEE, 2020: 1-6.
72 DONAHUE J, KRÄHENBÜHL P, DARRELL T. Adversarial feature learning [EB/OL]. [2017-04-03]. https://arxiv.org/pdf/1605.09782.pdf.
73 ZHANG H, YIN W, FANG Y, et al. ERNIE-ViLG: unified generative pre-training for bidirectional vision-language generation [EB/OL]. [2023-07-01]. https://arxiv.org/abs/2112.15283.
74 LIU X, PARK D H, AZADI S, et al. More control for free! image synthesis with semantic diffusion guidance [C]// Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision . Long Beach: IEEE, 2023: 289-299.
75 SHEYNIN S, ASHUAL O, POLYAK A, et al. Knn-diffusion: image generation via large-scale retrieval [EB/OL]. [2022-10-02]. https://arxiv.org/pdf/2204.02849.
76 NICHOL A Q, DHARIWAL P, RAMESH A, et al. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models [C]// International Conference on Machine Learning . Long Beach: IEEE, 2022: 16784-16804.
77 RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with clip latents [EB/OL]. [2022-04-13]. https://arxiv.org/pdf/2204.06125.
78 SAHARIA C, CHAN W, SAXENA S, et al Photorealistic text-to-image diffusion models with deep language understanding[J]. Advances in Neural Information Processing Systems, 2022, 35 (18): 36479- 36494
79 ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2022: 10684-10695.
80 RUIZ N, LI Y, JAMPANI V, et al. Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2023: 22500-22510.
81 ZHANG L, AGRAWALA M. Adding conditional control to text-to-image diffusion models [EB/OL]. [2023-02-10]. https://arxiv.org/pdf/2302.05543.
82 CHEFER H, ALALUF Y, VINKER Y, et al Attend-and-excite: attention-based semantic guidance for text-to-image diffusion models[J]. ACM Transactions on Graphics, 2023, 42 (4): 1- 10
83 MOU C, WANG X, XIE L, et al. T2i-adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models [EB/OL]. [2023-03-20]. https://arxiv.org/pdf/2302.08453.
84 NILSBACK M E, ZISSERMAN A. Automated flower classification over a large number of classes [C]// 6th Indian Conference on Computer Vision, Graphics and Image Processing . Long Beach: IEEE, 2008: 722-729.
85 WAH C, BRANSON S, WELINDER P, et al. The caltech-ucsd birds-200-2011 dataset [EB/OL]. [2023-07-06]. https://authors.library.caltech.edu/27452/1/CUB_200_2011.pdf.
86 GUO Y, ZHANG L, HU Y, et al. Ms-celeb-1m: a dataset and benchmark for large-scale face recognition [C]// European Conference on Computer Vision . Amsterdam: [s. n.], 2016: 87-102.
87 LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: common objects in context [C]// European Conference on Computer Vision . Zurich: [s. n.], 2014: 740-755.
88 FROLOV S, HINZ T, RAUE F, et al Adversarial text-to-image synthesis: a review[J]. Neural Networks, 2021, 144 (18): 187- 209
89 SALIMANS T, GOODFELLOW I, ZAREMBA W, et al. Improved techniques for training GANs [C]// Proceeding of the 29th Advances in Neural Information Processing Systems. Hangzhou: IEEE, 2016: 2226-2234.
90 HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium [C]// Proceeding of the 30th Advances in Neural Information Processing Systems. Long Beach: IEEE, 2017: 6626-6637.
91 XU T, ZHANG P, HUANG Q, et al. AttnGAN: finegrained text to image generation with attentional generative adversarial networks [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2018: 1316-1324.
[1] 李海烽,张雪英,段淑斐,贾海蓉,Huizhi Liang . 融合生成对抗网络与时间卷积网络的普通话情感识别[J]. 浙江大学学报(工学版), 2023, 57(9): 1865-1875.
[2] 李云红,段姣姣,苏雪平,张蕾涛,于惠康,刘杏瑞. 基于改进生成对抗网络的书法字生成算法[J]. 浙江大学学报(工学版), 2023, 57(7): 1326-1334.
[3] 曹晓璐,卢富男,朱翔,翁立波,卢书芳,高飞. 基于草图的兼容性服装生成方法[J]. 浙江大学学报(工学版), 2023, 57(5): 939-947.
[4] 杨冰,那巍,向学勤. 基于单阶段生成对抗网络的文本生成图像方法[J]. 浙江大学学报(工学版), 2023, 57(12): 2412-2420.
[5] 白文超,韩希先,王金宝. 基于条件生成模型的高效近似查询处理框架[J]. 浙江大学学报(工学版), 2022, 56(5): 995-1005.
[6] 何立,庞善民. 结合年龄监督和人脸先验的语音-人脸图像重建[J]. 浙江大学学报(工学版), 2022, 56(5): 1006-1016.
[7] 屠杭垚,王万良,陈嘉诚,李国庆,吴菲. 结合大气散射模型的生成对抗网络去雾算法[J]. 浙江大学学报(工学版), 2022, 56(2): 225-235.
[8] 曾素佳,庞善民,郝问裕. 深度监督对齐的零样本图像分类方法[J]. 浙江大学学报(工学版), 2022, 56(11): 2204-2214.
[9] 张鹏,田子都,王浩. 基于改进生成对抗网络的飞参数据异常检测方法[J]. 浙江大学学报(工学版), 2022, 56(10): 1967-1976.
[10] 陈彤,郭剑锋,韩心中,谢学立,席建祥. 基于生成对抗模型的可见光-红外图像匹配方法[J]. 浙江大学学报(工学版), 2022, 56(1): 63-74.
[11] 陈雪云,黄小巧,谢丽. 基于多尺度条件生成对抗网络血细胞图像分类检测方法[J]. 浙江大学学报(工学版), 2021, 55(9): 1772-1781.
[12] 高向宇,李杨龙. 利用AR模型解决全相位技术数据长度限制[J]. 浙江大学学报(工学版), 2021, 55(6): 1108-1117.
[13] 胡惠雅,盖绍彦,达飞鹏. 基于生成对抗网络的偏转人脸转正[J]. 浙江大学学报(工学版), 2021, 55(1): 116-123.
[14] 刘坤,文熙,黄闽茗,杨欣欣,毛经坤. 基于生成对抗网络的太阳能电池缺陷增强方法[J]. 浙江大学学报(工学版), 2020, 54(4): 684-693.
[15] 段然,周登文,赵丽娟,柴晓亮. 基于多尺度特征映射网络的图像超分辨率重建[J]. 浙江大学学报(工学版), 2019, 53(7): 1331-1339.