Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2024, Vol. 58 Issue (2): 219-238    DOI: 10.3785/j.issn.1008-973X.2024.02.001
    
Survey of text-to-image synthesis
Yin CAO1,2(),Junping QIN1,2,*(),Qianli MA1,2,Hao SUN1,2,Kai YAN1,2,Lei WANG1,2,Jiaqi REN1,2
1. College of Data Science and Applications, Inner Mongolia University of Technology, Hohhot 010000, China
2. Inner Mongolia Autonomous Region Engineering Technology Research Center of Big Data Based Software Service, Hohhot 010000, China
Download: HTML     PDF(2809KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A comprehensive evaluation and categorization of text-to-image generation tasks were conducted. Text-to-image generation tasks were classified into three major categories based on the principles of image generation: text-to-image generation based on the generative adversarial network architecture, text-to-image generation based on the autoregressive model architecture, and text-to-image generation based on the diffusion model architecture. Improvements in different aspects were categorized into six subcategories for text-to-image generation methods based on the generative adversarial network architecture: adoption of multi-level hierarchical architectures, application of attention mechanisms, utilization of siamese networks, incorporation of cycle-consistency methods, deep fusion of text features, and enhancement of unconditional models. The general evaluation indicators and datasets of existing text-to-image methods were summarized and discussed through the analysis of different methods.



Key wordsAI-generated content      text-to-image      generative adversarial network      autoregressive model      diffusion model     
Received: 06 July 2023      Published: 23 January 2024
CLC:  TP 391  
Fund:  国家自然科学基金资助项目(61962044);内蒙自然科学基金资助项目(2019MS06005);内蒙古自治区科技重大专项项目(2021ZD0015);自治区直属高校基本科研业务费资助项目(JY20220327).
Corresponding Authors: Junping QIN     E-mail: c1122335966@163.com;qinjunping@imut.edu.cn
Cite this article:

Yin CAO,Junping QIN,Qianli MA,Hao SUN,Kai YAN,Lei WANG,Jiaqi REN. Survey of text-to-image synthesis. Journal of ZheJiang University (Engineering Science), 2024, 58(2): 219-238.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2024.02.001     OR     https://www.zjujournals.com/eng/Y2024/V58/I2/219


文本生成图像研究综述

对文本生成图像任务进行综合评估和整理,根据生成图像的理念,将文本生成图像任务分为3大类:基于生成对抗网络架构生成图像、基于自回归模型架构生成图像、基于扩散模型架构生成图像. 针对基于生成对抗网络架构的文本生成图像方法,按照改进的不同技术点归纳为6小类:采用多层次体系嵌套架构、注意力机制的应用、应用孪生网络、采用循环一致方法、深度融合文本特征和改进无条件模型. 通过对不同方法的分析,总结并讨论了现有的文本生成图像方法通用评估指标和数据集.


关键词: 人工智能生成内容,  文本生成图像,  生成对抗网络,  自回归模型,  扩散模型 
Fig.1 Generative adversarial network and conditional generative adversarial network
Fig.2 Network architecture of auto-encoder
Fig.3 Structure diagram of CLIP
Fig.4 Schematic diagram of autoregressive models
Fig.5 Schematic diagram of diffusion model
Fig.6 Schematic diagram of text-to-image generation
Fig.7 History of text-to-image generation tasks
Fig.8 Different multilevel hierarchical nesting methods in text-to-image generation
Fig.9 Structure diagram of AttnGAN method model
Fig.10 Structure diagram of SD-GAN method model
Fig.11 Structure diagram of MirrorGAN method model
Fig.12 Structure diagram of DM-GAN method model
Fig.13 Structure diagram of DF-GAN method model
Fig.14 Structure diagram of textStyleGAN method model
Fig.15 Structure diagram of ERNIE-ViLG method model
Fig.16 Structure diagram of CogView method model
Fig.17 Structure diagram of DALL-E 2 method model
Fig.18 Examples of datasets for text-to-image generation
方法CUB鸟类数据集Oxford-120花卉数据集COCO数据集
ISFIDR-precisionISFIDISFIDR-precision
GAN-INT-CLS[2]2.3268.792.6679.557.9560.62
StackGAN[10]3.7051.893.2055.288.4574.65
StackGAN++[25]4.0415.303.2648.688.30
AttnGAN[11]4.3623.9867.8325.8935.4985.47
DM-GAN[53]4.7516.0972.3130.4932.6288.56
ControlGAN[36]4.5839.3324.0682.43
SEGAN[34]4.6718.1627.8632.28
MirrorGAN[41]4.5657.6726.4774.52
XMC-GAN[60]30.459.3371.00
CI-GAN[66]5.729.78
DF-GAN[57]5.1014.8144.8321.4267.97
DiverGAN[37]4.9815.633.9920.52
SSA-GAN[58]5.1715.6175.919.3790.60
Adam-GAN[59]5.288.5755.9429.0712.3988.74
TVBi-GAN[71]5.0311.8331.0131.97
文献[67]方法4.2311.173.7116.47
Bridge-GAN[92]4.7416.40
textStyleGAN[63]4.7849.5633.0088.23
文献[45]方法3.5818.142.9034.978.9427.07
SD-GAN[38]4.6735.69
PPAN[29]4.383.52
HfGAN[27]4.483.5727.53
HDGAN[26]4.153.4511.86
DSE-GAN[61]5.1313.2353.2526.7115.3076.31
Tab.1 Comparison of various metrics for GAN-based text-to-image generation methods on different datasets
方法MS-COCO数据集
FIDZero-shot FID
DALL-E[15]28.0
ERNIE-ViLG[73]14.7
CogView [16]27.1
CogView2[17]17.724.0
Parti[18]3.227.23
KNN-Diffusion[75]16.66
GLIDE[76]12.24
DALL-E 2[77]10.39
Imagen[78]7.27
Stable Diffusion[79]12.63
Re-Imagen[19]5.256.88
Tab.2 Comparison of text-to-image generation methods based on autoregressive model architecture and diffusion model architecture
[1]   ZHU X, GOLDBERG A B, ELDAWY M, et al. A text-to-picture synthesis system for augmenting communication [C]// Proceedings of the AAAI Conference on Artificial Intelligence . British Columbia: AAAI, 2007, 7: 1590-1595.
[2]   REED S, AKATA Z, YAN X, et al. Generative adversarial text to image synthesis [C]// International Conference on Machine Learning . New York: ACM, 2016: 1060-1069.
[3]   GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al Generative adversarial networks[J]. Communications of the ACM, 2020, 63 (11): 139- 144
doi: 10.1145/3422622
[4]   MIRZA M, OSINDERO S. Conditional generative adversarial nets [EB/OL]. [2014-11-06]. https://arxiv.org/pdf/1411. 1784.pdf.
[5]   RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision [C]// International Conference on Machine Learning . [S. l. ]: ACM, 2021: 8748-8763.
[6]   ISOLA P, ZHU J Y, ZHOU T, et al. Image-to-image translation with conditional adversarial networks [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Honolulu: IEEE, 2017: 1125-1134.
[7]   ZHU J Y, PARK T, ISOLA P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks [C]// Proceedings of the IEEE International Conference on Computer Vision . Honolulu: IEEE, 2017: 2223-2232.
[8]   WANG T, ZHANG T, LIU L, et al. CannyGAN: edge-preserving image translation with disentangled features [C]// IEEE International Conference on Image Processing . Taipei: IEEE, 2019: 514-518.
[9]   ZHANG T, WILIEM A, YANG S, et al. TV-GAN: generative adversarial network based thermal to visible face recognition [C]// International Conference on Biometrics . Gold Coast: IEEE, 2018: 174-181.
[10]   ZHANG H, XU T, LI H, et al. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks [C]// Proceedings of the IEEE International Conference on Computer Vision . Honolulu: IEEE, 2017: 5907-5915.
[11]   XU T, ZHANG P, HUANG Q, et al. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Salt Lake City: IEEE, 2018: 1316-1324.
[12]   LEE D D, PHAM P, LARGMAN Y, et al. Advances in neural information processing systems 22 [R]. Long Beach: IEEE, 2009.
[13]   贺小峰, 毛琳, 杨大伟 文本生成图像中语义-空间特征增强算法[J]. 大连民族大学学报, 2022, 24 (5): 401- 406
HE Xiaofeng, MAO Lin, YANG Dawei Semantic-spatial feature enhancement algorithm for text-to-image generation[J]. Journal of Dalian Minzu University, 2022, 24 (5): 401- 406
[14]   SHI X, CHEN Z, WANG H, et al Convolutional LSTM network: a machine learning approach for precipitation nowcasting[J]. Advances in Neural Information Processing Systems, 2015, 28 (18): 156- 167
[15]   RAMESH A, PAVLOV M, GOH G, et al. Zero-shot text-to-image generation [C]// International Conference on Machine Learning . [S. l. ]: ACM, 2021: 8821-8831.
[16]   DING M, YANG Z, HONG W, et al Cogview: mastering text-to-image generation via transformers[J]. Advances in Neural Information Processing Systems, 2021, 34 (18): 19822- 19835
[17]   DING M, ZHENG W, HONG W, et al. Cogview2: faster and better text-to-image generation via hierarchical transformers [EB/OL]. [2022-05-27]. https://arxiv.org/pdf/2204.14217.
[18]   YU J, XU Y, KOH J Y, et al. Scaling autoregressive models for content-rich text-to-image generation [EB/OL]. [2022-06-22]. https://arxiv.org/pdf/2206.10789.
[19]   CHEN W, HU H, SAHARIA C, et al. Re-imagen: retrieval-augmented text-to-image generator [EB/OL]. [2022-11-22]. https://arxiv.org/pdf/2209.14491.
[20]   HO J, JAIN A, ABBEEL P Denoising diffusion probabilistic models[J]. Advances in Neural Information Processing Systems, 2020, 33 (18): 6840- 6851
[21]   DHARIWAL P, NICHOL A Diffusion models beat GANs on image synthesis[J]. Advances in Neural Information Processing Systems, 2021, 34 (18): 8780- 8794
[22]   DEVLIN J, CHANG M W, LEE K, et al. Bert: pre-training of deep bidirectional transformers for language understanding [EB/OL]. [2019-05-24]. https://arxiv.org/pdf/ 1810.04805.
[23]   KARRAS T, LAINE S, AILA T. A style-based generator architecture for generative adversarial networks [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 4401-4410.
[24]   REED S, AKATA Z, LEE H, et al. Learning deep representations of fine-grained visual descriptions [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas: IEEE, 2016: 49-58.
[25]   ZHANG H, XU T, LI H, et al StackGAN++: realistic image synthesis with stacked generative adversarial networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41 (8): 1947- 1962
[26]   ZHANG Z, XIE Y, YANG L. Photographic text-to-image synthesis with a hierarchically-nested adversarial network [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Salt Lake City: IEEE 2018: 6199-6208.
[27]   HUANG X, WANG M, GONG M. Hierarchically-fused generative adversarial network for text to realistic image synthesis [C]// 16th Conference on Computer and Robot Vision . Kingston: IEEE, 2019: 73-80.
[28]   HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas: IEEE, 2016: 770-778.
[29]   GAO L, CHEN D, SONG J, et al. Perceptual pyramid adversarial networks for text-to-image synthesis [C]// Proceedings of the AAAI Conference on Artificial Intelligence . Honolulu: AAAI, 2019, 33(1): 8312-8319.
[30]   LAI W S, HUANG J B, AHUJA N, et al. Deep Aplacian pyramid networks for fast and accurate super-resolution [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Honolulu: IEEE, 2017: 624-632.
[31]   韩爽. 基于生成对抗网络的文本到图像生成技术研究[D]. 大庆: 东北石油大学, 2022.
HAN Shuang. Research on text-to-image generation techniques based on generative adversarial networks [D]. Daqing: Northeast Petroleum University, 2022.
[32]   王家喻. 基于生成对抗网络的图像生成研究[D]. 合肥: 中国科学技术大学, 2021.
WANG Jiayu. Research on image generation based on generative adversarial networks [D]. Hefei: University of Science and Technology of China, 2021.
[33]   田枫, 孙小强, 刘芳, 等 融合双注意力与多标签的图像中文描述生成方法[J]. 计算机系统应用, 2021, 30 (7): 32- 40
TIAN Feng, SUN Xiaoqiang, LIU Fang, et al Image caption generation method combining dual attention and multi-labels[J]. Computer Systems and Applications, 2021, 30 (7): 32- 40
doi: 10.15888/j.cnki.csa.008010
[34]   TAN H, LIU X, LI X, et al. Semantics-enhanced adversarial nets for text-to-image synthesis [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Long Beach: IEEE, 2019: 10501-10510.
[35]   HUANG W, DA XU R Y, OPPERMANN I. Realistic image generation using region-phrase attention [C]// Asian Conference on Machine Learning . Nagoya: ACM, 2019: 284-299.
[36]   LI B, QI X, LUKASIEWICZ T, et al. Controllable text-to-image generation [J]. Advances in Neural Information Processing Systems , 2019, 32(18): 2065-2075.
[37]   ZHANG Z, SCHOMAKER L DiverGAN: an efficient and effective single-stage framework for diverse text-to-image generation[J]. Neurocomputing, 2022, 473 (18): 182- 198
[38]   YIN G, LIU B, SHENG L, et al. Semantics disentangling for text-to-image generation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 2327-2336.
[39]   HADSELL R, CHOPRA S, LECUN Y. Dimensionality reduction by learning an invariant mapping [C]// IEEE Computer Society Conference on Computer Vision and Pattern Recognition . New York: IEEE, 2006: 1735-1742.
[40]   CHEN Z, LUO Y. Cycle-consistent diverse image synthesis from natural language [C]// IEEE International Conference on Multimedia and Expo Workshops . Shanghai: IEEE, 2019: 459-464.
[41]   QIAO T, ZHANG J, XU D, et al. MirrorGAN: learning text-to-image generation by redescription [C]/ /Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 1505-1514.
[42]   KARPATHY A, FEI-FEI L. Deep visual-semantic alignments for generating image descriptions [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2015: 3128-3137.
[43]   VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2015: 3156-3164.
[44]   DUMOULIN V, BELGHAZI I, POOLE B, et al. Adversarially learned inference [EB/OL]. [2017-02-21]. https://arxiv.org/pdf/1606.00704.pdf.
[45]   LAO Q, HAVAEI M, PESARANGHADER A, et al. Dual adversarial inference for text-to-image synthesis [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision . Long Beach: IEEE, 2019: 7567-7576.
[46]   王蕾. 基于关联语义挖掘的文本生成图像算法研究[D]. 西安: 西安电子科技大学, 2020.
WANG Lei. Research on text-to-image generation algorithm based on semantic association mining [D]. Xi’an: Xidian University, 2020.
[47]   吕文涵, 车进, 赵泽纬. 等. 基于动态卷积与文本数据增强的图像生成方法[EB/OL]. [2023-07-01]. https://kns.cnki.net/kcms2/article/abstract?v=sSXGFc3NEDIAReRhgp48uNl5G2T_5G24IJmVa17AFT4XZFr932Jmsa2EZrM7rxoIWSwHni_2CiKpa4phSwe9hcwvepEs3fO1pcWCTfWKZ7gIU_jFpQgmgw==&uniplatform=NZKPT.
[48]   薛志杭, 许喆铭, 郎丛妍, 等. 基于图像-文本语义一致性的文本生成图像方法[J]. 计算机研究与发展, 2023, 60(9): 2180-2190.
XUE Zhihang, XU Zheming, LANG Congyan, et al. Text-to-image generation method based on image-text semantic consistency [J]. Journal of Computer Research and Development , 2023, 60(9): 2180-2190.
[49]   吴春燕, 潘龙越, 杨有 基于特征增强生成对抗网络的文本生成图像方法[J]. 微电子学与计算机, 2023, (6): 51- 61
WU Chunyan, PAN Longyue, YANG You Text generated images based on feature enhancement generated against network approach[J]. Microelectronics and Computers, 2023, (6): 51- 61
doi: 10.19304/J.ISSN1000-7180.2022.0629
[50]   王威, 李玉洁, 郭富林, 等. 生成对抗网络及其文本图像合成综述[J]. 计算机工程与应用, 2022, 58(19): 14-36.
WANG Wei, LI Yujie, GUO Fulin, et al. A survey on generative adversarial networks and text-image synthesis [J]. Computer Engineering and Applications , 2012, 58(19): 14-36.
[51]   李欣炜. 基于多深度神经网络的文本生成图像研究[D]. 大连: 大连理工大学, 2022.
LI Xinwei. Research on text-generated image based on multi-deep neural network [D]. Dalian: Dalian University of Technology, 2022.
[52]   叶龙, 王正勇, 何小海. 基于多模态融合的文本生成图像[J]. 智能计算机与应用, 2022, 12(11): 9-17.
YE Long, WANG Zhengyong, HE Xiaohai. Image generation from text based on multi-modal fusion [J]. Intelligent Computer and Application , 2012, 12(11): 9-17.
[53]   ZHU M, PAN P, CHEN W, et al. Dm-GAN: dynamic memory generative adversarial networks for text-to-image synthesis [C]/ /Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 5802-5810.
[54]   GULCEHRE C, CHANDAR S, CHO K, et al Dynamic neural turing machine with continuous and discrete addressing schemes[J]. Neural Computation, 2018, 30 (4): 857- 884
doi: 10.1162/neco_a_01060
[55]   SUKHBAATAR S, WESTON J, FERGUS R End-to-end memory networks[J]. Advances in Neural Information Processing Systems, 2015, 28 (18): 576- 575
[56]   TAI K S, SOCHER R, MANNING C D. Improved semantic representations from tree-structured long short-term memory networks [EB/OL]. [2015-05-30]. https:// arxiv.org/pdf/1503.00075.pdf.
[57]   TAO M, TANG H, WU F, et al. DF-GAN: a simple and effective baseline for text-to-image synthesis [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2022: 16515- 16525.
[58]   LIAO W, HU K, YANG M Y, et al. Text to image generation with semantic-spatial aware GAN [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2022: 18187-18196.
[59]   WU X, ZHAO H, ZHENG L, et al. Adma-GAN: attribute-driven memory augmented GANs for text-to-image generation [C]// Proceedings of the 30th ACM International Conference on Multimedia . Lisboa: ACM, 2022: 1593-1602.
[60]   ZHANG H, KOH J Y, BALDRIDGE J, et al. Crossmodal contrastive learning for text-to-image generation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2021: 833-842.
[61]   HUANG M, MAO Z, WANG P, et al. DSE-GAN: dynamic semantic evolution generative adversarial network for text-to-image generation [C]// Proceedings of the 30th ACM International Conference on Multimedia . Long Beach: IEEE, 2022: 4345- 4354.
[62]   BROCK A, DONAHUE J, SIMONYAN K. Large scale GAN training for high fidelity natural image synthesis [EB/OL]. [2019-02-25]. https://arxiv.org/pdf/1809.11096.pdf.
[63]   STAP D, BLEEKER M, IBRAHIMI S, et al. Conditional image generation and manipulation for user-specified content [EB/OL]. [2020-05-11]. https://arxiv.org/pdf/2005.04909.pdf.
[64]   ZHANG Y, LU H. Deep cross-modal projection learning for image-text matching [C]// Proceedings of the European Conference on Computer Vision . Long Beach: IEEE, 2018: 686-701.
[65]   LEDIG C, THEIS L, HUSZÁR F, et al. Photo-realistic single image super-resolution using a generative adversarial network [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2017: 4681-4690.
[66]   WANG H, LIN G, HOI S C H, et al. Cycle-consistent inverse GAN for text-to-image synthesis [C]// Proceedings of the 29th ACM International Conference on Multimedia . [S. l.]: ACM, 2021: 630-638.
[67]   SOUZA D M, WEHRMANN J, RUIZ D D. Efficient neural architecture for text-to-image synthesis [C]// International Joint Conference on Neural Networks . Long Beach: IEEE, 2020: 1-8.
[68]   ROMBACH R, ESSER P, OMMER B Network-to-network translation with conditional invertible neural networks[J]. Advances in Neural Information Processing Systems, 2020, 33 (18): 2784- 2797
[69]   DINH L, KRUEGER D, BENGIO Y. Nice: non-linear independent components estimation [EB/OL]. [2015-04-10]. https://arxiv.org/pdf/1410.8516.
[70]   DINH L, SOHL-DICKSTEIN J, BENGIO S. Density estimation using real nvp [EB/OL]. [2017-02-27]. https://arxiv.org/pdf/1605.08803.
[71]   WANG Z, QUAN Z, WANG Z J, et al. Text to image synthesis with bidirectional generative adversarial network [C]// IEEE International Conference on Multimedia and Expo . Long Beach: IEEE, 2020: 1-6.
[72]   DONAHUE J, KRÄHENBÜHL P, DARRELL T. Adversarial feature learning [EB/OL]. [2017-04-03]. https://arxiv.org/pdf/1605.09782.pdf.
[73]   ZHANG H, YIN W, FANG Y, et al. ERNIE-ViLG: unified generative pre-training for bidirectional vision-language generation [EB/OL]. [2023-07-01]. https://arxiv.org/abs/2112.15283.
[74]   LIU X, PARK D H, AZADI S, et al. More control for free! image synthesis with semantic diffusion guidance [C]// Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision . Long Beach: IEEE, 2023: 289-299.
[75]   SHEYNIN S, ASHUAL O, POLYAK A, et al. Knn-diffusion: image generation via large-scale retrieval [EB/OL]. [2022-10-02]. https://arxiv.org/pdf/2204.02849.
[76]   NICHOL A Q, DHARIWAL P, RAMESH A, et al. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models [C]// International Conference on Machine Learning . Long Beach: IEEE, 2022: 16784-16804.
[77]   RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with clip latents [EB/OL]. [2022-04-13]. https://arxiv.org/pdf/2204.06125.
[78]   SAHARIA C, CHAN W, SAXENA S, et al Photorealistic text-to-image diffusion models with deep language understanding[J]. Advances in Neural Information Processing Systems, 2022, 35 (18): 36479- 36494
[79]   ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2022: 10684-10695.
[80]   RUIZ N, LI Y, JAMPANI V, et al. Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2023: 22500-22510.
[81]   ZHANG L, AGRAWALA M. Adding conditional control to text-to-image diffusion models [EB/OL]. [2023-02-10]. https://arxiv.org/pdf/2302.05543.
[82]   CHEFER H, ALALUF Y, VINKER Y, et al Attend-and-excite: attention-based semantic guidance for text-to-image diffusion models[J]. ACM Transactions on Graphics, 2023, 42 (4): 1- 10
[83]   MOU C, WANG X, XIE L, et al. T2i-adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models [EB/OL]. [2023-03-20]. https://arxiv.org/pdf/2302.08453.
[84]   NILSBACK M E, ZISSERMAN A. Automated flower classification over a large number of classes [C]// 6th Indian Conference on Computer Vision, Graphics and Image Processing . Long Beach: IEEE, 2008: 722-729.
[85]   WAH C, BRANSON S, WELINDER P, et al. The caltech-ucsd birds-200-2011 dataset [EB/OL]. [2023-07-06]. https://authors.library.caltech.edu/27452/1/CUB_200_2011.pdf.
[86]   GUO Y, ZHANG L, HU Y, et al. Ms-celeb-1m: a dataset and benchmark for large-scale face recognition [C]// European Conference on Computer Vision . Amsterdam: [s. n.], 2016: 87-102.
[87]   LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: common objects in context [C]// European Conference on Computer Vision . Zurich: [s. n.], 2014: 740-755.
[88]   FROLOV S, HINZ T, RAUE F, et al Adversarial text-to-image synthesis: a review[J]. Neural Networks, 2021, 144 (18): 187- 209
[89]   SALIMANS T, GOODFELLOW I, ZAREMBA W, et al. Improved techniques for training GANs [C]// Proceeding of the 29th Advances in Neural Information Processing Systems. Hangzhou: IEEE, 2016: 2226-2234.
[90]   HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium [C]// Proceeding of the 30th Advances in Neural Information Processing Systems. Long Beach: IEEE, 2017: 6626-6637.
[91]   XU T, ZHANG P, HUANG Q, et al. AttnGAN: finegrained text to image generation with attentional generative adversarial networks [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2018: 1316-1324.
[1] Hai-feng LI,Xue-ying ZHANG,Shu-fei DUAN,Hai-rong JIA,Hui-zhi LIANG. Fusing generative adversarial network and temporal convolutional network for Mandarin emotion recognition[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(9): 1865-1875.
[2] Yun-hong LI,Jiao-jiao DUAN,Xue-ping SU,Lei-tao ZHANG,Hui-kang YU,Xing-rui LIU. Calligraphy generation algorithm based on improved generative adversarial network[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(7): 1326-1334.
[3] Xiao-lu CAO,Fu-nan LU,Xiang ZHU,Li-bo WENG,Shu-fang LU,Fei GAO. Sketch-based compatible clothing image generation[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(5): 939-947.
[4] Bing YANG,Wei NA,Xue-qin XIANG. Text-to-image generation method based on single stage generative adversarial network[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(12): 2412-2420.
[5] Wen-chao BAI,Xi-xian HAN,Jin-bao WANG. Efficient approximate query processing framework based on conditional generative model[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(5): 995-1005.
[6] Li HE,Shan-min PANG. Face reconstruction from voice based on age-supervised learning and face prior information[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(5): 1006-1016.
[7] Pei-zhi WEN,Jun-mou CHEN,Yan-nan XIAO,Ya-yuan WEN,Wen-ming HUANG. Underwater image enhancement algorithm based on GAN and multi-level wavelet CNN[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(2): 213-224.
[8] Hang-yao TU,Wan-liang WANG,Jia-chen CHEN,Guo-qing LI,Fei WU. Dehazing algorithm combined with atmospheric scattering model based on generative adversarial network[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(2): 225-235.
[9] Su-jia ZENG,Shan-min PANG,Wen-yu HAO. Zero-shot image classification method base on deep supervised alignment[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(11): 2204-2214.
[10] Peng ZHANG,Zi-du TIAN,Hao WANG. Flight parameter data anomaly detection method based on improved generative adversarial network[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(10): 1967-1976.
[11] Tong CHEN,Jian-feng GUO,Xin-zhong HAN,Xue-li XIE,Jian-xiang XI. Visible and infrared image matching method based on generative adversarial model[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(1): 63-74.
[12] Xue-yun CHEN,Xiao-qiao HUANG,Li XIE. Classification and detection method of blood cells images based on multi-scale conditional generative adversarial network[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(9): 1772-1781.
[13] Xiang-yu GAO,Yang-long LI. Solving data length limit of all-phase technology by AR model[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(6): 1108-1117.
[14] Hui-ya HU,Shao-yan GAI,Fei-peng DA. Face frontalization based on generative adversarial network[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(1): 116-123.
[15] Kun LIU,Xi WEN,Min-ming HUANG,Xin-xin YANG,Jing-kun MAO. Solar cell defect enhancement method based on generative adversarial network[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(4): 684-693.