Please wait a minute...
Journal of Zhejiang University (Science Edition)  2023, Vol. 50 Issue (6): 651-667    DOI: 10.3785/j.issn.1008-9497.2023.06.001
CCF CAD/CG 2023     
A review of conditional image generation based on diffusion models
Zerun LIU1,Yufei YIN1,2,Wenhao XUE1,3,Rui GUO1,Lechao CHENG1()
1.Zhejiang Lab,Hangzhou 311121,China
2.CAS Key Laboratory of GIPAS,University of Science and Technology of China,Hefei 230026,China
3.School of Automation,Northwestern Polytechnical University,Xi'an 710072,China
Download: HTML( 10 )   PDF(2011KB)
Export: BibTeX | EndNote (RIS)      

Abstract  

Artificial intelligence generated content (AIGC) has received significant attention at present. As the numerous generative models proposed, the emerging diffusion model has attracted extensive attention due to its highly interpretable mathematical properties and the ability to generate high-quality and diverse results. Nowadays, diffusion models have achieved remarkable results in the field of condition-guided image generation. This achievement promotes the development of diffusion models in other conditional tasks and has various applications in areas such as movies, games, paintings, and virtual reality. For instance, the diffusion model can generate high-resolution images in text-guided image generation tasks while ensuring the quality of the generated images. In this paper, we first introduce the definition and background of diffusion models. Then, we present a review of the development history and latest progress of conditional image generation based on diffusion models. Finally, we conclude this survey with discussions on challenges and future research directions of diffusion models.



Key wordsdiffusion model      conditional image generation      application     
Received: 10 May 2023      Published: 30 November 2023
CLC:  TP 391.41  
Corresponding Authors: Lechao CHENG     E-mail: chenglc@zhejianglab.com
Cite this article:

Zerun LIU,Yufei YIN,Wenhao XUE,Rui GUO,Lechao CHENG. A review of conditional image generation based on diffusion models. Journal of Zhejiang University (Science Edition), 2023, 50(6): 651-667.

URL:

https://www.zjujournals.com/sci/EN/Y2023/V50/I6/651


基于扩散模型的条件引导图像生成综述

基于人工智能技术的生成内容(artificial intelligence generated content,AIGC)已成为当下的热门话题。在众多生成模型中,扩散模型因其高度可解释的数学特性及高质量和多样性的结果引起广泛关注,在条件引导的图像生成领域已取得显著成果,被广泛应用于电影、游戏、绘画和虚拟现实等领域,在文本引导的图像生成任务中,扩散模型不仅能生成高分辨率的图像,而且能保证生成图像的质量。首先介绍了扩散模型的定义和相关背景,然后重点介绍了扩散模型在条件引导的图像生成领域的发展历程和最新进展,最后探讨了扩散模型面临的挑战和潜在的发展方向,旨在为广大研究人员提供相关领域的研究概况和前沿动态。


关键词: 扩散模型,  条件引导的图像生成,  应用 
Fig.1 Forward and reverse diffusion processes
Fig. 2 Condition-based classification of conditional image generation methods
数据集名称图像文本网址
Flicker30K4332 K158 Khttp://shannon.cs.illinois.edu/DenotationGraph/
MS-COCO44330 K1.5 Mhttps://cocodataset.org/
CC4512 M12 Mhttps://github.com/google-research-datasets/ conceptual-12m
WIT4611.5 M37.6 Mhttps://github.com/google-research-datasets/wit
WuKong47100 M100 Mhttps://wukong-dataset.github.io/wukong-dataset/
LAION-400M48400 M400 Mhttps://laion.ai/blog/laion-400-open-dataset/
COYO49700 M700 Mhttps://github.com/kakaobrain/coyo-dataset
LAION-5B505 B5 Bhttps://laion.ai/projects/
Table 1 Large-scale image-text datasets
模型参数量/BFID-30K(↓)Zero-shotFID(↓)
DM-GAN57-20.79-
XMC-GAN58-9.33-
LAFITE590.28.12-
Cogview2606.017.7024.00
MakeAScene614.07.5511.84
Parti6220.03.227.23
Glide255.0-12.24
DALLE2336.5-10.39
Stable Diffusion321.4-8.59
Simple Diffusion53--8.30
Imagen347.9-7.27
eDiff-I519.1-6.95
ERNIE-ViLG2.05224.0-6.72
Table 2 Comparison of FID on MS-COCO dataset
Fig.3 Framework of text-guided image generation methods based on retrieval enhancement
模型文本一致性图像一致性
Textual Inversion770.1830.689
DreamBooth780.2490.827
DreamArtist790.2860.739
Custom Diffusion800.2310.868
ELITE810.2660.804
Cones820.2370.853
SVDiff830.3230.716
Table 3 Comparison of models for subject-driven generation
模型条件形式数据集
PITI86草图、布局图ADE20K,DIODE,COCO-Stuff,
Sketch-Guided87草图和文本Sketchy,Edge2shoes
DiSS89草图、颜色图COCO-Stuff,Visual
Sketch2Photo88草图、文本LAION,GeoPose3K,LSUNChurch
DiffFaceSketch90草图CelebA-HQ
Table 4 The comparison of layout-based image generation methods
Fig. 4 Common layout forms and the corresponding generated results
[114]   HAO Y R, CHI Z W, DONG L, et al. Optimizing Prompts For Text-To-Image Generation[Z]. (2022-12-19). https://doi.org/10.48550/arXiv.2212.09611.
[115]   WITTEVEEN S, ANDREWS M. Investigating Prompt Engineering in Diffusion Models[Z]. (2022-11-21). https://doi.org/10.48550/arXiv.2211. 15462.
[116]   WANG Z J, MONTOYA E, MUNECHIKA D, et al. DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models[Z]. (2022-10-26). https://doi.org/10.48550/arXiv.2210. 14896.
[117]   SONG J M, MENG C L, ERMON S. Denoising Diffusion Implicit Models[Z]. (2020-10-06). https://doi.org/10.48550/arXiv.2010.02502.
[118]   LU C, ZHOU Y H, BAO F, et al. DPM-Solver: A Fast Ode Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps[Z]. (2022-06-02).https://doi.org/10.48550/arXiv.2206.00927.
[119]   LU C, ZHOU Y H, BAO F, et al. DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models[Z]. (2022-11-02). https://doi.org/10.48550/arXiv.2211.01095.
[120]   ZHANG Q S, TAO M L, CHEN Y X. GDDIM: Generalized Denoising Diffusion Implicit Models[Z].(2022-06-11). https://doi.org/10.48550/arXiv.2206. 05564.
[121]   BAO F, LI C X, ZHU J, et al. Analytic-DPM: An Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models[Z]. (2022-01-17). https://doi.org/10.48550/arXiv.2201.06503.
[122]   LUHMAN E, LUHMAN T. Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed[Z]. (2021-01-07). https://doi.org/10.48550/arXiv.2101.02388.
[123]   SALIMANS T, HO J. Progressive Distillation for Fast Sampling of Diffusion Models[Z]. (2022-02-01). https://doi.org/10.48550/arXiv.2202.00512.
[124]   MENG C L, ROMBACH R, GAO R Q, et al. On distillation of guided diffusion models[C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 14297-14306. DOI:10.1109/cvpr52729.2023.01374
doi: 10.1109/cvpr52729.2023.01374
[125]   BAO F, NIE S, XUE K W, et al. All are worth words: A vit backbone for diffusion models[C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 22669-22679. DOI:10.1109/cvpr52729.2023.02171
doi: 10.1109/cvpr52729.2023.02171
[126]   PEEBLES W, XIE S. Scalable Diffusion Models with Transformers[Z]. (2022-12-19).https://doi.org/10.48550/arXiv.2212.09748.
[1]   WEI L Y, LEFEBVRE S, KWATRA V, et al. State of the Art in Example-Based Texture Synthesis[R]. Eindhoven: Eurographics Association, 2009: 93-117.
[2]   HAN C, RISSER E, RAMAMOORTHI R, et al. Multiscale texture synthesis[J]. ACM Transactions on Graphics, 2008, 27(3): 1-8. DOI:10.1145/1360612.1360650
doi: 10.1145/1360612.1360650
[3]   MAKTHAL S, ROSS A. Synthesis of iris images using Markov random fields[C]// 2005 13th European Signal Processing Conference. Antalya: IEEE, 2005: 1-4.
[4]   OSINDERO S, HINTON G E. Modeling image patches with a directed hierarchy of Markov random fields[J]. Advances in Neural Information Processing Systems. 2008, 20:1121-1128.
[5]   GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144. DOI:10.1007/978-3-030-50017-7_10
doi: 10.1007/978-3-030-50017-7_10
[6]   MIRZA M, OSINDERO S. Conditional Generative Adversarial Nets[Z]. (2014-11-06). https://arXiv.org/abs/1411.1784.
[7]   OORD A V D, KALCHBRENNER N, VINYALS O, et al. Conditional Image Generation with PixelCNN Decoders[Z]. (2016-06-16). https://doi.org/10.48550/arXiv.1606.05328.
[8]   SALIMANS T, KARPATHY A, CHEN X, et al. PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications[Z]. (2017-01-19). https://doi.org/10.48550/arXiv.1701.05517.
[9]   KINGMA D P, WELLING M. Auto-Encoding Variational Bayes[Z]. (2013-12-20). https://doi.org/10.48550/arXiv.1312.6114.
[10]   DINH L, KRUEGER D, BENGIO Y. NICE: Non-Linear Independent Components Estimation[Z]. (2014-10-30). https://doi.org/10.48550/arXiv.1410. 8516.
[11]   DINH L, SOHL-DICKSTEIN J, BENGIO S. Density Estimation Using Real NVP[Z]. (2016-05-27). https://doi.org/10.48550/arXiv.1605.08803.
[12]   LECUN Y, CHOPRA S, HADSELL R, et al. A tutorial on energy-based learning[C]//BAKIR G, HOFMAN T, SCHÖLKOPF B. Predicting Structured Data. Cambridge: MIT Press, 2006. doi:10.7551/mitpress/7443.003.0014
doi: 10.7551/mitpress/7443.003.0014
[13]   NGIAM J, CHEN Z, KOH P W, et al. Learning deep energy models[C]// 28th International Conference on International Conference on Machine Learning. Bellevue: Omnipress, 2011: 1105-1112.
[14]   HO J, JAIN A, ABBEEL P. Denoising Diffusion Probabilistic Models[Z]. (2020-06-19). https://doi.org/10.48550/arXiv.2006.11239.
[15]   SONG Y, ERMON S. Generative modeling by estimating gradients of the data distribution[C]//Thirty-third Conference on Neural Information Processing Systems(NeurIPS). Vancouver: NeurIPS, 2019.
[16]   ZHANG H, XU T, LI H S, et al. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks[C]// 2017 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE, 2017: 5908-5916. DOI:10. 1109/iccv.2017.629
doi: 10. 1109/iccv.2017.629
[17]   XU T, ZHANG P C, HUANG Q Y, et al. AttnGan: Fine-grained text to image generation with attentional generative adversarial networks[C]// 2018 IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 1316-1324. DOI:10.1109/cvpr.2018.00143
doi: 10.1109/cvpr.2018.00143
[18]   DHARIWAL P, NICHOL A. Diffusion models beat GANs on image synthesis[J]. Advances in Neural Information Processing Systems, 2021, 11: 8780-8794.
[19]   RAMESH A, PAVLOV M, GOH G, et al. Zero-shot text-to-image generation[C]// International Conference on Machine Learning. Online: PMLR, 2021: 8821-8831.
[20]   KARRAS T, LAINE S, AILA T. A style-based generator architecture for generative adversarial networks[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 4401-4410. DOI:10.1109/cvpr.2019.00453
doi: 10.1109/cvpr.2019.00453
[21]   WU H H, SEETHARAMAN P, KUMAR K, et al. Wav2clip: Learning robust audio representations from clip[C]// 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Online: IEEE, 2022: 4563-4567. DOI:10.1109/icassp43922.2022.9747669
doi: 10.1109/icassp43922.2022.9747669
[22]   SONG Y, SOHL-DICKSTEIN J, KINGMA D P, et al. Score-Based Generative Modeling Through Stochastic Differential Equations[Z]. (2020-11-26). https://doi.org/10.48550/arXiv.2011.13456.
[23]   DHARIWAL P, NICHOL A. Diffusion models beat gans on image synthesis[J]. Advances in Neural Information Processing Systems, 2021, 34: 8780-8794.
[24]   NICHOL A, DHARIWAL P. Improved Denoising Diffusion Probabilistic Models[Z]. (2021-02-18). https://arxiv.org/abs/2102.09672.
[25]   NICHOL A, DHARIWAL P, RAMESH A, et al. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models[Z]. (2021-12-20). https://arxiv.org/abs/2112.10741.
[26]   RONNEBERGER O, FISCHER P, BROX T. U-Net: Convolutional networks for biomedical image segmentation[C]// 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich: MICCAI, 2015: 234-241. doi:10.1007/978-3-319-24574-4_28
doi: 10.1007/978-3-319-24574-4_28
[27]   SOHL-DICKSTEIN J, WEISS E, MAHESWARANATHAN N, et al. Deep unsupervised learning using nonequilibrium thermodynamics[C]// 32th International Conference on Machine Learning. Lille: PMLR, 2015: 2256-2265.
[28]   SONG Y, ERMON S. Improved techniques for training score-based generative models[J]. Advances in Neural Information Processing Systems, 2020, 33: 12438-12448. doi:10.48550/arXiv.2006.09011
doi: 10.48550/arXiv.2006.09011
[29]   SONG Y, DURKAN C, MURRAY I, et al. Maximum likelihood training of score-based diffusion models[J]. Advances in Neural Information Processing Systems, 2021, 34: 1415-1428.
[30]   BROCK A, DONAHUE J, SIMONYAN K. Large Scale GAN Training for High Fidelity Natural Image Synthesis[Z]. (2018-09-28). https://doi.org/10.48550/arXiv.1809.11096.
[31]   HO J, SALIMANS T. Classifier-Free Diffusion Guidance[Z]. (2022-07-26). https://doi.org/10. 48550/arXiv.2207.12598.
[32]   ROMBACH R, BLATTMANN A, LORENZ D, et al. High-Resolution Image Synthesis with Latent Diffusion Models[Z]. (2021-12-20). https://doi.org/10.48550/arXiv.2112.10752.
[33]   RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical Text-Conditional Image Generation with CLIP Latents[Z]. (2022-04-13). https://doi.org/10.48550/arXiv.2204.06125.
[34]   SAHARIA C, CHAN W, SAXENA S, et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding[Z]. (2022-03-23). https://doi.org/10.48550/arXiv.2205.11487.
[35]   RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]// International Conference on Machine Learning. Online: PMLR, 2021: 8748-8763.
[36]   LIU X, PARK D H, AZADI S, et al. More control for free image synthesis with semantic diffusion guidance[C]// Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Vancouver: IEEE, 2023: 289-299. doi:10.1109/wacv56688.2023.00037
doi: 10.1109/wacv56688.2023.00037
[37]   AVRAHAMI O, LISCHINSKI D, FRIED O. Blended diffusion for text-driven editing of natural images[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans: IEEE, 2022: 18187-18197. DOI:10.1109/CVPR52688.2022.01767
doi: 10.1109/CVPR52688.2022.01767
[38]   KWON M, JEONG J, UH Y. Diffusion Models Already Have a Semantic Latent Space[Z]. (2022-10-20). https://doi.org/10.48550/arXiv.2210. 10960.
[39]   KIM G, KWON T, YE J C. Diffusionclip: Text-guided diffusion models for robust image manipulation[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 2426-2435. DOI:10.1109/cvpr52688.2022.00246
doi: 10.1109/cvpr52688.2022.00246
[40]   GU S Y, CHEN D, BAO J M, et al. Vector quantized diffusion model for text-to-image synthesis[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 10696-10706. DOI:10.1109/cvpr52688.2022.01043
doi: 10.1109/cvpr52688.2022.01043
[41]   HO J, SAHARIA C, CHAN W, et al. Cascaded diffusion models for high fidelity image generation[J]. The Journal of Machine Learning Research, 2022, 23(47): 1-33.
[42]   SAHARIA C, HO J, CHAN W, et al. Image super-resolution via iterative refinement[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(4): 4713-4726. DOI:10. 1109/tpami.2022.3204461
doi: 10. 1109/tpami.2022.3204461
[43]   YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78. DOI:10.1162/tacl_a_00166
doi: 10.1162/tacl_a_00166
[44]   LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: Common objects in context[C]// 13th European Conference on Computer Vision (ECCV). Zurich: Springer, 2014: 740-755. doi:10.1007/978-3-319-10602-1_48
doi: 10.1007/978-3-319-10602-1_48
[45]   CHANGPINYO S, SHARMA P, DING N, et al. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 3558-3568. DOI:10.1109/cvpr46437.2021.00356
doi: 10.1109/cvpr46437.2021.00356
[46]   SRINIVASAN K, RAMAN K, CHEN J, et al. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning[C]// 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. Online: ACM, 2021: 2443-2449. DOI:10.1145/3404835. 3463257
doi: 10.1145/3404835. 3463257
[47]   GU J X, MENG X J, LU G S, et al. Wukong: 100 Million Large-Scale Chinese Cross-Modal Pre-Training Dataset and a Foundation Framework[Z]. (2022-02-14). https://doi.org/10.48550/arXiv.2202. 06767.
[48]   SCHUHMANN C, VENCU R, BEAUMONT R, et al. Laion-400M: Open Dataset of Clip-Filtered 400 Million Image-Text Pairs[Z]. (2021-11-03). https://doi.org/10.48550/arXiv.2111.02114.
[49]   MINWOO B, BEOMHEE P, HAECHEON K, et al. COYO-700M: Image-Text Pair Dataset[Z]. https://github.com/kakaobrain/coyo-dataset.
[50]   SCHUHMANN C, BEAUMONT R, VENCU R, et al. Laion-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models[Z]. (2022-10-16). https://doi.org/10.48550/arXiv.2210. 08402.
[51]   FENG Z, ZHANG Z, YU X, et al. ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts[Z]. (2022-10-27). https://doi.org/10. 48550/ arXiv.2210.15257.
[52]   BALAJI Y, NAH S, HUANG X, et al. EDiffi: Text-To-Image Diffusion Models with an Ensemble of Expert Denoisers[Z]. (2022-11-02). https://doi.org/10.48550/arXiv.2211.01324.
[53]   HOOGEBOOM E, HEEK J, SALIMANS T. Simple Diffusion: End-to-End Diffusion for High Resolution Images[Z]. (2023-01-26). https://doi.org/10.48550/arXiv.2301.11093.
[54]   ESSER P, ROMBACH R, OMMER B. Taming transformers for high-resolution image synthesis[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 12868-12878. DOI:10.1109/cvpr46437. 2021.01268
doi: 10.1109/cvpr46437. 2021.01268
[55]   RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 5485-5551.
[56]   BORJI A. Generated Faces in the Wild: Quantitative Comparison of Stable Diffusion, Midjourney and DALL E2[Z]. (2022-10-02). https://doi.org/10. 48550/arXiv.2210.00586.
[57]   YE H, YANG X, TAKAC M, et al. Improving Text-to-Image Synthesis Using Contrastive Learning[Z]. (2021-07-06). https://doi.org/10. 48550/arXiv.2107.02423.
[58]   ZHANG H, KOH J Y, BALDRIDGE J, et al. Cross-modal contrastive learning for text-to-image generation[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 833-842. doi:10.1109/cvpr46437.2021.00089
doi: 10.1109/cvpr46437.2021.00089
[59]   ZHOU Y F, ZHANG R Y, CHEN C Y, et al. LAFITE: Towards Language-Free Training for Text-to-Image Generation[Z]. (2021-11-27). https://doi.org/10.48550/arXiv.2111.13792.
[60]   DING M, ZHENG W D, HONG W Y, et al. CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers[Z]. (2022-04-28).https://doi.org/10.48550/arXiv.2204.14217.
[61]   GAFNI O, POLYAK A, ASHUAL O, et al. Make-a-scene: Scene-based text-to-image generation with human priors[C]// 17th European Conference on Computer Vision. Israel: Springer, 2022: 89-106. doi:10.1007/978-3-031-19784-0_6
doi: 10.1007/978-3-031-19784-0_6
[62]   YU J H, XU Y Z, KOH J Y, et al. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation[Z]. (2022-06-22). https://doi.org/10.48550/arXiv.2206.10789.
[63]   LEE K M, LIU H, RYU M, et al. Aligning Text-To-Image Models Using Human Feedback[Z]. (2023-02-23). https://doi.org/10.48550/arXiv.2302. 12192.
[64]   ZHANG Q S, SONG J M, HUANG X, et al. DiffCollage: Parallel Generation of Large Content with Diffusion Models[Z]. (2023-03-30). https://doi.org/10.48550/arXiv.2303.17076.
[65]   SCHRAMOWSKI P, BRACK M, DEISEROTH B, et al. Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models[Z]. (2022-11-09). https://doi.org/10.48550/arXiv.2211.05105.
[66]   FRIEDRICH F, SCHRAMOWSKI P, BRACK M, et al. Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness[Z]. (2023-02-07). https://doi.org/10.48550/arXiv.2302.10893.
[67]   ZHU Y, WU Y, OLSZEWSKI K, et al. Discrete contrastive diffusion for cross-modal music and image generation[C]// The Eleventh International Conference on Learning Representations. Kigali Rwanda: ICLR, 2023: .
[68]   LIU N, LI S, DU Y L, et al. Compositional visual generation with composable diffusion models[C]// 17th European Conference on Computer Vision. Israel: Springer, 2022: 423-439. doi:10.1007/978-3-031-19790-1_26
doi: 10.1007/978-3-031-19790-1_26
[69]   LIEW J H, YAN H, ZHOU D, et al. MagicMix: Semantic Mixing with Diffusion Models[Z]. (2022-10-28). https://doi.org/10.48550/arXiv.2210. 16056.
[70]   MA W D K, LEWIS J P, KLEIJN W B, et al. Directed Diffusion: Direct Control of Object Placement through Attention Guidance[Z]. (2023-02-25). https://doi.org/10.48550/arXiv.2302.13153.
[71]   CHEFER H, ALALUF Y, VINKER Y, et al. Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models[Z]. (2023-01-31). https://doi.org/10.48550/arXiv.2301.13826.
[72]   GRAVE E, JOULIN A, USUNIER N. Improving Neural Language Models with a Continuous Cache[Z]. (2016-12-13). https://doi.org/10. 48550/arXiv. 1612.04426.
[73]   ROMBACH R, BLATTMANN A, OMMER B. Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models[Z]. (2022-07-26). https://doi.org/10.48550/arXiv.2207.13038.
[74]   BLATTMANN A, ROMBACH R, OKTAY K, et al. Retrieval-augmented diffusion models[J]. Advances in Neural Information Processing Systems, 2022, 35: 15309-15324.
[75]   CHEN W H, HU H X, SAHARIA C, et al. Re-Imagen: Retrieval-Augmented Text-to-Image Generator[Z]. (2022-09-29). https://doi.org/10.48550/arXiv.2209.14491.
[76]   SHEYNIN S, ASHUAL O, POLYAK A, et al. KNN-Diffusion: Image Generation via Large-Scale Retrieval[Z]. (2022-04-06). https://doi.org/10.48550/arXiv.2204.02849.
[77]   GAL R, ALALUF Y, ATZMON Y, et al. An Image is Worth One Word: Personalizing Text-to-Image Generation Using Textual Inversion[Z]. (2022-08-02). https:// doi.org/10.48550/arXiv.2208.01618.
[78]   RUIZ N, LI Y, JAMPANI V, et al. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation[Z]. (2022-08-25). https://doi.org/10.48550/arXiv.2208.12242.
[79]   DONG Z Y, WEI P X, LIN L. DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Contrastive Prompt-Tuning[Z]. (2022-11-21). https://doi.org/10.48550/arXiv.2211. 11337.
[80]   KUMARI N, ZHANG B, ZHANG R, et al. Multi-Concept Customization of Text-to-Image Diffusion[Z]. (2022-12-08). https://doi.org/10.48550/arXiv.2212.04488.
[81]   WEI Y, ZHANG Y, JI Z, et al. Elite: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation[Z]. (2023-02-27). https://doi.org/10.48550/arXiv.2302.13848.
[82]   LIU Z H, FENG R L, ZHU K, et al. Cones: Concept Neurons in Diffusion Models for Customized Generation[Z]. (2023-03-09). https://doi.org/10.48550/arXiv.2303.05125.
[83]   HAN L G, LI Y X, ZHANG H, et al. SVDiff: Compact Parameter Space for Diffusion Fine-Tuning[Z]. (2023-03-20). https://doi.org/10.48550/arXiv.2303. 11305.
[84]   PATASHNIK O, GARIBI D, AZURI I, et al. Localizing Object-Level Shape Variations with Text-to-Image Diffusion Models[Z]. (2023-03-20). https://doi.org/10.48550/arXiv.2303.11306.
[85]   HUANG Z Q, WU T X, JIANG Y M, et al. ReVersion: Diffusion-Based Relation Inversion from Images[Z]. (2023-03-23). https://doi.org/10.48550/arXiv. 2303.13495.
[86]   WANG T F, ZHANG T, ZHANG B, et al. Pretraining is All You Need for Image-to-Image Translation[Z]. (2022-05-25). https://doi.org/10. 48550/arXiv.2205.12952.
[87]   VOYNOV A, ABERMAN K, COHEN-OR D. Sketch-Guided Text-to-Image Diffusion Models[Z]. (2022-11-24). https://doi.org/10.48550/arXiv. 2211. 13752.
[88]   MAUNGMAUNG A, SHING M, MITSUI K, et al. Text-Guided Scene Sketch-to-Photo Synthesis[Z]. (2023-02-14). https://doi.org/10.48550/arXiv. 2302. 06883.
[89]   CHENG S I, CHEN Y J, CHIU W C, et al. Adaptively-realistic image generation from stroke and sketch with diffusion model[C]// 2023 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2023: 4043-4051. DOI:10. 1109/wacv56688.2023.00404
doi: 10. 1109/wacv56688.2023.00404
[90]   PENG Y C, ZHAO C Q, XIE H R, et al. DiffFaceSketch: High-Fidelity Face Image Synthesis with Sketch-Guided Latent Diffusion Model[Z]. (2023-02-14). https://doi.org/10.48550/arXiv.2302. 06908.
[91]   CHENG J X, LIANG X, SHI X J, et al. LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation[Z]. (2023-02-16). https://doi.org/10.48550/arXiv.2302.08908.
[92]   BAR-TAL O, YARIV L, LIPMAN Y, et al. MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation[Z]. (2023-02-16). https://doi.org/10.48550/arXiv.2302.08113.
[93]   AVRAHAMI O, HAYES T, GAFNI O, et al. SpaText: Spatio-textual representation for controllable image generation[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver: IEEE, 2023: 18370-18380. DOI:10.1109/CVPR52729.2023.01762
doi: 10.1109/CVPR52729.2023.01762
[94]   HAM C, HAYS J, LU J, et al. Modulating Pretrained Diffusion Models for Multimodal Image Synthesis[Z]. (2023-02-24). https://doi.org/10. 48550/arXiv.2302.12764.
[95]   YANG L, HUANG Z L, SONG Y, et al. Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training[Z]. (2022-11-21). https://doi.org/10.48550/arXiv.2211.11138.
[96]   LI Y H, LIU H T, WU Q Y, et al. GLIGEN: Open-Set Grounded Text-to-Image Generation[Z]. (2023-01-17). https://doi.org/10.48550/arXiv.2301.07093.
[97]   SARUKKAI V, LI L, MA A, et al. Collage Diffusion[Z]. (2023-03-01).https://doi.org/10.48550/ arXiv.2303.00262.
[98]   ZHANG L, AGRAWALA M. Adding Conditional Control to Text-to-Image Diffusion Models[Z]. (2023-02-10). https://doi.org/10.48550/arXiv.2302. 05543.
[99]   HUANG L H, CHEN D, LIU Y, et al. Composer: Creative and Controllable Image Synthesis with Composable Conditions[Z]. (2023-02-20). https://doi.org/10.48550/arXiv.2302.09778.
[100]   YU J W, WANG Y H, ZHAO C, et al. Freedom: Training-Free Energy-Guided Conditional Diffusion Model[Z]. (2023-03-17). https://doi.org/10.48550/arXiv.2303.09833.
[101]   LUGMAYR A, DANELLJAN M, ROMERO A, et al. RePaint: Inpainting using Denoising Diffusion Probabilistic Models[Z]. (2022-01-24). https://doi.org/10.48550/arXiv.2201.09865.
[102]   LI W B, YU X, ZHOU K, et al. SDM: Spatial Diffusion Model for Large Hole Image Inpainting[Z]. (2022-12-06). https://doi.org/10.48550/arXiv.2212. 02963.
[103]   LI R, TAN R T, CHEONG L F. All in one bad weather removal using architectural search[C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 3172-3182. DOI:10.1109/cvpr42600.2020.00324
doi: 10.1109/cvpr42600.2020.00324
[104]   CHEN H T, WANG Y H, GUO T Y, et al. Pre-trained image processing transformer[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 12294-12305. DOI:10.1109/cvpr46437.2021.01212
doi: 10.1109/cvpr46437.2021.01212
[105]   ZHU Y R, WANG T Y, FU X Y, et al. Learning Weather-General and Weather-Specific Features for Image Restoration Under Multiple Adverse Weather Conditions[C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 21747-21758. DOI:10. 1109/cvpr52729.2023.02083
doi: 10. 1109/cvpr52729.2023.02083
[106]   KAWAR B, ELAD M, ERMON S, et al. Denoising Diffusion Restoration Models[Z]. (2022-01-27). https://doi.org/10.48550/arXiv.2201.11793.
[107]   WANG Y H, YU J W, ZHANG J. Zero-Shot Image Restoration Using Denoising Diffusion Null-Space Model[Z]. (2022-12-01). https://doi.org/10.48550/arXiv.2212.00490.
[108]   SAHARIA C, CHAN W, CHANG H, et al. Palette: Image-to-image diffusion models[C]// ACM SIGGRAPH 2022. Vancouver: ACM, 2022: 1-10. DOI:10.1145/3528233.3530757
doi: 10.1145/3528233.3530757
[109]   PAN X C, QIN P D, LI Y H, et al. Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models[Z]. (2022-11-20). https://doi.org/10.48550/ arXiv.2211.10950.
[110]   JEONG H, KWON G, YE J C. Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models[Z]. (2023-02-08). https://doi.org/10.48550/arXiv.2302.03900.
[111]   NIKANKIN Y, HAIM N, IRANI M. SinFusion: Training Diffusion Models on a Single Image or Video[Z]. (2022-11-21). https://doi.org/10.48550/arXiv.2211.11743.
[112]   ZHAO Y Q, PANG T Y, DU C, et al. A Recipe for Watermarking Diffusion Models[Z]. (2023-03-17). https://doi.org/10.48550/arXiv.2303.10137.
[1] QIU Bo, ZHANG Feng, DU Zhenhong, LIU Renyi, ZHANG Shuyu, FAN Xinyi. An integrated index online visualization of geo-scene point clouds on mobiles[J]. Journal of Zhejiang University (Science Edition), 2019, 46(1): 101-110.