Please wait a minute...
浙江大学学报(理学版)  2023, Vol. 50 Issue (6): 651-667    DOI: 10.3785/j.issn.1008-9497.2023.06.001
第26届全国计算机辅助设计与图形学学术会议专题     
基于扩散模型的条件引导图像生成综述
刘泽润1,尹宇飞1,2,薛文灏1,3,郭蕊1,程乐超1()
1.之江实验室,浙江 杭州 311121
2.中国科学技术大学 多媒体计算与通信教育部—微软重点实验室,安徽 合肥 230026
3.西北工业大学 自动化学院,陕西 西安 710072
A review of conditional image generation based on diffusion models
Zerun LIU1,Yufei YIN1,2,Wenhao XUE1,3,Rui GUO1,Lechao CHENG1()
1.Zhejiang Lab,Hangzhou 311121,China
2.CAS Key Laboratory of GIPAS,University of Science and Technology of China,Hefei 230026,China
3.School of Automation,Northwestern Polytechnical University,Xi'an 710072,China
 全文: PDF(2011 KB)   HTML( 10 )
摘要:

基于人工智能技术的生成内容(artificial intelligence generated content,AIGC)已成为当下的热门话题。在众多生成模型中,扩散模型因其高度可解释的数学特性及高质量和多样性的结果引起广泛关注,在条件引导的图像生成领域已取得显著成果,被广泛应用于电影、游戏、绘画和虚拟现实等领域,在文本引导的图像生成任务中,扩散模型不仅能生成高分辨率的图像,而且能保证生成图像的质量。首先介绍了扩散模型的定义和相关背景,然后重点介绍了扩散模型在条件引导的图像生成领域的发展历程和最新进展,最后探讨了扩散模型面临的挑战和潜在的发展方向,旨在为广大研究人员提供相关领域的研究概况和前沿动态。

关键词: 扩散模型条件引导的图像生成应用    
Abstract:

Artificial intelligence generated content (AIGC) has received significant attention at present. As the numerous generative models proposed, the emerging diffusion model has attracted extensive attention due to its highly interpretable mathematical properties and the ability to generate high-quality and diverse results. Nowadays, diffusion models have achieved remarkable results in the field of condition-guided image generation. This achievement promotes the development of diffusion models in other conditional tasks and has various applications in areas such as movies, games, paintings, and virtual reality. For instance, the diffusion model can generate high-resolution images in text-guided image generation tasks while ensuring the quality of the generated images. In this paper, we first introduce the definition and background of diffusion models. Then, we present a review of the development history and latest progress of conditional image generation based on diffusion models. Finally, we conclude this survey with discussions on challenges and future research directions of diffusion models.

Key words: diffusion model    conditional image generation    application
收稿日期: 2023-05-10 出版日期: 2023-11-30
CLC:  TP 391.41  
通讯作者: 程乐超     E-mail: chenglc@zhejianglab.com
作者简介: 刘泽润 (1998—),ORCID: https://orcid.org/0009-0001-4493-6025,男,硕士研究生,主要从事图像处理研究.
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章  
刘泽润
尹宇飞
薛文灏
郭蕊
程乐超

引用本文:

刘泽润,尹宇飞,薛文灏,郭蕊,程乐超. 基于扩散模型的条件引导图像生成综述[J]. 浙江大学学报(理学版), 2023, 50(6): 651-667.

Zerun LIU,Yufei YIN,Wenhao XUE,Rui GUO,Lechao CHENG. A review of conditional image generation based on diffusion models. Journal of Zhejiang University (Science Edition), 2023, 50(6): 651-667.

链接本文:

https://www.zjujournals.com/sci/CN/10.3785/j.issn.1008-9497.2023.06.001        https://www.zjujournals.com/sci/CN/Y2023/V50/I6/651

图1  扩散模型的正向和逆扩散过程of the diffusion model
图2  条件引导图像生成的条件和对应方法
数据集名称图像文本网址
Flicker30K4332 K158 Khttp://shannon.cs.illinois.edu/DenotationGraph/
MS-COCO44330 K1.5 Mhttps://cocodataset.org/
CC4512 M12 Mhttps://github.com/google-research-datasets/ conceptual-12m
WIT4611.5 M37.6 Mhttps://github.com/google-research-datasets/wit
WuKong47100 M100 Mhttps://wukong-dataset.github.io/wukong-dataset/
LAION-400M48400 M400 Mhttps://laion.ai/blog/laion-400-open-dataset/
COYO49700 M700 Mhttps://github.com/kakaobrain/coyo-dataset
LAION-5B505 B5 Bhttps://laion.ai/projects/
表1  大规模图像文本数据集
模型参数量/BFID-30K(↓)Zero-shotFID(↓)
DM-GAN57-20.79-
XMC-GAN58-9.33-
LAFITE590.28.12-
Cogview2606.017.7024.00
MakeAScene614.07.5511.84
Parti6220.03.227.23
Glide255.0-12.24
DALLE2336.5-10.39
Stable Diffusion321.4-8.59
Simple Diffusion53--8.30
Imagen347.9-7.27
eDiff-I519.1-6.95
ERNIE-ViLG2.05224.0-6.72
表2  不同模型在MS-COCO 256×256数据集上的FID指标
图3  基于检索增强的文本生成图像模型框架
模型文本一致性图像一致性
Textual Inversion770.1830.689
DreamBooth780.2490.827
DreamArtist790.2860.739
Custom Diffusion800.2310.868
ELITE810.2660.804
Cones820.2370.853
SVDiff830.3230.716
表3  对图像主体进行演绎的图像生成模型的比较
模型条件形式数据集
PITI86草图、布局图ADE20K,DIODE,COCO-Stuff,
Sketch-Guided87草图和文本Sketchy,Edge2shoes
DiSS89草图、颜色图COCO-Stuff,Visual
Sketch2Photo88草图、文本LAION,GeoPose3K,LSUNChurch
DiffFaceSketch90草图CelebA-HQ
表4  以草图为条件的图像生成模型
图4  常见的布局形式和对应的生成结果
114 HAO Y R, CHI Z W, DONG L, et al. Optimizing Prompts For Text-To-Image Generation[Z]. (2022-12-19). https://doi.org/10.48550/arXiv.2212.09611.
115 WITTEVEEN S, ANDREWS M. Investigating Prompt Engineering in Diffusion Models[Z]. (2022-11-21). https://doi.org/10.48550/arXiv.2211. 15462.
116 WANG Z J, MONTOYA E, MUNECHIKA D, et al. DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models[Z]. (2022-10-26). https://doi.org/10.48550/arXiv.2210. 14896.
117 SONG J M, MENG C L, ERMON S. Denoising Diffusion Implicit Models[Z]. (2020-10-06). https://doi.org/10.48550/arXiv.2010.02502.
118 LU C, ZHOU Y H, BAO F, et al. DPM-Solver: A Fast Ode Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps[Z]. (2022-06-02).https://doi.org/10.48550/arXiv.2206.00927.
119 LU C, ZHOU Y H, BAO F, et al. DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models[Z]. (2022-11-02). https://doi.org/10.48550/arXiv.2211.01095.
120 ZHANG Q S, TAO M L, CHEN Y X. GDDIM: Generalized Denoising Diffusion Implicit Models[Z].(2022-06-11). https://doi.org/10.48550/arXiv.2206. 05564.
121 BAO F, LI C X, ZHU J, et al. Analytic-DPM: An Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models[Z]. (2022-01-17). https://doi.org/10.48550/arXiv.2201.06503.
122 LUHMAN E, LUHMAN T. Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed[Z]. (2021-01-07). https://doi.org/10.48550/arXiv.2101.02388.
123 SALIMANS T, HO J. Progressive Distillation for Fast Sampling of Diffusion Models[Z]. (2022-02-01). https://doi.org/10.48550/arXiv.2202.00512.
124 MENG C L, ROMBACH R, GAO R Q, et al. On distillation of guided diffusion models[C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 14297-14306. DOI:10.1109/cvpr52729.2023.01374
doi: 10.1109/cvpr52729.2023.01374
125 BAO F, NIE S, XUE K W, et al. All are worth words: A vit backbone for diffusion models[C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 22669-22679. DOI:10.1109/cvpr52729.2023.02171
doi: 10.1109/cvpr52729.2023.02171
126 PEEBLES W, XIE S. Scalable Diffusion Models with Transformers[Z]. (2022-12-19).https://doi.org/10.48550/arXiv.2212.09748.
1 WEI L Y, LEFEBVRE S, KWATRA V, et al. State of the Art in Example-Based Texture Synthesis[R]. Eindhoven: Eurographics Association, 2009: 93-117.
2 HAN C, RISSER E, RAMAMOORTHI R, et al. Multiscale texture synthesis[J]. ACM Transactions on Graphics, 2008, 27(3): 1-8. DOI:10.1145/1360612.1360650
doi: 10.1145/1360612.1360650
3 MAKTHAL S, ROSS A. Synthesis of iris images using Markov random fields[C]// 2005 13th European Signal Processing Conference. Antalya: IEEE, 2005: 1-4.
4 OSINDERO S, HINTON G E. Modeling image patches with a directed hierarchy of Markov random fields[J]. Advances in Neural Information Processing Systems. 2008, 20:1121-1128.
5 GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144. DOI:10.1007/978-3-030-50017-7_10
doi: 10.1007/978-3-030-50017-7_10
6 MIRZA M, OSINDERO S. Conditional Generative Adversarial Nets[Z]. (2014-11-06). https://arXiv.org/abs/1411.1784.
7 OORD A V D, KALCHBRENNER N, VINYALS O, et al. Conditional Image Generation with PixelCNN Decoders[Z]. (2016-06-16). https://doi.org/10.48550/arXiv.1606.05328.
8 SALIMANS T, KARPATHY A, CHEN X, et al. PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications[Z]. (2017-01-19). https://doi.org/10.48550/arXiv.1701.05517.
9 KINGMA D P, WELLING M. Auto-Encoding Variational Bayes[Z]. (2013-12-20). https://doi.org/10.48550/arXiv.1312.6114.
10 DINH L, KRUEGER D, BENGIO Y. NICE: Non-Linear Independent Components Estimation[Z]. (2014-10-30). https://doi.org/10.48550/arXiv.1410. 8516.
11 DINH L, SOHL-DICKSTEIN J, BENGIO S. Density Estimation Using Real NVP[Z]. (2016-05-27). https://doi.org/10.48550/arXiv.1605.08803.
12 LECUN Y, CHOPRA S, HADSELL R, et al. A tutorial on energy-based learning[C]//BAKIR G, HOFMAN T, SCHÖLKOPF B. Predicting Structured Data. Cambridge: MIT Press, 2006. doi:10.7551/mitpress/7443.003.0014
doi: 10.7551/mitpress/7443.003.0014
13 NGIAM J, CHEN Z, KOH P W, et al. Learning deep energy models[C]// 28th International Conference on International Conference on Machine Learning. Bellevue: Omnipress, 2011: 1105-1112.
14 HO J, JAIN A, ABBEEL P. Denoising Diffusion Probabilistic Models[Z]. (2020-06-19). https://doi.org/10.48550/arXiv.2006.11239.
15 SONG Y, ERMON S. Generative modeling by estimating gradients of the data distribution[C]//Thirty-third Conference on Neural Information Processing Systems(NeurIPS). Vancouver: NeurIPS, 2019.
16 ZHANG H, XU T, LI H S, et al. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks[C]// 2017 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE, 2017: 5908-5916. DOI:10. 1109/iccv.2017.629
doi: 10. 1109/iccv.2017.629
17 XU T, ZHANG P C, HUANG Q Y, et al. AttnGan: Fine-grained text to image generation with attentional generative adversarial networks[C]// 2018 IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 1316-1324. DOI:10.1109/cvpr.2018.00143
doi: 10.1109/cvpr.2018.00143
18 DHARIWAL P, NICHOL A. Diffusion models beat GANs on image synthesis[J]. Advances in Neural Information Processing Systems, 2021, 11: 8780-8794.
19 RAMESH A, PAVLOV M, GOH G, et al. Zero-shot text-to-image generation[C]// International Conference on Machine Learning. Online: PMLR, 2021: 8821-8831.
20 KARRAS T, LAINE S, AILA T. A style-based generator architecture for generative adversarial networks[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 4401-4410. DOI:10.1109/cvpr.2019.00453
doi: 10.1109/cvpr.2019.00453
21 WU H H, SEETHARAMAN P, KUMAR K, et al. Wav2clip: Learning robust audio representations from clip[C]// 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Online: IEEE, 2022: 4563-4567. DOI:10.1109/icassp43922.2022.9747669
doi: 10.1109/icassp43922.2022.9747669
22 SONG Y, SOHL-DICKSTEIN J, KINGMA D P, et al. Score-Based Generative Modeling Through Stochastic Differential Equations[Z]. (2020-11-26). https://doi.org/10.48550/arXiv.2011.13456.
23 DHARIWAL P, NICHOL A. Diffusion models beat gans on image synthesis[J]. Advances in Neural Information Processing Systems, 2021, 34: 8780-8794.
24 NICHOL A, DHARIWAL P. Improved Denoising Diffusion Probabilistic Models[Z]. (2021-02-18). https://arxiv.org/abs/2102.09672.
25 NICHOL A, DHARIWAL P, RAMESH A, et al. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models[Z]. (2021-12-20). https://arxiv.org/abs/2112.10741.
26 RONNEBERGER O, FISCHER P, BROX T. U-Net: Convolutional networks for biomedical image segmentation[C]// 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich: MICCAI, 2015: 234-241. doi:10.1007/978-3-319-24574-4_28
doi: 10.1007/978-3-319-24574-4_28
27 SOHL-DICKSTEIN J, WEISS E, MAHESWARANATHAN N, et al. Deep unsupervised learning using nonequilibrium thermodynamics[C]// 32th International Conference on Machine Learning. Lille: PMLR, 2015: 2256-2265.
28 SONG Y, ERMON S. Improved techniques for training score-based generative models[J]. Advances in Neural Information Processing Systems, 2020, 33: 12438-12448. doi:10.48550/arXiv.2006.09011
doi: 10.48550/arXiv.2006.09011
29 SONG Y, DURKAN C, MURRAY I, et al. Maximum likelihood training of score-based diffusion models[J]. Advances in Neural Information Processing Systems, 2021, 34: 1415-1428.
30 BROCK A, DONAHUE J, SIMONYAN K. Large Scale GAN Training for High Fidelity Natural Image Synthesis[Z]. (2018-09-28). https://doi.org/10.48550/arXiv.1809.11096.
31 HO J, SALIMANS T. Classifier-Free Diffusion Guidance[Z]. (2022-07-26). https://doi.org/10. 48550/arXiv.2207.12598.
32 ROMBACH R, BLATTMANN A, LORENZ D, et al. High-Resolution Image Synthesis with Latent Diffusion Models[Z]. (2021-12-20). https://doi.org/10.48550/arXiv.2112.10752.
33 RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical Text-Conditional Image Generation with CLIP Latents[Z]. (2022-04-13). https://doi.org/10.48550/arXiv.2204.06125.
34 SAHARIA C, CHAN W, SAXENA S, et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding[Z]. (2022-03-23). https://doi.org/10.48550/arXiv.2205.11487.
35 RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]// International Conference on Machine Learning. Online: PMLR, 2021: 8748-8763.
36 LIU X, PARK D H, AZADI S, et al. More control for free image synthesis with semantic diffusion guidance[C]// Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Vancouver: IEEE, 2023: 289-299. doi:10.1109/wacv56688.2023.00037
doi: 10.1109/wacv56688.2023.00037
37 AVRAHAMI O, LISCHINSKI D, FRIED O. Blended diffusion for text-driven editing of natural images[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans: IEEE, 2022: 18187-18197. DOI:10.1109/CVPR52688.2022.01767
doi: 10.1109/CVPR52688.2022.01767
38 KWON M, JEONG J, UH Y. Diffusion Models Already Have a Semantic Latent Space[Z]. (2022-10-20). https://doi.org/10.48550/arXiv.2210. 10960.
39 KIM G, KWON T, YE J C. Diffusionclip: Text-guided diffusion models for robust image manipulation[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 2426-2435. DOI:10.1109/cvpr52688.2022.00246
doi: 10.1109/cvpr52688.2022.00246
40 GU S Y, CHEN D, BAO J M, et al. Vector quantized diffusion model for text-to-image synthesis[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 10696-10706. DOI:10.1109/cvpr52688.2022.01043
doi: 10.1109/cvpr52688.2022.01043
41 HO J, SAHARIA C, CHAN W, et al. Cascaded diffusion models for high fidelity image generation[J]. The Journal of Machine Learning Research, 2022, 23(47): 1-33.
42 SAHARIA C, HO J, CHAN W, et al. Image super-resolution via iterative refinement[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(4): 4713-4726. DOI:10. 1109/tpami.2022.3204461
doi: 10. 1109/tpami.2022.3204461
43 YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78. DOI:10.1162/tacl_a_00166
doi: 10.1162/tacl_a_00166
44 LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: Common objects in context[C]// 13th European Conference on Computer Vision (ECCV). Zurich: Springer, 2014: 740-755. doi:10.1007/978-3-319-10602-1_48
doi: 10.1007/978-3-319-10602-1_48
45 CHANGPINYO S, SHARMA P, DING N, et al. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 3558-3568. DOI:10.1109/cvpr46437.2021.00356
doi: 10.1109/cvpr46437.2021.00356
46 SRINIVASAN K, RAMAN K, CHEN J, et al. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning[C]// 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. Online: ACM, 2021: 2443-2449. DOI:10.1145/3404835. 3463257
doi: 10.1145/3404835. 3463257
47 GU J X, MENG X J, LU G S, et al. Wukong: 100 Million Large-Scale Chinese Cross-Modal Pre-Training Dataset and a Foundation Framework[Z]. (2022-02-14). https://doi.org/10.48550/arXiv.2202. 06767.
48 SCHUHMANN C, VENCU R, BEAUMONT R, et al. Laion-400M: Open Dataset of Clip-Filtered 400 Million Image-Text Pairs[Z]. (2021-11-03). https://doi.org/10.48550/arXiv.2111.02114.
49 MINWOO B, BEOMHEE P, HAECHEON K, et al. COYO-700M: Image-Text Pair Dataset[Z]. https://github.com/kakaobrain/coyo-dataset.
50 SCHUHMANN C, BEAUMONT R, VENCU R, et al. Laion-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models[Z]. (2022-10-16). https://doi.org/10.48550/arXiv.2210. 08402.
51 FENG Z, ZHANG Z, YU X, et al. ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts[Z]. (2022-10-27). https://doi.org/10. 48550/ arXiv.2210.15257.
52 BALAJI Y, NAH S, HUANG X, et al. EDiffi: Text-To-Image Diffusion Models with an Ensemble of Expert Denoisers[Z]. (2022-11-02). https://doi.org/10.48550/arXiv.2211.01324.
53 HOOGEBOOM E, HEEK J, SALIMANS T. Simple Diffusion: End-to-End Diffusion for High Resolution Images[Z]. (2023-01-26). https://doi.org/10.48550/arXiv.2301.11093.
54 ESSER P, ROMBACH R, OMMER B. Taming transformers for high-resolution image synthesis[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 12868-12878. DOI:10.1109/cvpr46437. 2021.01268
doi: 10.1109/cvpr46437. 2021.01268
55 RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 5485-5551.
56 BORJI A. Generated Faces in the Wild: Quantitative Comparison of Stable Diffusion, Midjourney and DALL E2[Z]. (2022-10-02). https://doi.org/10. 48550/arXiv.2210.00586.
57 YE H, YANG X, TAKAC M, et al. Improving Text-to-Image Synthesis Using Contrastive Learning[Z]. (2021-07-06). https://doi.org/10. 48550/arXiv.2107.02423.
58 ZHANG H, KOH J Y, BALDRIDGE J, et al. Cross-modal contrastive learning for text-to-image generation[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 833-842. doi:10.1109/cvpr46437.2021.00089
doi: 10.1109/cvpr46437.2021.00089
59 ZHOU Y F, ZHANG R Y, CHEN C Y, et al. LAFITE: Towards Language-Free Training for Text-to-Image Generation[Z]. (2021-11-27). https://doi.org/10.48550/arXiv.2111.13792.
60 DING M, ZHENG W D, HONG W Y, et al. CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers[Z]. (2022-04-28).https://doi.org/10.48550/arXiv.2204.14217.
61 GAFNI O, POLYAK A, ASHUAL O, et al. Make-a-scene: Scene-based text-to-image generation with human priors[C]// 17th European Conference on Computer Vision. Israel: Springer, 2022: 89-106. doi:10.1007/978-3-031-19784-0_6
doi: 10.1007/978-3-031-19784-0_6
62 YU J H, XU Y Z, KOH J Y, et al. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation[Z]. (2022-06-22). https://doi.org/10.48550/arXiv.2206.10789.
63 LEE K M, LIU H, RYU M, et al. Aligning Text-To-Image Models Using Human Feedback[Z]. (2023-02-23). https://doi.org/10.48550/arXiv.2302. 12192.
64 ZHANG Q S, SONG J M, HUANG X, et al. DiffCollage: Parallel Generation of Large Content with Diffusion Models[Z]. (2023-03-30). https://doi.org/10.48550/arXiv.2303.17076.
65 SCHRAMOWSKI P, BRACK M, DEISEROTH B, et al. Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models[Z]. (2022-11-09). https://doi.org/10.48550/arXiv.2211.05105.
66 FRIEDRICH F, SCHRAMOWSKI P, BRACK M, et al. Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness[Z]. (2023-02-07). https://doi.org/10.48550/arXiv.2302.10893.
67 ZHU Y, WU Y, OLSZEWSKI K, et al. Discrete contrastive diffusion for cross-modal music and image generation[C]// The Eleventh International Conference on Learning Representations. Kigali Rwanda: ICLR, 2023: .
68 LIU N, LI S, DU Y L, et al. Compositional visual generation with composable diffusion models[C]// 17th European Conference on Computer Vision. Israel: Springer, 2022: 423-439. doi:10.1007/978-3-031-19790-1_26
doi: 10.1007/978-3-031-19790-1_26
69 LIEW J H, YAN H, ZHOU D, et al. MagicMix: Semantic Mixing with Diffusion Models[Z]. (2022-10-28). https://doi.org/10.48550/arXiv.2210. 16056.
70 MA W D K, LEWIS J P, KLEIJN W B, et al. Directed Diffusion: Direct Control of Object Placement through Attention Guidance[Z]. (2023-02-25). https://doi.org/10.48550/arXiv.2302.13153.
71 CHEFER H, ALALUF Y, VINKER Y, et al. Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models[Z]. (2023-01-31). https://doi.org/10.48550/arXiv.2301.13826.
72 GRAVE E, JOULIN A, USUNIER N. Improving Neural Language Models with a Continuous Cache[Z]. (2016-12-13). https://doi.org/10. 48550/arXiv. 1612.04426.
73 ROMBACH R, BLATTMANN A, OMMER B. Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models[Z]. (2022-07-26). https://doi.org/10.48550/arXiv.2207.13038.
74 BLATTMANN A, ROMBACH R, OKTAY K, et al. Retrieval-augmented diffusion models[J]. Advances in Neural Information Processing Systems, 2022, 35: 15309-15324.
75 CHEN W H, HU H X, SAHARIA C, et al. Re-Imagen: Retrieval-Augmented Text-to-Image Generator[Z]. (2022-09-29). https://doi.org/10.48550/arXiv.2209.14491.
76 SHEYNIN S, ASHUAL O, POLYAK A, et al. KNN-Diffusion: Image Generation via Large-Scale Retrieval[Z]. (2022-04-06). https://doi.org/10.48550/arXiv.2204.02849.
77 GAL R, ALALUF Y, ATZMON Y, et al. An Image is Worth One Word: Personalizing Text-to-Image Generation Using Textual Inversion[Z]. (2022-08-02). https:// doi.org/10.48550/arXiv.2208.01618.
78 RUIZ N, LI Y, JAMPANI V, et al. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation[Z]. (2022-08-25). https://doi.org/10.48550/arXiv.2208.12242.
79 DONG Z Y, WEI P X, LIN L. DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Contrastive Prompt-Tuning[Z]. (2022-11-21). https://doi.org/10.48550/arXiv.2211. 11337.
80 KUMARI N, ZHANG B, ZHANG R, et al. Multi-Concept Customization of Text-to-Image Diffusion[Z]. (2022-12-08). https://doi.org/10.48550/arXiv.2212.04488.
81 WEI Y, ZHANG Y, JI Z, et al. Elite: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation[Z]. (2023-02-27). https://doi.org/10.48550/arXiv.2302.13848.
82 LIU Z H, FENG R L, ZHU K, et al. Cones: Concept Neurons in Diffusion Models for Customized Generation[Z]. (2023-03-09). https://doi.org/10.48550/arXiv.2303.05125.
83 HAN L G, LI Y X, ZHANG H, et al. SVDiff: Compact Parameter Space for Diffusion Fine-Tuning[Z]. (2023-03-20). https://doi.org/10.48550/arXiv.2303. 11305.
84 PATASHNIK O, GARIBI D, AZURI I, et al. Localizing Object-Level Shape Variations with Text-to-Image Diffusion Models[Z]. (2023-03-20). https://doi.org/10.48550/arXiv.2303.11306.
85 HUANG Z Q, WU T X, JIANG Y M, et al. ReVersion: Diffusion-Based Relation Inversion from Images[Z]. (2023-03-23). https://doi.org/10.48550/arXiv. 2303.13495.
86 WANG T F, ZHANG T, ZHANG B, et al. Pretraining is All You Need for Image-to-Image Translation[Z]. (2022-05-25). https://doi.org/10. 48550/arXiv.2205.12952.
87 VOYNOV A, ABERMAN K, COHEN-OR D. Sketch-Guided Text-to-Image Diffusion Models[Z]. (2022-11-24). https://doi.org/10.48550/arXiv. 2211. 13752.
88 MAUNGMAUNG A, SHING M, MITSUI K, et al. Text-Guided Scene Sketch-to-Photo Synthesis[Z]. (2023-02-14). https://doi.org/10.48550/arXiv. 2302. 06883.
89 CHENG S I, CHEN Y J, CHIU W C, et al. Adaptively-realistic image generation from stroke and sketch with diffusion model[C]// 2023 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2023: 4043-4051. DOI:10. 1109/wacv56688.2023.00404
doi: 10. 1109/wacv56688.2023.00404
90 PENG Y C, ZHAO C Q, XIE H R, et al. DiffFaceSketch: High-Fidelity Face Image Synthesis with Sketch-Guided Latent Diffusion Model[Z]. (2023-02-14). https://doi.org/10.48550/arXiv.2302. 06908.
91 CHENG J X, LIANG X, SHI X J, et al. LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation[Z]. (2023-02-16). https://doi.org/10.48550/arXiv.2302.08908.
92 BAR-TAL O, YARIV L, LIPMAN Y, et al. MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation[Z]. (2023-02-16). https://doi.org/10.48550/arXiv.2302.08113.
93 AVRAHAMI O, HAYES T, GAFNI O, et al. SpaText: Spatio-textual representation for controllable image generation[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver: IEEE, 2023: 18370-18380. DOI:10.1109/CVPR52729.2023.01762
doi: 10.1109/CVPR52729.2023.01762
94 HAM C, HAYS J, LU J, et al. Modulating Pretrained Diffusion Models for Multimodal Image Synthesis[Z]. (2023-02-24). https://doi.org/10. 48550/arXiv.2302.12764.
95 YANG L, HUANG Z L, SONG Y, et al. Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training[Z]. (2022-11-21). https://doi.org/10.48550/arXiv.2211.11138.
96 LI Y H, LIU H T, WU Q Y, et al. GLIGEN: Open-Set Grounded Text-to-Image Generation[Z]. (2023-01-17). https://doi.org/10.48550/arXiv.2301.07093.
97 SARUKKAI V, LI L, MA A, et al. Collage Diffusion[Z]. (2023-03-01).https://doi.org/10.48550/ arXiv.2303.00262.
98 ZHANG L, AGRAWALA M. Adding Conditional Control to Text-to-Image Diffusion Models[Z]. (2023-02-10). https://doi.org/10.48550/arXiv.2302. 05543.
99 HUANG L H, CHEN D, LIU Y, et al. Composer: Creative and Controllable Image Synthesis with Composable Conditions[Z]. (2023-02-20). https://doi.org/10.48550/arXiv.2302.09778.
100 YU J W, WANG Y H, ZHAO C, et al. Freedom: Training-Free Energy-Guided Conditional Diffusion Model[Z]. (2023-03-17). https://doi.org/10.48550/arXiv.2303.09833.
101 LUGMAYR A, DANELLJAN M, ROMERO A, et al. RePaint: Inpainting using Denoising Diffusion Probabilistic Models[Z]. (2022-01-24). https://doi.org/10.48550/arXiv.2201.09865.
102 LI W B, YU X, ZHOU K, et al. SDM: Spatial Diffusion Model for Large Hole Image Inpainting[Z]. (2022-12-06). https://doi.org/10.48550/arXiv.2212. 02963.
103 LI R, TAN R T, CHEONG L F. All in one bad weather removal using architectural search[C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 3172-3182. DOI:10.1109/cvpr42600.2020.00324
doi: 10.1109/cvpr42600.2020.00324
104 CHEN H T, WANG Y H, GUO T Y, et al. Pre-trained image processing transformer[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 12294-12305. DOI:10.1109/cvpr46437.2021.01212
doi: 10.1109/cvpr46437.2021.01212
105 ZHU Y R, WANG T Y, FU X Y, et al. Learning Weather-General and Weather-Specific Features for Image Restoration Under Multiple Adverse Weather Conditions[C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 21747-21758. DOI:10. 1109/cvpr52729.2023.02083
doi: 10. 1109/cvpr52729.2023.02083
106 KAWAR B, ELAD M, ERMON S, et al. Denoising Diffusion Restoration Models[Z]. (2022-01-27). https://doi.org/10.48550/arXiv.2201.11793.
107 WANG Y H, YU J W, ZHANG J. Zero-Shot Image Restoration Using Denoising Diffusion Null-Space Model[Z]. (2022-12-01). https://doi.org/10.48550/arXiv.2212.00490.
108 SAHARIA C, CHAN W, CHANG H, et al. Palette: Image-to-image diffusion models[C]// ACM SIGGRAPH 2022. Vancouver: ACM, 2022: 1-10. DOI:10.1145/3528233.3530757
doi: 10.1145/3528233.3530757
109 PAN X C, QIN P D, LI Y H, et al. Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models[Z]. (2022-11-20). https://doi.org/10.48550/ arXiv.2211.10950.
110 JEONG H, KWON G, YE J C. Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models[Z]. (2023-02-08). https://doi.org/10.48550/arXiv.2302.03900.
111 NIKANKIN Y, HAIM N, IRANI M. SinFusion: Training Diffusion Models on a Single Image or Video[Z]. (2022-11-21). https://doi.org/10.48550/arXiv.2211.11743.
112 ZHAO Y Q, PANG T Y, DU C, et al. A Recipe for Watermarking Diffusion Models[Z]. (2023-03-17). https://doi.org/10.48550/arXiv.2303.10137.
[1] 邱波, 张丰, 杜震洪, 刘仁义, 张书瑜, 范心仪. 一种面向移动终端地理场景点云在线可视化的集成型索引[J]. 浙江大学学报(理学版), 2019, 46(1): 101-110.
[2] 姜斌, 黄祥志, 杜震洪, 张丰, 刘仁义. 一种动态实时的遥感专题应用系统定制框架[J]. 浙江大学学报(理学版), 2018, 45(6): 758-764.