Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2026, Vol. 60 Issue (1): 43-51    DOI: 10.3785/j.issn.1008-973X.2026.01.004
    
Image generation for power personnel behaviors based on diffusion model with multimodal prompts
Zhihang ZHU1(),Yunfeng YAN1,2,Donglian QI1,2,*()
1. College of Electrical Engineering, Zhejiang University, Hangzhou 310027, China
2. Hainan Institute of Zhejiang University, Sanya 572025, China
Download: HTML     PDF(4153KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A multimodal conditional-control image generation model PoseNet for power personnel behaviors was established to address the challenges posed to data-driven behavior identification due to the scarcity of image data caused by the unique and complex nature of power personnel behaviors. On the basis of the stable diffusion model, the human skeleton, mask and text description information were fully integrated, and the key point loss function was added to the model, enabling the model to generate high-quality and controllable human body images. An image filter based on the similarity of the key points was designed to remove the erroneous and low-quality generated images, and the two-stage training strategy was used to pre-train the model on the generic data and fine-tune the model on the private data to improve the model performance. For the behavioral characteristics of the power personnel, a set of evaluation metrics for generating images integrating the generic and specialized evaluation metrics was designed, and the image generation performance under different evaluation metrics was analyzed. The experimental results showed that compared with the mainstream human generation models ControlNet and HumanSD, this model achieved more accurate, realistic and superior results.



Key wordsconditional image generation model      data augmentation      human body keypoint      image segmentation      diffusion model      deep learning     
Received: 17 December 2024      Published: 15 December 2025
CLC:  TP 391  
Corresponding Authors: Donglian QI     E-mail: 22210044@zju.edu.cn;qidl@zju.edu.cn
Cite this article:

Zhihang ZHU,Yunfeng YAN,Donglian QI. Image generation for power personnel behaviors based on diffusion model with multimodal prompts. Journal of ZheJiang University (Engineering Science), 2026, 60(1): 43-51.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2026.01.004     OR     https://www.zjujournals.com/eng/Y2026/V60/I1/43


基于扩散模型多模态提示的电力人员行为图像生成

电力人员行为的特殊性与复杂性导致其图像数据稀缺,给数据驱动下的行为识别带来了挑战. 在稳定扩散模型的基础上,充分融合人体骨架、掩膜以及文本描述信息,加入关键点损失函数,建立多模态条件控制的电力人员行为图像生成模型PoseNet,该模型可以生成高质量的可控人体图像. 设计基于关键点相似度的图像滤波器,以去除错误、低质量的生成图像;采用双阶段训练策略,在通用数据上对模型进行预训练,并在私有数据上微调,提升模型性能;针对电力人员行为特点,设计集通用、专用评价指标于一体的生成图像评价指标集,分析不同评价指标下的图像生成效果. 实验结果表明,与主流人体生成模型ControlNet、HumanSD相比,该模型的生成结果更精准、真实、效果更优.


关键词: 条件图像生成模型,  数据扩充,  人体关键点,  图像分割,  扩散模型,  深度学习 
Fig.1 Image generation architecture for power personnel behaviors
Fig.2 Basic process of stable diffusion model
Fig.3 Architectures of stable diffusion model and PoseNet
Fig.4 Image self-labeling flowchart
Fig.5 Image filtering flowchart
指标名称指标定义指标作用
FID2组图像经过Inception模型转换为特征向量后的Fréchet距离.衡量2组图像的整体相似度,FID分数越低表示2组图像越相似,说明生成图像质量越好.
KID2组图像经过Inception模型转换为特征向量后的核矩阵无偏估计值.衡量2组图像的整体相似度,KID分数越低表示2组图像越相似,说明生成图像质量越好.
CLIP-Score
(图-图/图-文)
2组图像或图像-文本被经过预训练的CLIP模型转换为特征向量后的余弦相似度.评估2组图像或图像-文本对之间的相似度,CLIP-Score分数越高表示2组图像越相似或图像-文本对越匹配,说明生成图像质量或图像与文本提示一致性越高.
PCK预测关键点与真实关键点之间的距离小于某个阈值的比例,即正确检测的关键点所占百分比.评估姿态估计的准确度,PCK分数越高表示姿态估计预测结果越准确,说明生成图像的关键点准确度越高.
OKS2组关键点之间的距离,考虑了关键点的可见性、人体尺寸和权重分配.评估2组人体姿势的匹配度,OKS分数越高表示2组关键点越接近,说明生成图像的关键点准确度越高.
生成效率$ \eta $对1张人员行为图像,模型生成任意张图像,其中能被行为识别模型检测出人员行为的图像与生成总数量的比值.评估模型生成效率,比值越高,说明生成图像可用概率越大,模型生成效率越高.
Tab.1 Definitions of different metrics in evaluation metric set
方法FIDKIDCLIP-ScorePCK/%OKS
ControlNet[18]274.726.6267.33/30.3147.50.872
HumanSD[20]331.0511.6562.21/29.2489.40.946
PoseNet130.795.1287.23/30.3275.40.889
PoseNet+图像滤波器130.434.9089.81/30.4390.40.978
PoseNet+图像滤波器+双阶段训练128.254.5691.02/31.4494.20.979
Tab.2 Quantitative results of different methods under multiple metrics
算法$ {\eta }_{\mathrm{d}} $/%$ {\eta }_{\mathrm{k}} $/%$ {\eta }_{\mathrm{p}} $/%
ControlNet[18]533068
HumanSD[20]775771
PoseNet100100100
Tab.3 Comparison of generation efficiency for different behaviors with different algorithms
Fig.6 Generation effect comparison of PoseNet with or without image filter and two-stage training method
Fig.7 Comparison of generation effect between proposed method and ControlNet
Fig.8 Comparison of generation effects between ControlNet and proposed method under complex postures and occlusion condition
Fig.9 Comparison of generation effects in replacement and addition scenarios
Fig.10 Generation effect diagram of proposed method under climbing, falling and ladder-carrying postures
[1]   王刘旺 机器视觉技术在电力安全监控中的应用综述[J]. 浙江电力, 2022, 41 (10): 16- 26
WANG Liuwang A review of the application of machine vision in power safety monitoring[J]. Zhejiang Electric Power, 2022, 41 (10): 16- 26
[2]   赵振兵, 张薇, 翟永杰, 等. 电力视觉技术的概念、研究现状与展望[J]. 电力科学与工程, 2020, 36(1): 1–8.
ZHAO Zhenbing, ZHANG Wei, ZHAI Yongjie, et al. Concept, research status and prospect of electric power vision technology [J]. Electric Power Science and Engineering, 2020, 36(1): 1–8.
[3]   齐冬莲, 韩译锋, 周自强, 等 基于视频图像的输变电设备外部缺陷检测技术及其应用现状[J]. 电子与信息学报, 2022, 44 (11): 3709- 3720
QI Donglian, HAN Yifeng, ZHOU Ziqiang, et al Review of defect detection technology of power equipment based on video images[J]. Journal of Electronics and Information Technology, 2022, 44 (11): 3709- 3720
doi: 10.11999/JEIT211588
[4]   闫云凤, 陈汐, 金浩远, 等 基于计算机视觉的电力作业人员行为分析研究现状与展望[J]. 高电压技术, 2024, 50 (5): 1842- 1854
YAN Yunfeng, CHEN Xi, JIN Haoyuan, et al Research status and development of computer-vision-based power workers’ behavior analysis[J]. High Voltage Engineering, 2024, 50 (5): 1842- 1854
[5]   陈佛计, 朱枫, 吴清潇, 等 生成对抗网络及其在图像生成中的应用研究综述[J]. 计算机学报, 2021, 44 (2): 347- 369
CHEN Foji, ZHU Feng, WU Qingxiao, et al A survey about image generation with generative adversarial nets[J]. Chinese Journal of Computers, 2021, 44 (2): 347- 369
doi: 10.11897/SP.J.1016.2021.00347
[6]   GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al Generative adversarial networks[J]. Communications of the ACM, 2020, 63 (11): 139- 144
doi: 10.1145/3422622
[7]   HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models [C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver: NeurIPS Foundation, 2020: 6840–6851.
[8]   NICHOL A, DHARIWAL P, RAMESH A, et al. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models [EB/OL]. (2022−03−08) [2025−01−14]. https://arxiv.org/abs/2112.10741.
[9]   SAHARIA C, CHAN W, SAXENA S, et al. Photorealistic text-to-image diffusion models with deep language understanding [C]// Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans: NeurIPS Foundation, 2022: 36479–36494.
[10]   张美锋, 谭翼坤, 陈世俊, 等 基于DAGAN的电气设备小样本红外图像生成技术与应用[J]. 电工技术, 2023, (6): 76- 79
ZHANG Meifeng, TAN Yikun, CHEN Shijun, et al Infrared image generation technology and application of small sample of electrical equipment based on DAGAN[J]. Electric Engineering, 2023, (6): 76- 79
[11]   何宇浩, 宋云海, 何森, 等 面向电力缺陷场景的小样本图像生成方法[J]. 浙江电力, 2024, 43 (1): 126- 132
HE Yuhao, SONG Yunhai, HE Sen, et al A few-shot image generation method for power defect scenarios[J]. Zhejiang Electric Power, 2024, 43 (1): 126- 132
[12]   杨剑锋, 秦钟, 庞小龙, 等 基于深度学习网络的输电线路异物入侵监测和识别方法[J]. 电力系统保护与控制, 2021, 49 (4): 37- 44
YANG Jianfeng, QIN Zhong, PANG Xiaolong, et al Foreign body intrusion monitoring and recognition method based on Dense-YOLOv3 deep learning network[J]. Power System Protection and Control, 2021, 49 (4): 37- 44
[13]   王德文, 李业东 基于WGAN图片去模糊的绝缘子目标检测[J]. 电力自动化设备, 2020, 40 (5): 188- 198
WANG Dewen, LI Yedong Insulator object detection based on image deblurring by WGAN[J]. Electric Power Automation Equipment, 2020, 40 (5): 188- 198
[14]   黄文琦, 许爱东, 明哲, 等 基于生成对抗网络的变电站工作人员行为预测的方法[J]. 南方电网技术, 2019, 13 (2): 45- 50
HUANG Wenqi, XU Aidong, MING Zhe, et al Prediction method for the behavior of substation staff based on generative adversarial network[J]. Southern Power System Technology, 2019, 13 (2): 45- 50
[15]   邵振国, 张承圣, 陈飞雄, 等 生成对抗网络及其在电力系统中的应用综述[J]. 中国电机工程学报, 2023, 43 (3): 987- 1004
SHAO Zhenguo, ZHANG Chengsheng, CHEN Feixiong, et al A review on generative adversarial networks for power system applications[J]. Proceedings of the CSEE, 2023, 43 (3): 987- 1004
[16]   ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 10674–10685.
[17]   RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation [C]// Medical Image Computing and Computer-Assisted Intervention. Munich: Springer, 2015: 234–241.
[18]   ZHANG L, RAO A, AGRAWALA M. Adding conditional control to text-to-image diffusion models [C]// IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2023: 3813–3824.
[19]   MOU C, WANG X, XIE L, et al. T2I-adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models [C]// AAAI Conference on Artificial Intelligence. Vancouver: AAAI, 2024: 4296–4304.
[20]   JU X, ZENG A, ZHAO C, et al. HumanSD: a native skeleton-guided diffusion model for human image generation [C]// IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2023: 15942–15952.
[21]   LIU X, REN J, SIAROHIN A, et al. HyperHuman: hyper-realistic human generation with latent structural diffusion [EB/OL]. (2024−03−15) [2025−01−14]. https://arxiv.org/abs/2310.08579.
[22]   闫政斌. 鲁棒性多姿态人体图像生成方法研究[D]. 天津: 天津工业大学, 2023.
YAN Zhengbin. Research on robust multi-pose human image generation method [D]. Tianjin: Tianjin University of Technology, 2023.
[23]   左然, 胡皓翔, 邓小明, 等 基于手绘草图的视觉内容生成深度学习方法综述[J]. 软件学报, 2024, 35 (7): 3497- 3530
ZUO Ran, HU Haoxiang, DENG Xiaoming, et al Survey on deep learning methods for freehand-sketch-based visual content generation[J]. Journal of Software, 2024, 35 (7): 3497- 3530
[24]   文渊博, 高涛, 安毅生, 等 基于视觉提示学习的天气退化图像恢复[J]. 计算机学报, 2024, 47 (10): 2401- 2416
WEN Yuanbo, GAO Tao, AN Yisheng, et al Weather-degraded image restoration based on visual prompt learning[J]. Chinese Journal of Computers, 2024, 47 (10): 2401- 2416
[25]   CORDTS M, OMRAN M, RAMOS S, et al. The cityscapes dataset for semantic urban scene understanding [C]// IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 3213–3223.
[26]   REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection [C]// IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 779–788.
[27]   CHENG B, MISRA I, SCHWING A G, et al. Masked-attention mask Transformer for universal image segmentation [C]// IEEE Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 1280–1289.
[28]   XU Y, ZHANG J, ZHANG Q, et al. Vitpose: simple vision Transformer baselines for human pose estimation [C]// Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans: NeurIPS Foundation, 2022: 38571–38584.
[29]   LI J, LI D, XIONG C, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation [C]// International Conference on Machine Learning. Baltimore: PMLR, 2022: 12888–12900.
[30]   LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context [C]// European Conference on Computer Vision. Zurich: Springer, 2014: 740–755.
[31]   HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: NeurIPS Foundation, 2017: 6629–6640.
[32]   BIŃKOWSKI M, SUTHERLAND D J, ARBEL M, et al. Demystifying MMD GANs [C]// International Conference on Learning Representations. Vancouver: ICLR, 2018: 1–36.
[1] Jizhong DUAN,Haiyuan LI. Multi-scale parallel magnetic resonance imaging reconstruction based on variational model and Transformer[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(9): 1826-1837.
[2] Fujian WANG,Zetian ZHANG,Xiqun CHEN,Dianhai WANG. Usage prediction of shared bike based on multi-channel graph aggregation attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(9): 1986-1995.
[3] Hong ZHANG,Xuecheng ZHANG,Guoqiang WANG,Panlong GU,Nan JIANG. Real-time positioning and control of soft robot based on three-dimensional vision[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1574-1582.
[4] Shengju WANG,Zan ZHANG. Missing value imputation algorithm based on accelerated diffusion model[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1471-1480.
[5] Dongping ZHANG,Dawei WANG,Shuji HE,Siliang TANG,Zhiyong LIU,Zhongqiu LIU. Remaining useful life prediction of aircraft engines based on cross-dimensional feature fusion[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1504-1513.
[6] Yongqing CAI,Cheng HAN,Wei QUAN,Wudi CHEN. Visual induced motion sickness estimation model based on attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(6): 1110-1118.
[7] Lihong WANG,Xinqian LIU,Jing LI,Zhiquan FENG. Network intrusion detection method based on federated learning and spatiotemporal feature fusion[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(6): 1201-1210.
[8] Huizhi XU,Xiuqing WANG. Perception of distance and speed of front vehicle based on vehicle image features[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(6): 1219-1232.
[9] Zan CHEN,Ran LI,Yuanjing FENG,Yongqiang LI. Video snapshot compressive imaging reconstruction based on temporal super-resolution[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(5): 956-963.
[10] Li MA,Yongshun WANG,Yao HU,Lei FAN. Pre-trained long-short spatiotemporal interleaved Transformer for traffic flow prediction applications[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(4): 669-678.
[11] Qiaohong CHEN,Menghao GUO,Xian FANG,Qi SUN. Image captioning based on cross-modal cascaded diffusion model[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(4): 787-794.
[12] Zhengyu GU,Feifei LAI,Chen GENG,Ximing WANG,Yakang DAI. Knowledge-guided infarct segmentation of ischemic stroke[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(4): 814-820.
[13] Minghui YAO,Yueyan WANG,Qiliang WU,Yan NIU,Cong WANG. Siamese networks algorithm based on small human motion behavior recognition[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(3): 504-511.
[14] Liming LIANG,Pengwei LONG,Jiaxin JIN,Renjie LI,Lu ZENG. Steel surface defect detection algorithm based on improved YOLOv8s[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(3): 512-522.
[15] Kaibo YANG,Mingen ZHONG,Jiawei TAN,Zhiying DENG,Mengli ZHOU,Ziji XIAO. Small-scale sparse smoke detection in multiple fire scenarios based on semi-supervised learning[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(3): 546-556.