Please wait a minute...
浙江大学学报(工学版)  2026, Vol. 60 Issue (1): 43-51    DOI: 10.3785/j.issn.1008-973X.2026.01.004
计算机技术     
基于扩散模型多模态提示的电力人员行为图像生成
朱志航1(),闫云凤1,2,齐冬莲1,2,*()
1. 浙江大学 电气工程学院,浙江 杭州 310027
2. 浙江大学 海南研究院,海南 三亚 572025
Image generation for power personnel behaviors based on diffusion model with multimodal prompts
Zhihang ZHU1(),Yunfeng YAN1,2,Donglian QI1,2,*()
1. College of Electrical Engineering, Zhejiang University, Hangzhou 310027, China
2. Hainan Institute of Zhejiang University, Sanya 572025, China
 全文: PDF(4153 KB)   HTML
摘要:

电力人员行为的特殊性与复杂性导致其图像数据稀缺,给数据驱动下的行为识别带来了挑战. 在稳定扩散模型的基础上,充分融合人体骨架、掩膜以及文本描述信息,加入关键点损失函数,建立多模态条件控制的电力人员行为图像生成模型PoseNet,该模型可以生成高质量的可控人体图像. 设计基于关键点相似度的图像滤波器,以去除错误、低质量的生成图像;采用双阶段训练策略,在通用数据上对模型进行预训练,并在私有数据上微调,提升模型性能;针对电力人员行为特点,设计集通用、专用评价指标于一体的生成图像评价指标集,分析不同评价指标下的图像生成效果. 实验结果表明,与主流人体生成模型ControlNet、HumanSD相比,该模型的生成结果更精准、真实、效果更优.

关键词: 条件图像生成模型数据扩充人体关键点图像分割扩散模型深度学习    
Abstract:

A multimodal conditional-control image generation model PoseNet for power personnel behaviors was established to address the challenges posed to data-driven behavior identification due to the scarcity of image data caused by the unique and complex nature of power personnel behaviors. On the basis of the stable diffusion model, the human skeleton, mask and text description information were fully integrated, and the key point loss function was added to the model, enabling the model to generate high-quality and controllable human body images. An image filter based on the similarity of the key points was designed to remove the erroneous and low-quality generated images, and the two-stage training strategy was used to pre-train the model on the generic data and fine-tune the model on the private data to improve the model performance. For the behavioral characteristics of the power personnel, a set of evaluation metrics for generating images integrating the generic and specialized evaluation metrics was designed, and the image generation performance under different evaluation metrics was analyzed. The experimental results showed that compared with the mainstream human generation models ControlNet and HumanSD, this model achieved more accurate, realistic and superior results.

Key words: conditional image generation model    data augmentation    human body keypoint    image segmentation    diffusion model    deep learning
收稿日期: 2024-12-17 出版日期: 2025-12-15
:  TP 391  
通讯作者: 齐冬莲     E-mail: 22210044@zju.edu.cn;qidl@zju.edu.cn
作者简介: 朱志航(2000—),男,硕士生,从事计算机视觉研究. orcid.org/0009-0000-8952-5249. E-mail:22210044@zju.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
朱志航
闫云凤
齐冬莲

引用本文:

朱志航,闫云凤,齐冬莲. 基于扩散模型多模态提示的电力人员行为图像生成[J]. 浙江大学学报(工学版), 2026, 60(1): 43-51.

Zhihang ZHU,Yunfeng YAN,Donglian QI. Image generation for power personnel behaviors based on diffusion model with multimodal prompts. Journal of ZheJiang University (Engineering Science), 2026, 60(1): 43-51.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2026.01.004        https://www.zjujournals.com/eng/CN/Y2026/V60/I1/43

图 1  电力人员行为图像生成架构
图 2  稳定扩散模型基本流程
图 3  稳定扩散模型与PoseNet架构
图 4  图像自标注流程图
图 5  图像过滤流程图
指标名称指标定义指标作用
FID2组图像经过Inception模型转换为特征向量后的Fréchet距离.衡量2组图像的整体相似度,FID分数越低表示2组图像越相似,说明生成图像质量越好.
KID2组图像经过Inception模型转换为特征向量后的核矩阵无偏估计值.衡量2组图像的整体相似度,KID分数越低表示2组图像越相似,说明生成图像质量越好.
CLIP-Score
(图-图/图-文)
2组图像或图像-文本被经过预训练的CLIP模型转换为特征向量后的余弦相似度.评估2组图像或图像-文本对之间的相似度,CLIP-Score分数越高表示2组图像越相似或图像-文本对越匹配,说明生成图像质量或图像与文本提示一致性越高.
PCK预测关键点与真实关键点之间的距离小于某个阈值的比例,即正确检测的关键点所占百分比.评估姿态估计的准确度,PCK分数越高表示姿态估计预测结果越准确,说明生成图像的关键点准确度越高.
OKS2组关键点之间的距离,考虑了关键点的可见性、人体尺寸和权重分配.评估2组人体姿势的匹配度,OKS分数越高表示2组关键点越接近,说明生成图像的关键点准确度越高.
生成效率$ \eta $对1张人员行为图像,模型生成任意张图像,其中能被行为识别模型检测出人员行为的图像与生成总数量的比值.评估模型生成效率,比值越高,说明生成图像可用概率越大,模型生成效率越高.
表 1  评价指标集中不同指标的定义
方法FIDKIDCLIP-ScorePCK/%OKS
ControlNet[18]274.726.6267.33/30.3147.50.872
HumanSD[20]331.0511.6562.21/29.2489.40.946
PoseNet130.795.1287.23/30.3275.40.889
PoseNet+图像滤波器130.434.9089.81/30.4390.40.978
PoseNet+图像滤波器+双阶段训练128.254.5691.02/31.4494.20.979
表 2  不同方法在多种指标下的定量结果
算法$ {\eta }_{\mathrm{d}} $/%$ {\eta }_{\mathrm{k}} $/%$ {\eta }_{\mathrm{p}} $/%
ControlNet[18]533068
HumanSD[20]775771
PoseNet100100100
表 3  不同算法对不同行为的生成效率对比
图 6  在PoseNet中使用或不使用图像滤波器、双阶段训练方法的生成效果对比
图 7  所提方法与ControlNet生成效果对比
图 8  复杂姿态、遮挡情况下ControlNet与所提方法的生成效果对比
图 9  替换和新增情况下的生成效果对比
图 10  所提方法在爬杆、倒地、扛梯姿态下的生成效果图
1 王刘旺 机器视觉技术在电力安全监控中的应用综述[J]. 浙江电力, 2022, 41 (10): 16- 26
WANG Liuwang A review of the application of machine vision in power safety monitoring[J]. Zhejiang Electric Power, 2022, 41 (10): 16- 26
2 赵振兵, 张薇, 翟永杰, 等. 电力视觉技术的概念、研究现状与展望[J]. 电力科学与工程, 2020, 36(1): 1–8.
ZHAO Zhenbing, ZHANG Wei, ZHAI Yongjie, et al. Concept, research status and prospect of electric power vision technology [J]. Electric Power Science and Engineering, 2020, 36(1): 1–8.
3 齐冬莲, 韩译锋, 周自强, 等 基于视频图像的输变电设备外部缺陷检测技术及其应用现状[J]. 电子与信息学报, 2022, 44 (11): 3709- 3720
QI Donglian, HAN Yifeng, ZHOU Ziqiang, et al Review of defect detection technology of power equipment based on video images[J]. Journal of Electronics and Information Technology, 2022, 44 (11): 3709- 3720
doi: 10.11999/JEIT211588
4 闫云凤, 陈汐, 金浩远, 等 基于计算机视觉的电力作业人员行为分析研究现状与展望[J]. 高电压技术, 2024, 50 (5): 1842- 1854
YAN Yunfeng, CHEN Xi, JIN Haoyuan, et al Research status and development of computer-vision-based power workers’ behavior analysis[J]. High Voltage Engineering, 2024, 50 (5): 1842- 1854
5 陈佛计, 朱枫, 吴清潇, 等 生成对抗网络及其在图像生成中的应用研究综述[J]. 计算机学报, 2021, 44 (2): 347- 369
CHEN Foji, ZHU Feng, WU Qingxiao, et al A survey about image generation with generative adversarial nets[J]. Chinese Journal of Computers, 2021, 44 (2): 347- 369
doi: 10.11897/SP.J.1016.2021.00347
6 GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al Generative adversarial networks[J]. Communications of the ACM, 2020, 63 (11): 139- 144
doi: 10.1145/3422622
7 HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models [C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver: NeurIPS Foundation, 2020: 6840–6851.
8 NICHOL A, DHARIWAL P, RAMESH A, et al. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models [EB/OL]. (2022−03−08) [2025−01−14]. https://arxiv.org/abs/2112.10741.
9 SAHARIA C, CHAN W, SAXENA S, et al. Photorealistic text-to-image diffusion models with deep language understanding [C]// Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans: NeurIPS Foundation, 2022: 36479–36494.
10 张美锋, 谭翼坤, 陈世俊, 等 基于DAGAN的电气设备小样本红外图像生成技术与应用[J]. 电工技术, 2023, (6): 76- 79
ZHANG Meifeng, TAN Yikun, CHEN Shijun, et al Infrared image generation technology and application of small sample of electrical equipment based on DAGAN[J]. Electric Engineering, 2023, (6): 76- 79
11 何宇浩, 宋云海, 何森, 等 面向电力缺陷场景的小样本图像生成方法[J]. 浙江电力, 2024, 43 (1): 126- 132
HE Yuhao, SONG Yunhai, HE Sen, et al A few-shot image generation method for power defect scenarios[J]. Zhejiang Electric Power, 2024, 43 (1): 126- 132
12 杨剑锋, 秦钟, 庞小龙, 等 基于深度学习网络的输电线路异物入侵监测和识别方法[J]. 电力系统保护与控制, 2021, 49 (4): 37- 44
YANG Jianfeng, QIN Zhong, PANG Xiaolong, et al Foreign body intrusion monitoring and recognition method based on Dense-YOLOv3 deep learning network[J]. Power System Protection and Control, 2021, 49 (4): 37- 44
13 王德文, 李业东 基于WGAN图片去模糊的绝缘子目标检测[J]. 电力自动化设备, 2020, 40 (5): 188- 198
WANG Dewen, LI Yedong Insulator object detection based on image deblurring by WGAN[J]. Electric Power Automation Equipment, 2020, 40 (5): 188- 198
14 黄文琦, 许爱东, 明哲, 等 基于生成对抗网络的变电站工作人员行为预测的方法[J]. 南方电网技术, 2019, 13 (2): 45- 50
HUANG Wenqi, XU Aidong, MING Zhe, et al Prediction method for the behavior of substation staff based on generative adversarial network[J]. Southern Power System Technology, 2019, 13 (2): 45- 50
15 邵振国, 张承圣, 陈飞雄, 等 生成对抗网络及其在电力系统中的应用综述[J]. 中国电机工程学报, 2023, 43 (3): 987- 1004
SHAO Zhenguo, ZHANG Chengsheng, CHEN Feixiong, et al A review on generative adversarial networks for power system applications[J]. Proceedings of the CSEE, 2023, 43 (3): 987- 1004
16 ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 10674–10685.
17 RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation [C]// Medical Image Computing and Computer-Assisted Intervention. Munich: Springer, 2015: 234–241.
18 ZHANG L, RAO A, AGRAWALA M. Adding conditional control to text-to-image diffusion models [C]// IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2023: 3813–3824.
19 MOU C, WANG X, XIE L, et al. T2I-adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models [C]// AAAI Conference on Artificial Intelligence. Vancouver: AAAI, 2024: 4296–4304.
20 JU X, ZENG A, ZHAO C, et al. HumanSD: a native skeleton-guided diffusion model for human image generation [C]// IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2023: 15942–15952.
21 LIU X, REN J, SIAROHIN A, et al. HyperHuman: hyper-realistic human generation with latent structural diffusion [EB/OL]. (2024−03−15) [2025−01−14]. https://arxiv.org/abs/2310.08579.
22 闫政斌. 鲁棒性多姿态人体图像生成方法研究[D]. 天津: 天津工业大学, 2023.
YAN Zhengbin. Research on robust multi-pose human image generation method [D]. Tianjin: Tianjin University of Technology, 2023.
23 左然, 胡皓翔, 邓小明, 等 基于手绘草图的视觉内容生成深度学习方法综述[J]. 软件学报, 2024, 35 (7): 3497- 3530
ZUO Ran, HU Haoxiang, DENG Xiaoming, et al Survey on deep learning methods for freehand-sketch-based visual content generation[J]. Journal of Software, 2024, 35 (7): 3497- 3530
24 文渊博, 高涛, 安毅生, 等 基于视觉提示学习的天气退化图像恢复[J]. 计算机学报, 2024, 47 (10): 2401- 2416
WEN Yuanbo, GAO Tao, AN Yisheng, et al Weather-degraded image restoration based on visual prompt learning[J]. Chinese Journal of Computers, 2024, 47 (10): 2401- 2416
25 CORDTS M, OMRAN M, RAMOS S, et al. The cityscapes dataset for semantic urban scene understanding [C]// IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 3213–3223.
26 REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection [C]// IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 779–788.
27 CHENG B, MISRA I, SCHWING A G, et al. Masked-attention mask Transformer for universal image segmentation [C]// IEEE Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 1280–1289.
28 XU Y, ZHANG J, ZHANG Q, et al. Vitpose: simple vision Transformer baselines for human pose estimation [C]// Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans: NeurIPS Foundation, 2022: 38571–38584.
29 LI J, LI D, XIONG C, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation [C]// International Conference on Machine Learning. Baltimore: PMLR, 2022: 12888–12900.
30 LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context [C]// European Conference on Computer Vision. Zurich: Springer, 2014: 740–755.
31 HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: NeurIPS Foundation, 2017: 6629–6640.
32 BIŃKOWSKI M, SUTHERLAND D J, ARBEL M, et al. Demystifying MMD GANs [C]// International Conference on Learning Representations. Vancouver: ICLR, 2018: 1–36.
[1] 段继忠,李海源. 基于变分模型和Transformer的多尺度并行磁共振成像重建[J]. 浙江大学学报(工学版), 2025, 59(9): 1826-1837.
[2] 王福建,张泽天,陈喜群,王殿海. 基于多通道图聚合注意力机制的共享单车借还量预测[J]. 浙江大学学报(工学版), 2025, 59(9): 1986-1995.
[3] 张弘,张学成,王国强,顾潘龙,江楠. 基于三维视觉的软体机器人实时定位与控制[J]. 浙江大学学报(工学版), 2025, 59(8): 1574-1582.
[4] 王圣举,张赞. 基于加速扩散模型的缺失值插补算法[J]. 浙江大学学报(工学版), 2025, 59(7): 1471-1480.
[5] 章东平,王大为,何数技,汤斯亮,刘志勇,刘中秋. 基于跨维度特征融合的航空发动机寿命预测[J]. 浙江大学学报(工学版), 2025, 59(7): 1504-1513.
[6] 蔡永青,韩成,权巍,陈兀迪. 基于注意力机制的视觉诱导晕动症评估模型[J]. 浙江大学学报(工学版), 2025, 59(6): 1110-1118.
[7] 王立红,刘新倩,李静,冯志全. 基于联邦学习和时空特征融合的网络入侵检测方法[J]. 浙江大学学报(工学版), 2025, 59(6): 1201-1210.
[8] 徐慧智,王秀青. 基于车辆图像特征的前车距离与速度感知[J]. 浙江大学学报(工学版), 2025, 59(6): 1219-1232.
[9] 陈赞,李冉,冯远静,李永强. 基于时间维超分辨率的视频快照压缩成像重构[J]. 浙江大学学报(工学版), 2025, 59(5): 956-963.
[10] 马莉,王永顺,胡瑶,范磊. 预训练长短时空交错Transformer在交通流预测中的应用[J]. 浙江大学学报(工学版), 2025, 59(4): 669-678.
[11] 陈巧红,郭孟浩,方贤,孙麒. 基于跨模态级联扩散模型的图像描述方法[J]. 浙江大学学报(工学版), 2025, 59(4): 787-794.
[12] 顾正宇,赖菲菲,耿辰,王希明,戴亚康. 基于知识引导的缺血性脑卒中梗死区分割方法[J]. 浙江大学学报(工学版), 2025, 59(4): 814-820.
[13] 姚明辉,王悦燕,吴启亮,牛燕,王聪. 基于小样本人体运动行为识别的孪生网络算法[J]. 浙江大学学报(工学版), 2025, 59(3): 504-511.
[14] 梁礼明,龙鹏威,金家新,李仁杰,曾璐. 基于改进YOLOv8s的钢材表面缺陷检测算法[J]. 浙江大学学报(工学版), 2025, 59(3): 512-522.
[15] 杨凯博,钟铭恩,谭佳威,邓智颖,周梦丽,肖子佶. 基于半监督学习的多场景火灾小规模稀薄烟雾检测[J]. 浙江大学学报(工学版), 2025, 59(3): 546-556.