基于加速扩散模型的缺失值插补算法

doi:10.3785/j.issn.1008-973X.2025.07.015

浙江大学学报(工学版)

2025, Vol. 59

Issue (7): 1471-1480 DOI: 10.3785/j.issn.1008-973X.2025.07.015

计算机技术与控制工程

基于加速扩散模型的缺失值插补算法

王圣举(

),张赞*(

)

长安大学电子与控制工程学院，陕西西安，710064

Missing value imputation algorithm based on accelerated diffusion model

Shengju WANG(

),Zan ZHANG*(

)

School of Electronics and Control Engineering, Chang’an University, Xi’an 710064, China

全文: PDF(943 KB) HTML

摘要：

为了解决表格数据中数据缺失对后续任务产生的不利影响，提出使用扩散模型进行缺失值插补的方法. 针对原始扩散模型在生成过程中耗时过长的问题，设计基于加速扩散模型的数据插补方法（PNDM_Tab）. 扩散模型的前向过程通过高斯加噪方法实现，采用基于扩散模型的伪数值方法进行反向过程加速. 使用U-Net与注意力机制相结合的网络结构从数据中高效提取显著特征，实现噪声的准确预测. 为了使模型在训练阶段有监督目标，使用随机掩码处理训练数据以生成新的缺失数据. 在9个数据集中的插补方法对比实验结果表明：相较其他插补方法，PNDM_Tab在6个数据集中的均方根误差最低. 实验结果证明，相较于原始的扩散模型，反向过程使用扩散模型的伪数值方法能够在减少采样步数的同时保持生成性能不变.

关键词： 表格数据; 扩散模型; 数据插补; 注意力机制; 深度学习

Abstract:

To address the adverse effects of missing data in tabular data on subsequent tasks, a method for imputation using diffusion models was proposed. An accelerated diffusion model-based imputation method (PNDM_Tab) was designed aiming at the problem that the original diffusion models being time-consuming during the generation process. The forward process of the diffusion model was realized through Gaussian noise addition, and the pseudo-numerical methods derived from diffusion models were employed to achieve acceleration of the reverse process. Using a network structure combining U-Net with attention mechanisms, significant features were extracted efficiently from the data to predict noise accurately. To provide supervised targets during the training phase, random masking of the training data generated new missing data. Comparative experiments were conducted in nine datasets, and the results showed that PNDM_Tab achieved the lowest root mean square error in six datasets compared to other imputation methods. Experimental results demonstrate that, compared to the original diffusion models, the use of pseudo-numerical methods in the reverse process can reduce the number of sampling steps while maintaining equivalent generative performance.

Key words: tabular data diffusion model data imputation attention mechanism deep learning

收稿日期: 2024-06-04 出版日期: 2025-07-25

CLC:

TP 391

通讯作者: 张赞 E-mail: wangshengju@chd.edu.cn;z.zhang@chd.edu.cn

作者简介: 王圣举（1999—），男，硕士生，从事深度学习研究. orcid.org/0009-0008-4078-3026. E-mail：wangshengju@chd.edu.cn

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	作者相关文章
	王圣举
	张赞

引用本文:

王圣举,张赞. 基于加速扩散模型的缺失值插补算法[J]. 浙江大学学报(工学版), 2025, 59(7): 1471-1480.

Shengju WANG,Zan ZHANG. Missing value imputation algorithm based on accelerated diffusion model. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1471-1480.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2025.07.015 或 https://www.zjujournals.com/eng/CN/Y2025/V59/I7/1471

图 1 基于加速扩散模型的数据插补方法的训练过程

图 2 基于加速扩散模型的数据插补方法的插补过程

图 3 基于加速扩散模型的数据插补方法的结构图

图 4 U-Net中残差块的网络结构图

图 5 自注意力机制的结构示意图

表 1 基于加速扩散模型的数据插补方法的网络参数

表 2 插补方法性能对比实验的数据集信息

图 6 FT-Transformer的特征嵌入模块

表 3 不同插补方法在纯数值特征数据集中的均方根误差

表 4 不同插补方法在混合特征数据集中的性能对比结果

表 5 不同反向过程模型在2个数据集中的均方根误差

图 7 在Breast数据集上循环次数对均方根误差的影响

表 6 模型训练阶段不同缺失率的均方根误差

表 7 不同缺失率下不同插补方法在2数据集中的均方根误差

表 8 不同缺失率下不同插补方法在AD数据集中的性能对比结果

表 9 所提方法在不同规模数据集上的插补时间对比

表 10 不同方法在Breast数据集上的插补时间对比

1	VAN BUUREN S. Flexible imputation of missing data [M]. [S. l. ]: CRC Press, 2018.
2	STEKHOVEN D J, BÜHLMANN P MissForest: non-parametric missing value imputation for mixed-type data[J]. Bioinformatics, 2012, 28 (1): 112- 118 doi: 10.1093/bioinformatics/btr597
3	RESCHE-RIGON M, WHITE I R Multiple imputation by chained equations for systematically and sporadically missing multilevel data[J]. Statistical Methods in Medical Research, 2018, 27 (6): 1634- 1649 doi: 10.1177/0962280216666564
4	MAZUMDER R, HASTIE T, TIBSHIRANI R Spectral regularization algorithms for learning large incomplete matrices[J]. Journal of Machine Learning Research, 2010, 11: 2287- 2322
5	YOON J, JORDON J, SCHAAR M. GAIN: missing data imputation using generative adversarial nets [C]// Proceedings of the 35th International Conference on Machine Learning. Stockholm: ACM, 2018: 5689–5698.
6	MATTEI P A, FRELLSEN J. MIWAE: deep generative modelling and imputation of incomplete data sets [C]// Proceedings of the 36th International Conference on Machine Learning. Long Beach: ACM, 2019: 4413–4423.
7	ZHENG S, CHAROENPHAKDEE N. Diffusion models for missing value imputation in tabular data [EB/OL]. (2023–03–11)[2023–07–12]. https://arxiv.org/pdf/2210.17128.
8	LIU L, REN Y, LIN Z, et al. Pseudo numerical methods for diffusion models on manifolds [EB/OL]. (2022–10–31)[2023–08–19]. https://arxiv.org/pdf/2202.09778.
9	MCKNIGHT P E, MCKNIGHT K M, SIDANI S, et al. Missing data: a gentle introduction [M]. [S. l. ]: Guilford Press, 2007.
10	MALARVIZHI R, THANAMANI A S K-nearest neighbor in missing data imputation[J]. International Journal of Engineering Research and Development, 2012, 5 (1): 5- 7
11	庞新生缺失数据插补处理方法的比较研究[J]. 统计与决策, 2012, 28 (24): 18- 22 PANG Xinsheng A comparative study on missing data interpolation methods[J]. Statistics and Decision, 2012, 28 (24): 18- 22
12	HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models [C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. [S. l. ]: Curran Associates Inc. , 2020: 6840–6851.
13	SONG J M, MENG C L, ERMON S. Denoising diffusion implicit models [EB/OL]. (2022–10–05)[2023–08–23]. https://arxiv.org/pdf/2010.02502.
14	SONG Y, SOHL-DICKSTEIN J, KINGMA D P, et al. Score-based generative modeling through stochastic differential equations [EB/OL]. (2021–02–10)[2023–08–25]. https://arxiv.org/pdf/2011.13456.
15	MAOUTSA D, REICH S, OPPER M Interacting particle solutions of Fokker–Planck equations through gradient–log–density estimation[J]. Entropy, 2020, 22 (8): 802 doi: 10.3390/e22080802
16	SALIMANS T, HO J. Progressive distillation for fast sampling of diffusion models [EB/OL]. (2022–06–07)[2024–01–23]. https://arxiv.org/pdf/2202.00512.
17	TASHIRO Y, SONG J, SONG Y, et al. CSDI: conditional score-based diffusion models for probabilistic time series imputation [C]// Proceedings of the 35th International Conference on Neural Information Processing Systems. [S.l.]: Curran Associates Inc. , 2021: 24804–24816.
18	NICHOL A. Q, DHARIWAL P. Improved denoising diffusion probabilistic models [C]// Proceedings of the 38th International Conference on Machine Learning. Vienna: ACM, 2021: 8162–8171.
19	RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation [C]// Medical Image Computing and Computer-Assisted Intervention. [S. l. ]: Springer, 2015: 234–241.
20	DHARIWAL P, NICHOL A. Diffusion models beat gans on image synthesis [C]// Proceedings of the 35th International Conference on Neural Information Processing Systems. [S. l.]: Curran Associates Inc. , 2021: 8780–8794.
21	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31th International Conference on Neural Information Processing Systems. [S. l.]: Curran Associates Inc. , 2017: 6000–6010.
22	GARCÍA-LAENCINA P J, SANCHO-GÓMEZ J L, FIGUEIRAS-VIDAL A R Pattern classification with missing data: a review[J]. Neural Computing and Applications, 2010, 19 (2): 263- 282 doi: 10.1007/s00521-009-0295-6
23	GORISHNIY Y, RUBACHEV I, KHRULKOV V, et al. Revisiting deep learning models for tabular data [C]// Proceedings of the 35th International Conference on Neural Information Processing Systems. [S. l. ]: Curran Associates Inc., 2021: 18932–18943.
24	JARRETT D, CEBERE B C, LIU T, et al. Hyperimpute: Generalized iterative imputation with automatic model selection [C]// Proceedings of the 39th International Conference on Machine Learning. Baltimore: ACM, 2022: 9916–9937.

[1]	杨荣泰,邵玉斌,杜庆治. 基于结构感知的少样本知识补全[J]. 浙江大学学报(工学版), 2025, 59(7): 1394-1402.
[2]	杨宇豪,郭永存,李德永,王爽. 基于视觉信息的煤矸识别分割定位方法[J]. 浙江大学学报(工学版), 2025, 59(7): 1421-1433.
[3]	蔡永青,韩成,权巍,陈兀迪. 基于注意力机制的视觉诱导晕动症评估模型[J]. 浙江大学学报(工学版), 2025, 59(6): 1110-1118.
[4]	鞠文博,董华军. 基于上下文信息融合与动态采样的主板缺陷检测方法[J]. 浙江大学学报(工学版), 2025, 59(6): 1159-1168.
[5]	王立红,刘新倩,李静,冯志全. 基于联邦学习和时空特征融合的网络入侵检测方法[J]. 浙江大学学报(工学版), 2025, 59(6): 1201-1210.
[6]	周翔宇,刘毅志,赵肄江,廖祝华,张德城. 面向目的地预测的层次化空间嵌入BiGRU模型[J]. 浙江大学学报(工学版), 2025, 59(6): 1211-1218.
[7]	徐慧智,王秀青. 基于车辆图像特征的前车距离与速度感知[J]. 浙江大学学报(工学版), 2025, 59(6): 1219-1232.
[8]	李宗民,徐畅,白云,鲜世洋,戎光彩. 面向点云理解的双邻域图卷积方法[J]. 浙江大学学报(工学版), 2025, 59(5): 879-889.
[9]	陈赞,李冉,冯远静,李永强. 基于时间维超分辨率的视频快照压缩成像重构[J]. 浙江大学学报(工学版), 2025, 59(5): 956-963.
[10]	刘洪伟,王磊,刘阳,张鹏超,乔石. 基于重组二次分解及LSTNet-Atten的短期负荷预测[J]. 浙江大学学报(工学版), 2025, 59(5): 1051-1062.
[11]	马莉,王永顺,胡瑶,范磊. 预训练长短时空交错Transformer在交通流预测中的应用[J]. 浙江大学学报(工学版), 2025, 59(4): 669-678.
[12]	陈巧红,郭孟浩,方贤,孙麒. 基于跨模态级联扩散模型的图像描述方法[J]. 浙江大学学报(工学版), 2025, 59(4): 787-794.
[13]	顾正宇,赖菲菲,耿辰,王希明,戴亚康. 基于知识引导的缺血性脑卒中梗死区分割方法[J]. 浙江大学学报(工学版), 2025, 59(4): 814-820.
[14]	刘登峰,郭文静,陈世海. 基于内容引导注意力的车道线检测网络[J]. 浙江大学学报(工学版), 2025, 59(3): 451-459.
[15]	姚明辉,王悦燕,吴启亮,牛燕,王聪. 基于小样本人体运动行为识别的孪生网络算法[J]. 浙江大学学报(工学版), 2025, 59(3): 504-511.

Viewed

Full text

Abstract

Cited

Shared

Discussed