Please wait a minute...
浙江大学学报(工学版)  2025, Vol. 59 Issue (7): 1471-1480    DOI: 10.3785/j.issn.1008-973X.2025.07.015
计算机技术与控制工程     
基于加速扩散模型的缺失值插补算法
王圣举(),张赞*()
长安大学 电子与控制工程学院,陕西 西安,710064
Missing value imputation algorithm based on accelerated diffusion model
Shengju WANG(),Zan ZHANG*()
School of Electronics and Control Engineering, Chang’an University, Xi’an 710064, China
 全文: PDF(943 KB)   HTML
摘要:

为了解决表格数据中数据缺失对后续任务产生的不利影响,提出使用扩散模型进行缺失值插补的方法. 针对原始扩散模型在生成过程中耗时过长的问题,设计基于加速扩散模型的数据插补方法(PNDM_Tab). 扩散模型的前向过程通过高斯加噪方法实现,采用基于扩散模型的伪数值方法进行反向过程加速. 使用U-Net与注意力机制相结合的网络结构从数据中高效提取显著特征,实现噪声的准确预测. 为了使模型在训练阶段有监督目标,使用随机掩码处理训练数据以生成新的缺失数据. 在9个数据集中的插补方法对比实验结果表明:相较其他插补方法,PNDM_Tab在6个数据集中的均方根误差最低. 实验结果证明,相较于原始的扩散模型,反向过程使用扩散模型的伪数值方法能够在减少采样步数的同时保持生成性能不变.

关键词: 表格数据扩散模型数据插补注意力机制深度学习    
Abstract:

To address the adverse effects of missing data in tabular data on subsequent tasks, a method for imputation using diffusion models was proposed. An accelerated diffusion model-based imputation method (PNDM_Tab) was designed aiming at the problem that the original diffusion models being time-consuming during the generation process. The forward process of the diffusion model was realized through Gaussian noise addition, and the pseudo-numerical methods derived from diffusion models were employed to achieve acceleration of the reverse process. Using a network structure combining U-Net with attention mechanisms, significant features were extracted efficiently from the data to predict noise accurately. To provide supervised targets during the training phase, random masking of the training data generated new missing data. Comparative experiments were conducted in nine datasets, and the results showed that PNDM_Tab achieved the lowest root mean square error in six datasets compared to other imputation methods. Experimental results demonstrate that, compared to the original diffusion models, the use of pseudo-numerical methods in the reverse process can reduce the number of sampling steps while maintaining equivalent generative performance.

Key words: tabular data    diffusion model    data imputation    attention mechanism    deep learning
收稿日期: 2024-06-04 出版日期: 2025-07-25
CLC:  TP 391  
通讯作者: 张赞     E-mail: wangshengju@chd.edu.cn;z.zhang@chd.edu.cn
作者简介: 王圣举(1999—),男,硕士生,从事深度学习研究. orcid.org/0009-0008-4078-3026. E-mail:wangshengju@chd.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
王圣举
张赞

引用本文:

王圣举,张赞. 基于加速扩散模型的缺失值插补算法[J]. 浙江大学学报(工学版), 2025, 59(7): 1471-1480.

Shengju WANG,Zan ZHANG. Missing value imputation algorithm based on accelerated diffusion model. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1471-1480.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2025.07.015        https://www.zjujournals.com/eng/CN/Y2025/V59/I7/1471

图 1  基于加速扩散模型的数据插补方法的训练过程
图 2  基于加速扩散模型的数据插补方法的插补过程
图 3  基于加速扩散模型的数据插补方法的结构图
图 4  U-Net中残差块的网络结构图
图 5  自注意力机制的结构示意图
模块阶段模块名称KSPCoutH
初始层卷积层3×311×116
编码器残差块3×311×116
3×311×116
残差块3×311×116
3×311×116
自注意力块4
下采样块(非最后一个层级)5×521×1128
下采样块(最后一个层级)3×311×1128
中间层残差块3×311×1128
3×311×1128
自注意力块4
残差块3×311×1128
3×311×1128
解码器残差块3×311×1128
3×311×1128
3×311×1128
残差块3×311×1128
3×311×1128
3×311×1128
自注意力块4
上采样块(非最后一个层级)2×211×1128
上采样块(最后一个层级)3×311×116
nn.Upsample(scale_factor = 2)
输出层卷积层1×1101
表 1  基于加速扩散模型的数据插补方法的网络参数
数据集样本数特征数量分类特征数数值特征数
Heart1 0251385
FIR6 11851051
CO1 030909
Libras36091091
GC1 00024024
WR5 45624024
AD10 0001486
Students1 000372710
Breast69910010
表 2  插补方法性能对比实验的数据集信息
图 6  FT-Transformer的特征嵌入模块
方法RMSE
BreastLibrasCOFIRGCWR
平均值0.251±0.0120.103±0.0030.225±0.0080.135±0.0040.210±0.0060.239±0.002
ICE0.145±0.0050.029±0.0030.152±0.0060.042±0.0050.232±0.0040.191±0.003
EM0.146±0.0060.027±0.0080.153±0.0060.043±0.0050.189±0.0020.192±0.003
GAIN0.177±0.0100.050±0.0070.220±0.0050.086±0.0030.249±0.0090.227±0.005
MissForest0.148±0.0030.036±0.0020.167±0.0060.057±0.0040.206±0.0050.180±0.002
MIWAE0.481±0.0220.654±0.0070.245±0.0060.146±0.0040.275±0.0070.257±0.003
TabCSDI0.152±0.0050.010±0.0010.135±0.0120.051±0.0040.214±0.0040.197±0.004
PNDM_Tab0.143±0.0060.008±0.0000.124±0.0050.028±0.0040.192±0.0060.175±0.003
表 3  不同插补方法在纯数值特征数据集中的均方根误差
方法ADHeartStudents
RMSERERMSERERMSERE
平均值0.131±0.0030.628±0.0100.164±0.0060.487±0.0210.231±0.0070.531±0.014
ICE0.122±0.0030.571±0.0110.145±0.0050.391±0.0240.186±0.0110.432±0.013
EM0.122±0.0030.564±0.0060.145±0.0060.393±0.0100.187±0.0100.414±0.010
GAIN0.135±0.0020.637±0.0070.158±0.0030.403±0.0190.257±0.0020.488±0.014
MissForest0.118±0.0030.560±0.0050.140±0.0040.336±0.0260.169±0.0060.414±0.013
MIWAE0.136±0.0040.500±0.0030.185±0.0130.477±0.0400.305±0.0090.528±0.007
TabCSDI0.107±0.0040.393±0.0060.147±0.0020.389±0.0320.221±0.0130.402±0.010
PNDM_Tab0.111±0.0030.391±0.0020.139±0.0040.351±0.0270.190±0.0090.343±0.012
表 4  不同插补方法在混合特征数据集中的性能对比结果
数据集模型RMSE
step=10step=20step=150
BreastDDPM_Tab0.309±0.0190.229±0.0130.141±0.008
PNDM_Tab0.143±0.0060.143±0.0060.140±0.006
GCDDPM_Tab0.344±0.0090.296±0.0070.191±0.006
PNDM_Tab0.192±0.0060.190±0.0060.187±0.006
表 5  不同反向过程模型在2个数据集中的均方根误差
图 7  在Breast数据集上循环次数对均方根误差的影响
数据集RMSE
${p_{{\mathrm{mis}}}} $=0.2${p_{{\mathrm{mis}}}} $=0.5${p_{{\mathrm{mis}}}} $=0.8${p_{{\mathrm{mis}}}} $为随机值
Breast0.148±0.0050.140±0.0070.145±0.0080.143±0.006
Libras0.010±0.0010.020±0.0040.029±0.0070.008±0.000
表 6  模型训练阶段不同缺失率的均方根误差
方法RMSEWRRMSEFIR
${p_{{\mathrm{mis}}}}$=10%${p_{{\mathrm{mis}}}}$=30%${p_{{\mathrm{mis}}}}$=50%${p_{{\mathrm{mis}}}}$=10%${p_{{\mathrm{mis}}}}$=30%${p_{{\mathrm{mis}}}}$=50%
平均值0.238±0.0020.238±0.0010.238±0.0010.133±0.0020.136±0.0040.135±0.003
ICE0.188±0.0040.194±0.0020.202±0.0010.036±0.0020.047±0.0050.059±0.003
EM0.188±0.0040.194±0.0020.202±0.0010.039±0.0090.050±0.0040.057±0.002
GAIN0.233±0.0040.232±0.0040.268±0.0030.080±0.0010.098±0.0050.188±0.003
MissForest0.175±0.0030.184±0.0010.195±0.0010.052±0.0020.061±0.0040.070±0.003
MIWAE0.257±0.0020.255±0.0020.256±0.0010.145±0.0030.147±0.0050.147±0.004
TabCSDI0.193±0.0040.198±0.0030.204±0.0030.047±0.0030.054±0.0050.060±0.004
PNDM_Tab0.170±0.0030.178±0.0020.190±0.0020.022±0.0020.032±0.0040.042±0.003
表 7  不同缺失率下不同插补方法在2数据集中的均方根误差
方法${p_{{\mathrm{mis}}}}$=10%${p_{{\mathrm{mis}}}}$=30%${p_{{\mathrm{mis}}}}$=50%
RMSERERMSERERMSERE
平均值0.132±0.0060.629±0.0130.131±0.0020.629±0.0080.131±0.0030.627±0.006
ICE0.124±0.0060.577±0.0090.123±0.0010.578±0.0090.126±0.0030.581±0.008
EM0.124±0.0060.567±0.0070.123±0.0010.572±0.0070.126±0.0030.578±0.007
GAIN0.135±0.0060.629±0.0070.137±0.0020.650±0.0040.205±0.0320.668±0.006
MissForest0.118±0.0060.560±0.0100.119±0.0020.566±0.0060.123±0.0030.573±0.006
MIWAE0.137±0.0070.500±0.0100.137±0.0020.499±0.0070.137±0.0040.499±0.006
TabCSDI0.104±0.0080.388±0.0100.109±0.0030.403±0.0080.115±0.0040.419±0.005
PNDM_Tab0.110±0.0070.384±0.0090.113±0.0030.401±0.0050.119±0.0040.423±0.005
表 8  不同缺失率下不同插补方法在AD数据集中的性能对比结果
数据集数据集大小批次大小timp/s
Breast69916107
FIR61185121619
AD10000512899
表 9  所提方法在不同规模数据集上的插补时间对比
方法timp/s方法timp/s
平均值0.026MissForest53.400
ICE7.330MIWAE5.120
EM1.510TabCSDI792.000
GAIN6.470PNDM_Tab107.000
表 10  不同方法在Breast数据集上的插补时间对比
1 VAN BUUREN S. Flexible imputation of missing data [M]. [S. l. ]: CRC Press, 2018.
2 STEKHOVEN D J, BÜHLMANN P MissForest: non-parametric missing value imputation for mixed-type data[J]. Bioinformatics, 2012, 28 (1): 112- 118
doi: 10.1093/bioinformatics/btr597
3 RESCHE-RIGON M, WHITE I R Multiple imputation by chained equations for systematically and sporadically missing multilevel data[J]. Statistical Methods in Medical Research, 2018, 27 (6): 1634- 1649
doi: 10.1177/0962280216666564
4 MAZUMDER R, HASTIE T, TIBSHIRANI R Spectral regularization algorithms for learning large incomplete matrices[J]. Journal of Machine Learning Research, 2010, 11: 2287- 2322
5 YOON J, JORDON J, SCHAAR M. GAIN: missing data imputation using generative adversarial nets [C]// Proceedings of the 35th International Conference on Machine Learning. Stockholm: ACM, 2018: 5689–5698.
6 MATTEI P A, FRELLSEN J. MIWAE: deep generative modelling and imputation of incomplete data sets [C]// Proceedings of the 36th International Conference on Machine Learning. Long Beach: ACM, 2019: 4413–4423.
7 ZHENG S, CHAROENPHAKDEE N. Diffusion models for missing value imputation in tabular data [EB/OL]. (2023–03–11)[2023–07–12]. https://arxiv.org/pdf/2210.17128.
8 LIU L, REN Y, LIN Z, et al. Pseudo numerical methods for diffusion models on manifolds [EB/OL]. (2022–10–31)[2023–08–19]. https://arxiv.org/pdf/2202.09778.
9 MCKNIGHT P E, MCKNIGHT K M, SIDANI S, et al. Missing data: a gentle introduction [M]. [S. l. ]: Guilford Press, 2007.
10 MALARVIZHI R, THANAMANI A S K-nearest neighbor in missing data imputation[J]. International Journal of Engineering Research and Development, 2012, 5 (1): 5- 7
11 庞新生 缺失数据插补处理方法的比较研究[J]. 统计与决策, 2012, 28 (24): 18- 22
PANG Xinsheng A comparative study on missing data interpolation methods[J]. Statistics and Decision, 2012, 28 (24): 18- 22
12 HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models [C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. [S. l. ]: Curran Associates Inc. , 2020: 6840–6851.
13 SONG J M, MENG C L, ERMON S. Denoising diffusion implicit models [EB/OL]. (2022–10–05)[2023–08–23]. https://arxiv.org/pdf/2010.02502.
14 SONG Y, SOHL-DICKSTEIN J, KINGMA D P, et al. Score-based generative modeling through stochastic differential equations [EB/OL]. (2021–02–10)[2023–08–25]. https://arxiv.org/pdf/2011.13456.
15 MAOUTSA D, REICH S, OPPER M Interacting particle solutions of Fokker–Planck equations through gradient–log–density estimation[J]. Entropy, 2020, 22 (8): 802
doi: 10.3390/e22080802
16 SALIMANS T, HO J. Progressive distillation for fast sampling of diffusion models [EB/OL]. (2022–06–07)[2024–01–23]. https://arxiv.org/pdf/2202.00512.
17 TASHIRO Y, SONG J, SONG Y, et al. CSDI: conditional score-based diffusion models for probabilistic time series imputation [C]// Proceedings of the 35th International Conference on Neural Information Processing Systems. [S.l.]: Curran Associates Inc. , 2021: 24804–24816.
18 NICHOL A. Q, DHARIWAL P. Improved denoising diffusion probabilistic models [C]// Proceedings of the 38th International Conference on Machine Learning. Vienna: ACM, 2021: 8162–8171.
19 RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation [C]// Medical Image Computing and Computer-Assisted Intervention. [S. l. ]: Springer, 2015: 234–241.
20 DHARIWAL P, NICHOL A. Diffusion models beat gans on image synthesis [C]// Proceedings of the 35th International Conference on Neural Information Processing Systems. [S. l.]: Curran Associates Inc. , 2021: 8780–8794.
21 VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31th International Conference on Neural Information Processing Systems. [S. l.]: Curran Associates Inc. , 2017: 6000–6010.
22 GARCÍA-LAENCINA P J, SANCHO-GÓMEZ J L, FIGUEIRAS-VIDAL A R Pattern classification with missing data: a review[J]. Neural Computing and Applications, 2010, 19 (2): 263- 282
doi: 10.1007/s00521-009-0295-6
23 GORISHNIY Y, RUBACHEV I, KHRULKOV V, et al. Revisiting deep learning models for tabular data [C]// Proceedings of the 35th International Conference on Neural Information Processing Systems. [S. l. ]: Curran Associates Inc., 2021: 18932–18943.
24 JARRETT D, CEBERE B C, LIU T, et al. Hyperimpute: Generalized iterative imputation with automatic model selection [C]// Proceedings of the 39th International Conference on Machine Learning. Baltimore: ACM, 2022: 9916–9937.
[1] 杨荣泰,邵玉斌,杜庆治. 基于结构感知的少样本知识补全[J]. 浙江大学学报(工学版), 2025, 59(7): 1394-1402.
[2] 杨宇豪,郭永存,李德永,王爽. 基于视觉信息的煤矸识别分割定位方法[J]. 浙江大学学报(工学版), 2025, 59(7): 1421-1433.
[3] 蔡永青,韩成,权巍,陈兀迪. 基于注意力机制的视觉诱导晕动症评估模型[J]. 浙江大学学报(工学版), 2025, 59(6): 1110-1118.
[4] 鞠文博,董华军. 基于上下文信息融合与动态采样的主板缺陷检测方法[J]. 浙江大学学报(工学版), 2025, 59(6): 1159-1168.
[5] 王立红,刘新倩,李静,冯志全. 基于联邦学习和时空特征融合的网络入侵检测方法[J]. 浙江大学学报(工学版), 2025, 59(6): 1201-1210.
[6] 周翔宇,刘毅志,赵肄江,廖祝华,张德城. 面向目的地预测的层次化空间嵌入BiGRU模型[J]. 浙江大学学报(工学版), 2025, 59(6): 1211-1218.
[7] 徐慧智,王秀青. 基于车辆图像特征的前车距离与速度感知[J]. 浙江大学学报(工学版), 2025, 59(6): 1219-1232.
[8] 李宗民,徐畅,白云,鲜世洋,戎光彩. 面向点云理解的双邻域图卷积方法[J]. 浙江大学学报(工学版), 2025, 59(5): 879-889.
[9] 陈赞,李冉,冯远静,李永强. 基于时间维超分辨率的视频快照压缩成像重构[J]. 浙江大学学报(工学版), 2025, 59(5): 956-963.
[10] 刘洪伟,王磊,刘阳,张鹏超,乔石. 基于重组二次分解及LSTNet-Atten的短期负荷预测[J]. 浙江大学学报(工学版), 2025, 59(5): 1051-1062.
[11] 马莉,王永顺,胡瑶,范磊. 预训练长短时空交错Transformer在交通流预测中的应用[J]. 浙江大学学报(工学版), 2025, 59(4): 669-678.
[12] 陈巧红,郭孟浩,方贤,孙麒. 基于跨模态级联扩散模型的图像描述方法[J]. 浙江大学学报(工学版), 2025, 59(4): 787-794.
[13] 顾正宇,赖菲菲,耿辰,王希明,戴亚康. 基于知识引导的缺血性脑卒中梗死区分割方法[J]. 浙江大学学报(工学版), 2025, 59(4): 814-820.
[14] 刘登峰,郭文静,陈世海. 基于内容引导注意力的车道线检测网络[J]. 浙江大学学报(工学版), 2025, 59(3): 451-459.
[15] 姚明辉,王悦燕,吴启亮,牛燕,王聪. 基于小样本人体运动行为识别的孪生网络算法[J]. 浙江大学学报(工学版), 2025, 59(3): 504-511.