Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2025, Vol. 59 Issue (7): 1471-1480    DOI: 10.3785/j.issn.1008-973X.2025.07.015
    
Missing value imputation algorithm based on accelerated diffusion model
Shengju WANG(),Zan ZHANG*()
School of Electronics and Control Engineering, Chang’an University, Xi’an 710064, China
Download: HTML     PDF(943KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

To address the adverse effects of missing data in tabular data on subsequent tasks, a method for imputation using diffusion models was proposed. An accelerated diffusion model-based imputation method (PNDM_Tab) was designed aiming at the problem that the original diffusion models being time-consuming during the generation process. The forward process of the diffusion model was realized through Gaussian noise addition, and the pseudo-numerical methods derived from diffusion models were employed to achieve acceleration of the reverse process. Using a network structure combining U-Net with attention mechanisms, significant features were extracted efficiently from the data to predict noise accurately. To provide supervised targets during the training phase, random masking of the training data generated new missing data. Comparative experiments were conducted in nine datasets, and the results showed that PNDM_Tab achieved the lowest root mean square error in six datasets compared to other imputation methods. Experimental results demonstrate that, compared to the original diffusion models, the use of pseudo-numerical methods in the reverse process can reduce the number of sampling steps while maintaining equivalent generative performance.



Key wordstabular data      diffusion model      data imputation      attention mechanism      deep learning     
Received: 04 June 2024      Published: 25 July 2025
CLC:  TP 391  
Corresponding Authors: Zan ZHANG     E-mail: wangshengju@chd.edu.cn;z.zhang@chd.edu.cn
Cite this article:

Shengju WANG,Zan ZHANG. Missing value imputation algorithm based on accelerated diffusion model. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1471-1480.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2025.07.015     OR     https://www.zjujournals.com/eng/Y2025/V59/I7/1471


基于加速扩散模型的缺失值插补算法

为了解决表格数据中数据缺失对后续任务产生的不利影响,提出使用扩散模型进行缺失值插补的方法. 针对原始扩散模型在生成过程中耗时过长的问题,设计基于加速扩散模型的数据插补方法(PNDM_Tab). 扩散模型的前向过程通过高斯加噪方法实现,采用基于扩散模型的伪数值方法进行反向过程加速. 使用U-Net与注意力机制相结合的网络结构从数据中高效提取显著特征,实现噪声的准确预测. 为了使模型在训练阶段有监督目标,使用随机掩码处理训练数据以生成新的缺失数据. 在9个数据集中的插补方法对比实验结果表明:相较其他插补方法,PNDM_Tab在6个数据集中的均方根误差最低. 实验结果证明,相较于原始的扩散模型,反向过程使用扩散模型的伪数值方法能够在减少采样步数的同时保持生成性能不变.


关键词: 表格数据,  扩散模型,  数据插补,  注意力机制,  深度学习 
Fig.1 Training process for accelerated diffusion model-based imputation method
Fig.2 Imputation process for accelerated diffusion model-based imputation method
Fig.3 Architecture diagram of accelerated diffusion model-based imputation method
Fig.4 Network structure diagram of residual blocks in U-Net
Fig.5 Structure schematic of self-attention mechanism
模块阶段模块名称KSPCoutH
初始层卷积层3×311×116
编码器残差块3×311×116
3×311×116
残差块3×311×116
3×311×116
自注意力块4
下采样块(非最后一个层级)5×521×1128
下采样块(最后一个层级)3×311×1128
中间层残差块3×311×1128
3×311×1128
自注意力块4
残差块3×311×1128
3×311×1128
解码器残差块3×311×1128
3×311×1128
3×311×1128
残差块3×311×1128
3×311×1128
3×311×1128
自注意力块4
上采样块(非最后一个层级)2×211×1128
上采样块(最后一个层级)3×311×116
nn.Upsample(scale_factor = 2)
输出层卷积层1×1101
Tab.1 Network parameter of accelerated diffusion model-based imputation method
数据集样本数特征数量分类特征数数值特征数
Heart1 0251385
FIR6 11851051
CO1 030909
Libras36091091
GC1 00024024
WR5 45624024
AD10 0001486
Students1 000372710
Breast69910010
Tab.2 Dataset information for performance comparison experiments of imputation methods
Fig.6 Feature embedding module of FT-Transformer
方法RMSE
BreastLibrasCOFIRGCWR
平均值0.251±0.0120.103±0.0030.225±0.0080.135±0.0040.210±0.0060.239±0.002
ICE0.145±0.0050.029±0.0030.152±0.0060.042±0.0050.232±0.0040.191±0.003
EM0.146±0.0060.027±0.0080.153±0.0060.043±0.0050.189±0.0020.192±0.003
GAIN0.177±0.0100.050±0.0070.220±0.0050.086±0.0030.249±0.0090.227±0.005
MissForest0.148±0.0030.036±0.0020.167±0.0060.057±0.0040.206±0.0050.180±0.002
MIWAE0.481±0.0220.654±0.0070.245±0.0060.146±0.0040.275±0.0070.257±0.003
TabCSDI0.152±0.0050.010±0.0010.135±0.0120.051±0.0040.214±0.0040.197±0.004
PNDM_Tab0.143±0.0060.008±0.0000.124±0.0050.028±0.0040.192±0.0060.175±0.003
Tab.3 Root mean square error of different imputation methods in purely numerical feature datasets
方法ADHeartStudents
RMSERERMSERERMSERE
平均值0.131±0.0030.628±0.0100.164±0.0060.487±0.0210.231±0.0070.531±0.014
ICE0.122±0.0030.571±0.0110.145±0.0050.391±0.0240.186±0.0110.432±0.013
EM0.122±0.0030.564±0.0060.145±0.0060.393±0.0100.187±0.0100.414±0.010
GAIN0.135±0.0020.637±0.0070.158±0.0030.403±0.0190.257±0.0020.488±0.014
MissForest0.118±0.0030.560±0.0050.140±0.0040.336±0.0260.169±0.0060.414±0.013
MIWAE0.136±0.0040.500±0.0030.185±0.0130.477±0.0400.305±0.0090.528±0.007
TabCSDI0.107±0.0040.393±0.0060.147±0.0020.389±0.0320.221±0.0130.402±0.010
PNDM_Tab0.111±0.0030.391±0.0020.139±0.0040.351±0.0270.190±0.0090.343±0.012
Tab.4 Performance comparison results of different imputation methods in mixed feature datasets
数据集模型RMSE
step=10step=20step=150
BreastDDPM_Tab0.309±0.0190.229±0.0130.141±0.008
PNDM_Tab0.143±0.0060.143±0.0060.140±0.006
GCDDPM_Tab0.344±0.0090.296±0.0070.191±0.006
PNDM_Tab0.192±0.0060.190±0.0060.187±0.006
Tab.5 Root mean square error of different reverse process models in two datasets
Fig.7 Impact of number of cycles on root mean square error in Breast dataset
数据集RMSE
${p_{{\mathrm{mis}}}} $=0.2${p_{{\mathrm{mis}}}} $=0.5${p_{{\mathrm{mis}}}} $=0.8${p_{{\mathrm{mis}}}} $为随机值
Breast0.148±0.0050.140±0.0070.145±0.0080.143±0.006
Libras0.010±0.0010.020±0.0040.029±0.0070.008±0.000
Tab.6 Root mean square error for different missing rates in model training phase
方法RMSEWRRMSEFIR
${p_{{\mathrm{mis}}}}$=10%${p_{{\mathrm{mis}}}}$=30%${p_{{\mathrm{mis}}}}$=50%${p_{{\mathrm{mis}}}}$=10%${p_{{\mathrm{mis}}}}$=30%${p_{{\mathrm{mis}}}}$=50%
平均值0.238±0.0020.238±0.0010.238±0.0010.133±0.0020.136±0.0040.135±0.003
ICE0.188±0.0040.194±0.0020.202±0.0010.036±0.0020.047±0.0050.059±0.003
EM0.188±0.0040.194±0.0020.202±0.0010.039±0.0090.050±0.0040.057±0.002
GAIN0.233±0.0040.232±0.0040.268±0.0030.080±0.0010.098±0.0050.188±0.003
MissForest0.175±0.0030.184±0.0010.195±0.0010.052±0.0020.061±0.0040.070±0.003
MIWAE0.257±0.0020.255±0.0020.256±0.0010.145±0.0030.147±0.0050.147±0.004
TabCSDI0.193±0.0040.198±0.0030.204±0.0030.047±0.0030.054±0.0050.060±0.004
PNDM_Tab0.170±0.0030.178±0.0020.190±0.0020.022±0.0020.032±0.0040.042±0.003
Tab.7 Root mean square error of different imputation methods under different missing rates in two datasets
方法${p_{{\mathrm{mis}}}}$=10%${p_{{\mathrm{mis}}}}$=30%${p_{{\mathrm{mis}}}}$=50%
RMSERERMSERERMSERE
平均值0.132±0.0060.629±0.0130.131±0.0020.629±0.0080.131±0.0030.627±0.006
ICE0.124±0.0060.577±0.0090.123±0.0010.578±0.0090.126±0.0030.581±0.008
EM0.124±0.0060.567±0.0070.123±0.0010.572±0.0070.126±0.0030.578±0.007
GAIN0.135±0.0060.629±0.0070.137±0.0020.650±0.0040.205±0.0320.668±0.006
MissForest0.118±0.0060.560±0.0100.119±0.0020.566±0.0060.123±0.0030.573±0.006
MIWAE0.137±0.0070.500±0.0100.137±0.0020.499±0.0070.137±0.0040.499±0.006
TabCSDI0.104±0.0080.388±0.0100.109±0.0030.403±0.0080.115±0.0040.419±0.005
PNDM_Tab0.110±0.0070.384±0.0090.113±0.0030.401±0.0050.119±0.0040.423±0.005
Tab.8 Performance comparison results of different interpolation methods under different missing rates in AD dataset
数据集数据集大小批次大小timp/s
Breast69916107
FIR61185121619
AD10000512899
Tab.9 Imputation time comparison of proposed method in datasets of different sizes
方法timp/s方法timp/s
平均值0.026MissForest53.400
ICE7.330MIWAE5.120
EM1.510TabCSDI792.000
GAIN6.470PNDM_Tab107.000
Tab.10 Imputation time comparison of different methods in Breast dataset
[1]   VAN BUUREN S. Flexible imputation of missing data [M]. [S. l. ]: CRC Press, 2018.
[2]   STEKHOVEN D J, BÜHLMANN P MissForest: non-parametric missing value imputation for mixed-type data[J]. Bioinformatics, 2012, 28 (1): 112- 118
doi: 10.1093/bioinformatics/btr597
[3]   RESCHE-RIGON M, WHITE I R Multiple imputation by chained equations for systematically and sporadically missing multilevel data[J]. Statistical Methods in Medical Research, 2018, 27 (6): 1634- 1649
doi: 10.1177/0962280216666564
[4]   MAZUMDER R, HASTIE T, TIBSHIRANI R Spectral regularization algorithms for learning large incomplete matrices[J]. Journal of Machine Learning Research, 2010, 11: 2287- 2322
[5]   YOON J, JORDON J, SCHAAR M. GAIN: missing data imputation using generative adversarial nets [C]// Proceedings of the 35th International Conference on Machine Learning. Stockholm: ACM, 2018: 5689–5698.
[6]   MATTEI P A, FRELLSEN J. MIWAE: deep generative modelling and imputation of incomplete data sets [C]// Proceedings of the 36th International Conference on Machine Learning. Long Beach: ACM, 2019: 4413–4423.
[7]   ZHENG S, CHAROENPHAKDEE N. Diffusion models for missing value imputation in tabular data [EB/OL]. (2023–03–11)[2023–07–12]. https://arxiv.org/pdf/2210.17128.
[8]   LIU L, REN Y, LIN Z, et al. Pseudo numerical methods for diffusion models on manifolds [EB/OL]. (2022–10–31)[2023–08–19]. https://arxiv.org/pdf/2202.09778.
[9]   MCKNIGHT P E, MCKNIGHT K M, SIDANI S, et al. Missing data: a gentle introduction [M]. [S. l. ]: Guilford Press, 2007.
[10]   MALARVIZHI R, THANAMANI A S K-nearest neighbor in missing data imputation[J]. International Journal of Engineering Research and Development, 2012, 5 (1): 5- 7
[11]   庞新生 缺失数据插补处理方法的比较研究[J]. 统计与决策, 2012, 28 (24): 18- 22
PANG Xinsheng A comparative study on missing data interpolation methods[J]. Statistics and Decision, 2012, 28 (24): 18- 22
[12]   HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models [C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. [S. l. ]: Curran Associates Inc. , 2020: 6840–6851.
[13]   SONG J M, MENG C L, ERMON S. Denoising diffusion implicit models [EB/OL]. (2022–10–05)[2023–08–23]. https://arxiv.org/pdf/2010.02502.
[14]   SONG Y, SOHL-DICKSTEIN J, KINGMA D P, et al. Score-based generative modeling through stochastic differential equations [EB/OL]. (2021–02–10)[2023–08–25]. https://arxiv.org/pdf/2011.13456.
[15]   MAOUTSA D, REICH S, OPPER M Interacting particle solutions of Fokker–Planck equations through gradient–log–density estimation[J]. Entropy, 2020, 22 (8): 802
doi: 10.3390/e22080802
[16]   SALIMANS T, HO J. Progressive distillation for fast sampling of diffusion models [EB/OL]. (2022–06–07)[2024–01–23]. https://arxiv.org/pdf/2202.00512.
[17]   TASHIRO Y, SONG J, SONG Y, et al. CSDI: conditional score-based diffusion models for probabilistic time series imputation [C]// Proceedings of the 35th International Conference on Neural Information Processing Systems. [S.l.]: Curran Associates Inc. , 2021: 24804–24816.
[18]   NICHOL A. Q, DHARIWAL P. Improved denoising diffusion probabilistic models [C]// Proceedings of the 38th International Conference on Machine Learning. Vienna: ACM, 2021: 8162–8171.
[19]   RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation [C]// Medical Image Computing and Computer-Assisted Intervention. [S. l. ]: Springer, 2015: 234–241.
[20]   DHARIWAL P, NICHOL A. Diffusion models beat gans on image synthesis [C]// Proceedings of the 35th International Conference on Neural Information Processing Systems. [S. l.]: Curran Associates Inc. , 2021: 8780–8794.
[21]   VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31th International Conference on Neural Information Processing Systems. [S. l.]: Curran Associates Inc. , 2017: 6000–6010.
[22]   GARCÍA-LAENCINA P J, SANCHO-GÓMEZ J L, FIGUEIRAS-VIDAL A R Pattern classification with missing data: a review[J]. Neural Computing and Applications, 2010, 19 (2): 263- 282
doi: 10.1007/s00521-009-0295-6
[23]   GORISHNIY Y, RUBACHEV I, KHRULKOV V, et al. Revisiting deep learning models for tabular data [C]// Proceedings of the 35th International Conference on Neural Information Processing Systems. [S. l. ]: Curran Associates Inc., 2021: 18932–18943.
[24]   JARRETT D, CEBERE B C, LIU T, et al. Hyperimpute: Generalized iterative imputation with automatic model selection [C]// Proceedings of the 39th International Conference on Machine Learning. Baltimore: ACM, 2022: 9916–9937.
[1] Rongtai YANG,Yubin SHAO,Qingzhi DU. Structure-aware model for few-shot knowledge completion[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1394-1402.
[2] Yongqing CAI,Cheng HAN,Wei QUAN,Wudi CHEN. Visual induced motion sickness estimation model based on attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(6): 1110-1118.
[3] Wenbo JU,Huajun DONG. Motherboard defect detection method based on context information fusion and dynamic sampling[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(6): 1159-1168.
[4] Lihong WANG,Xinqian LIU,Jing LI,Zhiquan FENG. Network intrusion detection method based on federated learning and spatiotemporal feature fusion[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(6): 1201-1210.
[5] Xiangyu ZHOU,Yizhi LIU,Yijiang ZHAO,Zhuhua LIAO,Decheng ZHANG. Hierarchical spatial embedding BiGRU model for destination prediction[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(6): 1211-1218.
[6] Huizhi XU,Xiuqing WANG. Perception of distance and speed of front vehicle based on vehicle image features[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(6): 1219-1232.
[7] Zongmin LI,Chang XU,Yun BAI,Shiyang XIAN,Guangcai RONG. Dual-neighborhood graph convolution method for point cloud understanding[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(5): 879-889.
[8] Zan CHEN,Ran LI,Yuanjing FENG,Yongqiang LI. Video snapshot compressive imaging reconstruction based on temporal super-resolution[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(5): 956-963.
[9] Hongwei LIU,Lei WANG,Yang LIU,Pengchao ZHANG,Shi QIAO. Short term load forecasting based on recombination quadratic decomposition and LSTNet-Atten[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(5): 1051-1062.
[10] Li MA,Yongshun WANG,Yao HU,Lei FAN. Pre-trained long-short spatiotemporal interleaved Transformer for traffic flow prediction applications[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(4): 669-678.
[11] Qiaohong CHEN,Menghao GUO,Xian FANG,Qi SUN. Image captioning based on cross-modal cascaded diffusion model[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(4): 787-794.
[12] Zhengyu GU,Feifei LAI,Chen GENG,Ximing WANG,Yakang DAI. Knowledge-guided infarct segmentation of ischemic stroke[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(4): 814-820.
[13] Dengfeng LIU,Wenjing GUO,Shihai CHEN. Content-guided attention-based lane detection network[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(3): 451-459.
[14] Minghui YAO,Yueyan WANG,Qiliang WU,Yan NIU,Cong WANG. Siamese networks algorithm based on small human motion behavior recognition[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(3): 504-511.
[15] Liming LIANG,Pengwei LONG,Jiaxin JIN,Renjie LI,Lu ZENG. Steel surface defect detection algorithm based on improved YOLOv8s[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(3): 512-522.