Optimization method of approximate aggregate query based on variational auto-encoder

doi:10.3785/j.issn.1008-973X.2024.05.006

Journal of ZheJiang University (Engineering Science)

2024, Vol. 58

Issue (5): 931-940 DOI: 10.3785/j.issn.1008-973X.2024.05.006

Optimization method of approximate aggregate query based on variational auto-encoder

Longsen HUANG1,2(

),Jun FANG1,2,*(

),Yunliang ZHOU1,2,Zhicheng GUO1,2

1. College of Information, North China University of Technology, Beijing 100144, China
2. Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data, North China University of Technology, Beijing 100144, China

Download:

HTML

PDF(1290KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

An optimized variational self-encoder-based approximate aggregation query method was proposed for the problem of imbalanced distribution of biased data, which makes it difficult to sample biased distribution data with traditional approximate aggregation query methods. The effect of approximate aggregation query method on the accuracy of approximate aggregation query for biased distribution data was analyzed. The bias-distributed data were hierarchically grouped in the preprocessing stage, and the network structure and loss function of the variational self-encoder generation model were optimized to reduce the approximate aggregated query relative error. The experimental results show that the query relative error of the approximate aggregation query is smaller for skewness distribution data compared with the benchmark method, and the rising trend of the query relative error is smoother as the skewness coefficient increases.

Key words： approximate query processing skewness distribution machine learning variational auto-encoder group sampling

Received: 24 October 2023 Published: 26 April 2024

CLC:

TP 393

Fund: 国家自然科学基金国际（地区）合作与交流项目（62061136006）；国家自然科学基金重点资助项目（61832004）.

Corresponding Authors: Jun FANG E-mail: 1091232176@qq.com;fangjun@ncut.edu.cn

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Longsen HUANG
	Jun FANG
	Yunliang ZHOU
	Zhicheng GUO

Cite this article:

Longsen HUANG,Jun FANG,Yunliang ZHOU,Zhicheng GUO. Optimization method of approximate aggregate query based on variational auto-encoder. Journal of ZheJiang University (Engineering Science), 2024, 58(5): 931-940.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2024.05.006 OR https://www.zjujournals.com/eng/Y2024/V58/I5/931

基于变分自编码器的近似聚合查询优化方法

针对偏态数据分布不平衡，传统近似聚合查询方法难以抽样生成偏态分布数据的问题，提出基于优化的变分自编码器的近似聚合查询方法，研究近似聚合查询方法对偏态分布数据的近似聚合查询准确率的影响. 在预处理阶段对偏态分布数据进行分层分组，对变分自编码器生成模型的网络结构和损失函数进行优化，降低近似聚合查询相对误差. 实验结果表明，与基准方法相比，近似聚合查询对偏态分布数据的查询相对误差更小，且随着偏态系数的提高，查询相对误差的上升趋势更平缓.

关键词： 近似查询处理, 偏态分布, 机器学习, 变分自编码器, 分组抽样

Fig.1 Comparison of left-skewed and right-skewed distribution with normal distribution

Tab.1 Example of PM2.5 dataset

Tab.2 Relative errors of bias coefficients for different data under different models

Fig.2 Flowchart of GVAE method

Tab.3 Comparison of subgroup bases

Tab.4 Comparison of sample distribution for interval grouping　　　　　

Fig.3 Neural network structure of variational auto-encoder

Fig.4 CDF for different datasets

Fig.5 Relative errors for queries with different bias coefficients

Fig.6 Ablation experiment with different dataset and skewed coefficient

Fig.7 Model comparison in different dimensions

Fig.8 Relative error of different sampling rate


[1]	CHAUDURI S, DING B, KANDULA S. Approximate query processing: no silver bullet [C]// Proceedings of the 2017 ACM International Conference on Management of Data . New York: ACM, 2017: 511-519.

[2]	ZHANG Meifan, WANG Hongzhi LAQP: learning-based approximate query processing[J]. Information Sciences, 2021, 546: 1113- 1134 doi: 10.1016/j.ins.2020.09.070

[3]	LI Kaiyu, LI Guoliang. Approximate query processing: what is new and where to go? a survey on approximate query processing [J]. Data Science and Engineering , 2018: 379-397.

[4]	SANCA V, AILAMAKI A. Sampling-based AQP in modern analytical engines [C]// Proceedings of the 18th International Workshop on Data Management on New Hardware . New York: ACM, 2022: 1-8.

[5]	MA Qingzhi, TRIANTAFILLOU P. Dbest: revisiting approximate query processing engines with machine learning models [C]// Proceedings of the 2019 International Conference on Management of Data . New York: ACM, 2019: 1553-1570.

[6]	MA Qingzhi, SHANGHOOSHABAD A M, ALMASI M, et al. Learned approximate query processing: make it light, accurate and fast [C]// Conference on Innovative Data Systems. New York: ACM, 2021.

[7]	HILPRECHT B, SCHMIDT A, KULESSA M, et al DeepDB: learn from data, not from queries![J]. Proceedings of the VLDB Endowment, 2020, 13 (7): 992- 1005 doi: 10.14778/3384345.3384349

[8]	LEE T, PARK C S, NAM K, et al. Query transformation for approximate query processing using synthetic data from deep generative models [C]// IEEE International Conference on Consumer Electronics-Asia . Yeosu: IEEE, 2022: 1-4.

[9]	SHEORAN N, MITRA S, PORWAL V, et al. Conditional generative model based predicate-aware query approximation [C]// Proceedings of the AAAI Conference on Artificial Intelligence . Palo Alto: AAAI, 2022, 36(8): 8259-8266.

[10]	OLKEN F, ROTEM D. Simple random sampling from relational databases [C]// Proceedings of the 12th International Conference on Very Large Data Bases . San Francisco: Morgan Kaufmann Publishers Inc, 1986: 160-169.

[11]	PENG Jinglin, ZHANG Dongxiang, WANG Jiannan, et al. Aqp++ connecting approximate query processing with aggregate precomputation for interactive analytics [C]// Proceedings of the 2018 International Conference on Management of Data . New York: ACM, 2018: 1477-1492.

[12]	HASAN S, THIRUMURUGANATHAN S, AUGUSTINE J, et al. Deep learning models for selectivity estimation of multi-attribute queries [C]// Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data . New York: ACM, 2020: 1035-1050.

[13]	LI Feifei, WU Bin, YI Ke, et al. Wander join: online aggregation via random walks [C]// Proceedings of the 2016 International Conference on Management of Data . New York: ACM, 2016: 615-629.

[14]	CHAUDHURI S, DAYAL U Data warehousing and OLAP for decision support[J]. ACM Sigmod Record, 1997, 26 (2): 507- 508 doi: 10.1145/253262.253373

[15]	AGARWAL S, MOZAFARI B, PANDA A, et al. BlinkDB: queries with bounded errors and bounded response times on very large data [C]// Proceedings of the 8th ACM European conference on computer systems . New York: ACM, 2013: 29-42.

[16]	CORMODE G, GAROFALAKIS M, HAAS P J, et al Synopses for massive data: samples, histograms, wavelets, sketches[J]. Foundations and Trends in Databases, 2011, 4 (1-3): 1- 294 doi: 10.1561/1900000004

[17]	LI Kaiyu, ZHANG Yong, LI Guoliang, et al. Bounded approximate query processing [J]. IEEE Transactions on Knowledge and Data Engineering , 2019, 31(12): 2262-2276.

[18]	LEE T, NAM K, PARK C S, et al. Exploiting machine learning models for approximate query processing [C]// IEEE International Conference on Big Data . Osaka: IEEE, 2022: 6752-6754.

[19]	KINGMA D P, WELLING M. Auto-encoding variational Bayes [EB/OL]. (2013-12-20). https://doi.org/10.48550/arXiv.1312.6114.

[20]	GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets [EB/OL].[2023-10-14]. https://arxiv.org/abs/1406.2661.

[21]	THIRUMURUGANATHAN S, HASAN S, KOUDAS N, et al. Approximate query processing for data exploration using deep generative models [C]// IEEE 36th International Conference on Data Engineering . Los Alamitos: IEEE, 2020: 1309-1320.

[22]	ZHANG Meifang, WANG Hongzhi. Approximate query processing for group-by queries based on conditional generative models [EB/OL]. (2021-01-08). https://doi.org/10.48550/arXiv.2101.02914.

[23]	SEGLEN P O The skewness of science[J]. Journal of the American Society for Information Science, 1992, 43 (9): 628- 638 doi: 10.1002/(SICI)1097-4571(199210)43:9<628::AID-ASI5>3.0.CO;2-0

[24]	TSOUMAKAS G, KATAKIS I Multi-label classification: an overview[J]. International Journal of Data Warehousing and Mining, 2007, 3 (3): 1- 13 doi: 10.4018/jdwm.2007070101

[25]	FU Hao, LI Chunyuan, LIU Xiaodong, et al. Cyclical annealing schedule: a simple approach to mitigating KL vanishing [C]// Proceedings of NAACL-HLT . Stroudsburg: ACL, 2019: 240-250.

[1]	Su LI,Ze CHEN,Baoyan SONG,Haolin ZHANG. Enterprise composite blockchain construction method for business environment evaluation[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(5): 891-899.

[2]	Yi-cong GAO,Yan-kun WANG,Shao-mei FEI,Qiong LIN. Intelligent proofreading method of engineering drawing based on transfer learning[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(5): 856-863, 889.

[3]	Wen-chao BAI,Xi-xian HAN,Jin-bao WANG. Efficient approximate query processing framework based on conditional generative model[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(5): 995-1005.

[4]	Peng ZHANG,Zi-du TIAN,Hao WANG. Flight parameter data anomaly detection method based on improved generative adversarial network[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(10): 1967-1976.

[5]	Fa-ming HUANG,Li-han PAN,Chi YAO,Chuang-bing ZHOU,Qing-hui JIANG,Zhi-lu CHANG. Landslide susceptibility prediction modelling based on semi-supervised machine learning[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(9): 1705-1713.

[6]	Jia-hao REN,Hai-ou WANG,Jiang-kuan XING,Kun LUO,Jian-ren FAN. Lower-dimensional approximation models of tangential strain rate of turbulent flames[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(6): 1128-1134.

[7]	You ZHAN,Qiang LI,Xiao-tian MA,Chen-ping WANG,Yan-jun QIU. Macro and micro texture based prediction of pavement surface friction[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(4): 684-694.

[8]	Yong YU,Jing-yuan XUE,Sheng DAI,Qiang-wei BAO,Gang ZHAO. Quality prediction and process parameter optimization method for machining parts[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(3): 441-447.

[9]	Qiao-hong CHEN,YI CHEN,Wen-shu Li,Yu-bo JIA. Clothing image classification based on multi-scale SE-Xception[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(9): 1727-1735.

[10]	Hui-fang WANG,Chen-yu ZHANG. Prediction of voltage stability margin in power system based on extreme gradient boosting algorithm[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(3): 606-613.

[11]	Le XIE,Xi-dan HENG,Yang LIU,Qi-long JIANG,Dong LIU. Transformer fault diagnosis based on linear discriminant analysis and step-by-step machine learning[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(11): 2266-2272.

[12]	Zhi-yuan WAN,Jia-heng TAO,Jia-kun LIANG,Zhen-gong CAI,Cheng CHANG,Lin QIAO,Qiao-ni ZHOU. Large-scale empirical study on machine learning related questions on Stack Overflow[J]. Journal of ZheJiang University (Engineering Science), 2019, 53(5): 819-828.

[13]	Dong-xiang KE,Li-min PAN,Sen-lin LUO,Han-qing ZHANG. Android malicious behavior recognition and classification method based on random forest algorithm[J]. Journal of ZheJiang University (Engineering Science), 2019, 53(10): 2013-2023.

[14]	HU Li-sha, WANG Su-zhen, CHEN Yi-qiang, GAO Chen-long, HU Chun-yu, JIANG Xin-long, CHEN Zhen-yu, GAO Xing-yu. Fall detection algorithms based on wearable device: a review[J]. Journal of ZheJiang University (Engineering Science), 2018, 52(9): 1717-1728.

[15]	WANG Hong-kai, CHEN Zhong-hua, ZHOU Zong-wei, LI Ying-ci, LU Pei-ou, WANG Wen-zhi, LIU Wan-yu, YU Li-juan. Evaluation of machine learning classifiers for diagnosing mediastinal lymph node metastasis of lung cancer from PET/CT images[J]. Journal of ZheJiang University (Engineering Science), 2018, 52(4): 788-797.

Viewed

Full text

Abstract

Cited

Shared

Discussed