Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2024, Vol. 58 Issue (5): 931-940    DOI: 10.3785/j.issn.1008-973X.2024.05.006
    
Optimization method of approximate aggregate query based on variational auto-encoder
Longsen HUANG1,2(),Jun FANG1,2,*(),Yunliang ZHOU1,2,Zhicheng GUO1,2
1. College of Information, North China University of Technology, Beijing 100144, China
2. Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data, North China University of Technology, Beijing 100144, China
Download: HTML     PDF(1290KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

An optimized variational self-encoder-based approximate aggregation query method was proposed for the problem of imbalanced distribution of biased data, which makes it difficult to sample biased distribution data with traditional approximate aggregation query methods. The effect of approximate aggregation query method on the accuracy of approximate aggregation query for biased distribution data was analyzed. The bias-distributed data were hierarchically grouped in the preprocessing stage, and the network structure and loss function of the variational self-encoder generation model were optimized to reduce the approximate aggregated query relative error. The experimental results show that the query relative error of the approximate aggregation query is smaller for skewness distribution data compared with the benchmark method, and the rising trend of the query relative error is smoother as the skewness coefficient increases.



Key wordsapproximate query processing      skewness distribution      machine learning      variational auto-encoder      group sampling     
Received: 24 October 2023      Published: 26 April 2024
CLC:  TP 393  
Fund:  国家自然科学基金国际(地区)合作与交流项目(62061136006);国家自然科学基金重点资助项目(61832004).
Corresponding Authors: Jun FANG     E-mail: 1091232176@qq.com;fangjun@ncut.edu.cn
Cite this article:

Longsen HUANG,Jun FANG,Yunliang ZHOU,Zhicheng GUO. Optimization method of approximate aggregate query based on variational auto-encoder. Journal of ZheJiang University (Engineering Science), 2024, 58(5): 931-940.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2024.05.006     OR     https://www.zjujournals.com/eng/Y2024/V58/I5/931


基于变分自编码器的近似聚合查询优化方法

针对偏态数据分布不平衡,传统近似聚合查询方法难以抽样生成偏态分布数据的问题,提出基于优化的变分自编码器的近似聚合查询方法,研究近似聚合查询方法对偏态分布数据的近似聚合查询准确率的影响. 在预处理阶段对偏态分布数据进行分层分组,对变分自编码器生成模型的网络结构和损失函数进行优化,降低近似聚合查询相对误差. 实验结果表明,与基准方法相比,近似聚合查询对偏态分布数据的查询相对误差更小,且随着偏态系数的提高,查询相对误差的上升趋势更平缓.


关键词: 近似查询处理,  偏态分布,  机器学习,  变分自编码器,  分组抽样 
Fig.1 Comparison of left-skewed and right-skewed distribution with normal distribution
年份月份ρB/(μg·m?3)CBWDIWS/(m·s?1)
20101129NE1.79
2011562CV0.89
20127126SE20.57
Tab.1 Example of PM2.5 dataset
FieldSKRq/%
VAE-AQPCWGAN
Year02.903.38
Month?0.0093.803.91
CBWD?0.1144.015.27
PM251.8026.387.59
Tab.2 Relative errors of bias coefficients for different data under different models
Fig.2 Flowchart of GVAE method
算法 1 分层分组算法
输入: 已编码和标准化的数据$ {\mathrm{Data}} $,数据量$ N $,聚合查询语句$ Q $,有$ K $个层,每层$ i $大小为$ {N}_{i} $,分组基数为$ n $输出: 分组后的数据$ \mathrm{S}\mathrm{a}\mathrm{m}\mathrm{p}\mathrm{l}\mathrm{e} $ 1)计算每层 $ i $ 占总体的比例 $ {f}_{i}={N}_{{i}}/N $. 2)对于每个层$ i $,计算出需要从该层抽取的样本数量 $ {n}_{i}=\lfloor {f}_{i} n\rfloor $. 3)计算分组数$ {\mathrm{Group}}={\mathrm{div}}\;(N,n) $ . 4)记录已经被抽取的样本数量 $ {m}_{\mathrm{g}}=0 $. 5)for each $ g $ in ${\mathrm{ Group }}$ do6) for each layer $ i $ do7)  如果$ {n}_{i}\le $ 0,则跳过该层i. 8)  如果$ {m}_{{\mathrm{g}}} > n $,则跳出循环. 9)  for each n in layer i in random order do10) 如果已经从该层中抽取了$ {n}_{i} $个样本,$ {{\mathrm{Sample}}}_{g} $ = $ {n}_{i} $,跳出循环. 11)以相等概率从该层中抽取一个样本,并将其记录下来. 12) $ {m}_{{\mathrm{g}}} $增加1. 13) end for14) end for15)end for
 
Ng$ {T}_{\mathrm{a}\mathrm{v}\mathrm{g}} $/sAcc/%
10013.2696.20
100016.1398.10
10000124.0298.24
Tab.3 Comparison of subgroup bases
分组方式数据取值meanStdKL
区间分组总体?0.006 50.961 4$ 6.599\times {10}^{-5} $
区间分组总体?0.004 10.976 2$ 6.218\times {10}^{-5} $
区间分组单列?0.021 21.002 5$ 2.751\times {10}^{-4} $
区间分组单列0.010 70.996 9$ 2.616\times {10}^{-4} $
分层分组总体0.003 60.993 0$ 4.957\times {10}^{-7} $
分层分组总体0.002 90.993 3$ 4.923\times {10}^{-7} $
分层分组单列0.001 10.999 80
分层分组单列0.001 10.999 80
Tab.4 Comparison of sample distribution for interval grouping     
Fig.3 Neural network structure of variational auto-encoder
算法2 生成模型近似查询
输入: GVAE模型库$ \mathrm{M}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{l}\mathrm{s} $、聚合查询语句$ {\mathrm{Aggr}} $、初始采样比例P、数据库$ D $. 输出: 聚合查询结果$ R $和查询误差$ {R}_{\mathrm{A}\mathrm{g}\mathrm{g}\mathrm{r}} $1)聚合查询字段$ G $$ {\mathrm{Aggr}} $2)$ {\mathrm{model}} $$ G,\mathrm{M}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{l}\mathrm{s} $3)$ D $$ {\mathrm{model}},P $4)$ \theta ,\tilde \theta $$ {\mathrm{Aggr}},D $5)$ R $ $ \theta = \left\{ \begin{gathered} \theta = \{ {\theta _1},{\theta _2}, \cdots ,{\theta _n}\} \\ \tilde \theta = \{ {{\tilde \theta }_1},{{\tilde \theta }_2}, \cdots ,{{\tilde \theta }_m}\} \\ \end{gathered} \right.,m \leqslant n $6)$ {R_q} = \dfrac{1}{n}\left( {\displaystyle \sum\nolimits_{i = 1}^m {\left| {\frac{{{\theta _i} - {{\tilde \theta }_i}}}{{{\theta _i}}}} \right|} +(n - m)} \right) $
 
Fig.4 CDF for different datasets
Fig.5 Relative errors for queries with different bias coefficients
Fig.6 Ablation experiment with different dataset and skewed coefficient
Fig.7 Model comparison in different dimensions
Fig.8 Relative error of different sampling rate
[1]   CHAUDURI S, DING B, KANDULA S. Approximate query processing: no silver bullet [C]// Proceedings of the 2017 ACM International Conference on Management of Data . New York: ACM, 2017: 511-519.
[2]   ZHANG Meifan, WANG Hongzhi LAQP: learning-based approximate query processing[J]. Information Sciences, 2021, 546: 1113- 1134
doi: 10.1016/j.ins.2020.09.070
[3]   LI Kaiyu, LI Guoliang. Approximate query processing: what is new and where to go? a survey on approximate query processing [J]. Data Science and Engineering , 2018: 379-397.
[4]   SANCA V, AILAMAKI A. Sampling-based AQP in modern analytical engines [C]// Proceedings of the 18th International Workshop on Data Management on New Hardware . New York: ACM, 2022: 1-8.
[5]   MA Qingzhi, TRIANTAFILLOU P. Dbest: revisiting approximate query processing engines with machine learning models [C]// Proceedings of the 2019 International Conference on Management of Data . New York: ACM, 2019: 1553-1570.
[6]   MA Qingzhi, SHANGHOOSHABAD A M, ALMASI M, et al. Learned approximate query processing: make it light, accurate and fast [C]// Conference on Innovative Data Systems. New York: ACM, 2021.
[7]   HILPRECHT B, SCHMIDT A, KULESSA M, et al DeepDB: learn from data, not from queries![J]. Proceedings of the VLDB Endowment, 2020, 13 (7): 992- 1005
doi: 10.14778/3384345.3384349
[8]   LEE T, PARK C S, NAM K, et al. Query transformation for approximate query processing using synthetic data from deep generative models [C]// IEEE International Conference on Consumer Electronics-Asia . Yeosu: IEEE, 2022: 1-4.
[9]   SHEORAN N, MITRA S, PORWAL V, et al. Conditional generative model based predicate-aware query approximation [C]// Proceedings of the AAAI Conference on Artificial Intelligence . Palo Alto: AAAI, 2022, 36(8): 8259-8266.
[10]   OLKEN F, ROTEM D. Simple random sampling from relational databases [C]// Proceedings of the 12th International Conference on Very Large Data Bases . San Francisco: Morgan Kaufmann Publishers Inc, 1986: 160-169.
[11]   PENG Jinglin, ZHANG Dongxiang, WANG Jiannan, et al. Aqp++ connecting approximate query processing with aggregate precomputation for interactive analytics [C]// Proceedings of the 2018 International Conference on Management of Data . New York: ACM, 2018: 1477-1492.
[12]   HASAN S, THIRUMURUGANATHAN S, AUGUSTINE J, et al. Deep learning models for selectivity estimation of multi-attribute queries [C]// Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data . New York: ACM, 2020: 1035-1050.
[13]   LI Feifei, WU Bin, YI Ke, et al. Wander join: online aggregation via random walks [C]// Proceedings of the 2016 International Conference on Management of Data . New York: ACM, 2016: 615-629.
[14]   CHAUDHURI S, DAYAL U Data warehousing and OLAP for decision support[J]. ACM Sigmod Record, 1997, 26 (2): 507- 508
doi: 10.1145/253262.253373
[15]   AGARWAL S, MOZAFARI B, PANDA A, et al. BlinkDB: queries with bounded errors and bounded response times on very large data [C]// Proceedings of the 8th ACM European conference on computer systems . New York: ACM, 2013: 29-42.
[16]   CORMODE G, GAROFALAKIS M, HAAS P J, et al Synopses for massive data: samples, histograms, wavelets, sketches[J]. Foundations and Trends in Databases, 2011, 4 (1-3): 1- 294
doi: 10.1561/1900000004
[17]   LI Kaiyu, ZHANG Yong, LI Guoliang, et al. Bounded approximate query processing [J]. IEEE Transactions on Knowledge and Data Engineering , 2019, 31(12): 2262-2276.
[18]   LEE T, NAM K, PARK C S, et al. Exploiting machine learning models for approximate query processing [C]// IEEE International Conference on Big Data . Osaka: IEEE, 2022: 6752-6754.
[19]   KINGMA D P, WELLING M. Auto-encoding variational Bayes [EB/OL]. (2013-12-20). https://doi.org/10.48550/arXiv.1312.6114.
[20]   GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets [EB/OL].[2023-10-14]. https://arxiv.org/abs/1406.2661.
[21]   THIRUMURUGANATHAN S, HASAN S, KOUDAS N, et al. Approximate query processing for data exploration using deep generative models [C]// IEEE 36th International Conference on Data Engineering . Los Alamitos: IEEE, 2020: 1309-1320.
[22]   ZHANG Meifang, WANG Hongzhi. Approximate query processing for group-by queries based on conditional generative models [EB/OL]. (2021-01-08). https://doi.org/10.48550/arXiv.2101.02914.
[23]   SEGLEN P O The skewness of science[J]. Journal of the American Society for Information Science, 1992, 43 (9): 628- 638
doi: 10.1002/(SICI)1097-4571(199210)43:9<628::AID-ASI5>3.0.CO;2-0
[24]   TSOUMAKAS G, KATAKIS I Multi-label classification: an overview[J]. International Journal of Data Warehousing and Mining, 2007, 3 (3): 1- 13
doi: 10.4018/jdwm.2007070101
[25]   FU Hao, LI Chunyuan, LIU Xiaodong, et al. Cyclical annealing schedule: a simple approach to mitigating KL vanishing [C]// Proceedings of NAACL-HLT . Stroudsburg: ACL, 2019: 240-250.
[1] Su LI,Ze CHEN,Baoyan SONG,Haolin ZHANG. Enterprise composite blockchain construction method for business environment evaluation[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(5): 891-899.
[2] Yi-cong GAO,Yan-kun WANG,Shao-mei FEI,Qiong LIN. Intelligent proofreading method of engineering drawing based on transfer learning[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(5): 856-863, 889.
[3] Wen-chao BAI,Xi-xian HAN,Jin-bao WANG. Efficient approximate query processing framework based on conditional generative model[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(5): 995-1005.
[4] Peng ZHANG,Zi-du TIAN,Hao WANG. Flight parameter data anomaly detection method based on improved generative adversarial network[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(10): 1967-1976.
[5] Fa-ming HUANG,Li-han PAN,Chi YAO,Chuang-bing ZHOU,Qing-hui JIANG,Zhi-lu CHANG. Landslide susceptibility prediction modelling based on semi-supervised machine learning[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(9): 1705-1713.
[6] Jia-hao REN,Hai-ou WANG,Jiang-kuan XING,Kun LUO,Jian-ren FAN. Lower-dimensional approximation models of tangential strain rate of turbulent flames[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(6): 1128-1134.
[7] You ZHAN,Qiang LI,Xiao-tian MA,Chen-ping WANG,Yan-jun QIU. Macro and micro texture based prediction of pavement surface friction[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(4): 684-694.
[8] Yong YU,Jing-yuan XUE,Sheng DAI,Qiang-wei BAO,Gang ZHAO. Quality prediction and process parameter optimization method for machining parts[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(3): 441-447.
[9] Qiao-hong CHEN,YI CHEN,Wen-shu Li,Yu-bo JIA. Clothing image classification based on multi-scale SE-Xception[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(9): 1727-1735.
[10] Hui-fang WANG,Chen-yu ZHANG. Prediction of voltage stability margin in power system based on extreme gradient boosting algorithm[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(3): 606-613.
[11] Le XIE,Xi-dan HENG,Yang LIU,Qi-long JIANG,Dong LIU. Transformer fault diagnosis based on linear discriminant analysis and step-by-step machine learning[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(11): 2266-2272.
[12] Zhi-yuan WAN,Jia-heng TAO,Jia-kun LIANG,Zhen-gong CAI,Cheng CHANG,Lin QIAO,Qiao-ni ZHOU. Large-scale empirical study on machine learning related questions on Stack Overflow[J]. Journal of ZheJiang University (Engineering Science), 2019, 53(5): 819-828.
[13] Dong-xiang KE,Li-min PAN,Sen-lin LUO,Han-qing ZHANG. Android malicious behavior recognition and classification method based on random forest algorithm[J]. Journal of ZheJiang University (Engineering Science), 2019, 53(10): 2013-2023.
[14] HU Li-sha, WANG Su-zhen, CHEN Yi-qiang, GAO Chen-long, HU Chun-yu, JIANG Xin-long, CHEN Zhen-yu, GAO Xing-yu. Fall detection algorithms based on wearable device: a review[J]. Journal of ZheJiang University (Engineering Science), 2018, 52(9): 1717-1728.
[15] WANG Hong-kai, CHEN Zhong-hua, ZHOU Zong-wei, LI Ying-ci, LU Pei-ou, WANG Wen-zhi, LIU Wan-yu, YU Li-juan. Evaluation of machine learning classifiers for diagnosing mediastinal lymph node metastasis of lung cancer from PET/CT images[J]. Journal of ZheJiang University (Engineering Science), 2018, 52(4): 788-797.