Please wait a minute...
浙江大学学报(工学版)  2024, Vol. 58 Issue (5): 931-940    DOI: 10.3785/j.issn.1008-973X.2024.05.006
计算机技术、通信技术     
基于变分自编码器的近似聚合查询优化方法
黄龙森1,2(),房俊1,2,*(),周云亮1,2,郭志城1,2
1. 北方工业大学 信息学院,北京 100144
2. 北方工业大学 大规模流数据集成与分析技术北京市重点实验室,北京 100144
Optimization method of approximate aggregate query based on variational auto-encoder
Longsen HUANG1,2(),Jun FANG1,2,*(),Yunliang ZHOU1,2,Zhicheng GUO1,2
1. College of Information, North China University of Technology, Beijing 100144, China
2. Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data, North China University of Technology, Beijing 100144, China
 全文: PDF(1290 KB)   HTML
摘要:

针对偏态数据分布不平衡,传统近似聚合查询方法难以抽样生成偏态分布数据的问题,提出基于优化的变分自编码器的近似聚合查询方法,研究近似聚合查询方法对偏态分布数据的近似聚合查询准确率的影响. 在预处理阶段对偏态分布数据进行分层分组,对变分自编码器生成模型的网络结构和损失函数进行优化,降低近似聚合查询相对误差. 实验结果表明,与基准方法相比,近似聚合查询对偏态分布数据的查询相对误差更小,且随着偏态系数的提高,查询相对误差的上升趋势更平缓.

关键词: 近似查询处理偏态分布机器学习变分自编码器分组抽样    
Abstract:

An optimized variational self-encoder-based approximate aggregation query method was proposed for the problem of imbalanced distribution of biased data, which makes it difficult to sample biased distribution data with traditional approximate aggregation query methods. The effect of approximate aggregation query method on the accuracy of approximate aggregation query for biased distribution data was analyzed. The bias-distributed data were hierarchically grouped in the preprocessing stage, and the network structure and loss function of the variational self-encoder generation model were optimized to reduce the approximate aggregated query relative error. The experimental results show that the query relative error of the approximate aggregation query is smaller for skewness distribution data compared with the benchmark method, and the rising trend of the query relative error is smoother as the skewness coefficient increases.

Key words: approximate query processing    skewness distribution    machine learning    variational auto-encoder    group sampling
收稿日期: 2023-10-24 出版日期: 2024-04-26
CLC:  TP 393  
基金资助: 国家自然科学基金国际(地区)合作与交流项目(62061136006);国家自然科学基金重点资助项目(61832004).
通讯作者: 房俊     E-mail: 1091232176@qq.com;fangjun@ncut.edu.cn
作者简介: 黄龙森(2000—),男,硕士生,从事大数据研究. orcid.org/0009-0004-2110-1721.E-mail:1091232176@qq.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
黄龙森
房俊
周云亮
郭志城

引用本文:

黄龙森,房俊,周云亮,郭志城. 基于变分自编码器的近似聚合查询优化方法[J]. 浙江大学学报(工学版), 2024, 58(5): 931-940.

Longsen HUANG,Jun FANG,Yunliang ZHOU,Zhicheng GUO. Optimization method of approximate aggregate query based on variational auto-encoder. Journal of ZheJiang University (Engineering Science), 2024, 58(5): 931-940.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2024.05.006        https://www.zjujournals.com/eng/CN/Y2024/V58/I5/931

图 1  左偏分布、右偏分布与正态分布的比较
年份月份ρB/(μg·m?3)CBWDIWS/(m·s?1)
20101129NE1.79
2011562CV0.89
20127126SE20.57
表 1  PM2.5数据集的示例
FieldSKRq/%
VAE-AQPCWGAN
Year02.903.38
Month?0.0093.803.91
CBWD?0.1144.015.27
PM251.8026.387.59
表 2  不同模型下不同数据偏态系数的相对误差
图 2  GVAE方法的流程图
算法 1 分层分组算法
输入: 已编码和标准化的数据$ {\mathrm{Data}} $,数据量$ N $,聚合查询语句$ Q $,有$ K $个层,每层$ i $大小为$ {N}_{i} $,分组基数为$ n $输出: 分组后的数据$ \mathrm{S}\mathrm{a}\mathrm{m}\mathrm{p}\mathrm{l}\mathrm{e} $ 1)计算每层 $ i $ 占总体的比例 $ {f}_{i}={N}_{{i}}/N $. 2)对于每个层$ i $,计算出需要从该层抽取的样本数量 $ {n}_{i}=\lfloor {f}_{i} n\rfloor $. 3)计算分组数$ {\mathrm{Group}}={\mathrm{div}}\;(N,n) $ . 4)记录已经被抽取的样本数量 $ {m}_{\mathrm{g}}=0 $. 5)for each $ g $ in ${\mathrm{ Group }}$ do6) for each layer $ i $ do7)  如果$ {n}_{i}\le $ 0,则跳过该层i. 8)  如果$ {m}_{{\mathrm{g}}} > n $,则跳出循环. 9)  for each n in layer i in random order do10) 如果已经从该层中抽取了$ {n}_{i} $个样本,$ {{\mathrm{Sample}}}_{g} $ = $ {n}_{i} $,跳出循环. 11)以相等概率从该层中抽取一个样本,并将其记录下来. 12) $ {m}_{{\mathrm{g}}} $增加1. 13) end for14) end for15)end for
  
Ng$ {T}_{\mathrm{a}\mathrm{v}\mathrm{g}} $/sAcc/%
10013.2696.20
100016.1398.10
10000124.0298.24
表 3  分组基数的比较
分组方式数据取值meanStdKL
区间分组总体?0.006 50.961 4$ 6.599\times {10}^{-5} $
区间分组总体?0.004 10.976 2$ 6.218\times {10}^{-5} $
区间分组单列?0.021 21.002 5$ 2.751\times {10}^{-4} $
区间分组单列0.010 70.996 9$ 2.616\times {10}^{-4} $
分层分组总体0.003 60.993 0$ 4.957\times {10}^{-7} $
分层分组总体0.002 90.993 3$ 4.923\times {10}^{-7} $
分层分组单列0.001 10.999 80
分层分组单列0.001 10.999 80
表 4  区间分组样本分布的比较
图 3  变分自编码器的神经网络结构
算法2 生成模型近似查询
输入: GVAE模型库$ \mathrm{M}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{l}\mathrm{s} $、聚合查询语句$ {\mathrm{Aggr}} $、初始采样比例P、数据库$ D $. 输出: 聚合查询结果$ R $和查询误差$ {R}_{\mathrm{A}\mathrm{g}\mathrm{g}\mathrm{r}} $1)聚合查询字段$ G $$ {\mathrm{Aggr}} $2)$ {\mathrm{model}} $$ G,\mathrm{M}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{l}\mathrm{s} $3)$ D $$ {\mathrm{model}},P $4)$ \theta ,\tilde \theta $$ {\mathrm{Aggr}},D $5)$ R $ $ \theta = \left\{ \begin{gathered} \theta = \{ {\theta _1},{\theta _2}, \cdots ,{\theta _n}\} \\ \tilde \theta = \{ {{\tilde \theta }_1},{{\tilde \theta }_2}, \cdots ,{{\tilde \theta }_m}\} \\ \end{gathered} \right.,m \leqslant n $6)$ {R_q} = \dfrac{1}{n}\left( {\displaystyle \sum\nolimits_{i = 1}^m {\left| {\frac{{{\theta _i} - {{\tilde \theta }_i}}}{{{\theta _i}}}} \right|} +(n - m)} \right) $
  
图 4  不同数据集的CDF
图 5  不同偏态系数查询的相对误差
图 6  不同数据集和偏态系数的消融实验
图 7  不同维度下的模型比较
图 8  不同采样比例的相对误差
1 CHAUDURI S, DING B, KANDULA S. Approximate query processing: no silver bullet [C]// Proceedings of the 2017 ACM International Conference on Management of Data . New York: ACM, 2017: 511-519.
2 ZHANG Meifan, WANG Hongzhi LAQP: learning-based approximate query processing[J]. Information Sciences, 2021, 546: 1113- 1134
doi: 10.1016/j.ins.2020.09.070
3 LI Kaiyu, LI Guoliang. Approximate query processing: what is new and where to go? a survey on approximate query processing [J]. Data Science and Engineering , 2018: 379-397.
4 SANCA V, AILAMAKI A. Sampling-based AQP in modern analytical engines [C]// Proceedings of the 18th International Workshop on Data Management on New Hardware . New York: ACM, 2022: 1-8.
5 MA Qingzhi, TRIANTAFILLOU P. Dbest: revisiting approximate query processing engines with machine learning models [C]// Proceedings of the 2019 International Conference on Management of Data . New York: ACM, 2019: 1553-1570.
6 MA Qingzhi, SHANGHOOSHABAD A M, ALMASI M, et al. Learned approximate query processing: make it light, accurate and fast [C]// Conference on Innovative Data Systems. New York: ACM, 2021.
7 HILPRECHT B, SCHMIDT A, KULESSA M, et al DeepDB: learn from data, not from queries![J]. Proceedings of the VLDB Endowment, 2020, 13 (7): 992- 1005
doi: 10.14778/3384345.3384349
8 LEE T, PARK C S, NAM K, et al. Query transformation for approximate query processing using synthetic data from deep generative models [C]// IEEE International Conference on Consumer Electronics-Asia . Yeosu: IEEE, 2022: 1-4.
9 SHEORAN N, MITRA S, PORWAL V, et al. Conditional generative model based predicate-aware query approximation [C]// Proceedings of the AAAI Conference on Artificial Intelligence . Palo Alto: AAAI, 2022, 36(8): 8259-8266.
10 OLKEN F, ROTEM D. Simple random sampling from relational databases [C]// Proceedings of the 12th International Conference on Very Large Data Bases . San Francisco: Morgan Kaufmann Publishers Inc, 1986: 160-169.
11 PENG Jinglin, ZHANG Dongxiang, WANG Jiannan, et al. Aqp++ connecting approximate query processing with aggregate precomputation for interactive analytics [C]// Proceedings of the 2018 International Conference on Management of Data . New York: ACM, 2018: 1477-1492.
12 HASAN S, THIRUMURUGANATHAN S, AUGUSTINE J, et al. Deep learning models for selectivity estimation of multi-attribute queries [C]// Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data . New York: ACM, 2020: 1035-1050.
13 LI Feifei, WU Bin, YI Ke, et al. Wander join: online aggregation via random walks [C]// Proceedings of the 2016 International Conference on Management of Data . New York: ACM, 2016: 615-629.
14 CHAUDHURI S, DAYAL U Data warehousing and OLAP for decision support[J]. ACM Sigmod Record, 1997, 26 (2): 507- 508
doi: 10.1145/253262.253373
15 AGARWAL S, MOZAFARI B, PANDA A, et al. BlinkDB: queries with bounded errors and bounded response times on very large data [C]// Proceedings of the 8th ACM European conference on computer systems . New York: ACM, 2013: 29-42.
16 CORMODE G, GAROFALAKIS M, HAAS P J, et al Synopses for massive data: samples, histograms, wavelets, sketches[J]. Foundations and Trends in Databases, 2011, 4 (1-3): 1- 294
doi: 10.1561/1900000004
17 LI Kaiyu, ZHANG Yong, LI Guoliang, et al. Bounded approximate query processing [J]. IEEE Transactions on Knowledge and Data Engineering , 2019, 31(12): 2262-2276.
18 LEE T, NAM K, PARK C S, et al. Exploiting machine learning models for approximate query processing [C]// IEEE International Conference on Big Data . Osaka: IEEE, 2022: 6752-6754.
19 KINGMA D P, WELLING M. Auto-encoding variational Bayes [EB/OL]. (2013-12-20). https://doi.org/10.48550/arXiv.1312.6114.
20 GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets [EB/OL].[2023-10-14]. https://arxiv.org/abs/1406.2661.
21 THIRUMURUGANATHAN S, HASAN S, KOUDAS N, et al. Approximate query processing for data exploration using deep generative models [C]// IEEE 36th International Conference on Data Engineering . Los Alamitos: IEEE, 2020: 1309-1320.
22 ZHANG Meifang, WANG Hongzhi. Approximate query processing for group-by queries based on conditional generative models [EB/OL]. (2021-01-08). https://doi.org/10.48550/arXiv.2101.02914.
23 SEGLEN P O The skewness of science[J]. Journal of the American Society for Information Science, 1992, 43 (9): 628- 638
doi: 10.1002/(SICI)1097-4571(199210)43:9<628::AID-ASI5>3.0.CO;2-0
24 TSOUMAKAS G, KATAKIS I Multi-label classification: an overview[J]. International Journal of Data Warehousing and Mining, 2007, 3 (3): 1- 13
doi: 10.4018/jdwm.2007070101
25 FU Hao, LI Chunyuan, LIU Xiaodong, et al. Cyclical annealing schedule: a simple approach to mitigating KL vanishing [C]// Proceedings of NAACL-HLT . Stroudsburg: ACL, 2019: 240-250.
[1] 李素,陈泽,宋宝燕,张浩林. 营商环境评估的企业级复合区块链构建方法[J]. 浙江大学学报(工学版), 2024, 58(5): 891-899.
[2] 高一聪,王彦坤,费少梅,林琼. 基于迁移学习的机械制图智能评阅方法[J]. 浙江大学学报(工学版), 2022, 56(5): 856-863, 889.
[3] 白文超,韩希先,王金宝. 基于条件生成模型的高效近似查询处理框架[J]. 浙江大学学报(工学版), 2022, 56(5): 995-1005.
[4] 张鹏,田子都,王浩. 基于改进生成对抗网络的飞参数据异常检测方法[J]. 浙江大学学报(工学版), 2022, 56(10): 1967-1976.
[5] 黄发明,潘李含,姚池,周创兵,姜清辉,常志璐. 基于半监督机器学习的滑坡易发性预测建模[J]. 浙江大学学报(工学版), 2021, 55(9): 1705-1713.
[6] 任嘉豪,王海鸥,邢江宽,罗坤,樊建人. 湍流火焰切向应变率的低维近似模型[J]. 浙江大学学报(工学版), 2021, 55(6): 1128-1134.
[7] 战友,李强,马啸天,王郴平,邱延峻. 基于宏微观纹理特征融合的路面摩擦性能预测[J]. 浙江大学学报(工学版), 2021, 55(4): 684-694.
[8] 于勇,薛静远,戴晟,鲍强伟,赵罡. 机加零件质量预测与工艺参数优化方法[J]. 浙江大学学报(工学版), 2021, 55(3): 441-447.
[9] 陈巧红,陈翊,李文书,贾宇波. 多尺度SE-Xception服装图像分类[J]. 浙江大学学报(工学版), 2020, 54(9): 1727-1735.
[10] 王慧芳,张晨宇. 采用极限梯度提升算法的电力系统电压稳定裕度预测[J]. 浙江大学学报(工学版), 2020, 54(3): 606-613.
[11] 谢乐,衡熙丹,刘洋,蒋启龙,刘东. 基于线性判别分析和分步机器学习的变压器故障诊断[J]. 浙江大学学报(工学版), 2020, 54(11): 2266-2272.
[12] 万志远,陶嘉恒,梁家坤,才振功,苌程,乔林,周巧妮. Stack Overflow上机器学习相关问题的大规模实证研究[J]. 浙江大学学报(工学版), 2019, 53(5): 819-828.
[13] 柯懂湘,潘丽敏,罗森林,张寒青. 基于随机森林算法的Android恶意行为识别与分类方法[J]. 浙江大学学报(工学版), 2019, 53(10): 2013-2023.
[14] 忽丽莎, 王素贞, 陈益强, 高晨龙, 胡春雨, 蒋鑫龙, 陈振宇, 高兴宇. 基于可穿戴设备的跌倒检测算法综述[J]. 浙江大学学报(工学版), 2018, 52(9): 1717-1728.
[15] 王洪凯, 陈中华, 周纵苇, 李迎辞, 陆佩欧, 王文志, 刘宛予, 于丽娟. 机器学习算法诊断PET/CT纵膈淋巴结性能评估[J]. 浙江大学学报(工学版), 2018, 52(4): 788-797.