基于变分自编码器的近似聚合查询优化方法

doi:10.3785/j.issn.1008-973X.2024.05.006

浙江大学学报(工学版)

2024, Vol. 58

Issue (5): 931-940 DOI: 10.3785/j.issn.1008-973X.2024.05.006

计算机技术、通信技术

基于变分自编码器的近似聚合查询优化方法

黄龙森1,2(

),房俊1,2,*(

),周云亮1,2,郭志城1,2

1. 北方工业大学信息学院，北京 100144
2. 北方工业大学大规模流数据集成与分析技术北京市重点实验室，北京 100144

Optimization method of approximate aggregate query based on variational auto-encoder

Longsen HUANG1,2(

),Jun FANG1,2,*(

),Yunliang ZHOU1,2,Zhicheng GUO1,2

1. College of Information, North China University of Technology, Beijing 100144, China
2. Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data, North China University of Technology, Beijing 100144, China

全文: PDF(1290 KB) HTML

摘要：

针对偏态数据分布不平衡，传统近似聚合查询方法难以抽样生成偏态分布数据的问题，提出基于优化的变分自编码器的近似聚合查询方法，研究近似聚合查询方法对偏态分布数据的近似聚合查询准确率的影响. 在预处理阶段对偏态分布数据进行分层分组，对变分自编码器生成模型的网络结构和损失函数进行优化，降低近似聚合查询相对误差. 实验结果表明，与基准方法相比，近似聚合查询对偏态分布数据的查询相对误差更小，且随着偏态系数的提高，查询相对误差的上升趋势更平缓.

关键词： 近似查询处理; 偏态分布; 机器学习; 变分自编码器; 分组抽样

Abstract:

An optimized variational self-encoder-based approximate aggregation query method was proposed for the problem of imbalanced distribution of biased data, which makes it difficult to sample biased distribution data with traditional approximate aggregation query methods. The effect of approximate aggregation query method on the accuracy of approximate aggregation query for biased distribution data was analyzed. The bias-distributed data were hierarchically grouped in the preprocessing stage, and the network structure and loss function of the variational self-encoder generation model were optimized to reduce the approximate aggregated query relative error. The experimental results show that the query relative error of the approximate aggregation query is smaller for skewness distribution data compared with the benchmark method, and the rising trend of the query relative error is smoother as the skewness coefficient increases.

Key words: approximate query processing skewness distribution machine learning variational auto-encoder group sampling

收稿日期: 2023-10-24 出版日期: 2024-04-26

CLC:

TP 393

基金资助: 国家自然科学基金国际（地区）合作与交流项目（62061136006）；国家自然科学基金重点资助项目（61832004）.

通讯作者: 房俊 E-mail: 1091232176@qq.com;fangjun@ncut.edu.cn

作者简介: 黄龙森（2000—），男，硕士生，从事大数据研究. orcid.org/0009-0004-2110-1721.E-mail：1091232176@qq.com

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	作者相关文章
	黄龙森
	房俊
	周云亮
	郭志城

引用本文:

黄龙森,房俊,周云亮,郭志城. 基于变分自编码器的近似聚合查询优化方法[J]. 浙江大学学报(工学版), 2024, 58(5): 931-940.

Longsen HUANG,Jun FANG,Yunliang ZHOU,Zhicheng GUO. Optimization method of approximate aggregate query based on variational auto-encoder. Journal of ZheJiang University (Engineering Science), 2024, 58(5): 931-940.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2024.05.006 或 https://www.zjujournals.com/eng/CN/Y2024/V58/I5/931

图 1 左偏分布、右偏分布与正态分布的比较

表 1 PM2.5数据集的示例

表 2 不同模型下不同数据偏态系数的相对误差

图 2 GVAE方法的流程图

表 3 分组基数的比较

表 4 区间分组样本分布的比较

图 3 变分自编码器的神经网络结构

图 4 不同数据集的CDF

图 5 不同偏态系数查询的相对误差

图 6 不同数据集和偏态系数的消融实验

图 7 不同维度下的模型比较

图 8 不同采样比例的相对误差

1	CHAUDURI S, DING B, KANDULA S. Approximate query processing: no silver bullet [C]// Proceedings of the 2017 ACM International Conference on Management of Data . New York: ACM, 2017: 511-519.
2	ZHANG Meifan, WANG Hongzhi LAQP: learning-based approximate query processing[J]. Information Sciences, 2021, 546: 1113- 1134 doi: 10.1016/j.ins.2020.09.070
3	LI Kaiyu, LI Guoliang. Approximate query processing: what is new and where to go? a survey on approximate query processing [J]. Data Science and Engineering , 2018: 379-397.
4	SANCA V, AILAMAKI A. Sampling-based AQP in modern analytical engines [C]// Proceedings of the 18th International Workshop on Data Management on New Hardware . New York: ACM, 2022: 1-8.
5	MA Qingzhi, TRIANTAFILLOU P. Dbest: revisiting approximate query processing engines with machine learning models [C]// Proceedings of the 2019 International Conference on Management of Data . New York: ACM, 2019: 1553-1570.
6	MA Qingzhi, SHANGHOOSHABAD A M, ALMASI M, et al. Learned approximate query processing: make it light, accurate and fast [C]// Conference on Innovative Data Systems. New York: ACM, 2021.
7	HILPRECHT B, SCHMIDT A, KULESSA M, et al DeepDB: learn from data, not from queries![J]. Proceedings of the VLDB Endowment, 2020, 13 (7): 992- 1005 doi: 10.14778/3384345.3384349
8	LEE T, PARK C S, NAM K, et al. Query transformation for approximate query processing using synthetic data from deep generative models [C]// IEEE International Conference on Consumer Electronics-Asia . Yeosu: IEEE, 2022: 1-4.
9	SHEORAN N, MITRA S, PORWAL V, et al. Conditional generative model based predicate-aware query approximation [C]// Proceedings of the AAAI Conference on Artificial Intelligence . Palo Alto: AAAI, 2022, 36(8): 8259-8266.
10	OLKEN F, ROTEM D. Simple random sampling from relational databases [C]// Proceedings of the 12th International Conference on Very Large Data Bases . San Francisco: Morgan Kaufmann Publishers Inc, 1986: 160-169.
11	PENG Jinglin, ZHANG Dongxiang, WANG Jiannan, et al. Aqp++ connecting approximate query processing with aggregate precomputation for interactive analytics [C]// Proceedings of the 2018 International Conference on Management of Data . New York: ACM, 2018: 1477-1492.
12	HASAN S, THIRUMURUGANATHAN S, AUGUSTINE J, et al. Deep learning models for selectivity estimation of multi-attribute queries [C]// Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data . New York: ACM, 2020: 1035-1050.
13	LI Feifei, WU Bin, YI Ke, et al. Wander join: online aggregation via random walks [C]// Proceedings of the 2016 International Conference on Management of Data . New York: ACM, 2016: 615-629.
14	CHAUDHURI S, DAYAL U Data warehousing and OLAP for decision support[J]. ACM Sigmod Record, 1997, 26 (2): 507- 508 doi: 10.1145/253262.253373
15	AGARWAL S, MOZAFARI B, PANDA A, et al. BlinkDB: queries with bounded errors and bounded response times on very large data [C]// Proceedings of the 8th ACM European conference on computer systems . New York: ACM, 2013: 29-42.
16	CORMODE G, GAROFALAKIS M, HAAS P J, et al Synopses for massive data: samples, histograms, wavelets, sketches[J]. Foundations and Trends in Databases, 2011, 4 (1-3): 1- 294 doi: 10.1561/1900000004
17	LI Kaiyu, ZHANG Yong, LI Guoliang, et al. Bounded approximate query processing [J]. IEEE Transactions on Knowledge and Data Engineering , 2019, 31(12): 2262-2276.
18	LEE T, NAM K, PARK C S, et al. Exploiting machine learning models for approximate query processing [C]// IEEE International Conference on Big Data . Osaka: IEEE, 2022: 6752-6754.
19	KINGMA D P, WELLING M. Auto-encoding variational Bayes [EB/OL]. (2013-12-20). https://doi.org/10.48550/arXiv.1312.6114.
20	GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets [EB/OL].[2023-10-14]. https://arxiv.org/abs/1406.2661.
21	THIRUMURUGANATHAN S, HASAN S, KOUDAS N, et al. Approximate query processing for data exploration using deep generative models [C]// IEEE 36th International Conference on Data Engineering . Los Alamitos: IEEE, 2020: 1309-1320.
22	ZHANG Meifang, WANG Hongzhi. Approximate query processing for group-by queries based on conditional generative models [EB/OL]. (2021-01-08). https://doi.org/10.48550/arXiv.2101.02914.
23	SEGLEN P O The skewness of science[J]. Journal of the American Society for Information Science, 1992, 43 (9): 628- 638 doi: 10.1002/(SICI)1097-4571(199210)43:9<628::AID-ASI5>3.0.CO;2-0
24	TSOUMAKAS G, KATAKIS I Multi-label classification: an overview[J]. International Journal of Data Warehousing and Mining, 2007, 3 (3): 1- 13 doi: 10.4018/jdwm.2007070101
25	FU Hao, LI Chunyuan, LIU Xiaodong, et al. Cyclical annealing schedule: a simple approach to mitigating KL vanishing [C]// Proceedings of NAACL-HLT . Stroudsburg: ACL, 2019: 240-250.

[1]	李素,陈泽,宋宝燕,张浩林. 营商环境评估的企业级复合区块链构建方法[J]. 浙江大学学报(工学版), 2024, 58(5): 891-899.
[2]	高一聪,王彦坤,费少梅,林琼. 基于迁移学习的机械制图智能评阅方法[J]. 浙江大学学报(工学版), 2022, 56(5): 856-863, 889.
[3]	白文超,韩希先,王金宝. 基于条件生成模型的高效近似查询处理框架[J]. 浙江大学学报(工学版), 2022, 56(5): 995-1005.
[4]	张鹏,田子都,王浩. 基于改进生成对抗网络的飞参数据异常检测方法[J]. 浙江大学学报(工学版), 2022, 56(10): 1967-1976.
[5]	黄发明,潘李含,姚池,周创兵,姜清辉,常志璐. 基于半监督机器学习的滑坡易发性预测建模[J]. 浙江大学学报(工学版), 2021, 55(9): 1705-1713.
[6]	任嘉豪,王海鸥,邢江宽,罗坤,樊建人. 湍流火焰切向应变率的低维近似模型[J]. 浙江大学学报(工学版), 2021, 55(6): 1128-1134.
[7]	战友,李强,马啸天,王郴平,邱延峻. 基于宏微观纹理特征融合的路面摩擦性能预测[J]. 浙江大学学报(工学版), 2021, 55(4): 684-694.
[8]	于勇,薛静远,戴晟,鲍强伟,赵罡. 机加零件质量预测与工艺参数优化方法[J]. 浙江大学学报(工学版), 2021, 55(3): 441-447.
[9]	陈巧红,陈翊,李文书,贾宇波. 多尺度SE-Xception服装图像分类[J]. 浙江大学学报(工学版), 2020, 54(9): 1727-1735.
[10]	王慧芳,张晨宇. 采用极限梯度提升算法的电力系统电压稳定裕度预测[J]. 浙江大学学报(工学版), 2020, 54(3): 606-613.
[11]	谢乐,衡熙丹,刘洋,蒋启龙,刘东. 基于线性判别分析和分步机器学习的变压器故障诊断[J]. 浙江大学学报(工学版), 2020, 54(11): 2266-2272.
[12]	万志远,陶嘉恒,梁家坤,才振功,苌程,乔林,周巧妮. Stack Overflow上机器学习相关问题的大规模实证研究[J]. 浙江大学学报(工学版), 2019, 53(5): 819-828.
[13]	柯懂湘,潘丽敏,罗森林,张寒青. 基于随机森林算法的Android恶意行为识别与分类方法[J]. 浙江大学学报(工学版), 2019, 53(10): 2013-2023.
[14]	忽丽莎, 王素贞, 陈益强, 高晨龙, 胡春雨, 蒋鑫龙, 陈振宇, 高兴宇. 基于可穿戴设备的跌倒检测算法综述[J]. 浙江大学学报(工学版), 2018, 52(9): 1717-1728.
[15]	王洪凯, 陈中华, 周纵苇, 李迎辞, 陆佩欧, 王文志, 刘宛予, 于丽娟. 机器学习算法诊断PET/CT纵膈淋巴结性能评估[J]. 浙江大学学报(工学版), 2018, 52(4): 788-797.

Viewed

Full text

Abstract

Cited

Shared

Discussed