Please wait a minute...
浙江大学学报(理学版)  2024, Vol. 51 Issue (2): 153-161    DOI: 10.3785/j.issn.1008-9497.2024.02.003
地理信息系统(GIS)专栏     
基于改进的K-means聚类分区均匀化空间学习索引
傅晨华1,2(),张丰1,2(),胡林舒1,2,王立君1,2
1.浙江大学 地球科学学院,浙江 杭州 310058
2.浙江大学 浙江省资源与环境信息系统重点实验室,浙江 杭州 310058
Develop spatial learning indexing using improved K-means clustering partition
Chenhua FU1,2(),Feng ZHANG1,2(),Linshu HU1,2,Lijun WANG1,2
1.Earth Science College,Zhejiang University,Hangzhou 310058,China
2.Zhejiang Provincial Key Laboratory of Geographic Information Science,Zhejiang University,Hangzhou 310058,China
 全文: PDF(3277 KB)   HTML( 5 )
摘要:

传统空间索引的体量随数据量的增加而膨胀,查询效率较低。学习索引的体量不随数据量的增加而膨胀,同时避免了层级比较查询,性能优异。将学习索引应用于空间索引存在2个难点:一是选取合适的降维方法实现空间数据的排序;二是对降维后数据序列进行有效的简化分布计算,使其易于拟合。基于此,提出了一种网格混合聚类分区学习索引(grid-ml),用z曲线进行降维,用双层网格结构优化查询策略,用改进的K-means聚类算法进行数据分区,实现数据分布均匀化。对比实验发现,grid-ml构建速度快、存储空间小、查询效率高,较传统空间索引优势显著。

关键词: 学习索引K-means聚类空间填充曲线空间索引    
Abstract:

With the rapid increase of data size, the defects of traditional spatial indexing become more and more apparent. In comparison, learning indexing is based on data distribution. Its volume will not expand with the increase of the amount of data, and can achieve better performance without performing hierarchical comparison. Nevertheless, there are still two difficulties in applying the idea of learning indexing to spatial data: (1) How to choose appropriate dimension reduction method to sort the spatial data. (2) How to simplify data distribution of the dimension reduced data and make it easy to fit. This paper proposes a new type of grid mixed cluster partition learning indexing (grid-ml) based on the idea of learning indexing. In view of the above two difficulties, grid-ml uses z curve to reduce the dimension, and deals with the jumping problem with double-layer grid structure. Then, the improved K-means clustering method is used to simplify data distribution. The results show that grid-ml builds fast with small spatial storage volume, and can query fast as well, demonstrating significant advantages over the traditional spatial indexing approach.

Key words: learned index    K-means clustering    space filling curve    spatial index
收稿日期: 2022-11-02 出版日期: 2024-03-08
CLC:  P 208  
基金资助: 国家自然科学基金资助项目(42271466);高分综合交通遥感应用示范系统(二期)(07-Y30B30-9001-19/21)
通讯作者: 张丰     E-mail: 22038030@zju.edu.cn;zfcarnation@zju.edu.cn
作者简介: 傅晨华(1998—),ORCID:https://orcid.org/0009-0002-0683-624X,女,硕士研究生,主要从事空间学习索引研究,E-mail:22038030@zju.edu.cn.
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章  
傅晨华
张丰
胡林舒
王立君

引用本文:

傅晨华,张丰,胡林舒,王立君. 基于改进的K-means聚类分区均匀化空间学习索引[J]. 浙江大学学报(理学版), 2024, 51(2): 153-161.

Chenhua FU,Feng ZHANG,Linshu HU,Lijun WANG. Develop spatial learning indexing using improved K-means clustering partition. Journal of Zhejiang University (Science Edition), 2024, 51(2): 153-161.

链接本文:

https://www.zjujournals.com/sci/CN/10.3785/j.issn.1008-9497.2024.02.003        https://www.zjujournals.com/sci/CN/Y2024/V51/I2/153

图1  空间点数据潜在的均匀分布特性
图2  双层网格优化示意注 红色虚线框代表范围查询框;红色z曲线经过的格网区域为需二次过滤的区域。
图3  z曲线结构与范围查询区域示意注 红色虚线框代表查询框,红色z曲线经过的区域为需二次查询的区域。Pa 为经纬度最小点,Pb 为经纬度最大点。
图4  grid-ml的整体结构
图5  grid-ml点查询流程
图6  对均匀网格实施范围查询注 红色虚线框表示查询框,蓝色虚线框表示不需要二次过滤的区间。Pa 为经纬度最小点,Pb 为经纬度最大点。
图7  美国邮局数据集空间分布
scale查询框面积占总覆盖面积的比例
0.320.102 4
0.160.025 6
0.080.006 4
0.040.001 6
0.020.000 4
0.010.000 1
0.0050.000 025
0.002 50.000 006 25
表1  scale的取值和对应查询框面积占比
图8  不同数据集的存储空间和点查询效率对比
图9  不同数据集的范围查询效率对比
1 FINKEL R A, BENTLEY J L. Quad trees: A data structure for retrieval on composite keys[J]. Acta Informatica, 1974, 4(1): 1-9. DOI: 10.1007/bf00288933
doi: 10.1007/bf00288933
2 BENTLEY J L. Multidimensional binary search trees used for associative searching[J]. Communications of the ACM, 1975, 18(9): 509-517. DOI: 10.1145/361002.361007
doi: 10.1145/361002.361007
3 GUTTMAN A. R-trees: A dynamic index structure for spatial searching[C]// Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data. New York: Association for Computing Machinery, 1984: 47-57. DOI: 10.1145/971697.602266
doi: 10.1145/971697.602266
4 张洲, 金培权, 谢希科. 学习索引:现状与研究展望[J]. 软件学报, 2021, 32(4): 1129-1150. DOI: 10.13328/j.cnki.jos.006168
ZHANG Z, JIN P Q, XIE X K. Learning index: A review[J]. Journal of Software, 2021, 32(4): 1129-1150. DOI: 10.13328/j.cnki.jos.006168
doi: 10.13328/j.cnki.jos.006168
5 KRASKA T, BEUTEL A, CHI E H, et al. The case for learned index structures[C]// Proceedings of the 2018 International Conference on Management of Data. New York: Association for Computing Machinery, 2018: 489-504. DOI: 10.1145/3183713. 3196909
doi: 10.1145/3183713. 3196909
6 WANG H X, FU X, XU J, et al. Learned index for spatial queries[C]// Proceedings of the 2019 20th IEEE International Conference on Mobile Data Management (MDM). Hongkong: IEEE, 2019: 569-574. DOI: 10.1109/mdm.2019.00121
doi: 10.1109/mdm.2019.00121
7 杨国平, 乔少杰, 屈露露, 等. 学习型数据库索引推荐技术综述[J]. 重庆理工大学学报(自然科学), 2022, 36(6):189-199. DOI: 10.3969/j.issn.1674-8425(z).2022.06.023
YANG G P, QIAO S J, QU L L, et al. Review of index recommendation technology for learning database [J]. Journal of Chongqing University of Technology (Natural Science), 2022, 36(6):189-199. DOI: 10.3969/j.issn.1674-8425(z).2022.06.023
doi: 10.3969/j.issn.1674-8425(z).2022.06.023
8 LI X, LI J, WANG X. ASLM: Adaptive single layer model for learned index[C]// Proceedings of the International Conference on Database Systems for Advanced Applications. Chiang Myron: Springer International Publishing, 2019: 80-95. DOI: 10.1007/978-3-030-18590-9_6
doi: 10.1007/978-3-030-18590-9_6
9 JAGADISH H V, OOI B C, TAN K L, et al. iDistance: An adaptive B+-tree based indexing method for nearest neighbor search[J]. ACM Transactions on Database Systems, 2005, 30(2): 364-397. DOI: 10.1145/1071610.1071612
doi: 10.1145/1071610.1071612
10 DAVITKOVA A, MILCHEVSKI E, MICHEL S. The ML-index: A multidimensional, learned index for point, range, and nearest-neighbor queries[C]// Proceedings of the 23rd International Conference on Extending Databased Technology. Copenhagen: IEEE,2020: 407-410. DOI:10.5441/002/edbt. 2020.44
doi: 10.5441/002/edbt. 2020.44
11 QI J, LIU G, JENSEN C S, et al. Effectively learning spatial indices[J]. Proccedings of the VLDB Endowment, 2020, 13(12): 2341-2354. DOI:10. 14778/3407790.3407829
doi: 10. 14778/3407790.3407829
12 高远宁, 叶金标, 杨念祖, 等. 基于中间层的可扩展学习索引技术[J]. 软件学报, 2020, 31(3): 620-633. DOI: 10.13328/j.cnki.jos.005910
GAO Y N, YE J B, YANG N Z, et al. Scalable learning index based on middle layer [J]. Journal of Software, 2020, 31(3): 620-633. DOI: 10.13328/j.cnki.jos.005910
doi: 10.13328/j.cnki.jos.005910
13 LI P, HUA Y, ZUO P, et al. A Scalable Learned Index Scheme in Storage Systems[Z/OL]. (2019-05-08). . doi:10.14778/3489496.3489512
doi: 10.14778/3489496.3489512
14 LI P, LU H, ZHENG Q, et al. LISA: A learned index structure for spatial data[C]// 2020 ACM SIGMOD International Conference on Management of Data. New York: Association for Computing Machinery, 2020: 2119-2133. DOI: 10.1145/3318464.3389703
doi: 10.1145/3318464.3389703
15 MACQUEEN J. Some methods for classification and analysis of multivariate observations[C]// Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. California: University of California Press, 1967: 281-297.
16 WANG N, XU J. Spatial queries based on learned index[C]// The First International Conference on Spatial Data and Intelligence. Hongkong: Springer International Publishing, 2020: 245-257. DOI: 10.1007/978-3-030-69873-7_18
doi: 10.1007/978-3-030-69873-7_18
17 RAMSAK F, MARKL V, FENK R, et al. Integrating the UB-tree into a database system kernel[C]// Proceedings of the 26th International Conference on Very Large Data Bases. San Francisco: Morgan Kaufmann Publishers Inc, 2000: 263-272.
18 LEE K C, ZHENG B, LI H, et al. Approaching the skyline in Z order[C]// Proceedings of the 33rd International Conference on Very Large Data Bases. Vienna: VLDB Endowment, 2007: 279-290.
19 KRASKA T, ALIZADEH M, BEUTEL A, et al. SageDB: A learned database system[C]// 2019 9th Biennial Conference on Innovative Data Systems Research. Asilomar: CIDR, 2019.
20 MARKL V, BAYER R. Processing relational OLAP queries with UB-trees and multidimensional hierarchical clustering[C]// Proceedings on the International Workshop on Design and Management of Data. Stockholm: CEUR-WS, 2000: 28.
21 王千, 王成, 冯振元, 等. K-means 聚类算法研究综述[J]. 电子设计工程, 2012, 20(7): 21-24. DOI: 10.3969/j.issn.1674-6236.2012.07.008
WANG Q, WANG C, FENG Z Y, et al. A review of K-means clustering algorithm[J]. Electronic Design Engineering, 2012, 20(7): 21-24. DOI: 10.3969/j.issn.1674-6236.2012.07.008
doi: 10.3969/j.issn.1674-6236.2012.07.008
[1] 孙克染,王颖志,张丰,刘仁义. 考虑抵达时间成本的道路交通事故风险评估方法[J]. 浙江大学学报(理学版), 2024, 51(2): 143-152.
[2] 汪超宇,杜震洪,汪愿愿. 结合通道交互空间组注意力与金字塔池化的高分影像语义分割网络[J]. 浙江大学学报(理学版), 2024, 51(2): 131-142.
[3] 李悦艺,张丰,杜震洪,刘仁义. 基于LSM-OCTree的时空流分布式调度和存储方案[J]. 浙江大学学报(理学版), 2023, 50(2): 204-212.
[4] 王昱文, 杜震洪, 戴震, 刘仁义, 张丰. 基于复合神经网络的多元水质指标预测模型[J]. 浙江大学学报(理学版), 2022, 49(3): 354-362.
[5] 汤求毅,王超,杜震洪,张丰,刘仁义. 基于时空老化模型的服务端瓦片缓存置换算法[J]. 浙江大学学报(理学版), 2022, 49(2): 210-218.