Please wait a minute...
Journal of Zhejiang University (Science Edition)  2024, Vol. 51 Issue (2): 153-161    DOI: 10.3785/j.issn.1008-9497.2024.02.003
Geographic Information Science     
Develop spatial learning indexing using improved K-means clustering partition
Chenhua FU1,2(),Feng ZHANG1,2(),Linshu HU1,2,Lijun WANG1,2
1.Earth Science College,Zhejiang University,Hangzhou 310058,China
2.Zhejiang Provincial Key Laboratory of Geographic Information Science,Zhejiang University,Hangzhou 310058,China
Download: HTML( 5 )   PDF(3277KB)
Export: BibTeX | EndNote (RIS)      

Abstract  

With the rapid increase of data size, the defects of traditional spatial indexing become more and more apparent. In comparison, learning indexing is based on data distribution. Its volume will not expand with the increase of the amount of data, and can achieve better performance without performing hierarchical comparison. Nevertheless, there are still two difficulties in applying the idea of learning indexing to spatial data: (1) How to choose appropriate dimension reduction method to sort the spatial data. (2) How to simplify data distribution of the dimension reduced data and make it easy to fit. This paper proposes a new type of grid mixed cluster partition learning indexing (grid-ml) based on the idea of learning indexing. In view of the above two difficulties, grid-ml uses z curve to reduce the dimension, and deals with the jumping problem with double-layer grid structure. Then, the improved K-means clustering method is used to simplify data distribution. The results show that grid-ml builds fast with small spatial storage volume, and can query fast as well, demonstrating significant advantages over the traditional spatial indexing approach.



Key wordslearned index      K-means clustering      space filling curve      spatial index     
Received: 02 November 2022      Published: 08 March 2024
CLC:  P 208  
Corresponding Authors: Feng ZHANG     E-mail: 22038030@zju.edu.cn;zfcarnation@zju.edu.cn
Cite this article:

Chenhua FU,Feng ZHANG,Linshu HU,Lijun WANG. Develop spatial learning indexing using improved K-means clustering partition. Journal of Zhejiang University (Science Edition), 2024, 51(2): 153-161.

URL:

https://www.zjujournals.com/sci/EN/Y2024/V51/I2/153


基于改进的K-means聚类分区均匀化空间学习索引

传统空间索引的体量随数据量的增加而膨胀,查询效率较低。学习索引的体量不随数据量的增加而膨胀,同时避免了层级比较查询,性能优异。将学习索引应用于空间索引存在2个难点:一是选取合适的降维方法实现空间数据的排序;二是对降维后数据序列进行有效的简化分布计算,使其易于拟合。基于此,提出了一种网格混合聚类分区学习索引(grid-ml),用z曲线进行降维,用双层网格结构优化查询策略,用改进的K-means聚类算法进行数据分区,实现数据分布均匀化。对比实验发现,grid-ml构建速度快、存储空间小、查询效率高,较传统空间索引优势显著。


关键词: 学习索引,  K-means聚类,  空间填充曲线,  空间索引 
Fig.1 The potential uniform distribution of spatial point sets
Fig.2 Schematic illustration of double layer grid optimizationNote The red dotted box represents the range query box. The grid area through which the red z curve passes is the spatial area requiring secondary filtering.
Fig.3 z-curve structure and range query area schematics.The red dotted line box represents the query boxNote The area that the red z curve passes through is the area that needs to be queried twice. Pa is the minimum point of latitude and longitude, Pb is the maximum point of latitude and longitude.
Fig.4 Structure of grid-ml
Fig.5 Point query process of grid-ml
Fig.6 Implement range query on uniform gridNote The red dotted box indicates the query box, and the blue dotted box indicates the range that does not require secondary filtering. Pa is the minimum point of latitude and longitude, Pb is the maximum point of latitude and longitude.
Fig.7 Spatial distribution of USA post office data set
scale查询框面积占总覆盖面积的比例
0.320.102 4
0.160.025 6
0.080.006 4
0.040.001 6
0.020.000 4
0.010.000 1
0.0050.000 025
0.002 50.000 006 25
Table 1 The ratio of scale to covered area
Fig.8 Comparison of storage space and point guery efficiency in different data set
Fig.9 Comparison of query efficiency in different data set scopes
[1]   FINKEL R A, BENTLEY J L. Quad trees: A data structure for retrieval on composite keys[J]. Acta Informatica, 1974, 4(1): 1-9. DOI: 10.1007/bf00288933
doi: 10.1007/bf00288933
[2]   BENTLEY J L. Multidimensional binary search trees used for associative searching[J]. Communications of the ACM, 1975, 18(9): 509-517. DOI: 10.1145/361002.361007
doi: 10.1145/361002.361007
[3]   GUTTMAN A. R-trees: A dynamic index structure for spatial searching[C]// Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data. New York: Association for Computing Machinery, 1984: 47-57. DOI: 10.1145/971697.602266
doi: 10.1145/971697.602266
[4]   张洲, 金培权, 谢希科. 学习索引:现状与研究展望[J]. 软件学报, 2021, 32(4): 1129-1150. DOI: 10.13328/j.cnki.jos.006168
ZHANG Z, JIN P Q, XIE X K. Learning index: A review[J]. Journal of Software, 2021, 32(4): 1129-1150. DOI: 10.13328/j.cnki.jos.006168
doi: 10.13328/j.cnki.jos.006168
[5]   KRASKA T, BEUTEL A, CHI E H, et al. The case for learned index structures[C]// Proceedings of the 2018 International Conference on Management of Data. New York: Association for Computing Machinery, 2018: 489-504. DOI: 10.1145/3183713. 3196909
doi: 10.1145/3183713. 3196909
[6]   WANG H X, FU X, XU J, et al. Learned index for spatial queries[C]// Proceedings of the 2019 20th IEEE International Conference on Mobile Data Management (MDM). Hongkong: IEEE, 2019: 569-574. DOI: 10.1109/mdm.2019.00121
doi: 10.1109/mdm.2019.00121
[7]   杨国平, 乔少杰, 屈露露, 等. 学习型数据库索引推荐技术综述[J]. 重庆理工大学学报(自然科学), 2022, 36(6):189-199. DOI: 10.3969/j.issn.1674-8425(z).2022.06.023
YANG G P, QIAO S J, QU L L, et al. Review of index recommendation technology for learning database [J]. Journal of Chongqing University of Technology (Natural Science), 2022, 36(6):189-199. DOI: 10.3969/j.issn.1674-8425(z).2022.06.023
doi: 10.3969/j.issn.1674-8425(z).2022.06.023
[8]   LI X, LI J, WANG X. ASLM: Adaptive single layer model for learned index[C]// Proceedings of the International Conference on Database Systems for Advanced Applications. Chiang Myron: Springer International Publishing, 2019: 80-95. DOI: 10.1007/978-3-030-18590-9_6
doi: 10.1007/978-3-030-18590-9_6
[9]   JAGADISH H V, OOI B C, TAN K L, et al. iDistance: An adaptive B+-tree based indexing method for nearest neighbor search[J]. ACM Transactions on Database Systems, 2005, 30(2): 364-397. DOI: 10.1145/1071610.1071612
doi: 10.1145/1071610.1071612
[10]   DAVITKOVA A, MILCHEVSKI E, MICHEL S. The ML-index: A multidimensional, learned index for point, range, and nearest-neighbor queries[C]// Proceedings of the 23rd International Conference on Extending Databased Technology. Copenhagen: IEEE,2020: 407-410. DOI:10.5441/002/edbt. 2020.44
doi: 10.5441/002/edbt. 2020.44
[11]   QI J, LIU G, JENSEN C S, et al. Effectively learning spatial indices[J]. Proccedings of the VLDB Endowment, 2020, 13(12): 2341-2354. DOI:10. 14778/3407790.3407829
doi: 10. 14778/3407790.3407829
[12]   高远宁, 叶金标, 杨念祖, 等. 基于中间层的可扩展学习索引技术[J]. 软件学报, 2020, 31(3): 620-633. DOI: 10.13328/j.cnki.jos.005910
GAO Y N, YE J B, YANG N Z, et al. Scalable learning index based on middle layer [J]. Journal of Software, 2020, 31(3): 620-633. DOI: 10.13328/j.cnki.jos.005910
doi: 10.13328/j.cnki.jos.005910
[13]   LI P, HUA Y, ZUO P, et al. A Scalable Learned Index Scheme in Storage Systems[Z/OL]. (2019-05-08). . doi:10.14778/3489496.3489512
doi: 10.14778/3489496.3489512
[14]   LI P, LU H, ZHENG Q, et al. LISA: A learned index structure for spatial data[C]// 2020 ACM SIGMOD International Conference on Management of Data. New York: Association for Computing Machinery, 2020: 2119-2133. DOI: 10.1145/3318464.3389703
doi: 10.1145/3318464.3389703
[15]   MACQUEEN J. Some methods for classification and analysis of multivariate observations[C]// Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. California: University of California Press, 1967: 281-297.
[16]   WANG N, XU J. Spatial queries based on learned index[C]// The First International Conference on Spatial Data and Intelligence. Hongkong: Springer International Publishing, 2020: 245-257. DOI: 10.1007/978-3-030-69873-7_18
doi: 10.1007/978-3-030-69873-7_18
[17]   RAMSAK F, MARKL V, FENK R, et al. Integrating the UB-tree into a database system kernel[C]// Proceedings of the 26th International Conference on Very Large Data Bases. San Francisco: Morgan Kaufmann Publishers Inc, 2000: 263-272.
[18]   LEE K C, ZHENG B, LI H, et al. Approaching the skyline in Z order[C]// Proceedings of the 33rd International Conference on Very Large Data Bases. Vienna: VLDB Endowment, 2007: 279-290.
[19]   KRASKA T, ALIZADEH M, BEUTEL A, et al. SageDB: A learned database system[C]// 2019 9th Biennial Conference on Innovative Data Systems Research. Asilomar: CIDR, 2019.
[20]   MARKL V, BAYER R. Processing relational OLAP queries with UB-trees and multidimensional hierarchical clustering[C]// Proceedings on the International Workshop on Design and Management of Data. Stockholm: CEUR-WS, 2000: 28.
[21]   王千, 王成, 冯振元, 等. K-means 聚类算法研究综述[J]. 电子设计工程, 2012, 20(7): 21-24. DOI: 10.3969/j.issn.1674-6236.2012.07.008
WANG Q, WANG C, FENG Z Y, et al. A review of K-means clustering algorithm[J]. Electronic Design Engineering, 2012, 20(7): 21-24. DOI: 10.3969/j.issn.1674-6236.2012.07.008
doi: 10.3969/j.issn.1674-6236.2012.07.008
[1] Keran SUN,Yingzhi WANG,Feng ZHANG,Renyi LIU. A road traffic accident risk assessment method considering the arrival time cost[J]. Journal of Zhejiang University (Science Edition), 2024, 51(2): 143-152.
[2] Chaoyu WANG,Zhenhong DU,Yuanyuan WANG. High-resolution image semantic segmentation network combining channel interaction spatial group attention and pyramid pooling[J]. Journal of Zhejiang University (Science Edition), 2024, 51(2): 131-142.
[3] Yueyi LI,Feng ZHANG,Zhenhong DU,Renyi LIU. Distributed scheduling and storage scheme based on LSM-OCTree for spatiotemporal stream[J]. Journal of Zhejiang University (Science Edition), 2023, 50(2): 204-212.
[4] Yuwen WANG, Zhenhong DU, Zhen DAI, Renyi LIU, Feng ZHANG. Multivariate water quality parameter prediction model based on hybrid neural network[J]. Journal of Zhejiang University (Science Edition), 2022, 49(3): 354-362.
[5] Qiuyi TANG,Chao WANG,Zhenhong DU,Feng ZHANG,Renyi LIU. Server-side cache replacement algorithm based on spatiotemporal aging model for tiles[J]. Journal of Zhejiang University (Science Edition), 2022, 49(2): 210-218.