Please wait a minute...
浙江大学学报(工学版)  2020, Vol. 54 Issue (9): 1768-1776    DOI: 10.3785/j.issn.1008-973X.2020.09.013
计算机技术     
两步解码式空间矢量数据并行转换算法
孙乐乐1(),金宝轩2,*()
1. 云南师范大学 旅游与地理科学学院,云南 昆明 650500
2. 云南省自然资源厅,云南 昆明 650224
Spatial vector data parallel conversion algorithm based on two-step decoding
Le-le SUN1(),Bao-xuan JIN2,*()
1. College of Tourism and Geography Science, Yunnan Normal University, Kunming 650500, China
2. Yunnan Provincial Department of Natural Resources, Kunming 650224, China
 全文: PDF(1065 KB)   HTML
摘要:

传统单机转换工具与基于范围分区方案的并行转换算法存在扩展性差、数据倾斜的问题,为此提出两步解码式空间矢量数据(SVD)并行转换算法. 通过归纳地理空间数据库(GDB)中空间矢量数据的存储编码模式,构建优化后的几何解码函数作为基础工具. 初次解码:仅解析空间元数据,根据几何复杂度平衡解析任务,提高解析与数据量的均衡度;二次解码:借助几何并行解析机制提取、解析压缩几何字节,提高转换效率. 该算法基于Spark实现,将其与ArcGIS单机转换工具、基于范围分区方案的并行查询转换算法进行对比可知,所提算法具有显著的效率、性能扩展优势,转换效率提升了2.5~117倍,大幅降低了几何复杂度不均导致的数据倾斜情况.

关键词: 地理信息系统空间矢量数据(SVD)数据并行转换数据倾斜    
Abstract:

In view of the poor scalability and data skew in traditional single-machine conversion tools and RangePartitioner-based parallel methods, A spatial vector data (SVD) parallel conversion was proposed based on two-step decoding. An optimized geometry-parsing algorithm was introduced as a basic decoding tool with the storage schema of SVD in geospatial database (GDB). Only the spatial metadata was parsed in the first-step decoding, and the task was balanced according to the set geometry complexity to improve the balance between parsing and data. In the later-step decoding, the compressed geometry bytes were extracted and parsed with the geometric parallel parsing mechanism, to improve the conversion efficiency. This algorithm was implemented on Apache Spark, which was compared with ArcGIS conversion tool and the RangePartitioner-based parallel query transform algorithm. The experimental results verify that the proposed algorithm has significant advantages in efficiency and performance expansion; the conversion efficiency is promoted by 2.5?117 times; and the data skew caused by uneven geometric complexity is greatly reduced.

Key words: geographic information system    spatial vector data (SVD)    data parallel conversion    data skew
收稿日期: 2019-08-02 出版日期: 2020-09-22
CLC:  P 208  
基金资助: 国家自然科学基金资助项目(41661086)
通讯作者: 金宝轩     E-mail: sycamoresun@foxmail.com;jinbx163@163.com
作者简介: 孙乐乐(1992—),男,博士生,从事空间大数据运算研究. orcid.org/0000-0003-4855-0629. E-mail: sycamoresun@foxmail.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
孙乐乐
金宝轩

引用本文:

孙乐乐,金宝轩. 两步解码式空间矢量数据并行转换算法[J]. 浙江大学学报(工学版), 2020, 54(9): 1768-1776.

Le-le SUN,Bao-xuan JIN. Spatial vector data parallel conversion algorithm based on two-step decoding. Journal of ZheJiang University (Engineering Science), 2020, 54(9): 1768-1776.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2020.09.013        http://www.zjujournals.com/eng/CN/Y2020/V54/I9/1768

字节 内容
0 字节数标识,高位为符号位,值为固定值4
1~4 表头元信息,可略过
5~6 标识STRUCT总字节数,二进制
7~m 对应ST_GEOMETRY编码块,m为字节数组末尾位
表 1  ArcSDE for Oracle中SHAPE的编码模式
图 1  两步解码式矢量空间数据并行转换算法的实现过程
数据集 数据区域 要素数/M 构成点数/M 数据大小/GB
${D_0}$ 怒江州 0.103 11.939 0.512
${D_1}$ 33° 带 0.041 44.305 1.300
${D_2}$ 35° 带 1.996 192.056 6.300
${D_3}$ 云南省 8.280 730.922 22.900
${D_4}$ 云南省 8.396 2069.057 56.730
表 2  实验数据集信息
图 2  不同查询数据量下ST_AsText、ST_AsBinary函数的查询响应耗时对比
转换方法 ${D_0}$ ${D_1}$ ${D_2}$ ${D_3}$ ${D_4}$
ArcGIS 5.60 16.42 87.80 252.40 689.22
PSGD 1.90 2.68 8.93 49.98 53.70
一步解码法 0.53 2.08 2.67 4.67 6.73
两步解码法 0.51 1.08 1.73 4.03 5.87
表 3  采用各转换方法转换不同数据集的执行耗时对比
图 3  采用各并行方法转换不同数据集时的集群CPU利用率 对比
图 4  采用各并行方法转换不同数据集时的集群磁盘写入速率对比
图 5  采用各并行方法转换不同数据集时的耗时倾斜指标对比
图 6  采用各并行方法转换不同数据集时的区块倾斜指标对比
1 李军, 费川云 地球空间数据集成研究概况[J]. 地理科学进展, 2000, 19 (3): 203- 211
LI Jun, FEI Chuan-yun Overview of study on geo-spatial data integration[J]. Progress in Geography, 2000, 19 (3): 203- 211
doi: 10.3969/j.issn.1007-6301.2000.03.002
2 李清泉, 李德仁 大数据GIS[J]. 武汉大学学报: 信息科学版, 2014, 39 (6): 641- 644
LI Qing-quan, LI De-ren Big data GIS[J]. Geomatics and Information Science of Wuhan University, 2014, 39 (6): 641- 644
3 人民网. 土地调查国家级数据库实现全国“一张图”[EB/OL]. (2015-01-02)[2019-12-24]. http://scitech.people.com.cn/n/2015/0102/c1057-26311822.html.
4 人民日报. 首次全国地理国情普查完成[EB/OL]. (2017-01-03)[2019-12-24]. http://www.gov.cn/xinwen/2017-01/03/content_5155812.htm.
5 乐鹏, 吴昭炎, 上官博屹 基于Spark的分布式空间数据存储结构设计与实现[J]. 武汉大学学报: 信息科学版, 2018, 43 (12): 542- 549
YUE Peng, WU Zhao-yan, SHANGGUAN Bo-yi Design and implement of a distributed geospatial data storage structure based on spark[J]. Geomatics and Information Science of Wuhan University, 2018, 43 (12): 542- 549
6 YUE P, TAN Z GIS databases and NoSQL databases[J]. Comprehensive Geographic Information Systems, 2018, 6 (1): 50- 79
7 LI W, SONG M, ZHOU B, et al Performance improvement techniques for geospatial web services in a cyberinfrastructure environment: a case study with a disaster management portal[J]. Computers Environment and Urban Systems, 2015, 54 (3): 314- 325
8 陈德权 基于GeoJSON的WFS实现方式[J]. 测绘科学技术学报, 2011, 28 (1): 66- 69
CHEN De-quan The realization of WFS based on GeoJSON[J]. Journal of Geomatics Science and Technology, 2011, 28 (1): 66- 69
doi: 10.3969/j.issn.1673-6338.2011.01.016
9 龚健雅, 贾文珏, 陈玉敏, 等 从平台GIS到跨平台互操作GIS的发展[J]. 武汉大学学报: 信息科学版, 2004, 29 (11): 985- 989
GONG Jian-ya, JIA Wen-jue, CHEN Yu-min, et al Development from platform GIS to cross-platform interoperable GIS[J]. Geomatics and Information Science of Wuhan University, 2004, 29 (11): 985- 989
10 占美志, 何政伟, 李程 基于GML的空间数据集成技术研究[J]. 地理信息世界, 2014, (2): 29- 32
ZHAN Zhi-mei, HE Zheng-wei, LI Cheng Research of integration technology of spatial data based on GML[J]. Geomatics World, 2014, (2): 29- 32
doi: 10.3969/j.issn.1672-1586.2014.02.008
11 ASTRIANI W, TRISMININGSIH R Extraction, transformation, and loading (ETL) module for hotspot spatial data warehouse using Geokettle[J]. Procedia Environmental Sciences, 2016, 33: 626- 634
doi: 10.1016/j.proenv.2016.03.117
12 裴莲莲, 唐建智, 毕小硕 多源空间大数据的获取及在城市规划中的应用[J]. 地理信息世界, 2019, 26 (1): 13- 17
PEI Lian-lian, TANG Jian-zhi, BI Xiao-shuo The acquisition of multi-source spatial data and its application to urban planning[J]. Geomatics World, 2019, 26 (1): 13- 17
doi: 10.3969/j.issn.1672-1586.2019.01.003
13 ANEJIONU O C D, THAKURIAH P, MCHUGH A, et al Spatial urban data system: a cloud-enabled big data infrastructure for social and economic urban analytics[J]. Future Generation Computer Systems, 2019, 98 (9): 456- 473
14 姚晓闯. 矢量大数据管理关键技术研究 [D]. 北京: 中国农业大学, 2017: 48.
YAO Xiao-chuang. Research on key technologies of vector big data management [D]. Beijing: China Agricultural University, 2017: 48.
15 张少将. 基于Hadoop的地理空间大数据存储与查询技术[D]. 西安: 西安电子科技大学, 2017: 34.
ZHANG Shao-jiang. Hadoop-based geospatial data storage and query technology [D]. Xi’an: Xidian University, 2017: 34.
16 周经纬. 矢量大数据高性能计算模型及关键技术研究[D]. 杭州: 浙江大学, 2016: 89.
ZHOU Jing-wei. Research on big vector data’s high performance computing model and key technologies [D]. Hangzhou: Zhejiang University, 2016: 89.
17 李家, 曹威 Oracle Spatial空间数据在ArcSDE中的图层注册[J]. 计算机系统应用, 2015, 24 (1): 143- 146
LI Jia, CAO Wei Layer register of Oracle Spatial data in ArcSDE[J]. Computer Systems and Applications, 2015, 24 (1): 143- 146
doi: 10.3969/j.issn.1003-3254.2015.01.026
18 吴锦超 基于Oracle的ArcSDE数据迁移[J]. 测绘与空间地理信息, 2018, 41 (3): 154- 155
WU Jin-chao Data migration of ArcSDE based on Oracle[J]. Geomatics and Spatial Information Technology, 2018, 41 (3): 154- 155
doi: 10.3969/j.issn.1672-5867.2018.03.048
19 YAO X, MOKBEL M F, ALARABI L, et al Spatial coding-based approach for partitioning big spatial data in Hadoop[J]. Computers and Geosciences, 2017, 106: 60- 67
doi: 10.1016/j.cageo.2017.05.014
20 ELDAWY A, ALARABI L, MOKBEL M F Spatial partitioning techniques in Spatial Hadoop[J]. Proceedings of the VLDB Endowment, 2015, 8 (12): 1602- 1605
doi: 10.14778/2824032.2824057
21 ZEILER M. Modeling our world: the ESRI guide to geodatabase design [M]. Redlands: ESRI Press, 1999: 8.
22 ESRI. ArcGIS所支持的Oracle数据类型[EB/OL]. (2014-05-10)[2019-08-01]. http://resources.arcgis.com/zh-cn/help/main/10.2/index.html#/na/002n00000067000000/.
23 王怀, 樊文锋, 叶芳宏 基于ArcSDE的省级基础地理信息数据库系统建设[J]. 地理信息世界, 2011, 9 (3): 65- 69
WANG Huai, FAN Wen-feng, YE Fang-hong Building provincial fundamental geographic information database system based on ArcSDE[J]. Geomatics World, 2011, 9 (3): 65- 69
doi: 10.3969/j.issn.1672-1586.2011.03.013
24 周龙廷. 直接访问ArcSDE空间数据模型的技术方法研究[D]. 上海: 华东师范大学, 2011: 30.
ZHOU Long-ting. The technical research of methods to direct access to ArcSDE spatial data model [D]. Shanghai: East China Normal University, 2011: 30.
[1] 付仲良,赵星源,王楠,杨元维. 面向并行空间连接的两轮映射数据划分方法[J]. 浙江大学学报(工学版), 2017, 51(1): 212-224.
[2] 卢颖,郭良杰,侯云玥,赵云胜,陈连进. 多灾种耦合综合风险评估方法在城市用地规划中的应用[J]. 浙江大学学报(工学版), 2015, 49(3): 538-546.
[3] 欧阳安蛟 赵斯思. 操作系统的新一代地理信息系统研究[J]. J4, 2006, 40(10): 1728-1731.