基于两级过滤的时间序列近似查询

doi:10.3785/j.issn.1008-973X.2016.07.010

浙江大学学报(工学版)

计算机技术、信息工程

基于两级过滤的时间序列近似查询

蔡青林,陈岭,梅寒蕾,孙建伶

浙江大学计算机科学与技术学院,浙江杭州 310027

Two-step filtering based time series similarity search

CAI Qing lin, CHEN Ling, MEI Han lei, SUN Jian ling

College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China

全文: PDF(926 KB) HTML

摘要：

针对现有的近似查询模型对查询精度的可控性较差,后续处理效率较低的问题,提出基于两级过滤的查询模型.通过采用不同粒度的SAX表示方法提取时间序列的字符型特征向量,可以将高维的时间序列映射到低维的特征空间；将不同粒度的特征向量以向量近似文件（VA-File）的结构进行存储,有效引入了倒排索引.在查询过程中,设计了启发式的查询过滤算法,根据粗粒度特征向量查询细粒度特征向量,实现第一级过滤；针对VA-File设计了高效的边界剪枝算法,实现第二级过滤.模型基于多粒度的SAX特征向量进行构建,可以对查询精度进行有效控制；在第二级过滤中采用的边界剪枝算法可以有效地提高后续处理的执行效率.实验结果表明,提出的查询模型具有较高的性能,对时间序列长度、kNN查询规模及数据集规模具有稳定的扩展性.

Abstract:

A two-step filtering based search model was proposed in order to solve the problms that existing approximated search models have the poor controllability on the accuracy and have the low efficiency at the post-processing step. The symbolic aggregate approximation (SAX) method was employed to extract the symbolic feature vectors from time series. The high-dimensional time series was projected into the low-dimensional feature space. The symbolic feature vectors were saved as the form of vector approximation file (VA-File), and the efficient inverted index was introduced. A heuristic query and filtering algorithm was proposed at the procedure of searching in order to perform the first filtering, and an efficient boundary pruning method was proposed over the VA-File to reach the second filtering. The model can effectively control the search accuracy by the SAX feature vectors with multiple precision. Experimental results show that the proposed model has the efficient performance and the robust scalability on the series length, the kNN query scale and the dataset size．

出版日期: 2016-07-23

TP 39

基金资助:

中国工业和信息化部“核高基”重大专项基金资助项目（2010ZX01042-002-003-001）；中国工程科技知识中心基金资助项目（CKCEST-2014-1-5）；国家自然科学基金资助项目（61332017）.

通讯作者: 陈岭，男，副教授. ORCID：0000-0003-1934-5992. E-mail: lingchen@zju.edu.cn

作者简介: 蔡青林(1987-)，男，博士生，从事数据挖掘、信息检索、模式识别的研究. ORCID：0000-0001-5219-7771. E-mail: qlcai@zju.edu.cn

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章

引用本文:

蔡青林,陈岭,梅寒蕾,孙建伶. 基于两级过滤的时间序列近似查询[J]. 浙江大学学报(工学版), 10.3785/j.issn.1008-973X.2016.07.010.

CAI Qing lin, CHEN Ling, MEI Han lei, SUN Jian ling. Two-step filtering based time series similarity search. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 10.3785/j.issn.1008-973X.2016.07.010.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2016.07.010 或 http://www.zjujournals.com/eng/CN/Y2016/V50/I7/1290

［1］ ESLING P, AGON C. Timeseries data mining ［J］. ACM Computing Surveys, 2012, 45(1): 134．
［2］ CONG F, CHEN J, DONG G, et al. Shorttime matrix series based singular value decomposition for rolling bearing fault diagnosis ［J］. Mechanical Systems and Signal Processing, 2013, 34(1): 218-230．
［3］ AGRAWAL R, FALOUTSOS C, SWAMI A. Efficient similarity search in sequence databases ［J］. Foundations of Data Organization and Algorithms, 1993, 730(1): 69-84．
［4］ CHAOVALIT P, GANGOPADHYAY A, KARABATIS G, et al. Discrete wavelet transformbased time series analysis and mining ［J］. ACM Computing Surveys, 2011, 43(2): 137．
［5］ KEOGH E, CHAKRABARTI K, PAZZANI M, et al. Dimensionality reduction for fast similarity search in large time series databases ［J］. Knowledge and Information Systems, 2001, 3(3): 263-286．
［6］ LIN J, KEOGH E, WEI L, et al. Experiencing SAX: a novel symbolic representation of time series ［J］. Data Mining and Knowledge Discovery, 2007, 15(2): 107-144．
［7］ FALOUTSOS C, RANGANATHAN M, MANOLOPOULOS Y. Fast subsequence matching in timeseries databases ［C］∥Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data. New York: ACM, 1994: 419-429．
［8］ ATHITSOS V, PAPAPETROU P, POTAMIAS M, et al. Approximate embeddingbased subsequence matching of time series ［C］∥ Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2008: 365-378．
［9］ SHIEH J, KEOGH E. iSAX: diskaware mining and indexing of massive time series datasets ［J］. Data Mining and Knowledge Discovery, 2009, 19(1): 24-57．
［10］ CAMERRA A, SHIEH J, PALPANAS T, et al. Beyond one billion time series: indexing and mining very large time series collections with iSAX2+ ［J］. Knowledge and Information Systems, 2013, 39(1): 129．
［11］ LI Y, YIU M L, GONG Z. Discovering longestlasting correlation in sequence databases ［J］. Proceedings of the VLDB Endowment, 2013, 6(14): 1666-1677．
［12］ KEOGH E, CHAKRABARTI K, PAZZANI M, et al. Locally adaptive dimensionality reduction for indexing large time series databases ［J］. ACM SIGMOD Record, 2001, 30(2): 151-162．
［13］ KEOGH E, CHU S, HART D, et al. Segmenting time series: a survey and novel approach ［J］. Data Mining in Time Series Databases, 2004, 57(1): 122．
［14］ GULLO F, PONTI G, TAGARELLI A, et al. A time series representation model for accurate and fast similarity detection ［J］. Pattern Recognition, 2009, 42(11): 2998-3014．
［15］ PAPADIMITRIOU S, SUN J, FALOUTSOS C. Streaming pattern discovery in multiple timeseries ［C］∥ Proceedings of the 31st International Conference on Very Large Data Bases. New York: ACM, 2005: 697-708．
［16］ RAKTHANMANON T, CAMPANA B, MUEEN A, et al. Searching and mining trillions of time series subsequences under dynamic time warping ［C］∥ Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data mining. New York: ACM, 2012: 262-270．
［17］ MANNING, CHRISTOPHER D, PRABHAKAR R, et al. Introduction to information retrieval ［M］. Cambridge: Cambridge University Press, 2008．
［18］ BLOTT S, WEBER R. A simple vectorapproximation file for similarity search in highdimensional vector spaces ［R］.Zurich: ETH Zentrum, 1997．
［19］ YAHOO! Finance. ［2015-05-01］. http:∥finance.yahoo.com/.
［20］ Monte Carlo simulated stock price generator. ［2015-05-01］. http:∥25yearsofprogramm- ing.com/．

[1]	何雪军, 王进, 陆国栋, 刘振宇, 陈立, 金晶. 基于三角网切片及碰撞检测的工业机器人三维头像雕刻[J]. 浙江大学学报(工学版), 2017, 51(6): 1104-1110.
[2]	王桦, 韩同阳, 周可. 公安情报中基于关键图谱的群体发现算法[J]. 浙江大学学报(工学版), 2017, 51(6): 1173-1180.
[3]	游录金, 卢兴见, 何高奇. 云环境亚健康研究[J]. 浙江大学学报(工学版), 2017, 51(6): 1181-1189.
[4]	尤海辉, 马增益, 唐义军, 王月兰, 郑林, 俞钟, 吉澄军. 循环流化床入炉垃圾热值软测量[J]. 浙江大学学报(工学版), 2017, 51(6): 1163-1172.
[5]	张欣欣, 徐恪, 钟宜峰, 苏辉. 网络服务提供商合作行为的演化博弈分析[J]. 浙江大学学报(工学版), 2017, 51(6): 1214-1224.
[6]	李建丽, 丁丁, 李涛. 基于二次聚类的多目标混合云任务调度算法[J]. 浙江大学学报(工学版), 2017, 51(6): 1233-1241.
[7]	毕晓君, 王佳荟. 基于混合学习策略的教与学优化算法[J]. 浙江大学学报(工学版), 2017, 51(5): 1024-1031.
[8]	穆晶晶, 赵昕玥, 何再兴, 张树有. 基于凹凸变换与圆周拟合的重叠气泡轮廓重构[J]. 浙江大学学报(工学版), 2017, 51(4): 714-721.
[9]	黄正宇, 蒋鑫龙, 刘军发, 陈益强, 谷洋. 基于融合特征的半监督流形约束定位方法[J]. 浙江大学学报(工学版), 2017, 51(4): 655-662.
[10]	蒋鑫龙, 陈益强, 刘军发, 忽丽莎, 沈建飞. 面向自闭症患者社交距离认知的可穿戴系统[J]. 浙江大学学报(工学版), 2017, 51(4): 637-647.
[11]	王钰翔, 李晟洁, 王皓, 马钧轶, 王亚沙, 张大庆. 基于Wi-Fi的非接触式行为识别研究综述[J]. 浙江大学学报(工学版), 2017, 51(4): 648-654.
[12]	郭彤, 郭斌, 张佳凡, 於志文, 周兴社. 多源社交数据融合的多角度旅游信息感知[J]. 浙江大学学报(工学版), 2017, 51(4): 663-668.
[13]	王亮, 於志文, 郭斌. 基于双层多粒度知识发现的移动轨迹预测模型[J]. 浙江大学学报(工学版), 2017, 51(4): 669-674.
[14]	景瑶, 郭斌, 王柱, 於志文, 周兴社. 基于群体智能挖掘的个性化商品评论呈现方法[J]. 浙江大学学报(工学版), 2017, 51(4): 675-681.
[15]	钱良芳, 张森林, 刘妹琴. 基于预约的数据队列水下无线传感器网络MAC协议[J]. 浙江大学学报(工学版), 2017, 51(4): 691-696.

Viewed

Full text

Abstract

Cited

Shared

Discussed