Computer Technology, Information Engineering |
|
|
|
|
Two-step filtering based time series similarity search |
CAI Qing lin, CHEN Ling, MEI Han lei, SUN Jian ling |
College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China |
|
|
Abstract A two-step filtering based search model was proposed in order to solve the problms that existing approximated search models have the poor controllability on the accuracy and have the low efficiency at the post-processing step. The symbolic aggregate approximation (SAX) method was employed to extract the symbolic feature vectors from time series. The high-dimensional time series was projected into the low-dimensional feature space. The symbolic feature vectors were saved as the form of vector approximation file (VA-File), and the efficient inverted index was introduced. A heuristic query and filtering algorithm was proposed at the procedure of searching in order to perform the first filtering, and an efficient boundary pruning method was proposed over the VA-File to reach the second filtering. The model can effectively control the search accuracy by the SAX feature vectors with multiple precision. Experimental results show that the proposed model has the efficient performance and the robust scalability on the series length, the kNN query scale and the dataset size.
|
Published: 23 July 2016
|
|
基于两级过滤的时间序列近似查询
针对现有的近似查询模型对查询精度的可控性较差,后续处理效率较低的问题,提出基于两级过滤的查询模型.通过采用不同粒度的SAX表示方法提取时间序列的字符型特征向量,可以将高维的时间序列映射到低维的特征空间;将不同粒度的特征向量以向量近似文件(VA-File)的结构进行存储,有效引入了倒排索引.在查询过程中,设计了启发式的查询过滤算法,根据粗粒度特征向量查询细粒度特征向量,实现第一级过滤;针对VA-File设计了高效的边界剪枝算法,实现第二级过滤.模型基于多粒度的SAX特征向量进行构建,可以对查询精度进行有效控制;在第二级过滤中采用的边界剪枝算法可以有效地提高后续处理的执行效率.实验结果表明,提出的查询模型具有较高的性能,对时间序列长度、kNN查询规模及数据集规模具有稳定的扩展性.
|
|
[1] ESLING P, AGON C. Timeseries data mining [J]. ACM Computing Surveys, 2012, 45(1): 134.
[2] CONG F, CHEN J, DONG G, et al. Shorttime matrix series based singular value decomposition for rolling bearing fault diagnosis [J]. Mechanical Systems and Signal Processing, 2013, 34(1): 218-230.
[3] AGRAWAL R, FALOUTSOS C, SWAMI A. Efficient similarity search in sequence databases [J]. Foundations of Data Organization and Algorithms, 1993, 730(1): 69-84.
[4] CHAOVALIT P, GANGOPADHYAY A, KARABATIS G, et al. Discrete wavelet transformbased time series analysis and mining [J]. ACM Computing Surveys, 2011, 43(2): 137.
[5] KEOGH E, CHAKRABARTI K, PAZZANI M, et al. Dimensionality reduction for fast similarity search in large time series databases [J]. Knowledge and Information Systems, 2001, 3(3): 263-286.
[6] LIN J, KEOGH E, WEI L, et al. Experiencing SAX: a novel symbolic representation of time series [J]. Data Mining and Knowledge Discovery, 2007, 15(2): 107-144.
[7] FALOUTSOS C, RANGANATHAN M, MANOLOPOULOS Y. Fast subsequence matching in timeseries databases [C]∥Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data. New York: ACM, 1994: 419-429.
[8] ATHITSOS V, PAPAPETROU P, POTAMIAS M, et al. Approximate embeddingbased subsequence matching of time series [C]∥ Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2008: 365-378.
[9] SHIEH J, KEOGH E. iSAX: diskaware mining and indexing of massive time series datasets [J]. Data Mining and Knowledge Discovery, 2009, 19(1): 24-57.
[10] CAMERRA A, SHIEH J, PALPANAS T, et al. Beyond one billion time series: indexing and mining very large time series collections with iSAX2+ [J]. Knowledge and Information Systems, 2013, 39(1): 129.
[11] LI Y, YIU M L, GONG Z. Discovering longestlasting correlation in sequence databases [J]. Proceedings of the VLDB Endowment, 2013, 6(14): 1666-1677.
[12] KEOGH E, CHAKRABARTI K, PAZZANI M, et al. Locally adaptive dimensionality reduction for indexing large time series databases [J]. ACM SIGMOD Record, 2001, 30(2): 151-162.
[13] KEOGH E, CHU S, HART D, et al. Segmenting time series: a survey and novel approach [J]. Data Mining in Time Series Databases, 2004, 57(1): 122.
[14] GULLO F, PONTI G, TAGARELLI A, et al. A time series representation model for accurate and fast similarity detection [J]. Pattern Recognition, 2009, 42(11): 2998-3014.
[15] PAPADIMITRIOU S, SUN J, FALOUTSOS C. Streaming pattern discovery in multiple timeseries [C]∥ Proceedings of the 31st International Conference on Very Large Data Bases. New York: ACM, 2005: 697-708.
[16] RAKTHANMANON T, CAMPANA B, MUEEN A, et al. Searching and mining trillions of time series subsequences under dynamic time warping [C]∥ Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data mining. New York: ACM, 2012: 262-270.
[17] MANNING, CHRISTOPHER D, PRABHAKAR R, et al. Introduction to information retrieval [M]. Cambridge: Cambridge University Press, 2008.
[18] BLOTT S, WEBER R. A simple vectorapproximation file for similarity search in highdimensional vector spaces [R].Zurich: ETH Zentrum, 1997.
[19] YAHOO! Finance. [2015-05-01]. http:∥finance.yahoo.com/.
[20] Monte Carlo simulated stock price generator. [2015-05-01]. http:∥25yearsofprogramm- ing.com/. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|