基于分区索引的集合相似连接

doi:10.3785/j.issn.1008-973X.2012.02.017

2012, Vol. 46

Issue (2): 286-293 DOI: 10.3785/j.issn.1008-973X.2012.02.017

计算机技术

基于分区索引的集合相似连接

洪银杰, 陈刚, 陈珂

浙江大学计算机科学与技术系,浙江杭州 310027

Set similarity join using partition index

HONG Yin-jie, CHEN Gang, CHEN Ke

Department of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China

全文: PDF

摘要：

针对传统的索引和过滤算法处理在线相似连接时的不足,提出新的索引方法和过滤算法.在采用倒排索引的基础上,将索引按照位置和长度的相关信息进行划分,以减少查询空间,加强倒排索引的执行效率.此外,设计加权签名过滤算法,用来估计2个集合交的长度的上限,提高过滤的效率.集合的相似连接通常应用于过滤验证的工作框架里,主要采用2个步骤：先产生候选结果集合；再对候选集合进行验证.通过对真实数据集的实验,结果表明,该过滤算法可以和其他过滤算法一起协同应用于过滤验证的工作框架里,对数据进行在线相似连接处理,同时在计算效率上也有显著的提升.

关键词： 相似连接; 分区; 加权签名; 过滤; 相似函数

Abstract:

To address the deficiency of similarity join online when using traditional indexing and filtering algorithm, we proposed several novel filtering approaches by improving the inverted based and signature based schemes. Enhancing the inverted index to reduce the search spaces, which partition the index according to the information of item’s position and the record’s length. In addition, we designed a novel weighted signature filtering scheme, where the upper bound of the overlap between two sets can be estimated to improve the effectiveness of filtering. Typically, the processing of set similarity join often adopts the filteringrefinement framework, which generates candidates by some filtering schemes and then produces the final results by refining the candidates. The proposed schemes can be seamlessly integrated into the filteringrefinement framework with other filtering schemes to process set similarity join online. Extensive experiments are conducted using real datasets. The experiments results show the efficiency of the proposed schemes.

Key words: similarity join partition weighted signature filter similarity function

出版日期: 2012-03-02

TP 311.13

基金资助:

国家自然科学基金资助项目(60803003, 60970124)

通讯作者: 陈刚,男,教授、博导 E-mail: cg@zju.edu.cn

作者简介: 洪银杰（1982—）,男,博士生,从事数据库、数据挖掘研究.E-mail: hongyj@zju.edu.cn

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章

引用本文:

洪银杰, 陈刚, 陈珂. 基于分区索引的集合相似连接[J]. J4, 2012, 46(2): 286-293.

HONG Yin-jie, CHEN Gang, CHEN Ke. Set similarity join using partition index. J4, 2012, 46(2): 286-293.

链接本文:

http://www.zjujournals.com/xueshu/eng/CN/10.3785/j.issn.1008-973X.2012.02.017 或 http://www.zjujournals.com/xueshu/eng/CN/Y2012/V46/I2/286

［1］ XIAO Chuan, WANG Wei, LIN Xuemin, et al. Efficient similarity joins for near duplicate detection ［C］∥ Proceedings of the 17th International Conference on World Wide Web. Beijing: ACM, 2008: 131-140.
［2］ ARASU A, GANTI V, KAUSHIK R. Efficient exact setsimilarity joins ［C］∥ Proceedings of the 32nd International Conference on Very Large Data Bases. Seoul: ACM, 2006: 918-929.
［3］ AGRAWAL P, ARASU A, KAUSHIK R. On indexing errortolerant set containment ［C］∥ Proceedings of the ACM SIGMOD International Conference on Management of Data. Indianapolis: ACM, 2010: 927-938.
［4］ THEOBALD M, SIDDHARTH J, PAEPCKE A. Spotsigs: robust and efficient near duplicate detection in large web collections ［C］∥ Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Singapore: ACM, 2008: 563-570.
［5］ CHAUDHURI S, GANTI V, KAUSHIK R. A primitive operator for similarity joins in data cleaning ［C］∥ Proceedings of the 22nd International Conference on Data Engineering. Atlanta: IEEE Computer Society, 2006: 5.
［6］ SARAWAGI S, KIRPAL A. Efficient set joins on similarity predicates ［C］∥ Proceedings of the ACM SIGMOD International Conference on Management of Data. Paris: ACM, 2004: 743-754.
［7］ GRAVANO L, IPEIROTIS P G, JAGADISH H V, et al. Approximate string joins in a database (almost) for free ［C］∥ Proceedings of 27th International Conference on Very Large Data Bases. Roma. Morgan Kaufmann, 2001: 491-500.
［8］ XIAO Chuan, WANG Wei, LIN Xuemin. Edjoin: an efficient algorithm for similarity joins with edit distance constraints ［J］. PVLDB, 2008(1): 933-944.
［9］ RIBEIRO L, HRDER T. Efficient set similarity joins using minprexes ［C］∥ Advances in Databases and Information Systems, 13th East European Conference. Riga: Springer, 2009: 88-102.
［10］ BAYARDO R J, MA Y, SRIKANT R. Scaling up all pairs similarity search ［C］∥ Proceedings of the 16th International Conference on World Wide Web. Alberta: ACM, 2007: 131-140.
［11］ MAMOULIS N. Efficient processing of joins on setvalued attributes ［C］∥ Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. California: ACM, 2003: 157-168.

[1]	毛宜钰, 刘建勋, 胡蓉, 唐明董. 基于Logistic函数和用户聚类的协同过滤算法[J]. 浙江大学学报(工学版), 2017, 51(6): 1252-1258.
[2]	任迪, 万健, 殷昱煜, 周丽, 高敏. 基于贝叶斯分类的Web服务质量预测方法研究[J]. 浙江大学学报(工学版), 2017, 51(6): 1242-1251.
[3]	王海艳, 程严. 基于离散系数的双向服务选择方法[J]. 浙江大学学报(工学版), 2017, 51(6): 1197-1204.
[4]	熊海贝,曹纪兴,张凤亮. 含加强层框筒结构位移监测方法[J]. 浙江大学学报(工学版), 2016, 50(9): 1752-1760.
[5]	居斌, 钱沄涛, 叶敏超. 基于结构投影非负矩阵分解的协同过滤算法[J]. 浙江大学学报(工学版), 2015, 49(7): 1319-1325.
[6]	扈中凯，郑小林，吴亚峰，陈德人. 基于用户评论挖掘的产品推荐算法[J]. J4, 2013, 47(8): 1475-1485.
[7]	王友卫, 刘元宁, 朱晓冬. 用于图像内容认证的半脆弱水印新算法[J]. J4, 2013, 47(6): 969-976.
[8]	王美燕, 张正威. 竖直地埋管换热器热作用半径的估算方法[J]. J4, 2013, 47(12): 2153-2159.
[9]	魏蔚, 董亚波, 鲁东明. 基于支持向量机和多资源最大最小公平的DDoS防御[J]. J4, 2010, 44(2): 265-270.
[10]	陈文智黄炜谢铖何钦铭. 基于虚拟化平台的可信任计算基[J]. J4, 2009, 43(2): 276-282.
[11]	陈前虎, 黄杉, 华晨. 山地丘陵可持续开发的用地评价模型与应用——以浙江省开化县工业新城为例[J]. J4, 2009, 43(11): 2100-2106.
[12]	陈珂, 邵峰, 陈刚, 等. XML结构化匹配中的位图过滤加速法[J]. J4, 2009, 43(09): 1549-1556.
[13]	舒明岑沛霖. 白肋烟加工尾气中异味物质的大规模处理[J]. J4, 2007, 41(8): 1417-1420.
[14]	陈岭陈根才. 一种用于数据分布管理的模糊DR算法[J]. J4, 2006, 40(9): 1521-1525.
[15]	王志梅杨帆. 基于相似学习者发现的资源推荐系统[J]. J4, 2006, 40(10): 1688-1691.

Viewed

Full text

Abstract

Cited

Shared

Discussed