Please wait a minute...
J4  2012, Vol. 46 Issue (2): 286-293    DOI: 10.3785/j.issn.1008-973X.2012.02.017
Set similarity join using partition index
HONG Yin-jie, CHEN Gang, CHEN Ke
Department of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China
Download:   PDF(0KB) HTML
Export: BibTeX | EndNote (RIS)      


 To address the deficiency of similarity join online when using traditional indexing and filtering algorithm, we proposed several novel filtering approaches by improving the inverted based and signature based schemes. Enhancing the inverted index to reduce the search spaces, which partition the index according to the information of item’s position and the record’s length. In addition, we designed a novel weighted signature filtering scheme, where the upper bound of the overlap between two sets can be estimated to improve the effectiveness of filtering. Typically, the processing of set similarity join often adopts the filteringrefinement framework, which generates candidates by some filtering schemes and then produces the final results by refining the candidates. The proposed schemes can be seamlessly integrated into the filteringrefinement framework with other filtering schemes to process set similarity join online. Extensive experiments are conducted using real datasets. The experiments results show the efficiency of the proposed schemes.

Published: 20 March 2012
CLC:  TP 311.13  
Cite this article:

HONG Yin-jie, CHEN Gang, CHEN Ke. Set similarity join using partition index. J4, 2012, 46(2): 286-293.

URL:     OR



[1] XIAO Chuan, WANG Wei, LIN Xuemin, et al. Efficient similarity joins for near duplicate detection [C]∥ Proceedings of the 17th International Conference on World Wide Web. Beijing: ACM, 2008: 131-140.
[2] ARASU A, GANTI V, KAUSHIK R. Efficient exact setsimilarity joins [C]∥ Proceedings of the 32nd International Conference on Very Large Data Bases. Seoul: ACM, 2006: 918-929.
[3] AGRAWAL P, ARASU A, KAUSHIK R. On indexing errortolerant set containment [C]∥ Proceedings of the ACM SIGMOD International Conference on Management of Data. Indianapolis: ACM, 2010: 927-938.
[4] THEOBALD M, SIDDHARTH J, PAEPCKE A. Spotsigs: robust and efficient near duplicate detection in large web collections [C]∥ Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Singapore: ACM, 2008: 563-570.
[5] CHAUDHURI S, GANTI V, KAUSHIK R. A primitive operator for similarity joins in data cleaning [C]∥ Proceedings of the 22nd International Conference on Data Engineering. Atlanta: IEEE Computer Society, 2006: 5.
[6] SARAWAGI S, KIRPAL A. Efficient set joins on similarity predicates [C]∥ Proceedings of the ACM SIGMOD International Conference on Management of Data. Paris: ACM, 2004: 743-754.
[7] GRAVANO L, IPEIROTIS P G, JAGADISH H V, et al. Approximate string joins in a database (almost) for free [C]∥ Proceedings of 27th International Conference on Very Large Data Bases. Roma. Morgan Kaufmann, 2001: 491-500.
[8] XIAO Chuan, WANG Wei, LIN Xuemin. Edjoin: an efficient algorithm for similarity joins with edit distance constraints [J]. PVLDB, 2008(1): 933-944.
[9] RIBEIRO L, HRDER T. Efficient set similarity joins using minprexes [C]∥ Advances in Databases and Information Systems, 13th East European Conference. Riga: Springer, 2009: 88-102.
[10] BAYARDO R J, MA Y, SRIKANT R. Scaling up all pairs similarity search [C]∥ Proceedings of the 16th International Conference on World Wide Web. Alberta: ACM, 2007: 131-140.
[11] MAMOULIS N. Efficient processing of joins on setvalued attributes [C]∥ Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. California: ACM, 2003: 157-168.

[1] GUO Li-chao, SU Hong-ye, GOU Qian-wen. A new algorithm for frequency tendency prediction over data streams[J]. J4, 2012, 46(5): 858-865.
[2] WU Yu, SHOU Li-dan, CHEN Gang. CB-LSH: an efficient LSH indexing algorithm based on compressed bitmap[J]. J4, 2012, 46(3): 377-385.
[3] JIANG Jin-hua, WU Yu, HU Tian-lei, CHEN Gang. Efficient processing of complex XML twig pattern queries
based on path-joins
[J]. J4, 2011, 45(1): 1-8.
[4] ZHOU Jia-qing, WU Yu, JIANG Jin-hua, CHEN Gang, DONG Yi. Object cache optimization strategy for real-time vertical search engine[J]. J4, 2011, 45(1): 14-19.