Please wait a minute...
J4  2010, Vol. 44 Issue (12): 2284-2290    DOI: 10.3785/j.issn.1008-973X.2010.12.009
    
TrigSigs: an effective record linkage algorithm for unstructured data
WU Yu, SHENG Zhen-hua, SHOU Li-dan, CHEN Gang
College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China
Download:   PDF(0KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A novel clustering algorithm named TrigSigs was proposed to overcome the problem of record linkage models for unstructured data from network. It focuses on mining the associations of hidden attributes as the signatures of objects in unstructured data by trigger-pair model. It can group tokens which help identify objects and filter out noise. Then it assigns weight to tokens properly which makes feature vectors more representative for identifying objects. After these steps, it gains fine-grained object-based clustering result from unstructured data. Experiments on real datasets show that this algorithm can filter out most noise and assign weight for features properly, and improves the clustering results greatly.



Published: 01 December 2010
CLC:  TP 393.08  
Cite this article:

WU Yu, SHENG Zhen-hua, SHOU Li-dan, CHEN Gang. TrigSigs: an effective record linkage algorithm for unstructured data. J4, 2010, 44(12): 2284-2290.

URL:

http://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2010.12.009     OR     http://www.zjujournals.com/eng/Y2010/V44/I12/2284


TrigSigs:一种有效的非结构化记录关联合并算法

为了解决从网络数据源提取的非结构化数据的处理问题,提出一种基于触发对的聚类算法TrigSigs,利用触发对挖掘非结构化数据中隐含属性间的关联关系作为辨别实体的标志.该算法能够聚集对辨别实体起到关键作用的特征组合,过滤噪音词汇,并且根据辨别实体的分辨力,为每个特征词汇赋予合理的权重,使记录的特征向量对辨别实体更具代表性,最终提高聚类结果的细粒度,很好地解决了非结构化数据的记录关联合并问题.实验结果表明:该算法可以过滤绝大部分噪音词汇,并且根据词汇的分辨力合理分配权重,使最终聚类结果的准确率有很大的提升.

[1] CHRISTEN P, GOISER K. Quality and complexity measures for data linkage and deduplication [M]. Heidelberg: Springer Berlin, 2007.
[2] AHMED K E, PANAGIOTIS G I, VASSILIOS S V. Duplicate record detection: a survey [J]. IEEE Transaction on Knowledge and Data Eng, 2007, 19(1): 1-16.
[3] RUI X, WUNSCH D. Survey of clustering algorithms [J]. IEEE Transactions on Neural Networks, 2005, 16(3): 645-678.
[4] JAIN A, MURTY M, FLYNN P. Data clustering: a review [J]. ACM Computing Surveys, 1999, 31(3): 264-323.
[5] LAU, ROSENFELD R, ROUKOS R. Triggerbased language models: a maximum entropy approach [C]∥ IEEE International Conference on Acoustics, Speech, and Signal Processing. Minneapolis: IEEE, 1993,2: 45-48.
[6] 赵岩,王晓龙,刘秉权,等.融合聚类触发对特征的最大熵词性标注模型[J].计算机研究与发展,2006,43(2): 268-274.
ZHAO Yan, WANG Xiaolong, LIU Bingquan, et al. Fusion of clustering TriggerPair features for POS tagging based on maximum entropy model [J]. Journal of Computer Research and Development, 2006, 43(2): 268-274(in Chinese).
[7] BAEZA R A, RIBEIRO B. Modern information retrieval [M]. New Jersey: AddisonWesley Longman Publishing Co., Inc, 1999.
[8] CHOWDHURY A, FRIEDER O, GROSSMAN D, et al. Collection statistics for fast duplicate document detection [J]. ACM Transaction of Information Systems, 2002, 20(2): 171-191.

[9] BRODER A Z, GLASSMAN S C, MANASSE M S. Syntactic clustering of the Web [J]. Computer Networks, 1997, 29(813): 1157-1166.
[10] IVAN P F, ALAN B S. A theory for record linkage [J]. Journal of the American Statistical Association, 1969,64(328): 1183-1210.
[11] PETER C. Automatic record linkage using seeded nearest neighbour and support vector machine classification [C]∥ Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Nevada: ACM, 2008: 151-159.
[12] CHURCHES T, CHRISTEN P, LIM K, et al. Preparation of name and address data for record linkage using hidden Markov models [J/OL]. BioMed Central Medical Informatics and Decision Making, 2002, 2(9). \
[20080220\]. http:∥www.biomedcentral.com/1472-6947/2/9/.
[13] BILENKO M, MOONEY R J. Adaptive duplicate detection using learnable string similarity measures [C]∥Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Vancouver: ACM, 2003: 185-194.
[14] INDRAJIT B, LISE G. Iterative record linkage for cleaning and integration [C]∥Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. Paris: ACM, 2004: 11-18.
[15] GU L, BAXTER R. Decision models for record linkage [J]. In Selected Papers from AusDM, 2006, 3755: 146-160.
[16] MARTIN T, JONATHAN S, ANDREAS P. SpotSigs: robust and efficient near duplicate detection in large web collections [C] ∥Proc of the 31st SIGIR Conference on Research and Development In Information Retrieval. Singapore: ACM, 2008: 563-570.
[17] HUNG C, XIAOTIE D. A new suffix tree similarity measure for document clustering [C] ∥ Proc of the 16th International Conference on World Wide Web. Canada: ACM, 2007: 121-130.
[18] CIOS K, PEDRYCS W, SWINIARSKI R. Data mining methods for knowledge discovery [M]. Boston: Kluwer Academic Publishers, 1998: 381-390.
[19] HAMMOUDA K M, KAMEL M S. Efficient phrasebased document indexing for Web document clustering[J]. Knowledge and Data Engineering, 2004, 16(10): 1279-1296.

[1] XU Chang, SHOU Li-dan, CHEN Gang, HU Tian-lei. An flash-based hybrid storage model for database[J]. J4, 2012, 46(2): 294-300.
[2] SHOU Li-dan, LIAO Ding-bai, XU Chang, CHEN Gang. PWLRU: a buffer replacement algorithm for flash-based Database[J]. J4, 2010, 44(12): 2257-2262.
[3] PI Dun-Bei, CHEN Ke, CHEN Gang, DONG Jin-Xiang. Privacy protection method based on user profile of two-step sorting[J]. J4, 2010, 44(9): 1659-1665.
[4] WEI Wei, DONG E-Bei, LU Dong-Meng. Multi-resource max-min fairness and support vector machine based DDoS defense[J]. J4, 2010, 44(2): 265-270.