TrigSigs：一种有效的非结构化记录关联合并算法

doi:10.3785/j.issn.1008-973X.2010.12.009

2010, Vol. 44

Issue (12): 2284-2290 DOI: 10.3785/j.issn.1008-973X.2010.12.009

自动化技术、计算机技术

TrigSigs：一种有效的非结构化记录关联合并算法

吴羽, 盛振华, 寿黎但, 陈刚

浙江大学计算机科学与技术学院,浙江杭州 310027

TrigSigs: an effective record linkage algorithm for unstructured data

WU Yu, SHENG Zhen-hua, SHOU Li-dan, CHEN Gang

College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China

全文: PDF HTML

摘要：

为了解决从网络数据源提取的非结构化数据的处理问题,提出一种基于触发对的聚类算法TrigSigs,利用触发对挖掘非结构化数据中隐含属性间的关联关系作为辨别实体的标志.该算法能够聚集对辨别实体起到关键作用的特征组合,过滤噪音词汇,并且根据辨别实体的分辨力,为每个特征词汇赋予合理的权重,使记录的特征向量对辨别实体更具代表性,最终提高聚类结果的细粒度,很好地解决了非结构化数据的记录关联合并问题.实验结果表明：该算法可以过滤绝大部分噪音词汇,并且根据词汇的分辨力合理分配权重,使最终聚类结果的准确率有很大的提升.

Abstract:

A novel clustering algorithm named TrigSigs was proposed to overcome the problem of record linkage models for unstructured data from network. It focuses on mining the associations of hidden attributes as the signatures of objects in unstructured data by trigger-pair model. It can group tokens which help identify objects and filter out noise. Then it assigns weight to tokens properly which makes feature vectors more representative for identifying objects. After these steps, it gains fine-grained object-based clustering result from unstructured data. Experiments on real datasets show that this algorithm can filter out most noise and assign weight for features properly, and improves the clustering results greatly.

出版日期: 2010-12-01

TP 393.08

基金资助:

国家自然科学基金资助项目(60603044, 60803003)；浙江省科技计划项目重大科技攻关资助项目(2006c11108).

通讯作者: 陈刚，男，教授. E-mail: cg@zju.edu.cn

作者简介: 吴羽（1982—），男，江苏苏州人，博士生，主要从事垂直搜索引擎和社会化网络研究，E-mail: yuwu@zju.edu.cn

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章

引用本文:

吴羽, 盛振华, 寿黎但, 陈刚. TrigSigs：一种有效的非结构化记录关联合并算法[J]. J4, 2010, 44(12): 2284-2290.

WU Yu, SHENG Zhen-hua, SHOU Li-dan, CHEN Gang. TrigSigs: an effective record linkage algorithm for unstructured data. J4, 2010, 44(12): 2284-2290.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2010.12.009 或 http://www.zjujournals.com/eng/CN/Y2010/V44/I12/2284

［1］ CHRISTEN P, GOISER K. Quality and complexity measures for data linkage and deduplication ［M］. Heidelberg: Springer Berlin, 2007.
［2］ AHMED K E, PANAGIOTIS G I, VASSILIOS S V. Duplicate record detection: a survey ［J］. IEEE Transaction on Knowledge and Data Eng, 2007, 19(1): 1-16.
［3］ RUI X, WUNSCH D. Survey of clustering algorithms ［J］. IEEE Transactions on Neural Networks, 2005, 16(3): 645-678.
［4］ JAIN A, MURTY M, FLYNN P. Data clustering: a review ［J］. ACM Computing Surveys, 1999, 31(3): 264-323.
［5］ LAU, ROSENFELD R, ROUKOS R. Triggerbased language models: a maximum entropy approach ［C］∥ IEEE International Conference on Acoustics, Speech, and Signal Processing. Minneapolis: IEEE, 1993,2: 45-48.
［6］赵岩,王晓龙,刘秉权,等.融合聚类触发对特征的最大熵词性标注模型［J］.计算机研究与发展,2006,43(2): 268-274.
ZHAO Yan, WANG Xiaolong, LIU Bingquan, et al. Fusion of clustering TriggerPair features for POS tagging based on maximum entropy model ［J］. Journal of Computer Research and Development, 2006, 43(2): 268-274(in Chinese).
［7］ BAEZA R A, RIBEIRO B. Modern information retrieval ［M］. New Jersey: AddisonWesley Longman Publishing Co., Inc, 1999.
［8］ CHOWDHURY A, FRIEDER O, GROSSMAN D, et al. Collection statistics for fast duplicate document detection ［J］. ACM Transaction of Information Systems, 2002, 20(2): 171-191.

［9］ BRODER A Z, GLASSMAN S C, MANASSE M S. Syntactic clustering of the Web ［J］. Computer Networks, 1997, 29(813): 1157-1166.
［10］ IVAN P F, ALAN B S. A theory for record linkage ［J］. Journal of the American Statistical Association, 1969,64（328）： 1183-1210.
［11］ PETER C. Automatic record linkage using seeded nearest neighbour and support vector machine classification ［C］∥ Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Nevada: ACM, 2008: 151-159.
［12］ CHURCHES T, CHRISTEN P, LIM K, et al. Preparation of name and address data for record linkage using hidden Markov models ［J/OL］. BioMed Central Medical Informatics and Decision Making, 2002, 2(9). \
[20080220\]. http:∥www.biomedcentral.com/1472-6947/2/9/.
［13］ BILENKO M, MOONEY R J. Adaptive duplicate detection using learnable string similarity measures ［C］∥Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Vancouver: ACM, 2003: 185-194.
［14］ INDRAJIT B, LISE G. Iterative record linkage for cleaning and integration ［C］∥Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. Paris: ACM, 2004: 11-18.
［15］ GU L, BAXTER R. Decision models for record linkage ［J］. In Selected Papers from AusDM, 2006, 3755: 146-160.
［16］ MARTIN T, JONATHAN S, ANDREAS P. SpotSigs: robust and efficient near duplicate detection in large web collections ［C］ ∥Proc of the 31st SIGIR Conference on Research and Development In Information Retrieval. Singapore: ACM, 2008: 563-570.
［17］ HUNG C, XIAOTIE D. A new suffix tree similarity measure for document clustering ［C］ ∥ Proc of the 16th International Conference on World Wide Web. Canada: ACM, 2007: 121-130.
［18］ CIOS K, PEDRYCS W, SWINIARSKI R. Data mining methods for knowledge discovery ［M］. Boston: Kluwer Academic Publishers, 1998: 381-390.
［19］ HAMMOUDA K M, KAMEL M S. Efficient phrasebased document indexing for Web document clustering［J］. Knowledge and Data Engineering, 2004, 16(10): 1279-1296.

[1]	徐昶, 寿黎但, 陈刚, 胡天磊. 一种基于闪存的数据库复合存储模型[J]. J4, 2012, 46(2): 294-300.
[2]	寿黎但, 廖定柏, 徐昶, 陈刚. PWLRU: 一种面向闪存数据库的缓冲区存取算法[J]. J4, 2010, 44(12): 2257-2262.
[3]	皮俊波, 陈珂, 陈刚, 董金祥. 基于用户兴趣模型两段式排序的隐私保护方法[J]. J4, 2010, 44(9): 1659-1665.
[4]	魏蔚, 董亚波, 鲁东明. 基于支持向量机和多资源最大最小公平的DDoS防御[J]. J4, 2010, 44(2): 265-270.

Viewed

Full text

Abstract

Cited

Shared

Discussed