[1] CHRISTEN P, GOISER K. Quality and complexity measures for data linkage and deduplication [M]. Heidelberg: Springer Berlin, 2007.
[2] AHMED K E, PANAGIOTIS G I, VASSILIOS S V. Duplicate record detection: a survey [J]. IEEE Transaction on Knowledge and Data Eng, 2007, 19(1): 1-16.
[3] RUI X, WUNSCH D. Survey of clustering algorithms [J]. IEEE Transactions on Neural Networks, 2005, 16(3): 645-678.
[4] JAIN A, MURTY M, FLYNN P. Data clustering: a review [J]. ACM Computing Surveys, 1999, 31(3): 264-323.
[5] LAU, ROSENFELD R, ROUKOS R. Triggerbased language models: a maximum entropy approach [C]∥ IEEE International Conference on Acoustics, Speech, and Signal Processing. Minneapolis: IEEE, 1993,2: 45-48.
[6] 赵岩,王晓龙,刘秉权,等.融合聚类触发对特征的最大熵词性标注模型[J].计算机研究与发展,2006,43(2): 268-274.
ZHAO Yan, WANG Xiaolong, LIU Bingquan, et al. Fusion of clustering TriggerPair features for POS tagging based on maximum entropy model [J]. Journal of Computer Research and Development, 2006, 43(2): 268-274(in Chinese).
[7] BAEZA R A, RIBEIRO B. Modern information retrieval [M]. New Jersey: AddisonWesley Longman Publishing Co., Inc, 1999.
[8] CHOWDHURY A, FRIEDER O, GROSSMAN D, et al. Collection statistics for fast duplicate document detection [J]. ACM Transaction of Information Systems, 2002, 20(2): 171-191.
[9] BRODER A Z, GLASSMAN S C, MANASSE M S. Syntactic clustering of the Web [J]. Computer Networks, 1997, 29(813): 1157-1166.
[10] IVAN P F, ALAN B S. A theory for record linkage [J]. Journal of the American Statistical Association, 1969,64(328): 1183-1210.
[11] PETER C. Automatic record linkage using seeded nearest neighbour and support vector machine classification [C]∥ Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Nevada: ACM, 2008: 151-159.
[12] CHURCHES T, CHRISTEN P, LIM K, et al. Preparation of name and address data for record linkage using hidden Markov models [J/OL]. BioMed Central Medical Informatics and Decision Making, 2002, 2(9). \
[20080220\]. http:∥www.biomedcentral.com/1472-6947/2/9/.
[13] BILENKO M, MOONEY R J. Adaptive duplicate detection using learnable string similarity measures [C]∥Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Vancouver: ACM, 2003: 185-194.
[14] INDRAJIT B, LISE G. Iterative record linkage for cleaning and integration [C]∥Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. Paris: ACM, 2004: 11-18.
[15] GU L, BAXTER R. Decision models for record linkage [J]. In Selected Papers from AusDM, 2006, 3755: 146-160.
[16] MARTIN T, JONATHAN S, ANDREAS P. SpotSigs: robust and efficient near duplicate detection in large web collections [C] ∥Proc of the 31st SIGIR Conference on Research and Development In Information Retrieval. Singapore: ACM, 2008: 563-570.
[17] HUNG C, XIAOTIE D. A new suffix tree similarity measure for document clustering [C] ∥ Proc of the 16th International Conference on World Wide Web. Canada: ACM, 2007: 121-130.
[18] CIOS K, PEDRYCS W, SWINIARSKI R. Data mining methods for knowledge discovery [M]. Boston: Kluwer Academic Publishers, 1998: 381-390.
[19] HAMMOUDA K M, KAMEL M S. Efficient phrasebased document indexing for Web document clustering[J]. Knowledge and Data Engineering, 2004, 16(10): 1279-1296. |