A multi-stage disambiguation algorithm was proposed based on the construction of text feature space. According to the characteristics of query terms often occurring as common terms, heuristic rule was applied to determine if the query term is personal name after the pre-processing of documents. Then named entity and occupation were extracted according to the feature templates. The sentential semantic model was used for sentential semantic analysis and sentential semantic features extraction. The word frequency was counted according to the bag-of-words model. Then the three layers of feature space were constructed. The rule-based classification and two-stage hierarchical clustering algorithm was used to realize the name disambiguation. The overlap coefficient was introduced to compute the similarity of the sentential semantic features. The experiments datasets built by CLP2012 Chinese Personal Name disambiguation showed that F achieved 88.79%, which proved that the proposed approach can improve the performance of cross-document personal name disambiguation.
ZHANG Han, LUO Sen-lin, ZOU Li-li, SHI Xiu-min. Cross-document personal name disambiguation merging sentential semantic analysis. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2015, 49(4): 717-723.
[1] GUHA R, GARG A. Disambiguating people in search [C]∥ The 13th International World Wide Web Conference. New York: Association for Computing Machinery, 2004: 102-107.
[2] ARTILES J, GONZALO J, SEKINE S. The SemEval-2007 WePS evaluation: establishing a benchmark for the web people search task [C]∥ Proceedings of the 4th International Workshop on Semantic Evaluations. Prague: Association for Computational Linguistics, 2007: 64-69.
[3] BAGGA A, BALDWIN B. Entity-based cross-document conferencing using the vector space model [C]∥ Proceedings of the 17th International Conference on Computational Linguistics: Volume 1. Montreal, Ganada: Association for Computational Linguistics, 1998: 79-85.
[4] MANN G S, YAROWSKY D. Unsupervised personal name disambiguation [C]∥ Proceedings of the 17th Conference on Natural Language Learning at HLT-NAACL 2003: Volume 4. Sofia, Bulgaria: Association for Computational Linguistics, 2003: 33-40.
[5] MALIN B. Unsupervised name disambiguation via social network similarity [C]∥ Workshop on Link Analysis, Counterterrorism, and Security. Minneapolis: [s. n.], 2005, 1401: 93-102.
[6] WANG H, DING H. A multi-stage clustering framework for Chinese personal name disambiguation [C]∥ CIPS-SIGHAN Joint Conference on Chinese Language Processing.Tianjin: [s. n.], 2010: 88-94.
[7] XU R, XU J. Combine person name and person identity recognition and document clustering for Chinese person name disambiguation [C]∥ CIPS-SIGHAN Joint Conference on Chinese Language Processing. Tianjin: [s. n.], 2010: 95-100.
[8] 陈峰, 王厚峰. 基于社会网络的跨文本同名消歧[J]. 中文信息学报, 2011, 25(05): 76-82.
CHEN Feng, WANG Hou-feng. Social network based cross-document personal name disambiguation [J]. Journal of Chinese Information Processing. Tijanjin: [s.n.], 2011, 25(05): 76-82.
[9] WEI H, XU B, ZHAO T. Study on Chinese person name disambiguation based on multi-stage strategy [C]∥ 2011 8th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD). Chongqing: IEEE, 2011: 1177-1181.
[10] PENG Z, SUN L. SIR-NERD: a Chinese named entity recognition and disambiguation system using a two-stage method [C]∥ CIPS-SIGHAN Joint Conference on Chinese Language Processing. Wuhan: [s. n.], 2012: 115-120.
[11] 罗森林,韩磊,潘丽敏,等.汉语句义结构模型及其验证[J].北京理工大学学报:自然科学版, 2013, 33(2): 166-171.
LUO Sen-lin, HAN Lei, PAN Li-min, et al. Chinese sentential semantic mode and verification [J]. Beijing Institute of Technology: Natural Science, 2013, 33(2): 166-171.
[12] 冯扬. 汉语句义模型构建及若干关键技术研究[D]. 北京: 北京理工大学, 2010.
FENG Yang. Research on Chinese sentential semantic mode and some key problems [D]. Beijing: Beijing Institute of Technology, 2010.
[13] HAO Z, DEREK F. A template based hybrid model for Chinese personal name disambiguation [C]∥ CIPS-SIGHAN Joint Conference on Chinese Language Processing. Wuhan: [s. n.], 2012: 121-126.