基于中文维基的大规模命名实体识别语料自动生成方法

doi:10.1631/FITEE.1500067

Front. Inform. Technol. Electron. Eng.

2015, Vol. 16

Issue (11): 940-956 DOI: 10.1631/FITEE.1500067

基于中文维基的大规模命名实体识别语料自动生成方法

Jie Zhou, Bi-cheng Li, Gang Chen

Department of Signal Analysis and Information Processing, Zhengzhou Information Science and Technology Institute, Zhengzhou 450002, China

Automatically building large-scale named entity recognition corpora from Chinese Wikipedia

Jie Zhou, Bi-cheng Li, Gang Chen

Department of Signal Analysis and Information Processing, Zhengzhou Information Science and Technology Institute, Zhengzhou 450002, China

全文: PDF

摘要： 目的：命名实体识别作为自然语言处理领域一项重要的基础性工作，当前主流方法是基于有监督的机器学习方法。该类方法依赖于特定语种和领域的标注语料，而语料的标注过程需耗费大量的人力、物力。本文提出一种基于中文维基的大规模命名实体识别（NER）语料自动生成方法。利用该方法能自动抽取并标记中文维基中的句子，从而为中文NER任务提供有效的语料支持。
创新点：本文根据中文维基的特点设计出四类启发式规则，并结合有监督的命名实体分类器，实现中文维基条目的命名实体类型的准确、全面识别；为避免缺失的维基链接引发的标注缺失，本文利用出链接的边界信息发现维基文档中的隐式指称项，并利用实体链接技术识别歧义指称项的实体类型；本文提出一种基于核心条目扩展的标注语料选择方法，实现测试数据的领域自适应。
方法：本文方法的整体流程如原文图2所示。该方法主要包括显式指称项的实体分类、隐式指称项的类型识别和标注语料选择三个主要步骤。在显式指称项的实体分类中，为实现准确、全面的实体类型识别，采用基于启发式规则与有监督实体分类器相结合的方法；在隐式指称项的类型识别中，提出一种新方法发现维基文档中的隐式指称项并识别歧义指称项的实体类型；在标注语料选择中，提出一种基于核心条目扩展的方法，实现测试数据的领域自适应。
结论：根据实验结果，采用本文方法能自动生成大规模的中文NER语料。此外，将生成语料与标准语料结合时，训练获得的NER模型性能更优。

关键词： NER语料; 中文维基; 实体分类; 领域自适应; 语料选择

Abstract: Named entity recognition (NER) is a core component in many natural language processing applications. Most NER systems rely on supervised machine learning methods, which depend on time-consuming and expensive annotations in different languages and domains. This paper presents a method for automatically building silver-standard NER corpora from Chinese Wikipedia. We refine novel and language-dependent features by exploiting the text and structure of Chinese Wikipedia. To reduce tagging errors caused by entity classification, we design four types of heuristic rules based on the characteristics of Chinese Wikipedia and train a supervised NE classifier, and a combined method is used to improve the precision and coverage. Then, we realize type identification of implicit mention by using boundary information of outgoing links. By selecting the sentences related with the domains of test data, we can train better NER models. In the experiments, large-scale NER corpora containing 2.3 million sentences are built from Chinese Wikipedia. The results show the effectiveness of automatically annotated corpora, and the trained NER models achieve the best performance when combining our silver-standard corpora with gold-standard corpora.

Key words: NER corpora Chinese Wikipedia Entity classification Domain adaptation Corpus selection

收稿日期: 2015-03-07 出版日期: 2015-11-04

CLC:

TP391

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	Jie Zhou
	Bi-cheng Li
	Gang Chen

引用本文:

Jie Zhou, Bi-cheng Li, Gang Chen. Automatically building large-scale named entity recognition corpora from Chinese Wikipedia. Front. Inform. Technol. Electron. Eng., 2015, 16(11): 940-956.

链接本文:

http://www.zjujournals.com/xueshu/fitee/CN/10.1631/FITEE.1500067 或 http://www.zjujournals.com/xueshu/fitee/CN/Y2015/V16/I11/940

[1]	Yuan-ping Nie, Yi Han, Jiu-ming Huang, Bo Jiao, Ai-ping Li. 基于注意机制编码解码模型的答案选择方法[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(4): 535-544.
[2]	Gopi Ram , Durbadal Mandal , Sakti Prasad Ghoshal , Rajib Kar . 使用猫群算法优化线性天线阵列的最佳阵因子辐射方向图：电磁仿真验证[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(4): 570-577.
[3]	Lin-bo Qiao, Bo-feng Zhang, Jin-shu Su, Xi-cheng Lu. 结构化稀疏学习综述[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(4): 445-463.
[4]	Rong-Feng Zhang , Ting Deng , Gui-Hong Wang , Jing-Lun Shi , Quan-Sheng Guan . 基于可靠特征点分配算法的鲁棒性跟踪框架[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(4): 545-558.
[5]	. 一种基于描述逻辑的体系质量需求建模与验证方法[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(3): 346-361.
[6]	Ali Darvish Falehi, Ali Mosallanejad. 使用基于多目标粒子群算法多层自适应模糊推理系统晶闸管控制串联电容器补偿技术的互联多源电力系统动态稳定性增强器[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(3): 394-409.
[7]	Wen-yan Xiao, Ming-wen Wang, Zhen Weng, Li-lin Zhang, Jia-li Zuo. 基于语料库的小学英语认识率及教材选词策略研究[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(3): 362-372.
[8]	Li Weigang. 用于评估共同作者学术贡献的第一和其他合作者信用分配模式[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(2): 180-194.
[9]	Hui Chen, Bao-gang Wei, Yi-ming Li, Yong-huai Liu, Wen-hao Zhu. 一种易用的实体识别消歧系统评测框架[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(2): 195-205.
[10]	Jun-hong Zhang, Yu Liu. 应用完备集合固有时间尺度分解和混合差分进化和粒子群算法优化的最小二乘支持向量机对柴油机进行故障诊断[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(2): 272-286.
[11]	Yue-ting Zhuang, Fei Wu, Chun Chen, Yun-he Pan. 挑战与希望：AI2.0时代从大数据到知识[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(1): 3-14.
[12]	Bo-hu Li, Hui-yang Qu, Ting-yu Lin, Bao-cun Hou, Xiang Zhai, Guo-qiang Shi, Jun-hua Zhou, Chao Ruan. 基于综合集成研讨厅的群体智能设计研究[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(1): 149-152.
[13]	Le-kui Zhou, Si-liang Tang, Jun Xiao, Fei Wu, Yue-ting Zhuang. 基于众包标签数据深度学习的命名实体消歧算法[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(1): 97-106.
[14]	Yong-hong Tian, Xi-lin Chen, Hong-kai Xiong, Hong-liang Li, Li-rong Dai, Jing Chen, Jun-liang Xing, Jing Chen, Xi-hong Wu, Wei-min Hu, Yu Hu, Tie-jun Huang, Wen Gao. AI2.0时代的类人与超人感知：研究综述与趋势展望[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(1): 58-67.
[15]	Yu-xin Peng, Wen-wu Zhu, Yao Zhao, Chang-sheng Xu, Qing-ming Huang, Han-qing Lu, Qing-hua Zheng, Tie-jun Huang, Wen Gao. 跨媒体分析与推理：研究进展与发展方向[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(1): 44-57.

Viewed

Full text

Abstract

Cited

Shared

Discussed