Please wait a minute...
浙江大学学报(工学版)
计算机科学技术     
融合词法和句法特征的汉语谓词高精度识别方法
韩磊, 罗森林, 潘丽敏, 魏超
北京理工大学 信息与电子学院, 北京 100081
high accuracy Chinese predicate recognition method combining lexical and syntactic feature
HAN Lei, LUO Sen-lin, PAN Li-min, WEI Chao
School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China
 全文: PDF(834 KB)   HTML
摘要:

为了对汉语谓词进行系统的研究,提出一种融合词法和句法特征、结合C4.5机器学习和规则进行谓词识别的方法.该方法对句子的词法信息和句法信息分别进行特征提取,通过词法特征提取得到句子中可疑谓词及其个数,使用人工总结规则对词法特征进行规则过滤,对符合规则条件的样本直接给出结果,融合不符合规则样本的词法和句法特征,使用C4.5进行分类得到谓词识别结果.实验中,采用谓词总量达到20 000条以上的BFS-CTC标注语料库进行特征和参数选择、句法特征验证、训练数据量选择和算法准确性等一系列的实验,对谓词识别效果的影响进行研究.结果表明:句法特征能有效提升谓词识别效果,随着训练数据量的增加谓词识别准确率趋于平缓,达到了99%的高准确率.

Abstract:

A method which merges lexical with syntactic features and combines C4.5 algorithm and rules was proposed for the systematic study of Chinese predicates. The method extracts lexical and syntactic features respectively. According to the lexical features, the suspicious predicate and its number are obtained. The lexical features are filtered by manual rules to identify the predicate which conforms to the rules. Basing on the lexical and syntactic features, the ones which do not conform to the rules are identified using C4.5. On the basis of Beijing forest studio-chinese tagged corpus(BFS-CTC) whose total number of predicates is more than 20 000, features and parameter choice experiment, syntactic features verification experiment, the amount of training data choice experiment and the method precision experiment were carried out to study the relations between predicate recognition results and the factors including lexical and syntactic features,function of the syntactic features and the amount of training data. The results show that the syntactic features effectively improves the effect of predicate recognition, as the amount of training data increasing the precision is convergence and the high precision reaches 99%.

出版日期: 2014-12-01
:  TP 391  
基金资助:

北京理工大学研究生科技创新活动专项计划资助项目.

通讯作者: 潘丽敏,女,工程师     E-mail: panlimin_bit@126.com
作者简介: 韩磊(1985—),男,博士生,从事中文信息处理的研究. E-mail: lei.glory@gmail.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章  

引用本文:

韩磊, 罗森林, 潘丽敏, 魏超. 融合词法和句法特征的汉语谓词高精度识别方法[J]. 浙江大学学报(工学版), 10.3785/j.issn.1008-973X.2014.12.002.

HAN Lei, LUO Sen-lin, PAN Li-min, WEI Chao. high accuracy Chinese predicate recognition method combining lexical and syntactic feature. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 10.3785/j.issn.1008-973X.2014.12.002.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2014.12.002        http://www.zjujournals.com/eng/CN/Y2014/V48/I12/2107

[1] 罗森林, 韩磊, 潘丽敏,等. 汉语句义结构模型及其验证[J]. 北京理工大学学报, 2013, 33(2): 166-171.
LUO Sen-lin, HAN Lei, PAN Li-min, et al. Chinese sentential semantic mode and verification[J]. Transactions of Beijing Institute of Technology, 2013, 33(2): 166-171.
[2] XUE N. Labeling chinese predicates with semantic roles[J]. Computational Linguistics, 2008, 34(2): 225-255.
[3] 李国臣, 孟静. 利用主语和谓语的句法关系识别谓语中心词[J]. 中文信息学报, 2005, 19(01): 17.
LI Guo-chen, MENG Jing. A method of identifying the predicate head based on the correspondence between the subject and the predicate[J]. Journal of Chinese Information Processing, 2005, 19(01): 17.
[4] 汪红林, 王红玲, 周国栋. 语义分析中谓词标识的特征工程[J]. 计算机工程与应用, 2010, 46(9): 134-137.
WANG Hong-lin, WANG Hong-lin, ZHOU Guo-dong. Feature engineering for predicate identification and classification in semantic analysis[J]. Computer Engineering and Applications, 2010, 46(9): 134-137.
[5] 龚小谨, 罗振声, 骆卫华. 汉语句子谓语中心词的自动识别[J]. 中文信息学报, 2003,17(02): 7-13.
GONG Xiao-jin, LUO Zhen-sheng, LUO Wei-hua. Recognizing the predicate head of Chinese sentences \[J\]. Journal of Chinese Information Processing,2003, 17(02): 7-13.
[6] 罗森林, 刘盈盈, 冯扬,等. BFS-CTC汉语句义结构标注语料库构建方法[J]. 北京理工大学学报, 2012, 32(3): 311-315.
LUO Sen-lin, LIU Ying-ying, FENG Yang, et al. Method of building BFS-CTC: a Chinese tagged corpus of sentential semantic structure[J]. Transactions of Beijing Institute of Technology, 2012, 32(3): 311-315.
[7] 刘盈盈, 罗森林, 冯扬,等. BFS-CTC汉语句义结构标注语料库[J]. 中文信息学报, 2013, 27(01): 72-80.
LIU Ying-ying, LUO Sen-lin, FENG Yang, et al. BFS-CTC: A chinese corpus of sentential semantic structure[J]. Journal of Chinese Information Processing,2013, 27(01): 7280.
[8] 贾彦德. 汉语语义学[M]. 北京: 北京大学出版社, 2005: 214-249.
[9] 俞士汶, 段慧明, 朱学锋,等. 北京大学现代汉语语料库基本加工规范[J]. 中文信息学报, 2002, 16(5): 49-64.
YU Shi-wen, DUAN Hui-ming, ZHU Xue-feng, et al. The basic processing of contemporary chinese corpus at Peking University SPECIFICATION[J]. Journal of Chinese Information Processing, 2002, 16(5): 49-64.
[10] 俞士汶, 段慧明, 朱学锋,等. 北京大学现代汉语语料库基本加工规范(续)[J]. 中文信息学报, 2002, 16(6): 58-64.
YU Shi-wen, DUAN Hui-ming, ZHU Xue-feng, et al. The basic processing of contemporary chinese corpus at Peking University SPECIFICATION[J]. Journal of Chinese Information Processing, 2002, 16(6): 58-64.
[11] 周强.汉语语料库的短语自动划分和标注研究[D].北京:北京大学, 2002.
ZHOU Qiang. Phrase bracketing and annotating on chinese language corpus[D].Beijing: Peking University, 2002.
[12] QUINLAN J R. Induction of decision trees[J]. Machine Learning, 1986(1): 81-106.
[13] HALL M, FRANK E, HOLMES G, et al. The WEKA data mining software: an update[J]. ACM Sigkdd Explorations Newsletter, 2009, 11(1): 1018.
[14] 陈功, 罗森林, 陈开江,等. 结合结构下文及词汇信息的汉语句法分析方法[J]. 中文信息学报, 2012, 26(01): 9-15.
CHEN Gong, LUO Sen-lin, CHEN Kai-jiang, et al. Method for layered chinese parsing based on subsidiary contest and lexical information[J]. Journal of Chinese Information Processing, 2012, 26(01): 9-15.

[1] 何雪军, 王进, 陆国栋, 刘振宇, 陈立, 金晶. 基于三角网切片及碰撞检测的工业机器人三维头像雕刻[J]. 浙江大学学报(工学版), 2017, 51(6): 1104-1110.
[2] 王桦, 韩同阳, 周可. 公安情报中基于关键图谱的群体发现算法[J]. 浙江大学学报(工学版), 2017, 51(6): 1173-1180.
[3] 尤海辉, 马增益, 唐义军, 王月兰, 郑林, 俞钟, 吉澄军. 循环流化床入炉垃圾热值软测量[J]. 浙江大学学报(工学版), 2017, 51(6): 1163-1172.
[4] 毕晓君, 王佳荟. 基于混合学习策略的教与学优化算法[J]. 浙江大学学报(工学版), 2017, 51(5): 1024-1031.
[5] 王亮, 於志文, 郭斌. 基于双层多粒度知识发现的移动轨迹预测模型[J]. 浙江大学学报(工学版), 2017, 51(4): 669-674.
[6] 廖苗, 赵于前, 曾业战, 黄忠朝, 张丙奎, 邹北骥. 基于支持向量机和椭圆拟合的细胞图像自动分割[J]. 浙江大学学报(工学版), 2017, 51(4): 722-728.
[7] 穆晶晶, 赵昕玥, 何再兴, 张树有. 基于凹凸变换与圆周拟合的重叠气泡轮廓重构[J]. 浙江大学学报(工学版), 2017, 51(4): 714-721.
[8] 黄正宇, 蒋鑫龙, 刘军发, 陈益强, 谷洋. 基于融合特征的半监督流形约束定位方法[J]. 浙江大学学报(工学版), 2017, 51(4): 655-662.
[9] 蒋鑫龙, 陈益强, 刘军发, 忽丽莎, 沈建飞. 面向自闭症患者社交距离认知的可穿戴系统[J]. 浙江大学学报(工学版), 2017, 51(4): 637-647.
[10] 戴彩艳, 陈崚, 李斌, 陈伯伦. 复杂网络中的抽样链接预测[J]. 浙江大学学报(工学版), 2017, 51(3): 554-561.
[11] 刘磊, 杨鹏, 刘作军. 采用多核相关向量机的人体步态识别[J]. 浙江大学学报(工学版), 2017, 51(3): 562-571.
[12] 郭梦丽, 达飞鹏, 邓星, 盖绍彦. 基于关键点和局部特征的三维人脸识别[J]. 浙江大学学报(工学版), 2017, 51(3): 584-589.
[13] 王海军, 葛红娟, 张圣燕. 基于核协同表示的快速目标跟踪算法[J]. 浙江大学学报(工学版), 2017, 51(2): 399-407.
[14] 张亚楠, 陈德运, 王莹洁, 刘宇鹏. 基于增量图形模式匹配的动态冷启动推荐方法[J]. 浙江大学学报(工学版), 2017, 51(2): 408-415.
[15] 刘宇鹏, 乔秀明, 赵石磊, 马春光. 统计机器翻译中大规模特征的深度融合[J]. 浙江大学学报(工学版), 2017, 51(1): 46-56.