基于伪相关反馈的短文本扩展与分类

doi:10.3785/j.issn.1008-973X.2014.10.018

浙江大学学报(工学版)

计算机技术﹑电信技术

基于伪相关反馈的短文本扩展与分类

王蒙, 林兰芬, 王锋

浙江大学计算机科学与技术学院,浙江杭州 310027

Short text expansion and classification based on pseudo-relevance feedback

WANG Meng, LIN Lan-fen, WANG Feng

College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China

全文: PDF(2006 KB) HTML

摘要：

针对短文本分类问题,提出基于伪相关反馈(PFR)的短文本扩展与分类方法.在保持语义不变的情况下,利用互联网中的相似语料对短文本的内容进行了扩展.对现有的仅使用局部特征的扩展语料特征抽取方法进行改进,引入全局特征抽取,将全局特征与局部特征相结合得到了更好的特征向量,有效地解决了分类过程中由短文本长度有限导致的特征矩阵高度稀疏的问题.通过在开放数据集上的测试和与其他文献的结果比对,验证了该方法在短文本分类的问题上可以取得较好的效果.

Abstract:

A novel classification method based on pseudo-relevance feedback (PFR) was proposed in order to solve the sparseness problems in short text classification. The short texts were expanded using the web pages which are similar to them in semantic level. The feature vector generation algorithm was modified to extract both the local features and the global features. The method can alleviate the sparseness problem of the final feature matrix, which is common in short text classification because of the limited length of the texts. The experimental results on an open dataset show that the method can significantly improve the short text classification effect compared with state-of-the-art methods.

出版日期: 2014-10-01

TP 391

基金资助:

博士点基金资助项目（20110101110065）；国家“十二五”科技支撑计划资助项目（2012BAD35B01-3，2013BAF02B10）.

通讯作者: 林兰芬,女,教授,博导 E-mail: llf@zju.edu.cn

作者简介: 王蒙(1986 —),男,博士生,从事自然语言处理和数据挖掘的研究. E-mail: wangmeng@zju.edu.cn

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章

引用本文:

王蒙, 林兰芬, 王锋. 基于伪相关反馈的短文本扩展与分类[J]. 浙江大学学报(工学版), 10.3785/j.issn.1008-973X.2014.10.018.

WANG Meng, LIN Lan-fen, WANG Feng.

Short text expansion and classification based on pseudo-relevance feedback

. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 10.3785/j.issn.1008-973X.2014.10.018.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2014.10.018 或 http://www.zjujournals.com/eng/CN/Y2014/V48/I10/1835

［1］ SRIRAM B, FUHRY D, DEMIR E, et al. Short text classification in twitter to improve information filtering ［C］∥ Proceedings of the 33rd international ACM SIGIR Conference on Research and Development in Information Retrieval. Geneva: ACM, 2010: 841-842.

［2］ SUN A. Short text classification using very few words ［C］∥ Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. Portland: ACM, 2012: 1145-1146.

［3］ YUAN Q, CONG G, THALMANN N M. Enhancing Naive Bayes with various smoothing methods for short text classification ［C］∥ Proceedings of the 21st International Conference on World Wide Web. Seoul: ACM, 2012: 645-646.

［4］李卫疆,赵铁军,王宪刚. 基于上下文的查询扩展［J］.计算机研究与发展,2010,47(2): 300-304.

LI Wei-jiang,ZHAO Tie-jun, WANG Xian-gang. Context-sensitive query expansion ［J］. Journal of Computer Research and Development, 2010, 47(2): 300-304.

［5］ BANERJEE S, RAMANATHAN K, GUPTA A. Clustering short texts using Wikipedia ［C］∥ Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval. Amsterdam: ACM, 2007: 787-788.

［6］ HU X, SUN N, ZHANG C, et al. Exploiting internal and external semantics for the clustering of short texts using world knowledge ［C］∥ Proceedings of the 18th ACM Conference on Information and Knowledge Management. Hong Kong: ACM, 2009: 919-928.

［7］ PHAN X H, NGUYEN L M, HORIGUCHI S. Learning to classify short and sparse text & web with hidden topics from large-scale data collections ［C］∥ Proceedings of the 17th International Conference on World Wide Web. Beijing: ACM, 2008: 91-100.

［8］ CHEN M, JIN X, SHEN D. Short text classification improved by learning multi-granularity topics ［C］∥ Proceedings of the 22nd International Joint Conference on Artificial Intelligence. Barcelona: AAAI, 2011: 1776-1781.

［9］ SAHAMI M, HEILMAN T D. A web-based kernel function for measuring the similarity of short text snippets ［C］∥ Proceedings of the 15th International Conference on World Wide Web. Edinburgh: ACM, 2006: 377-386.

［10］ YIH W T, CHRISTOPHER M. Improving similarity measures for short segments of text ［C］∥ Proceedings of the 22nd Conference on Artificial Intelligence. Vancouver: AAAI, 2007: 1489-1494.

［11］ BOLLEGALA D, MATSUO Y, ISHIZUKA M. Measuring semantic similarity between words using web search engines ［C］∥ Proceedings of the 16th International Conference on World Wide Web. Banff: ACM, 2007: 757-765.

［12］ EFRON M, ORGANISCIAK P, FENLON K. Improving retrieval of short texts through document expansion ［C］∥ Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. Portland: ACM, 2012: 911-920.

［13］ HALL M, FRANK E, HOLMES G, et al. The WEKA data mining software: an update ［J］. ACM SIGKDD Explorations Newsletter, 2009, 11(1): 10-18.

[1]	何雪军, 王进, 陆国栋, 刘振宇, 陈立, 金晶. 基于三角网切片及碰撞检测的工业机器人三维头像雕刻[J]. 浙江大学学报(工学版), 2017, 51(6): 1104-1110.
[2]	王桦, 韩同阳, 周可. 公安情报中基于关键图谱的群体发现算法[J]. 浙江大学学报(工学版), 2017, 51(6): 1173-1180.
[3]	尤海辉, 马增益, 唐义军, 王月兰, 郑林, 俞钟, 吉澄军. 循环流化床入炉垃圾热值软测量[J]. 浙江大学学报(工学版), 2017, 51(6): 1163-1172.
[4]	毕晓君, 王佳荟. 基于混合学习策略的教与学优化算法[J]. 浙江大学学报(工学版), 2017, 51(5): 1024-1031.
[5]	蒋鑫龙, 陈益强, 刘军发, 忽丽莎, 沈建飞. 面向自闭症患者社交距离认知的可穿戴系统[J]. 浙江大学学报(工学版), 2017, 51(4): 637-647.
[6]	王亮, 於志文, 郭斌. 基于双层多粒度知识发现的移动轨迹预测模型[J]. 浙江大学学报(工学版), 2017, 51(4): 669-674.
[7]	廖苗, 赵于前, 曾业战, 黄忠朝, 张丙奎, 邹北骥. 基于支持向量机和椭圆拟合的细胞图像自动分割[J]. 浙江大学学报(工学版), 2017, 51(4): 722-728.
[8]	穆晶晶, 赵昕玥, 何再兴, 张树有. 基于凹凸变换与圆周拟合的重叠气泡轮廓重构[J]. 浙江大学学报(工学版), 2017, 51(4): 714-721.
[9]	黄正宇, 蒋鑫龙, 刘军发, 陈益强, 谷洋. 基于融合特征的半监督流形约束定位方法[J]. 浙江大学学报(工学版), 2017, 51(4): 655-662.
[10]	刘磊, 杨鹏, 刘作军. 采用多核相关向量机的人体步态识别[J]. 浙江大学学报(工学版), 2017, 51(3): 562-571.
[11]	郭梦丽, 达飞鹏, 邓星, 盖绍彦. 基于关键点和局部特征的三维人脸识别[J]. 浙江大学学报(工学版), 2017, 51(3): 584-589.
[12]	戴彩艳, 陈崚, 李斌, 陈伯伦. 复杂网络中的抽样链接预测[J]. 浙江大学学报(工学版), 2017, 51(3): 554-561.
[13]	王海军, 葛红娟, 张圣燕. 基于核协同表示的快速目标跟踪算法[J]. 浙江大学学报(工学版), 2017, 51(2): 399-407.
[14]	张亚楠, 陈德运, 王莹洁, 刘宇鹏. 基于增量图形模式匹配的动态冷启动推荐方法[J]. 浙江大学学报(工学版), 2017, 51(2): 408-415.
[15]	刘宇鹏, 乔秀明, 赵石磊, 马春光. 统计机器翻译中大规模特征的深度融合[J]. 浙江大学学报(工学版), 2017, 51(1): 46-56.

Viewed

Full text

Abstract

Cited

Shared

Discussed