基于在线层次化非负矩阵分解的文本流主题检测

doi:10.3785/j.issn.1008-973X.2016.08.026

浙江大学学报(工学版)

计算机技术

基于在线层次化非负矩阵分解的文本流主题检测

涂鼎, 陈岭, 陈根才, 吴勇, 王敬昌

1.浙江大学计算机科学与技术学院,浙江杭州 310027；
2.浙江鸿程计算机系统有限公司,浙江杭州 310009

Hierarchical online NMF for detecting and tracking topics

TU Ding, CHEN Ling, CHEN Gen cai, WU Yong, WANG Jing chang

1. College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China;

2. Zhejiang Hongcheng Computer Systems Company Limited, Hangzhou 310009, China

全文: PDF(1228 KB) HTML

摘要：

针对文本流主题检测中存在的主题结构扁平问题,提出在线的层次化非负矩阵分解方法,在每个时间片中根据归一化累计折损增益选择主题节点进行分解,接着反复将文档分配给最相关的主题节点构建主题层次,该过程中假设主题在由不同时间片中相似主题节点构成的序列中连续再演化,在当前时间片对主题节点进行分解时考虑过去时间片中主题节点的分解结果.该方法不仅能在线的发现和更新文本流中的主题,而且还可揭示主题间的结构关系.在Nist TDT2数据集上的实验结果表明,该方法在NMI、Micro F1、MAP和NDCG等指标下均显著超过了其他动态NMF方法,并在时间效率上显示出一定优势.

Abstract:

An online hierarchical non-negative matrix fraction method was proposed to address the problem of flat topic structure of text stream topic detecting methods. At every time slot, the topic nodes were oplited-splited according to the normalized discounted cumulative gain and a topic hierarchy was built by iteratively assigning documents to the most related topic nodes. The hierarchy construction process refers the previous topic hierarchy. The underlying assumption is that the topics are evolving among the similar topic nodes in different time slots. The method can detect and track topics in stream in an online way, which reveals many useful relationships between the topics. Experiments on Nist TDT2 dataset show that our method outperforms the contrasting methods under different metrics, e.g. NMI, Micro F1, MAP and NDCG, and uses less execution time.

出版日期: 2016-08-01

TP 311

基金资助:

国家自然科学基金资助项目（60703040，61332017）；浙江省科技计划资助项目（2011C13042，2015C33002）;“核高基”国家科技重大专项资助项目（2010ZX01042-002-003）|中国工程科技知识中心资助项目（CKCEST-2014-1-5）.

通讯作者: 陈岭，男，副教授.ORCID: 0000-0003-1934-5992. E-mail: lingchen@cs.zju.edu.cn

作者简介: 涂鼎，(1987—)，男，博士，从事非结构化数据管理、数据挖掘等研究.ORCID: 0000-0001-8579-2737. E-mail：tututu111@yeah.net

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章

引用本文:

涂鼎, 陈岭, 陈根才, 吴勇, 王敬昌. 基于在线层次化非负矩阵分解的文本流主题检测[J]. 浙江大学学报(工学版), 10.3785/j.issn.1008-973X.2016.08.026.

TU Ding, CHEN Ling, CHEN Gen cai, WU Yong, WANG Jing chang. Hierarchical online NMF for detecting and tracking topics. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 10.3785/j.issn.1008-973X.2016.08.026.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2016.08.026 或 http://www.zjujournals.com/eng/CN/Y2016/V50/I8/1618

［1］ BLEI D M, JOHN D L. Dynamic topic models ［C］∥Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh, Pennsylvania: ACM, 2006: 113-120.

［2］ WANG C, BLEI D M, HECKERMAN D. Continuous time dynamic topic models ［C］∥Uncertainty in Artificial Intelligence. Helsinki, Finland: AUAI Press, 2008:579-586.

［3］ WANG X, ANDREW M. Topics over time: a nonmarkov continuoustime model of topical trends ［C］∥Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia, PA： ACM, 2006: 424-433.

［4］ MAIRAL J, FRANCIS B, JEAN P, et al. Online learning for matrix factorization and sparse coding ［J］. Journal of Machine Learning Research. 2010, 11: 19-60.

［5］ SAHA A, VIKAS S. Learning evolving and emerging topics in social media: a dynamic nmf approach with temporal regularization ［C］∥Proceedings of the Fifth ACM International Conference on Web Search and Data mining. Seattle, Washington: ACM, 2012: 693-702.

［6］ VACA C K, AMIN M, ALEJANDRO J, MARCO S. A Timebased collective factorization for topic discovery and monitoring in news ［C］∥Proceedings of the 23rd International Conference on World Wide Web. Seoul： ACM, 2014: 527-538.

［7］ BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation ［J］. Journal of Machine Learning Research, 2003, 3: 993-1022.

［8］ THOMAS H. Probabilistic latent semantic analysis ［C］∥Uncertainty in artificial intelligence. Stockholm, Sweden: Morgan Kaufmann, 1999: 289-296.

［9］ GAUSSIER E, GOUTTE C. Relation between PLSA and NMF and implications ［C］∥Acm Sigir Conference on Research & Development in Information Retrieval. Salvador, Bahia, Brazil: ACM, 2005: 601-602.

［10］ KUANG D, HAESUN P. Fast rank2 nonnegative matrix factorization for hierarchical document clustering ［C］∥Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. Chicago, Illinois： ACM,2013: 739-747.

［11］ JENATTON R, MAIRAL J, OBOZINSKI G, et al.. Proximal methods for sparse hierarchical dictionary learning ［C］∥International Conference on Machine Learning. Haifa, Israel: Omnipress, 2010: 487-494.

［12］景丽萍,朱岩,于剑.层次非负矩阵分解及在文本聚类中的应用［J］.计算机科学与探索,2011, 5(10): 904-913.

JING Liping, ZHU Yan, YU jian. Hierarchical nonnegative matrix factorization for text clustering ［J］. Journal of Frontiers of Computer Science & Technology, 2011, 5(10): 904-913.

［13］ BLEI D M, GRIFFITHS T L, JORDAN M I, et al. Hierarchical topic models and the nested chinese restaurant process ［C］∥Proceedings of the 17th Annual Conference on Neural Information Processing Systems. Vancouver, British Columbia, Canada: MIT Press, 2003: 17-24.

［14］ HU L, LI JZ, ZHANG J, et al. oHETM: an online hierarchical entity topic model for news streams ［C］∥ Proceedings of the 19th PacificAsia Conference on Knowledge Discovery and Data Mining. Ho Chi Minh City, Vietnam: Springer, 2015: 696-707.

［15］ CAI D, MEI Q, HAN J, et al. Modeling hidden topics on document manifold ［C］∥Proceedings of the 17th ACM conference on Information and knowledge management. Napa Valley, California: ACM, 2008: 911-920.

［16］ CICHOCKI A, ZDUNEK R, PHAN A H, et al. Nonnegative matrix and tensor factorizations: applications to exploratory multiway data analysis and blind source separation ［M］. Hoboken, New Jersey. John Wiley & Sons, 2009: 131-139.

［17］ RITOV Y, BICKEL P J, TSYBAKOV A B. Simultaneous analysis of lasso and dantzig selector ［J］. Annals of Statistics, 2009, 37(4):1705-1732.

[1]	袁友伟, 余佳, 郑宏升, 王娇娇. 基于新颖性排名和多服务质量的云工作流调度算法[J]. 浙江大学学报(工学版), 2017, 51(6): 1190-1196.
[2]	许荣斌, 石军, 张鹏飞, 谢莹. Petri网的映射变迁关系相似性度量[J]. 浙江大学学报(工学版), 2017, 51(6): 1205-1213.
[3]	王海艳, 程严. 基于离散系数的双向服务选择方法[J]. 浙江大学学报(工学版), 2017, 51(6): 1197-1204.
[4]	常超, 刘克胜, 谭龙丹, 贾文超. 基于图模型的C程序数据流分析[J]. 浙江大学学报(工学版), 2017, 51(5): 1007-1015.
[5]	王继奎. 贝叶斯冲突Web数据可信度算法[J]. 浙江大学学报(工学版), 2016, 50(12): 2380-2385.
[6]	杨莎, 叶振宇, 王淑刚, 陶海, 李石坚, 潘纲, 朱斌. 感认知增强的智能机械手系统[J]. 浙江大学学报(工学版), 2016, 50(6): 1155-1159.
[7]	罗林, 苏宏业, 班岚. Dirichlet过程混合模型在非线性过程监控中的应用[J]. 浙江大学学报(工学版), 2015, 49(11): 2230-2236.
[8]	汪宏浩, 王慧泉, 金仲和. 基于增量链接的可回滚星载软件在轨更新方法[J]. 浙江大学学报(工学版), 2015, 49(4): 724-731.
[9]	王继奎, 李少波. 基于真值发现的冲突数据源质量评价算法[J]. 浙江大学学报(工学版), 2015, 49(2): 303-318.
[10]	蔡华林,陈刚,陈珂. 多类别复合资源的空间匹配[J]. 浙江大学学报(工学版), 2015, 49(1): 69-78.
[11]	俞东进,殷昱煜,吴萌萌,刘愉. 基于混合协同过滤的Web服务QoS预测方法[J]. 浙江大学学报(工学版), 2014, 48(11): 2039-2045.
[12]	柯海丰,应晶. 基于R-ELM的实时车牌字符识别技术[J]. 浙江大学学报(工学版), 2014, 48(7): 1209-1216.
[13]	刘智慧, 张泉灵. 大数据技术研究综述[J]. 浙江大学学报(工学版), 2014, 48(6): 957-972.
[14]	田甜,巩敦卫. 基于覆盖难度选择路径的测试数据进化生成[J]. 浙江大学学报(工学版), 2014, 48(5): 948-994.
[15]	柯海丰,应晶. 基于R-ELM的实时车牌字符识别技术[J]. J4, 2014, 48(2): 0-0.

Viewed

Full text

Abstract

Cited

Shared

Discussed