Please wait a minute...
J4  2011, Vol. 45 Issue (6): 1006-1012    DOI: 10.3785/j.issn.1008-973X.2011.06.007
自动化技术、计算机技术     
新闻数据流的在线事件检测
陈伟1, 张成2, 王灿1,卜佳俊1, 陈纯1, 陈宏3
1.浙江大学 计算机科学与技术学院,浙江 杭州 310027; 2.中国残疾人联合会信息中心,北京 100034;
3.浙江科技学院 信息中心,浙江 杭州 310023
Online event detection in news stream
CHEN Wei1, ZHANG Cheng2, WANG Can1, BU Jia-jun1,
CHEN Chun1, CHEN Hong3
1. College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China; 2. Information Center
of China Disabled Persons’ Federation, Beijing 100034, China; 3. Information Center, Zhejiang University of
Science and Technology, Hangzhou 310023, China
 全文: PDF  HTML
摘要:

针对新闻数据流事件检测算法在实时性、准确率等方面存在的问题,提出一种面向新闻数据流的在线事件检测方法.事件的发生往往伴随着构成该事件的特征(即关键词)在相应时间段内出现的频率明显上升,将这些特征称为突发特征.运用分布拟合检验检测构成新闻数据流的特征在某一时间段内新闻报道中出现频率的分布是否发生明显变化,并进一步利用左边检验确认该时间段内的所有突发特征.分析突发特征的相关性,采用进化谱聚类算法将相关性较高的突发特征聚类在一起构成事件.在路透社新闻数据集第一卷上应用了本算法,验证了该方法能够有效地发现突发特征,并实时地检测出发生的事件,检测出的事件同实际事件有很高的符合度.

Abstract:

Event detection in news stream is an important research area in topic detection and tracking community. Unfortunately, most of the existing event detection methods are offline and inaccurate. An online event detection algorithm in news stream was introduced. An event consists of a set of bursty features that demonstrates bursty rises in corresponding keyword frequency as the related events emerge. Goodness-of-fit test was applied to find out these features with obvious changes in distribution of term frequency in a news document. Left side significance test was further used to validate all the bursty features occurred in a time span. Finally, evolutionary spectral clustering was applied to group highly correlated bursty features into bursty events. Experiments on the Reuters Corpus Volume 1 show that the proposed method can effectively identify bursty features and timely detect events. The detected events are consistent with corresponding events in real life.

出版日期: 2011-07-14
:  TP 391  
基金资助:

国家科技支撑计划资助项目(2008BAH26B00).

通讯作者: 王灿,男,讲师.     E-mail: wcan@zju.edu.cn
作者简介: 陈伟(1983—),男,博士生,从事信息检索与数据挖掘研究. E-mail: chenw@zju.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  

引用本文:

陈伟, 张成, 王灿,卜佳俊, 陈纯, 陈宏. 新闻数据流的在线事件检测[J]. J4, 2011, 45(6): 1006-1012.

CHEN Wei, ZHANG Cheng, WANG Can, BU Jia-jun, CHEN Chun, CHEN Hong. Online event detection in news stream. J4, 2011, 45(6): 1006-1012.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2011.06.007        https://www.zjujournals.com/eng/CN/Y2011/V45/I6/1006

[1] 第25次中国互联网络发展状况统计报告[R].北京:中国互联网信息中心,2010.
[2] Topic detection and tracking evaluation project[EB/OL]. 20030908.http:∥www.itl.nist.gov/iad/mig∥tests/tdt/.
[3] ALLAN J, PAPKA R, LAVERENKO V. Online new event detection and tracking[C]∥Proceeding 21st Annual International ACM SIGIR Conference. New York:ACM, 1998: 37-45.
[4] YANG Y, PIERCE T, CARBONELL J. A study on retrospective and online event detection[C]∥Proceeding 21st Annual International ACM SIGIR Conference. New York: ACM, 1998: 28-36.
[5] LAM W, MENG H, WONG K, et al. Using contextual analysis for news event detection [J]. International Journal of Intelligent Systems, 2001, 16(4): 525-546.
[6] YANG Y, ZHANG J, CARBONELL J, et al. Topicconditioned novelty detection[C]∥ Proceeding of the 8th ACM SIGKDD International Conference. New York:ACM, 2002:688-693.
[7] KUMARAN G, ALLAN J. Text classification and named entities for new event detection[C]∥Proceeding 27st annual International ACM SIGIR Conference. New York: ACM, 2004:297-304.
[8] ZHANG K, LI J, WU G, et al. A new event detection model based on term reweighting[J]. Journal of Software, 2008, 19(4): 817-828.
[9] ZHANG K, LI J, WU G. New event detection based on indexingtree and name entity[C]∥ Proceeding of 30st Annual International ACM SIGIR Conference. New York: ACM, 2007: 215-222.
[10] HE Q, CHANG K, LIM E. Analyzing feature trajectories for event detection[C]∥Proceeding of 30st Annual International ACM SIGIR Conference. New York: ACM, 2007: 207-214.
[11] FUNG G, YU J, YU P, et al. Parameter free bursty events detection in text streams[C]∥Proceeding of the 31st International Conference on Very Large Databases. New York: ACM, 2005: 181-192.
[12] LEWIS D, YANG Y, ROSE T, et al. RCV1: a new benchmark collection for text categorization research[J]. Journal of Machine Learning Research, 2004, 5(1): 361-197.
[13] William F. An introduction to probability theory and its applications [M]. New York: Wiley,1968.
[14] 盛骤, 谢式千, 潘承毅. 概率论与数理统计[M]. 北京:高等教育出版社, 2001:91-92
[15] LUXBURG U. A tutorial on spectral clustering[J]. Statistics and Computing, 2007, 17(4): 395-416.
[16] CHAKRABARTI D, KUMAR R, TOMKINS A. Evolutionary clustering[C]∥ Proceeding of the 12th ACM SIGKDD International Conference. New York:ACM, 2006:554-560.
[17] CHI Y, SONG X, ZHOU D, et al. Evolutionary spectral clustering by incorporating temporal smoothness[C]∥ Proceeding of the 13th ACM SIGKDD International Conference. New York: ACM, 2007:153-162.
[18] SHI J, MALIK J. Normalized cuts and image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888-905.
[19] Apache Lucene project[EB/OL]. \
[2005-09-09\]. http:∥lucene.apache.org.
[20] CNN1996 year in review[EB/OL].1997-01-03. http:∥edition.cnn.com/EVENTS/1996/year.in.review/topten/twa/twa.index.html.
[21] Boston Globe Online/top news stories of 1997 [EB/OL]. \
[1998-01-03\]. http:∥www.boston.com/globe/packages/year_in_review/news/.

[1] 赵建军,王毅,杨利斌. 基于时间序列预测的威胁估计方法[J]. J4, 2014, 48(3): 398-403.
[2] 崔光茫, 赵巨峰, 冯华君, 徐之海, 李奇, 陈跃庭. 非均匀介质退化图像快速仿真模型的建立[J]. J4, 2014, 48(2): 303-311.
[3] 张天煜, 冯华君, 徐之海, 李奇, 陈跃庭. 基于强边缘宽度直方图的图像清晰度指标[J]. J4, 2014, 48(2): 312-320.
[4] 刘中, 陈伟海, 吴星明, 邹宇华, 王建华. 基于双目视觉的显著性区域检测[J]. J4, 2014, 48(2): 354-359.
[5] 王相兵,童水光,钟崴,张健. 基于可拓重用的液压挖掘机结构性能方案设计[J]. J4, 2013, 47(11): 1992-2002.
[6] 王进, 陆国栋, 张云龙. 基于数量化一类分析的IGA算法及应用[J]. J4, 2013, 47(10): 1697-1704.
[7] 刘羽, 王国瑾. 以已知曲线为渐进线的可展曲面束的设计[J]. J4, 2013, 47(7): 1246-1252.
[8] 胡根生,鲍文霞,梁栋,张为. 基于SVR和贝叶斯方法的全色与多光谱图像融合[J]. J4, 2013, 47(7): 1258-1266.
[9] 吴金亮, 黄海斌, 刘利刚. 保持纹理细节的无缝图像合成[J]. J4, 2013, 47(6): 951-956.
[10] 陈潇红,王维东. 基于时空联合滤波的高清视频降噪算法[J]. J4, 2013, 47(5): 853-859.
[11] 朱凡,李悦,蒋 凯,叶树明,郑筱祥. 基于偏最小二乘的大鼠初级运动皮层解码[J]. J4, 2013, 47(5): 901-905.
[12] 吴宁, 陈秋晓, 周玲, 万丽. 遥感影像矢量化图形的多层次优化方法[J]. J4, 2013, 47(4): 581-587.
[13] 计瑜,沈继忠,施锦河. 一种基于盲源分离的眼电伪迹自动去除方法[J]. J4, 2013, 47(3): 415-421.
[14] 王翔,丁勇. 基于Gabor滤波器的全参考图像质量评价方法[J]. J4, 2013, 47(3): 422-430.
[15] 刘芳, 孙芸, 杨庚, 林海. 基于粒子群优化算法的社交网络可视化[J]. J4, 2013, 47(1): 37-43.