Please wait a minute...
J4  2011, Vol. 45 Issue (6): 1006-1012    DOI: 10.3785/j.issn.1008-973X.2011.06.007
    
Online event detection in news stream
CHEN Wei1, ZHANG Cheng2, WANG Can1, BU Jia-jun1,
CHEN Chun1, CHEN Hong3
1. College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China; 2. Information Center
of China Disabled Persons’ Federation, Beijing 100034, China; 3. Information Center, Zhejiang University of
Science and Technology, Hangzhou 310023, China
Download:   PDF(0KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

Event detection in news stream is an important research area in topic detection and tracking community. Unfortunately, most of the existing event detection methods are offline and inaccurate. An online event detection algorithm in news stream was introduced. An event consists of a set of bursty features that demonstrates bursty rises in corresponding keyword frequency as the related events emerge. Goodness-of-fit test was applied to find out these features with obvious changes in distribution of term frequency in a news document. Left side significance test was further used to validate all the bursty features occurred in a time span. Finally, evolutionary spectral clustering was applied to group highly correlated bursty features into bursty events. Experiments on the Reuters Corpus Volume 1 show that the proposed method can effectively identify bursty features and timely detect events. The detected events are consistent with corresponding events in real life.



Published: 14 July 2011
CLC:  TP 391  
Cite this article:

CHEN Wei, ZHANG Cheng, WANG Can, BU Jia-jun, CHEN Chun, CHEN Hong. Online event detection in news stream. J4, 2011, 45(6): 1006-1012.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2011.06.007     OR     https://www.zjujournals.com/eng/Y2011/V45/I6/1006


新闻数据流的在线事件检测

针对新闻数据流事件检测算法在实时性、准确率等方面存在的问题,提出一种面向新闻数据流的在线事件检测方法.事件的发生往往伴随着构成该事件的特征(即关键词)在相应时间段内出现的频率明显上升,将这些特征称为突发特征.运用分布拟合检验检测构成新闻数据流的特征在某一时间段内新闻报道中出现频率的分布是否发生明显变化,并进一步利用左边检验确认该时间段内的所有突发特征.分析突发特征的相关性,采用进化谱聚类算法将相关性较高的突发特征聚类在一起构成事件.在路透社新闻数据集第一卷上应用了本算法,验证了该方法能够有效地发现突发特征,并实时地检测出发生的事件,检测出的事件同实际事件有很高的符合度.

[1] 第25次中国互联网络发展状况统计报告[R].北京:中国互联网信息中心,2010.
[2] Topic detection and tracking evaluation project[EB/OL]. 20030908.http:∥www.itl.nist.gov/iad/mig∥tests/tdt/.
[3] ALLAN J, PAPKA R, LAVERENKO V. Online new event detection and tracking[C]∥Proceeding 21st Annual International ACM SIGIR Conference. New York:ACM, 1998: 37-45.
[4] YANG Y, PIERCE T, CARBONELL J. A study on retrospective and online event detection[C]∥Proceeding 21st Annual International ACM SIGIR Conference. New York: ACM, 1998: 28-36.
[5] LAM W, MENG H, WONG K, et al. Using contextual analysis for news event detection [J]. International Journal of Intelligent Systems, 2001, 16(4): 525-546.
[6] YANG Y, ZHANG J, CARBONELL J, et al. Topicconditioned novelty detection[C]∥ Proceeding of the 8th ACM SIGKDD International Conference. New York:ACM, 2002:688-693.
[7] KUMARAN G, ALLAN J. Text classification and named entities for new event detection[C]∥Proceeding 27st annual International ACM SIGIR Conference. New York: ACM, 2004:297-304.
[8] ZHANG K, LI J, WU G, et al. A new event detection model based on term reweighting[J]. Journal of Software, 2008, 19(4): 817-828.
[9] ZHANG K, LI J, WU G. New event detection based on indexingtree and name entity[C]∥ Proceeding of 30st Annual International ACM SIGIR Conference. New York: ACM, 2007: 215-222.
[10] HE Q, CHANG K, LIM E. Analyzing feature trajectories for event detection[C]∥Proceeding of 30st Annual International ACM SIGIR Conference. New York: ACM, 2007: 207-214.
[11] FUNG G, YU J, YU P, et al. Parameter free bursty events detection in text streams[C]∥Proceeding of the 31st International Conference on Very Large Databases. New York: ACM, 2005: 181-192.
[12] LEWIS D, YANG Y, ROSE T, et al. RCV1: a new benchmark collection for text categorization research[J]. Journal of Machine Learning Research, 2004, 5(1): 361-197.
[13] William F. An introduction to probability theory and its applications [M]. New York: Wiley,1968.
[14] 盛骤, 谢式千, 潘承毅. 概率论与数理统计[M]. 北京:高等教育出版社, 2001:91-92
[15] LUXBURG U. A tutorial on spectral clustering[J]. Statistics and Computing, 2007, 17(4): 395-416.
[16] CHAKRABARTI D, KUMAR R, TOMKINS A. Evolutionary clustering[C]∥ Proceeding of the 12th ACM SIGKDD International Conference. New York:ACM, 2006:554-560.
[17] CHI Y, SONG X, ZHOU D, et al. Evolutionary spectral clustering by incorporating temporal smoothness[C]∥ Proceeding of the 13th ACM SIGKDD International Conference. New York: ACM, 2007:153-162.
[18] SHI J, MALIK J. Normalized cuts and image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888-905.
[19] Apache Lucene project[EB/OL]. \
[2005-09-09\]. http:∥lucene.apache.org.
[20] CNN1996 year in review[EB/OL].1997-01-03. http:∥edition.cnn.com/EVENTS/1996/year.in.review/topten/twa/twa.index.html.
[21] Boston Globe Online/top news stories of 1997 [EB/OL]. \
[1998-01-03\]. http:∥www.boston.com/globe/packages/year_in_review/news/.

[1] ZHAO Jian-jun, WANG Yi, YANG Li-bin. Threat assessment method based on time series forecast[J]. J4, 2014, 48(3): 398-403.
[2] CUI Guang-mang, ZHAO Ju-feng,FENG Hua-jun, XU Zhi-hai,LI Qi, CHEN Yue-ting. Construction of fast simulation model for degraded image by inhomogeneous medium[J]. J4, 2014, 48(2): 303-311.
[3] ZHANG Tian-yu, FENG Hua-jun, XU Zhi-hai, LI Qi, CHEN Yue-ting. Sharpness metric based on histogram of strong edge width[J]. J4, 2014, 48(2): 312-320.
[4] LIU Zhong, CHEN Wei-hai, WU Xing-ming, ZOU Yu-hua, WANG Jian-hua. Salient region detection based on stereo vision[J]. J4, 2014, 48(2): 354-359.
[5] WANG Xiang-bing,TONG Shui-guang,ZHONG Wei,ZHANG Jian. Study on scheme design technique for hydraulic excavator's structure performance based on extension reuse[J]. J4, 2013, 47(11): 1992-2002.
[6] WANG Jin, LU Guo-dong, ZHANG Yun-long. Quantification-I theory based IGA and its application[J]. J4, 2013, 47(10): 1697-1704.
[7] LIU Yu, WANG Guo-jin. Designing developable surface pencil through given curve as its common asymptotic curve[J]. J4, 2013, 47(7): 1246-1252.
[8] HU Gen-sheng, BAO Wen-xia, LIANG Dong, ZHANG Wei. Fusion of panchromatic image and multi-spectral image based on
SVR and Bayesian method
[J]. J4, 2013, 47(7): 1258-1266.
[9] WU Jin-liang, HUANG Hai-bin, LIU Li-gang. Texture details preserving seamless image composition[J]. J4, 2013, 47(6): 951-956.
[10] CHEN Xiao-hong,WANG Wei-dong. A HDTV video de-noising algorithm based on spatial-temporal filtering[J]. J4, 2013, 47(5): 853-859.
[11] ZHU Fan , LI Yue, JIANG Kai, YE Shu-ming, ZHENG Xiao-xiang. Decoding of rat’s primary motor cortex by partial least square[J]. J4, 2013, 47(5): 901-905.
[12] WU Ning, CHEN Qiu-xiao, ZHOU Ling, WAN Li. Multi-level method of optimizing vector graphs converted from remote sensing images[J]. J4, 2013, 47(4): 581-587.
[13] JI Yu, SHEN Ji-zhong, SHI Jin-he. Automatic ocular artifact removal based on blind source separation[J]. J4, 2013, 47(3): 415-421.
[14] WANG Xiang, DING Yong. Full reference image quality assessment based on Gabor filter[J]. J4, 2013, 47(3): 422-430.
[15] LIU Fang, SUN Yun, YANG Geng, LIN Hai. Visualization of social network based on particle swarm optimization[J]. J4, 2013, 47(1): 37-43.