Please wait a minute...
浙江大学学报(工学版)
计算机技术     
融合句义结构模型的微博话题摘要算法
林萌, 罗森林, 贾丛飞, 韩磊, 原玉娇, 潘丽敏
北京理工大学 信息与电子学院, 北京 100081
Microblog topics summarization algorithm merging sentential semantic structure model
LIN Meng, LUO Sen lin, JIA Cong fei, HAN Lei, YUAN Yu jiao, PAN Li min
1.School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China
 全文: PDF(1167 KB)   HTML
摘要:

 为了更快地从海量微博中获取话题的核心内容,提出融合句义结构模型的微博话题摘要方法.该方法利用句义结构模型抽取句子的语义格得到句子的语义特征,并基于LDA主题模型使用句义结构计算句子两两之间的语义相似度构建相似度矩阵,划分子主题类,得到句子的关联特征.融合句子的语义特征和关联特征,选取子主题内信息量最大的句子作为摘要结果.当压缩比为0.5%、1.0%和1.5%时,ROUGE值均明显优于对比系统.当压缩比为1.5%时,ROUGE 1值达到51.30%,ROUGE SU*达到25.27%.实验结果表明:融合句义结构模型的分析方法能够深化句子的语义分析层次,提取的句义特征增强了语义信息的表达能力.综合考虑句子语义特征和关联特征的句子权重计算方法能够丰富句子的特征表示,减少语义信息丢失,使同类数据的语义相关性增强,有效降低了噪声的影响,从而提升摘要与话题的相关度.此外,所提出的方法处理不同话题的泛化能力较好,适用范围较广.

Abstract:

A new microblog summarization framework based on sentential semantic structure model was proposed in order to provide concise summarization to help users quickly grasp the essence of topics. Sentential semantic features were extracted by sentential semantic structure model. Latent Dirichlet allocation (LDA) topic model was used to calculate the pairwise sentence similarities and construct the similarity matrix based on sentential semantic structure. Sentences were clustered into several subtopics and the sentential relationship features were obtained. The most informative sentences were extracted from each subtopic through combining both sentential semantic features and relationship features. As a result, the value of ROUGE outperforms the contrast algorithms when the the compress ratio was 0.5%, 1.0% and 1.5%. The value of ROUGE 1 was 51.30%, while that of ROUGE SU* was 25.27% when the compress ratio was 1.5%. Results indicate that the method that introduces sentential semantic structure model can better understand sentential semantic, and the extracted semantic features can highlight the description power of sentential semantic. Meanwhile, using both sentential semantic features and relationship features can enrich the features representation and reduce information loss, increasing the semantic relevance of similar data. Moreover, the impact of noise can be reduced. Besides, the proposed method has excellent generalization ability and can be applied to various topics.

出版日期: 2015-12-31
:  TP 391  
基金资助:

国家“242”信息安全计划资助项目(2005C48);北京理工大学科技创新计划重大项目培育专项资助项目(2011CX01015).

通讯作者: 罗森林,男,教授. ORCID: 0000 0002 5330 3705.     E-mail: luosenlin@bit.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章  

引用本文:

林萌, 罗森林, 贾丛飞, 韩磊, 原玉娇, 潘丽敏. 融合句义结构模型的微博话题摘要算法[J]. 浙江大学学报(工学版), 10.3785/j.issn.1008-973X.2015.12.011.

LIN Meng, LUO Sen lin, JIA Cong fei, HAN Lei, YUAN Yu jiao, PAN Li min. Microblog topics summarization algorithm merging sentential semantic structure model. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 10.3785/j.issn.1008-973X.2015.12.011.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2015.12.011        http://www.zjujournals.com/eng/CN/Y2015/V49/I12/2316

[1] Wikipedia. Sina Weibo [EB/OL]. (2014 11 10)\[2015 10 20]. https:∥en.wikipedia.org/wiki/Sina_Weibo.
[2] HE Y, SU W, TIAN Y, et al. Summarizing microblogs on network hot topics [C] ∥ Proceedings of the 2011 International Conference on Internet Technology and Applications (iTAP 2011). New York: Piscataway, 2011:1-4.
[3] LONG R, WANG H F, CHEN Y Q, et al. Towards effective event detection, tracking and summarization on microblog data [M] ∥ Web Age Information Management. Berlin: Springer, 2011: 652-663.
[4] WILLIAN H, ZHANG Y. Threshold and associative based classification for social spam profile detection on Twitter [C] ∥ 2013 9th International Conference onSemantics, Knowledge and Grids (SKG). New York:Piscataway, 2013: 113-120.
[5] VANDERWENDE L, SUZUKI H, BROCKETT C, et al. Beyond SumBasic: task focused summarization with sentence simplification and lexical expansion [J]. Information Processing and Management, 2007, 43(6):1606-1618.
[6] RADEV D R, JING H, STYS M, et al. Centroid based summarization of multiple documents [J]. Information Processing and Management, 2004, 40(6): 919-938.
[7] SINGH M, KHAN F U. Effect of incremental EM on document summarization using probabilistic latent semantic analysis [C] ∥ Proceedings of the World Congress on Engineering (WCE 2012). Hong Kong: Newswood Limited, 2012: 21-98.
[8] GAO D, LI W, OUYANG Y, et al. LDA based topic formation and topic sentence reinforcement for graph based multi document summarization [M] ∥ Information Retrieval Technology. Berlin: Springer, 2012:376-385.
[9] ARORA R, RAVINDRAN B. Latent dirichlet allocation based multi document summarization [C] ∥ Proceedings of the 2nd Workshop on Analytics for Noisy Unstructured Text Data. Singapore: ACM, 2008: 91-97.
[10] BINTI ZAHRI N A H, FUKUMOTO F, MATSUYOSHI S. Link analysis based on rhetorical relations for multi document summarization [J]. IEICE Transactions on Information and Systems, 2013, 96(5):1182-1191.
[11] SUJATHA C, CHIVATE A R, GANIHAR S A, et al. Time driven video summarization using GMM [C] ∥ 2013 4th National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG). Piscataway: IEEE, 2013: 1-4.
[12] OLARIU A. Clustering to improve microblog stream summarization [C] ∥ 2012 14th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC 2012). Timisoara: IEEE, 2012: 220-226.
[13] ZHANG R, LI W, GAO D, et al. Automatic Twitter topic summarization with speech acts [J]. IEEE Transactions on Audio Speech and Language Processing, 2013, 21(3): 649-658.
[14] KHAN M A H, BOLLEGALA D, LIU G, et al. Multi tweet summarization of real time events [C] ∥ 2013 International Conference on Social Computing (SocialCom). Washington DC: ASE/IEEE, 2013:128-133.
[15] LIU F, LIU Y, WENG F L. Why is “SXSW” trending? Exploring multiple text sources for twitter topic summarization [C] ∥ Proceedings of the Workshop on Languages in Social Media (LSM 2011). Strasbourg: Association for Computational Linguistics, 2011:66-75.
[16] SHARIFI B, HUTTON M, KALITA J. Summarizing microblogs automatically [C] ∥ 2010 Human Language Technologies Conference of the North American Chapter of the Association for Computational Linguistics, NAACL HLT 2010. Los Angeles: ACL, 2010: 685-688.
[17] HARABAGIU S M, HICKL A. Relevance modeling for microblog summarization [C] ∥ Proceedings of the 5th International Conference on Weblogs and Social Media. Menlo Park: AAAI, 2011: 514-517.
[18] CHAKRABARTI D, PUNERA K. Event summarization using Tweets [C] ∥ Proc of the 5th Int AAAI Conference and Social Media (ICWSM’11). Menlo Park: AAAI, 2011: 66-73.
[19] INOUYE D, KALITA J K. Comparing Twitter Summarization Algorithms for Multiple Post Summaries [C] ∥ Proceedings of the 2011 IEEE Third International Conference on Privacy, Security, Risk and Trust and IEEE Third International Conference on Social Computing (PASSAT/SocialCom 2011). Boston: IEEE, 2011: 298-306.
[20] ERKAN G, RADEV D R. LexRank: graph based lexical centrality as salience in text summarization [J]. Journal of Artificial Intelligence Research, 2004:457-479.
[21] MIHALCEA R, TARAU P. TextRank: bringing order into texts [C] ∥ Conference on Empirical Methods in Natural Language Processing (EMNLP), 2004.Barcelona: ACL, 2004: 275-279.
[22] BIAN J, YANG Y, CHUA T. Multimedia summarization for trending topics in microblogs [C]∥ 22nd ACM International Conference on Information and Knowledge Management, CIKM 2013. San Francisco: ACM, 2013: 1807-1812.
[23] 罗森林, 韩磊, 潘丽敏, 等. 汉语句义结构模型及其验证 [J]. 北京理工大学学报, 2013, 33(2): 166-171.
LUO Sen lin, HAN Lei, PAN Li min, et al. Chinese sentential semantic mode and verification [J]. Transactions of Beijing Institute of Technology, 2013, 33(2): 166-171.
[24] 罗森林, 刘盈盈, 冯扬, 等. BFS CTC 汉语句义结构标注语料库构建方法 [J]. 北京理工大学学报, 2012, 32(3): 311-315.
LUO Sen lin, LIU Ying ying, FENG Yang, et al. Method of building BFS CTC: a Chinese Tagged corpus of sentential semantic structure [J]. Transactions of Beijing Institute of Technology, 2012, 32(3):311-315.
[25] 张华平. ICTCLAS2013版 [CP/OL].(2013 11 15)[2015 10 20]. http:∥ictclas.nlpir.org/newsdownloads?DocId=352.
[26] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation [J]. Journal of Machine Learning Research. 2003, 3(4/5): 993-1022.
[27] 中国计算机学会中文信息技术专业委员会. 第二届自然语言处理与中文计算会议技术评测结果 [CP/OL]. (2013 06 15)[2015 10 20]. http:∥tcci.ccf.org.cn/conference/2013/pages/page04_evares.html.
[28] LIN C Y. Rouge: a package for automatic evaluation of summaries [C] ∥ Text Summarization Branches Out: Proceedings of the ACL 04 Workshop. Barcelona: ACL, 2004: 74-81.

[1] 何雪军, 王进, 陆国栋, 刘振宇, 陈立, 金晶. 基于三角网切片及碰撞检测的工业机器人三维头像雕刻[J]. 浙江大学学报(工学版), 2017, 51(6): 1104-1110.
[2] 王桦, 韩同阳, 周可. 公安情报中基于关键图谱的群体发现算法[J]. 浙江大学学报(工学版), 2017, 51(6): 1173-1180.
[3] 尤海辉, 马增益, 唐义军, 王月兰, 郑林, 俞钟, 吉澄军. 循环流化床入炉垃圾热值软测量[J]. 浙江大学学报(工学版), 2017, 51(6): 1163-1172.
[4] 毕晓君, 王佳荟. 基于混合学习策略的教与学优化算法[J]. 浙江大学学报(工学版), 2017, 51(5): 1024-1031.
[5] 穆晶晶, 赵昕玥, 何再兴, 张树有. 基于凹凸变换与圆周拟合的重叠气泡轮廓重构[J]. 浙江大学学报(工学版), 2017, 51(4): 714-721.
[6] 黄正宇, 蒋鑫龙, 刘军发, 陈益强, 谷洋. 基于融合特征的半监督流形约束定位方法[J]. 浙江大学学报(工学版), 2017, 51(4): 655-662.
[7] 蒋鑫龙, 陈益强, 刘军发, 忽丽莎, 沈建飞. 面向自闭症患者社交距离认知的可穿戴系统[J]. 浙江大学学报(工学版), 2017, 51(4): 637-647.
[8] 王亮, 於志文, 郭斌. 基于双层多粒度知识发现的移动轨迹预测模型[J]. 浙江大学学报(工学版), 2017, 51(4): 669-674.
[9] 廖苗, 赵于前, 曾业战, 黄忠朝, 张丙奎, 邹北骥. 基于支持向量机和椭圆拟合的细胞图像自动分割[J]. 浙江大学学报(工学版), 2017, 51(4): 722-728.
[10] 戴彩艳, 陈崚, 李斌, 陈伯伦. 复杂网络中的抽样链接预测[J]. 浙江大学学报(工学版), 2017, 51(3): 554-561.
[11] 刘磊, 杨鹏, 刘作军. 采用多核相关向量机的人体步态识别[J]. 浙江大学学报(工学版), 2017, 51(3): 562-571.
[12] 郭梦丽, 达飞鹏, 邓星, 盖绍彦. 基于关键点和局部特征的三维人脸识别[J]. 浙江大学学报(工学版), 2017, 51(3): 584-589.
[13] 王海军, 葛红娟, 张圣燕. 基于核协同表示的快速目标跟踪算法[J]. 浙江大学学报(工学版), 2017, 51(2): 399-407.
[14] 张亚楠, 陈德运, 王莹洁, 刘宇鹏. 基于增量图形模式匹配的动态冷启动推荐方法[J]. 浙江大学学报(工学版), 2017, 51(2): 408-415.
[15] 刘宇鹏, 乔秀明, 赵石磊, 马春光. 统计机器翻译中大规模特征的深度融合[J]. 浙江大学学报(工学版), 2017, 51(1): 46-56.