Computer Technology |
|
|
|
|
Microblog topics summarization algorithm merging sentential semantic structure model |
LIN Meng, LUO Sen lin, JIA Cong fei, HAN Lei, YUAN Yu jiao, PAN Li min |
1.School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China |
|
|
Abstract A new microblog summarization framework based on sentential semantic structure model was proposed in order to provide concise summarization to help users quickly grasp the essence of topics. Sentential semantic features were extracted by sentential semantic structure model. Latent Dirichlet allocation (LDA) topic model was used to calculate the pairwise sentence similarities and construct the similarity matrix based on sentential semantic structure. Sentences were clustered into several subtopics and the sentential relationship features were obtained. The most informative sentences were extracted from each subtopic through combining both sentential semantic features and relationship features. As a result, the value of ROUGE outperforms the contrast algorithms when the the compress ratio was 0.5%, 1.0% and 1.5%. The value of ROUGE 1 was 51.30%, while that of ROUGE SU* was 25.27% when the compress ratio was 1.5%. Results indicate that the method that introduces sentential semantic structure model can better understand sentential semantic, and the extracted semantic features can highlight the description power of sentential semantic. Meanwhile, using both sentential semantic features and relationship features can enrich the features representation and reduce information loss, increasing the semantic relevance of similar data. Moreover, the impact of noise can be reduced. Besides, the proposed method has excellent generalization ability and can be applied to various topics.
|
Published: 31 December 2015
|
|
Fund: 林萌(1991—),女,硕士生,从事中文信息处理的研究.ORCID: 0000 0002 1970 5532. E-mail:lemon0919@bit.edu.cn |
融合句义结构模型的微博话题摘要算法
为了更快地从海量微博中获取话题的核心内容,提出融合句义结构模型的微博话题摘要方法.该方法利用句义结构模型抽取句子的语义格得到句子的语义特征,并基于LDA主题模型使用句义结构计算句子两两之间的语义相似度构建相似度矩阵,划分子主题类,得到句子的关联特征.融合句子的语义特征和关联特征,选取子主题内信息量最大的句子作为摘要结果.当压缩比为0.5%、1.0%和1.5%时,ROUGE值均明显优于对比系统.当压缩比为1.5%时,ROUGE 1值达到51.30%,ROUGE SU*达到25.27%.实验结果表明:融合句义结构模型的分析方法能够深化句子的语义分析层次,提取的句义特征增强了语义信息的表达能力.综合考虑句子语义特征和关联特征的句子权重计算方法能够丰富句子的特征表示,减少语义信息丢失,使同类数据的语义相关性增强,有效降低了噪声的影响,从而提升摘要与话题的相关度.此外,所提出的方法处理不同话题的泛化能力较好,适用范围较广.
|
|
[1] Wikipedia. Sina Weibo [EB/OL]. (2014 11 10)\[2015 10 20]. https:∥en.wikipedia.org/wiki/Sina_Weibo.
[2] HE Y, SU W, TIAN Y, et al. Summarizing microblogs on network hot topics [C] ∥ Proceedings of the 2011 International Conference on Internet Technology and Applications (iTAP 2011). New York: Piscataway, 2011:1-4.
[3] LONG R, WANG H F, CHEN Y Q, et al. Towards effective event detection, tracking and summarization on microblog data [M] ∥ Web Age Information Management. Berlin: Springer, 2011: 652-663.
[4] WILLIAN H, ZHANG Y. Threshold and associative based classification for social spam profile detection on Twitter [C] ∥ 2013 9th International Conference onSemantics, Knowledge and Grids (SKG). New York:Piscataway, 2013: 113-120.
[5] VANDERWENDE L, SUZUKI H, BROCKETT C, et al. Beyond SumBasic: task focused summarization with sentence simplification and lexical expansion [J]. Information Processing and Management, 2007, 43(6):1606-1618.
[6] RADEV D R, JING H, STYS M, et al. Centroid based summarization of multiple documents [J]. Information Processing and Management, 2004, 40(6): 919-938.
[7] SINGH M, KHAN F U. Effect of incremental EM on document summarization using probabilistic latent semantic analysis [C] ∥ Proceedings of the World Congress on Engineering (WCE 2012). Hong Kong: Newswood Limited, 2012: 21-98.
[8] GAO D, LI W, OUYANG Y, et al. LDA based topic formation and topic sentence reinforcement for graph based multi document summarization [M] ∥ Information Retrieval Technology. Berlin: Springer, 2012:376-385.
[9] ARORA R, RAVINDRAN B. Latent dirichlet allocation based multi document summarization [C] ∥ Proceedings of the 2nd Workshop on Analytics for Noisy Unstructured Text Data. Singapore: ACM, 2008: 91-97.
[10] BINTI ZAHRI N A H, FUKUMOTO F, MATSUYOSHI S. Link analysis based on rhetorical relations for multi document summarization [J]. IEICE Transactions on Information and Systems, 2013, 96(5):1182-1191.
[11] SUJATHA C, CHIVATE A R, GANIHAR S A, et al. Time driven video summarization using GMM [C] ∥ 2013 4th National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG). Piscataway: IEEE, 2013: 1-4.
[12] OLARIU A. Clustering to improve microblog stream summarization [C] ∥ 2012 14th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC 2012). Timisoara: IEEE, 2012: 220-226.
[13] ZHANG R, LI W, GAO D, et al. Automatic Twitter topic summarization with speech acts [J]. IEEE Transactions on Audio Speech and Language Processing, 2013, 21(3): 649-658.
[14] KHAN M A H, BOLLEGALA D, LIU G, et al. Multi tweet summarization of real time events [C] ∥ 2013 International Conference on Social Computing (SocialCom). Washington DC: ASE/IEEE, 2013:128-133.
[15] LIU F, LIU Y, WENG F L. Why is “SXSW” trending? Exploring multiple text sources for twitter topic summarization [C] ∥ Proceedings of the Workshop on Languages in Social Media (LSM 2011). Strasbourg: Association for Computational Linguistics, 2011:66-75.
[16] SHARIFI B, HUTTON M, KALITA J. Summarizing microblogs automatically [C] ∥ 2010 Human Language Technologies Conference of the North American Chapter of the Association for Computational Linguistics, NAACL HLT 2010. Los Angeles: ACL, 2010: 685-688.
[17] HARABAGIU S M, HICKL A. Relevance modeling for microblog summarization [C] ∥ Proceedings of the 5th International Conference on Weblogs and Social Media. Menlo Park: AAAI, 2011: 514-517.
[18] CHAKRABARTI D, PUNERA K. Event summarization using Tweets [C] ∥ Proc of the 5th Int AAAI Conference and Social Media (ICWSM’11). Menlo Park: AAAI, 2011: 66-73.
[19] INOUYE D, KALITA J K. Comparing Twitter Summarization Algorithms for Multiple Post Summaries [C] ∥ Proceedings of the 2011 IEEE Third International Conference on Privacy, Security, Risk and Trust and IEEE Third International Conference on Social Computing (PASSAT/SocialCom 2011). Boston: IEEE, 2011: 298-306.
[20] ERKAN G, RADEV D R. LexRank: graph based lexical centrality as salience in text summarization [J]. Journal of Artificial Intelligence Research, 2004:457-479.
[21] MIHALCEA R, TARAU P. TextRank: bringing order into texts [C] ∥ Conference on Empirical Methods in Natural Language Processing (EMNLP), 2004.Barcelona: ACL, 2004: 275-279.
[22] BIAN J, YANG Y, CHUA T. Multimedia summarization for trending topics in microblogs [C]∥ 22nd ACM International Conference on Information and Knowledge Management, CIKM 2013. San Francisco: ACM, 2013: 1807-1812.
[23] 罗森林, 韩磊, 潘丽敏, 等. 汉语句义结构模型及其验证 [J]. 北京理工大学学报, 2013, 33(2): 166-171.
LUO Sen lin, HAN Lei, PAN Li min, et al. Chinese sentential semantic mode and verification [J]. Transactions of Beijing Institute of Technology, 2013, 33(2): 166-171.
[24] 罗森林, 刘盈盈, 冯扬, 等. BFS CTC 汉语句义结构标注语料库构建方法 [J]. 北京理工大学学报, 2012, 32(3): 311-315.
LUO Sen lin, LIU Ying ying, FENG Yang, et al. Method of building BFS CTC: a Chinese Tagged corpus of sentential semantic structure [J]. Transactions of Beijing Institute of Technology, 2012, 32(3):311-315.
[25] 张华平. ICTCLAS2013版 [CP/OL].(2013 11 15)[2015 10 20]. http:∥ictclas.nlpir.org/newsdownloads?DocId=352.
[26] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation [J]. Journal of Machine Learning Research. 2003, 3(4/5): 993-1022.
[27] 中国计算机学会中文信息技术专业委员会. 第二届自然语言处理与中文计算会议技术评测结果 [CP/OL]. (2013 06 15)[2015 10 20]. http:∥tcci.ccf.org.cn/conference/2013/pages/page04_evares.html.
[28] LIN C Y. Rouge: a package for automatic evaluation of summaries [C] ∥ Text Summarization Branches Out: Proceedings of the ACL 04 Workshop. Barcelona: ACL, 2004: 74-81. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|