复合型日志的模板提取方法

doi:10.3785/j.issn.1008-973X.2020.08.014

浙江大学学报(工学版)

2020, Vol. 54

Issue (8): 1557-1561 DOI: 10.3785/j.issn.1008-973X.2020.08.014

计算机技术

复合型日志的模板提取方法

吴其(

),黄小红,马严(

),丛群

1. 北京邮电大学网络技术研究院信息网络中心，北京 100876
2. 北京网瑞达科技有限公司，北京 100876

A template extraction method for composite log

Qi WU(

),Xiao-hong HUANG,Yan MA(

),Qun CONG

1. Information Network Center, Institute of Network Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China
2. Beijing Wrdtech Co. Ltd, Beijing 100876, China

全文: PDF(581 KB) HTML

摘要：

为了解决目前复合型日志无法被模板提取算法正确解析的问题，设计新的模板提取算法CLEA来处理复合型日志的模板提取. 该算法使用符号将所有日志划分为集群，基于Drain模板提取算法提取每个集群各自的日志模板，存储并缓存模板提取结果，在更新集群的同时更新缓存的模板；将差异度计算引入简单共有词算法中，增强简单共有词算法对模板中不同词语的敏感度，计算模板之间的相似度；设计BMerge算法，利用该算法对相似度大于阈值的模板进行合并，获取并输出合并日志作为最终结果. 在相似度算法中引入差异度计算，增强算法对模板中不同词语的敏感度，并设计BMerge算法对模板进行合并，输出无损日志作为结果. 所提方法适用于处理复合型日志，且正确率较高.

关键词： 模板提取; 复合型日志; 简单共有词; 相似度; Json; 日志提取

Abstract:

A new template extraction algorithm was designed to handle the template extraction of the composite log, and the algorithm was named composite-log extraction algorithm (CLEA), in order to solve the problem that currently, the composite log cannot be correctly parsed by the template extraction algorithms. Symbols are used to divide all logs into clusters, and the respective log template of each cluster is extracted based on the Drain extraction method. Template extraction results are stored and cached, and the cached template is updated together with the cluster update. The calculation of the difference is introduced into the simple common word algorithm to enhance the sensitivity of the algorithm to different words in the template and calculate the similarity between templates. The BMerge algorithm is designed and used to merge templates with similarity greater than the threshold, and the merged log is got and output as the final result. The difference calculation is introduced into the similarity algorithm, the sensitivity of the algorithm to different words in the template is enhanced, and the BMerge algorithm is designed to merge the templates, and then lossless log is output as result. The proposed method is suitable for processing composite logs with high accuracy.

Key words: template extraction composite log simple common word similarity Json log extraction

收稿日期: 2019-09-24 出版日期: 2020-08-28

CLC:

TP 301

基金资助: 中央高校基本科研专项资金资助项目（2019RC53）；国家CNGI专项资助项目（CNGI-12-03-001）

通讯作者: 马严 E-mail: njwuqi123@126.com;mayan@bupt.edu.cn

作者简介: 吴其（1995—），男，硕士生，从事软件工程研究. orcid.org/0000-0002-6320-2246. E-mail： njwuqi123@126.com

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	作者相关文章
	吴其
	黄小红
	马严
	丛群

引用本文:

吴其,黄小红,马严,丛群. 复合型日志的模板提取方法[J]. 浙江大学学报(工学版), 2020, 54(8): 1557-1561.

Qi WU,Xiao-hong HUANG,Yan MA,Qun CONG. A template extraction method for composite log. Journal of ZheJiang University (Engineering Science), 2020, 54(8): 1557-1561.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2020.08.014 或 http://www.zjujournals.com/eng/CN/Y2020/V54/I8/1557

图 1 CLEA算法日志分类树结构示例

表 1 验证CLEA日志模板提取算法的实验环境

表 2 CLEA与Drain算法对不同日志划分的准确率

表 3 多种日志模板提取算法对不同日志的处理时间

图 2 多种日志模板提取算法对DNS日志解析的准确度

图 3 多种日志模板提取算法对华为交换机日志解析的准确度

表 4 多种日志模板算法对不同日志的最终准确率

1	崔元, 张琢基于大规模网络日志的模板提取研究[J]. 计算机科学, 2017, (Suppl.2): 458- 462 CUI Yuan, ZHANG Zhuo Research on template extraction based on large-scale network log[J]. Computer Science, 2017, (Suppl.2): 458- 462
2	范惊. 高精度的程序日志解析技术研究[D]. 上海: 上海交通大学, 2013. FAN Jing. Research on high precision program log analysis technology [D]. Shanghai: Shanghai Jiaotong University, 2013.
3	张晓箐. 基于海量日志消息的软件系统异常检测技术研究与实现[D]. 西安: 西安电子科技大学, 2015. ZHANG Xiao-jing. Research and implementation of software system anomaly detection technology based on massive log messages [D]. Xi’an: Xidian University, 2015.
4	KOBAYASHI S, FUKUDA K, ESAKI H. Towards an NLP-based log template generation algorithm for system log analysis [C]// Proceedings of the 9th International Conference on Future Internet Technologies. Tokyo: ACM, 2014: 11.
5	MIZUTANI M. Incremental mining of system log format [C]// Services Computing (SCC), 2013 IEEE International Conference on Santa Clara, CA, USA. Santa Clara: IEEE, 2013: 595-602.
6	SHIMA K. Length matters: clustering system log messages using length of words [J/OL]. [2019-09-20]. https://arxiv.org/abs/1611.03213.
7	DU M, LI F. Spell: streaming parsing of system event logs [C]// Data Mining (ICDM), 2016 IEEE 16th International Conference on Barcelona, Spain. Barcelona: IEEE, 2016: 859-864.
8	HE P, ZHU J, ZHENG Z, et al. Drain: an online log parsing approach with fixed depth tree [C]// Web Services (ICWS), 2017 IEEE International Conference on Honolulu, HI, USA.Honolulu: IEEE, 2017: 33-40.
9	MESSAOUDI S, PANICHELLA A, BIANCULLI D, et al. A search-based approach for accurate identification of log message formats [C]// Proceedings of the 26th IEEE/ACM International Conference on Program Comprehension (ICPC’18). Gothenburg: ACM, 2018.
10	ZHANG S, MENG W, BU J, et al. Syslog processing for switch failure diagnosis and prediction in datacenter networks [C]// Quality of Service (IWQoS), 2017 IEEE/ACM 25th International Symposium on Vilanovai la Geltru, Spain. Vilanovai la Geltru: IEEE, 2017: 1-10.
11	POGGI N, MUTHUSAMY V, CARRERA D, et al. Business process mining from e-commerce web logs [C]// Business Process Management. Springer, Berlin, Heidelberg∶LNCS, 2013: 65-80.
12	LOU J G, FU Q, YANG S, et al. Mining program workflow from interleaved traces [C]// Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington DC: ACM, 2010: 613-622.
13	MAKANJU A, ZINCIR-HEYWOOD A N, MILIOS E E A lightweight algorithm for message type extraction in system application logs[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 24 (11): 1921- 1936
14	TANG L, LI T. LogTree: a framework for generating system events from raw textual logs [C]// 2010 IEEE International Conference on Data Mining. Sydney: IEEE, 2010: 491-500.

[1]	曾煌尧,李丹丹,马严,丛群. 园区网风险账号评估方法[J]. 浙江大学学报(工学版), 2020, 54(9): 1761-1767.
[2]	景瑶, 郭斌, 王柱, 於志文, 周兴社. 基于群体智能挖掘的个性化商品评论呈现方法[J]. 浙江大学学报(工学版), 2017, 51(4): 675-681.
[3]	戴彩艳, 陈崚, 李斌, 陈伯伦. 复杂网络中的抽样链接预测[J]. 浙江大学学报(工学版), 2017, 51(3): 554-561.
[4]	白帆, 郑慧峰, 沈平平, 王成, 喻桑桑. 基于花朵特征编码归类的植物种类识别方法[J]. 浙江大学学报(工学版), 2015, 49(10): 1902-1908.
[5]	崔光茫, 赵巨峰, 冯华君, 徐之海, 李奇, 陈跃庭. 非均匀介质退化图像快速仿真模型的建立[J]. J4, 2014, 48(2): 303-311.
[6]	蒋湛,姚晓明,林兰芬. 基于特征自适应的本体映射方法[J]. J4, 2014, 48(1): 76-84.
[7]	扈中凯，郑小林，吴亚峰，陈德人. 基于用户评论挖掘的产品推荐算法[J]. J4, 2013, 47(8): 1475-1485.
[8]	许琦, 顾新建. 一种基于Subject-Action-Object三元组的知识基因提取方法[J]. J4, 2013, 47(3): 385-399.
[9]	李俊,郑小林,陈德人. 基于信任的组合服务选择方法[J]. J4, 2012, 46(5): 885-892.
[10]	戴渊明, 韦巍, 林亦宁. 基于颜色纹理特征的均值漂移目标跟踪算法[J]. J4, 2012, 46(2): 212-217.
[11]	张大尉, 朱善安. 基于核邻域保持判别嵌入的人脸识别[J]. J4, 2011, 45(10): 1842-1847.
[12]	楼斌, 沈海斌, 赵武锋, 等. 基于失真模型的结构相似度图像质量评价[J]. J4, 2009, 43(5): 864-868.
[13]	徐敬华, 张树有. 基于形态分布图与BP神经网络的三维模型检索方法[J]. J4, 2009, 43(5): 877-883.
[14]	刘琦张引叶修梓俞荣栋. 基于奇异值分解的RNA二级结构相似度计算方法[J]. J4, 2007, 41(8): 1249-1254.
[15]	吴健李莹邓水光吴朝晖. 网络化制造环境中的Web服务模糊匹配研究[J]. J4, 2006, 40(9): 1545-1549.

Viewed

Full text

Abstract

Cited

Shared

Discussed