Journal of ZheJiang University (Engineering Science)  2020, Vol. 54 Issue (9): 1753-1760    DOI: 10.3785/j.issn.1008-973X.2020.09.011
Detection of DNS tunnels based on log statistics feature
Qi WANG1(),Kun XIE1,Yan MA1,*(),Qun CONG2
1. Information Network Center, Institute of Network Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China
2. Beijing Wrdtech Co. Ltd, Beijing 100876, China
The log of DNS server was used as the data source to extract the multi-dimensional statistical characteristics of the secondary domain name, such as the entropy of the domain, the number of sub domain names, and the cache hit rate. The logs were quantized as feature vector set, which was used as data source. The random forest algorithm was used for model training, the model parameters were adjusted by the method of ten fold cross validation, and the model was optimized to improve the overall detection accuracy. Finally, comparative experiments were made under different classification algorithms, and compared with the existing research methods. The experimental results show that the proposed detection method had an accuracy rate of not less than 90% when the recall rate was 98.5%, and the detection accuracy was improved. Thus, the proposed algorithm can effectively detect DNS tunnel.

Key wordsDNS tunnel      log analysis      DNS cache      random forest      malicious domain name     
Received: 24 September 2019      Published: 22 September 2020
CLC:  TP 302  
Corresponding Authors: Yan MA     E-mail:;
Qi WANG,Kun XIE,Yan MA,Qun CONG. Detection of DNS tunnels based on log statistics feature. Journal of ZheJiang University (Engineering Science), 2020, 54(9): 1753-1760.

以DNS服务器的日志为数据源,提取出二级域名的熵、子域名个数、缓存命中率等多维日志统计特征, 将日志量化为特征向量集;以特征向量集为数据源,使用随机森林算法进行模型训练,并使用十折交叉验证的方法对模型参数进行调整,对模型进行优化,提高整体检测精度;在不同分类算法下进行对比实验,并将实验结果与已有研究方法进行比较. 实验结果表明,提出的检测方法在召回率达到98.5%的情况下,有不低于90%的准确率,检测精度有所提高,即提出的算法能有效检测DNS隧道.

关键词: DNS隧道,  日志分析,  DNS缓存,  随机森林,  恶意域名 
Fig.1 Flow chart of DNS tunnel detection method based on log statistical characteristics
参数 含义
$E$ 二级域名的熵
${P_{\rm{t}}}$ A/AAAA资源类型查询占比
${C_{\rm{s}}}$ 二级域下子域名的个数
${P_{\rm{s}}}$ 二级域下特异子域名的个数占比
$L$ 域名长度
${L_{{\rm{mvd}}}}$ 最长元音距
${P_{\rm{c}}}$ 缓存命中率
Tab.1 Log features used by proposed detection algorithm
Fig.2 DNS tunnel detection model train flow
混淆矩阵 预测值
正类 负类
实际值 正类 TP FN
负类 FP TN
Tab.2 Confusion matrix
Fig.3 Receiver operating characteristic (ROC) curve sample
Fig.4 Distribution of longest vowel distance between normal flow and tunnel flow
Fig.5 Cache hit ratio distribution of normal traffic and tunnel traffic
分类算法 ${\delta }$ ${R_{\rm{pre} }}$/% ${R_{\rm{re}}}$/%
Logistic回归 0.902 80.7 95.4
朴素贝叶斯 0.818 72.7 90.1
SVM 0.951 93.6 97.2
随机森林 0.991 95.2 98.5
Tab.3 Comparison of experimental results by different classification algorithms
方法名称 ${R_{{\rm{pre}}} }$/% ${R_{\rm{re}} }$/%
本文方法 95.2 98.5
对比实验 88.1 97.9
Tab.4 Comparison of experimental results under different feature dimensions
Fig.6 Comparison of ten cross-validation model scores under different feature dimensions
