Code vulnerability detection method based on contextual feature fusion

doi:10.3785/j.issn.1008-973X.2022.11.017

Journal of ZheJiang University (Engineering Science)

2022, Vol. 56

Issue (11): 2260-2270 DOI: 10.3785/j.issn.1008-973X.2022.11.017

Code vulnerability detection method based on contextual feature fusion

Ze-xin XU1,2,3(

),Li-juan DUAN1,2,3,*(

),Wen-jian WANG1,2,3,Qing EN4

1. Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
2. Beijing Key Laboratory of Trusted Computing, Beijing 100124, China
3. National Engineering Laboratory for Critical Technologies of Information Security Classified Protection, Beijing 100124, China
4. Artificial Intelligence and Machine Learning (AIML) Lab, School of Computer Science, Carlton University, Ottawa K1S 5B6, Canada

Download:

HTML

PDF(1207KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

A code vulnerability detection method based on contextual feature fusion was proposed in the view of high false positive rate and the high false negative rate of existing code vulnerability detection methods. The code features were decoupled into code block local features and context global features. The code block local features focused on the semantics of key tokens and short distance dependencies. The context global features were obtained by fusing code block local features to capture long-distance dependencies of code line context. The feature learning ability of the model was improved by collaborating the learning of local and global information. The programming mode of code vulnerabilities was discovered more accurately. A code vulnerability comparison mapping module was introduced to widen the distance between positive and negative samples in embedded space. The model can accurately distinguish between positive and negative samples. The experimental results show that the precision rate is improved by a maximum of 29% and the recall rate is improved by a maximum of 16% on the real data set mixed with 9 software source code.

Key words： code vulnerability detection code block local feature extraction contextual global feature fusion short-distance dependence long-distance dependence

Received: 23 December 2021 Published: 02 December 2022

CLC:

TP 391

Fund: 国家自然科学基金资助项目(62176009，62106065)；北京市教委重点项目(KZ201910005008)

Corresponding Authors: Li-juan DUAN E-mail: xuzexin@emails.bjut.edu.cn;ljduan@bjut.edu.cn

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Ze-xin XU
	Li-juan DUAN
	Wen-jian WANG
	Qing EN

Cite this article:

Ze-xin XU,Li-juan DUAN,Wen-jian WANG,Qing EN. Code vulnerability detection method based on contextual feature fusion. Journal of ZheJiang University (Engineering Science), 2022, 56(11): 2260-2270.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2022.11.017 OR https://www.zjujournals.com/eng/Y2022/V56/I11/2260

基于上下文特征融合的代码漏洞检测方法

针对现有代码漏洞检测方法误报率和漏报率较高的问题，提出基于上下文特征融合的代码漏洞检测方法. 该方法将代码特征解耦分为代码块局部特征和上下文全局特征. 代码块局部特征关注代码块中关键词的语义及其短距离依赖关系. 将局部特征融合得到上下文全局特征从而捕捉代码行上下文长距离依赖关系. 该方法通过局部信息与全局信息协同学习，提升了模型的特征学习能力. 模型精确地挖掘出代码漏洞的编程模式，增加了代码漏洞对比映射模块，拉大了正负样本在嵌入空间中的距离，促使对正负样本进行准确地区分. 实验结果表明，在9个软件源代码混合的真实数据集上的精确率最大提升了29%，召回率最大提升了16%.

关键词： 代码漏洞检测, 代码块局部特征提取, 上下文全局特征融合, 短距离依赖, 长距离依赖

Fig.1 General framework of code vulnerability detection method based on contextual feature fusion

Fig.2 Code local feature extraction module

Fig.3 Code global feature fusion module

Fig.4 Code vulnerability comparison mapping module

Tab.1 Number of functions in dataset

Tab.2 Number of functions in training set, test set, validation set

Tab.3 Results of different token sequence length l

Fig.5 Top K% recall rate corresponding to different token sequences length

Tab.4 Results of different convolutional kernel sizes %

Fig.6 TopK% recall rate corresponding to different convolutional kernel sizes

Tab.5 Precision and recall rate of comparative experiment with Bi-LSTM、Text-CNN、DNN

Tab.6 Accuracy and F1 score results of comparative experiment with Devign

Tab.7 Ablation experiment results of sample balancing method %

Tab.8 Ablation of different modules

Tab.9 Ablation experiment results of different modules

Fig.7 Top K% recall rate corresponding to ablation of modules


[1]	SECURESOFTWARE. Rough auditing tool for security(rats)[EB/OL]. [2021-12-23]. http://www.securesoftware.com/resources/download_rats.html.

[2]	WHEELER D A. Flawfinder software official website[EB/OL]. [2018-08-02]. https://www.dwheeler.com/flawfinder/.

[3]	JANG J, AGRAWAL A, BRUMLEY D. Redebug: finding unpatched code clones in entire os distributions[C]// IEEE Symposium on Security and Privacy. California: IEEE Computer Society, 2012: 48-62.

[4]	KIM S, WOO S, LEE H, et al. Vuddy: a scalable approach for vulnerable code clone discovery[C]// IEEE Symposium on Security and Privacy (SP). San Jose: IEEE Computer Society, 2017: 595-614.

[5]	AVGERINOS T, CHA S K, REBERT A, et al Automatic exploit generation[J]. Communications of the ACM, 2014, 57 (2): 74- 84 doi: 10.1145/2560217.2560219

[6]	RAMOS D A, ENGLER D. Under-constrained symbolic execution: correctness checking for real code[C]// Proceedings of the 24th USENIX Security Symposium (USENIX Security 15). Washington: USENIX Association, 2015: 49-64.

[7]	NEUHAUS S, ZIMMERMANN T, HOLLER C, et al. Predicting vulnerable software components[C]// Proceedings of the 14th ACM Conference on Computer and Communications Security. Alexandria: ACM, 2007: 529-540.

[8]	SHIN Y, WILLIAMS L. An empirical model to predict security vulnerabilities using code complexity metrics[C]// Proceedings of the 2nd ACM-IEEE International Symposium on Empirical Software Engineering and Measurement. Kaiserslautern: ACM, 2008: 315-317.

[9]	SHIN Y, WILLIAMS L Can traditional fault prediction models be used for vulnerability prediction?[J]. Empirical Software Engineering, 2013, 18 (1): 25- 59 doi: 10.1007/s10664-011-9190-8

[10]	PERL H, DECHAND S, SMITH M, et al. Vccfinder: finding potential vulnerabilities in open-source projects to assist code audits[C]// Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. Denver: ACM, 2015: 426-437.

[11]	SHIN Y, MENEELY A, WILLIAMS L, et al Evaluating complexity, code churn and developer activity metrics as indicators of software vulnerabilities[J]. IEEE Transactions on Software Engineering, 2010, 37 (6): 772- 787

[12]	GHAFFARIAN S M, SHAHRIARI H R Software vulnerability analysis and discovery using machine-learning and data-mining techniques: a survey[J]. ACM Computing Surveys (CSUR), 2017, 50 (4): 1- 36

[13]	LIU L, DE VEL O, HAN Q-L, et al Detecting and preventing cyber insider threats: a survey[J]. IEEE Communications Surveys and Tutorials, 2018, 20 (2): 1397- 1417 doi: 10.1109/COMST.2018.2800740

[14]	SUN N, ZHANG J, RIMBA P, et al Data-driven cybersecurity incident prediction: a survey[J]. IEEE Communications Surveys and Tutorials, 2018, 21 (2): 1744- 1772

[15]	JIANG J, WEN S, YU S, et al Identifying propagation sources in networks: state-of-the-art and comparative studies[J]. IEEE Communications Surveys and Tutorials, 2016, 19 (1): 465- 481

[16]	WU T, WEN S, XIANG Y, et al Twitter spam detection: survey of new approaches and comparative study[J]. Computers and Security, 2018, 76: 265- 284 doi: 10.1016/j.cose.2017.11.013

[17]	GOODFELLOW I, BENGIO Y, COURVILLE A. Deep learning [M]. Cambridge, Massachusetts, USA: MIT press, 2016.

[18]	SESTILI C D, SNAVELY W S, VANHOUDNOS N M. Towards security defect prediction with AI [EB/OL]. [2018-08-29]. https://doi.org/10.48550/arXiv.1808.09897.

[19]	LI Z, ZOU D, XU S, et al Sysevr: a framework for using deep learning to detect software vulnerabilities[J]. IEEE Transactions on Dependable and Secure Computing, 2021, 19 (4): 2244- 2258

[20]	DAM H K, TRAN T, PHAM T, et al. Automatic feature learning for vulnerability prediction [EB/OL]. [2017-08-08]. https://doi.org/10.48550/arXiv.1708.02368.

[21]	LIN G, XIAO W, ZHANG J, et al. Deep learning-based vulnerable function detection: A benchmark[C]// International Conference on Information and Communications Security. Beijing: Springer, 2019: 219-232.

[22]	LI Z, ZOU D, XU S, et al. Vuldeepecker: a deep learning-based system for vulnerability detection [EB/OL]. [2018-01-05]. https://doi.org/10.48550/arXiv.1801.01681.

[23]	LIN G, ZHANG J, LUO W, et al Cross-project transfer representation learning for vulnerable function discovery[J]. IEEE Transactions on Industrial Informatics, 2018, 14 (7): 3289- 3297 doi: 10.1109/TII.2018.2821768

[24]	段旭, 吴敬征, 罗天悦, 等基于代码属性图及注意力双向 LSTM 的漏洞挖掘方法[J]. 软件学报, 2020, 31 (11): 3404- 3420 DUAN Xu, WU Jing-zheng, LUO Tian-yue, et al Vulnerability mining method based on code property graph and attention BiLSTM[J]. Journal of Software, 2020, 31 (11): 3404- 3420

[25]	PENG H, MOU L, LI G, et al. Building program vector representations for deep learning [C]// International Conference on Knowledge Science, Engineering and Management. Chongqing: Springer, 2015: 547-553.

[26]	LEE Y J, CHOI S H, KIM C, et al. Learning binary code with deep learning to detect software weakness[C]// KSII the 9th International Conference on Internet (ICONI). Vientiane: Symposium, 2017: 245-249.

[27]	RUSSELL R, KIM L, HAMILTON L, et al. Automated vulnerability detection in source code using deep representation learning [C]// 17th IEEE International Conference on Machine Learning and Applications (ICMLA). Orlando: Institute of Electrical and Electronics Engineers, 2018: 757-762.

[28]	AL-ALYAN A, AL-AHMADI S Robust URL phishing detection based on deep learning[J]. KSII Transactions on Internet and Information Systems (TIIS), 2020, 14 (7): 2752- 2768

[29]	YAMAGUCHI F, LOTTMANN M, RIECK K. Generalized vulnerability extrapolation using abstract syntax trees[C]// Proceedings of the 28th Annual Computer Security Applications Conference. Orlando: ACM, 2012: 359-368.

[30]	SUNEJA S, ZHENG Y, ZHUANG Y, et al. Learning to map source code to software vulnerability using code-as-a-graph [EB/OL]. [2020-06-15]. https://arxiv.org/abs/2006.08614.

[31]	YAMAGUCHI F, GOLDE N, ARP D, et al. Modeling and discovering vulnerabilities with code property graphs[C]// IEEE Symposium on Security and Privacy. Berkeley: IEEE Computer Society, 2014: 590-604.

[32]	ZHOU Y, LIU S, SIOW J, et al. Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks [C]// Proceedings of the 33rd Conference on Neural Information Processing Systems. Vancouver: NIPS Foundation, 2019: 10197-10207.

[33]	CAO S, SUN X, BO L, et al Bgnn4vd: constructing bidirectional graph neural-network for vulnerability detection[J]. Information and Software Technology, 2021, 136: 106576 doi: 10.1016/j.infsof.2021.106576

[34]	CHAWLA N V, BOWYER K W, HALL L O, et al Smote: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16: 321- 357 doi: 10.1613/jair.953

[35]	AGRAWAL A, MENZIES T. Is" better data" better than" better data miners"? [C]// IEEE/ACM 40th International Conference on Software Engineering. Gothenburg: ACM, 2018: 1050-1061.

[1]	Fan ZHONG,Zheng-yao BAI. 3D point cloud super-resolution with dynamic residual graph convolutional networks[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(11): 2251-2259.

[2]	Ming LI,Li-juan DUAN,Wen-jian WANG,Qing EN. Brain functional connections classification method based on significant sparse strong correlation[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(11): 2232-2240.

[3]	Fei LI,Kun HU,Yong ZHANG,Wen-shan WANG,Hao JIANG. Multi-dimensional detection of longitudinal tearing of conveyor belt based on YOLOv4 of hybrid domain attention[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(11): 2156-2167.

[4]	Rong ZHANG,Wei ZHANG. Fire detection algorithm based on improved GhostNet-FCOS[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(10): 1891-1899.

[5]	Jun YANG,Jin-tai LI,Zhi-ming GAO. Unsupervised co-calculation on correspondence of three-dimensional shape collections[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(10): 1935-1947.

[6]	Xiao-ping YUAN,Xiang HE,Xiao-qian WANG,Yang-ming HU. Image segmentation algorithm based on multi-level feature adaptive fusion[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(10): 1958-1966.

[7]	Wan-liang WANG,Tie-jun WANG,Jia-cheng CHEN,Wen-bo YOU. Medical image segmentation method combining multi-scale and multi-head attention[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(9): 1796-1805.

[8]	Xiao-liang YIN,Cheng QIAN. Global approximation of complex model based on adaptive sampling[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(9): 1815-1823.

[9]	Hai-jun WANG,Sheng-yan ZHANG,Yu-jie DU. UAV object tracking algorithm based on response and filter deviation-aware regularization[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(9): 1824-1832.

[10]	Kun HAO,Kuo WANG,Bei-bei WANG. Lightweight underwater biological detection algorithm based on improved Mobilenet-YOLOv3[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(8): 1622-1632.

[11]	Fei SUN,Xiao-run LI,Liao-ying ZHAO,Shao-qi YU. Anomaly detection algorithm based on FrFT transform and total variation regularization[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(7): 1276-1284.

[12]	Wei-hong NIU,Xiao-feng HUANG,Wei QI,Hai-bing YIN,Cheng-gang YAN. Fast stepwise all zero block detection algorithm for H.266/VVC[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(7): 1285-1293, 1319.

[13]	Shi-lin ZHANG,Hong-nan GUO,Xuan LIU. Person and vehicle re-identification based on energy model[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(7): 1416-1424.

[14]	Xun ZHANG,Jian-sheng LI,Wen OUYANG,Run-ze CHEN,Zhen JI,Kai ZHENG. Efficient convolution operators integrating motion information and tracking evaluation[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(6): 1135-1143, 1167.

[15]	Guo-hua ZHOU,Jian-wei LU,Tong-guang NI,Xue-long HU. Hierarchical nonlinear subspace dictionary learning[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(6): 1159-1167.

Viewed

Full text

Abstract

Cited

Shared

Discussed