Bimodal software classification model based on bidirectional encoder representation from transformer

doi:10.3785/j.issn.1008-973X.2024.11.005

Journal of ZheJiang University (Engineering Science)

2024, Vol. 58

Issue (11): 2239-2246 DOI: 10.3785/j.issn.1008-973X.2024.11.005

Bimodal software classification model based on bidirectional encoder representation from transformer

Xiaofeng FU1(

),Weiqi CHEN2,Yao SUN2,Yuze PAN2

1. School of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China
2. School of Automation, Hangzhou Dianzi University, Hangzhou 310018, China

Download:

HTML

PDF(848KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

A bimodal software categorization method based on bidirectional encoder representations from transformers (BERT) was proposed addressing the limitations of existing methods that only consider a single factor in software categorization and suffer from low precision. The method followed the latest national standards for software classification. The advantages of bidirectional encoding from code based BERT (CodeBERT) and masked language model as correction BERT (MacBERT) were integrated. CodeBERT was used for in-depth analysis of source code content, while MacBERT handled textual description information such as comments and documents. The above bimodal information was utilized to jointly generate word embeddings. Convolutional neural network (CNN) was combined for local feature extraction, and the proposed cross self-attention mechanism (CSAM) was employed to fuse model results in order to achieve accurate classification of complex software system. The experimental results demonstrate that the method achieves a high precision of 93.3% with text and source code data, which is 5.4% higher on average than the BERT and CodeBERT models trained on datasets processed from the Orginone and gitee platforms. Results show the efficiency and accuracy of bidirectional encoding and bimodal classification methods in software categorization, while proves the practicality of the proposed approach.

Key words： software classification bidirectional encoder representation from transformer (BERT) convolutional neural network bimodal cross self-attention mechanism

Received: 03 July 2023 Published: 23 October 2024

CLC:

TP 391

Fund: 国家自然科学基金资助项目(61672199).

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Xiaofeng FU
	Weiqi CHEN
	Yao SUN
	Yuze PAN

Cite this article:

Xiaofeng FU,Weiqi CHEN,Yao SUN,Yuze PAN. Bimodal software classification model based on bidirectional encoder representation from transformer. Journal of ZheJiang University (Engineering Science), 2024, 58(11): 2239-2246.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2024.11.005 OR https://www.zjujournals.com/eng/Y2024/V58/I11/2239

基于双向编码表示转换的双模态软件分类模型

针对已有方法在软件分类方面只考虑单一分类因素和精确率较低的不足，提出基于双向编码表示转换(BERT)的双模态软件分类方法. 该方法遵循最新的国家标准对软件进行分类，通过集成基于代码的BERT(CodeBERT)和基于掩码语言模型的纠错BERT(MacBERT)双向编码的优势，其中CodeBERT用于深入分析源码内容，MacBERT处理文本描述信息如注释和文档，利用这2种双模态信息联合生成词嵌入. 结合卷积神经网络(CNN)提取局部特征，通过提出的交叉自注意力机制(CSAM)融合模型结果，实现对复杂软件系统的准确分类. 实验结果表明，本文方法在同时考虑文本和源码数据的情况下精确率高达93.3%，与从奥集能和gitee平台收集并处理的数据集上训练的BERT模型和CodeBERT模型相比，平均精确率提高了5.4%. 这表明了双向编码和双模态分类方法在软件分类中的高效性和准确性，证明了提出方法的实用性.

关键词： 软件分类, 双向编码表示转换（BERT）, 卷积神经网络, 双模态, 交叉自注意力机制

Fig.1 Self-attention mechanism

Fig.2 Data distribution of each category in software classification dataset

Fig.3 Cross self-attention mechanism

Fig.4 BERT-based bimodal software classification model

Fig.5 BERT embedding layer representation

Tab.1 Parameter setting of bimodal classification model

Tab.2 Classification result of each model without text description

Tab.3 Classification result of each model with text description

Tab.4 Classification result of various software category

Fig.6 Average F₁ score for various model


[1]	ARIAS-BARAHONA M X, ARTEAGA-ARTEAGA H B, OROZCO-ARIAS S, et al Requests classification in the customer service area for software companies using machine learning and natural language processing[J]. PeerJ Computer Science, 2023, 9: e1016 doi: 10.7717/peerj-cs.1016

[2]	CANEDO E D, MEDNES B C Software requirements classification using machine learning algorithms[J]. Entropy, 2020, 22 (9): 1057 doi: 10.3390/e22091057

[3]	TANJONG E E, CARVER D L. Improving impact and dependency analysis through software categorization methods [C]// 9th International Conference in Software Engineering Research and Innovation . San Diego: IEEE, 2021: 142-151.

[4]	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding [C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Minneapolis: ACL, 2019: 4171–4186.

[5]	FENG Z, GUO D, TANG D, et al. CodeBERT: a pre-trained model for programming and natural languages [C]// Findings of the Association for Computational Linguistics . [S. l. ]: ACL, 2020: 1536-1547.

[6]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// 31st Conference on Neural Information Processing Systems. Long Beach: ACM, 2017: 5998-6008.

[7]	PAN T, CHEN J, YE Z, et al A multi-head attention network with adaptive meta-transfer learning for RUL prediction of rocket engines[J]. Reliability Engineering and System Safety, 2022, 225: 108610 doi: 10.1016/j.ress.2022.108610

[8]	REZA S, FERREIRA M C, MACHADO J J M, et al A multi-head attention-based transformer model for traffic flow forecasting with a comparative analysis to recurrent neural networks[J]. Expert Systems with Applications, 2022, 202: 117275 doi: 10.1016/j.eswa.2022.117275

[9]	WEN Z, LIN W, WANG T, et al Distract your attention: multi-head cross attention network for facial expression recognition[J]. Biomimetics, 2023, 8 (2): 199 doi: 10.3390/biomimetics8020199

[10]	LAI T, CHENG L, WANG D, et al RMAN: relational multi-head attention neural network for joint extraction of entities and relations[J]. Applied Intelligence, 2022, 52 (3): 3132- 3142 doi: 10.1007/s10489-021-02600-2

[11]	ZHANG Y, JIN R, ZHOU Z H Understanding bag-of-words model: a statistical framework[J]. International Journal of Machine Learning and Cybernetics., 2010, 1: 43- 52 doi: 10.1007/s13042-010-0001-0

[12]	ZHANG W, YOSHIDA T, TANG X A comparative study of TF*IDF, LSI and multi-words for text classification[J]. Expert Systems with Applications, 2011, 38 (3): 2758- 2765 doi: 10.1016/j.eswa.2010.08.066

[13]	LAN Z, CHEN M, GOODMAN S, et al. ALBERT: a lite BERT for self-supervised learning of language representations [C]/ /IEEE Spoken Language Technology Workshop . Shenzhen: IEEE, 2021: 344-351.

[14]	LIU Y, OTT M, GOYAL N, et al. Roberta: a robustly optimized BERT pretraining approach [C]// China National Conference on Chinese Computational Linguistics. Hohhot: ACL, 2021: 1218–1227.

[15]	CUI Y, CHE W, LIU T, et al Pre-training with whole word masking for Chinese bert[J]. ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504- 3514

[16]	CHANG Y, KONG L, JIA K, et al. Chinese named entity recognition method based on BERT [C]// IEEE International Conference on Data Science and Computer Application . Dalian: IEEE, 2021: 294-299.

[17]	SERKAN K, ONUR A, OSAMA A, et al 1D convolutional neural networks and applications: a survey[J]. Mechanical Systems and Signal Processing, 2021, 151: 107398 doi: 10.1016/j.ymssp.2020.107398

[18]	TORREY L, SHAVLIK J. Transfer learning [M]// SORIA E, MARTIN R. Handbook of research on machine learning applications. Madison: IGI Global, 2010: 242-264.

[19]	GU Y, TINN R, CHENG H, et al. Domain-specific language model pretraining for biomedical natural language processing [J]. ACM Transactions on Computing for Healthcare , 2021, 3(1): 1-23.

[20]	MOON S, CHI S, IM S B Automated detection of contractual risk clauses from construction specifications using bidirectional encoder representations from transformers (BERT)[J]. Automation in Construction, 2022, 142: 104465 doi: 10.1016/j.autcon.2022.104465

[21]	SUN C, QIU X, XU Y, et al. How to fine-tune bert for text classification? [C]// Chinese Computational Linguistics: 18th China National Conference. Kunming: Springer, 2019: 194-206.

[22]	LINDSAY G W Convolutional neural networks as a model of the visual system: past, present, and future[J]. Journal of Cognitive Neuroscience, 2021, 33 (10): 2017- 2031 doi: 10.1162/jocn_a_01544

[23]	HAN Z, JIAN M, WANG G G ConvUNeXt: an efficient convolution neural network for medical image segmentation[J]. Knowledge-Based Systems, 2022, 253: 109512 doi: 10.1016/j.knosys.2022.109512

[24]	REIMERS N, GUREVYCH I. Sentence-BERT: sentence embeddings using Siamese BERT-networks [C]/ / Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing . Hong Kong: IEEE, 2019: 3982–3992.

[1]	Haijun WANG,Tao WANG,Cijun YU. CFRP ultrasonic detection defect identification method based on recursive quantitative analysis[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(8): 1604-1617.

[2]	Jinye LI,Yongqiang LI. Spatial-temporal multi-graph convolution for traffic flow prediction by integrating knowledge graphs[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(7): 1366-1376.

[3]	Zhiwei XING,Shujie ZHU,Biao LI. Airline baggage feature perception based on improved graph convolutional neural network[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(5): 941-950.

[4]	Mingjun SONG,Wen YAN,Yizhao DENG,Junran ZHANG,Haiyan TU. Light-weight algorithm for real-time robotic grasp detection[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(3): 599-610.

[5]	Zhiqiang GENG,Wei CHEN,Bo MA,Yongming HAN. Bearing intelligent fault diagnosis method based on continuous wavelet convolutional neural network[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(10): 2069-2075.

[6]	Zhao-yang SONG,Xiao-qiang ZHAO,Yong-yong HUI,Hong-mei JIANG. Image super-resolution reconstruction algorithm based on multi-level continuous encoding and decoding[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(9): 1885-1893.

[7]	Dian-hai WANG,Rui XIE,Zheng-yi CAI. Prediction of urban interrupted traffic flow based on optimal convergence time interval[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(8): 1607-1617.

[8]	Wei QUAN,Yong-qing CAI,Chao WANG,Jia SONG,Hong-kai SUN,Lin-xuan LI. VR sickness estimation model based on 3D-ResNet two-stream network[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(7): 1345-1353.

[9]	Xin-dong LV,Jiao LI,Zhen-nan DENG,Hao FENG,Xin-tong CUI,Hong-xia DENG. Structured image super-resolution network based on improved Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(5): 865-874.

[10]	Jian-zhao ZHANG,Ji-chang GUO,Yu-dong WANG. Underwater image enhancement algorithm via fusing reverse medium transmission map[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(5): 921-929.

[11]	Chuan-hua ZHOU,Li-chun CAO,Jia-yi ZHOU,Feng ZHAN. Identification of critical nodes in temporal networks based on graph convolution union computing[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(5): 930-938.

[12]	Jun HE,Ya-sheng ZHANG,Can-bin YIN. Operating modes identification of spaceborne SAR based on deep learning[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(8): 1676-1684.

[13]	Ren-peng MO,Xiao-sheng SI,Tian-mei LI,Xu ZHU. Bearing life prediction based on multi-scale features and attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(7): 1447-1456.

[14]	Zhu-peng WEN,Jie CHEN,Lian-hua LIU,Ling-ling JIAO. Fault diagnosis of wind power gearbox based on wavelet transform and improved CNN[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(6): 1212-1219.

[15]	Li HE,Shan-min PANG. Face reconstruction from voice based on age-supervised learning and face prior information[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(5): 1006-1016.

Viewed

Full text

Abstract

Cited

Shared

Discussed