Bimodal software classification model based on bidirectional encoder representation from transformer
Xiaofeng FU1(),Weiqi CHEN2,Yao SUN2,Yuze PAN2
1. School of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China 2. School of Automation, Hangzhou Dianzi University, Hangzhou 310018, China
A bimodal software categorization method based on bidirectional encoder representations from transformers (BERT) was proposed addressing the limitations of existing methods that only consider a single factor in software categorization and suffer from low precision. The method followed the latest national standards for software classification. The advantages of bidirectional encoding from code based BERT (CodeBERT) and masked language model as correction BERT (MacBERT) were integrated. CodeBERT was used for in-depth analysis of source code content, while MacBERT handled textual description information such as comments and documents. The above bimodal information was utilized to jointly generate word embeddings. Convolutional neural network (CNN) was combined for local feature extraction, and the proposed cross self-attention mechanism (CSAM) was employed to fuse model results in order to achieve accurate classification of complex software system. The experimental results demonstrate that the method achieves a high precision of 93.3% with text and source code data, which is 5.4% higher on average than the BERT and CodeBERT models trained on datasets processed from the Orginone and gitee platforms. Results show the efficiency and accuracy of bidirectional encoding and bimodal classification methods in software categorization, while proves the practicality of the proposed approach.
Xiaofeng FU,Weiqi CHEN,Yao SUN,Yuze PAN. Bimodal software classification model based on bidirectional encoder representation from transformer. Journal of ZheJiang University (Engineering Science), 2024, 58(11): 2239-2246.
Fig.2Data distribution of each category in software classification dataset
Fig.3Cross self-attention mechanism
Fig.4BERT-based bimodal software classification model
Fig.5BERT embedding layer representation
参数
数值
迭代数
30
批数量
32
句子长度
代码512,文本256
学习率
10?5
隐藏层数
768
Dropout
0.5
CNN隐藏层数 损失函数
1 Cross-entropy loss
优化器
Adamax
Tab.1Parameter setting of bimodal classification model
模型
P
R
F1
BERT
0.645
0.632
0.638
CodeBERT
0.786
0.764
0.775
MacBERT 基于神经网络的文本分类方法
0.692 0.765
0.676 0.743
0.684 0.753
基于BERT双模态软件分类模型
0.786
0.764
0.775
Tab.2Classification result of each model without text description
模型
P
R
F1
BERT
0.875
0.863
0.869
CodeBERT
0.883
0.876
0.880
MacBERT
0.903
0.901
0.902
基于神经网络的文本分类方法
0.913
0.912
0.913
基于BERT双模态软件分类模型
0.933
0.926
0.930
Tab.3Classification result of each model with text description
软件类别
P
R
行业应用
0.908
0.913
网络通信
0.951
0.948
语言
0.964
0.961
多媒体
0.919
0.922
地理信息
0.895
0.897
人工智能
0.973
0.975
浏览器
0.928
0.934
中间件
0.902
0.901
开发支撑
0.957
0.946
数据库
0.943
0.948
操作系统
0.915
0.912
信息安全
0.934
0.936
Tab.4Classification result of various software category
Fig.6Average F1 score for various model
[1]
ARIAS-BARAHONA M X, ARTEAGA-ARTEAGA H B, OROZCO-ARIAS S, et al Requests classification in the customer service area for software companies using machine learning and natural language processing[J]. PeerJ Computer Science, 2023, 9: e1016
doi: 10.7717/peerj-cs.1016
[2]
CANEDO E D, MEDNES B C Software requirements classification using machine learning algorithms[J]. Entropy, 2020, 22 (9): 1057
doi: 10.3390/e22091057
[3]
TANJONG E E, CARVER D L. Improving impact and dependency analysis through software categorization methods [C]// 9th International Conference in Software Engineering Research and Innovation . San Diego: IEEE, 2021: 142-151.
[4]
DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding [C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Minneapolis: ACL, 2019: 4171–4186.
[5]
FENG Z, GUO D, TANG D, et al. CodeBERT: a pre-trained model for programming and natural languages [C]// Findings of the Association for Computational Linguistics . [S. l. ]: ACL, 2020: 1536-1547.
[6]
VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// 31st Conference on Neural Information Processing Systems. Long Beach: ACM, 2017: 5998-6008.
[7]
PAN T, CHEN J, YE Z, et al A multi-head attention network with adaptive meta-transfer learning for RUL prediction of rocket engines[J]. Reliability Engineering and System Safety, 2022, 225: 108610
doi: 10.1016/j.ress.2022.108610
[8]
REZA S, FERREIRA M C, MACHADO J J M, et al A multi-head attention-based transformer model for traffic flow forecasting with a comparative analysis to recurrent neural networks[J]. Expert Systems with Applications, 2022, 202: 117275
doi: 10.1016/j.eswa.2022.117275
[9]
WEN Z, LIN W, WANG T, et al Distract your attention: multi-head cross attention network for facial expression recognition[J]. Biomimetics, 2023, 8 (2): 199
doi: 10.3390/biomimetics8020199
[10]
LAI T, CHENG L, WANG D, et al RMAN: relational multi-head attention neural network for joint extraction of entities and relations[J]. Applied Intelligence, 2022, 52 (3): 3132- 3142
doi: 10.1007/s10489-021-02600-2
[11]
ZHANG Y, JIN R, ZHOU Z H Understanding bag-of-words model: a statistical framework[J]. International Journal of Machine Learning and Cybernetics., 2010, 1: 43- 52
doi: 10.1007/s13042-010-0001-0
[12]
ZHANG W, YOSHIDA T, TANG X A comparative study of TF*IDF, LSI and multi-words for text classification[J]. Expert Systems with Applications, 2011, 38 (3): 2758- 2765
doi: 10.1016/j.eswa.2010.08.066
[13]
LAN Z, CHEN M, GOODMAN S, et al. ALBERT: a lite BERT for self-supervised learning of language representations [C]/ /IEEE Spoken Language Technology Workshop . Shenzhen: IEEE, 2021: 344-351.
[14]
LIU Y, OTT M, GOYAL N, et al. Roberta: a robustly optimized BERT pretraining approach [C]// China National Conference on Chinese Computational Linguistics. Hohhot: ACL, 2021: 1218–1227.
[15]
CUI Y, CHE W, LIU T, et al Pre-training with whole word masking for Chinese bert[J]. ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504- 3514
[16]
CHANG Y, KONG L, JIA K, et al. Chinese named entity recognition method based on BERT [C]// IEEE International Conference on Data Science and Computer Application . Dalian: IEEE, 2021: 294-299.
[17]
SERKAN K, ONUR A, OSAMA A, et al 1D convolutional neural networks and applications: a survey[J]. Mechanical Systems and Signal Processing, 2021, 151: 107398
doi: 10.1016/j.ymssp.2020.107398
[18]
TORREY L, SHAVLIK J. Transfer learning [M]// SORIA E, MARTIN R. Handbook of research on machine learning applications. Madison: IGI Global, 2010: 242-264.
[19]
GU Y, TINN R, CHENG H, et al. Domain-specific language model pretraining for biomedical natural language processing [J]. ACM Transactions on Computing for Healthcare , 2021, 3(1): 1-23.
[20]
MOON S, CHI S, IM S B Automated detection of contractual risk clauses from construction specifications using bidirectional encoder representations from transformers (BERT)[J]. Automation in Construction, 2022, 142: 104465
doi: 10.1016/j.autcon.2022.104465
[21]
SUN C, QIU X, XU Y, et al. How to fine-tune bert for text classification? [C]// Chinese Computational Linguistics: 18th China National Conference. Kunming: Springer, 2019: 194-206.
[22]
LINDSAY G W Convolutional neural networks as a model of the visual system: past, present, and future[J]. Journal of Cognitive Neuroscience, 2021, 33 (10): 2017- 2031
doi: 10.1162/jocn_a_01544
[23]
HAN Z, JIAN M, WANG G G ConvUNeXt: an efficient convolution neural network for medical image segmentation[J]. Knowledge-Based Systems, 2022, 253: 109512
doi: 10.1016/j.knosys.2022.109512
[24]
REIMERS N, GUREVYCH I. Sentence-BERT: sentence embeddings using Siamese BERT-networks [C]/ / Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing . Hong Kong: IEEE, 2019: 3982–3992.