Please wait a minute...
浙江大学学报(工学版)  2024, Vol. 58 Issue (11): 2239-2246    DOI: 10.3785/j.issn.1008-973X.2024.11.005
计算机技术、控制工程     
基于双向编码表示转换的双模态软件分类模型
付晓峰1(),陈威岐2,孙曜2,潘宇泽2
1. 杭州电子科技大学 计算机学院,浙江 杭州 310018
2. 杭州电子科技大学 自动化学院,浙江 杭州 310018
Bimodal software classification model based on bidirectional encoder representation from transformer
Xiaofeng FU1(),Weiqi CHEN2,Yao SUN2,Yuze PAN2
1. School of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China
2. School of Automation, Hangzhou Dianzi University, Hangzhou 310018, China
 全文: PDF(848 KB)   HTML
摘要:

针对已有方法在软件分类方面只考虑单一分类因素和精确率较低的不足,提出基于双向编码表示转换(BERT)的双模态软件分类方法. 该方法遵循最新的国家标准对软件进行分类,通过集成基于代码的BERT(CodeBERT)和基于掩码语言模型的纠错BERT(MacBERT)双向编码的优势,其中CodeBERT用于深入分析源码内容,MacBERT处理文本描述信息如注释和文档,利用这2种双模态信息联合生成词嵌入. 结合卷积神经网络(CNN)提取局部特征,通过提出的交叉自注意力机制(CSAM)融合模型结果,实现对复杂软件系统的准确分类. 实验结果表明,本文方法在同时考虑文本和源码数据的情况下精确率高达93.3%,与从奥集能和gitee平台收集并处理的数据集上训练的BERT模型和CodeBERT模型相比,平均精确率提高了5.4%. 这表明了双向编码和双模态分类方法在软件分类中的高效性和准确性,证明了提出方法的实用性.

关键词: 软件分类双向编码表示转换(BERT)卷积神经网络双模态交叉自注意力机制    
Abstract:

A bimodal software categorization method based on bidirectional encoder representations from transformers (BERT) was proposed addressing the limitations of existing methods that only consider a single factor in software categorization and suffer from low precision. The method followed the latest national standards for software classification. The advantages of bidirectional encoding from code based BERT (CodeBERT) and masked language model as correction BERT (MacBERT) were integrated. CodeBERT was used for in-depth analysis of source code content, while MacBERT handled textual description information such as comments and documents. The above bimodal information was utilized to jointly generate word embeddings. Convolutional neural network (CNN) was combined for local feature extraction, and the proposed cross self-attention mechanism (CSAM) was employed to fuse model results in order to achieve accurate classification of complex software system. The experimental results demonstrate that the method achieves a high precision of 93.3% with text and source code data, which is 5.4% higher on average than the BERT and CodeBERT models trained on datasets processed from the Orginone and gitee platforms. Results show the efficiency and accuracy of bidirectional encoding and bimodal classification methods in software categorization, while proves the practicality of the proposed approach.

Key words: software classification    bidirectional encoder representation from transformer (BERT)    convolutional neural network    bimodal    cross self-attention mechanism
收稿日期: 2023-07-03 出版日期: 2024-10-23
CLC:  TP 391  
基金资助: 国家自然科学基金资助项目(61672199).
作者简介: 付晓峰(1981—),女,副教授,从事人工智能、深度学习、软件资产分类等研究. orcid.org/ 0000-0003-4903-5266. E-mail:fuxiaofeng@hdu.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
付晓峰
陈威岐
孙曜
潘宇泽

引用本文:

付晓峰,陈威岐,孙曜,潘宇泽. 基于双向编码表示转换的双模态软件分类模型[J]. 浙江大学学报(工学版), 2024, 58(11): 2239-2246.

Xiaofeng FU,Weiqi CHEN,Yao SUN,Yuze PAN. Bimodal software classification model based on bidirectional encoder representation from transformer. Journal of ZheJiang University (Engineering Science), 2024, 58(11): 2239-2246.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2024.11.005        https://www.zjujournals.com/eng/CN/Y2024/V58/I11/2239

图 1  自注意力机制
图 2  软件分类数据集每个类别的数据分布
图 3  交叉自注意力机制
图 4  基于BERT的双模态软件分类模型
图 5  BERT嵌入层表示
参数数值
迭代数30
批数量32
句子长度代码512,文本256
学习率10?5
隐藏层数768
Dropout0.5
CNN隐藏层数
损失函数
1
Cross-entropy loss
优化器Adamax
表 1  双模态分类模型的参数设置
模型PRF1
BERT0.6450.6320.638
CodeBERT0.7860.7640.775
MacBERT
基于神经网络的文本分类方法
0.692
0.765
0.676
0.743
0.684
0.753
基于BERT双模态软件分类模型0.7860.7640.775
表 2  没有文本描述情况下各模型的分类结果
模型PRF1
BERT0.8750.8630.869
CodeBERT0.8830.8760.880
MacBERT0.9030.9010.902
基于神经网络的文本分类方法0.9130.9120.913
基于BERT双模态软件分类模型0.9330.9260.930
表 3  有文本描述情况下各模型的分类结果
软件类别PR
行业应用0.9080.913
网络通信0.9510.948
语言0.9640.961
多媒体0.9190.922
地理信息0.8950.897
人工智能0.9730.975
浏览器0.9280.934
中间件0.9020.901
开发支撑0.9570.946
数据库0.9430.948
操作系统0.9150.912
信息安全0.9340.936
表 4  各种软件类别的分类结果
图 6  各个模型的平均F1值
1 ARIAS-BARAHONA M X, ARTEAGA-ARTEAGA H B, OROZCO-ARIAS S, et al Requests classification in the customer service area for software companies using machine learning and natural language processing[J]. PeerJ Computer Science, 2023, 9: e1016
doi: 10.7717/peerj-cs.1016
2 CANEDO E D, MEDNES B C Software requirements classification using machine learning algorithms[J]. Entropy, 2020, 22 (9): 1057
doi: 10.3390/e22091057
3 TANJONG E E, CARVER D L. Improving impact and dependency analysis through software categorization methods [C]// 9th International Conference in Software Engineering Research and Innovation . San Diego: IEEE, 2021: 142-151.
4 DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding [C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Minneapolis: ACL, 2019: 4171–4186.
5 FENG Z, GUO D, TANG D, et al. CodeBERT: a pre-trained model for programming and natural languages [C]// Findings of the Association for Computational Linguistics . [S. l. ]: ACL, 2020: 1536-1547.
6 VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// 31st Conference on Neural Information Processing Systems. Long Beach: ACM, 2017: 5998-6008.
7 PAN T, CHEN J, YE Z, et al A multi-head attention network with adaptive meta-transfer learning for RUL prediction of rocket engines[J]. Reliability Engineering and System Safety, 2022, 225: 108610
doi: 10.1016/j.ress.2022.108610
8 REZA S, FERREIRA M C, MACHADO J J M, et al A multi-head attention-based transformer model for traffic flow forecasting with a comparative analysis to recurrent neural networks[J]. Expert Systems with Applications, 2022, 202: 117275
doi: 10.1016/j.eswa.2022.117275
9 WEN Z, LIN W, WANG T, et al Distract your attention: multi-head cross attention network for facial expression recognition[J]. Biomimetics, 2023, 8 (2): 199
doi: 10.3390/biomimetics8020199
10 LAI T, CHENG L, WANG D, et al RMAN: relational multi-head attention neural network for joint extraction of entities and relations[J]. Applied Intelligence, 2022, 52 (3): 3132- 3142
doi: 10.1007/s10489-021-02600-2
11 ZHANG Y, JIN R, ZHOU Z H Understanding bag-of-words model: a statistical framework[J]. International Journal of Machine Learning and Cybernetics., 2010, 1: 43- 52
doi: 10.1007/s13042-010-0001-0
12 ZHANG W, YOSHIDA T, TANG X A comparative study of TF*IDF, LSI and multi-words for text classification[J]. Expert Systems with Applications, 2011, 38 (3): 2758- 2765
doi: 10.1016/j.eswa.2010.08.066
13 LAN Z, CHEN M, GOODMAN S, et al. ALBERT: a lite BERT for self-supervised learning of language representations [C]/ /IEEE Spoken Language Technology Workshop . Shenzhen: IEEE, 2021: 344-351.
14 LIU Y, OTT M, GOYAL N, et al. Roberta: a robustly optimized BERT pretraining approach [C]// China National Conference on Chinese Computational Linguistics. Hohhot: ACL, 2021: 1218–1227.
15 CUI Y, CHE W, LIU T, et al Pre-training with whole word masking for Chinese bert[J]. ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504- 3514
16 CHANG Y, KONG L, JIA K, et al. Chinese named entity recognition method based on BERT [C]// IEEE International Conference on Data Science and Computer Application . Dalian: IEEE, 2021: 294-299.
17 SERKAN K, ONUR A, OSAMA A, et al 1D convolutional neural networks and applications: a survey[J]. Mechanical Systems and Signal Processing, 2021, 151: 107398
doi: 10.1016/j.ymssp.2020.107398
18 TORREY L, SHAVLIK J. Transfer learning [M]// SORIA E, MARTIN R. Handbook of research on machine learning applications. Madison: IGI Global, 2010: 242-264.
19 GU Y, TINN R, CHENG H, et al. Domain-specific language model pretraining for biomedical natural language processing [J]. ACM Transactions on Computing for Healthcare , 2021, 3(1): 1-23.
20 MOON S, CHI S, IM S B Automated detection of contractual risk clauses from construction specifications using bidirectional encoder representations from transformers (BERT)[J]. Automation in Construction, 2022, 142: 104465
doi: 10.1016/j.autcon.2022.104465
21 SUN C, QIU X, XU Y, et al. How to fine-tune bert for text classification? [C]// Chinese Computational Linguistics: 18th China National Conference. Kunming: Springer, 2019: 194-206.
22 LINDSAY G W Convolutional neural networks as a model of the visual system: past, present, and future[J]. Journal of Cognitive Neuroscience, 2021, 33 (10): 2017- 2031
doi: 10.1162/jocn_a_01544
23 HAN Z, JIAN M, WANG G G ConvUNeXt: an efficient convolution neural network for medical image segmentation[J]. Knowledge-Based Systems, 2022, 253: 109512
doi: 10.1016/j.knosys.2022.109512
24 REIMERS N, GUREVYCH I. Sentence-BERT: sentence embeddings using Siamese BERT-networks [C]/ / Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing . Hong Kong: IEEE, 2019: 3982–3992.
[1] 王海军,王涛,俞慈君. 基于递归量化分析的CFRP超声检测缺陷识别方法[J]. 浙江大学学报(工学版), 2024, 58(8): 1604-1617.
[2] 李劲业,李永强. 融合知识图谱的时空多图卷积交通流量预测[J]. 浙江大学学报(工学版), 2024, 58(7): 1366-1376.
[3] 邢志伟,朱书杰,李彪. 基于改进图卷积神经网络的航空行李特征感知[J]. 浙江大学学报(工学版), 2024, 58(5): 941-950.
[4] 宋明俊,严文,邓益昭,张俊然,涂海燕. 轻量化机器人抓取位姿实时检测算法[J]. 浙江大学学报(工学版), 2024, 58(3): 599-610.
[5] 耿志强,陈威,马波,韩永明. 基于连续小波卷积神经网络的轴承智能故障诊断方法[J]. 浙江大学学报(工学版), 2024, 58(10): 2069-2075.
[6] 宋昭漾,赵小强,惠永永,蒋红梅. 基于多级连续编码与解码的图像超分辨率重建算法[J]. 浙江大学学报(工学版), 2023, 57(9): 1885-1893.
[7] 王殿海,谢瑞,蔡正义. 基于最优汇集时间间隔的城市间断交通流预测[J]. 浙江大学学报(工学版), 2023, 57(8): 1607-1617.
[8] 权巍,蔡永青,王超,宋佳,孙鸿凯,李林轩. 基于3D-ResNet双流网络的VR病评估模型[J]. 浙江大学学报(工学版), 2023, 57(7): 1345-1353.
[9] 王誉翔,钟智伟,夏鹏程,黄亦翔,刘成良. 基于改进Transformer的复合故障解耦诊断方法[J]. 浙江大学学报(工学版), 2023, 57(5): 855-864.
[10] 吕鑫栋,李娇,邓真楠,冯浩,崔欣桐,邓红霞. 基于改进Transformer的结构化图像超分辨网络[J]. 浙江大学学报(工学版), 2023, 57(5): 865-874.
[11] 张剑钊,郭继昌,汪昱东. 基于融合逆透射率图的水下图像增强算法[J]. 浙江大学学报(工学版), 2023, 57(5): 921-929.
[12] 周传华,操礼春,周家亿,詹凤. 图卷积融合计算时效网络节点重要性评估分析[J]. 浙江大学学报(工学版), 2023, 57(5): 930-938.
[13] 杨燕泽,王萌,刘诚,徐慧通,张小月. 基于语义分割的沥青路面裂缝智能识别[J]. 浙江大学学报(工学版), 2023, 57(10): 2094-2105.
[14] 赵卿,张雪英,陈桂军,张静. 基于模态注意力图卷积特征融合的EEG和fNIRS情感识别[J]. 浙江大学学报(工学版), 2023, 57(10): 1987-1997.
[15] 孙炜,刘恒,陶建峰,孙浩,刘成良. 基于IndRNN-1DLCNN的负载口独立控制阀控缸系统故障诊断[J]. 浙江大学学报(工学版), 2023, 57(10): 2028-2041.