Please wait a minute...
浙江大学学报(工学版)  2025, Vol. 59 Issue (8): 1653-1661    DOI: 10.3785/j.issn.1008-973X.2025.08.012
计算机技术、控制工程、通信技术     
基于多头自注意力机制与MLP-Interactor的多模态情感分析
林宜山1(),左景1,卢树华1,2,*()
1. 中国人民公安大学 信息网络安全学院,北京 102600
2. 公安部安全防范技术与风险评估重点实验室,北京 102600
Multimodal sentiment analysis based on multi-head self-attention mechanism and MLP-Interactor
Yishan LIN1(),Jing ZUO1,Shuhua LU1,2,*()
1. College of Information and Cyber Security, People’s Public Security University of China, Beijing 102600, China
2. Key Laboratory of Security Technology and Risk Assessment, Ministry of Public Security, Beijing 102600, China
 全文: PDF(1431 KB)   HTML
摘要:

针对多模态情感分析中单模态特征质量较差及多模态特征交互不够充分的问题,提出基于多头自注意力机制和MLP-Interactor的多模态情感分析方法. 通过基于多头自注意力机制的模态内特征交互模块,实现单模态内的特征交互,提高单模态特征的质量. 通过MLP-Interactor机制实现多模态特征之间的充分交互,学习不同模态之间的一致性信息. 利用提出方法,在CMU-MOSI和CMU-MOSEI 2个公开数据集上进行大量的实验验证与测试. 结果表明,提出方法超越了当前诸多的先进方法,可以有效地提升多模态情感分析的准确性.

关键词: 多模态情感分析MLP-Interactor多头自注意力机制特征交互    
Abstract:

A multimodal sentiment analysis method based on multi-head self-attention mechanism and MLP-Interactor was proposed in order to solve the problems of poor quality of unimodal features and insufficient interaction of multimodal features in multimodal sentiment analysis. The intra-modal feature interaction was realized and the quality of single-modal features was improved through the intramodal feature interaction module based on the multi-head self-attention mechanism. The MLP-Interactor mechanism was used to realize the full interaction between multimodal features and learn the consistency information between different modalities. A large number of experiments were verified and tested on two public datasets, CMU-MOSI and CMU-MOSEI by using the proposed method. Results show that the proposed method surpasses many advanced methods and can effectively improve the accuracy of multimodal sentiment analysis.

Key words: multimodal sentiment analysis    MLP-Interactor    multi-head self-attention mechanism    feature interaction
收稿日期: 2024-08-22 出版日期: 2025-07-28
:  TP 391  
基金资助: 中国人民公安大学安全防范工程双一流创新研究专项项目(2023SYL08);2024年基科费?跨模态数据融合及智能讯问技术研究资助项目(2024JKF10).
通讯作者: 卢树华     E-mail: 375568222@qq.com;lushuhua@ppsuc.edu.cn
作者简介: 林宜山(1999—),男,硕士生,从事多模态情感分析的研究. orcid.org/0000-0000-0000-0000. E-mail:375568222@qq.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
林宜山
左景
卢树华

引用本文:

林宜山,左景,卢树华. 基于多头自注意力机制与MLP-Interactor的多模态情感分析[J]. 浙江大学学报(工学版), 2025, 59(8): 1653-1661.

Yishan LIN,Jing ZUO,Shuhua LU. Multimodal sentiment analysis based on multi-head self-attention mechanism and MLP-Interactor. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1653-1661.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2025.08.012        https://www.zjujournals.com/eng/CN/Y2025/V59/I8/1653

图 1  所提多模态情感分析模型的整体结构
图 2  多头注意力机制
图 3  MLP-Interactor
数据集rLSLmaxN
CMU-MOSI3.146×10?532503
CMU-MOSEI7.600×10?632503
表 1  多头自注意力机制与MLP-Interactor 的多模态情感分析实验参数的设置
模型MAEcorrA2A7F1
TFN[40](2017)0.9010.698— / 80.834.9—/ 80.7
LMF[41](2018)0.9170.695— / 82.533.2—/ 82.4
MFN[42](2018)0.9650.63277.4 / —34.177.3 / —
MulT[20](2019)0.8710.698— / 83.040.0— / 82.8
BBFN[11](2021)0.7760.755—/84.345.0—/84.3
Self-MM[24](2021)0.7130.79884.0/85.9884.42/85.95
MISA[7](2020)0.7830.76181.8/83.442.381.7/83.6
MAG-BERT[43](2020)0.7310.79882.5/84.382.6/84.3
CubeMLP[31](2022)0.7700.767—/ 85.645.5—/85.5
PS-Mixer[30](2023)0.7940.74880.3/82.144.3180.3/82.1
MTSA[44](2022)0.6960.806—/86.846.4—/86.8
AOBERT[10](2023)0.8560.70085.2/85.640.285.4/86.4
TETFN[25](2023)
TMRN[45](2023)
0.717
0.704
0.800
0.784
84.05/86.10
83.67/85.67

48.68
83.83/86.07
83.45/85.52
MTAMW[46](2024)0.7120.79484.40/86.5946.8484.20/86.46
MIBSA[47](2024)0.7280.798—/87.0043.10—/87.20
FRDIN[48](2024)0.6820.81385.8/87.446.5985.3/87.5
CRNet[49](2024)0.7120.797—/86.447.40—/86.4
本文模型0.5750.86887.6/89.652.2387.7/89.6
表 2  在CMU-MOSI数据集上和其他基准模型性能的对比结果
模型MAEcorrA2A7F1
TFN[40](2017)0.5930.700—/82.550.2—/82.1
LMF[41](2018)0.6230.677—/82.048.0—/82.1
MulT[20](2019)0.5800.703—/82.551.8—/82.3
BBFN[11](2021)0.5290.767—/86.254.8—/86.1
Self-MM[24](2021)0.5300.76582.81/85.1782.53/85.30
MISA[7](2020)0.5550.75683.6/85.552.283.8/85.3
MAG-BERT[43](2020)0.5430.75582.51/84.8282.77/84.71
CubeMLP[31](2022)0.5290.760—/85.154.9—/84.5
PS-Mixer[30](2023)0.5370.76583.1/86.153.083.1/86.1
MTSA[44](2022)0.5410.774—/85.552.9—/85.3
AOBERT[10](2023)0.5150.76384.9/86.254.585.0/85.9
TETFN[25](2023)
TMRN[45](2023)
0.551
0.535
0.748
0.762
84.25/85.18
83.39/86.19

53.65
84.18/85.27
83.67/86.08
MTAMW[46](2024)0.5250.78283.09/86.4953.7383.48/86.45
MIBSA[47](2024)0.5680.753—/86.7052.40—/85.80
FRDIN[48](2024)0.5250.77883.30/86.3054.4083.70/86.20
CRNet[49](2024)0.5410.771—/86.2053.80—/86.10
本文模型0.5120.79483.0/86.854.582.5/86.8
表 3  在CMU-MOSEI数据集上和其他基准模型性能的对比结果
方法MAEcorrA2A7F1
BERT0.7990.74680.80/82.8740.7780.91/82.90
DeBERTa1.1540.48666.96/68.0628.8666.93/67.93
RoBERTa0.6640.82483.63/85.6545.5383.65/85.64
DistilBERT0.7540.76881.40/83.5440.4881.48/83.55
ALBERT0.9280.67076.93/79.0034.6777.10/79.08
w/o Intra-Modality
Interaction
0.6130.85186.90/89.1047.9286.90/89.10
w/o MLP-Interactor0.5900.86885.57/87.5450.0085.66/87.59
w/o audio0.6350.85386.46/88.1644.7986.48/88.15
w/o video0.5830.86187.35/89.4249.5587.45/89.47
w/o text1.4600.05245.95/47.4714.5850.86/52.50
本文模型0.5750.86887.60/89.6052.2387.70/89.60
表 4  在CMU-MOSI数据集上的消融实验结果
图 4  特征可视化
1 ZHU L, ZHU Z, ZHANG C, et al Multimodal sentiment analysis based on fusion methods: a survey[J]. Information Fusion, 2023, 95: 306- 325
doi: 10.1016/j.inffus.2023.02.028
2 GANDHI A, ADHVARYU K, PORIA S, et al Multimodal sentiment analysis: a systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions[J]. Information Fusion, 2023, 91: 424- 444
doi: 10.1016/j.inffus.2022.09.025
3 CAO R, YE C, ZHOU H. Multimodal sentiment analysis with self-attention [C]// Proceedings of the Future Technologies Conference. [S. l. ]: Springer, 2021: 16-26.
4 BALTRUSAITIS T, AHUJA C, MORENCY L P Multimodal machine learning: a survey and taxonomy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41 (2): 423- 443
5 GUO W, WANG J, WANG S Deep multimodal representation learning: a survey[J]. IEEE Access, 2019, 7: 63373- 63394
doi: 10.1109/ACCESS.2019.2916887
6 HAN W, CHEN H, PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis [EB/OL]. (2021-09-16)[2025-05-28]. https://arxiv.org/pdf/2109.00412.
7 HAZARIKA D, ZIMMERMANN R, PORIA S. Misa: modality-invariant and specific representations for multimodal sentiment analysis [C]// Proceedings of the 28th ACM International Conference on Multimedia. Seattle: ACM, 2020: 1122-1131.
8 TANG J, LIU D, JIN X, et al Bafn: bi-direction attention based fusion network for multimodal sentiment analysis[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 33 (4): 1966- 1978
9 WU Y, LIN Z, ZHAO Y, et al. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis [C]// Findings of the Association for Computational Linguistics. [S. l.]: ACL, 2021: 4730-4738.
10 KIM K, PARK S AOBERT: all-modalities-in-one BERT for multimodal sentiment analysis[J]. Information Fusion, 2023, 92: 37- 45
doi: 10.1016/j.inffus.2022.11.022
11 HAN W, CHEN H, GELBUKH A, et al. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis [C]// Proceedings of the 2021 International Conference on Multimodal Interaction. Montréal: ACM, 2021: 6-15.
12 LI Z, GUO Q, PAN Y, et al Multi-level correlation mining framework with self-supervised label generation for multimodal sentiment analysis[J]. Information Fusion, 2023, 99: 101891
doi: 10.1016/j.inffus.2023.101891
13 MORENCY L P, MIHALCEA R, DOSHI P. Towards multimodal sentiment analysis: harvesting opinions from the web [C]// Proceedings of the 13th International Conference on Multimodal Interfaces. Alicante: ACM, 2011: 169-176.
14 ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension [C]// Proceedings of the AAAI Conference on Artificial Intelligence. New Orleans: AAAI Press, 2018: 5642-5649.
15 PORIA S, CHATURVEDI I, CAMBRIA E, et al. Convolutional MKL based multimodal emotion recognition and sentiment analysis [C]//IEEE 16th International Conference on Data Mining. Barcelona: IEEE, 2016: 439-448.
16 ALAM F, RICCARDI G. Predicting personality traits using multimodal information [C]// Proceedings of the 2014 ACM Multimedia on Workshop on Computational Personality Recognition. Orlando: ACM, 2014: 15-18.
17 CAI G, XIA B. Convolutional neural networks for multimedia sentiment analysis [C]// Natural Language Processing and Chinese Computing: 4th CCF Conference. Nanchang: Springer, 2015: 159-167.
18 GKOUMAS D, LI Q, LIOMA C, et al What makes the difference? an empirical comparison of fusion strategies for multimodal language analysis[J]. Information Fusion, 2021, 66: 184- 197
doi: 10.1016/j.inffus.2020.09.005
19 LIN T, WANG Y, LIU X, et al A survey of transformers[J]. AI Open, 2022, 3: 111- 132
doi: 10.1016/j.aiopen.2022.10.001
20 TSAI Y H H, BAI S, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences [C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: ACL, 2019: 6558-6569.
21 CHEN C, HONG H, GUO J, et al Inter-intra modal representation augmentation with trimodal collaborative disentanglement network for multimodal sentiment analysis[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 1476- 1488
doi: 10.1109/TASLP.2023.3263801
22 YUAN Z, LI W, XU H, et al. Transformer-based feature reconstruction network for robust multimodal sentiment analysis [C]// Proceedings of the 29th ACM International Conference on Multimedia. Chengdu: ACM, 2021: 4400-4407.
23 MA L, YAO Y, LIANG T, et al. Multi-scale cooperative multimodal transformers for multimodal sentiment analysis in videos [EB/OL]. (2022-06-17)[2025-05-28]. https://arxiv.org/pdf/2206.07981.
24 YU W, XU H, YUAN Z, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis [C]// Proceedings of the AAAI Conference on Artificial Intelligence. [S. l. ]: AAAI Press, 2021: 10790-10797.
25 WANG D, GUO X, TIAN Y, et al TETFN: a text enhanced transformer fusion network for multimodal sentiment analysis[J]. Pattern Recognition, 2023, 136: 109259
doi: 10.1016/j.patcog.2022.109259
26 MELAS-KYRIAZI L. Do you even need attention? a stack of feed-forward layers does surprisingly well on imagenet [EB/OL]. (2021-05-06)[2025-05-28]. https://arxiv.org/pdf/2105.02723.
27 TOLSTIKHIN I O, HOULSBY N, KOLESNIKOV A, et al Mlp-mixer: an all-mlp architecture for vision[J]. Advances in Neural Information Processing Systems, 2021, 34: 24261- 24272
28 TOUVRON H, BOJANOWSKI P, CARON M, et al Resmlp: feedforward networks for image classification with data-efficient training[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45 (4): 5314- 5321
29 NIE Y, LI L, GAN Z, et al. Mlp architectures for vision-and-language modeling: an empirical study [EB/OL]. (2021-12-08)[2025-05-28]. https://arxiv.org/pdf/2112.04453.
30 LIN H, ZHANG P, LING J, et al PS-mixer: a polar-vector and strength-vector mixer model for multimodal sentiment analysis[J]. Information Processing and Management, 2023, 60 (2): 103229
doi: 10.1016/j.ipm.2022.103229
31 SUN H, WANG H, LIU J, et al. CubeMLP: an MLP-based model for multimodal sentiment analysis and depression estimation [C]// Proceedings of the 30th ACM International Conference on Multimedia. Lisboa: ACM, 2022: 3722-3729.
32 BAIRAVEL S, KRISHNAMURTHY M Novel OGBEE-based feature selection and feature-level fusion with MLP neural network for social media multimodal sentiment analysis[J]. Soft Computing, 2020, 24 (24): 18431- 18445
doi: 10.1007/s00500-020-05049-6
33 KE P, JI H, LIU S, et al. SentiLARE: sentiment-aware language representation learning with linguistic knowledge [EB/OL]. (2020-09-24)[2025-05-28]. https://arxiv.org/pdf/1911.02493.
34 LIU Y, OTT M, GOYAL N, et al. Roberta: a robustly optimized bert pretraining approach [EB/OL]. (2019-07-26)[2025-05-28]. https://arxiv.org/pdf/1907.11692.
35 DEGOTTEX G, KANE J, DRUGMAN T, et al. COVAREP: a collaborative voice analysis repository for speech technologies [C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Florence: IEEE, 2014: 960-964.
36 ZADEH A, ZELLERS R, PINCUS E, et al. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos [EB/OL]. (2016-08-11)[2025-05-28]. https://arxiv.org/pdf/1606.06259.
37 ZADEH A A B, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph [C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne: ACL, 2018: 2236-2246.
38 CHEONG J H, JOLLY E, XIE T, et al Py-feat: Python facial expression analysis toolbox[J]. Affective Science, 2023, 4 (4): 781- 796
doi: 10.1007/s42761-023-00191-4
39 VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Advances in Neural Information Processing Systems. California: Curran Associates Inc , 2017: 5998-6008.
40 ZZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis [EB/OL]. (2017-07-23)[2025-05-28]. https://arxiv.org/pdf/1707.07250.
41 LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors [EB/OL]. (2018-05-31)[2025-05-28]. https://arxiv.org/pdf/1806.00064.
42 ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning [C]// Proceedings of the AAAI Conference on Artificial Intelligence. New Orleans: AAAI Press, 2018: 5634-5641.
43 RAHMAN W, HASAN M K, LEE S, et al. Integrating multimodal information in large pretrained Transformers [C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. [S. l.]: ACL, 2020: 2359-2369.
44 YANG B, SHAO B, WU L, et al Multimodal sentiment analysis with unidirectional modality translation[J]. Neurocomputing, 2022, 467: 130- 137
doi: 10.1016/j.neucom.2021.09.041
45 LEI Y, YANG D, LI M, et al. Text-oriented modality reinforcement network for multimodal sentiment analysis from unaligned multimodal sequences [C]// CAAI International Conference on Artificial Intelligence. Singapore: Springer, 2023: 189-200.
46 WANG Y, HE J, WANG D, et al Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis[J]. Neurocomputing, 2024, 572: 127181
doi: 10.1016/j.neucom.2023.127181
47 LIU W, CAO S, ZHANG S Multimodal consistency-specificity fusion based on information bottleneck for sentiment analysis[J]. Journal of King Saud University-Computer and Information Sciences, 2024, 36 (2): 101943
doi: 10.1016/j.jksuci.2024.101943
48 ZENG Y, LI Z, CHEN Z, et al A feature-based restoration dynamic interaction network for multimodal sentiment analysis[J]. Engineering Applications of Artificial Intelligence, 2024, 127: 107335
doi: 10.1016/j.engappai.2023.107335
[1] 张德军,白燕子,曹锋,吴亦奇,徐战亚. 面向密集预测任务的点云Transformer适配器[J]. 浙江大学学报(工学版), 2025, 59(5): 920-928.
[2] 梁礼明,龙鹏威,金家新,李仁杰,曾璐. 基于改进YOLOv8s的钢材表面缺陷检测算法[J]. 浙江大学学报(工学版), 2025, 59(3): 512-522.
[3] 陈巧红,孙佳锦,漏杨波,方志坚. 基于多任务学习与层叠 Transformer 的多模态情感分析模型[J]. 浙江大学学报(工学版), 2023, 57(12): 2421-2429.