基于多头自注意力机制与MLP-Interactor的多模态情感分析

doi:10.3785/j.issn.1008-973X.2025.08.012

浙江大学学报(工学版)

2025, Vol. 59

Issue (8): 1653-1661 DOI: 10.3785/j.issn.1008-973X.2025.08.012

计算机技术、控制工程、通信技术

基于多头自注意力机制与MLP-Interactor的多模态情感分析

林宜山1(

),左景1,卢树华1,2,*(

)

1. 中国人民公安大学信息网络安全学院，北京 102600
2. 公安部安全防范技术与风险评估重点实验室，北京 102600

Multimodal sentiment analysis based on multi-head self-attention mechanism and MLP-Interactor

Yishan LIN1(

),Jing ZUO1,Shuhua LU1,2,*(

)

1. College of Information and Cyber Security, People’s Public Security University of China, Beijing 102600, China
2. Key Laboratory of Security Technology and Risk Assessment, Ministry of Public Security, Beijing 102600, China

全文: PDF(1431 KB) HTML

摘要：

针对多模态情感分析中单模态特征质量较差及多模态特征交互不够充分的问题，提出基于多头自注意力机制和MLP-Interactor的多模态情感分析方法. 通过基于多头自注意力机制的模态内特征交互模块，实现单模态内的特征交互，提高单模态特征的质量. 通过MLP-Interactor机制实现多模态特征之间的充分交互，学习不同模态之间的一致性信息. 利用提出方法，在CMU-MOSI和CMU-MOSEI 2个公开数据集上进行大量的实验验证与测试. 结果表明，提出方法超越了当前诸多的先进方法，可以有效地提升多模态情感分析的准确性.

关键词： 多模态情感分析; MLP-Interactor; 多头自注意力机制; 特征交互

Abstract:

A multimodal sentiment analysis method based on multi-head self-attention mechanism and MLP-Interactor was proposed in order to solve the problems of poor quality of unimodal features and insufficient interaction of multimodal features in multimodal sentiment analysis. The intra-modal feature interaction was realized and the quality of single-modal features was improved through the intramodal feature interaction module based on the multi-head self-attention mechanism. The MLP-Interactor mechanism was used to realize the full interaction between multimodal features and learn the consistency information between different modalities. A large number of experiments were verified and tested on two public datasets, CMU-MOSI and CMU-MOSEI by using the proposed method. Results show that the proposed method surpasses many advanced methods and can effectively improve the accuracy of multimodal sentiment analysis.

Key words: multimodal sentiment analysis MLP-Interactor multi-head self-attention mechanism feature interaction

收稿日期: 2024-08-22 出版日期: 2025-07-28

TP 391

基金资助: 中国人民公安大学安全防范工程双一流创新研究专项项目（2023SYL08）；2024年基科费?跨模态数据融合及智能讯问技术研究资助项目（2024JKF10）.

通讯作者: 卢树华 E-mail: 375568222@qq.com;lushuhua@ppsuc.edu.cn

作者简介: 林宜山（1999—），男，硕士生，从事多模态情感分析的研究. orcid.org/0000-0000-0000-0000. E-mail：375568222@qq.com

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	作者相关文章
	林宜山
	左景
	卢树华

引用本文:

林宜山,左景,卢树华. 基于多头自注意力机制与MLP-Interactor的多模态情感分析[J]. 浙江大学学报(工学版), 2025, 59(8): 1653-1661.

Yishan LIN,Jing ZUO,Shuhua LU. Multimodal sentiment analysis based on multi-head self-attention mechanism and MLP-Interactor. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1653-1661.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2025.08.012 或 https://www.zjujournals.com/eng/CN/Y2025/V59/I8/1653

图 1 所提多模态情感分析模型的整体结构

图 2 多头注意力机制

图 3 MLP-Interactor

表 1 多头自注意力机制与MLP-Interactor 的多模态情感分析实验参数的设置

表 2 在CMU-MOSI数据集上和其他基准模型性能的对比结果

表 3 在CMU-MOSEI数据集上和其他基准模型性能的对比结果

表 4 在CMU-MOSI数据集上的消融实验结果

图 4 特征可视化

1	ZHU L, ZHU Z, ZHANG C, et al Multimodal sentiment analysis based on fusion methods: a survey[J]. Information Fusion, 2023, 95: 306- 325 doi: 10.1016/j.inffus.2023.02.028
2	GANDHI A, ADHVARYU K, PORIA S, et al Multimodal sentiment analysis: a systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions[J]. Information Fusion, 2023, 91: 424- 444 doi: 10.1016/j.inffus.2022.09.025
3	CAO R, YE C, ZHOU H. Multimodal sentiment analysis with self-attention [C]// Proceedings of the Future Technologies Conference. [S. l. ]: Springer, 2021: 16-26.
4	BALTRUSAITIS T, AHUJA C, MORENCY L P Multimodal machine learning: a survey and taxonomy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41 (2): 423- 443
5	GUO W, WANG J, WANG S Deep multimodal representation learning: a survey[J]. IEEE Access, 2019, 7: 63373- 63394 doi: 10.1109/ACCESS.2019.2916887
6	HAN W, CHEN H, PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis [EB/OL]. (2021-09-16)[2025-05-28]. https://arxiv.org/pdf/2109.00412.
7	HAZARIKA D, ZIMMERMANN R, PORIA S. Misa: modality-invariant and specific representations for multimodal sentiment analysis [C]// Proceedings of the 28th ACM International Conference on Multimedia. Seattle: ACM, 2020: 1122-1131.
8	TANG J, LIU D, JIN X, et al Bafn: bi-direction attention based fusion network for multimodal sentiment analysis[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 33 (4): 1966- 1978
9	WU Y, LIN Z, ZHAO Y, et al. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis [C]// Findings of the Association for Computational Linguistics. [S. l.]: ACL, 2021: 4730-4738.
10	KIM K, PARK S AOBERT: all-modalities-in-one BERT for multimodal sentiment analysis[J]. Information Fusion, 2023, 92: 37- 45 doi: 10.1016/j.inffus.2022.11.022
11	HAN W, CHEN H, GELBUKH A, et al. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis [C]// Proceedings of the 2021 International Conference on Multimodal Interaction. Montréal: ACM, 2021: 6-15.
12	LI Z, GUO Q, PAN Y, et al Multi-level correlation mining framework with self-supervised label generation for multimodal sentiment analysis[J]. Information Fusion, 2023, 99: 101891 doi: 10.1016/j.inffus.2023.101891
13	MORENCY L P, MIHALCEA R, DOSHI P. Towards multimodal sentiment analysis: harvesting opinions from the web [C]// Proceedings of the 13th International Conference on Multimodal Interfaces. Alicante: ACM, 2011: 169-176.
14	ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension [C]// Proceedings of the AAAI Conference on Artificial Intelligence. New Orleans: AAAI Press, 2018: 5642-5649.
15	PORIA S, CHATURVEDI I, CAMBRIA E, et al. Convolutional MKL based multimodal emotion recognition and sentiment analysis [C]//IEEE 16th International Conference on Data Mining. Barcelona: IEEE, 2016: 439-448.
16	ALAM F, RICCARDI G. Predicting personality traits using multimodal information [C]// Proceedings of the 2014 ACM Multimedia on Workshop on Computational Personality Recognition. Orlando: ACM, 2014: 15-18.
17	CAI G, XIA B. Convolutional neural networks for multimedia sentiment analysis [C]// Natural Language Processing and Chinese Computing: 4th CCF Conference. Nanchang: Springer, 2015: 159-167.
18	GKOUMAS D, LI Q, LIOMA C, et al What makes the difference? an empirical comparison of fusion strategies for multimodal language analysis[J]. Information Fusion, 2021, 66: 184- 197 doi: 10.1016/j.inffus.2020.09.005
19	LIN T, WANG Y, LIU X, et al A survey of transformers[J]. AI Open, 2022, 3: 111- 132 doi: 10.1016/j.aiopen.2022.10.001
20	TSAI Y H H, BAI S, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences [C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: ACL, 2019: 6558-6569.
21	CHEN C, HONG H, GUO J, et al Inter-intra modal representation augmentation with trimodal collaborative disentanglement network for multimodal sentiment analysis[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 1476- 1488 doi: 10.1109/TASLP.2023.3263801
22	YUAN Z, LI W, XU H, et al. Transformer-based feature reconstruction network for robust multimodal sentiment analysis [C]// Proceedings of the 29th ACM International Conference on Multimedia. Chengdu: ACM, 2021: 4400-4407.
23	MA L, YAO Y, LIANG T, et al. Multi-scale cooperative multimodal transformers for multimodal sentiment analysis in videos [EB/OL]. (2022-06-17)[2025-05-28]. https://arxiv.org/pdf/2206.07981.
24	YU W, XU H, YUAN Z, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis [C]// Proceedings of the AAAI Conference on Artificial Intelligence. [S. l. ]: AAAI Press, 2021: 10790-10797.
25	WANG D, GUO X, TIAN Y, et al TETFN: a text enhanced transformer fusion network for multimodal sentiment analysis[J]. Pattern Recognition, 2023, 136: 109259 doi: 10.1016/j.patcog.2022.109259
26	MELAS-KYRIAZI L. Do you even need attention? a stack of feed-forward layers does surprisingly well on imagenet [EB/OL]. (2021-05-06)[2025-05-28]. https://arxiv.org/pdf/2105.02723.
27	TOLSTIKHIN I O, HOULSBY N, KOLESNIKOV A, et al Mlp-mixer: an all-mlp architecture for vision[J]. Advances in Neural Information Processing Systems, 2021, 34: 24261- 24272
28	TOUVRON H, BOJANOWSKI P, CARON M, et al Resmlp: feedforward networks for image classification with data-efficient training[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45 (4): 5314- 5321
29	NIE Y, LI L, GAN Z, et al. Mlp architectures for vision-and-language modeling: an empirical study [EB/OL]. (2021-12-08)[2025-05-28]. https://arxiv.org/pdf/2112.04453.
30	LIN H, ZHANG P, LING J, et al PS-mixer: a polar-vector and strength-vector mixer model for multimodal sentiment analysis[J]. Information Processing and Management, 2023, 60 (2): 103229 doi: 10.1016/j.ipm.2022.103229
31	SUN H, WANG H, LIU J, et al. CubeMLP: an MLP-based model for multimodal sentiment analysis and depression estimation [C]// Proceedings of the 30th ACM International Conference on Multimedia. Lisboa: ACM, 2022: 3722-3729.
32	BAIRAVEL S, KRISHNAMURTHY M Novel OGBEE-based feature selection and feature-level fusion with MLP neural network for social media multimodal sentiment analysis[J]. Soft Computing, 2020, 24 (24): 18431- 18445 doi: 10.1007/s00500-020-05049-6
33	KE P, JI H, LIU S, et al. SentiLARE: sentiment-aware language representation learning with linguistic knowledge [EB/OL]. (2020-09-24)[2025-05-28]. https://arxiv.org/pdf/1911.02493.
34	LIU Y, OTT M, GOYAL N, et al. Roberta: a robustly optimized bert pretraining approach [EB/OL]. (2019-07-26)[2025-05-28]. https://arxiv.org/pdf/1907.11692.
35	DEGOTTEX G, KANE J, DRUGMAN T, et al. COVAREP: a collaborative voice analysis repository for speech technologies [C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Florence: IEEE, 2014: 960-964.
36	ZADEH A, ZELLERS R, PINCUS E, et al. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos [EB/OL]. (2016-08-11)[2025-05-28]. https://arxiv.org/pdf/1606.06259.
37	ZADEH A A B, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph [C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne: ACL, 2018: 2236-2246.
38	CHEONG J H, JOLLY E, XIE T, et al Py-feat: Python facial expression analysis toolbox[J]. Affective Science, 2023, 4 (4): 781- 796 doi: 10.1007/s42761-023-00191-4
39	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Advances in Neural Information Processing Systems. California: Curran Associates Inc , 2017: 5998-6008.
40	ZZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis [EB/OL]. (2017-07-23)[2025-05-28]. https://arxiv.org/pdf/1707.07250.
41	LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors [EB/OL]. (2018-05-31)[2025-05-28]. https://arxiv.org/pdf/1806.00064.
42	ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning [C]// Proceedings of the AAAI Conference on Artificial Intelligence. New Orleans: AAAI Press, 2018: 5634-5641.
43	RAHMAN W, HASAN M K, LEE S, et al. Integrating multimodal information in large pretrained Transformers [C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. [S. l.]: ACL, 2020: 2359-2369.
44	YANG B, SHAO B, WU L, et al Multimodal sentiment analysis with unidirectional modality translation[J]. Neurocomputing, 2022, 467: 130- 137 doi: 10.1016/j.neucom.2021.09.041
45	LEI Y, YANG D, LI M, et al. Text-oriented modality reinforcement network for multimodal sentiment analysis from unaligned multimodal sequences [C]// CAAI International Conference on Artificial Intelligence. Singapore: Springer, 2023: 189-200.
46	WANG Y, HE J, WANG D, et al Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis[J]. Neurocomputing, 2024, 572: 127181 doi: 10.1016/j.neucom.2023.127181
47	LIU W, CAO S, ZHANG S Multimodal consistency-specificity fusion based on information bottleneck for sentiment analysis[J]. Journal of King Saud University-Computer and Information Sciences, 2024, 36 (2): 101943 doi: 10.1016/j.jksuci.2024.101943
48	ZENG Y, LI Z, CHEN Z, et al A feature-based restoration dynamic interaction network for multimodal sentiment analysis[J]. Engineering Applications of Artificial Intelligence, 2024, 127: 107335 doi: 10.1016/j.engappai.2023.107335

[1]	张德军,白燕子,曹锋,吴亦奇,徐战亚. 面向密集预测任务的点云Transformer适配器[J]. 浙江大学学报(工学版), 2025, 59(5): 920-928.
[2]	梁礼明,龙鹏威,金家新,李仁杰,曾璐. 基于改进YOLOv8s的钢材表面缺陷检测算法[J]. 浙江大学学报(工学版), 2025, 59(3): 512-522.
[3]	陈巧红,孙佳锦,漏杨波,方志坚. 基于多任务学习与层叠 Transformer 的多模态情感分析模型[J]. 浙江大学学报(工学版), 2023, 57(12): 2421-2429.

Viewed

Full text

Abstract

Cited

Shared

Discussed