Multimodal sentiment analysis based on multi-head self-attention mechanism and MLP-Interactor

doi:10.3785/j.issn.1008-973X.2025.08.012

Journal of ZheJiang University (Engineering Science)

2025, Vol. 59

Issue (8): 1653-1661 DOI: 10.3785/j.issn.1008-973X.2025.08.012

Multimodal sentiment analysis based on multi-head self-attention mechanism and MLP-Interactor

Yishan LIN1(

),Jing ZUO1,Shuhua LU1,2,*(

)

1. College of Information and Cyber Security, People’s Public Security University of China, Beijing 102600, China
2. Key Laboratory of Security Technology and Risk Assessment, Ministry of Public Security, Beijing 102600, China

Download:

HTML

PDF(1431KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

A multimodal sentiment analysis method based on multi-head self-attention mechanism and MLP-Interactor was proposed in order to solve the problems of poor quality of unimodal features and insufficient interaction of multimodal features in multimodal sentiment analysis. The intra-modal feature interaction was realized and the quality of single-modal features was improved through the intramodal feature interaction module based on the multi-head self-attention mechanism. The MLP-Interactor mechanism was used to realize the full interaction between multimodal features and learn the consistency information between different modalities. A large number of experiments were verified and tested on two public datasets, CMU-MOSI and CMU-MOSEI by using the proposed method. Results show that the proposed method surpasses many advanced methods and can effectively improve the accuracy of multimodal sentiment analysis.

Key words： multimodal sentiment analysis MLP-Interactor multi-head self-attention mechanism feature interaction

Received: 22 August 2024 Published: 28 July 2025

CLC:

TP 391

Fund: 中国人民公安大学安全防范工程双一流创新研究专项项目（2023SYL08）；2024年基科费?跨模态数据融合及智能讯问技术研究资助项目（2024JKF10）.

Corresponding Authors: Shuhua LU E-mail: 375568222@qq.com;lushuhua@ppsuc.edu.cn

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Yishan LIN
	Jing ZUO
	Shuhua LU

Cite this article:

Yishan LIN,Jing ZUO,Shuhua LU. Multimodal sentiment analysis based on multi-head self-attention mechanism and MLP-Interactor. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1653-1661.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2025.08.012 OR https://www.zjujournals.com/eng/Y2025/V59/I8/1653

基于多头自注意力机制与MLP-Interactor的多模态情感分析

针对多模态情感分析中单模态特征质量较差及多模态特征交互不够充分的问题，提出基于多头自注意力机制和MLP-Interactor的多模态情感分析方法. 通过基于多头自注意力机制的模态内特征交互模块，实现单模态内的特征交互，提高单模态特征的质量. 通过MLP-Interactor机制实现多模态特征之间的充分交互，学习不同模态之间的一致性信息. 利用提出方法，在CMU-MOSI和CMU-MOSEI 2个公开数据集上进行大量的实验验证与测试. 结果表明，提出方法超越了当前诸多的先进方法，可以有效地提升多模态情感分析的准确性.

关键词： 多模态情感分析, MLP-Interactor, 多头自注意力机制, 特征交互

Fig.1 Overall structure of proposed multimodal sentiment analysis model

Fig.2 Multi-head Attention mechanism

Fig.3 MLP-Interactor

Tab.1 Experimental parameter setting of mutimodal sentiment analysis based on multi-head self-attention mechanism and MLP-Interactor

Tab.2 Comparison of performance on CMU-MOSI dataset with other benchmark models

Tab.3 Comparison of performance on CMU-MOSEI dataset with other benchmark models

Tab.4 Result of ablation experiment on CMU-MOSI dataset

Fig.4 Feature visualization


[1]	ZHU L, ZHU Z, ZHANG C, et al Multimodal sentiment analysis based on fusion methods: a survey[J]. Information Fusion, 2023, 95: 306- 325 doi: 10.1016/j.inffus.2023.02.028

[2]	GANDHI A, ADHVARYU K, PORIA S, et al Multimodal sentiment analysis: a systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions[J]. Information Fusion, 2023, 91: 424- 444 doi: 10.1016/j.inffus.2022.09.025

[3]	CAO R, YE C, ZHOU H. Multimodal sentiment analysis with self-attention [C]// Proceedings of the Future Technologies Conference. [S. l. ]: Springer, 2021: 16-26.

[4]	BALTRUSAITIS T, AHUJA C, MORENCY L P Multimodal machine learning: a survey and taxonomy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41 (2): 423- 443

[5]	GUO W, WANG J, WANG S Deep multimodal representation learning: a survey[J]. IEEE Access, 2019, 7: 63373- 63394 doi: 10.1109/ACCESS.2019.2916887

[6]	HAN W, CHEN H, PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis [EB/OL]. (2021-09-16)[2025-05-28]. https://arxiv.org/pdf/2109.00412.

[7]	HAZARIKA D, ZIMMERMANN R, PORIA S. Misa: modality-invariant and specific representations for multimodal sentiment analysis [C]// Proceedings of the 28th ACM International Conference on Multimedia. Seattle: ACM, 2020: 1122-1131.

[8]	TANG J, LIU D, JIN X, et al Bafn: bi-direction attention based fusion network for multimodal sentiment analysis[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 33 (4): 1966- 1978

[9]	WU Y, LIN Z, ZHAO Y, et al. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis [C]// Findings of the Association for Computational Linguistics. [S. l.]: ACL, 2021: 4730-4738.

[10]	KIM K, PARK S AOBERT: all-modalities-in-one BERT for multimodal sentiment analysis[J]. Information Fusion, 2023, 92: 37- 45 doi: 10.1016/j.inffus.2022.11.022

[11]	HAN W, CHEN H, GELBUKH A, et al. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis [C]// Proceedings of the 2021 International Conference on Multimodal Interaction. Montréal: ACM, 2021: 6-15.

[12]	LI Z, GUO Q, PAN Y, et al Multi-level correlation mining framework with self-supervised label generation for multimodal sentiment analysis[J]. Information Fusion, 2023, 99: 101891 doi: 10.1016/j.inffus.2023.101891

[13]	MORENCY L P, MIHALCEA R, DOSHI P. Towards multimodal sentiment analysis: harvesting opinions from the web [C]// Proceedings of the 13th International Conference on Multimodal Interfaces. Alicante: ACM, 2011: 169-176.

[14]	ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension [C]// Proceedings of the AAAI Conference on Artificial Intelligence. New Orleans: AAAI Press, 2018: 5642-5649.

[15]	PORIA S, CHATURVEDI I, CAMBRIA E, et al. Convolutional MKL based multimodal emotion recognition and sentiment analysis [C]//IEEE 16th International Conference on Data Mining. Barcelona: IEEE, 2016: 439-448.

[16]	ALAM F, RICCARDI G. Predicting personality traits using multimodal information [C]// Proceedings of the 2014 ACM Multimedia on Workshop on Computational Personality Recognition. Orlando: ACM, 2014: 15-18.

[17]	CAI G, XIA B. Convolutional neural networks for multimedia sentiment analysis [C]// Natural Language Processing and Chinese Computing: 4th CCF Conference. Nanchang: Springer, 2015: 159-167.

[18]	GKOUMAS D, LI Q, LIOMA C, et al What makes the difference? an empirical comparison of fusion strategies for multimodal language analysis[J]. Information Fusion, 2021, 66: 184- 197 doi: 10.1016/j.inffus.2020.09.005

[19]	LIN T, WANG Y, LIU X, et al A survey of transformers[J]. AI Open, 2022, 3: 111- 132 doi: 10.1016/j.aiopen.2022.10.001

[20]	TSAI Y H H, BAI S, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences [C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: ACL, 2019: 6558-6569.

[21]	CHEN C, HONG H, GUO J, et al Inter-intra modal representation augmentation with trimodal collaborative disentanglement network for multimodal sentiment analysis[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 1476- 1488 doi: 10.1109/TASLP.2023.3263801

[22]	YUAN Z, LI W, XU H, et al. Transformer-based feature reconstruction network for robust multimodal sentiment analysis [C]// Proceedings of the 29th ACM International Conference on Multimedia. Chengdu: ACM, 2021: 4400-4407.

[23]	MA L, YAO Y, LIANG T, et al. Multi-scale cooperative multimodal transformers for multimodal sentiment analysis in videos [EB/OL]. (2022-06-17)[2025-05-28]. https://arxiv.org/pdf/2206.07981.

[24]	YU W, XU H, YUAN Z, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis [C]// Proceedings of the AAAI Conference on Artificial Intelligence. [S. l. ]: AAAI Press, 2021: 10790-10797.

[25]	WANG D, GUO X, TIAN Y, et al TETFN: a text enhanced transformer fusion network for multimodal sentiment analysis[J]. Pattern Recognition, 2023, 136: 109259 doi: 10.1016/j.patcog.2022.109259

[26]	MELAS-KYRIAZI L. Do you even need attention? a stack of feed-forward layers does surprisingly well on imagenet [EB/OL]. (2021-05-06)[2025-05-28]. https://arxiv.org/pdf/2105.02723.

[27]	TOLSTIKHIN I O, HOULSBY N, KOLESNIKOV A, et al Mlp-mixer: an all-mlp architecture for vision[J]. Advances in Neural Information Processing Systems, 2021, 34: 24261- 24272

[28]	TOUVRON H, BOJANOWSKI P, CARON M, et al Resmlp: feedforward networks for image classification with data-efficient training[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45 (4): 5314- 5321

[29]	NIE Y, LI L, GAN Z, et al. Mlp architectures for vision-and-language modeling: an empirical study [EB/OL]. (2021-12-08)[2025-05-28]. https://arxiv.org/pdf/2112.04453.

[30]	LIN H, ZHANG P, LING J, et al PS-mixer: a polar-vector and strength-vector mixer model for multimodal sentiment analysis[J]. Information Processing and Management, 2023, 60 (2): 103229 doi: 10.1016/j.ipm.2022.103229

[31]	SUN H, WANG H, LIU J, et al. CubeMLP: an MLP-based model for multimodal sentiment analysis and depression estimation [C]// Proceedings of the 30th ACM International Conference on Multimedia. Lisboa: ACM, 2022: 3722-3729.

[32]	BAIRAVEL S, KRISHNAMURTHY M Novel OGBEE-based feature selection and feature-level fusion with MLP neural network for social media multimodal sentiment analysis[J]. Soft Computing, 2020, 24 (24): 18431- 18445 doi: 10.1007/s00500-020-05049-6

[33]	KE P, JI H, LIU S, et al. SentiLARE: sentiment-aware language representation learning with linguistic knowledge [EB/OL]. (2020-09-24)[2025-05-28]. https://arxiv.org/pdf/1911.02493.

[34]	LIU Y, OTT M, GOYAL N, et al. Roberta: a robustly optimized bert pretraining approach [EB/OL]. (2019-07-26)[2025-05-28]. https://arxiv.org/pdf/1907.11692.

[35]	DEGOTTEX G, KANE J, DRUGMAN T, et al. COVAREP: a collaborative voice analysis repository for speech technologies [C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Florence: IEEE, 2014: 960-964.

[36]	ZADEH A, ZELLERS R, PINCUS E, et al. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos [EB/OL]. (2016-08-11)[2025-05-28]. https://arxiv.org/pdf/1606.06259.

[37]	ZADEH A A B, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph [C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne: ACL, 2018: 2236-2246.

[38]	CHEONG J H, JOLLY E, XIE T, et al Py-feat: Python facial expression analysis toolbox[J]. Affective Science, 2023, 4 (4): 781- 796 doi: 10.1007/s42761-023-00191-4

[39]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Advances in Neural Information Processing Systems. California: Curran Associates Inc , 2017: 5998-6008.

[40]	ZZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis [EB/OL]. (2017-07-23)[2025-05-28]. https://arxiv.org/pdf/1707.07250.

[41]	LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors [EB/OL]. (2018-05-31)[2025-05-28]. https://arxiv.org/pdf/1806.00064.

[42]	ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning [C]// Proceedings of the AAAI Conference on Artificial Intelligence. New Orleans: AAAI Press, 2018: 5634-5641.

[43]	RAHMAN W, HASAN M K, LEE S, et al. Integrating multimodal information in large pretrained Transformers [C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. [S. l.]: ACL, 2020: 2359-2369.

[44]	YANG B, SHAO B, WU L, et al Multimodal sentiment analysis with unidirectional modality translation[J]. Neurocomputing, 2022, 467: 130- 137 doi: 10.1016/j.neucom.2021.09.041

[45]	LEI Y, YANG D, LI M, et al. Text-oriented modality reinforcement network for multimodal sentiment analysis from unaligned multimodal sequences [C]// CAAI International Conference on Artificial Intelligence. Singapore: Springer, 2023: 189-200.

[46]	WANG Y, HE J, WANG D, et al Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis[J]. Neurocomputing, 2024, 572: 127181 doi: 10.1016/j.neucom.2023.127181

[47]	LIU W, CAO S, ZHANG S Multimodal consistency-specificity fusion based on information bottleneck for sentiment analysis[J]. Journal of King Saud University-Computer and Information Sciences, 2024, 36 (2): 101943 doi: 10.1016/j.jksuci.2024.101943

[48]	ZENG Y, LI Z, CHEN Z, et al A feature-based restoration dynamic interaction network for multimodal sentiment analysis[J]. Engineering Applications of Artificial Intelligence, 2024, 127: 107335 doi: 10.1016/j.engappai.2023.107335

[1]	Dejun ZHANG,Yanzi BAI,Feng CAO,Yiqi WU,Zhanya XU. Point cloud Transformer adapter for dense prediction task[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(5): 920-928.

[2]	Liming LIANG,Pengwei LONG,Jiaxin JIN,Renjie LI,Lu ZENG. Steel surface defect detection algorithm based on improved YOLOv8s[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(3): 512-522.

[3]	Qiao-hong CHEN,Jia-jin SUN,Yang-bo LOU,Zhi-jian FANG. Multimodal sentiment analysis model based on multi-task learning and stacked cross-modal Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(12): 2421-2429.

Viewed

Full text

Abstract

Cited

Shared

Discussed