Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2025, Vol. 59 Issue (8): 1653-1661    DOI: 10.3785/j.issn.1008-973X.2025.08.012
    
Multimodal sentiment analysis based on multi-head self-attention mechanism and MLP-Interactor
Yishan LIN1(),Jing ZUO1,Shuhua LU1,2,*()
1. College of Information and Cyber Security, People’s Public Security University of China, Beijing 102600, China
2. Key Laboratory of Security Technology and Risk Assessment, Ministry of Public Security, Beijing 102600, China
Download: HTML     PDF(1431KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A multimodal sentiment analysis method based on multi-head self-attention mechanism and MLP-Interactor was proposed in order to solve the problems of poor quality of unimodal features and insufficient interaction of multimodal features in multimodal sentiment analysis. The intra-modal feature interaction was realized and the quality of single-modal features was improved through the intramodal feature interaction module based on the multi-head self-attention mechanism. The MLP-Interactor mechanism was used to realize the full interaction between multimodal features and learn the consistency information between different modalities. A large number of experiments were verified and tested on two public datasets, CMU-MOSI and CMU-MOSEI by using the proposed method. Results show that the proposed method surpasses many advanced methods and can effectively improve the accuracy of multimodal sentiment analysis.



Key wordsmultimodal sentiment analysis      MLP-Interactor      multi-head self-attention mechanism      feature interaction     
Received: 22 August 2024      Published: 28 July 2025
CLC:  TP 391  
Fund:  中国人民公安大学安全防范工程双一流创新研究专项项目(2023SYL08);2024年基科费?跨模态数据融合及智能讯问技术研究资助项目(2024JKF10).
Corresponding Authors: Shuhua LU     E-mail: 375568222@qq.com;lushuhua@ppsuc.edu.cn
Cite this article:

Yishan LIN,Jing ZUO,Shuhua LU. Multimodal sentiment analysis based on multi-head self-attention mechanism and MLP-Interactor. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1653-1661.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2025.08.012     OR     https://www.zjujournals.com/eng/Y2025/V59/I8/1653


基于多头自注意力机制与MLP-Interactor的多模态情感分析

针对多模态情感分析中单模态特征质量较差及多模态特征交互不够充分的问题,提出基于多头自注意力机制和MLP-Interactor的多模态情感分析方法. 通过基于多头自注意力机制的模态内特征交互模块,实现单模态内的特征交互,提高单模态特征的质量. 通过MLP-Interactor机制实现多模态特征之间的充分交互,学习不同模态之间的一致性信息. 利用提出方法,在CMU-MOSI和CMU-MOSEI 2个公开数据集上进行大量的实验验证与测试. 结果表明,提出方法超越了当前诸多的先进方法,可以有效地提升多模态情感分析的准确性.


关键词: 多模态情感分析,  MLP-Interactor,  多头自注意力机制,  特征交互 
Fig.1 Overall structure of proposed multimodal sentiment analysis model
Fig.2 Multi-head Attention mechanism
Fig.3 MLP-Interactor
数据集rLSLmaxN
CMU-MOSI3.146×10?532503
CMU-MOSEI7.600×10?632503
Tab.1 Experimental parameter setting of mutimodal sentiment analysis based on multi-head self-attention mechanism and MLP-Interactor
模型MAEcorrA2A7F1
TFN[40](2017)0.9010.698— / 80.834.9—/ 80.7
LMF[41](2018)0.9170.695— / 82.533.2—/ 82.4
MFN[42](2018)0.9650.63277.4 / —34.177.3 / —
MulT[20](2019)0.8710.698— / 83.040.0— / 82.8
BBFN[11](2021)0.7760.755—/84.345.0—/84.3
Self-MM[24](2021)0.7130.79884.0/85.9884.42/85.95
MISA[7](2020)0.7830.76181.8/83.442.381.7/83.6
MAG-BERT[43](2020)0.7310.79882.5/84.382.6/84.3
CubeMLP[31](2022)0.7700.767—/ 85.645.5—/85.5
PS-Mixer[30](2023)0.7940.74880.3/82.144.3180.3/82.1
MTSA[44](2022)0.6960.806—/86.846.4—/86.8
AOBERT[10](2023)0.8560.70085.2/85.640.285.4/86.4
TETFN[25](2023)
TMRN[45](2023)
0.717
0.704
0.800
0.784
84.05/86.10
83.67/85.67

48.68
83.83/86.07
83.45/85.52
MTAMW[46](2024)0.7120.79484.40/86.5946.8484.20/86.46
MIBSA[47](2024)0.7280.798—/87.0043.10—/87.20
FRDIN[48](2024)0.6820.81385.8/87.446.5985.3/87.5
CRNet[49](2024)0.7120.797—/86.447.40—/86.4
本文模型0.5750.86887.6/89.652.2387.7/89.6
Tab.2 Comparison of performance on CMU-MOSI dataset with other benchmark models
模型MAEcorrA2A7F1
TFN[40](2017)0.5930.700—/82.550.2—/82.1
LMF[41](2018)0.6230.677—/82.048.0—/82.1
MulT[20](2019)0.5800.703—/82.551.8—/82.3
BBFN[11](2021)0.5290.767—/86.254.8—/86.1
Self-MM[24](2021)0.5300.76582.81/85.1782.53/85.30
MISA[7](2020)0.5550.75683.6/85.552.283.8/85.3
MAG-BERT[43](2020)0.5430.75582.51/84.8282.77/84.71
CubeMLP[31](2022)0.5290.760—/85.154.9—/84.5
PS-Mixer[30](2023)0.5370.76583.1/86.153.083.1/86.1
MTSA[44](2022)0.5410.774—/85.552.9—/85.3
AOBERT[10](2023)0.5150.76384.9/86.254.585.0/85.9
TETFN[25](2023)
TMRN[45](2023)
0.551
0.535
0.748
0.762
84.25/85.18
83.39/86.19

53.65
84.18/85.27
83.67/86.08
MTAMW[46](2024)0.5250.78283.09/86.4953.7383.48/86.45
MIBSA[47](2024)0.5680.753—/86.7052.40—/85.80
FRDIN[48](2024)0.5250.77883.30/86.3054.4083.70/86.20
CRNet[49](2024)0.5410.771—/86.2053.80—/86.10
本文模型0.5120.79483.0/86.854.582.5/86.8
Tab.3 Comparison of performance on CMU-MOSEI dataset with other benchmark models
方法MAEcorrA2A7F1
BERT0.7990.74680.80/82.8740.7780.91/82.90
DeBERTa1.1540.48666.96/68.0628.8666.93/67.93
RoBERTa0.6640.82483.63/85.6545.5383.65/85.64
DistilBERT0.7540.76881.40/83.5440.4881.48/83.55
ALBERT0.9280.67076.93/79.0034.6777.10/79.08
w/o Intra-Modality
Interaction
0.6130.85186.90/89.1047.9286.90/89.10
w/o MLP-Interactor0.5900.86885.57/87.5450.0085.66/87.59
w/o audio0.6350.85386.46/88.1644.7986.48/88.15
w/o video0.5830.86187.35/89.4249.5587.45/89.47
w/o text1.4600.05245.95/47.4714.5850.86/52.50
本文模型0.5750.86887.60/89.6052.2387.70/89.60
Tab.4 Result of ablation experiment on CMU-MOSI dataset
Fig.4 Feature visualization
[1]   ZHU L, ZHU Z, ZHANG C, et al Multimodal sentiment analysis based on fusion methods: a survey[J]. Information Fusion, 2023, 95: 306- 325
doi: 10.1016/j.inffus.2023.02.028
[2]   GANDHI A, ADHVARYU K, PORIA S, et al Multimodal sentiment analysis: a systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions[J]. Information Fusion, 2023, 91: 424- 444
doi: 10.1016/j.inffus.2022.09.025
[3]   CAO R, YE C, ZHOU H. Multimodal sentiment analysis with self-attention [C]// Proceedings of the Future Technologies Conference. [S. l. ]: Springer, 2021: 16-26.
[4]   BALTRUSAITIS T, AHUJA C, MORENCY L P Multimodal machine learning: a survey and taxonomy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41 (2): 423- 443
[5]   GUO W, WANG J, WANG S Deep multimodal representation learning: a survey[J]. IEEE Access, 2019, 7: 63373- 63394
doi: 10.1109/ACCESS.2019.2916887
[6]   HAN W, CHEN H, PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis [EB/OL]. (2021-09-16)[2025-05-28]. https://arxiv.org/pdf/2109.00412.
[7]   HAZARIKA D, ZIMMERMANN R, PORIA S. Misa: modality-invariant and specific representations for multimodal sentiment analysis [C]// Proceedings of the 28th ACM International Conference on Multimedia. Seattle: ACM, 2020: 1122-1131.
[8]   TANG J, LIU D, JIN X, et al Bafn: bi-direction attention based fusion network for multimodal sentiment analysis[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 33 (4): 1966- 1978
[9]   WU Y, LIN Z, ZHAO Y, et al. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis [C]// Findings of the Association for Computational Linguistics. [S. l.]: ACL, 2021: 4730-4738.
[10]   KIM K, PARK S AOBERT: all-modalities-in-one BERT for multimodal sentiment analysis[J]. Information Fusion, 2023, 92: 37- 45
doi: 10.1016/j.inffus.2022.11.022
[11]   HAN W, CHEN H, GELBUKH A, et al. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis [C]// Proceedings of the 2021 International Conference on Multimodal Interaction. Montréal: ACM, 2021: 6-15.
[12]   LI Z, GUO Q, PAN Y, et al Multi-level correlation mining framework with self-supervised label generation for multimodal sentiment analysis[J]. Information Fusion, 2023, 99: 101891
doi: 10.1016/j.inffus.2023.101891
[13]   MORENCY L P, MIHALCEA R, DOSHI P. Towards multimodal sentiment analysis: harvesting opinions from the web [C]// Proceedings of the 13th International Conference on Multimodal Interfaces. Alicante: ACM, 2011: 169-176.
[14]   ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension [C]// Proceedings of the AAAI Conference on Artificial Intelligence. New Orleans: AAAI Press, 2018: 5642-5649.
[15]   PORIA S, CHATURVEDI I, CAMBRIA E, et al. Convolutional MKL based multimodal emotion recognition and sentiment analysis [C]//IEEE 16th International Conference on Data Mining. Barcelona: IEEE, 2016: 439-448.
[16]   ALAM F, RICCARDI G. Predicting personality traits using multimodal information [C]// Proceedings of the 2014 ACM Multimedia on Workshop on Computational Personality Recognition. Orlando: ACM, 2014: 15-18.
[17]   CAI G, XIA B. Convolutional neural networks for multimedia sentiment analysis [C]// Natural Language Processing and Chinese Computing: 4th CCF Conference. Nanchang: Springer, 2015: 159-167.
[18]   GKOUMAS D, LI Q, LIOMA C, et al What makes the difference? an empirical comparison of fusion strategies for multimodal language analysis[J]. Information Fusion, 2021, 66: 184- 197
doi: 10.1016/j.inffus.2020.09.005
[19]   LIN T, WANG Y, LIU X, et al A survey of transformers[J]. AI Open, 2022, 3: 111- 132
doi: 10.1016/j.aiopen.2022.10.001
[20]   TSAI Y H H, BAI S, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences [C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: ACL, 2019: 6558-6569.
[21]   CHEN C, HONG H, GUO J, et al Inter-intra modal representation augmentation with trimodal collaborative disentanglement network for multimodal sentiment analysis[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 1476- 1488
doi: 10.1109/TASLP.2023.3263801
[22]   YUAN Z, LI W, XU H, et al. Transformer-based feature reconstruction network for robust multimodal sentiment analysis [C]// Proceedings of the 29th ACM International Conference on Multimedia. Chengdu: ACM, 2021: 4400-4407.
[23]   MA L, YAO Y, LIANG T, et al. Multi-scale cooperative multimodal transformers for multimodal sentiment analysis in videos [EB/OL]. (2022-06-17)[2025-05-28]. https://arxiv.org/pdf/2206.07981.
[24]   YU W, XU H, YUAN Z, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis [C]// Proceedings of the AAAI Conference on Artificial Intelligence. [S. l. ]: AAAI Press, 2021: 10790-10797.
[25]   WANG D, GUO X, TIAN Y, et al TETFN: a text enhanced transformer fusion network for multimodal sentiment analysis[J]. Pattern Recognition, 2023, 136: 109259
doi: 10.1016/j.patcog.2022.109259
[26]   MELAS-KYRIAZI L. Do you even need attention? a stack of feed-forward layers does surprisingly well on imagenet [EB/OL]. (2021-05-06)[2025-05-28]. https://arxiv.org/pdf/2105.02723.
[27]   TOLSTIKHIN I O, HOULSBY N, KOLESNIKOV A, et al Mlp-mixer: an all-mlp architecture for vision[J]. Advances in Neural Information Processing Systems, 2021, 34: 24261- 24272
[28]   TOUVRON H, BOJANOWSKI P, CARON M, et al Resmlp: feedforward networks for image classification with data-efficient training[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45 (4): 5314- 5321
[29]   NIE Y, LI L, GAN Z, et al. Mlp architectures for vision-and-language modeling: an empirical study [EB/OL]. (2021-12-08)[2025-05-28]. https://arxiv.org/pdf/2112.04453.
[30]   LIN H, ZHANG P, LING J, et al PS-mixer: a polar-vector and strength-vector mixer model for multimodal sentiment analysis[J]. Information Processing and Management, 2023, 60 (2): 103229
doi: 10.1016/j.ipm.2022.103229
[31]   SUN H, WANG H, LIU J, et al. CubeMLP: an MLP-based model for multimodal sentiment analysis and depression estimation [C]// Proceedings of the 30th ACM International Conference on Multimedia. Lisboa: ACM, 2022: 3722-3729.
[32]   BAIRAVEL S, KRISHNAMURTHY M Novel OGBEE-based feature selection and feature-level fusion with MLP neural network for social media multimodal sentiment analysis[J]. Soft Computing, 2020, 24 (24): 18431- 18445
doi: 10.1007/s00500-020-05049-6
[33]   KE P, JI H, LIU S, et al. SentiLARE: sentiment-aware language representation learning with linguistic knowledge [EB/OL]. (2020-09-24)[2025-05-28]. https://arxiv.org/pdf/1911.02493.
[34]   LIU Y, OTT M, GOYAL N, et al. Roberta: a robustly optimized bert pretraining approach [EB/OL]. (2019-07-26)[2025-05-28]. https://arxiv.org/pdf/1907.11692.
[35]   DEGOTTEX G, KANE J, DRUGMAN T, et al. COVAREP: a collaborative voice analysis repository for speech technologies [C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Florence: IEEE, 2014: 960-964.
[36]   ZADEH A, ZELLERS R, PINCUS E, et al. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos [EB/OL]. (2016-08-11)[2025-05-28]. https://arxiv.org/pdf/1606.06259.
[37]   ZADEH A A B, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph [C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne: ACL, 2018: 2236-2246.
[38]   CHEONG J H, JOLLY E, XIE T, et al Py-feat: Python facial expression analysis toolbox[J]. Affective Science, 2023, 4 (4): 781- 796
doi: 10.1007/s42761-023-00191-4
[39]   VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Advances in Neural Information Processing Systems. California: Curran Associates Inc , 2017: 5998-6008.
[40]   ZZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis [EB/OL]. (2017-07-23)[2025-05-28]. https://arxiv.org/pdf/1707.07250.
[41]   LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors [EB/OL]. (2018-05-31)[2025-05-28]. https://arxiv.org/pdf/1806.00064.
[42]   ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning [C]// Proceedings of the AAAI Conference on Artificial Intelligence. New Orleans: AAAI Press, 2018: 5634-5641.
[43]   RAHMAN W, HASAN M K, LEE S, et al. Integrating multimodal information in large pretrained Transformers [C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. [S. l.]: ACL, 2020: 2359-2369.
[44]   YANG B, SHAO B, WU L, et al Multimodal sentiment analysis with unidirectional modality translation[J]. Neurocomputing, 2022, 467: 130- 137
doi: 10.1016/j.neucom.2021.09.041
[45]   LEI Y, YANG D, LI M, et al. Text-oriented modality reinforcement network for multimodal sentiment analysis from unaligned multimodal sequences [C]// CAAI International Conference on Artificial Intelligence. Singapore: Springer, 2023: 189-200.
[46]   WANG Y, HE J, WANG D, et al Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis[J]. Neurocomputing, 2024, 572: 127181
doi: 10.1016/j.neucom.2023.127181
[47]   LIU W, CAO S, ZHANG S Multimodal consistency-specificity fusion based on information bottleneck for sentiment analysis[J]. Journal of King Saud University-Computer and Information Sciences, 2024, 36 (2): 101943
doi: 10.1016/j.jksuci.2024.101943
[48]   ZENG Y, LI Z, CHEN Z, et al A feature-based restoration dynamic interaction network for multimodal sentiment analysis[J]. Engineering Applications of Artificial Intelligence, 2024, 127: 107335
doi: 10.1016/j.engappai.2023.107335
[1] Dejun ZHANG,Yanzi BAI,Feng CAO,Yiqi WU,Zhanya XU. Point cloud Transformer adapter for dense prediction task[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(5): 920-928.
[2] Liming LIANG,Pengwei LONG,Jiaxin JIN,Renjie LI,Lu ZENG. Steel surface defect detection algorithm based on improved YOLOv8s[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(3): 512-522.
[3] Qiao-hong CHEN,Jia-jin SUN,Yang-bo LOU,Zhi-jian FANG. Multimodal sentiment analysis model based on multi-task learning and stacked cross-modal Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(12): 2421-2429.