Multimodal sentiment analysis model based on multi-task learning and stacked cross-modal Transformer

doi:10.3785/j.issn.1008-973X.2023.12.009

Journal of ZheJiang University (Engineering Science)

2023, Vol. 57

Issue (12): 2421-2429 DOI: 10.3785/j.issn.1008-973X.2023.12.009

Multimodal sentiment analysis model based on multi-task learning and stacked cross-modal Transformer

Qiao-hong CHEN(

),Jia-jin SUN,Yang-bo LOU,Zhi-jian FANG

School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China

Download:

HTML

PDF(1171KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

A new multimodal sentiment analysis model (MTSA) was proposed on the basis of cross-modal Transformer, aiming at the difficult retention of the modal feature heterogeneity for single-modal feature extraction and feature redundancy for cross-modal feature fusion. Long short-term memory (LSTM) and multi-task learning framework were used to extract single-modal contextual semantic information, the noise was removed and the modal feature heterogeneity was preserved by adding up auxiliary modal task losses. Multi-tasking gating mechanism was used to adjust cross-modal feature fusion. Text, audio and visual modal features were fused in a stacked cross-modal Transformer structure to improve fusion depth and avoid feature redundancy. MTSA was evaluated in the MOSEI and SIMS data sets, results show that compared with other advanced models, MTSA has better overall performance, the accuracy of binary classification reached 83.51% and 84.18% respectively.

Key words： multimodal sentiment analysis long short-term memory (LSTM) Transformer multi-task learning cross-modal feature fusion

Received: 11 February 2023 Published: 27 December 2023

CLC:

TP 391

Fund: 浙江理工大学中青年骨干人才培养经费项目

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Qiao-hong CHEN
	Jia-jin SUN
	Yang-bo LOU
	Zhi-jian FANG

Cite this article:

Qiao-hong CHEN,Jia-jin SUN,Yang-bo LOU,Zhi-jian FANG. Multimodal sentiment analysis model based on multi-task learning and stacked cross-modal Transformer. Journal of ZheJiang University (Engineering Science), 2023, 57(12): 2421-2429.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2023.12.009 OR https://www.zjujournals.com/eng/Y2023/V57/I12/2421

基于多任务学习与层叠 Transformer 的多模态情感分析模型

针对单模态特征提取存在的模态特征异质性难以保留问题和跨模态特征融合存在的特征冗余问题，基于跨模态Transformer，提出新的多模态情感分析模型（MTSA）. 使用长短时记忆(LSTM)与多任务学习框架提取单模态上下文语义信息，通过累加辅助模态任务损失以筛除噪声并保留模态特征异质性. 使用多任务门控机制调整跨模态特征融合，通过层叠Transformer结构融合文本、音频与视觉模态特征，提升融合深度，避免融合特征冗余. 在2个公开数据集MOSEI和SIMS上的实验结果表明，相较于其他先进模型，MTSA的整体性能表现更好，二分类准确率分别达到83.51%和84.18%.

关键词： 多模态情感分析, 长短时记忆(LSTM), Transformer, 多任务学习, 跨模态特征融合

Fig.1 Structure of multimodal sentiment analysis model based on multi-task learning and stacked cross-modal Transformer

Fig.2 Comparison of two fusion architecture

Fig.3 Structure of cross-modal gated Transformer

Tab.1 Experimental parameter settings of proposed model in two datasets

Tab.2 Performance comparison of results with different models in SIMS dataset

Tab.3 Performance comparison of results with different models in MOSEI dataset

Tab.4 Results of ablation experiments of proposed model in SIMS dataset

Tab.5 Results of mode fusion sequence in SIMS dataset

Tab.6 Sample analysis of proposed model in SIMS dataset


[1]	HUANG Y, DU C, XUE Z, et al. What makes multi-modal learning better than single [C]// Advances in Neural Information Processing Systems. [S.l.]: NIPS, 2021: 10944-10956.

[2]	WANG H, MEGHAWAT A, MORENCY L P, et al. Select-additive learning: improving generalization in multimodal sentiment analysis [C]// 2017 IEEE International Conference on Multimedia and Expo. Hong Kong: IEEE, 2017: 949-954.

[3]	WILLIAMS J, COMANESCU R, RADU O, et al. DNN multimodal fusion techniques for predicting video sentiment [C]// Proceedings of Grand Challenge and Workshop on Human Multimodal Language. [S.l.]: ACL, 2018: 64-72.

[4]	NOJAVANASGHARI B, GOPINATH D, KOUSHIK J, et al. Deep multimodal fusion for persuasiveness prediction [C]// Proceedings of the 18th ACM International Conference on Multimodal Interaction. Tokyo: ACM, 2016: 284-288.

[5]	CAMBRIA E, HAZARIKA D, PORIA S, et al. Benchmarking multimodal sentiment analysis [C]// Computational Linguistics and Intelligent Text Processing. Budapest: Springer, 2018: 166-179.

[6]	WANG Y, SHEN Y, LIU Z, et al. Words can shift: dynamically adjusting word representations using nonverbal behaviors [C]// Proceedings of the Thirty-third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence. [S.l.]: AAAI, 2019: 7216-7223.

[7]	ZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis [C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen: ACL, 2017: 1103-1114.

[8]	YU W, XU H, MENG F, et al. CH-SIMS: a chinese multimodal sentiment analysis dataset with fine-grained annotation of modality [C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. [S.l.]: ACL, 2020: 3718-3727.

[9]	YU W, XU H, YUAN Z, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis [C]// Proceedings of the AAAI Conference on Artificial Intelligence. [S.l.]: AAAI, 2021, 35(12): 10790-10797.

[10]	TSAI Y H H, BAI S, LIANG P P, et al. Multimodal Transformer for unaligned multimodal language sequences [C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: ACL, 2019: 6558-6569.

[11]	WU J, MAI S, HU H. Graph capsule aggregation for unaligned multimodal sequences [C]// Proceedings of the 2021 International Conference on Multimodal Interaction. [S.l.]: ACM, 2021: 521-529.

[12]	MA H, HAN Z, ZHANG C, et al. Trustworthy multimodal regression with mixture of normal-inverse gamma distributions [C]// Advances in Neural Information Processing Systems. [S.l.]: NIPS, 2021: 6881-6893.

[13]	SIRIWARDHANA S, KALUARACHCHI T, BILLINGHURST M, et al Multimodal emotion recognition with Transformer-based self supervised feature fusion[J]. IEEE Access, 2020, 8: 176274- 176285 doi: 10.1109/ACCESS.2020.3026823

[14]	MAJUMDER N, HAZARIKA D, GELBUKH A, et al Multimodal sentiment analysis using hierarchical fusion with context modeling[J]. Knowledge-Based Systems, 2018, 161: 124- 133 doi: 10.1016/j.knosys.2018.07.041

[15]	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding [C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume1 (Long and Short Papers). Minneapolis: ACL, 2019: 4171-4186.

[16]	MCFEE B, RAFFEL C, LIANG D, et al. librosa: audio and music signal analysis in python [C]// Proceedings of the Python in Science Conference (SCIPY 2015). Austin: SciPy, 2015.

[17]	BALTRUSAITIS T, ZADEH A, LIM Y C, et al. OpenFace 2.0: facial behavior analysis toolkit [C]// 2018 13th IEEE International Conference on Automatic Face and Gesture Recognition. Xi’an: IEEE, 2018: 59-66.

[18]	HOCHREITER S, SCHMIDHUBER J Long short-term memory[J]. Neural Computation, 1997, 9 (8): 1735- 1780 doi: 10.1162/neco.1997.9.8.1735

[19]	HAN W, CHEN H, GELBUKH A, et al. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis [C]// Proceedings of the 2021 International Conference on Multimodal Interaction. [S.l.]: ACM, 2021: 6-15.

[20]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. [S.l.]: NIPS, 2017: 6000-6010.

[21]	CHEN Z, BADRINARAYANAN V, LEE C Y, et al. GradNorm: gradient normalization for adaptive loss balancing in deep multitask networks [C]// Proceedings of the 35th International Conference on Machine Learning. [S.l.]: PMLR, 2018: 794-803.

[22]	ZADEH A, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph [C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne: ACL, 2018: 2236-2246.

[23]	ZADEH A, ZELLERS R, PINCUS E, et al. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos [EB/OL]. (2016-08-12)[2022-11-25]. https://arxiv.org/ftp/arxiv/papers/1606/1606.06259.pdf.

[24]	LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors [C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. [S.l.]: ACL, 2018: 2247-2256.

[25]	ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension [EB/OL]. (2018-02-03)[2022-11-29]. https://arxiv.org/pdf/1802.00923.pdf.

[26]	HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and-specific representations for multimodal sentiment analysis [C]// Proceedings of the 28th ACM International Conference on Multimedia. Nice: ACM, 2020: 1122-1131.

[1]	Zhicheng FENG,Jie YANG,Zhichao CHEN. Urban road network extraction method based on lightweight Transformer[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(1): 40-49.

[2]	Hai-bo ZHANG,Lei CAI,Jun-ping REN,Ru-yan WANG,Fu LIU. Efficient and adaptive semantic segmentation network based on Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(6): 1205-1214.

[3]	Yu-xiang WANG,Zhi-wei ZHONG,Peng-cheng XIA,Yi-xiang HUANG,Cheng-liang LIU. Compound fault decoupling diagnosis method based on improved Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(5): 855-864.

[4]	Xin-dong LV,Jiao LI,Zhen-nan DENG,Hao FENG,Xin-tong CUI,Hong-xia DENG. Structured image super-resolution network based on improved Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(5): 865-874.

[5]	Yu-xiang LU,Guan-hua XU,Bo TANG. Worker behavior recognition based on temporal and spatial self-attention of vision Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(3): 446-454.

[6]	Jin-bo HU,Wei-zhi NIE,Dan SONG,Zhuo GAO,Yun-peng BAI,Feng ZHAO. Chest X-ray imaging disease diagnosis model assisted by deformable Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(10): 1923-1932.

[7]	Wan-liang WANG,Tie-jun WANG,Jia-cheng CHEN,Wen-bo YOU. Medical image segmentation method combining multi-scale and multi-head attention[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(9): 1796-1805.

[8]	Guo-peng ZHANG,Zi-han LI,Hao WANG,zheng ZHENG. Isolated AC-DC solid state transformer front and rear stages integrated sliding mode control[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(3): 622-630.

[9]	Tian-le YUAN,Ju-long YUAN,Yong-jian ZHU,Han-chen ZHENG. Surface defect detection algorithm of thrust ball bearing based on improved YOLOv5[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(12): 2349-2357.

[10]	Qiao-hong CHEN,Fei-yu LI,Qi SUN,Yu-bo JIA. Answer selection model based on LSTM and decay self-attention[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(12): 2436-2444.

[11]	Zhen-hong MA,Zhen LIU,Sheng-yong YIN,Rong-wei MA,Ke-ping YAN. Experimental study on melanoma cell ablation by high-voltage nanosecond pulsed electric field[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(6): 1168-1174.

[12]	Jia-hui XU,Jing-chang WANG,Ling CHEN,Yong WU. Surface water quality prediction model based on graph neural network[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(4): 601-607.

[13]	Ping YANG,Dan WANG,Zi-jian KAGN,Tong LI,Li-hua FU,Yue-ren YU. Prediction model of paroxysmal atrial fibrillation based on pattern recognition and ensemble CNN-LSTM[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(5): 1039-1048.

[14]	Le XIE,Xi-dan HENG,Yang LIU,Qi-long JIANG,Dong LIU. Transformer fault diagnosis based on linear discriminant analysis and step-by-step machine learning[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(11): 2266-2272.

[15]	Tian-zhong HU,Jian-bo YU. Life prediction of lithium-ion batteries based on multiscale decomposition and deep learning[J]. Journal of ZheJiang University (Engineering Science), 2019, 53(10): 1852-1864.

Viewed

Full text

Abstract

Cited

Shared

Discussed