Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2023, Vol. 57 Issue (12): 2421-2429    DOI: 10.3785/j.issn.1008-973X.2023.12.009
    
Multimodal sentiment analysis model based on multi-task learning and stacked cross-modal Transformer
Qiao-hong CHEN(),Jia-jin SUN,Yang-bo LOU,Zhi-jian FANG
School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
Download: HTML     PDF(1171KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A new multimodal sentiment analysis model (MTSA) was proposed on the basis of cross-modal Transformer, aiming at the difficult retention of the modal feature heterogeneity for single-modal feature extraction and feature redundancy for cross-modal feature fusion. Long short-term memory (LSTM) and multi-task learning framework were used to extract single-modal contextual semantic information, the noise was removed and the modal feature heterogeneity was preserved by adding up auxiliary modal task losses. Multi-tasking gating mechanism was used to adjust cross-modal feature fusion. Text, audio and visual modal features were fused in a stacked cross-modal Transformer structure to improve fusion depth and avoid feature redundancy. MTSA was evaluated in the MOSEI and SIMS data sets, results show that compared with other advanced models, MTSA has better overall performance, the accuracy of binary classification reached 83.51% and 84.18% respectively.



Key wordsmultimodal sentiment analysis      long short-term memory (LSTM)      Transformer      multi-task learning      cross-modal feature fusion     
Received: 11 February 2023      Published: 27 December 2023
CLC:  TP 391  
Fund:  浙江理工大学中青年骨干人才培养经费项目
Cite this article:

Qiao-hong CHEN,Jia-jin SUN,Yang-bo LOU,Zhi-jian FANG. Multimodal sentiment analysis model based on multi-task learning and stacked cross-modal Transformer. Journal of ZheJiang University (Engineering Science), 2023, 57(12): 2421-2429.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2023.12.009     OR     https://www.zjujournals.com/eng/Y2023/V57/I12/2421


基于多任务学习与层叠 Transformer 的多模态情感分析模型

针对单模态特征提取存在的模态特征异质性难以保留问题和跨模态特征融合存在的特征冗余问题,基于跨模态Transformer,提出新的多模态情感分析模型(MTSA). 使用长短时记忆(LSTM)与多任务学习框架提取单模态上下文语义信息,通过累加辅助模态任务损失以筛除噪声并保留模态特征异质性. 使用多任务门控机制调整跨模态特征融合,通过层叠Transformer结构融合文本、音频与视觉模态特征,提升融合深度,避免融合特征冗余. 在2个公开数据集MOSEI和SIMS上的实验结果表明,相较于其他先进模型,MTSA的整体性能表现更好,二分类准确率分别达到83.51%和84.18%.


关键词: 多模态情感分析,  长短时记忆(LSTM),  Transformer,  多任务学习,  跨模态特征融合 
Fig.1 Structure of multimodal sentiment analysis model based on multi-task learning and stacked cross-modal Transformer
Fig.2 Comparison of two fusion architecture
Fig.3 Structure of cross-modal gated Transformer
参数 数值
SIMS MOSEI
训练批度 32 16
学习率 0.001 0.001
文本特征维度 768 768
音频特征维度 33 74
视觉特征维度 709 35
文本特征长度
音频特征长度
视觉特征长度
39
400
55
50
500
375
文本LSTM隐藏层维度 100 100
音频LSTM隐藏层维度 100 100
视觉LSTM隐藏层维度 100 100
Transformer编码层维度 100 100
多头注意力头数 10 10
Transformer层数 2 2
Droupout率 0.1 0.1
Tab.1 Experimental parameter settings of proposed model in two datasets
对比模型 Acc2/% F1/% Acc3/% MAE Corr T/s N/MB
TFN 77.63 77.67 66.39 42.91 58.04 40.5 3.7
MFN 80.0 79.74 68.01 43.63 57.74 123 5.3
LMF 77.78 76.22 65.99 44.13 56.96 13 5.5
MULT 81.03 80.44 68.67 45.32 56.41 164 40.5
MISA 77.73 77.31 66.13 41.94 56.74 314 15.2
M_TFN 82.21 81.1 70.46 40.27 63.96 65 3.7
M_MFN 82.22 81.96 70.85 41.42 63.60 21 5.5
CMFIB 80.28 80.27 40.93 59.82 137 7.9
MTSA 84.18 83.98 71.42 38.85 64.89 111 4.9
Tab.2 Performance comparison of results with different models in SIMS dataset
对比模型 Acc2/% F1/% Acc3/% MAE Corr T/s N/MB
TFN 81.89 81.74 66.63 57.26 71.74 175 3.7
MFN 82.86 82.85 66.59 57.33 71.82 354 5.3
LMF 83.48 83.36 66.59 57.57 71.69 68 5.5
MULT 83.43 83.32 67.04 55.93 73.71 402 40.5
MISA 84.64 84.66 67.63 55.75 75.15 1101 15.2
CMFIB 85.72 85.72 67.89 58.36 79.37 705 7.9
MTSA 83.51 83.38 67.13 55.67 73.46 351 4.9
Tab.3 Performance comparison of results with different models in MOSEI dataset
对比模型 Acc2/% F1/% Acc3/% MAE Corr
MTSA 84.18 83.98 71.42 38.85 64.89
MTSA+WA 84.36 84.11 71.57 38.71 64.72
MTSA-G 82.32 82.08 71.06 40.17 61.31
MTSA-GMT 82.27 81.98 70.37 42.91 58.04
MTSA-SMT 81.08 81.03 70.11 43.63 57.74
Tab.4 Results of ablation experiments of proposed model in SIMS dataset
模态融合顺序 Acc2/% F1/% Acc3/% MAE Corr
V→(A→T) 84.18 83.89 71.42 38.85 64.89
A→(V→T) 83.04 83.02 70.41 43.74 57.35
V→(T→A) 83.04 82.82 70.55 43.33 58.55
T→(V→A) 83.82 83.63 70.59 43.08 59.62
T→(A→V) 82.94 83.02 70.41 42.92 58.42
A→(T→V) 83.71 83.88 71.07 41.08 61.16
Tab.5 Results of mode fusion sequence in SIMS dataset
视频编号 多模态信息 视频画面 情感预测结果 真实情感
video_0001/0001 T:我不想嫁给李茶
A:语速较快,语气低沉
V:皱眉,表情沮丧,生气
T:消极
A:消极
V:消极
M:消极
T:消极
A:消极
V:消极
M:消极
video_0009/0002 T:妹妹不敢有任何非分之想
A:语气平静
V:低头,微笑
T:消极
A:中性
V:积极
M:消极
T:消极
A:积极
V:积极
M:消极
video_0054/0032 T:特别尊贵的感觉,感觉很害怕,怎么和他对戏啊,
也不熟两个人也没有演过戏
A:语气紧张
V:微笑,紧握双手
T:积极
A:中性
V:积极
M:积极
T:积极
A:消极
V:积极
M:积极
video_0056/0051 T:拿他护照干吗你又用不了
A:语气平静
V:先皱眉,后微笑,挠头
T:消极
A:中性
V:积极
M:中性
T:中性
A:消极
V:积极
M:中性
Tab.6 Sample analysis of proposed model in SIMS dataset
[1]   HUANG Y, DU C, XUE Z, et al. What makes multi-modal learning better than single [C]// Advances in Neural Information Processing Systems. [S.l.]: NIPS, 2021: 10944-10956.
[2]   WANG H, MEGHAWAT A, MORENCY L P, et al. Select-additive learning: improving generalization in multimodal sentiment analysis [C]// 2017 IEEE International Conference on Multimedia and Expo. Hong Kong: IEEE, 2017: 949-954.
[3]   WILLIAMS J, COMANESCU R, RADU O, et al. DNN multimodal fusion techniques for predicting video sentiment [C]// Proceedings of Grand Challenge and Workshop on Human Multimodal Language. [S.l.]: ACL, 2018: 64-72.
[4]   NOJAVANASGHARI B, GOPINATH D, KOUSHIK J, et al. Deep multimodal fusion for persuasiveness prediction [C]// Proceedings of the 18th ACM International Conference on Multimodal Interaction. Tokyo: ACM, 2016: 284-288.
[5]   CAMBRIA E, HAZARIKA D, PORIA S, et al. Benchmarking multimodal sentiment analysis [C]// Computational Linguistics and Intelligent Text Processing. Budapest: Springer, 2018: 166-179.
[6]   WANG Y, SHEN Y, LIU Z, et al. Words can shift: dynamically adjusting word representations using nonverbal behaviors [C]// Proceedings of the Thirty-third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence. [S.l.]: AAAI, 2019: 7216-7223.
[7]   ZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis [C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen: ACL, 2017: 1103-1114.
[8]   YU W, XU H, MENG F, et al. CH-SIMS: a chinese multimodal sentiment analysis dataset with fine-grained annotation of modality [C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. [S.l.]: ACL, 2020: 3718-3727.
[9]   YU W, XU H, YUAN Z, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis [C]// Proceedings of the AAAI Conference on Artificial Intelligence. [S.l.]: AAAI, 2021, 35(12): 10790-10797.
[10]   TSAI Y H H, BAI S, LIANG P P, et al. Multimodal Transformer for unaligned multimodal language sequences [C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: ACL, 2019: 6558-6569.
[11]   WU J, MAI S, HU H. Graph capsule aggregation for unaligned multimodal sequences [C]// Proceedings of the 2021 International Conference on Multimodal Interaction. [S.l.]: ACM, 2021: 521-529.
[12]   MA H, HAN Z, ZHANG C, et al. Trustworthy multimodal regression with mixture of normal-inverse gamma distributions [C]// Advances in Neural Information Processing Systems. [S.l.]: NIPS, 2021: 6881-6893.
[13]   SIRIWARDHANA S, KALUARACHCHI T, BILLINGHURST M, et al Multimodal emotion recognition with Transformer-based self supervised feature fusion[J]. IEEE Access, 2020, 8: 176274- 176285
doi: 10.1109/ACCESS.2020.3026823
[14]   MAJUMDER N, HAZARIKA D, GELBUKH A, et al Multimodal sentiment analysis using hierarchical fusion with context modeling[J]. Knowledge-Based Systems, 2018, 161: 124- 133
doi: 10.1016/j.knosys.2018.07.041
[15]   DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding [C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume1 (Long and Short Papers). Minneapolis: ACL, 2019: 4171-4186.
[16]   MCFEE B, RAFFEL C, LIANG D, et al. librosa: audio and music signal analysis in python [C]// Proceedings of the Python in Science Conference (SCIPY 2015). Austin: SciPy, 2015.
[17]   BALTRUSAITIS T, ZADEH A, LIM Y C, et al. OpenFace 2.0: facial behavior analysis toolkit [C]// 2018 13th IEEE International Conference on Automatic Face and Gesture Recognition. Xi’an: IEEE, 2018: 59-66.
[18]   HOCHREITER S, SCHMIDHUBER J Long short-term memory[J]. Neural Computation, 1997, 9 (8): 1735- 1780
doi: 10.1162/neco.1997.9.8.1735
[19]   HAN W, CHEN H, GELBUKH A, et al. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis [C]// Proceedings of the 2021 International Conference on Multimodal Interaction. [S.l.]: ACM, 2021: 6-15.
[20]   VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. [S.l.]: NIPS, 2017: 6000-6010.
[21]   CHEN Z, BADRINARAYANAN V, LEE C Y, et al. GradNorm: gradient normalization for adaptive loss balancing in deep multitask networks [C]// Proceedings of the 35th International Conference on Machine Learning. [S.l.]: PMLR, 2018: 794-803.
[22]   ZADEH A, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph [C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne: ACL, 2018: 2236-2246.
[23]   ZADEH A, ZELLERS R, PINCUS E, et al. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos [EB/OL]. (2016-08-12)[2022-11-25]. https://arxiv.org/ftp/arxiv/papers/1606/1606.06259.pdf.
[24]   LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors [C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. [S.l.]: ACL, 2018: 2247-2256.
[25]   ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension [EB/OL]. (2018-02-03)[2022-11-29]. https://arxiv.org/pdf/1802.00923.pdf.
[26]   HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and-specific representations for multimodal sentiment analysis [C]// Proceedings of the 28th ACM International Conference on Multimedia. Nice: ACM, 2020: 1122-1131.
[1] Zhicheng FENG,Jie YANG,Zhichao CHEN. Urban road network extraction method based on lightweight Transformer[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(1): 40-49.
[2] Hai-bo ZHANG,Lei CAI,Jun-ping REN,Ru-yan WANG,Fu LIU. Efficient and adaptive semantic segmentation network based on Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(6): 1205-1214.
[3] Yu-xiang WANG,Zhi-wei ZHONG,Peng-cheng XIA,Yi-xiang HUANG,Cheng-liang LIU. Compound fault decoupling diagnosis method based on improved Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(5): 855-864.
[4] Xin-dong LV,Jiao LI,Zhen-nan DENG,Hao FENG,Xin-tong CUI,Hong-xia DENG. Structured image super-resolution network based on improved Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(5): 865-874.
[5] Yu-xiang LU,Guan-hua XU,Bo TANG. Worker behavior recognition based on temporal and spatial self-attention of vision Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(3): 446-454.
[6] Jin-bo HU,Wei-zhi NIE,Dan SONG,Zhuo GAO,Yun-peng BAI,Feng ZHAO. Chest X-ray imaging disease diagnosis model assisted by deformable Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(10): 1923-1932.
[7] Wan-liang WANG,Tie-jun WANG,Jia-cheng CHEN,Wen-bo YOU. Medical image segmentation method combining multi-scale and multi-head attention[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(9): 1796-1805.
[8] Guo-peng ZHANG,Zi-han LI,Hao WANG,zheng ZHENG. Isolated AC-DC solid state transformer front and rear stages integrated sliding mode control[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(3): 622-630.
[9] Tian-le YUAN,Ju-long YUAN,Yong-jian ZHU,Han-chen ZHENG. Surface defect detection algorithm of thrust ball bearing based on improved YOLOv5[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(12): 2349-2357.
[10] Qiao-hong CHEN,Fei-yu LI,Qi SUN,Yu-bo JIA. Answer selection model based on LSTM and decay self-attention[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(12): 2436-2444.
[11] Zhen-hong MA,Zhen LIU,Sheng-yong YIN,Rong-wei MA,Ke-ping YAN. Experimental study on melanoma cell ablation by high-voltage nanosecond pulsed electric field[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(6): 1168-1174.
[12] Jia-hui XU,Jing-chang WANG,Ling CHEN,Yong WU. Surface water quality prediction model based on graph neural network[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(4): 601-607.
[13] Ping YANG,Dan WANG,Zi-jian KAGN,Tong LI,Li-hua FU,Yue-ren YU. Prediction model of paroxysmal atrial fibrillation based on pattern recognition and ensemble CNN-LSTM[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(5): 1039-1048.
[14] Le XIE,Xi-dan HENG,Yang LIU,Qi-long JIANG,Dong LIU. Transformer fault diagnosis based on linear discriminant analysis and step-by-step machine learning[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(11): 2266-2272.
[15] Tian-zhong HU,Jian-bo YU. Life prediction of lithium-ion batteries based on multiscale decomposition and deep learning[J]. Journal of ZheJiang University (Engineering Science), 2019, 53(10): 1852-1864.