Please wait a minute...
浙江大学学报(工学版)  2023, Vol. 57 Issue (12): 2421-2429    DOI: 10.3785/j.issn.1008-973X.2023.12.009
计算机技术     
基于多任务学习与层叠 Transformer 的多模态情感分析模型
陈巧红(),孙佳锦,漏杨波,方志坚
浙江理工大学 计算机科学与技术学院,浙江 杭州 310018
Multimodal sentiment analysis model based on multi-task learning and stacked cross-modal Transformer
Qiao-hong CHEN(),Jia-jin SUN,Yang-bo LOU,Zhi-jian FANG
School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
 全文: PDF(1171 KB)   HTML
摘要:

针对单模态特征提取存在的模态特征异质性难以保留问题和跨模态特征融合存在的特征冗余问题,基于跨模态Transformer,提出新的多模态情感分析模型(MTSA). 使用长短时记忆(LSTM)与多任务学习框架提取单模态上下文语义信息,通过累加辅助模态任务损失以筛除噪声并保留模态特征异质性. 使用多任务门控机制调整跨模态特征融合,通过层叠Transformer结构融合文本、音频与视觉模态特征,提升融合深度,避免融合特征冗余. 在2个公开数据集MOSEI和SIMS上的实验结果表明,相较于其他先进模型,MTSA的整体性能表现更好,二分类准确率分别达到83.51%和84.18%.

关键词: 多模态情感分析长短时记忆(LSTM)Transformer多任务学习跨模态特征融合    
Abstract:

A new multimodal sentiment analysis model (MTSA) was proposed on the basis of cross-modal Transformer, aiming at the difficult retention of the modal feature heterogeneity for single-modal feature extraction and feature redundancy for cross-modal feature fusion. Long short-term memory (LSTM) and multi-task learning framework were used to extract single-modal contextual semantic information, the noise was removed and the modal feature heterogeneity was preserved by adding up auxiliary modal task losses. Multi-tasking gating mechanism was used to adjust cross-modal feature fusion. Text, audio and visual modal features were fused in a stacked cross-modal Transformer structure to improve fusion depth and avoid feature redundancy. MTSA was evaluated in the MOSEI and SIMS data sets, results show that compared with other advanced models, MTSA has better overall performance, the accuracy of binary classification reached 83.51% and 84.18% respectively.

Key words: multimodal sentiment analysis    long short-term memory (LSTM)    Transformer    multi-task learning    cross-modal feature fusion
收稿日期: 2023-02-11 出版日期: 2023-12-27
CLC:  TP 391  
基金资助: 浙江理工大学中青年骨干人才培养经费项目
作者简介: 陈巧红(1978—),女,教授,从事并联机器人智能优化设计及机器学习技术研究. orcid.org/0000-0003-0595-341X.E-mail: chen_lisa@zstu.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
陈巧红
孙佳锦
漏杨波
方志坚

引用本文:

陈巧红,孙佳锦,漏杨波,方志坚. 基于多任务学习与层叠 Transformer 的多模态情感分析模型[J]. 浙江大学学报(工学版), 2023, 57(12): 2421-2429.

Qiao-hong CHEN,Jia-jin SUN,Yang-bo LOU,Zhi-jian FANG. Multimodal sentiment analysis model based on multi-task learning and stacked cross-modal Transformer. Journal of ZheJiang University (Engineering Science), 2023, 57(12): 2421-2429.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2023.12.009        https://www.zjujournals.com/eng/CN/Y2023/V57/I12/2421

图 1  基于多任务学习和层叠Transformer的多模态情感分析模型结构图
图 2  2种融合结构的对比
图 3  跨模态门控Transformer结构图
参数 数值
SIMS MOSEI
训练批度 32 16
学习率 0.001 0.001
文本特征维度 768 768
音频特征维度 33 74
视觉特征维度 709 35
文本特征长度
音频特征长度
视觉特征长度
39
400
55
50
500
375
文本LSTM隐藏层维度 100 100
音频LSTM隐藏层维度 100 100
视觉LSTM隐藏层维度 100 100
Transformer编码层维度 100 100
多头注意力头数 10 10
Transformer层数 2 2
Droupout率 0.1 0.1
表 1  所提模型在2个数据集上的实验参数设置
对比模型 Acc2/% F1/% Acc3/% MAE Corr T/s N/MB
TFN 77.63 77.67 66.39 42.91 58.04 40.5 3.7
MFN 80.0 79.74 68.01 43.63 57.74 123 5.3
LMF 77.78 76.22 65.99 44.13 56.96 13 5.5
MULT 81.03 80.44 68.67 45.32 56.41 164 40.5
MISA 77.73 77.31 66.13 41.94 56.74 314 15.2
M_TFN 82.21 81.1 70.46 40.27 63.96 65 3.7
M_MFN 82.22 81.96 70.85 41.42 63.60 21 5.5
CMFIB 80.28 80.27 40.93 59.82 137 7.9
MTSA 84.18 83.98 71.42 38.85 64.89 111 4.9
表 2  SIMS数据集上不同模型的性能对比结果
对比模型 Acc2/% F1/% Acc3/% MAE Corr T/s N/MB
TFN 81.89 81.74 66.63 57.26 71.74 175 3.7
MFN 82.86 82.85 66.59 57.33 71.82 354 5.3
LMF 83.48 83.36 66.59 57.57 71.69 68 5.5
MULT 83.43 83.32 67.04 55.93 73.71 402 40.5
MISA 84.64 84.66 67.63 55.75 75.15 1101 15.2
CMFIB 85.72 85.72 67.89 58.36 79.37 705 7.9
MTSA 83.51 83.38 67.13 55.67 73.46 351 4.9
表 3  MOSEI数据集上不同模型的性能对比结果
对比模型 Acc2/% F1/% Acc3/% MAE Corr
MTSA 84.18 83.98 71.42 38.85 64.89
MTSA+WA 84.36 84.11 71.57 38.71 64.72
MTSA-G 82.32 82.08 71.06 40.17 61.31
MTSA-GMT 82.27 81.98 70.37 42.91 58.04
MTSA-SMT 81.08 81.03 70.11 43.63 57.74
表 4  所提模型在SIMS数据集上的消融实验结果
模态融合顺序 Acc2/% F1/% Acc3/% MAE Corr
V→(A→T) 84.18 83.89 71.42 38.85 64.89
A→(V→T) 83.04 83.02 70.41 43.74 57.35
V→(T→A) 83.04 82.82 70.55 43.33 58.55
T→(V→A) 83.82 83.63 70.59 43.08 59.62
T→(A→V) 82.94 83.02 70.41 42.92 58.42
A→(T→V) 83.71 83.88 71.07 41.08 61.16
表 5  SIMS数据集上的模态融合顺序实验结果
视频编号 多模态信息 视频画面 情感预测结果 真实情感
video_0001/0001 T:我不想嫁给李茶
A:语速较快,语气低沉
V:皱眉,表情沮丧,生气
T:消极
A:消极
V:消极
M:消极
T:消极
A:消极
V:消极
M:消极
video_0009/0002 T:妹妹不敢有任何非分之想
A:语气平静
V:低头,微笑
T:消极
A:中性
V:积极
M:消极
T:消极
A:积极
V:积极
M:消极
video_0054/0032 T:特别尊贵的感觉,感觉很害怕,怎么和他对戏啊,
也不熟两个人也没有演过戏
A:语气紧张
V:微笑,紧握双手
T:积极
A:中性
V:积极
M:积极
T:积极
A:消极
V:积极
M:积极
video_0056/0051 T:拿他护照干吗你又用不了
A:语气平静
V:先皱眉,后微笑,挠头
T:消极
A:中性
V:积极
M:中性
T:中性
A:消极
V:积极
M:中性
表 6  所提模型在SIMS数据集的样例分析
1 HUANG Y, DU C, XUE Z, et al. What makes multi-modal learning better than single [C]// Advances in Neural Information Processing Systems. [S.l.]: NIPS, 2021: 10944-10956.
2 WANG H, MEGHAWAT A, MORENCY L P, et al. Select-additive learning: improving generalization in multimodal sentiment analysis [C]// 2017 IEEE International Conference on Multimedia and Expo. Hong Kong: IEEE, 2017: 949-954.
3 WILLIAMS J, COMANESCU R, RADU O, et al. DNN multimodal fusion techniques for predicting video sentiment [C]// Proceedings of Grand Challenge and Workshop on Human Multimodal Language. [S.l.]: ACL, 2018: 64-72.
4 NOJAVANASGHARI B, GOPINATH D, KOUSHIK J, et al. Deep multimodal fusion for persuasiveness prediction [C]// Proceedings of the 18th ACM International Conference on Multimodal Interaction. Tokyo: ACM, 2016: 284-288.
5 CAMBRIA E, HAZARIKA D, PORIA S, et al. Benchmarking multimodal sentiment analysis [C]// Computational Linguistics and Intelligent Text Processing. Budapest: Springer, 2018: 166-179.
6 WANG Y, SHEN Y, LIU Z, et al. Words can shift: dynamically adjusting word representations using nonverbal behaviors [C]// Proceedings of the Thirty-third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence. [S.l.]: AAAI, 2019: 7216-7223.
7 ZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis [C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen: ACL, 2017: 1103-1114.
8 YU W, XU H, MENG F, et al. CH-SIMS: a chinese multimodal sentiment analysis dataset with fine-grained annotation of modality [C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. [S.l.]: ACL, 2020: 3718-3727.
9 YU W, XU H, YUAN Z, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis [C]// Proceedings of the AAAI Conference on Artificial Intelligence. [S.l.]: AAAI, 2021, 35(12): 10790-10797.
10 TSAI Y H H, BAI S, LIANG P P, et al. Multimodal Transformer for unaligned multimodal language sequences [C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: ACL, 2019: 6558-6569.
11 WU J, MAI S, HU H. Graph capsule aggregation for unaligned multimodal sequences [C]// Proceedings of the 2021 International Conference on Multimodal Interaction. [S.l.]: ACM, 2021: 521-529.
12 MA H, HAN Z, ZHANG C, et al. Trustworthy multimodal regression with mixture of normal-inverse gamma distributions [C]// Advances in Neural Information Processing Systems. [S.l.]: NIPS, 2021: 6881-6893.
13 SIRIWARDHANA S, KALUARACHCHI T, BILLINGHURST M, et al Multimodal emotion recognition with Transformer-based self supervised feature fusion[J]. IEEE Access, 2020, 8: 176274- 176285
doi: 10.1109/ACCESS.2020.3026823
14 MAJUMDER N, HAZARIKA D, GELBUKH A, et al Multimodal sentiment analysis using hierarchical fusion with context modeling[J]. Knowledge-Based Systems, 2018, 161: 124- 133
doi: 10.1016/j.knosys.2018.07.041
15 DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding [C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume1 (Long and Short Papers). Minneapolis: ACL, 2019: 4171-4186.
16 MCFEE B, RAFFEL C, LIANG D, et al. librosa: audio and music signal analysis in python [C]// Proceedings of the Python in Science Conference (SCIPY 2015). Austin: SciPy, 2015.
17 BALTRUSAITIS T, ZADEH A, LIM Y C, et al. OpenFace 2.0: facial behavior analysis toolkit [C]// 2018 13th IEEE International Conference on Automatic Face and Gesture Recognition. Xi’an: IEEE, 2018: 59-66.
18 HOCHREITER S, SCHMIDHUBER J Long short-term memory[J]. Neural Computation, 1997, 9 (8): 1735- 1780
doi: 10.1162/neco.1997.9.8.1735
19 HAN W, CHEN H, GELBUKH A, et al. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis [C]// Proceedings of the 2021 International Conference on Multimodal Interaction. [S.l.]: ACM, 2021: 6-15.
20 VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. [S.l.]: NIPS, 2017: 6000-6010.
21 CHEN Z, BADRINARAYANAN V, LEE C Y, et al. GradNorm: gradient normalization for adaptive loss balancing in deep multitask networks [C]// Proceedings of the 35th International Conference on Machine Learning. [S.l.]: PMLR, 2018: 794-803.
22 ZADEH A, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph [C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne: ACL, 2018: 2236-2246.
23 ZADEH A, ZELLERS R, PINCUS E, et al. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos [EB/OL]. (2016-08-12)[2022-11-25]. https://arxiv.org/ftp/arxiv/papers/1606/1606.06259.pdf.
24 LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors [C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. [S.l.]: ACL, 2018: 2247-2256.
25 ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension [EB/OL]. (2018-02-03)[2022-11-29]. https://arxiv.org/pdf/1802.00923.pdf.
26 HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and-specific representations for multimodal sentiment analysis [C]// Proceedings of the 28th ACM International Conference on Multimedia. Nice: ACM, 2020: 1122-1131.
[1] 冯志成,杨杰,陈智超. 基于轻量级Transformer的城市路网提取方法[J]. 浙江大学学报(工学版), 2024, 58(1): 40-49.
[2] 周欣磊,顾海挺,刘晶,许月萍,耿芳,王冲. 基于集成学习与深度学习的日供水量预测方法[J]. 浙江大学学报(工学版), 2023, 57(6): 1120-1127.
[3] 张海波,蔡磊,任俊平,王汝言,刘富. 基于Transformer的高效自适应语义分割网络[J]. 浙江大学学报(工学版), 2023, 57(6): 1205-1214.
[4] 王誉翔,钟智伟,夏鹏程,黄亦翔,刘成良. 基于改进Transformer的复合故障解耦诊断方法[J]. 浙江大学学报(工学版), 2023, 57(5): 855-864.
[5] 吕鑫栋,李娇,邓真楠,冯浩,崔欣桐,邓红霞. 基于改进Transformer的结构化图像超分辨网络[J]. 浙江大学学报(工学版), 2023, 57(5): 865-874.
[6] 陆昱翔,徐冠华,唐波. 基于视觉Transformer时空自注意力的工人行为识别[J]. 浙江大学学报(工学版), 2023, 57(3): 446-454.
[7] 胡锦波,聂为之,宋丹,高卓,白云鹏,赵丰. 可形变Transformer辅助的胸部X光影像疾病诊断模型[J]. 浙江大学学报(工学版), 2023, 57(10): 1923-1932.
[8] 王万良,王铁军,陈嘉诚,尤文波. 融合多尺度和多头注意力的医疗图像分割方法[J]. 浙江大学学报(工学版), 2022, 56(9): 1796-1805.
[9] 袁天乐,袁巨龙,朱勇建,郑翰辰. 基于改进YOLOv5的推力球轴承表面缺陷检测算法[J]. 浙江大学学报(工学版), 2022, 56(12): 2349-2357.
[10] 陈巧红,李妃玉,孙麒,贾宇波. 基于LSTM与衰减自注意力的答案选择模型[J]. 浙江大学学报(工学版), 2022, 56(12): 2436-2444.
[11] 罗逸涵,程杰仁,唐湘滟,欧明望,王天. 基于自适应阈值的DDoS攻击态势预警模型[J]. 浙江大学学报(工学版), 2020, 54(4): 704-711.
[12] 胡晨, 吴新科, 彭方正, 钱照明. 变压器级联的双路均流准谐振反激LED驱动器[J]. 浙江大学学报(工学版), 2015, 49(4): 740-748.