Please wait a minute...
浙江大学学报(工学版)  2023, Vol. 57 Issue (9): 1865-1875    DOI: 10.3785/j.issn.1008-973X.2023.09.018
计算机技术     
融合生成对抗网络与时间卷积网络的普通话情感识别
李海烽1(),张雪英1,*(),段淑斐1,贾海蓉1,Huizhi Liang 2
1. 太原理工大学 电子信息与光学工程学院,山西 太原 030024
2. 纽卡斯尔大学 计算机学院,泰恩-威尔 泰恩河畔纽卡斯尔 NE1 7RU
Fusing generative adversarial network and temporal convolutional network for Mandarin emotion recognition
Hai-feng LI1(),Xue-ying ZHANG1,*(),Shu-fei DUAN1,Hai-rong JIA1,Hui-zhi LIANG2
1. College of Electronic Information and Optical Engineering, Taiyuan University of Technology, Taiyuan 030024, China
2. School of Computing, Newcastle University, Newcastle upon Tyne NE1 7RU, United Kingdom
 全文: PDF(2044 KB)   HTML
摘要:

为了探究声学与发音学转换对普通话情感识别的影响,提出融合声学与发音特征转换的情感识别系统. 根据人体发音机制,录制普通话多模态音视频情感数据库. 设计双向映射生成对抗网络(Bi-MGAN)来解决双模态间的特征转换问题,定义生成器损失函数和映射损失函数来优化网络. 搭建基于特征-维度注意力机制的残差时间卷积网络(ResTCN-FDA),利用注意力机制自适应地为不同种类特征和不同维度通道赋予不同的权重. 实验结果表明,Bi-MGAN在正向和反向映射任务中的转换精度均优于主流的转换网络算法;ResTCN-FDA在给定情感数据集上的评价指标远高于传统的情感识别算法;真实特征融合映射特征使得情感被正确识别的准确率显著提升,证明了映射对普通话情感识别的积极作用.

关键词: 循环生成对抗网络情感识别声学与发音学转换时间卷积网络注意力机制    
Abstract:

An emotion recognition system that integrates acoustic and articulatory feature conversions was proposed in order to investigate the influence of acoustic and articulatory conversions on Mandarin emotion recognition. Firstly, a multimodal emotional Mandarin database was recorded based on the human articulation mechanism. Then, a bi-directional mapping generative adversarial network (Bi-MGAN) was designed to solve the feature conversion problem with bimodality, and the generator loss functions and the mapping loss functions were proposed to optimise the network. Finally, a residual temporal convolutional network based on the feature-dimension attention (ResTCN-FDA) was constructed to use attention mechanisms to adaptively assign different weights to different variety features and different dimension channels. Experimental results show that the conversion accuracy of Bi-MGAN outperforms the current optimal algorithms for conversion network in both the forward and the reverse mapping tasks. The evaluation metrics of ResTCN-FDA on a given emotion dataset is much higher than traditional emotion recognition algorithms. The real features fused with the mapped features resulted in a significant increase in the accuracy of the emotions being recognized correctly, and the positive effect of mapping on Mandarin emotion recognition was demonstrated.

Key words: cycle generative adversarial network    emotion recognition    acoustic and articulatory conversions    temporal convolutional network    attention mechanism
收稿日期: 2022-06-23 出版日期: 2023-10-16
CLC:  TP 183  
基金资助: 国家自然科学基金资助项目(12004275);山西省研究生创新项目(2022Y235);山西省留学人员科技活动择优资助项目(20200017);山西省回国留学人员科研资助项目(2019025,2020042);太原理工大学引进人才科研启动基金资助项目(tyut-rc201405b);山西省应用基础研究计划面上自然基金资助项目(20210302123186)
通讯作者: 张雪英     E-mail: 2244812211@qq.com;tyzhangxy@163.com
作者简介: 李海烽(1995—),男,博士生,从事信号处理与情感计算研究. orcid.org/0000-0002-7203-3894. E-mail: 2244812211@qq.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
李海烽
张雪英
段淑斐
贾海蓉
Huizhi Liang

引用本文:

李海烽,张雪英,段淑斐,贾海蓉,Huizhi Liang . 融合生成对抗网络与时间卷积网络的普通话情感识别[J]. 浙江大学学报(工学版), 2023, 57(9): 1865-1875.

Hai-feng LI,Xue-ying ZHANG,Shu-fei DUAN,Hai-rong JIA,Hui-zhi LIANG. Fusing generative adversarial network and temporal convolutional network for Mandarin emotion recognition. Journal of ZheJiang University (Engineering Science), 2023, 57(9): 1865-1875.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2023.09.018        https://www.zjujournals.com/eng/CN/Y2023/V57/I9/1865

图 1  电磁发音仪采集声学与发音学数据的过程
图 2  电磁发音仪采集数据时的传感器设置
评分 语音传达度 情感表达度
3 无噪音且语义明确 强烈的情绪表达
2 略微的噪音但语义明确 中度的情绪表达
1 明显的噪音且影响语义传达 轻微的情绪表达
0 音频嘈杂且无法理解 无情绪表达
表 1  声学数据评估量表
图 3  融合Bi-MGAN和ResTCN-FDA的情感识别算法整体结构
图 4  双向映射生成对抗网络的网络原理图
图 5  ResTCN-FDA的整体结构图
图 6  特征-维度注意力机制的整体结构框图
算法 正向映射 反向映射
MAE RMSE MAE RMSE
GAN[22] 1.217 1.642 0.946 1.189
CycleGAN[21] 1.127 1.428 0.811 0.919
Bi-MGAN(G) 1.034 1.341 0.801 0.908
Bi-MGAN(M) 0.879 1.134 0.642 0.881
Bi-MGAN(GM) 0.703 0.920 0.501 0.683
表 2  转换网络算法的消融实验
算法 正向映射 反向映射
MAE RMSE MAE RMSE
DNN[14] 1.479 1.613 1.143 1.259
BiLSTM[11] 1.298 1.422 1.003 1.217
PSO-LSSVM[6] 1.185 1.252 0.967 1.136
DRMDN[23] 0.884 0.948 0.831 0.939
Bi-MGAN 0.703 0.908 0.501 0.683
表 3  转换网络算法的映射性能对比
数据库 算法 ACC F1 AUC
CASIA TCN[20] 70.69 69.18 70.79
ResTCN[18] 72.25 72.16 72.67
ResTCN-FA 76.25 76.56 76.96
ResTCN-DA 73.75 73.71 73.97
ResTCN-FDA 80.41 81.22 81.43
STEM-E2VA TCN[20] 64.71 64.52 66.69
ResTCN[18] 68.31 67.68 71.14
ResTCN-FA 73.63 73.78 74.83
ResTCN-DA 72.85 72.61 73.36
ResTCN-FDA 75.63 75.44 76.82
EMO-DB TCN[20] 71.26 68.67 71.71
ResTCN[18] 73.71 74.28 74.83
ResTCN-FA 76.22 76.62 76.94
ResTCN-DA 77.29 75.81 77.91
ResTCN-FDA 80.16 80.78 81.58
RADVESS TCN[20] 59.07 59.02 59.66
ResTCN[18] 62.41 61.15 61.77
ResTCN-FA 63.93 62.40 63.98
ResTCN-DA 64.07 63.82 64.90
ResTCN-FDA 66.55 65.57 66.86
表 4  情感识别网络算法的消融实验
数据库 算法 ACC F1 AUC
CASIA CNN[12] 63.00 62.43 63.19
HS-TCN[24] 76.25 76.64 76.91
DRN[25] 76.91 76.67 76.94
ResTCN-FDA 80.41 81.22 81.43
STEM-E2VA CNN[12] 56.77 56.84 57.77
HS-TCN[24] 72.81 72.59 72.92
DRN[25] 68.15 68.25 68.86
ResTCN-FDA 75.63 75.44 76.82
EMO-DB CNN[12] 69.72 69.09 69.86
HS-TCN[24] 74.76 73.60 75.51
DRN[25] 76.64 74.72 76.96
ResTCN-FDA 80.16 80.78 81.58
RADVESS CNN[12] 57.29 55.67 57.85
HS-TCN[24] 63.29 63.56 63.58
DRN[25] 63.46 61.88 63.90
ResTCN-FDA 66.55 65.57 66.86
表 5  情感识别网络算法的情绪评价指标对比
特征类型 特征集 输入模态 维度 ACC/% F1/% AUC/%
发音学 Articulatory(C) 映射发音特征 28 53.02 52.03 53.61
发音学 Articulatory(R) 真实发音特征 28 63.56 62.96 63.87
声学 Acoustic(C) 映射声学特征 60 59.23 58.69 59.82
声学 Acoustic(R) 真实声学特征 60 75.63 75.44 76.82
声学与发音学 Acoustic(R) + Articulatory(C) 真实声学特征+映射发音特征 88 79.51 79.69 79.97
声学与发音学 Acoustic(C)+ Articulatory(R) 真实发音特征+映射声学特征 88 72.47 72.45 72.95
声学与发音学 Acoustic(R)+Articulatory(R) 真实声学特征+真实发音特征 88 83.77 83.64 83.97
预训练 HuBERT[26] 48层 transformer 1 280 89.66 89.85 91.96
预训练 Wav2vec 2.0[27] 24层 transformer 1 024 82.57 82.25 83.93
预训练与降维 HuBERT[26] 48层 transformer+主成分分析 60 78.54 78.93 79.15
预训练与降维 HuBERT[26] 48层 transformer+主成分分析 88 80.16 80.01 80.42
预训练与降维 Wav2vec 2.0[27] 24层 transformer+主成分分析 60 75.90 75.45 76.88
预训练与降维 Wav2vec 2.0[27] 24层 transformer+主成分分析 88 76.18 76.65 76.96
表 6  不同声学特征与发音特征的情感评价指标对比
图 7  不同特征集的混淆矩阵
1 LEI J J, ZHU X W, WANG Y BAT: block and token self-attention for speech emotion recognition[J]. Neural Networks, 2022, 156: 67- 80
doi: 10.1016/j.neunet.2022.09.022
2 LI Y, TAO J, CHAO L, et al CHEAVD: a Chinese natural emotional audio visual database[J]. Journal of Ambient Intelligence and Humanized Computing, 2017, 8 (6): 913- 924
doi: 10.1007/s12652-016-0406-z
3 CHOU H C, LIN W C, CHANG L C, et al. NNIME: the NTHU-NTUA Chinese interactive multimodal emotion corpus [C]// 2017 Seventh International Conference on Affective Computing and Intelligent Interaction. San Antonio: IEEE, 2017: 292-298.
4 BUSSO C, BULUT M, LEE C, et al IEMOCAP: interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation, 2008, 42 (4): 335- 359
doi: 10.1007/s10579-008-9076-6
5 QIN C, CARREIRA M A. An empirical investigation of the nonuniqueness in the acoustic-to-articulatory mapping [C]// Eighth Annual Conference of the International Speech Communication Association. Antwerp: [s.n.], 2007: 27-31.
6 REN G, FU J, SHAO G, et al Articulatory to acoustic conversion of Mandarin emotional speech based on PSO-LSSVM[J]. Complexity, 2021, 29 (3): 696- 706
7 HOGDEN J, LOFQVIST A, GRACCO V, et al Accurate recovery of articulator positions from acoustics: new conclusions based on human data[J]. The Journal of the Acoustical Society of America, 1996, 100 (3): 1819- 1834
doi: 10.1121/1.416001
8 LING Z H, RICHMOND K, YAMAGISHI J, et al Integrating articulatory features into HMM based parametric speech synthesis[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2009, 17 (6): 1171- 1185
doi: 10.1109/TASL.2009.2014796
9 LI M, KIM J, LAMMERT A, et al Speaker verification based on the fusion of speech acoustics and inverted articulatory signals[J]. Computer Speech and Language, 2016, 36: 196- 211
doi: 10.1016/j.csl.2015.05.003
10 GUO L, WANG L, DANG J, et al Learning affective representations based on magnitude and dynamic relative phase information for speech emotion recognition[J]. Speech Communication, 2022, 136 (4): 118- 127
11 CHEN Q, HUANG G A novel dual attention based BLSTM with hybrid features in speech emotion recognition[J]. Engineering Applications of Artificial Intelligence, 2021, 102 (5): 104277
12 张静, 张雪英, 陈桂军, 等 结合3D-CNN和频-空注意力机制的EEG情感识别[J]. 西安电子科技大学学报, 2022, 49 (3): 191- 198
ZHANG Jing, ZHANG Xue-ying, CHEN Gui-jun, et al EEG emotion recognition based on the 3D-CNN and spatial-frequency attention mechanism[J]. Journal of Xidian University, 2022, 49 (3): 191- 198
doi: 10.19665/j.issn1001-2400.2022.03.021
13 KUMARAN U, RADHA R S, NAGARAJAN S M, et al Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN[J]. International Journal of Speech Technology, 2021, 24 (2): 303- 314
doi: 10.1007/s10772-020-09792-x
14 LIESKOVSKA E, JAKUBEC M, JARINA R, et al A review on speech emotion recognition using deep learning and attention mechanism[J]. Electronics, 2021, 10 (10): 1163
doi: 10.3390/electronics10101163
15 ZHU J Y, PARK T, ISOLA P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks [C]// Proceedings of the IEEE International Conference on Computer Vision. Venice: ICCV, 2017: 2223-2232.
16 YUAN J, BAO C. CycleGAN based speech enhancement for the unpaired training data [C]// 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Lanzhou: APSIPA, 2019: 878-883.
17 SU B H, LEE C Unsupervised cross-corpus speech emotion recognition using a multi-source CycleGAN[J]. IEEE Transactions on Affective Computing, 2022, 48 (8): 650- 715
18 LIN J, WIJNGAARDEN A J L, WANG K C, et al Speech enhancement using multi-stage self-attentive temporal convolutional networks[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3440- 3450
doi: 10.1109/TASLP.2021.3125143
19 PANDEY A, WANG D L. TCNN: temporal convolutional neural network for real-time speech enhancement in the time domain [C]// 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: ICASSP, 2019: 6875-6879.
20 ZHANG L, SHI Z, HAN J, et al. Furcanext: end-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks [C]// International Conference on Multimedia Modeling. Daejeon: ICMM, 2020: 653-665.
21 JIANG Z, ZHANG R, GUO Y, et al Noise interference reduction in vision module of intelligent plant cultivation robot using better Cycle GAN[J]. IEEE Sensors Journal, 2022, 22 (11): 11045- 11055
doi: 10.1109/JSEN.2022.3164915
22 GOODFELLOW I, POUGET A J, MIRZA M, et al Generative adversarial nets[J]. Advances in Neural Information Processing Systems, 2014, 27: 42- 51
23 LIU P, YU Q, WU Z, et al. A deep recurrent approach for acoustic-to-articulatory inversion [C]// 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Brisbane: ICASSP, 2015: 4450-4454.
24 CHENG Y, XU Y, ZHONG H, et al Leveraging semisupervised hierarchical stacking temporal convolutional network for anomaly detection in IoT communication[J]. IEEE Internet of Things Journal, 2020, 8 (1): 144- 155
25 ZHAO Z P, LI Q F, ZHANG Z X, et al Combining a parallel 2D CNN with a self-attention dilated residual network for CTC-based discrete speech emotion recognition[J]. Neural Networks, 2021, 141: 52- 60
doi: 10.1016/j.neunet.2021.03.013
26 CHANG, XUAN K. An exploration of self-supervised pretrained representations for end-to-end speech recognition [C]// 2021 IEEE Automatic Speech Recognition and Understanding Workshop. Cartagena: ASRU, 2021, 228-235.
[1] 赵小强,王泽,宋昭漾,蒋红梅. 基于动态注意力网络的图像超分辨率重建[J]. 浙江大学学报(工学版), 2023, 57(8): 1487-1494.
[2] 王慧欣,童向荣. 融合知识图谱的推荐系统研究进展[J]. 浙江大学学报(工学版), 2023, 57(8): 1527-1540.
[3] 宋秀兰,董兆航,单杭冠,陆炜杰. 基于时空融合的多头注意力车辆轨迹预测[J]. 浙江大学学报(工学版), 2023, 57(8): 1636-1643.
[4] 李晓艳,王鹏,郭嘉,李雪,孙梦宇. 基于双注意力机制的多分支孪生网络目标跟踪[J]. 浙江大学学报(工学版), 2023, 57(7): 1307-1316.
[5] 权巍,蔡永青,王超,宋佳,孙鸿凯,李林轩. 基于3D-ResNet双流网络的VR病评估模型[J]. 浙江大学学报(工学版), 2023, 57(7): 1345-1353.
[6] 韩俊,袁小平,王准,陈烨. 基于YOLOv5s的无人机密集小目标检测算法[J]. 浙江大学学报(工学版), 2023, 57(6): 1224-1233.
[7] 项学泳,王力,宗文鹏,李广云. ASIS模块支持下融合注意力机制KNN的点云实例分割算法[J]. 浙江大学学报(工学版), 2023, 57(5): 875-882.
[8] 苏育挺,陆荣烜,张为. 基于注意力和自适应权重的车辆重识别算法[J]. 浙江大学学报(工学版), 2023, 57(4): 712-718.
[9] 卞佰成,陈田,吴入军,刘军. 基于改进YOLOv3的印刷电路板缺陷检测算法[J]. 浙江大学学报(工学版), 2023, 57(4): 735-743.
[10] 程艳芬,吴家俊,何凡. 基于关系门控图卷积网络的方面级情感分析[J]. 浙江大学学报(工学版), 2023, 57(3): 437-445.
[11] 曾耀,高法钦. 基于改进YOLOv5的电子元件表面缺陷检测算法[J]. 浙江大学学报(工学版), 2023, 57(3): 455-465.
[12] 杨帆,宁博,李怀清,周新,李冠宇. 基于语义增强特征融合的多模态图像检索模型[J]. 浙江大学学报(工学版), 2023, 57(2): 252-258.
[13] 刘超,孔兵,杜国王,周丽华,陈红梅,包崇明. 高阶互信息最大化与伪标签指导的深度聚类[J]. 浙江大学学报(工学版), 2023, 57(2): 299-309.
[14] 王林涛,毛齐. 基于RGB与深度信息融合的管片抓取位置测量方法[J]. 浙江大学学报(工学版), 2023, 57(1): 47-54.
[15] 凤丽洲,杨阳,王友卫,杨贵军. 基于Transformer和知识图谱的新闻推荐新方法[J]. 浙江大学学报(工学版), 2023, 57(1): 133-143.