融合生成对抗网络与时间卷积网络的普通话情感识别

doi:10.3785/j.issn.1008-973X.2023.09.018

浙江大学学报(工学版)

2023, Vol. 57

Issue (9): 1865-1875 DOI: 10.3785/j.issn.1008-973X.2023.09.018

计算机技术

融合生成对抗网络与时间卷积网络的普通话情感识别

李海烽1(

),张雪英1,*(

),段淑斐1,贾海蓉1,Huizhi Liang 2

1. 太原理工大学电子信息与光学工程学院，山西太原 030024
2. 纽卡斯尔大学计算机学院，泰恩-威尔泰恩河畔纽卡斯尔 NE1 7RU

Fusing generative adversarial network and temporal convolutional network for Mandarin emotion recognition

Hai-feng LI1(

),Xue-ying ZHANG1,*(

),Shu-fei DUAN1,Hai-rong JIA1,Hui-zhi LIANG2

1. College of Electronic Information and Optical Engineering, Taiyuan University of Technology, Taiyuan 030024, China
2. School of Computing, Newcastle University, Newcastle upon Tyne NE1 7RU, United Kingdom

全文: PDF(2044 KB) HTML

摘要：

为了探究声学与发音学转换对普通话情感识别的影响，提出融合声学与发音特征转换的情感识别系统. 根据人体发音机制，录制普通话多模态音视频情感数据库. 设计双向映射生成对抗网络(Bi-MGAN)来解决双模态间的特征转换问题，定义生成器损失函数和映射损失函数来优化网络. 搭建基于特征-维度注意力机制的残差时间卷积网络(ResTCN-FDA)，利用注意力机制自适应地为不同种类特征和不同维度通道赋予不同的权重. 实验结果表明，Bi-MGAN在正向和反向映射任务中的转换精度均优于主流的转换网络算法；ResTCN-FDA在给定情感数据集上的评价指标远高于传统的情感识别算法；真实特征融合映射特征使得情感被正确识别的准确率显著提升，证明了映射对普通话情感识别的积极作用.

关键词： 循环生成对抗网络; 情感识别; 声学与发音学转换; 时间卷积网络; 注意力机制

Abstract:

An emotion recognition system that integrates acoustic and articulatory feature conversions was proposed in order to investigate the influence of acoustic and articulatory conversions on Mandarin emotion recognition. Firstly, a multimodal emotional Mandarin database was recorded based on the human articulation mechanism. Then, a bi-directional mapping generative adversarial network (Bi-MGAN) was designed to solve the feature conversion problem with bimodality, and the generator loss functions and the mapping loss functions were proposed to optimise the network. Finally, a residual temporal convolutional network based on the feature-dimension attention (ResTCN-FDA) was constructed to use attention mechanisms to adaptively assign different weights to different variety features and different dimension channels. Experimental results show that the conversion accuracy of Bi-MGAN outperforms the current optimal algorithms for conversion network in both the forward and the reverse mapping tasks. The evaluation metrics of ResTCN-FDA on a given emotion dataset is much higher than traditional emotion recognition algorithms. The real features fused with the mapped features resulted in a significant increase in the accuracy of the emotions being recognized correctly, and the positive effect of mapping on Mandarin emotion recognition was demonstrated.

Key words: cycle generative adversarial network emotion recognition acoustic and articulatory conversions temporal convolutional network attention mechanism

收稿日期: 2022-06-23 出版日期: 2023-10-16

CLC:

TP 183

基金资助: 国家自然科学基金资助项目(12004275)；山西省研究生创新项目(2022Y235)；山西省留学人员科技活动择优资助项目(20200017)；山西省回国留学人员科研资助项目(2019025，2020042)；太原理工大学引进人才科研启动基金资助项目(tyut-rc201405b)；山西省应用基础研究计划面上自然基金资助项目(20210302123186)

通讯作者: 张雪英 E-mail: 2244812211@qq.com;tyzhangxy@163.com

作者简介: 李海烽（1995—），男，博士生，从事信号处理与情感计算研究. orcid.org/0000-0002-7203-3894. E-mail： 2244812211@qq.com

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	作者相关文章
	李海烽
	张雪英
	段淑斐
	贾海蓉
	Huizhi Liang

引用本文:

李海烽,张雪英,段淑斐,贾海蓉,Huizhi Liang . 融合生成对抗网络与时间卷积网络的普通话情感识别[J]. 浙江大学学报(工学版), 2023, 57(9): 1865-1875.

Hai-feng LI,Xue-ying ZHANG,Shu-fei DUAN,Hai-rong JIA,Hui-zhi LIANG. Fusing generative adversarial network and temporal convolutional network for Mandarin emotion recognition. Journal of ZheJiang University (Engineering Science), 2023, 57(9): 1865-1875.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2023.09.018 或 https://www.zjujournals.com/eng/CN/Y2023/V57/I9/1865

图 1 电磁发音仪采集声学与发音学数据的过程

图 2 电磁发音仪采集数据时的传感器设置

表 1 声学数据评估量表

图 3 融合Bi-MGAN和ResTCN-FDA的情感识别算法整体结构

图 4 双向映射生成对抗网络的网络原理图

图 5 ResTCN-FDA的整体结构图

图 6 特征-维度注意力机制的整体结构框图

表 2 转换网络算法的消融实验

表 3 转换网络算法的映射性能对比

表 4 情感识别网络算法的消融实验

表 5 情感识别网络算法的情绪评价指标对比

表 6 不同声学特征与发音特征的情感评价指标对比

图 7 不同特征集的混淆矩阵

1	LEI J J, ZHU X W, WANG Y BAT: block and token self-attention for speech emotion recognition[J]. Neural Networks, 2022, 156: 67- 80 doi: 10.1016/j.neunet.2022.09.022
2	LI Y, TAO J, CHAO L, et al CHEAVD: a Chinese natural emotional audio visual database[J]. Journal of Ambient Intelligence and Humanized Computing, 2017, 8 (6): 913- 924 doi: 10.1007/s12652-016-0406-z
3	CHOU H C, LIN W C, CHANG L C, et al. NNIME: the NTHU-NTUA Chinese interactive multimodal emotion corpus [C]// 2017 Seventh International Conference on Affective Computing and Intelligent Interaction. San Antonio: IEEE, 2017: 292-298.
4	BUSSO C, BULUT M, LEE C, et al IEMOCAP: interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation, 2008, 42 (4): 335- 359 doi: 10.1007/s10579-008-9076-6
5	QIN C, CARREIRA M A. An empirical investigation of the nonuniqueness in the acoustic-to-articulatory mapping [C]// Eighth Annual Conference of the International Speech Communication Association. Antwerp: [s.n.], 2007: 27-31.
6	REN G, FU J, SHAO G, et al Articulatory to acoustic conversion of Mandarin emotional speech based on PSO-LSSVM[J]. Complexity, 2021, 29 (3): 696- 706
7	HOGDEN J, LOFQVIST A, GRACCO V, et al Accurate recovery of articulator positions from acoustics: new conclusions based on human data[J]. The Journal of the Acoustical Society of America, 1996, 100 (3): 1819- 1834 doi: 10.1121/1.416001
8	LING Z H, RICHMOND K, YAMAGISHI J, et al Integrating articulatory features into HMM based parametric speech synthesis[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2009, 17 (6): 1171- 1185 doi: 10.1109/TASL.2009.2014796
9	LI M, KIM J, LAMMERT A, et al Speaker verification based on the fusion of speech acoustics and inverted articulatory signals[J]. Computer Speech and Language, 2016, 36: 196- 211 doi: 10.1016/j.csl.2015.05.003
10	GUO L, WANG L, DANG J, et al Learning affective representations based on magnitude and dynamic relative phase information for speech emotion recognition[J]. Speech Communication, 2022, 136 (4): 118- 127
11	CHEN Q, HUANG G A novel dual attention based BLSTM with hybrid features in speech emotion recognition[J]. Engineering Applications of Artificial Intelligence, 2021, 102 (5): 104277
12	张静, 张雪英, 陈桂军, 等结合3D-CNN和频-空注意力机制的EEG情感识别[J]. 西安电子科技大学学报, 2022, 49 (3): 191- 198 ZHANG Jing, ZHANG Xue-ying, CHEN Gui-jun, et al EEG emotion recognition based on the 3D-CNN and spatial-frequency attention mechanism[J]. Journal of Xidian University, 2022, 49 (3): 191- 198 doi: 10.19665/j.issn1001-2400.2022.03.021
13	KUMARAN U, RADHA R S, NAGARAJAN S M, et al Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN[J]. International Journal of Speech Technology, 2021, 24 (2): 303- 314 doi: 10.1007/s10772-020-09792-x
14	LIESKOVSKA E, JAKUBEC M, JARINA R, et al A review on speech emotion recognition using deep learning and attention mechanism[J]. Electronics, 2021, 10 (10): 1163 doi: 10.3390/electronics10101163
15	ZHU J Y, PARK T, ISOLA P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks [C]// Proceedings of the IEEE International Conference on Computer Vision. Venice: ICCV, 2017: 2223-2232.
16	YUAN J, BAO C. CycleGAN based speech enhancement for the unpaired training data [C]// 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Lanzhou: APSIPA, 2019: 878-883.
17	SU B H, LEE C Unsupervised cross-corpus speech emotion recognition using a multi-source CycleGAN[J]. IEEE Transactions on Affective Computing, 2022, 48 (8): 650- 715
18	LIN J, WIJNGAARDEN A J L, WANG K C, et al Speech enhancement using multi-stage self-attentive temporal convolutional networks[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3440- 3450 doi: 10.1109/TASLP.2021.3125143
19	PANDEY A, WANG D L. TCNN: temporal convolutional neural network for real-time speech enhancement in the time domain [C]// 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: ICASSP, 2019: 6875-6879.
20	ZHANG L, SHI Z, HAN J, et al. Furcanext: end-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks [C]// International Conference on Multimedia Modeling. Daejeon: ICMM, 2020: 653-665.
21	JIANG Z, ZHANG R, GUO Y, et al Noise interference reduction in vision module of intelligent plant cultivation robot using better Cycle GAN[J]. IEEE Sensors Journal, 2022, 22 (11): 11045- 11055 doi: 10.1109/JSEN.2022.3164915
22	GOODFELLOW I, POUGET A J, MIRZA M, et al Generative adversarial nets[J]. Advances in Neural Information Processing Systems, 2014, 27: 42- 51
23	LIU P, YU Q, WU Z, et al. A deep recurrent approach for acoustic-to-articulatory inversion [C]// 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Brisbane: ICASSP, 2015: 4450-4454.
24	CHENG Y, XU Y, ZHONG H, et al Leveraging semisupervised hierarchical stacking temporal convolutional network for anomaly detection in IoT communication[J]. IEEE Internet of Things Journal, 2020, 8 (1): 144- 155
25	ZHAO Z P, LI Q F, ZHANG Z X, et al Combining a parallel 2D CNN with a self-attention dilated residual network for CTC-based discrete speech emotion recognition[J]. Neural Networks, 2021, 141: 52- 60 doi: 10.1016/j.neunet.2021.03.013
26	CHANG, XUAN K. An exploration of self-supervised pretrained representations for end-to-end speech recognition [C]// 2021 IEEE Automatic Speech Recognition and Understanding Workshop. Cartagena: ASRU, 2021, 228-235.

[1]	赵小强,王泽,宋昭漾,蒋红梅. 基于动态注意力网络的图像超分辨率重建[J]. 浙江大学学报(工学版), 2023, 57(8): 1487-1494.
[2]	王慧欣,童向荣. 融合知识图谱的推荐系统研究进展[J]. 浙江大学学报(工学版), 2023, 57(8): 1527-1540.
[3]	宋秀兰,董兆航,单杭冠,陆炜杰. 基于时空融合的多头注意力车辆轨迹预测[J]. 浙江大学学报(工学版), 2023, 57(8): 1636-1643.
[4]	李晓艳,王鹏,郭嘉,李雪,孙梦宇. 基于双注意力机制的多分支孪生网络目标跟踪[J]. 浙江大学学报(工学版), 2023, 57(7): 1307-1316.
[5]	权巍,蔡永青,王超,宋佳,孙鸿凯,李林轩. 基于3D-ResNet双流网络的VR病评估模型[J]. 浙江大学学报(工学版), 2023, 57(7): 1345-1353.
[6]	韩俊,袁小平,王准,陈烨. 基于YOLOv5s的无人机密集小目标检测算法[J]. 浙江大学学报(工学版), 2023, 57(6): 1224-1233.
[7]	项学泳,王力,宗文鹏,李广云. ASIS模块支持下融合注意力机制KNN的点云实例分割算法[J]. 浙江大学学报(工学版), 2023, 57(5): 875-882.
[8]	苏育挺,陆荣烜,张为. 基于注意力和自适应权重的车辆重识别算法[J]. 浙江大学学报(工学版), 2023, 57(4): 712-718.
[9]	卞佰成,陈田,吴入军,刘军. 基于改进YOLOv3的印刷电路板缺陷检测算法[J]. 浙江大学学报(工学版), 2023, 57(4): 735-743.
[10]	程艳芬,吴家俊,何凡. 基于关系门控图卷积网络的方面级情感分析[J]. 浙江大学学报(工学版), 2023, 57(3): 437-445.
[11]	曾耀,高法钦. 基于改进YOLOv5的电子元件表面缺陷检测算法[J]. 浙江大学学报(工学版), 2023, 57(3): 455-465.
[12]	杨帆,宁博,李怀清,周新,李冠宇. 基于语义增强特征融合的多模态图像检索模型[J]. 浙江大学学报(工学版), 2023, 57(2): 252-258.
[13]	刘超,孔兵,杜国王,周丽华,陈红梅,包崇明. 高阶互信息最大化与伪标签指导的深度聚类[J]. 浙江大学学报(工学版), 2023, 57(2): 299-309.
[14]	王林涛,毛齐. 基于RGB与深度信息融合的管片抓取位置测量方法[J]. 浙江大学学报(工学版), 2023, 57(1): 47-54.
[15]	凤丽洲,杨阳,王友卫,杨贵军. 基于Transformer和知识图谱的新闻推荐新方法[J]. 浙江大学学报(工学版), 2023, 57(1): 133-143.

Viewed

Full text

Abstract

Cited

Shared

Discussed