Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2023, Vol. 57 Issue (9): 1865-1875    DOI: 10.3785/j.issn.1008-973X.2023.09.018
    
Fusing generative adversarial network and temporal convolutional network for Mandarin emotion recognition
Hai-feng LI1(),Xue-ying ZHANG1,*(),Shu-fei DUAN1,Hai-rong JIA1,Hui-zhi LIANG2
1. College of Electronic Information and Optical Engineering, Taiyuan University of Technology, Taiyuan 030024, China
2. School of Computing, Newcastle University, Newcastle upon Tyne NE1 7RU, United Kingdom
Download: HTML     PDF(2044KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

An emotion recognition system that integrates acoustic and articulatory feature conversions was proposed in order to investigate the influence of acoustic and articulatory conversions on Mandarin emotion recognition. Firstly, a multimodal emotional Mandarin database was recorded based on the human articulation mechanism. Then, a bi-directional mapping generative adversarial network (Bi-MGAN) was designed to solve the feature conversion problem with bimodality, and the generator loss functions and the mapping loss functions were proposed to optimise the network. Finally, a residual temporal convolutional network based on the feature-dimension attention (ResTCN-FDA) was constructed to use attention mechanisms to adaptively assign different weights to different variety features and different dimension channels. Experimental results show that the conversion accuracy of Bi-MGAN outperforms the current optimal algorithms for conversion network in both the forward and the reverse mapping tasks. The evaluation metrics of ResTCN-FDA on a given emotion dataset is much higher than traditional emotion recognition algorithms. The real features fused with the mapped features resulted in a significant increase in the accuracy of the emotions being recognized correctly, and the positive effect of mapping on Mandarin emotion recognition was demonstrated.



Key wordscycle generative adversarial network      emotion recognition      acoustic and articulatory conversions      temporal convolutional network      attention mechanism     
Received: 23 June 2022      Published: 16 October 2023
CLC:  TP 183  
  TP 391  
Fund:  国家自然科学基金资助项目(12004275);山西省研究生创新项目(2022Y235);山西省留学人员科技活动择优资助项目(20200017);山西省回国留学人员科研资助项目(2019025,2020042);太原理工大学引进人才科研启动基金资助项目(tyut-rc201405b);山西省应用基础研究计划面上自然基金资助项目(20210302123186)
Corresponding Authors: Xue-ying ZHANG     E-mail: 2244812211@qq.com;tyzhangxy@163.com
Cite this article:

Hai-feng LI,Xue-ying ZHANG,Shu-fei DUAN,Hai-rong JIA,Hui-zhi LIANG. Fusing generative adversarial network and temporal convolutional network for Mandarin emotion recognition. Journal of ZheJiang University (Engineering Science), 2023, 57(9): 1865-1875.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2023.09.018     OR     https://www.zjujournals.com/eng/Y2023/V57/I9/1865


融合生成对抗网络与时间卷积网络的普通话情感识别

为了探究声学与发音学转换对普通话情感识别的影响,提出融合声学与发音特征转换的情感识别系统. 根据人体发音机制,录制普通话多模态音视频情感数据库. 设计双向映射生成对抗网络(Bi-MGAN)来解决双模态间的特征转换问题,定义生成器损失函数和映射损失函数来优化网络. 搭建基于特征-维度注意力机制的残差时间卷积网络(ResTCN-FDA),利用注意力机制自适应地为不同种类特征和不同维度通道赋予不同的权重. 实验结果表明,Bi-MGAN在正向和反向映射任务中的转换精度均优于主流的转换网络算法;ResTCN-FDA在给定情感数据集上的评价指标远高于传统的情感识别算法;真实特征融合映射特征使得情感被正确识别的准确率显著提升,证明了映射对普通话情感识别的积极作用.


关键词: 循环生成对抗网络,  情感识别,  声学与发音学转换,  时间卷积网络,  注意力机制 
Fig.1 Electromagnetic articulography acquisition process for acoustic and articulatory data
Fig.2 Sensor settings for data acquisition by electromagnetic articulography
评分 语音传达度 情感表达度
3 无噪音且语义明确 强烈的情绪表达
2 略微的噪音但语义明确 中度的情绪表达
1 明显的噪音且影响语义传达 轻微的情绪表达
0 音频嘈杂且无法理解 无情绪表达
Tab.1 Acoustic data assessment scale
Fig.3 Overall structure of emotion recognition algorithm fusing Bi-MGAN and ResTCN-FDA
Fig.4 Network schematic of bi-directional mapping generative adversarial network
Fig.5 Overall structure of ResTCN-FDA
Fig.6 Overall structural framework of feature-dimensional attention mechanism
算法 正向映射 反向映射
MAE RMSE MAE RMSE
GAN[22] 1.217 1.642 0.946 1.189
CycleGAN[21] 1.127 1.428 0.811 0.919
Bi-MGAN(G) 1.034 1.341 0.801 0.908
Bi-MGAN(M) 0.879 1.134 0.642 0.881
Bi-MGAN(GM) 0.703 0.920 0.501 0.683
Tab.2 Ablation experiment of conversion network algorithm mm
算法 正向映射 反向映射
MAE RMSE MAE RMSE
DNN[14] 1.479 1.613 1.143 1.259
BiLSTM[11] 1.298 1.422 1.003 1.217
PSO-LSSVM[6] 1.185 1.252 0.967 1.136
DRMDN[23] 0.884 0.948 0.831 0.939
Bi-MGAN 0.703 0.908 0.501 0.683
Tab.3 Comparison of mapping performance for conversion networks algorithm mm
数据库 算法 ACC F1 AUC
CASIA TCN[20] 70.69 69.18 70.79
ResTCN[18] 72.25 72.16 72.67
ResTCN-FA 76.25 76.56 76.96
ResTCN-DA 73.75 73.71 73.97
ResTCN-FDA 80.41 81.22 81.43
STEM-E2VA TCN[20] 64.71 64.52 66.69
ResTCN[18] 68.31 67.68 71.14
ResTCN-FA 73.63 73.78 74.83
ResTCN-DA 72.85 72.61 73.36
ResTCN-FDA 75.63 75.44 76.82
EMO-DB TCN[20] 71.26 68.67 71.71
ResTCN[18] 73.71 74.28 74.83
ResTCN-FA 76.22 76.62 76.94
ResTCN-DA 77.29 75.81 77.91
ResTCN-FDA 80.16 80.78 81.58
RADVESS TCN[20] 59.07 59.02 59.66
ResTCN[18] 62.41 61.15 61.77
ResTCN-FA 63.93 62.40 63.98
ResTCN-DA 64.07 63.82 64.90
ResTCN-FDA 66.55 65.57 66.86
Tab.4 Ablation experiment of emotion recognition networks algorithm %
数据库 算法 ACC F1 AUC
CASIA CNN[12] 63.00 62.43 63.19
HS-TCN[24] 76.25 76.64 76.91
DRN[25] 76.91 76.67 76.94
ResTCN-FDA 80.41 81.22 81.43
STEM-E2VA CNN[12] 56.77 56.84 57.77
HS-TCN[24] 72.81 72.59 72.92
DRN[25] 68.15 68.25 68.86
ResTCN-FDA 75.63 75.44 76.82
EMO-DB CNN[12] 69.72 69.09 69.86
HS-TCN[24] 74.76 73.60 75.51
DRN[25] 76.64 74.72 76.96
ResTCN-FDA 80.16 80.78 81.58
RADVESS CNN[12] 57.29 55.67 57.85
HS-TCN[24] 63.29 63.56 63.58
DRN[25] 63.46 61.88 63.90
ResTCN-FDA 66.55 65.57 66.86
Tab.5 Comparison of emotion evaluation metrics for emotion recognition networks algorithm %
特征类型 特征集 输入模态 维度 ACC/% F1/% AUC/%
发音学 Articulatory(C) 映射发音特征 28 53.02 52.03 53.61
发音学 Articulatory(R) 真实发音特征 28 63.56 62.96 63.87
声学 Acoustic(C) 映射声学特征 60 59.23 58.69 59.82
声学 Acoustic(R) 真实声学特征 60 75.63 75.44 76.82
声学与发音学 Acoustic(R) + Articulatory(C) 真实声学特征+映射发音特征 88 79.51 79.69 79.97
声学与发音学 Acoustic(C)+ Articulatory(R) 真实发音特征+映射声学特征 88 72.47 72.45 72.95
声学与发音学 Acoustic(R)+Articulatory(R) 真实声学特征+真实发音特征 88 83.77 83.64 83.97
预训练 HuBERT[26] 48层 transformer 1 280 89.66 89.85 91.96
预训练 Wav2vec 2.0[27] 24层 transformer 1 024 82.57 82.25 83.93
预训练与降维 HuBERT[26] 48层 transformer+主成分分析 60 78.54 78.93 79.15
预训练与降维 HuBERT[26] 48层 transformer+主成分分析 88 80.16 80.01 80.42
预训练与降维 Wav2vec 2.0[27] 24层 transformer+主成分分析 60 75.90 75.45 76.88
预训练与降维 Wav2vec 2.0[27] 24层 transformer+主成分分析 88 76.18 76.65 76.96
Tab.6 Comparison of emotion evaluation indexes for different acoustic and articulatory features
Fig.7 Confusion matrix for different feature sets
[1]   LEI J J, ZHU X W, WANG Y BAT: block and token self-attention for speech emotion recognition[J]. Neural Networks, 2022, 156: 67- 80
doi: 10.1016/j.neunet.2022.09.022
[2]   LI Y, TAO J, CHAO L, et al CHEAVD: a Chinese natural emotional audio visual database[J]. Journal of Ambient Intelligence and Humanized Computing, 2017, 8 (6): 913- 924
doi: 10.1007/s12652-016-0406-z
[3]   CHOU H C, LIN W C, CHANG L C, et al. NNIME: the NTHU-NTUA Chinese interactive multimodal emotion corpus [C]// 2017 Seventh International Conference on Affective Computing and Intelligent Interaction. San Antonio: IEEE, 2017: 292-298.
[4]   BUSSO C, BULUT M, LEE C, et al IEMOCAP: interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation, 2008, 42 (4): 335- 359
doi: 10.1007/s10579-008-9076-6
[5]   QIN C, CARREIRA M A. An empirical investigation of the nonuniqueness in the acoustic-to-articulatory mapping [C]// Eighth Annual Conference of the International Speech Communication Association. Antwerp: [s.n.], 2007: 27-31.
[6]   REN G, FU J, SHAO G, et al Articulatory to acoustic conversion of Mandarin emotional speech based on PSO-LSSVM[J]. Complexity, 2021, 29 (3): 696- 706
[7]   HOGDEN J, LOFQVIST A, GRACCO V, et al Accurate recovery of articulator positions from acoustics: new conclusions based on human data[J]. The Journal of the Acoustical Society of America, 1996, 100 (3): 1819- 1834
doi: 10.1121/1.416001
[8]   LING Z H, RICHMOND K, YAMAGISHI J, et al Integrating articulatory features into HMM based parametric speech synthesis[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2009, 17 (6): 1171- 1185
doi: 10.1109/TASL.2009.2014796
[9]   LI M, KIM J, LAMMERT A, et al Speaker verification based on the fusion of speech acoustics and inverted articulatory signals[J]. Computer Speech and Language, 2016, 36: 196- 211
doi: 10.1016/j.csl.2015.05.003
[10]   GUO L, WANG L, DANG J, et al Learning affective representations based on magnitude and dynamic relative phase information for speech emotion recognition[J]. Speech Communication, 2022, 136 (4): 118- 127
[11]   CHEN Q, HUANG G A novel dual attention based BLSTM with hybrid features in speech emotion recognition[J]. Engineering Applications of Artificial Intelligence, 2021, 102 (5): 104277
[12]   张静, 张雪英, 陈桂军, 等 结合3D-CNN和频-空注意力机制的EEG情感识别[J]. 西安电子科技大学学报, 2022, 49 (3): 191- 198
ZHANG Jing, ZHANG Xue-ying, CHEN Gui-jun, et al EEG emotion recognition based on the 3D-CNN and spatial-frequency attention mechanism[J]. Journal of Xidian University, 2022, 49 (3): 191- 198
doi: 10.19665/j.issn1001-2400.2022.03.021
[13]   KUMARAN U, RADHA R S, NAGARAJAN S M, et al Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN[J]. International Journal of Speech Technology, 2021, 24 (2): 303- 314
doi: 10.1007/s10772-020-09792-x
[14]   LIESKOVSKA E, JAKUBEC M, JARINA R, et al A review on speech emotion recognition using deep learning and attention mechanism[J]. Electronics, 2021, 10 (10): 1163
doi: 10.3390/electronics10101163
[15]   ZHU J Y, PARK T, ISOLA P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks [C]// Proceedings of the IEEE International Conference on Computer Vision. Venice: ICCV, 2017: 2223-2232.
[16]   YUAN J, BAO C. CycleGAN based speech enhancement for the unpaired training data [C]// 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Lanzhou: APSIPA, 2019: 878-883.
[17]   SU B H, LEE C Unsupervised cross-corpus speech emotion recognition using a multi-source CycleGAN[J]. IEEE Transactions on Affective Computing, 2022, 48 (8): 650- 715
[18]   LIN J, WIJNGAARDEN A J L, WANG K C, et al Speech enhancement using multi-stage self-attentive temporal convolutional networks[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3440- 3450
doi: 10.1109/TASLP.2021.3125143
[19]   PANDEY A, WANG D L. TCNN: temporal convolutional neural network for real-time speech enhancement in the time domain [C]// 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: ICASSP, 2019: 6875-6879.
[20]   ZHANG L, SHI Z, HAN J, et al. Furcanext: end-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks [C]// International Conference on Multimedia Modeling. Daejeon: ICMM, 2020: 653-665.
[21]   JIANG Z, ZHANG R, GUO Y, et al Noise interference reduction in vision module of intelligent plant cultivation robot using better Cycle GAN[J]. IEEE Sensors Journal, 2022, 22 (11): 11045- 11055
doi: 10.1109/JSEN.2022.3164915
[22]   GOODFELLOW I, POUGET A J, MIRZA M, et al Generative adversarial nets[J]. Advances in Neural Information Processing Systems, 2014, 27: 42- 51
[23]   LIU P, YU Q, WU Z, et al. A deep recurrent approach for acoustic-to-articulatory inversion [C]// 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Brisbane: ICASSP, 2015: 4450-4454.
[24]   CHENG Y, XU Y, ZHONG H, et al Leveraging semisupervised hierarchical stacking temporal convolutional network for anomaly detection in IoT communication[J]. IEEE Internet of Things Journal, 2020, 8 (1): 144- 155
[25]   ZHAO Z P, LI Q F, ZHANG Z X, et al Combining a parallel 2D CNN with a self-attention dilated residual network for CTC-based discrete speech emotion recognition[J]. Neural Networks, 2021, 141: 52- 60
doi: 10.1016/j.neunet.2021.03.013
[26]   CHANG, XUAN K. An exploration of self-supervised pretrained representations for end-to-end speech recognition [C]// 2021 IEEE Automatic Speech Recognition and Understanding Workshop. Cartagena: ASRU, 2021, 228-235.
[1] Xiao-qiang ZHAO,Ze WANG,Zhao-yang SONG,Hong-mei JIANG. Image super-resolution reconstruction based on dynamic attention network[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(8): 1487-1494.
[2] Hui-xin WANG,Xiang-rong TONG. Research progress of recommendation system based on knowledge graph[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(8): 1527-1540.
[3] Xiu-lan SONG,Zhao-hang DONG,Hang-guan SHAN,Wei-jie LU. Vehicle trajectory prediction based on temporal-spatial multi-head attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(8): 1636-1643.
[4] Xiao-yan LI,Peng WANG,Jia GUO,Xue LI,Meng-yu SUN. Multi branch Siamese network target tracking based on double attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(7): 1307-1316.
[5] Wei QUAN,Yong-qing CAI,Chao WANG,Jia SONG,Hong-kai SUN,Lin-xuan LI. VR sickness estimation model based on 3D-ResNet two-stream network[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(7): 1345-1353.
[6] Jun HAN,Xiao-ping YUAN,Zhun WANG,Ye CHEN. UAV dense small target detection algorithm based on YOLOv5s[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(6): 1224-1233.
[7] Xue-yong XIANG,Li WANG,Wen-peng ZONG,Guang-yun LI. Point cloud instance segmentation based on attention mechanism KNN and ASIS module[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(5): 875-882.
[8] Yu-ting SU,Rong-xuan LU,Wei ZHANG. Vehicle re-identification algorithm based on attention mechanism and adaptive weight[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(4): 712-718.
[9] Bai-cheng BIAN,Tian CHEN,Ru-jun WU,Jun LIU. Improved YOLOv3-based defect detection algorithm for printed circuit board[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(4): 735-743.
[10] Yan-fen CHENG,Jia-jun WU,Fan HE. Aspect level sentiment analysis based on relation gated graph convolutional network[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(3): 437-445.
[11] Fan YANG,Bo NING,Huai-qing LI,Xin ZHOU,Guan-yu LI. Multimodal image retrieval model based on semantic-enhanced feature fusion[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(2): 252-258.
[12] Chao LIU,Bing KONG,Guo-wang DU,Li-hua ZHOU,Hong-mei CHEN,Chong-ming BAO. Deep clustering via high-order mutual information maximization and pseudo-label guidance[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(2): 299-309.
[13] Lin-tao WANG,Qi MAO. Position measurement method for tunnel segment grabbing based on RGB and depth information fusion[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(1): 47-54.
[14] Li-zhou FENG,Yang YANG,You-wei WANG,Gui-jun YANG. New method for news recommendation based on Transformer and knowledge graph[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(1): 133-143.
[15] Kun HAO,Kuo WANG,Bei-bei WANG. Lightweight underwater biological detection algorithm based on improved Mobilenet-YOLOv3[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(8): 1622-1632.