Please wait a minute...
Vis Inf  2019, Vol. 3 Issue (3): 150-155    DOI: 10.1016/j.visinf.2019.10.003
论文     
面向言语情感识别的多模态关联网络
Minjie Ren, Weizhi Nie, Anan Liu, Yuting Su
The School of Electrical and Information Engineering, Tianjin University, China
Multi-modal Correlated Network for emotion recognition in speech
Minjie Ren, Weizhi Nie, Anan Liu, Yuting Su</span>
The School of Electrical and Information Engineering, Tianjin University, China</span>
 全文: PDF 
摘要: 人机交互研究中变得越来越重要。近年来,由于硬件和深度学习方法的发展,自动情感识别的性能不断提高。然而,由于情感的抽象概念和多种表达方式,自动情感识别仍然是一项具有挑战性的任务。 为此,本文提出了一种新颖的多模态关联网络,进行用于言语情感识别,旨在利用音频和视频两个通道的信息来实现更多鲁棒和准确的检测。在所提出方法中,首先在频域中对音频信号进行处理,获得Mel声谱图。然后将Mel声谱图视为图像输入卷积神经网络中,获取音频特征。对于视觉信号,则从视觉片段中提取一些有代表性的帧,输入卷积神经网络中,获得其视觉特征。另外,我们采用三联组损失来扩大不同类特征间的差异。同时,提出了一种新的关联损失来缩小同类特征内的差异。最后,采用特征融合方法融合音频特征和视觉特征,进行情感识别分类。 在AEFW数据集上的实验结果表明,多模态关联信息对于自动情感识别至关重要,而且在执行分类任务上可达到目前最优的性能。
关键词: 多模态情感识别神经网络    
Abstract: With the growing demand of automatic emotion recognition system, emotion recognition is becoming more and more crucial for human–computer interaction (HCI) research. Recently, there is a continuous improvement in the performance of automatic emotion recognition due to the development of both hardware and deep learning methods. However, because of the abstract concept and multiple expressions of emotion, automatic emotion recognition is still a challenging task. In this paper, we propose a novel Multi-modal Correlated Network for emotion recognition aiming at exploiting the information from both audio and visual channels to achieve more robust and accurate detection. In the proposed method, the audio signals and visual signals are first preprocessed for the feature extraction. After preprocessing, we obtain the Mel-spectrograms, which can be treated as images, and the representative frames from visual segments. Then the Mel-spectrograms are fed to the convolutional neural network (CNN) to get the audio features and the representative frames are fed to the CNN and LSTM to get features. Specially, we employ the triplet loss to increase the differentiation of inter-class. Meanwhile, we propose a novel correlated loss to reduce the differentiation of intra-class. Finally, we apply the feature fusion method to fuse the audio and visual feature for emotion recognition classification. The experimental result on AEFW dataset demonstrates the correlation information of multiple modals is crucial for automatic emotion recognition and the proposed method can achieve the state-of-the-art performance on the classification task.
Key words: Multi-modal    Emotion recognition    Neural networks
出版日期: 2019-12-05
通讯作者: Weizhi Nie     E-mail: weizhinie@tju.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章  
Minjie Ren
Weizhi Nie
Anan Liu
Yuting Su

引用本文:

Minjie Ren, Weizhi Nie, Anan Liu, Yuting Su. Multi-modal Correlated Network for emotion recognition in speech . Vis Inf, 2019, 3(3): 150-155.

链接本文:

http://www.zjujournals.com/vi/CN/10.1016/j.visinf.2019.10.003        http://www.zjujournals.com/vi/CN/Y2019/V3/I3/150

[1] Piyush Chawla, Subhashis Hazarika, Han-Wei Shen. 面向ConvNet的词语级情感分解:情感分类器的可视化 [J]. Vis Inf, 2020, 4(2): 72-85.
[2] Wenbin He, Junpeng Wang, Hanqi Guo, Han-Wei Shen, Tom Peterka. CECAV-DNN:使用深度神经网络进行集合比较和可视化 [J]. Vis Inf, 2020, 4(2): 109-121.
[3] Piyush Chawla, Subhashis Hazarika, Han-Wei Shen. 面向ConvNet的词语级情感分解:情感分类器的可视化 [J]. Vis Inf, 2020, 4(2): 132-141.