Please wait a minute...
浙江大学学报(工学版)  2026, Vol. 60 Issue (4): 782-790    DOI: 10.3785/j.issn.1008-973X.2026.04.010
计算机技术     
基于无监督图对比学习的语音情感识别
张雪梅(),孙颖*(),张雪英
太原理工大学 电子信息工程学院,山西 太原 030024
Speech emotion recognition with unsupervised graph contrastive learning
Xuemei ZHANG(),Ying SUN*(),Xueying ZHANG
College of Electronic Information Engineering, Taiyuan University of Technology, Taiyuan 030024, China
 全文: PDF(1157 KB)   HTML
摘要:

针对多数语音数据集中有标签数据稀疏和高维语音特征建模困难的问题,提出基于无监督图对比学习的语音情感识别网络(SERUGCL). 该方法使用无标签数据进行训练,基于特征相似性构建语音特征原始视图,利用图结构建模语音帧之间的依赖关系,从而缓解高维特征直接建模带来的计算压力;通过快速梯度符号方法(FGSM)和子图采样-边缘扰动组合生成2种增强视图. 所有视图通过差异化编码器进行处理,并采用加权池化机制获取全局嵌入. 使用支持向量机(SVM)进行情感分类. 所提出的SERUGCL模型在IEMOCAP数据集上取得69.96%的未加权准确率(UA)和70.24%的加权准确率(WA),在EMO-DB数据集上取得91.04%的UA和90.29%的WA. 相较于DSTCNet,SERUGCL在IEMOCAP数据集上的UA和WA提高了8.18个百分点和8.44个百分点,在EMO-DB数据集上的UA和WA提高了4.49个百分点和1.50个百分点. 对比试验和消融实验结果也验证了模型的有效性.

关键词: 语音情感识别无监督学习图对比学习特征增强加权池化    
Abstract:

A speech emotion recognition network based on unsupervised graph contrastive learning (SERUGCL) was proposed to address the issues of sparse labeled data and difficulties in modeling high-dimensional speech features in most speech datasets. This method was trained using unlabeled data. Firstly, an original view of speech features was constructed based on feature similarity, and the graph structure was utilized to model the dependencies between speech frames, thereby alleviating the computational pressure caused by directly modeling high-dimensional features. Then, two enhanced views were generated through a combination of the fast gradient sign method (FGSM) and subgraph sampling-edge perturbation. All views were processed by a differentiated encoder, and a weighted pooling mechanism was adopted to obtain the global embedding. Finally, support vector machine (SVM) was used for emotion classification. The SERUGCL model achieved unweighted accuracy (UA) of 69.96% and weighted accuracy (WA) of 70.24% on the IEMOCAP dataset, and UA of 91.04% and WA of 90.29% on the EMO-DB dataset. Compared with DSTCNet, the UA and WA of SERUGCL improved by 8.18 and 8.44 percentage points on IEMOCAP and by 4.49 and 1.50 percentage points on EMO-DB datasets respectively. The results of comparative and ablation experiments also verified the effectiveness of the model.

Key words: speech emotion recognition    unsupervised learning    graph contrastive learning    feature augmentation    weighted pooling
收稿日期: 2025-05-14 出版日期: 2026-03-19
CLC:  TP 391.4  
通讯作者: 孙颖     E-mail: 2929164474@qq.com;tyutsy@163.com
作者简介: 张雪梅(2000—),女,硕士生,从事语音情感识别研究. orcid.org/0009-0008-4842-4159. E-mail:2929164474@qq.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
张雪梅
孙颖
张雪英

引用本文:

张雪梅,孙颖,张雪英. 基于无监督图对比学习的语音情感识别[J]. 浙江大学学报(工学版), 2026, 60(4): 782-790.

Xuemei ZHANG,Ying SUN,Xueying ZHANG. Speech emotion recognition with unsupervised graph contrastive learning. Journal of ZheJiang University (Engineering Science), 2026, 60(4): 782-790.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2026.04.010        https://www.zjujournals.com/eng/CN/Y2026/V60/I4/782

图 1  SERUGCL框架
图 2  基于特征相似性的语音图构建流程
FGSM算法流程$ (G,\epsilon ) $
输入:图$ G=(V,E,\boldsymbol{X}) $,扰动强度$ \epsilon$
输出:增强图$ {T}_{{\mathrm{t}}}(G)=(V,E,{\boldsymbol{X}}^{\prime}) $
1:使用编码器$ f(\cdot ) $生成图嵌入$ {{{{\bf{g}}}}}{{{{\bf{x}}}}} $$ {\bf{g}}{\bf{x}}_{1} $
2:计算$ \bf{g}\bf{x} $$ {\bf{g}}{\bf{x}}_{1} $之间的损失$ {\mathcal{L}}({\bf{g}}{\bf{x}},{\bf{g}}{\bf{x}}_{1}) $
3:反向传播,计算损失函数关于$ \boldsymbol{X} $的梯度
$ {\nabla }_{{\boldsymbol{X}}}\mathcal{L} $(gx, gx1)
4:生成对抗特征矩阵$ {{{{\boldsymbol{X}}}}^{\prime}}=\boldsymbol{X}+\epsilon \cdot \text{sign}\;({\nabla }_{{\boldsymbol{X}}}\mathcal{L}({\bf{g}}{\bf{x}},{\bf{g}}{\bf{x}}_{1})) $
5:构建增强图$ {T}_{{\mathrm{t}}}(G) $,保持拓扑结构不变
  
图 3  子图采样-边缘扰动组合的增强方式
图 4  加权池化机制
模型IEMOCAPEMO-DB
UA/%WA/%UA/%WA/%
模型160.8661.2388.1687.01
模型263.4263.4989.1087.69
模型363.9464.0489.4588.48
模型468.8469.0590.5089.68
SERUGCL69.9670.2491.0490.29
表 1  不同增强策略与池化方式下的模型对比
模型年份UA/%WA/%
GA-GRU[28]202063.8062.27
LSTM-GIN[29]202165.5364.65
CoGCN[13]202263.6762.64
GLNN[20]202368.6568.11
MTL[30]202469.16
DSTCNet[31]202561.7861.80
APIN[32]202560.3560.80
SERUGCL202569.9670.24
表 2  在IEMOCAP数据集上与其他语音情感识别模型的比较
模型年份UA/%WA/%
AMSNet[33]202388.5688.34
SER-Graph[4]202477.80
DSTCNet-BLSTM[31]202584.7285.98
DSTCNet[31]202586.5588.79
APIN[32]202586.0087.85
CL[34]202589.60
SERUGCL202591.0490.29
表 3  在EMO-DB数据集上与其他语音情感识别模型的比较
图 5  SERUGCL模型的混淆矩阵
方法IEMOCAPEMO-DB
UA/%WA/%UA/%WA/%
S-F69.1969.4290.5389.48
S-SE65.5765.9389.0587.70
SERUGCL69.9670.2491.0490.29
表 4  增强模块消融实验结果
方法IEMOCAPEMO-DB
UA/%WA/%UA/%WA/%
mean67.1567.4190.6189.77
max66.8967.0889.4687.63
sum68.8469.0590.5089.68
SERUGCL69.9670.2491.0490.29
表 5  池化模块消融实验结果
方法IEMOCAPEMO-DB
UA/%WA/%UA/%WA/%
GIN68.1168.2489.4588.43
GCN68.9569.1290.2389.33
GCN-GIN68.5468.8189.9089.02
SERUGCL69.9670.2491.0490.29
表 6  编码器模块消融实验结果
池化简称maxmeansoft
A0.500.250.25
B0.250.500.25
C0.250.250.50
D0.600.200.20
E0.200.600.20
F0.200.200.60
G0.300.300.30
表 7  池化名称对应的各池化方式所占权重
图 6  FGSM的不同扰动强度和不同图级增强比例的实验结果
图 7  不同加权池化权重的实验结果
1 HU Y, TANG Y, HUANG H, et al. A graph isomorphism network with weighted multiple aggregators for speech emotion recognition [C]// Interspeech 2022. Incheon: ISCA, 2022: 4705−4709.
2 孙颖, 胡艳香, 张雪英, 等 面向情感语音识别的情感维度PAD预测[J]. 浙江大学学报: 工学版, 2019, 53 (10): 2041- 2048
SUN Ying, HU Yanxiang, ZHANG Xueying, et al Prediction of emotional dimensions PAD for emotional speech recognition[J]. Journal of Zhejiang University: Engineering Science, 2019, 53 (10): 2041- 2048
3 孙志, 王冠 自监督对比学习的CNN-GRU语音情感识别算法[J]. 西安电子科技大学学报, 2024, 51 (6): 182- 193
SUN Zhi, WANG Guan CNN-GRU speech emotion recognition algorithm for self-supervised comparative learning[J]. Journal of Xidian University, 2024, 51 (6): 182- 193
doi: 10.19665/j.issn1001-2400.20241109
4 PENTARI A, KAFENTZIS G, TSIKNAKIS M Speech emotion recognition via graph-based representations[J]. Scientific Reports, 2024, 14: 4484
doi: 10.1038/s41598-024-52989-2
5 ABDELHAMID A A, EL-KENAWY E M, ALOTAIBI B, et al Robust speech emotion recognition using CNN+LSTM based on stochastic fractal search optimization algorithm[J]. IEEE Access, 2022, 10: 49265- 49284
doi: 10.1109/ACCESS.2022.3172954
6 ZHU Z, DAI W, HU Y, et al Speech emotion recognition model based on Bi-GRU and focal loss[J]. Pattern Recognition Letters, 2020, 140: 358- 365
doi: 10.1016/j.patrec.2020.11.009
7 LI M, YANG B, LEVY J, et al. Contrastive unsupervised learning for speech emotion recognition [C]// 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 6329−6333.
8 GERCZUK M, AMIRIPARIAN S, OTTL S, et al EmoNet: a transfer learning framework for multi-corpus speech emotion recognition[J]. IEEE Transactions on Affective Computing, 2023, 14 (2): 1472- 1487
doi: 10.1109/TAFFC.2021.3135152
9 XU X, DENG J, COUTINHO E, et al Connecting subspace learning and extreme learning machine in speech emotion recognition[J]. IEEE Transactions on Multimedia, 2019, 21 (3): 795- 808
doi: 10.1109/TMM.2018.2865834
10 PENTARI A, KAFENTZIS G, TSIKNAKIS M. Investigating graph-based features for speech emotion recognition [C]// IEEE-EMBS International Conference on Biomedical and Health Informatics. Ioannina: IEEE, 2022: 1–5.
11 MELO D F P, FADIGAS I S, PEREIRA H B B Graph-based feature extraction: a new proposal to study the classification of music signals outside the time-frequency domain[J]. PLoS One, 2020, 15 (11): e0240915
doi: 10.1371/journal.pone.0240915
12 SHIRIAN A, GUHA T. Compact graph architecture for speech emotion recognition [C]// 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 6284–6288.
13 KIM J, KIM J. Representation learning with graph neural networks for speech emotion recognition [EB/OL]. (2022–01–26) [2025–06–10]. https://arxiv.org/abs/2208.09830.
14 GHAYEKHLOO M, NICKABADI A Supervised contrastive learning for graph representation enhancement[J]. Neurocomputing, 2024, 588: 127710
doi: 10.1016/j.neucom.2024.127710
15 YOU Y N, CHEN T L, SUI Y D, et al. Graph contrastive learning with augmentations [C]// 34th International Conference on Neural Information Processing Systems. Vancouver: NeurIPS, 2020: 5812−5823.
16 SHIRIAN A, SOMANDEPALLI K, GUHA T Self-supervised graphs for audio representation learning with limited labeled data[J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16 (6): 1391- 1401
doi: 10.1109/JSTSP.2022.3190083
17 ESKIMEZ S E, DUAN Z, HEINZELMAN W. Unsupervised learning approach to feature analysis for automatic speech emotion recognition [C]// IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 5099–5103.
18 KANG H, XU Y, JIN G, et al FCAN: speech emotion recognition network based on focused contrastive learning[J]. Biomedical Signal Processing and Control, 2024, 96: 106545
doi: 10.1016/j.bspc.2024.106545
19 SONG X, HUANG L, XUE H, et al. Supervised prototypical contrastive learning for emotion recognition in conversation [C]// Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, Stroudsburg: ACL, 2022: 5197−5206.
20 LI Y, WANG Y, YANG X, et al Speech emotion recognition based on Graph-LSTM neural network[J]. EURASIP Journal on Audio, Speech, and Music Processing, 2023, (1): 40
doi: 10.1186/s13636-023-00303-9
21 XU Y, WANG J, GUANG M, et al Graph contrastive learning with Min-max mutual information[J]. Information Sciences, 2024, 665: 120378
doi: 10.1016/j.ins.2024.120378
22 WONG E, RICE L, KOLTER J. Fast is better than free: revisiting adversarial training [C]// 8th International Conference on Learning Representations. [S. l. ]: ICLR, 2020.
23 XU K, HU W H, LESKOVEC J, et al. How powerful are graph neural networks? [C]// 7th International Conference on Learning Representations. New Orleans: ICLR, 2019.
24 KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks [C]// 5th International Conference on Learning Representation. Toulon: ICLR, 2017.
25 BUSSO C, BULUT M, LEE C C, et al IEMOCAP: interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation, 2008, 42 (4): 335- 359
doi: 10.1007/s10579-008-9076-6
26 BURKHARDT F, PAESCHKE A, ROLFES M, et al. A database of German emotional speech [C]// Interspeech 2005. Lisbon: ISCA, 2005: 1517−1520.
27 SCHULLER B, STEIDL S, BATLINER A, et al. The INTERSPEECH 2010 paralinguistic challenge [C]// Interspeech 2010. Chiba: ISCA, 2010: 2794−2797.
28 PANDEY S K, SHEKHAWAT H S, PRASANNA S R M Attention gated tensor neural network architectures for speech emotion recognition[J]. Biomedical Signal Processing and Control, 2022, 71: 103173
doi: 10.1016/j.bspc.2021.103173
29 LIU J, WANG H. Graph isomorphism network for speech emotion recognition [C]// Interspeech 2021. Brno: ISCA, 2021: 3405−3409.
30 ULGEN I R, DU Z, BUSSO C, et al. Revealing emotional clusters in speaker embeddings: a contrastive learning strategy for speech emotion recognition [C]// 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. Seoul: IEEE, 2024: 12081–12085.
31 GUO L, DING S, WANG L, et al DSTCNet: deep spectro-temporal-channel attention network for speech emotion recognition[J]. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36 (1): 188- 197
doi: 10.1109/TNNLS.2023.3304516
32 GUO L, LI J, DING S, et al APIN: amplitude- and phase-aware interaction network for speech emotion recognition[J]. Speech Communication, 2025, 169: 103201
doi: 10.1016/j.specom.2025.103201
33 CHEN Z, LI J, LIU H, et al Learning multi-scale features for speech emotion recognition with connection attention mechanism[J]. Expert Systems with Applications, 2023, 214: 118943
doi: 10.1016/j.eswa.2022.118943
[1] 宋耀莲,彭驰,唐菁敏,赵宣植,虞贵财. 基于融合注意力机制的光学遥感图像小目标检测算法[J]. 浙江大学学报(工学版), 2026, 60(4): 763-771.
[2] 黄文湖,赵邢,谢亮,梁浩然,梁荣华. 基于对比学习的声源定位引导视听分割模型[J]. 浙江大学学报(工学版), 2025, 59(9): 1803-1813.
[3] 周昶清,侯耀春,武鹏,杨帅,吴大转. 自适应齿轮箱稀疏表示原子构建方法[J]. 浙江大学学报(工学版), 2025, 59(5): 1018-1030.
[4] 杨燕,贾存鹏. 代理注意力下域特征交互的高效图像去雾算法[J]. 浙江大学学报(工学版), 2025, 59(12): 2527-2538.
[5] 唐善成,逯建辉,张莹,金子成,赵安新. 修复缺陷嫌疑区域的无监督磁瓦表面缺陷检测[J]. 浙江大学学报(工学版), 2024, 58(4): 718-728.
[6] 谢誉,包梓群,张娜,吴彪,涂小妹,包晓安. 基于特征优化与深层次融合的目标检测算法[J]. 浙江大学学报(工学版), 2022, 56(12): 2403-2415.
[7] 李明,段立娟,王文健,恩擎. 基于显著稀疏强关联的脑功能连接分类方法[J]. 浙江大学学报(工学版), 2022, 56(11): 2232-2240.
[8] 杨军,李金泰,高志明. 无监督的三维模型簇对应关系协同计算[J]. 浙江大学学报(工学版), 2022, 56(10): 1935-1947.
[9] 张鹏,田子都,王浩. 基于改进生成对抗网络的飞参数据异常检测方法[J]. 浙江大学学报(工学版), 2022, 56(10): 1967-1976.
[10] 刘嘉诚,冀俊忠. 基于宽度学习系统的fMRI数据分类方法[J]. 浙江大学学报(工学版), 2021, 55(7): 1270-1278.
[11] 陈雪云,夏瑾,杜珂. 基于多线型特征增强网络的架空输电线检测[J]. 浙江大学学报(工学版), 2021, 55(12): 2382-2389.
[12] 郑浦,白宏阳,李伟,郭宏伟. 复杂背景下的小目标检测算法[J]. 浙江大学学报(工学版), 2020, 54(9): 1777-1784.
[13] 孙颖,胡艳香,张雪英,段淑斐. 面向情感语音识别的情感维度PAD预测[J]. 浙江大学学报(工学版), 2019, 53(10): 2041-2048.
[14] 孙凌云, 何博伟, 刘征, 杨智渊. 基于语义细胞的语音情感识别[J]. 浙江大学学报(工学版), 2015, 49(6): 1001-1009.
[15] 谢波 陈岭 陈根才 陈纯. 普通话语音情感识别的特征选择技术[J]. J4, 2007, 41(11): 1816-1822.