基于无监督图对比学习的语音情感识别

doi:10.3785/j.issn.1008-973X.2026.04.010

浙江大学学报(工学版)

2026, Vol. 60

Issue (4): 782-790 DOI: 10.3785/j.issn.1008-973X.2026.04.010

计算机技术

基于无监督图对比学习的语音情感识别

张雪梅(

),孙颖*(

),张雪英

太原理工大学电子信息工程学院，山西太原 030024

Speech emotion recognition with unsupervised graph contrastive learning

Xuemei ZHANG(

),Ying SUN*(

),Xueying ZHANG

College of Electronic Information Engineering, Taiyuan University of Technology, Taiyuan 030024, China

全文: PDF(1157 KB) HTML

摘要：

针对多数语音数据集中有标签数据稀疏和高维语音特征建模困难的问题，提出基于无监督图对比学习的语音情感识别网络（SERUGCL）. 该方法使用无标签数据进行训练，基于特征相似性构建语音特征原始视图，利用图结构建模语音帧之间的依赖关系，从而缓解高维特征直接建模带来的计算压力；通过快速梯度符号方法（FGSM）和子图采样-边缘扰动组合生成2种增强视图. 所有视图通过差异化编码器进行处理，并采用加权池化机制获取全局嵌入. 使用支持向量机（SVM）进行情感分类. 所提出的SERUGCL模型在IEMOCAP数据集上取得69.96%的未加权准确率（UA）和70.24%的加权准确率（WA），在EMO-DB数据集上取得91.04%的UA和90.29%的WA. 相较于DSTCNet，SERUGCL在IEMOCAP数据集上的UA和WA提高了8.18个百分点和8.44个百分点，在EMO-DB数据集上的UA和WA提高了4.49个百分点和1.50个百分点. 对比试验和消融实验结果也验证了模型的有效性.

关键词： 语音情感识别; 无监督学习; 图对比学习; 特征增强; 加权池化

Abstract:

A speech emotion recognition network based on unsupervised graph contrastive learning (SERUGCL) was proposed to address the issues of sparse labeled data and difficulties in modeling high-dimensional speech features in most speech datasets. This method was trained using unlabeled data. Firstly, an original view of speech features was constructed based on feature similarity, and the graph structure was utilized to model the dependencies between speech frames, thereby alleviating the computational pressure caused by directly modeling high-dimensional features. Then, two enhanced views were generated through a combination of the fast gradient sign method (FGSM) and subgraph sampling-edge perturbation. All views were processed by a differentiated encoder, and a weighted pooling mechanism was adopted to obtain the global embedding. Finally, support vector machine (SVM) was used for emotion classification. The SERUGCL model achieved unweighted accuracy (UA) of 69.96% and weighted accuracy (WA) of 70.24% on the IEMOCAP dataset, and UA of 91.04% and WA of 90.29% on the EMO-DB dataset. Compared with DSTCNet, the UA and WA of SERUGCL improved by 8.18 and 8.44 percentage points on IEMOCAP and by 4.49 and 1.50 percentage points on EMO-DB datasets respectively. The results of comparative and ablation experiments also verified the effectiveness of the model.

Key words: speech emotion recognition unsupervised learning graph contrastive learning feature augmentation weighted pooling

收稿日期: 2025-05-14 出版日期: 2026-03-19

CLC:

TP 391.4

通讯作者: 孙颖 E-mail: 2929164474@qq.com;tyutsy@163.com

作者简介: 张雪梅（2000—），女，硕士生，从事语音情感识别研究. orcid.org/0009-0008-4842-4159. E-mail：2929164474@qq.com

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	作者相关文章
	张雪梅
	孙颖
	张雪英

引用本文:

张雪梅,孙颖,张雪英. 基于无监督图对比学习的语音情感识别[J]. 浙江大学学报(工学版), 2026, 60(4): 782-790.

Xuemei ZHANG,Ying SUN,Xueying ZHANG. Speech emotion recognition with unsupervised graph contrastive learning. Journal of ZheJiang University (Engineering Science), 2026, 60(4): 782-790.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2026.04.010 或 https://www.zjujournals.com/eng/CN/Y2026/V60/I4/782

图 1 SERUGCL框架

图 2 基于特征相似性的语音图构建流程

图 3 子图采样-边缘扰动组合的增强方式

图 4 加权池化机制

表 1 不同增强策略与池化方式下的模型对比

表 2 在IEMOCAP数据集上与其他语音情感识别模型的比较

表 3 在EMO-DB数据集上与其他语音情感识别模型的比较

图 5 SERUGCL模型的混淆矩阵

表 4 增强模块消融实验结果

表 5 池化模块消融实验结果

表 6 编码器模块消融实验结果

表 7 池化名称对应的各池化方式所占权重

图 6 FGSM的不同扰动强度和不同图级增强比例的实验结果

图 7 不同加权池化权重的实验结果

1	HU Y, TANG Y, HUANG H, et al. A graph isomorphism network with weighted multiple aggregators for speech emotion recognition [C]// Interspeech 2022. Incheon: ISCA, 2022: 4705−4709.
2	孙颖, 胡艳香, 张雪英, 等面向情感语音识别的情感维度PAD预测[J]. 浙江大学学报: 工学版, 2019, 53 (10): 2041- 2048 SUN Ying, HU Yanxiang, ZHANG Xueying, et al Prediction of emotional dimensions PAD for emotional speech recognition[J]. Journal of Zhejiang University: Engineering Science, 2019, 53 (10): 2041- 2048
3	孙志, 王冠自监督对比学习的CNN-GRU语音情感识别算法[J]. 西安电子科技大学学报, 2024, 51 (6): 182- 193 SUN Zhi, WANG Guan CNN-GRU speech emotion recognition algorithm for self-supervised comparative learning[J]. Journal of Xidian University, 2024, 51 (6): 182- 193 doi: 10.19665/j.issn1001-2400.20241109
4	PENTARI A, KAFENTZIS G, TSIKNAKIS M Speech emotion recognition via graph-based representations[J]. Scientific Reports, 2024, 14: 4484 doi: 10.1038/s41598-024-52989-2
5	ABDELHAMID A A, EL-KENAWY E M, ALOTAIBI B, et al Robust speech emotion recognition using CNN+LSTM based on stochastic fractal search optimization algorithm[J]. IEEE Access, 2022, 10: 49265- 49284 doi: 10.1109/ACCESS.2022.3172954
6	ZHU Z, DAI W, HU Y, et al Speech emotion recognition model based on Bi-GRU and focal loss[J]. Pattern Recognition Letters, 2020, 140: 358- 365 doi: 10.1016/j.patrec.2020.11.009
7	LI M, YANG B, LEVY J, et al. Contrastive unsupervised learning for speech emotion recognition [C]// 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 6329−6333.
8	GERCZUK M, AMIRIPARIAN S, OTTL S, et al EmoNet: a transfer learning framework for multi-corpus speech emotion recognition[J]. IEEE Transactions on Affective Computing, 2023, 14 (2): 1472- 1487 doi: 10.1109/TAFFC.2021.3135152
9	XU X, DENG J, COUTINHO E, et al Connecting subspace learning and extreme learning machine in speech emotion recognition[J]. IEEE Transactions on Multimedia, 2019, 21 (3): 795- 808 doi: 10.1109/TMM.2018.2865834
10	PENTARI A, KAFENTZIS G, TSIKNAKIS M. Investigating graph-based features for speech emotion recognition [C]// IEEE-EMBS International Conference on Biomedical and Health Informatics. Ioannina: IEEE, 2022: 1–5.
11	MELO D F P, FADIGAS I S, PEREIRA H B B Graph-based feature extraction: a new proposal to study the classification of music signals outside the time-frequency domain[J]. PLoS One, 2020, 15 (11): e0240915 doi: 10.1371/journal.pone.0240915
12	SHIRIAN A, GUHA T. Compact graph architecture for speech emotion recognition [C]// 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 6284–6288.
13	KIM J, KIM J. Representation learning with graph neural networks for speech emotion recognition [EB/OL]. (2022–01–26) [2025–06–10]. https://arxiv.org/abs/2208.09830.
14	GHAYEKHLOO M, NICKABADI A Supervised contrastive learning for graph representation enhancement[J]. Neurocomputing, 2024, 588: 127710 doi: 10.1016/j.neucom.2024.127710
15	YOU Y N, CHEN T L, SUI Y D, et al. Graph contrastive learning with augmentations [C]// 34th International Conference on Neural Information Processing Systems. Vancouver: NeurIPS, 2020: 5812−5823.
16	SHIRIAN A, SOMANDEPALLI K, GUHA T Self-supervised graphs for audio representation learning with limited labeled data[J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16 (6): 1391- 1401 doi: 10.1109/JSTSP.2022.3190083
17	ESKIMEZ S E, DUAN Z, HEINZELMAN W. Unsupervised learning approach to feature analysis for automatic speech emotion recognition [C]// IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 5099–5103.
18	KANG H, XU Y, JIN G, et al FCAN: speech emotion recognition network based on focused contrastive learning[J]. Biomedical Signal Processing and Control, 2024, 96: 106545 doi: 10.1016/j.bspc.2024.106545
19	SONG X, HUANG L, XUE H, et al. Supervised prototypical contrastive learning for emotion recognition in conversation [C]// Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, Stroudsburg: ACL, 2022: 5197−5206.
20	LI Y, WANG Y, YANG X, et al Speech emotion recognition based on Graph-LSTM neural network[J]. EURASIP Journal on Audio, Speech, and Music Processing, 2023, (1): 40 doi: 10.1186/s13636-023-00303-9
21	XU Y, WANG J, GUANG M, et al Graph contrastive learning with Min-max mutual information[J]. Information Sciences, 2024, 665: 120378 doi: 10.1016/j.ins.2024.120378
22	WONG E, RICE L, KOLTER J. Fast is better than free: revisiting adversarial training [C]// 8th International Conference on Learning Representations. [S. l. ]: ICLR, 2020.
23	XU K, HU W H, LESKOVEC J, et al. How powerful are graph neural networks? [C]// 7th International Conference on Learning Representations. New Orleans: ICLR, 2019.
24	KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks [C]// 5th International Conference on Learning Representation. Toulon: ICLR, 2017.
25	BUSSO C, BULUT M, LEE C C, et al IEMOCAP: interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation, 2008, 42 (4): 335- 359 doi: 10.1007/s10579-008-9076-6
26	BURKHARDT F, PAESCHKE A, ROLFES M, et al. A database of German emotional speech [C]// Interspeech 2005. Lisbon: ISCA, 2005: 1517−1520.
27	SCHULLER B, STEIDL S, BATLINER A, et al. The INTERSPEECH 2010 paralinguistic challenge [C]// Interspeech 2010. Chiba: ISCA, 2010: 2794−2797.
28	PANDEY S K, SHEKHAWAT H S, PRASANNA S R M Attention gated tensor neural network architectures for speech emotion recognition[J]. Biomedical Signal Processing and Control, 2022, 71: 103173 doi: 10.1016/j.bspc.2021.103173
29	LIU J, WANG H. Graph isomorphism network for speech emotion recognition [C]// Interspeech 2021. Brno: ISCA, 2021: 3405−3409.
30	ULGEN I R, DU Z, BUSSO C, et al. Revealing emotional clusters in speaker embeddings: a contrastive learning strategy for speech emotion recognition [C]// 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. Seoul: IEEE, 2024: 12081–12085.
31	GUO L, DING S, WANG L, et al DSTCNet: deep spectro-temporal-channel attention network for speech emotion recognition[J]. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36 (1): 188- 197 doi: 10.1109/TNNLS.2023.3304516
32	GUO L, LI J, DING S, et al APIN: amplitude- and phase-aware interaction network for speech emotion recognition[J]. Speech Communication, 2025, 169: 103201 doi: 10.1016/j.specom.2025.103201
33	CHEN Z, LI J, LIU H, et al Learning multi-scale features for speech emotion recognition with connection attention mechanism[J]. Expert Systems with Applications, 2023, 214: 118943 doi: 10.1016/j.eswa.2022.118943

[1]	宋耀莲,彭驰,唐菁敏,赵宣植,虞贵财. 基于融合注意力机制的光学遥感图像小目标检测算法[J]. 浙江大学学报(工学版), 2026, 60(4): 763-771.
[2]	黄文湖,赵邢,谢亮,梁浩然,梁荣华. 基于对比学习的声源定位引导视听分割模型[J]. 浙江大学学报(工学版), 2025, 59(9): 1803-1813.
[3]	周昶清,侯耀春,武鹏,杨帅,吴大转. 自适应齿轮箱稀疏表示原子构建方法[J]. 浙江大学学报(工学版), 2025, 59(5): 1018-1030.
[4]	杨燕,贾存鹏. 代理注意力下域特征交互的高效图像去雾算法[J]. 浙江大学学报(工学版), 2025, 59(12): 2527-2538.
[5]	唐善成,逯建辉,张莹,金子成,赵安新. 修复缺陷嫌疑区域的无监督磁瓦表面缺陷检测[J]. 浙江大学学报(工学版), 2024, 58(4): 718-728.
[6]	谢誉,包梓群,张娜,吴彪,涂小妹,包晓安. 基于特征优化与深层次融合的目标检测算法[J]. 浙江大学学报(工学版), 2022, 56(12): 2403-2415.
[7]	李明,段立娟,王文健,恩擎. 基于显著稀疏强关联的脑功能连接分类方法[J]. 浙江大学学报(工学版), 2022, 56(11): 2232-2240.
[8]	杨军,李金泰,高志明. 无监督的三维模型簇对应关系协同计算[J]. 浙江大学学报(工学版), 2022, 56(10): 1935-1947.
[9]	张鹏,田子都,王浩. 基于改进生成对抗网络的飞参数据异常检测方法[J]. 浙江大学学报(工学版), 2022, 56(10): 1967-1976.
[10]	刘嘉诚,冀俊忠. 基于宽度学习系统的fMRI数据分类方法[J]. 浙江大学学报(工学版), 2021, 55(7): 1270-1278.
[11]	陈雪云,夏瑾,杜珂. 基于多线型特征增强网络的架空输电线检测[J]. 浙江大学学报(工学版), 2021, 55(12): 2382-2389.
[12]	郑浦,白宏阳,李伟,郭宏伟. 复杂背景下的小目标检测算法[J]. 浙江大学学报(工学版), 2020, 54(9): 1777-1784.
[13]	孙颖,胡艳香,张雪英,段淑斐. 面向情感语音识别的情感维度PAD预测[J]. 浙江大学学报(工学版), 2019, 53(10): 2041-2048.
[14]	孙凌云, 何博伟, 刘征, 杨智渊. 基于语义细胞的语音情感识别[J]. 浙江大学学报(工学版), 2015, 49(6): 1001-1009.
[15]	谢波陈岭陈根才陈纯. 普通话语音情感识别的特征选择技术[J]. J4, 2007, 41(11): 1816-1822.

Viewed

Full text

Abstract

Cited

Shared

Discussed