Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2026, Vol. 60 Issue (4): 782-790    DOI: 10.3785/j.issn.1008-973X.2026.04.010
    
Speech emotion recognition with unsupervised graph contrastive learning
Xuemei ZHANG(),Ying SUN*(),Xueying ZHANG
College of Electronic Information Engineering, Taiyuan University of Technology, Taiyuan 030024, China
Download: HTML     PDF(1157KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A speech emotion recognition network based on unsupervised graph contrastive learning (SERUGCL) was proposed to address the issues of sparse labeled data and difficulties in modeling high-dimensional speech features in most speech datasets. This method was trained using unlabeled data. Firstly, an original view of speech features was constructed based on feature similarity, and the graph structure was utilized to model the dependencies between speech frames, thereby alleviating the computational pressure caused by directly modeling high-dimensional features. Then, two enhanced views were generated through a combination of the fast gradient sign method (FGSM) and subgraph sampling-edge perturbation. All views were processed by a differentiated encoder, and a weighted pooling mechanism was adopted to obtain the global embedding. Finally, support vector machine (SVM) was used for emotion classification. The SERUGCL model achieved unweighted accuracy (UA) of 69.96% and weighted accuracy (WA) of 70.24% on the IEMOCAP dataset, and UA of 91.04% and WA of 90.29% on the EMO-DB dataset. Compared with DSTCNet, the UA and WA of SERUGCL improved by 8.18 and 8.44 percentage points on IEMOCAP and by 4.49 and 1.50 percentage points on EMO-DB datasets respectively. The results of comparative and ablation experiments also verified the effectiveness of the model.



Key wordsspeech emotion recognition      unsupervised learning      graph contrastive learning      feature augmentation      weighted pooling     
Received: 14 May 2025      Published: 19 March 2026
CLC:  TP 391.4  
Corresponding Authors: Ying SUN     E-mail: 2929164474@qq.com;tyutsy@163.com
Cite this article:

Xuemei ZHANG,Ying SUN,Xueying ZHANG. Speech emotion recognition with unsupervised graph contrastive learning. Journal of ZheJiang University (Engineering Science), 2026, 60(4): 782-790.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2026.04.010     OR     https://www.zjujournals.com/eng/Y2026/V60/I4/782


基于无监督图对比学习的语音情感识别

针对多数语音数据集中有标签数据稀疏和高维语音特征建模困难的问题,提出基于无监督图对比学习的语音情感识别网络(SERUGCL). 该方法使用无标签数据进行训练,基于特征相似性构建语音特征原始视图,利用图结构建模语音帧之间的依赖关系,从而缓解高维特征直接建模带来的计算压力;通过快速梯度符号方法(FGSM)和子图采样-边缘扰动组合生成2种增强视图. 所有视图通过差异化编码器进行处理,并采用加权池化机制获取全局嵌入. 使用支持向量机(SVM)进行情感分类. 所提出的SERUGCL模型在IEMOCAP数据集上取得69.96%的未加权准确率(UA)和70.24%的加权准确率(WA),在EMO-DB数据集上取得91.04%的UA和90.29%的WA. 相较于DSTCNet,SERUGCL在IEMOCAP数据集上的UA和WA提高了8.18个百分点和8.44个百分点,在EMO-DB数据集上的UA和WA提高了4.49个百分点和1.50个百分点. 对比试验和消融实验结果也验证了模型的有效性.


关键词: 语音情感识别,  无监督学习,  图对比学习,  特征增强,  加权池化 
Fig.1 Framework of SERUGCL
Fig.2 Speech graph construction based on feature similarity
FGSM算法流程$ (G,\epsilon ) $
输入:图$ G=(V,E,\boldsymbol{X}) $,扰动强度$ \epsilon$
输出:增强图$ {T}_{{\mathrm{t}}}(G)=(V,E,{\boldsymbol{X}}^{\prime}) $
1:使用编码器$ f(\cdot ) $生成图嵌入$ {{{{\bf{g}}}}}{{{{\bf{x}}}}} $$ {\bf{g}}{\bf{x}}_{1} $
2:计算$ \bf{g}\bf{x} $$ {\bf{g}}{\bf{x}}_{1} $之间的损失$ {\mathcal{L}}({\bf{g}}{\bf{x}},{\bf{g}}{\bf{x}}_{1}) $
3:反向传播,计算损失函数关于$ \boldsymbol{X} $的梯度
$ {\nabla }_{{\boldsymbol{X}}}\mathcal{L} $(gx, gx1)
4:生成对抗特征矩阵$ {{{{\boldsymbol{X}}}}^{\prime}}=\boldsymbol{X}+\epsilon \cdot \text{sign}\;({\nabla }_{{\boldsymbol{X}}}\mathcal{L}({\bf{g}}{\bf{x}},{\bf{g}}{\bf{x}}_{1})) $
5:构建增强图$ {T}_{{\mathrm{t}}}(G) $,保持拓扑结构不变
 
Fig.3 Enhancement method combining subgraph sampling and edge perturbation
Fig.4 Weighted pooling mechanism
模型IEMOCAPEMO-DB
UA/%WA/%UA/%WA/%
模型160.8661.2388.1687.01
模型263.4263.4989.1087.69
模型363.9464.0489.4588.48
模型468.8469.0590.5089.68
SERUGCL69.9670.2491.0490.29
Tab.1 Performance comparison of models under different augmentation strategies and pooling methods
模型年份UA/%WA/%
GA-GRU[28]202063.8062.27
LSTM-GIN[29]202165.5364.65
CoGCN[13]202263.6762.64
GLNN[20]202368.6568.11
MTL[30]202469.16
DSTCNet[31]202561.7861.80
APIN[32]202560.3560.80
SERUGCL202569.9670.24
Tab.2 Comparison with other speech emotion recognition models in IEMOCAP dataset
模型年份UA/%WA/%
AMSNet[33]202388.5688.34
SER-Graph[4]202477.80
DSTCNet-BLSTM[31]202584.7285.98
DSTCNet[31]202586.5588.79
APIN[32]202586.0087.85
CL[34]202589.60
SERUGCL202591.0490.29
Tab.3 Comparison with other speech emotion recognition models in EMO-DB dataset
Fig.5 Confusion matrix of SERUGCL model
方法IEMOCAPEMO-DB
UA/%WA/%UA/%WA/%
S-F69.1969.4290.5389.48
S-SE65.5765.9389.0587.70
SERUGCL69.9670.2491.0490.29
Tab.4 Results of enhanced module ablation experiment
方法IEMOCAPEMO-DB
UA/%WA/%UA/%WA/%
mean67.1567.4190.6189.77
max66.8967.0889.4687.63
sum68.8469.0590.5089.68
SERUGCL69.9670.2491.0490.29
Tab.5 Results of pooling module ablation experiment
方法IEMOCAPEMO-DB
UA/%WA/%UA/%WA/%
GIN68.1168.2489.4588.43
GCN68.9569.1290.2389.33
GCN-GIN68.5468.8189.9089.02
SERUGCL69.9670.2491.0490.29
Tab.6 Results of encoder module ablation experiment
池化简称maxmeansoft
A0.500.250.25
B0.250.500.25
C0.250.250.50
D0.600.200.20
E0.200.600.20
F0.200.200.60
G0.300.300.30
Tab.7 Weights of each pooling method corresponding to pooling name
Fig.6 Results of FGSM with different perturbation intensities and different proportions of graph-level augmentations
Fig.7 Results of different weighted pooling weights
[1]   HU Y, TANG Y, HUANG H, et al. A graph isomorphism network with weighted multiple aggregators for speech emotion recognition [C]// Interspeech 2022. Incheon: ISCA, 2022: 4705−4709.
[2]   孙颖, 胡艳香, 张雪英, 等 面向情感语音识别的情感维度PAD预测[J]. 浙江大学学报: 工学版, 2019, 53 (10): 2041- 2048
SUN Ying, HU Yanxiang, ZHANG Xueying, et al Prediction of emotional dimensions PAD for emotional speech recognition[J]. Journal of Zhejiang University: Engineering Science, 2019, 53 (10): 2041- 2048
[3]   孙志, 王冠 自监督对比学习的CNN-GRU语音情感识别算法[J]. 西安电子科技大学学报, 2024, 51 (6): 182- 193
SUN Zhi, WANG Guan CNN-GRU speech emotion recognition algorithm for self-supervised comparative learning[J]. Journal of Xidian University, 2024, 51 (6): 182- 193
doi: 10.19665/j.issn1001-2400.20241109
[4]   PENTARI A, KAFENTZIS G, TSIKNAKIS M Speech emotion recognition via graph-based representations[J]. Scientific Reports, 2024, 14: 4484
doi: 10.1038/s41598-024-52989-2
[5]   ABDELHAMID A A, EL-KENAWY E M, ALOTAIBI B, et al Robust speech emotion recognition using CNN+LSTM based on stochastic fractal search optimization algorithm[J]. IEEE Access, 2022, 10: 49265- 49284
doi: 10.1109/ACCESS.2022.3172954
[6]   ZHU Z, DAI W, HU Y, et al Speech emotion recognition model based on Bi-GRU and focal loss[J]. Pattern Recognition Letters, 2020, 140: 358- 365
doi: 10.1016/j.patrec.2020.11.009
[7]   LI M, YANG B, LEVY J, et al. Contrastive unsupervised learning for speech emotion recognition [C]// 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 6329−6333.
[8]   GERCZUK M, AMIRIPARIAN S, OTTL S, et al EmoNet: a transfer learning framework for multi-corpus speech emotion recognition[J]. IEEE Transactions on Affective Computing, 2023, 14 (2): 1472- 1487
doi: 10.1109/TAFFC.2021.3135152
[9]   XU X, DENG J, COUTINHO E, et al Connecting subspace learning and extreme learning machine in speech emotion recognition[J]. IEEE Transactions on Multimedia, 2019, 21 (3): 795- 808
doi: 10.1109/TMM.2018.2865834
[10]   PENTARI A, KAFENTZIS G, TSIKNAKIS M. Investigating graph-based features for speech emotion recognition [C]// IEEE-EMBS International Conference on Biomedical and Health Informatics. Ioannina: IEEE, 2022: 1–5.
[11]   MELO D F P, FADIGAS I S, PEREIRA H B B Graph-based feature extraction: a new proposal to study the classification of music signals outside the time-frequency domain[J]. PLoS One, 2020, 15 (11): e0240915
doi: 10.1371/journal.pone.0240915
[12]   SHIRIAN A, GUHA T. Compact graph architecture for speech emotion recognition [C]// 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 6284–6288.
[13]   KIM J, KIM J. Representation learning with graph neural networks for speech emotion recognition [EB/OL]. (2022–01–26) [2025–06–10]. https://arxiv.org/abs/2208.09830.
[14]   GHAYEKHLOO M, NICKABADI A Supervised contrastive learning for graph representation enhancement[J]. Neurocomputing, 2024, 588: 127710
doi: 10.1016/j.neucom.2024.127710
[15]   YOU Y N, CHEN T L, SUI Y D, et al. Graph contrastive learning with augmentations [C]// 34th International Conference on Neural Information Processing Systems. Vancouver: NeurIPS, 2020: 5812−5823.
[16]   SHIRIAN A, SOMANDEPALLI K, GUHA T Self-supervised graphs for audio representation learning with limited labeled data[J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16 (6): 1391- 1401
doi: 10.1109/JSTSP.2022.3190083
[17]   ESKIMEZ S E, DUAN Z, HEINZELMAN W. Unsupervised learning approach to feature analysis for automatic speech emotion recognition [C]// IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 5099–5103.
[18]   KANG H, XU Y, JIN G, et al FCAN: speech emotion recognition network based on focused contrastive learning[J]. Biomedical Signal Processing and Control, 2024, 96: 106545
doi: 10.1016/j.bspc.2024.106545
[19]   SONG X, HUANG L, XUE H, et al. Supervised prototypical contrastive learning for emotion recognition in conversation [C]// Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, Stroudsburg: ACL, 2022: 5197−5206.
[20]   LI Y, WANG Y, YANG X, et al Speech emotion recognition based on Graph-LSTM neural network[J]. EURASIP Journal on Audio, Speech, and Music Processing, 2023, (1): 40
doi: 10.1186/s13636-023-00303-9
[21]   XU Y, WANG J, GUANG M, et al Graph contrastive learning with Min-max mutual information[J]. Information Sciences, 2024, 665: 120378
doi: 10.1016/j.ins.2024.120378
[22]   WONG E, RICE L, KOLTER J. Fast is better than free: revisiting adversarial training [C]// 8th International Conference on Learning Representations. [S. l. ]: ICLR, 2020.
[23]   XU K, HU W H, LESKOVEC J, et al. How powerful are graph neural networks? [C]// 7th International Conference on Learning Representations. New Orleans: ICLR, 2019.
[24]   KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks [C]// 5th International Conference on Learning Representation. Toulon: ICLR, 2017.
[25]   BUSSO C, BULUT M, LEE C C, et al IEMOCAP: interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation, 2008, 42 (4): 335- 359
doi: 10.1007/s10579-008-9076-6
[26]   BURKHARDT F, PAESCHKE A, ROLFES M, et al. A database of German emotional speech [C]// Interspeech 2005. Lisbon: ISCA, 2005: 1517−1520.
[27]   SCHULLER B, STEIDL S, BATLINER A, et al. The INTERSPEECH 2010 paralinguistic challenge [C]// Interspeech 2010. Chiba: ISCA, 2010: 2794−2797.
[28]   PANDEY S K, SHEKHAWAT H S, PRASANNA S R M Attention gated tensor neural network architectures for speech emotion recognition[J]. Biomedical Signal Processing and Control, 2022, 71: 103173
doi: 10.1016/j.bspc.2021.103173
[29]   LIU J, WANG H. Graph isomorphism network for speech emotion recognition [C]// Interspeech 2021. Brno: ISCA, 2021: 3405−3409.
[30]   ULGEN I R, DU Z, BUSSO C, et al. Revealing emotional clusters in speaker embeddings: a contrastive learning strategy for speech emotion recognition [C]// 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. Seoul: IEEE, 2024: 12081–12085.
[31]   GUO L, DING S, WANG L, et al DSTCNet: deep spectro-temporal-channel attention network for speech emotion recognition[J]. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36 (1): 188- 197
doi: 10.1109/TNNLS.2023.3304516
[32]   GUO L, LI J, DING S, et al APIN: amplitude- and phase-aware interaction network for speech emotion recognition[J]. Speech Communication, 2025, 169: 103201
doi: 10.1016/j.specom.2025.103201
[33]   CHEN Z, LI J, LIU H, et al Learning multi-scale features for speech emotion recognition with connection attention mechanism[J]. Expert Systems with Applications, 2023, 214: 118943
doi: 10.1016/j.eswa.2022.118943
[1] Shancheng TANG,Jianhui LU,Ying ZHANG,Zicheng JIN,Anxin ZHAO. Unsupervised surface defect detection of magnetic tile for repair of suspected area defects[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(4): 718-728.
[2] Jun YANG,Jin-tai LI,Zhi-ming GAO. Unsupervised co-calculation on correspondence of three-dimensional shape collections[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(10): 1935-1947.
[3] Peng ZHANG,Zi-du TIAN,Hao WANG. Flight parameter data anomaly detection method based on improved generative adversarial network[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(10): 1967-1976.
[4] Ying SUN,Yan-xiang HU,Xue-ying ZHANG,Shu-fei DUAN. Prediction of emotional dimensions PAD for emotional speech recognition[J]. Journal of ZheJiang University (Engineering Science), 2019, 53(10): 2041-2048.
[5] SUN Ling-yun, HE Bo-wei, LIU Zheng, YANG Zhi-yuan. Speech emotion recognition based on information cell[J]. Journal of ZheJiang University (Engineering Science), 2015, 49(6): 1001-1009.