Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2026, Vol. 60 Issue (5): 1092-1099    DOI: 10.3785/j.issn.1008-973X.2026.05.018
    
Emotional speech synthesis approach via feature mapping model
Jie LUO(),Jian YANG*()
School of Information Science and Engineering, Yunnan University, Kunming 650504, China
Download: HTML     PDF(912KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A new method for emotional speech synthesis was proposed to address coarse emotional expression and reduced speech quality caused by text variations in existing methods. Principal component analysis and linear discriminant analysis were used to filter and classify audio features, reducing the negative impact of redundant information on speech quality. Meanwhile, a cross-attention mechanism and MoE structure were introduced to map text, emotion, and integrated features. Emotional features were generated adaptively using the learned mapping patterns, improving adaptability to text variations. In addition, a U-Net structure was applied, with optimized residual connections, to convert specific features into abstract ones and reduce information loss during processing. Based on the VITS model, the proposed method was used to build an emotional speech synthesis system, and the effectiveness of the method was validated through evaluation experiments. Results showed that the emotional speech synthesis quality and text adaptability of the improved model were superior to those of other comparison models. The proposed method could effectively enhance the emotional speech synthesis ability and text generalization ability of the model.



Key wordsemotional speech synthesis      VITS      adaptive feature generation      feature transformation      feature pre-extraction     
Received: 09 June 2025      Published: 06 May 2026
CLC:  TP 393  
Fund:  国家重点研发计划资助项目(2020AAA0107901);国家自然科学基金资助项目(61961043).
Corresponding Authors: Jian YANG     E-mail: luojie_lc5f@stu.ynu.edu.cn;jianyang@ynu.edu.cn
Cite this article:

Jie LUO,Jian YANG. Emotional speech synthesis approach via feature mapping model. Journal of ZheJiang University (Engineering Science), 2026, 60(5): 1092-1099.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2026.05.018     OR     https://www.zjujournals.com/eng/Y2026/V60/I5/1092


基于特征映射模型的情感语音合成方法

现有情感语音合成方法存在文本变化导致语音情感表达粗糙、合成语音质量下降的问题,为此提出新的情感语音合成方法. 应用主成分分析与线性判别分析,对音频特征进行特征过滤与分类,降低冗余信息对语音质量的负面影响. 引入交叉注意力机制与MoE结构,学习文本、情感特征与综合特征之间的映射关系,依据学习得到的映射范式自适应生成情感特征,提高对文本变化的适应能力. 应用U-Net结构并改进残差连接方式,实现具体特征到抽象特征的转换,降低处理过程中产生的信息损失. 以VITS为基线模型,应用所提方法构建情感语音合成系统,通过实验测评验证方法的有效性. 结果表明,改进后模型的情感语音合成质量与文本适应能力均优于其他对比模型,所提方法能够有效提高模型的情感语音合成能力与文本泛化能力.


关键词: 情感语音合成,  VITS,  自适应特征生成,  特征转换,  特征预提取 
Fig.1 VITS training flow
Fig.2 VITS inference flow
Fig.3 Comparison of linear discriminant analysis results before and after speaker classification
Fig.4 Comprehensive feature processing flow of improved model
Fig.5 Feature mapping module
Fig.6 Feature transformation module
Fig.7 Improved VITS
Fig.8 Emotion recognition classifier
模型说话人数量MOS↑SMOS↑EMOS↑RTF↓WER↓/%ECA↑/%EECS↑/%
真实音频4.514.447.5291.6
VITS单个说话人4.024.173.920.061 819.5580.098.30
VITS24.074.284.050.060 119.8977.399.18
AR_VITS4.114.264.150.119 617.7782.099.33
FM_VITS4.184.264.110.064 915.2082.399.52
VITS多个说话人4.054.223.930.062 216.5475.698.14
VITS24.164.363.980.060 616.8872.099.14
AR_VITS4.144.284.120.117 918.8077.799.27
FM_VITS4.294.324.070.060 410.5278.999.33
Tab.1 Subjective and objective evaluation results of different speech synthesis models
模型WER↓/%
中性开心悲伤惊喜愤怒
VITS16.119.215.914.914.8
VITS216.315.013.220.917.8
AR_VITS22.819.418.324.99.3
FM_VITS12.111.18.912.98.3
Tab.2 Evaluation results of emotional speech quality for different speech synthesis models
模型MOS↑EMOS↑WER↓/%
VITS4.114.0713.57
VITS24.154.1413.29
AR_VITS3.243.6746.70
FM_VITS4.244.1810.63
Tab.3 Speech quality evaluation results of different speech synthesis models
偏好性占比/%
FM_VITS无偏好VITSVITS2
44.331.424.3
35.733.630.7
Tab.4 Test results of speech quality preference of different speech synthesis models
消融模块WER↓/%ECA↑/%EECS↑/%
移除 MP+9.78?8.30?0.36
移除 U-Net+9.51?4.90?0.37
移除 FL+2.26?47.6?1.53
Tab.5 Ablation study results of improved VITS modules
[1]   TAN X, QIN T, SOONG F, et al. A survey on neural speech synthesis [EB/OL]. (2021–07–23)[2025–05–31]. https://arxiv.org/pdf/2106.15561.
[2]   TRIANTAFYLLOPOULOS A, SCHULLER B W. Expressivity and speech synthesis [EB/OL]. (2025–04–10)[2025–05–31]. https://arxiv.org/pdf/2404.19363.
[3]   TRIANTAFYLLOPOULOS A, SCHULLER B W, İYMEN G, et al An overview of affective speech synthesis and conversion in the deep learning era[J]. Proceedings of the IEEE, 2023, 111 (10): 1355- 1381
doi: 10.1109/JPROC.2023.3250266
[4]   PARK H J, KIM J S, SHIN W, et al. DEX-TTS: diffusion-based EXpressive text-to-speech with style modeling on time variability [EB/OL]. (2024–06–27)[2025–05–31]. https://arxiv.org/pdf/2406.19135.
[5]   TANG H, ZHANG X, WANG J, et al. EmoMix: emotion mixing via diffusion models for emotional speech synthesis [C]// Proceedings of the INTERSPEECH 2023. Dublin: International Speech Communication Association, 2023: 12–16.
[6]   CHEN Z, LI X, AI Z, et al. StyleFusion TTS: multimodal style-control and enhanced feature fusion for zero-shot text-to-speech synthesis [C]// Pattern Recognition and Computer Vision. Singapore: Springer, 2024: 263–277.
[7]   LEI Y, YANG S, ZHU X, et al Cross-speaker emotion transfer through information perturbation in emotional speech synthesis[J]. IEEE Signal Processing Letters, 2022, 29: 1948- 1952
doi: 10.1109/LSP.2022.3203888
[8]   LI Y A, HAN C, MESGARANI N StyleTTS: a style-based generative model for natural and diverse text-to-speech synthesis[J]. IEEE Journal of Selected Topics in Signal Processing, 2025, 19 (1): 283- 296
doi: 10.1109/JSTSP.2025.3530171
[9]   XU Y, CHEN H, YU J, et al. SECap: speech emotion captioning with large language model [C]// Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver: AAAI Press, 2024, 38(17): 19323–19331.
[10]   HSU W N, BOLTE B, TSAI Y H, et al HuBERT: self-supervised speech representation learning by masked prediction of hidden units[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3451- 3460
doi: 10.1109/TASLP.2021.3122291
[11]   BOTT T, LUX F, VU N T. Controlling emotion in text-to-speech with natural language prompts [C]// Proceedings of the Interspeech 2024. Kos: International Speech Communication Association, 2024: 1795–1799.
[12]   INOUE S, ZHOU K, WANG S, et al. Hierarchical emotion prediction and control in text-to-speech synthesis [C]// Proceedings of the ICASSP 2024. Seoul: IEEE, 2024: 10601–10605.
[13]   INOUE S, ZHOU K, WANG S, et al. Fine-grained quantitative emotion editing for speech generation [C]// Proceedings of the Asia Pacific Signal and Information Processing Association Annual Summit and Conference. Macau: IEEE, 2025: 1–6.
[14]   GAO X, ZHANG C, CHEN Y, et al. Emo-DPO: controllable emotional speech synthesis through direct preference optimization [C]// Proceedings of the ICASSP 2025. Hyderabad: IEEE, 2025: 1–5.
[15]   SHI H, WANG J, ZHANG X, et al. RSET: remapping-based sorting method forEmotion transfer speech synthesis [C]// Web and Big Data. Jinhua: Springer, 2024: 90–104.
[16]   KIM J, KONG J, SON J. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech [C]// Proceedings of the 38th International Conference on Machine Learning. [S.l.]: PMLR, 2021: 5530–5540.
[17]   REZENDE D, MOHAMED S. Variational inference with normalizing flows [C]// Proceedings of the 32nd International Conference on Machine Learning. Lille: PMLR, 2015: 1530–1538.
[18]   KIM J, KIM S, KONG J, et al. Glow-TTS: a generative flow for text-to-speech via monotonic alignment search [C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. [S.l.]: Curran Associates Inc., 2020: 8067–8077.
[19]   ZHOU K, SISMAN B, LIU R, et al Emotional voice conversion: theory, databases and ESD[J]. Speech Communication, 2022, 137: 1- 18
doi: 10.1016/j.specom.2021.11.006
[20]   WANG H, GUO P, ZHOU P, et al. MLCA-AVSR: multi-layer cross attention fusion based audio-visual speech recognition [C]// Proceedings of the ICASSP 2024. Seoul: IEEE, 2024: 8150–8154.
[21]   REN Y, HU C, TAN X, et al. FastSpeech 2: fast and high-quality end-to-end text to speech [EB/OL]. (2022–08–08)[2025–05–31]. https://arxiv.org/pdf/2006.04558.
[22]   ZUO S, ZHANG Q, LIANG C, et al. MoEBERT: from BERT to mixture-of-experts via importance-guided adaptation [C]// Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle: Association for Computational Linguistics, 2022: 1610–1623.
[23]   VARSHAVSKY-HASSID M, HIRSCH R, COHEN R, et al. On the semantic latent space of diffusion-based text-to-speech models [C]// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok: Association for Computational Linguistics, 2024, 2: 246–255.
[24]   QI T, ZHENG W, LU C, et al. PAVITS: exploring prosody-aware VITS for end-to-end emotional voice conversion [C]// Proceedings of the ICASSP 2024. Seoul: IEEE, 2024: 12697–12701.
[25]   ZHAO W, YANG Z An emotion speech synthesis method based on VITS[J]. Applied Sciences, 2023, 13 (4): 2225
doi: 10.3390/app13042225
[26]   KONG J, PARK J, KIM B, et al. VITS2: improving quality and efficiency of single-stage text-to-speech with adversarial learning and architecture design [C]// Proceedings of the INTERSPEECH 2023. Dublin: International Speech Communication Association, 2023: 4374–4378.
[27]   BAEVSKI A, ZHOU H, MOHAMED A, et al. wav2vec 2.0: a framework for self-supervised learning of speech representations [C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver: [s.n.], 2020: 12449–12460
[28]   RADFORD A, KIM J W, XU T, et al. Robust speech recognition via large-scale weak supervision [C]// Proceedings of the 40th International Conference on Machine Learning. [S.l.]: PMLR, 2023: 28495–28518.
[1] Meng WANG,Jian YANG. Low resource speech synthesis using ConvNeXt decoder and fundamental frequency prediction[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(10): 2186-2194.