基于特征映射模型的情感语音合成方法
罗杰,杨鉴

Emotional speech synthesis approach via feature mapping model
Jie LUO,Jian YANG
表 1 不同语音合成模型的主客观评测结果
Tab.1 Subjective and objective evaluation results of different speech synthesis models
模型说话人数量MOS↑SMOS↑EMOS↑RTF↓WER↓/%ECA↑/%EECS↑/%
真实音频4.514.447.5291.6
VITS单个说话人4.024.173.920.061 819.5580.098.30
VITS24.074.284.050.060 119.8977.399.18
AR_VITS4.114.264.150.119 617.7782.099.33
FM_VITS4.184.264.110.064 915.2082.399.52
VITS多个说话人4.054.223.930.062 216.5475.698.14
VITS24.164.363.980.060 616.8872.099.14
AR_VITS4.144.284.120.117 918.8077.799.27
FM_VITS4.294.324.070.060 410.5278.999.33