基于特征映射模型的情感语音合成方法
|
|
罗杰,杨鉴
|
Emotional speech synthesis approach via feature mapping model
|
|
Jie LUO,Jian YANG
|
|
| 表 1 不同语音合成模型的主客观评测结果 |
| Tab.1 Subjective and objective evaluation results of different speech synthesis models |
|
| 模型 | 说话人数量 | MOS↑ | SMOS↑ | EMOS↑ | RTF↓ | WER↓/% | ECA↑/% | EECS↑/% | | 真实音频 | — | 4.51 | — | 4.44 | — | 7.52 | 91.6 | — | | VITS | 单个说话人 | 4.02 | 4.17 | 3.92 | 0.061 8 | 19.55 | 80.0 | 98.30 | | VITS2 | 4.07 | 4.28 | 4.05 | 0.060 1 | 19.89 | 77.3 | 99.18 | | AR_VITS | 4.11 | 4.26 | 4.15 | 0.119 6 | 17.77 | 82.0 | 99.33 | | FM_VITS | 4.18 | 4.26 | 4.11 | 0.064 9 | 15.20 | 82.3 | 99.52 | | VITS | 多个说话人 | 4.05 | 4.22 | 3.93 | 0.062 2 | 16.54 | 75.6 | 98.14 | | VITS2 | 4.16 | 4.36 | 3.98 | 0.060 6 | 16.88 | 72.0 | 99.14 | | AR_VITS | 4.14 | 4.28 | 4.12 | 0.117 9 | 18.80 | 77.7 | 99.27 | | FM_VITS | 4.29 | 4.32 | 4.07 | 0.060 4 | 10.52 | 78.9 | 99.33 |
|
|
|