采用ConvNeXt解码器和基频预测的低资源语音合成
王猛,杨鉴

Low resource speech synthesis using ConvNeXt decoder and fundamental frequency prediction
Meng WANG,Jian YANG
表 1 不同语音合成模型在3种语言数据集中的性能测评结果
Tab.1 Performance evaluation results of different speech synthesis models in three language datasets
语言类型模型MOS(↑)MCD(↓)RMSE(↓)PESQ(↑)
缅甸语真实音频4.460.004.50
FastSpeech23.099.480.312.45
Glow-TTS3.049.250.252.50
VITS3.109.340.302.52
改进模型3.447.990.212.84
越南语真实音频4.790.004.50
FastSpeech23.316.070.292.62
Glow-TTS3.155.810.262.71
VITS3.195.290.282.69
改进模型4.454.960.222.87
泰语真实音频4.650.004.50
FastSpeech23.075.780.242.56
Glow-TTS3.215.240.212.62
VITS3.015.660.232.60
改进模型4.104.660.172.89