Please wait a minute...
浙江大学学报(工学版)  2019, Vol. 53 Issue (12): 2372-2380    DOI: 10.3785/j.issn.1008-973X.2019.12.015
计算机科学与人工智能     
改进深度信念网络在语音转换中的应用
王文浩(),张筱,万永菁*()
华东理工大学 信息科学与工程学院,上海 200237
Improved deep belief network and its application in voice conversion
Wen-hao WANG(),Xiao ZHANG,Yong-jing WAN*()
School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
 全文: PDF(1364 KB)   HTML
摘要:

综合考虑语音帧间关系及后处理网络的效果,提出一种改进的基于深度信念网络(DBN)的语音转换方法. 该方法利用线性预测分析-合成模型提取说话人线性预测谱的特征参数,构建基于区域融合谱特征参数的深度信念网络用以预训练模型,经过微调阶段后引入误差修正网络以实现细节谱特征的补偿. 对比实验结果表明,随着训练语音帧数的增加,转换语音的谱失真呈下降趋势. 同时,在训练语音帧数较少的情况下,改进方法在异性间转换的谱失真小于50%,在同性间转换的谱失真小于60%. 实验结果表明,改进方法的谱失真度较传统方法降低约6.5%,且同性别间转换效果比异性间转换效果更为明显,转换后语音的自然度和可理解度明显提高.

关键词: 深度信念网络(DBN)语音转换区域融合谱特征误差修正网络谱失真度    
Abstract:

An improved voice conversion method based on deep belief network (DBN) was proposed, comprehensively considering the relationship between the speech frames and the effect of post-processing network. The method utilized a linear predictive analysis-synthesis model to extract the feature parameters of a speaker’s linear predictive spectrum, and the regional fusion spectral feature parameters for DBN were constructed so as to pretrain the model. Finally, an error correction network for the feature compensation of a detailed spectrum was introduced after fine-tuning. The comparison results show that, the spectral distortion of the converted speech shows the tendency of decreasing as the number of speech frames increases. Meanwhile, when the number of training speech frames was small, the spectral distortion of the proposed method was less than 50% between genders and less than 60% within genders. The experimental results showed that the spectral distortion of the proposed method was 6.5% lower than that of the traditional method. The proposed method significantly improves the naturalness and intelligibility of converted speech in view of two different subjective evaluations.

Key words: deep belief network (DBN)    voice conversion    regional fusion spectral feature    error correction network    spectral distortion
收稿日期: 2018-10-10 出版日期: 2019-12-17
CLC:  TP 391  
基金资助: 国家自然科学基金资助项目(61872143)
通讯作者: 万永菁     E-mail: 13122386132@163.com;wanyongjing@ecust.edu.cn
作者简介: 王文浩(1994—),男,硕士生,从事语音信号处理、模式识别研究. orcid.org/0000-0002-3199-2618. E-mail: 13122386132@163.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
王文浩
张筱
万永菁

引用本文:

王文浩,张筱,万永菁. 改进深度信念网络在语音转换中的应用[J]. 浙江大学学报(工学版), 2019, 53(12): 2372-2380.

Wen-hao WANG,Xiao ZHANG,Yong-jing WAN. Improved deep belief network and its application in voice conversion. Journal of ZheJiang University (Engineering Science), 2019, 53(12): 2372-2380.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2019.12.015        http://www.zjujournals.com/eng/CN/Y2019/V53/I12/2372

图 1  训练、转换阶段语音转换流程图
图 2  受限玻尔兹曼机与深度信念网络的图模型
图 3  Sigmoid在参数取1和2时的函数图像
图 4  传统深度信念网络(DBN)方法训练过程架构
图 5  补偿转换语音细节谱特征的误差修正网络
图 6  基于改进DBN的语音转换算法流程图
图 7  不同谱特征转换模型下平均谱失真对比图
转换
算法
帧数/
103
MOS ABX
F-M F-F M-F M-M F-M F-F M-F M-M
DBN 4 2.5 2.2 2.5 2.3 86.2 74.8 86.5 76.2
10 2.7 2.4 2.7 2.5 88.4 75.2 88.9 78.4
16 2.8 2.5 2.9 2.7 90.1 76.4 91.2 79.6
NDBN 4 2.5 2.4 2.6 2.4 87.5 75.3 88.1 79.1
10 2.8 2.6 2.8 2.6 91.1 76.1 91.6 80.4
16 2.8 2.6 2.9 2.8 91.5 77.2 92.8 82.1
EDBN 4 2.6 2.3 2.8 2.3 88.4 75.1 89.1 77.8
10 2.9 2.5 2.9 2.5 92.5 75.8 92.9 79.1
16 2.9 2.5 3.0 2.8 92.8 76.9 93.1 81.7
NEDBN 4 2.7 2.7 2.9 2.5 89.5 80.0 90.3 81.7
10 2.9 2.9 2.9 2.7 92.6 81.2 92.9 82.3
16 3.0 2.7 3.1 2.8 93.5 82.1 94.5 83.5
表 1  4种不同转换算法的MOS值与ABX值对比
图 8  源语音、目标语音与转换语音的波形对比图
1 ERRO D, ALONSO A, SERRANO L Interpretable parametric voice conversion functions based on Gaussian Mixture Models and constrained transformations[J]. Computer Speech and Language, 2014, 30 (1): 3- 15
2 DOI H, TODA T, NAKAMURA K, et al Alaryngeal speech enhancement based on one-to-many eigenvoice conversion[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22 (1): 172- 183
doi: 10.1109/TASLP.2013.2286917
3 TODA T, NAKAGIRI M, SHIKANO K Statistical voice conversion techniques for body-conducted unvoiced speech enhancement[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20 (9): 2505- 2517
doi: 10.1109/TASL.2012.2205241
4 DENG L, ACERO A, JIANG L, et al. High-performance robust speech recognition using stereo training data [C] // IEEE International Conference on Acoustics, Speech, and Signal Processing. Las Vegas: IEEE, 2001: 301-304.
5 KUNIKOSHI A, QIAN L, MINEMATSU N, et al. Speech generation from hand gestures based on space mapping [C] // Tenth Annual Conference of the International Speech Communication Association. England: INTERSPEECH, 2009: 308-311.
6 MIZUNO H, ABE M Voice conversion algorithm based on piecewise linear conversion rules of formant frequency and spectral tilt[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 1 (6): 469- 472
7 ABE M, NAKAMURA S, et al. Voice conversion through vector quantization [C] // IEEE International Conference on Acoustics, Speech, and Signal Processing. Las Vegas: IEEE, 1988: 71-76.
8 YAMAGISHI J, KOBAYASHI T, NAKANO Y, et al Analysis of speaker adaptation algorithm[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2009, 17 (1): 66- 83
doi: 10.1109/TASL.2008.2006647
9 SARUWATARI T H, SHIKANO K. Voice conversion algorithm based on Gaussian Mixture Model with dynamic frequency warping of STRAIGHT spectrum [C] // Proceedings of IEEE International Conference on Acoust, Speech, Signal Processing. Las Vegas: IEEE, 2001: 841-844.
10 沈惠玲, 万永菁 一种基于预测谱偏移的自适应高斯混合模型在语音转换中的应用[J]. 华东理工大学学报:工学版, 2017, 43 (4): 546- 552
SHEN Hui-ling, WAN Yong-jing An adaptive Gaussian Mixed Model based on predictive spectral shift and its application in voice conversion[J]. Journal of East China University of Science and Technology: Engineering Science, 2017, 43 (4): 546- 552
11 左国玉, 刘文举, 阮晓钢 基于径向基神经网络的声音转换[J]. 中文信息学, 2004, 18 (1): 78- 84
ZUO Guo-yu, LIU Wen-ju, RUAN Xiao-gang Voice conversion by GA-based RBF neural network[J]. Journal of Chinese Information Processing, 2004, 18 (1): 78- 84
doi: 10.3969/j.issn.1003-0077.2004.01.012
12 NARENDRANATH M, MURTHY H A, RAJENDRAN S, et al Transformation of formants for voice conversion using artificial neural networks[J]. Speech Communication, 1995, 16 (2): 207- 216
13 王民, 黄斐, 刘利, 等 采用深度信念网络的语音转换方法[J]. 计算机工程与应用, 2016, 52 (15): 168- 171
WANG Ming, HUANG Fei, LIU Li, et al Voice conversion using deep belief networks[J]. Computer Engineering and Applications, 2016, 52 (15): 168- 171
doi: 10.3778/j.issn.1002-8331.1409-0383
14 叶伟, 俞一彪. 超帧特征空间下基于深度置信网络的语音转换[D]. 苏州: 苏州大学, 2016.
YE Wei, YU Yi-biao. Voice conversion using deep belief network in super frame feature space[D]. Soochow: Soochow University, 2016.
15 宋知用. Matlab在语音信号分析与合成中的应用: 第1版 [M]. 北京: 北京航空航天大学出版社, 2013: 2-16, 62-66, 161-162.
16 吕士楠, 初敏, 许洁萍, 等. 汉语语音合成: 原理和技术[M]. 北京: 科学出版社, 2012.
17 SMOLENSKY P. Information processing in dynamical systems: foundations of harmony theory [D]. Cambridge, MA, USA, 1986, 1(6): 194-281.
18 周志华. 机器学习[M]. 北京: 清华大学出版社, 2013: 111-115.
19 HINTON G Training products of experts by minimizing contrastive divergence[J]. Neural Computation, 2002, 12 (14): 1711- 1800
20 NAKASHIKA T, TAKASHIMA R, TAKIGUCH T, et al. Voice conversion in high-order eigen space using deep belief nets [C] // Interspeech. Lyon: INTERSPEECH, 2013: 369-372.
21 GHORBANDOOST M, SAYADIYAN A, AHANGAR M, et al. Voice conversion based on feature combination with limited training data[J]. Speech Communication, 2015, 67 (3): 115- 117
[1] 郑守国,张勇德,谢文添,樊虎,王青. 基于数字孪生的飞机总装生产线建模[J]. 浙江大学学报(工学版), 2021, 55(5): 843-854.
[2] 张师林,马思明,顾子谦. 基于大边距度量学习的车辆再识别方法[J]. 浙江大学学报(工学版), 2021, 55(5): 948-956.
[3] 宋鹏,杨德东,李畅,郭畅. 整体特征通道识别的自适应孪生网络跟踪算法[J]. 浙江大学学报(工学版), 2021, 55(5): 966-975.
[4] 蔡君,赵罡,于勇,鲍强伟,戴晟. 基于点云和设计模型的仿真模型快速重构方法[J]. 浙江大学学报(工学版), 2021, 55(5): 905-916.
[5] 王虹力,郭斌,刘思聪,刘佳琪,仵允港,於志文. 边端融合的终端情境自适应深度感知模型[J]. 浙江大学学报(工学版), 2021, 55(4): 626-638.
[6] 张腾,蒋鑫龙,陈益强,陈前,米涛免,陈彪. 基于腕部姿态的帕金森病用药后开-关期检测[J]. 浙江大学学报(工学版), 2021, 55(4): 639-647.
[7] 郑英杰,吴松荣,韦若禹,涂振威,廖进,刘东. 基于目标图像FCM算法的地铁定位点匹配及误报排除方法[J]. 浙江大学学报(工学版), 2021, 55(3): 586-593.
[8] 雍子叶,郭继昌,李重仪. 融入注意力机制的弱监督水下图像增强算法[J]. 浙江大学学报(工学版), 2021, 55(3): 555-562.
[9] 于勇,薛静远,戴晟,鲍强伟,赵罡. 机加零件质量预测与工艺参数优化方法[J]. 浙江大学学报(工学版), 2021, 55(3): 441-447.
[10] 胡惠雅,盖绍彦,达飞鹏. 基于生成对抗网络的偏转人脸转正[J]. 浙江大学学报(工学版), 2021, 55(1): 116-123.
[11] 陈杨波,伊国栋,张树有. 基于点云特征对比的曲面翘曲变形检测方法[J]. 浙江大学学报(工学版), 2021, 55(1): 81-88.
[12] 段有康,陈小刚,桂剑,马斌,李顺芬,宋志棠. 基于相位划分的下肢连续运动预测[J]. 浙江大学学报(工学版), 2021, 55(1): 89-95.
[13] 张太恒,梅标,乔磊,杨浩杰,朱伟东. 纹理边界引导的复合材料圆孔检测方法[J]. 浙江大学学报(工学版), 2020, 54(12): 2294-2300.
[14] 梁栋,刘昕宇,潘家兴,孙涵,周文俊,金子俊一. 动态背景下基于自更新像素共现的前景分割[J]. 浙江大学学报(工学版), 2020, 54(12): 2405-2413.
[15] 晋耀,张为. 采用Anchor-Free网络结构的实时火灾检测算法[J]. 浙江大学学报(工学版), 2020, 54(12): 2430-2436.