Improved deep belief network and its application in voice conversion

doi:10.3785/j.issn.1008-973X.2019.12.015

Journal of ZheJiang University (Engineering Science)

2019, Vol. 53

Issue (12): 2372-2380 DOI: 10.3785/j.issn.1008-973X.2019.12.015

Computer Science and Artificial Intelligence

Improved deep belief network and its application in voice conversion

Wen-hao WANG(

),Xiao ZHANG,Yong-jing WAN*(

)

School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China

Download:

HTML

PDF(1364KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

An improved voice conversion method based on deep belief network (DBN) was proposed, comprehensively considering the relationship between the speech frames and the effect of post-processing network. The method utilized a linear predictive analysis-synthesis model to extract the feature parameters of a speaker’s linear predictive spectrum, and the regional fusion spectral feature parameters for DBN were constructed so as to pretrain the model. Finally, an error correction network for the feature compensation of a detailed spectrum was introduced after fine-tuning. The comparison results show that, the spectral distortion of the converted speech shows the tendency of decreasing as the number of speech frames increases. Meanwhile, when the number of training speech frames was small, the spectral distortion of the proposed method was less than 50% between genders and less than 60% within genders. The experimental results showed that the spectral distortion of the proposed method was 6.5% lower than that of the traditional method. The proposed method significantly improves the naturalness and intelligibility of converted speech in view of two different subjective evaluations.

Key words： deep belief network (DBN) voice conversion regional fusion spectral feature error correction network spectral distortion

Received: 10 October 2018 Published: 17 December 2019

CLC:

TP 391

Corresponding Authors: Yong-jing WAN E-mail: 13122386132@163.com;wanyongjing@ecust.edu.cn

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Wen-hao WANG
	Xiao ZHANG
	Yong-jing WAN

Cite this article:

Wen-hao WANG,Xiao ZHANG,Yong-jing WAN. Improved deep belief network and its application in voice conversion. Journal of ZheJiang University (Engineering Science), 2019, 53(12): 2372-2380.

URL:

http://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2019.12.015 OR http://www.zjujournals.com/eng/Y2019/V53/I12/2372

改进深度信念网络在语音转换中的应用

综合考虑语音帧间关系及后处理网络的效果，提出一种改进的基于深度信念网络（DBN）的语音转换方法. 该方法利用线性预测分析-合成模型提取说话人线性预测谱的特征参数，构建基于区域融合谱特征参数的深度信念网络用以预训练模型，经过微调阶段后引入误差修正网络以实现细节谱特征的补偿. 对比实验结果表明，随着训练语音帧数的增加，转换语音的谱失真呈下降趋势. 同时，在训练语音帧数较少的情况下，改进方法在异性间转换的谱失真小于50%，在同性间转换的谱失真小于60%. 实验结果表明，改进方法的谱失真度较传统方法降低约6.5%，且同性别间转换效果比异性间转换效果更为明显，转换后语音的自然度和可理解度明显提高.

关键词： 深度信念网络（DBN）, 语音转换, 区域融合谱特征, 误差修正网络, 谱失真度

Fig.1 Voice conversion flow chart of training and conversion phase

Fig.2 Graphical model of restrict Boltzmann machine and deep belief network

Fig.3 Graph of Sigmoid with parameters of 1 and 2

Fig.4 Construction of training process based on deep belief network (DBN) method

Fig.5 Error correction network used to compensate detailed spectral features of converted speech

Fig.6 Voice conversion algorithm flow chart based on improved DBN

Fig.7 Comparison of average spectral distortion in different spectral feature conversion models

Tab.1 Comparison of MOS and ABX values by four different conversion algorithms

Fig.8 Waveform comparison of source speech, target speech and converted speech


[1]	ERRO D, ALONSO A, SERRANO L Interpretable parametric voice conversion functions based on Gaussian Mixture Models and constrained transformations[J]. Computer Speech and Language, 2014, 30 (1): 3- 15

[2]	DOI H, TODA T, NAKAMURA K, et al Alaryngeal speech enhancement based on one-to-many eigenvoice conversion[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22 (1): 172- 183 doi: 10.1109/TASLP.2013.2286917

[3]	TODA T, NAKAGIRI M, SHIKANO K Statistical voice conversion techniques for body-conducted unvoiced speech enhancement[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20 (9): 2505- 2517 doi: 10.1109/TASL.2012.2205241

[4]	DENG L, ACERO A, JIANG L, et al. High-performance robust speech recognition using stereo training data [C] // IEEE International Conference on Acoustics, Speech, and Signal Processing. Las Vegas: IEEE, 2001: 301-304.

[5]	KUNIKOSHI A, QIAN L, MINEMATSU N, et al. Speech generation from hand gestures based on space mapping [C] // Tenth Annual Conference of the International Speech Communication Association. England: INTERSPEECH, 2009: 308-311.

[6]	MIZUNO H, ABE M Voice conversion algorithm based on piecewise linear conversion rules of formant frequency and spectral tilt[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 1 (6): 469- 472

[7]	ABE M, NAKAMURA S, et al. Voice conversion through vector quantization [C] // IEEE International Conference on Acoustics, Speech, and Signal Processing. Las Vegas: IEEE, 1988: 71-76.

[8]	YAMAGISHI J, KOBAYASHI T, NAKANO Y, et al Analysis of speaker adaptation algorithm[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2009, 17 (1): 66- 83 doi: 10.1109/TASL.2008.2006647

[9]	SARUWATARI T H, SHIKANO K. Voice conversion algorithm based on Gaussian Mixture Model with dynamic frequency warping of STRAIGHT spectrum [C] // Proceedings of IEEE International Conference on Acoust, Speech, Signal Processing. Las Vegas: IEEE, 2001: 841-844.

[10]	沈惠玲, 万永菁一种基于预测谱偏移的自适应高斯混合模型在语音转换中的应用[J]. 华东理工大学学报:工学版, 2017, 43 (4): 546- 552 SHEN Hui-ling, WAN Yong-jing An adaptive Gaussian Mixed Model based on predictive spectral shift and its application in voice conversion[J]. Journal of East China University of Science and Technology: Engineering Science, 2017, 43 (4): 546- 552

[11]	左国玉, 刘文举, 阮晓钢基于径向基神经网络的声音转换[J]. 中文信息学, 2004, 18 (1): 78- 84 ZUO Guo-yu, LIU Wen-ju, RUAN Xiao-gang Voice conversion by GA-based RBF neural network[J]. Journal of Chinese Information Processing, 2004, 18 (1): 78- 84 doi: 10.3969/j.issn.1003-0077.2004.01.012

[12]	NARENDRANATH M, MURTHY H A, RAJENDRAN S, et al Transformation of formants for voice conversion using artificial neural networks[J]. Speech Communication, 1995, 16 (2): 207- 216

[13]	王民, 黄斐, 刘利, 等采用深度信念网络的语音转换方法[J]. 计算机工程与应用, 2016, 52 (15): 168- 171 WANG Ming, HUANG Fei, LIU Li, et al Voice conversion using deep belief networks[J]. Computer Engineering and Applications, 2016, 52 (15): 168- 171 doi: 10.3778/j.issn.1002-8331.1409-0383

[14]	叶伟, 俞一彪. 超帧特征空间下基于深度置信网络的语音转换[D]. 苏州: 苏州大学, 2016. YE Wei, YU Yi-biao. Voice conversion using deep belief network in super frame feature space[D]. Soochow: Soochow University, 2016.

[15]	宋知用. Matlab在语音信号分析与合成中的应用: 第1版 [M]. 北京: 北京航空航天大学出版社, 2013: 2-16, 62-66, 161-162.

[16]	吕士楠, 初敏, 许洁萍, 等. 汉语语音合成: 原理和技术[M]. 北京: 科学出版社, 2012.

[17]	SMOLENSKY P. Information processing in dynamical systems: foundations of harmony theory [D]. Cambridge, MA, USA, 1986, 1(6): 194-281.

[18]	周志华. 机器学习[M]. 北京: 清华大学出版社, 2013: 111-115.

[19]	HINTON G Training products of experts by minimizing contrastive divergence[J]. Neural Computation, 2002, 12 (14): 1711- 1800

[20]	NAKASHIKA T, TAKASHIMA R, TAKIGUCH T, et al. Voice conversion in high-order eigen space using deep belief nets [C] // Interspeech. Lyon: INTERSPEECH, 2013: 369-372.

[21]	GHORBANDOOST M, SAYADIYAN A, AHANGAR M, et al. Voice conversion based on feature combination with limited training data[J]. Speech Communication, 2015, 67 (3): 115- 117

[1]	Shou-guo ZHENG,Yong-de ZHANG,Wen-tian XIE,Hu FAN,Qing WANG. Aircraft final assembly line modeling based on digital twin[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(5): 843-854.

[2]	Shi-lin ZHANG,Si-ming MA,Zi-qian GU. Large margin metric learning based vehicle re-identification method[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(5): 948-956.

[3]	Peng SONG,De-dong YANG,Chang LI,Chang GUO. An adaptive siamese network tracking algorithm based on global feature channel recognition[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(5): 966-975.

[4]	Jun CAI,Gang ZHAO,Yong YU,Qiang-wei BAO,Sheng DAI. A rapid reconstruction method of simulation model based on point cloud and design model[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(5): 905-916.

[5]	Hong-li WANG,Bin GUO,Si-cong LIU,Jia-qi LIU,Yun-gang WU,Zhi-wen YU. End context-adaptative deep sensing model with edge-end collaboration[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(4): 626-638.

[6]	Teng ZHANG,Xin-long JIANG,Yi-qiang CHEN,Qian CHEN,Tao-mian MI,Piu CHAN. Wrist attitude-based Parkinson's disease ON/OFF state assessment after medication[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(4): 639-647.

[7]	Ying-jie ZHENG,Song-rong WU,Ruo-yu WEI,Zhen-wei TU,Jin LIAO,Dong LIU. Metro location point matching and false alarm elimination based on FCM algorithm of target image[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(3): 586-593.

[8]	Zi-ye YONG,Ji-chang GUO,Chong-yi LI. weakly supervised underwater image enhancement algorithm incorporating attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(3): 555-562.

[9]	Yong YU,Jing-yuan XUE,Sheng DAI,Qiang-wei BAO,Gang ZHAO. Quality prediction and process parameter optimization method for machining parts[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(3): 441-447.

[10]	Hui-ya HU,Shao-yan GAI,Fei-peng DA. Face frontalization based on generative adversarial network[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(1): 116-123.

[11]	Yang-bo CHEN,Guo-dong YI,Shu-you ZHANG. Surface warpage detection method based on point cloud feature comparison[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(1): 81-88.

[12]	You-kang DUAN,Xiao-gang CHEN,Jian GUI,Bin MA,Shun-fen LI,Zhi-tang SONG. Continuous kinematics prediction of lower limbs based on phase division[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(1): 89-95.

[13]	Tai-heng ZHANG,Biao MEI,Lei QIAO,Hao-jie YANG,Wei-dong ZHU. Detection method for composite hole guided by texture boundary[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(12): 2294-2300.

[14]	Dong LIANG,Xin-yu LIU,Jia-xing PAN,Han SUN,Wen-jun ZHOU,Shun’ichi KANEKO. Foreground segmentation under dynamic background based on self-updating co-occurrence pixel[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(12): 2405-2413.

[15]	Yao JIN,Wei ZHANG. Real-time fire detection algorithm with Anchor-Free network architecture[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(12): 2430-2436.

Viewed

Full text

Abstract

Cited

Shared

Discussed