说话人日志中可靠静音模型语音活动检测方法

doi:10.3785/j.issn.1008-973X.2016.01.022

浙江大学学报(工学版)

自动化技术、电信技术

说话人日志中可靠静音模型语音活动检测方法

杨登舟1,2,徐嘉明1,2,刘加3,夏善红1

1.中国科学院电子学研究所,北京 100190；2.中国科学院大学电子电气与通信工程学院,北京 100049；3.清华大学电子工程系,北京 100084

Reliable silence model based voice activity detection approach in speaker diarization

YANG Deng zhou1,2, XU Jia ming1,2, LIU Jia3, XIA Shan hong1

1. Institute of Electronics, Chinese Academy of Sciences, Beijing 100190, China;2. School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China;3. Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

全文: PDF(1086 KB) HTML

摘要：

为了解决传统语音活动检测(VAD)技术分离出的语音段掺杂静音以及帧间频繁跳动产生短语音碎片的问题,提出在说话人日志中能够高效稳定地完成语音活动检测的方法.该方法利用可靠静音模型对语音的区分度高这个特性,通过循环迭代收敛得到稳定划分.建立静音和语音模型,通过帧间连续性原理进行不确定性解码得到帧类属信息,开展低能量短时间语音碎片后处理完成语音活动检测.在富标注说话人日志数据集上测试,实验结果表明,由于对静音模型的描述更加可靠,采用该方法可以减少帧间跳动,减少静音模型对语音的吸收误判,性能比基于子带熵顺序统计滤波(SE OSF)方法提高明显.

Abstract:

The performance of traditional voice activity detection (VAD) methods is not stable owing to the impurity of speech segments and short time speech fragments generated by inter conversion frames. An efficient and stable VAD approach for the task of speaker diarization was presented. Stable speech and silence segments were got by iteratively retrain the speech and silence models until the relative invariability thanks to the discrimination of silence model. Our implementation included several components: modeling silence and speech as Gaussian model and Gaussian mixture model (GMM) separately, uncertainty decoding using frame continuity rule to gain frame attribution, removing short time and low energy fragments to finish the detection. Tests were conducted on the rich transcription speaker diarization dataset. Experimental results show that the method can reduce the inter conversion and make less inaccurate decision sentencing speech to silence, and demonstrate a sustained advantage over the sub band entropy order statistics filter (SE OSF) VAD algorithm．

出版日期: 2016-03-31

TN 912

基金资助:

国家自然科学基金资助项目(61370034, 61403224).

通讯作者: 刘加，男，教授，博导. ORCID: 0000 0002 7156 380X. E-mail: liuj@tsinghua.edu.cn

作者简介: 杨登舟（1986-），男，博士生，从事说话人识别的研究. ORCID: 0000 0003 0696 1497. E-mail: yangdengzhou@sina.com

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章

引用本文:

杨登舟,徐嘉明,刘加,夏善红. 说话人日志中可靠静音模型语音活动检测方法[J]. 浙江大学学报(工学版), 10.3785/j.issn.1008-973X.2016.01.022.

YANG Deng zhou, XU Jia ming, LIU Jia, XIA Shan hong. Reliable silence model based voice activity detection approach in speaker diarization. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 10.3785/j.issn.1008-973X.2016.01.022.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2016.01.022 或 http://www.zjujournals.com/eng/CN/Y2016/V50/I1/151

［1］ The 2009 (RT09) rich transcription meeting recognition evaluation plan.2009 02 24. http:∥itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09 meeting eval plan v2.pdf．
［2］ NWE T L, MA B, LI H, et al. Speaker diarization in meeting audio for single distant microphone. ［J］. Interspeech, 2010(1): 4073-4076．
［3］ MA Y, NISHIHARA A. Efficient voice activity detection algorithm using long term spectral flatness measure ［J］. EURASIP Journal on Audio, Speech, and Music Processing, 2013, 2013(1): 1-18．
［4］ MALEGAONKAR S, ARIYAEEINIA A M, SIVAKUMARAN P. Efficient speaker change detection using adapted Gaussian mixture models ［J］. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(6): 1859-1869．
［5］ FRIEDLAND G, JANIN A, IMSENG D, et al. The ICSI RT 09 speaker diarization system ［J］. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(2): 371-381．
［6］ RABINER L R, SAMBUR M R. An algorithm for determining the endpoints of isolated utterances ［J］. The Bell System Technical Journal, 1975, 54(2): 297-315．
［7］ SHEN J L, HUANG J W, LEE L S. Robust entropy based endpoint detection for speech recognition in noisy environments ［C］∥ Proceeding of International Conference on Spoken Language Processing. Sydney: ICSLP, 1998: 232-235．
［8］ YANG C H. A novel approach to robust speech endpoint detection in car environments ［C］∥ IEEE International Conference on Acoustics, Speech, and Signal Processing. Istanbul: IEEE, 2000:1751-1754．
［9］ WANG H, XU Y, LI M. Study on the MFCC similarity based voice activity detection algorithm ［C］∥ 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce. Zhengzhou: IEEE, 2011: 4391-4394．
［10］ KINNUNEN T, CHERNENKO E, TUONONEN M, et al. Voice activity detection using MFCC features and support vector machine ［C］∥International Conference on Speech and Computer. Moscow: Springer, 2007: 556-561．
［11］ RAMIRZE J, SEGURA J C, BENITEZ C, et al. An effective subband OSF based VAD with noise reduction for robust speech recognition ［J］. IEEE Transactions on Speech and Audio Processing, 2005, 13(6): 1119-1129．
［12］ RESTREPO A, HINCAPIE G, PARRA A. On the detection of edges using order statistic filters ［C］∥ 1994 IEEE International Conference of Image Processing. Austin: IEEE, 1994: 308-312．
［13］ SADJADI S O, HANSEN J H L. Unsupervised speech activity detection using voicing measures and perceptual spectral flux ［J］. Signal Processing Letters, 2013, 20(3): 197-200．
［14］ DUNN R B, REYNOLDS D A, QUATIERI T F. Approaches to speaker detection and tracking in conversational speech ［J］. Digital Signal Processing, 2000, 10(1): 93-112．
［15］ REYNOLDS D A, QUATIERI T F, DUNN R B. Speaker verification using adapted Gaussian mixture models ［J］. Digital Signal Processing, 2000, 10(1): 19-41．
［16］ SCALART P. Speech enhancement based on a priori signal to noise estimation ［C］∥ 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing. Atlanta: IEEE, 1996: 629-632．
［17］ NWE T L, SUN H, LI H, et al. Speaker diarization in meeting audio ［C］∥ IEEE International Conference on Acoustics, Speech and Signal Processing. Taipei: IEEE, 2009: 4073-4076．
［18］ DEMPSTER A P, LAIRD N M, RUBIN D B. Maximum likelihood from incomplete data via the EM algorithm ［J］. Journal of the Royal Statistical Society， Series B (Methodological),1977,39(1): 1-38．
［19］ YU S Z, KOBAYASHI H. Practical implementation of an efficient forward backward algorithm for an explicit duration hidden Markov model ［J］. IEEE Transactions on Signal Processing, 2006, 54(5): 1947-1951．
［20］ NIST, Rich Transcription Spring 2006 Evaluation. 2006 02 27.http:∥www.itl.nist.gov/iad/mig/tests/rt/2006 spring/docs/rto6s meeting eval plan V2.pdf．
［21］ TEMKO A, MACHO D, NADEU C. Enhanced SVM training for robust speech activity detection ［C］∥ IEEE International Conference on Acoustics, Speech and Signal Processing. Honolulu: IEEE, 2007: IV 1025 IV 10-28．
［22］ FREDOUILLE C, BOZONNET S, EVANS N. The LIA EURECOM RT09 speaker diarization system ［C］∥RT09, NIST Rich Transcription Workshop.Melbourne:NIST,2009．
［23］ NWE T L, SUN H, MA B, et al. Speaker clustering and cluster purification methods for RT07 and RT09 evaluation meeting data ［J］. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(2): 461-473．

[1]	杨帮华,韩志军,王倩,何亮飞. 分形维数结合RLS-ICA的脑电信号消噪[J]. 浙江大学学报(工学版), 2014, 48(7): 1234-1240.
[2]	杨立春, 钱沄涛, 王文宏. 基于零陷谱减的GSC二元麦克风小阵列语音增强算法[J]. J4, 2013, 47(8): 1493-1499.
[3]	朱梦尧, 李东晓, 张明. 基于HRTF频谱特征优化MDCT域滤波[J]. J4, 2010, 44(9): 1730-1737.

Viewed

Full text

Abstract

Cited

Shared

Discussed