Please wait a minute...
JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE)
Automatic Technology, Telecommunication Technology     
Reliable silence model based voice activity detection approach in speaker diarization
YANG Deng zhou1,2, XU Jia ming1,2, LIU Jia3, XIA Shan hong1
1. Institute of Electronics, Chinese Academy of Sciences, Beijing 100190, China;2. School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China;3. Department of Electronic Engineering, Tsinghua University, Beijing 100084, China
Download:   PDF(1086KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

The performance of traditional voice activity detection (VAD) methods is not stable owing to the impurity of speech segments and short time speech fragments generated by inter conversion frames. An efficient and stable VAD approach for the task of speaker diarization was presented. Stable speech and silence segments were got by iteratively retrain the speech and silence models until the relative invariability thanks to the discrimination of silence model. Our implementation included several components: modeling silence and speech as Gaussian model and Gaussian mixture model (GMM) separately, uncertainty decoding using frame continuity rule to gain frame attribution, removing short time and low energy fragments to finish the detection. Tests were conducted on the rich transcription speaker diarization dataset. Experimental results show that the method can reduce the inter conversion and make less inaccurate decision sentencing speech to silence, and demonstrate a sustained advantage over the sub band entropy order statistics filter (SE OSF) VAD algorithm.



Published: 31 March 2016
CLC:  TN 912  
Cite this article:

YANG Deng zhou, XU Jia ming, LIU Jia, XIA Shan hong. Reliable silence model based voice activity detection approach in speaker diarization. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2016, 50(1): 151-157.

URL:

http://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2016.01.022     OR     http://www.zjujournals.com/eng/Y2016/V50/I1/151


说话人日志中可靠静音模型语音活动检测方法

为了解决传统语音活动检测(VAD)技术分离出的语音段掺杂静音以及帧间频繁跳动产生短语音碎片的问题,提出在说话人日志中能够高效稳定地完成语音活动检测的方法.该方法利用可靠静音模型对语音的区分度高这个特性,通过循环迭代收敛得到稳定划分.建立静音和语音模型,通过帧间连续性原理进行不确定性解码得到帧类属信息,开展低能量短时间语音碎片后处理完成语音活动检测.在富标注说话人日志数据集上测试,实验结果表明,由于对静音模型的描述更加可靠,采用该方法可以减少帧间跳动,减少静音模型对语音的吸收误判,性能比基于子带熵顺序统计滤波(SE OSF)方法提高明显.

[1] The 2009 (RT09) rich transcription meeting recognition evaluation plan.2009 02 24. http:∥itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09 meeting eval plan v2.pdf.
[2] NWE T L, MA B, LI H, et al. Speaker diarization in meeting audio for single distant microphone. [J]. Interspeech, 2010(1): 4073-4076.
[3] MA Y, NISHIHARA A. Efficient voice activity detection algorithm using long term spectral flatness measure [J]. EURASIP Journal on Audio, Speech, and Music Processing, 2013, 2013(1): 1-18.
[4] MALEGAONKAR S, ARIYAEEINIA A M, SIVAKUMARAN P. Efficient speaker change detection using adapted Gaussian mixture models [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(6): 1859-1869.
[5] FRIEDLAND G, JANIN A, IMSENG D, et al. The ICSI RT 09 speaker diarization system [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(2): 371-381.
[6] RABINER L R, SAMBUR M R. An algorithm for determining the endpoints of isolated utterances [J]. The Bell System Technical Journal, 1975, 54(2): 297-315.
[7] SHEN J L, HUANG J W, LEE L S. Robust entropy based endpoint detection for speech recognition in noisy environments [C]∥ Proceeding of International Conference on Spoken Language Processing. Sydney: ICSLP, 1998: 232-235.
[8] YANG C H. A novel approach to robust speech endpoint detection in car environments [C]∥ IEEE International Conference on Acoustics, Speech, and Signal Processing. Istanbul: IEEE, 2000:1751-1754.
[9] WANG H, XU Y, LI M. Study on the MFCC similarity based voice activity detection algorithm [C]∥ 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce. Zhengzhou: IEEE, 2011: 4391-4394.
[10] KINNUNEN T, CHERNENKO E, TUONONEN M, et al. Voice activity detection using MFCC features and support vector machine [C]∥International Conference on Speech and Computer. Moscow: Springer, 2007: 556-561.
[11] RAMIRZE J, SEGURA J C, BENITEZ C, et al. An effective subband OSF based VAD with noise reduction for robust speech recognition [J]. IEEE Transactions on Speech and Audio Processing, 2005, 13(6): 1119-1129.
[12] RESTREPO A, HINCAPIE G, PARRA A. On the detection of edges using order statistic filters [C]∥ 1994 IEEE International Conference of Image Processing. Austin: IEEE, 1994: 308-312.
[13] SADJADI S O, HANSEN J H L. Unsupervised speech activity detection using voicing measures and perceptual spectral flux [J]. Signal Processing Letters, 2013, 20(3): 197-200.
[14] DUNN R B, REYNOLDS D A, QUATIERI T F. Approaches to speaker detection and tracking in conversational speech [J]. Digital Signal Processing, 2000, 10(1): 93-112.
[15] REYNOLDS D A, QUATIERI T F, DUNN R B. Speaker verification using adapted Gaussian mixture models [J]. Digital Signal Processing, 2000, 10(1): 19-41.
[16] SCALART P. Speech enhancement based on a priori signal to noise estimation [C]∥ 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing. Atlanta: IEEE, 1996: 629-632.
[17] NWE T L, SUN H, LI H, et al. Speaker diarization in meeting audio [C]∥ IEEE International Conference on Acoustics, Speech and Signal Processing. Taipei: IEEE, 2009: 4073-4076.
[18] DEMPSTER A P, LAIRD N M, RUBIN D B. Maximum likelihood from incomplete data via the EM algorithm [J]. Journal of the Royal Statistical Society, Series B (Methodological),1977,39(1): 1-38.
[19] YU S Z, KOBAYASHI H. Practical implementation of an efficient forward backward algorithm for an explicit duration hidden Markov model [J]. IEEE Transactions on Signal Processing, 2006, 54(5): 1947-1951.
[20] NIST, Rich Transcription Spring 2006 Evaluation. 2006 02 27.http:∥www.itl.nist.gov/iad/mig/tests/rt/2006 spring/docs/rto6s meeting eval plan V2.pdf.
[21] TEMKO A, MACHO D, NADEU C. Enhanced SVM training for robust speech activity detection [C]∥ IEEE International Conference on Acoustics, Speech and Signal Processing. Honolulu: IEEE, 2007: IV 1025 IV 10-28.
[22] FREDOUILLE C, BOZONNET S, EVANS N. The LIA EURECOM RT09 speaker diarization system [C]∥RT09, NIST Rich Transcription Workshop.Melbourne:NIST,2009.
[23] NWE T L, SUN H, MA B, et al. Speaker clustering and cluster purification methods for RT07 and RT09 evaluation meeting data [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(2): 461-473.

[1] YANG Bang-hua, HAN Zhi-jun, WANG Qian, HE liang-fei. Hybrid methodology combining fractal dimension and RLS-ICA for rejection of electroencephalography noise[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2014, 48(7): 1234-1240.
[2] YANG Li-chun, QIAN Yun-tao, WANG Wen-hong. A GSC algorithm based on null spectral subtraction for dual small microphone array speech enhancement[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2013, 47(8): 1493-1499.
[3] SHU Meng-Yao, LI Dong-Xiao, ZHANG Meng. Filtering optimization in MDCT domain base on spectrum character of HRTF[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2010, 44(9): 1730-1737.