Please wait a minute...
Front. Inform. Technol. Electron. Eng.  2015, Vol. 16 Issue (6): 457-465    DOI: 10.1631/FITEE.1400352
    
Topic modeling for large-scale text data
Xi-ming Li, Ji-hong Ouyang, You Lu
College of Computer Science and Technology, Jilin University, Changchun 130012, China; MOE Key Laboratory of Symbolic Computation and Knowledge Engineering, Jilin University, Changchun 130012, China
Download:   PDF(0KB)
Export: BibTeX | EndNote (RIS)      

Abstract  This paper develops a novel online algorithm, namely moving average stochastic variational inference (MASVI), which applies the results obtained by previous iterations to smooth out noisy natural gradients. We analyze the convergence property of the proposed algorithm and conduct a set of experiments on two large-scale collections that contain millions of documents. Experimental results indicate that in contrast to algorithms named ‘stochastic variational inference’ and ‘SGRLD’, our algorithm achieves a faster convergence rate and better performance.

Key wordsLatent Dirichlet allocation (LDA)      Topic modeling      Online learning      Moving average     
Received: 15 October 2014      Published: 04 June 2015
CLC:  TP391.1  
Cite this article:

Xi-ming Li, Ji-hong Ouyang, You Lu. Topic modeling for large-scale text data. Front. Inform. Technol. Electron. Eng., 2015, 16(6): 457-465.

URL:

http://www.zjujournals.com/xueshu/fitee/10.1631/FITEE.1400352     OR     http://www.zjujournals.com/xueshu/fitee/Y2015/V16/I6/457


大规模文本数据的主题建模

目的:研究大规模数据的主题模型在线推理算法,针对随机变分推理算法中随机梯度误差较大的问题,提出一种移动平均随机变分推理算法。
创新点:使用多次迭代的随机梯度移动平均值近似代替真实随机梯度,以此减小随机梯度和真实梯度间的误差。
方法:以主题模型的基础模型潜在狄利克雷分配为载体展开研究。考虑不同次迭代的文本子集具有不同的词汇(表1),使用不同次迭代的随机项移动平均值近似代替真实随机梯度的随机项。为尽可能保证算法的精度,使用最近R次迭代的随机项(图2)并验证所提算法的收敛性。
结论:在随机变分推理算法基础上,提出一种移动平均随机变分推理算法,实现更好的文本主题建模效果和更快的收敛速度。

关键词: 潜在狄利克雷分配,  主题模型,  在线学习,  移动平均值 
[1] Jian HAO, Lei JING, Hong-liang KE, Yao WANG, Qun GAO, Xiao-xun WANG, Qiang SUN , Zhi-jun XU. Determination of cut-off time of accelerated aging test under temperature stress for LED lamps[J]. Front. Inform. Technol. Electron. Eng., 2017, 18(8): 1197-1204.
[2] Wei ZHANG , Jia-yu ZHUANG , Xi YONG , Jian-kou LI , Wei CHEN , Zhe-min LI. Personalized topic modeling for recommending user-generated content[J]. Front. Inform. Technol. Electron. Eng., 2017, 18(5): 708-718.
[3] Dong-wei Xu, Yong-dong Wang, Li-min Jia, Yong Qin, Hong-hui Dong. Real-time road traffic state prediction based on ARIMA and Kalman filter[J]. Front. Inform. Technol. Electron. Eng., 2017, 18(2): 287-302.
[4] Ming Yang, Ying-ming Li, Zhongfei (Mark) Zhang. Scientific articles recommendation with topic regression and relational matrix factorization[J]. Front. Inform. Technol. Electron. Eng., 2014, 15(11): 984-998.
[5] Bing-kun Wang, Yong-feng Huang, Wan-xia Yang, Xing Li. Short text classification based on strong feature thesaurus[J]. Front. Inform. Technol. Electron. Eng., 2012, 13(9): 649-659.
[6] Xiao Hu, Zhong Xiao, Ni Zhang. Removal of baseline wander from ECG signal based on a statistical weighted moving average filter[J]. Front. Inform. Technol. Electron. Eng., 2011, 12(5): 397-403.
[7] Zhuo-jun Jin, Hui Qian, Shen-yi Chen, Miao-liang Zhu. Convergence analysis of an incremental approach to online inverse reinforcement learning[J]. Front. Inform. Technol. Electron. Eng., 2011, 12(1): 17-24.