|
|
Short text manifold representation based on AutoEncoder network |
WEI Chao, LUO Sen-lin, ZHANG Jing, PAN Li-min |
School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China |
|
|
Abstract A short text manifold representation method based on AutoEncoder network was proposed for the sparsity and the problem of the curse of dimensionality of short text. The main idea is to extract manifold features of short text for non-linear dimensionality reduction from AutoEncoder Network by reconstructing text data and finding manifold mapping at first. Then extend short text and get the optimum manifold representation model by tuning the mapping based the global pair-wise pattern between label and its Multi-document in High-dimensional observation space. The Short text manifold representation can be obtained using the model. Combined with SVM、KNN、Nave-Bayes, the method can get better classification results than VSM, LDA and LSI. The Macro_F1 of the method can be over97.8%. The experimental result indicates the manifold representation can describes features of short text more accurately and non-sparse, leading to a significant improvement of the classification.
|
Published: 01 August 2015
|
|
自编码网络短文本流形表示方法
针对短文本分类任务中文本表示存在的高维稀疏问题,提出基于自编码网络的短文本流形表示方法.通过自编码网络重构文本得到流形映射,提取短文本的流形特征,实现非线性降维.根据标签与多篇文本在高维观测空间的全局映射关系,对已有流形映射进行整体调整,扩充短文本信息得到最佳流形表示模型,使用该模型得到短文本流形表示.结合SVM、KNN、Nave-Bayes 3种分类算法,该方法在公开数据源的Macro_F1均超过97.8%,分类效果优于VSM、LDA、LSI.结果表明,该模型生成的流形表示能以非稀疏形式更准确地描述短文本特征信息,使分类效果得到显著提升.
|
|
[1] 杨杰明.文本分类中文本表示模型和特征选择算法研究[D].长春:吉林大学, 2013.
YANG Jie-ming. The research of text representation and feature selection in text categorization [D]. Changchun: Jilin University, 2013.
[2] 王锦,王会珍,张俐.基于维基百科类别的文本特征表示[J].中文信息学报,2011,25(2): 27-31.
WANG Jin, WANG Hui-zhen, ZHANG Li. Text Representation by the Wikipedia Category [J]. Journal of Chinese Information Processing, 2011, 25(2): 370-383.
[3] BANERJEE S, RAMANTHAN K, GUPTA A. Clustering short text using Wikipedia [C] ∥ Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval. Amsterdam: ACM, 2007: 787-788.
[4] HU X, SUN N, ZHANG C, et al. Exploiting internal and external semantics for the cluster of short texts using word knowledge[C] ∥ Proceedings of the 18th ACM Conference on Information and Knowledge Management. Hong Kong: ACM, 2009: 919-928.
[5] 王蒙,林兰芬,王峰.基于伪相关反馈的短文本扩展与分类[J].浙江大学学报:工学版,2014, 48(10):1835-1842.
WANG Meng, LIN Lan-fen, WANG Feng. Short text expansion and classification based on pseudo-relevance feedback [J]. Journal of Zhejiang University: Engineering Science, 2014, 48(10): 1835-1842.
[6] RUDI L C, PAUL M B. The google similarity distance[J]. IEEE Transactions on Knowledge and Data Engineering, 2007. 19(3): 370-383.
[7] YANG Jie-ming, LIU Yuan-ning, LIU Zhen, et al. A new feature selection algorithm based on binomial hypothesis testing for spam filtering [J]. Knowledge-Based Systems, 2011, 24(6): 904-914.
[8] YANG Jieming, LIU Yuan-ning, ZHU Xiao-dong, etsal. A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization [J]. Information Processing and Management, 2012, 48(4): 741-754.
[9] DEERWESTER S, DUMAIS S T, HARSHMAN R, et al. Indexing by Latent Semantic Analysis [J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
[10] BLEI D M, ANDREW Y N, JORDAN Y M. Latent dirichlet allocation [J]. Journal of Machine Learning Research, J2003, 3: 993-102.
[11] KRISHNAN V, Shortcomings of latent models in supervised settings[C]. ∥ Proceedings of the SIGIR. Salvador: ACM, 2005: 625-626.
[12] HUH S, FIENBERG S E. Discriminative topic modeling based on manifold learning [J]. ACM Transactions on Knowledge Discovery from Data (TKDD), 2012, 5(4): 653-661.
[13] SEUNG H S ,LEE D D. The manifold ways of perception [J]. Science. 2000, 290(5500): 2268-2269.
[14] SILVA V D, TEBEBBAUM J B. Global versus local methods in nonlinear dimensionality reduction [C]∥ Neural Information Processing Systems 15 (NIPS′2002). Vancouver: MIT, 2003, 705-712.
[15] BENGIO Y, LAMBLIN P, POPOVICI D, et al. Greedy layerwise training of deep networks [C]∥ Advances in Neural Information Processing Systems 19 (NIPS′2006). Vancouver: MIT, 2007: 153-160.
[16] LECUN, Y, BOTTOU L, MULLER K R., et al. “Efficient backprop.” Neural networks: Tricks of the trade [J]. Springer Berlin Heidelberg, 2012, 7700: 9-48.
[17] CHANG C C, LIN C J. LIBSVM: a library for support vector machines [J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 27. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|