A short text manifold representation method based on AutoEncoder network was proposed for the sparsity and the problem of the curse of dimensionality of short text. The main idea is to extract manifold features of short text for non-linear dimensionality reduction from AutoEncoder Network by reconstructing text data and finding manifold mapping at first. Then extend short text and get the optimum manifold representation model by tuning the mapping based the global pair-wise pattern between label and its Multi-document in High-dimensional observation space. The Short text manifold representation can be obtained using the model. Combined with SVM、KNN、Nave-Bayes, the method can get better classification results than VSM, LDA and LSI. The Macro_F1 of the method can be over97.8%. The experimental result indicates the manifold representation can describes features of short text more accurately and non-sparse, leading to a significant improvement of the classification.
WEI Chao, LUO Sen-lin, ZHANG Jing, PAN Li-min. Short text manifold representation based on AutoEncoder network. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2015, 49(8): 1591-1599.
[1] 杨杰明.文本分类中文本表示模型和特征选择算法研究[D].长春:吉林大学, 2013.
YANG Jie-ming. The research of text representation and feature selection in text categorization [D]. Changchun: Jilin University, 2013.
[2] 王锦,王会珍,张俐.基于维基百科类别的文本特征表示[J].中文信息学报,2011,25(2): 27-31.
WANG Jin, WANG Hui-zhen, ZHANG Li. Text Representation by the Wikipedia Category [J]. Journal of Chinese Information Processing, 2011, 25(2): 370-383.
[3] BANERJEE S, RAMANTHAN K, GUPTA A. Clustering short text using Wikipedia [C] ∥ Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval. Amsterdam: ACM, 2007: 787-788.
[4] HU X, SUN N, ZHANG C, et al. Exploiting internal and external semantics for the cluster of short texts using word knowledge[C] ∥ Proceedings of the 18th ACM Conference on Information and Knowledge Management. Hong Kong: ACM, 2009: 919-928.
[5] 王蒙,林兰芬,王峰.基于伪相关反馈的短文本扩展与分类[J].浙江大学学报:工学版,2014, 48(10):1835-1842.
WANG Meng, LIN Lan-fen, WANG Feng. Short text expansion and classification based on pseudo-relevance feedback [J]. Journal of Zhejiang University: Engineering Science, 2014, 48(10): 1835-1842.
[6] RUDI L C, PAUL M B. The google similarity distance[J]. IEEE Transactions on Knowledge and Data Engineering, 2007. 19(3): 370-383.
[7] YANG Jie-ming, LIU Yuan-ning, LIU Zhen, et al. A new feature selection algorithm based on binomial hypothesis testing for spam filtering [J]. Knowledge-Based Systems, 2011, 24(6): 904-914.
[8] YANG Jieming, LIU Yuan-ning, ZHU Xiao-dong, etsal. A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization [J]. Information Processing and Management, 2012, 48(4): 741-754.
[9] DEERWESTER S, DUMAIS S T, HARSHMAN R, et al. Indexing by Latent Semantic Analysis [J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
[10] BLEI D M, ANDREW Y N, JORDAN Y M. Latent dirichlet allocation [J]. Journal of Machine Learning Research, J2003, 3: 993-102.
[11] KRISHNAN V, Shortcomings of latent models in supervised settings[C]. ∥ Proceedings of the SIGIR. Salvador: ACM, 2005: 625-626.
[12] HUH S, FIENBERG S E. Discriminative topic modeling based on manifold learning [J]. ACM Transactions on Knowledge Discovery from Data (TKDD), 2012, 5(4): 653-661.
[13] SEUNG H S ,LEE D D. The manifold ways of perception [J]. Science. 2000, 290(5500): 2268-2269.
[14] SILVA V D, TEBEBBAUM J B. Global versus local methods in nonlinear dimensionality reduction [C]∥ Neural Information Processing Systems 15 (NIPS′2002). Vancouver: MIT, 2003, 705-712.
[15] BENGIO Y, LAMBLIN P, POPOVICI D, et al. Greedy layerwise training of deep networks [C]∥ Advances in Neural Information Processing Systems 19 (NIPS′2006). Vancouver: MIT, 2007: 153-160.
[16] LECUN, Y, BOTTOU L, MULLER K R., et al. “Efficient backprop.” Neural networks: Tricks of the trade [J]. Springer Berlin Heidelberg, 2012, 7700: 9-48.
[17] CHANG C C, LIN C J. LIBSVM: a library for support vector machines [J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 27.