Short text manifold representation based on AutoEncoder network

doi:10.3785/j.issn.1008-973X.2015.08.027

JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE)

WEI Chao, LUO Sen-lin, ZHANG Jing, PAN Li-min

School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China

Download:

PDF(1932KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

A short text manifold representation method based on AutoEncoder network was proposed for the sparsity and the problem of the curse of dimensionality of short text. The main idea is to extract manifold features of short text for non-linear dimensionality reduction from AutoEncoder Network by reconstructing text data and finding manifold mapping at first. Then extend short text and get the optimum manifold representation model by tuning the mapping based the global pair-wise pattern between label and its Multi-document in High-dimensional observation space. The Short text manifold representation can be obtained using the model. Combined with SVM、KNN、Nave-Bayes, the method can get better classification results than VSM, LDA and LSI. The Macro_F1 of the method can be over97.8%. The experimental result indicates the manifold representation can describes features of short text more accurately and non-sparse, leading to a significant improvement of the classification.

Published: 01 August 2015

CLC:

TP 391

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors

Cite this article:

WEI Chao, LUO Sen-lin, ZHANG Jing, PAN Li-min. Short text manifold representation based on AutoEncoder network. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2015, 49(8): 1591-1599.

URL:

http://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2015.08.027 OR http://www.zjujournals.com/eng/Y2015/V49/I8/1591

自编码网络短文本流形表示方法

针对短文本分类任务中文本表示存在的高维稀疏问题,提出基于自编码网络的短文本流形表示方法.通过自编码网络重构文本得到流形映射,提取短文本的流形特征,实现非线性降维.根据标签与多篇文本在高维观测空间的全局映射关系,对已有流形映射进行整体调整,扩充短文本信息得到最佳流形表示模型,使用该模型得到短文本流形表示.结合SVM、KNN、Nave-Bayes 3种分类算法,该方法在公开数据源的Macro_F1均超过97.8%,分类效果优于VSM、LDA、LSI.结果表明,该模型生成的流形表示能以非稀疏形式更准确地描述短文本特征信息,使分类效果得到显著提升.

［1］杨杰明.文本分类中文本表示模型和特征选择算法研究［D］.长春:吉林大学, 2013.
YANG Jie-ming. The research of text representation and feature selection in text categorization ［D］. Changchun: Jilin University, 2013.
［2］王锦,王会珍,张俐.基于维基百科类别的文本特征表示［J］.中文信息学报,2011,25(2): 27-31.
WANG Jin, WANG Hui-zhen, ZHANG Li. Text Representation by the Wikipedia Category ［J］. Journal of Chinese Information Processing, 2011, 25(2): 370-383.
［3］ BANERJEE S, RAMANTHAN K, GUPTA A. Clustering short text using Wikipedia ［C］ ∥ Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval. Amsterdam: ACM, 2007: 787-788.
［4］ HU X, SUN N, ZHANG C, et al. Exploiting internal and external semantics for the cluster of short texts using word knowledge［C］ ∥ Proceedings of the 18th ACM Conference on Information and Knowledge Management. Hong Kong: ACM, 2009: 919-928.
［5］王蒙,林兰芬,王峰.基于伪相关反馈的短文本扩展与分类［J］.浙江大学学报：工学版,2014, 48(10):1835-1842.
WANG Meng, LIN Lan-fen, WANG Feng. Short text expansion and classification based on pseudo-relevance feedback ［J］. Journal of Zhejiang University: Engineering Science, 2014, 48(10): 1835-1842.
［6］ RUDI L C, PAUL M B. The google similarity distance［J］. IEEE Transactions on Knowledge and Data Engineering, 2007. 19(3): 370-383.
［7］ YANG Jie-ming, LIU Yuan-ning, LIU Zhen, et al. A new feature selection algorithm based on binomial hypothesis testing for spam filtering ［J］. Knowledge-Based Systems, 2011, 24(6): 904-914.
［8］ YANG Jieming, LIU Yuan-ning, ZHU Xiao-dong, etsal. A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization ［J］. Information Processing and Management, 2012, 48(4): 741-754.
［9］ DEERWESTER S, DUMAIS S T, HARSHMAN R, et al. Indexing by Latent Semantic Analysis ［J］. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
［10］ BLEI D M, ANDREW Y N, JORDAN Y M. Latent dirichlet allocation ［J］. Journal of Machine Learning Research, J2003, 3: 993-102.
［11］ KRISHNAN V, Shortcomings of latent models in supervised settings［C］. ∥ Proceedings of the SIGIR. Salvador: ACM, 2005: 625-626.
［12］ HUH S, FIENBERG S E. Discriminative topic modeling based on manifold learning ［J］. ACM Transactions on Knowledge Discovery from Data (TKDD), 2012, 5(4): 653-661.
［13］ SEUNG H S ,LEE D D. The manifold ways of perception ［J］. Science. 2000, 290(5500): 2268-2269.
［14］ SILVA V D, TEBEBBAUM J B. Global versus local methods in nonlinear dimensionality reduction ［C］∥ Neural Information Processing Systems 15 (NIPS′2002). Vancouver: MIT, 2003, 705-712.
［15］ BENGIO Y, LAMBLIN P, POPOVICI D, et al. Greedy layerwise training of deep networks ［C］∥ Advances in Neural Information Processing Systems 19 (NIPS′2006). Vancouver: MIT, 2007: 153-160.
［16］ LECUN, Y, BOTTOU L, MULLER K R., et al. “Efficient backprop.” Neural networks: Tricks of the trade ［J］. Springer Berlin Heidelberg, 2012, 7700: 9-48.
［17］ CHANG C C, LIN C J. LIBSVM: a library for support vector machines ［J］. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 27.

[1]	HE Xue-jun, WANG Jin, LU Guo-dong, LIU Zhen-yu, CHEN Li, JIN Jing. 3D head portrait sculpture by industrial robot based on triangular mesh slicing and collision detection[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(6): 1104-1110.

[2]	WANG Hua, HAN Tong-yang, ZHOU Ke. KeyGraph-based community detection algorithm for public security intelligence[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(6): 1173-1180.

[3]	YOU Hai-hui, MA Zeng-yi, TANG Yi-jun, WANG Yue-lan, ZHENG Lin, YU Zhong, JI Cheng-jun. Soft measurement of heating value of burning municipal solid waste for circulating fluidized bed[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(6): 1163-1172.

[4]	BI Xiao-jun, WANG Jia-hui. Teaching-learning-based optimization algorithm with hybrid learning strategy[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(5): 1024-1031.

[5]	HUANG Zheng-yu, JIANG Xin-long, LIU Jun-fa, CHEN Yi-qiang, GU Yang. Fusion feature based semi-supervised manifold localization method[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(4): 655-662.

[6]	JIANG Xin-long, CHEN Yi-qiang, LIU Jun-fa, HU Li-sha, SHEN Jian-fei. Wearable system to support proximity awareness for people with autism[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(4): 637-647.

[7]	WANG Liang, YU Zhi-wen, GUO Bin. Moving trajectory prediction model based on double layer multi-granularity knowledge discovery[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(4): 669-674.

[8]	LIAO Miao, ZHAO Yu-qian, ZENG Ye-zhan, HUANG Zhong-chao, ZHANG Bing-kui, ZOU Bei-ji. Automatic segmentation for cell images based on support vector machine and ellipse fitting[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(4): 722-728.

[9]	MU Jing-jing, ZHAO Xin-yue, HE Zai-xing, ZHANG Shu-you. Contour reconstruction of overlapped bubbles based on concave-convex transformation and circle fitting[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(4): 714-721.

[10]	DAI Cai-yan, CHEN Ling, LI Bin, CHEN Bo-lun. Sampling-based link prediction in complex networks[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(3): 554-561.

[11]	LIU Lei, YANG Peng, LIU Zuo-jun. Locomotion-Mode recognition using multiple kernel relevance vector machine[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(3): 562-571.

[12]	GUO Meng-li, DA Fei-peng, DENG Xing, GAI Shao-yan. 3D face recognition based on keypoints and local feature[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(3): 584-589.

[13]	WANG Hai jun, GE Hong juan, ZHANG Sheng yan. Fast object tracking algorithm via kernel collaborative presentation[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(2): 399-407.

[14]	ZHANG Ya nan, CHEN De yun, WANG Ying jie, LIU Yu peng. Incremental graph pattern matching based dynamic recommendation method for cold-start user[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(2): 408-415.

[15]	LIU Yu peng, QIAO Xiu ming, ZHAO Shi lei, MA Chun guang. Deep combination of large-scale features in statistical machine translation[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(1): 46-56.

Viewed

Full text

Abstract

Cited

Shared

Discussed