Please wait a minute...
浙江大学学报(工学版)
电信技术     
自编码网络短文本流形表示方法
魏超, 罗森林, 张竞, 潘丽敏
北京理工大学 信息与电子学院,北京 100081
Short text manifold representation based on AutoEncoder network
WEI Chao, LUO Sen-lin, ZHANG Jing, PAN Li-min
School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China
 全文: PDF(1932 KB)   HTML
摘要:

针对短文本分类任务中文本表示存在的高维稀疏问题,提出基于自编码网络的短文本流形表示方法.通过自编码网络重构文本得到流形映射,提取短文本的流形特征,实现非线性降维.根据标签与多篇文本在高维观测空间的全局映射关系,对已有流形映射进行整体调整,扩充短文本信息得到最佳流形表示模型,使用该模型得到短文本流形表示.结合SVM、KNN、Nave-Bayes 3种分类算法,该方法在公开数据源的Macro_F1均超过97.8%,分类效果优于VSM、LDA、LSI.结果表明,该模型生成的流形表示能以非稀疏形式更准确地描述短文本特征信息,使分类效果得到显著提升.

Abstract:

 A short text manifold representation method based on AutoEncoder network was proposed for the sparsity and the problem of the curse of dimensionality of short text. The main idea is to extract manifold features of short text for non-linear dimensionality reduction from AutoEncoder Network by reconstructing text data and finding manifold mapping at first. Then extend short text and get the optimum manifold representation model by tuning the mapping based the global pair-wise pattern between label and its Multi-document in High-dimensional observation space. The Short text manifold representation can be obtained using the model. Combined with SVM、KNN、Nave-Bayes, the method can get better classification results than VSM, LDA and LSI. The Macro_F1 of the method can be over97.8%. The experimental result indicates the manifold representation can describes features of short text more accurately and non-sparse, leading to a significant improvement of the classification.

出版日期: 2015-08-01
:  TP 391  
基金资助:

国家242信息安全计划资助项目(2005(48));北京理工大学科技创新计划重大项目培育专项资助项目(2011CX01015).

通讯作者: 罗森林,男,教授.     E-mail: Luosenlin@bit.edu.cn
作者简介: 魏超(1985—),男,博士生,从事中文信息处理研究.E-mail: weichaolx@gmaill.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章  

引用本文:

魏超, 罗森林, 张竞, 潘丽敏. 自编码网络短文本流形表示方法[J]. 浙江大学学报(工学版), 10.3785/j.issn.1008-973X.2015.08.027.

WEI Chao, LUO Sen-lin, ZHANG Jing, PAN Li-min. Short text manifold representation based on AutoEncoder network. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 10.3785/j.issn.1008-973X.2015.08.027.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2015.08.027        http://www.zjujournals.com/eng/CN/Y2015/V49/I8/1591

[1] 杨杰明.文本分类中文本表示模型和特征选择算法研究[D].长春:吉林大学, 2013.
YANG Jie-ming. The research of text representation and feature selection in text categorization [D]. Changchun: Jilin University, 2013.
[2] 王锦,王会珍,张俐.基于维基百科类别的文本特征表示[J].中文信息学报,2011,25(2): 27-31.
WANG Jin, WANG Hui-zhen, ZHANG Li. Text Representation by the Wikipedia Category [J]. Journal of Chinese Information Processing, 2011, 25(2): 370-383.
[3] BANERJEE S, RAMANTHAN K, GUPTA A. Clustering short text using Wikipedia [C] ∥ Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval. Amsterdam: ACM, 2007: 787-788.
[4] HU X, SUN N, ZHANG C, et al. Exploiting internal and external semantics for the cluster of short texts using word knowledge[C] ∥ Proceedings of the 18th ACM Conference on Information and Knowledge Management. Hong Kong: ACM, 2009: 919-928.
[5] 王蒙,林兰芬,王峰.基于伪相关反馈的短文本扩展与分类[J].浙江大学学报:工学版,2014, 48(10):1835-1842.
WANG Meng, LIN Lan-fen, WANG Feng. Short text expansion and classification based on pseudo-relevance feedback [J]. Journal of Zhejiang University: Engineering Science, 2014, 48(10): 1835-1842.
[6] RUDI L C, PAUL M B. The google similarity distance[J]. IEEE Transactions on Knowledge and Data Engineering, 2007. 19(3): 370-383.
[7] YANG Jie-ming, LIU Yuan-ning, LIU Zhen, et al. A new feature selection algorithm based on binomial hypothesis testing for spam filtering [J]. Knowledge-Based Systems, 2011, 24(6): 904-914.
[8] YANG Jieming, LIU Yuan-ning, ZHU Xiao-dong, etsal. A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization [J]. Information Processing and Management, 2012, 48(4): 741-754.
[9] DEERWESTER S, DUMAIS S T, HARSHMAN R, et al. Indexing by Latent Semantic Analysis [J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
[10] BLEI D M, ANDREW Y N, JORDAN Y M. Latent dirichlet allocation [J]. Journal of Machine Learning Research, J2003, 3: 993-102.
[11] KRISHNAN V, Shortcomings of latent models in supervised settings[C]. ∥ Proceedings of the SIGIR. Salvador: ACM, 2005: 625-626.
[12] HUH S, FIENBERG S E. Discriminative topic modeling based on manifold learning [J]. ACM Transactions on Knowledge Discovery from Data (TKDD), 2012, 5(4): 653-661.
[13] SEUNG H S ,LEE D D. The manifold ways of perception [J]. Science. 2000, 290(5500): 2268-2269.
[14] SILVA V D, TEBEBBAUM J B. Global versus local methods in nonlinear dimensionality reduction [C]∥ Neural Information Processing Systems 15 (NIPS′2002). Vancouver: MIT, 2003, 705-712.
[15] BENGIO Y, LAMBLIN P, POPOVICI D, et al. Greedy layerwise training of deep networks [C]∥ Advances in Neural Information Processing Systems 19 (NIPS′2006). Vancouver: MIT, 2007: 153-160.
[16] LECUN, Y, BOTTOU L, MULLER K R., et al. “Efficient backprop.” Neural networks: Tricks of the trade [J]. Springer Berlin Heidelberg, 2012, 7700: 9-48.
[17] CHANG C C, LIN C J. LIBSVM: a library for support vector machines [J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 27.

[1] 何雪军, 王进, 陆国栋, 刘振宇, 陈立, 金晶. 基于三角网切片及碰撞检测的工业机器人三维头像雕刻[J]. 浙江大学学报(工学版), 2017, 51(6): 1104-1110.
[2] 王桦, 韩同阳, 周可. 公安情报中基于关键图谱的群体发现算法[J]. 浙江大学学报(工学版), 2017, 51(6): 1173-1180.
[3] 尤海辉, 马增益, 唐义军, 王月兰, 郑林, 俞钟, 吉澄军. 循环流化床入炉垃圾热值软测量[J]. 浙江大学学报(工学版), 2017, 51(6): 1163-1172.
[4] 毕晓君, 王佳荟. 基于混合学习策略的教与学优化算法[J]. 浙江大学学报(工学版), 2017, 51(5): 1024-1031.
[5] 王亮, 於志文, 郭斌. 基于双层多粒度知识发现的移动轨迹预测模型[J]. 浙江大学学报(工学版), 2017, 51(4): 669-674.
[6] 廖苗, 赵于前, 曾业战, 黄忠朝, 张丙奎, 邹北骥. 基于支持向量机和椭圆拟合的细胞图像自动分割[J]. 浙江大学学报(工学版), 2017, 51(4): 722-728.
[7] 穆晶晶, 赵昕玥, 何再兴, 张树有. 基于凹凸变换与圆周拟合的重叠气泡轮廓重构[J]. 浙江大学学报(工学版), 2017, 51(4): 714-721.
[8] 黄正宇, 蒋鑫龙, 刘军发, 陈益强, 谷洋. 基于融合特征的半监督流形约束定位方法[J]. 浙江大学学报(工学版), 2017, 51(4): 655-662.
[9] 蒋鑫龙, 陈益强, 刘军发, 忽丽莎, 沈建飞. 面向自闭症患者社交距离认知的可穿戴系统[J]. 浙江大学学报(工学版), 2017, 51(4): 637-647.
[10] 郭梦丽, 达飞鹏, 邓星, 盖绍彦. 基于关键点和局部特征的三维人脸识别[J]. 浙江大学学报(工学版), 2017, 51(3): 584-589.
[11] 戴彩艳, 陈崚, 李斌, 陈伯伦. 复杂网络中的抽样链接预测[J]. 浙江大学学报(工学版), 2017, 51(3): 554-561.
[12] 刘磊, 杨鹏, 刘作军. 采用多核相关向量机的人体步态识别[J]. 浙江大学学报(工学版), 2017, 51(3): 562-571.
[13] 王海军, 葛红娟, 张圣燕. 基于核协同表示的快速目标跟踪算法[J]. 浙江大学学报(工学版), 2017, 51(2): 399-407.
[14] 张亚楠, 陈德运, 王莹洁, 刘宇鹏. 基于增量图形模式匹配的动态冷启动推荐方法[J]. 浙江大学学报(工学版), 2017, 51(2): 408-415.
[15] 刘宇鹏, 乔秀明, 赵石磊, 马春光. 统计机器翻译中大规模特征的深度融合[J]. 浙江大学学报(工学版), 2017, 51(1): 46-56.