Please wait a minute...
Front. Inform. Technol. Electron. Eng.  2012, Vol. 13 Issue (9): 649-659    DOI: 10.1631/jzus.C1100373
    
Short text classification based on strong feature thesaurus
Bing-kun Wang, Yong-feng Huang, Wan-xia Yang, Xing Li
Information Cognitive and Intelligent System Research Institute, Department of Electronic and Engineering, Tsinghua University, Beijing 100084, China; Information Technology National Laboratory, Tsinghua University, Beijing 100084, China
Short text classification based on strong feature thesaurus
Bing-kun Wang, Yong-feng Huang, Wan-xia Yang, Xing Li
Information Cognitive and Intelligent System Research Institute, Department of Electronic and Engineering, Tsinghua University, Beijing 100084, China; Information Technology National Laboratory, Tsinghua University, Beijing 100084, China
 全文: PDF 
摘要: Data sparseness, the evident characteristic of short text, has always been regarded as the main cause of the low accuracy in the classification of short texts using statistical methods. Intensive research has been conducted in this area during the past decade. However, most researchers failed to notice that ignoring the semantic importance of certain feature terms might also contribute to low classification accuracy. In this paper we present a new method to tackle the problem by building a strong feature thesaurus (SFT) based on latent Dirichlet allocation (LDA) and information gain (IG) models. By giving larger weights to feature terms in SFT, the classification accuracy can be improved. Specifically, our method appeared to be more effective with more detailed classification. Experiments in two short text datasets demonstrate that our approach achieved improvement compared with the state-of-the-art methods including support vector machine (SVM) and Na?ve Bayes Multinomial.
关键词: Short textClassificationData sparsenessSemanticStrong feature thesaurus (SFT)Latent Dirichlet allocation (LDA)    
Abstract: Data sparseness, the evident characteristic of short text, has always been regarded as the main cause of the low accuracy in the classification of short texts using statistical methods. Intensive research has been conducted in this area during the past decade. However, most researchers failed to notice that ignoring the semantic importance of certain feature terms might also contribute to low classification accuracy. In this paper we present a new method to tackle the problem by building a strong feature thesaurus (SFT) based on latent Dirichlet allocation (LDA) and information gain (IG) models. By giving larger weights to feature terms in SFT, the classification accuracy can be improved. Specifically, our method appeared to be more effective with more detailed classification. Experiments in two short text datasets demonstrate that our approach achieved improvement compared with the state-of-the-art methods including support vector machine (SVM) and Na?ve Bayes Multinomial.
Key words: Short text    Classification    Data sparseness    Semantic    Strong feature thesaurus (SFT)    Latent Dirichlet allocation (LDA)
收稿日期: 2011-12-19 出版日期: 2012-09-05
CLC:  TP391.4  
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章  
Bing-kun Wang
Yong-feng Huang
Wan-xia Yang
Xing Li

引用本文:

Bing-kun Wang, Yong-feng Huang, Wan-xia Yang, Xing Li. Short text classification based on strong feature thesaurus. Front. Inform. Technol. Electron. Eng., 2012, 13(9): 649-659.

链接本文:

http://www.zjujournals.com/xueshu/fitee/CN/10.1631/jzus.C1100373        http://www.zjujournals.com/xueshu/fitee/CN/Y2012/V13/I9/649

[1] Ehab ALI , Mahamod ISMAIL, Rosdiadee NORDIN, Nor Fadzilah ABDULAH. Beamforming techniques for massive MIMO systems in 5G: overview, classification, and trends for future research[J]. Front. Inform. Technol. Electron. Eng., 2017, 18(6): 753-772.
[2] Xiao-hu Ma, Yan-qi Tan, Gang-min Zheng. A fast classification scheme and its application to face recognition[J]. Front. Inform. Technol. Electron. Eng., 2013, 14(7): 561-572.
[3] Hua-jun Chen, Tong Yu, Qing-zhao Zheng, Pei-qin Gu, Yu Zhang. A multi-agent framework for mining semantic relations from Linked Data[J]. Front. Inform. Technol. Electron. Eng., 2012, 13(4): 295-307.
[4] Zhi-chun Wang, Zhi-gang Wang, Juan-zi Li, Jeff Z. Pan. Knowledge extraction from Chinese wiki encyclopedias[J]. Front. Inform. Technol. Electron. Eng., 2012, 13(4): 268-280.
[5] Gang Wu, Meng-dong Yang. Improving SPARQL query performance with algebraic expression tree based caching and entity caching[J]. Front. Inform. Technol. Electron. Eng., 2012, 13(4): 281-294.
[6] Hang Zhang, Wei Hu, Yu-zhong Qu. VDoc+: a virtual document based approach for matching large ontologies using MapReduce[J]. Front. Inform. Technol. Electron. Eng., 2012, 13(4): 257-267.
[7] Xi-chuan Zhou, Hai-bin Shen, Zhi-yong Huang, Guo-jun Li. Large margin classification for combating disguise attacks on spam filters[J]. Front. Inform. Technol. Electron. Eng., 2012, 13(3): 187-195.
[8] Rong Zhu, Min Yao, Li-hua Ye, Jun-ying Xuan. Learning a hierarchical image manifold for Web image classification[J]. Front. Inform. Technol. Electron. Eng., 2012, 13(10): 719-735.
[9] Jr-shian Chen. Extracting classification rules based on a cumulative probability distribution approach[J]. Front. Inform. Technol. Electron. Eng., 2011, 12(5): 379-386.
[10] Dong-li Wang, Jian-guo Zheng, Yan Zhou. Binary tree of posterior probability support vector machines[J]. Front. Inform. Technol. Electron. Eng., 2011, 12(2): 83-87.
[11] Le-qing Zhu, Zhen Zhang. Insect recognition based on integrated region matching and dual tree complex wavelet transform[J]. Front. Inform. Technol. Electron. Eng., 2011, 12(1): 44-53.
[12] Yun-hua Qu, Tian-jiong Tao, Serge Sharoff, Narisong Jin, Ruo-yuan Gao, Nan Zhang, Yu-ting Yang, Cheng-zhi Xu. Using an integrated feature set to generalize and justify the Chinese-to-English transferring rule of the ‘ZHE’ aspect[J]. Front. Inform. Technol. Electron. Eng., 2010, 11(9): 663-676.
[13] Imran Ghani, Choon Yeul Lee, Sung Hyun Juhn, Seung Ryul Jeong. Semantics-oriented approach for information interoperability and governance: towards user-centric enterprise architecture management[J]. Front. Inform. Technol. Electron. Eng., 2010, 11(4): 227-240.