Please wait a minute...
Front. Inform. Technol. Electron. Eng.  2012, Vol. 13 Issue (9): 649-659    DOI: 10.1631/jzus.C1100373
    
Short text classification based on strong feature thesaurus
Bing-kun Wang, Yong-feng Huang, Wan-xia Yang, Xing Li
Information Cognitive and Intelligent System Research Institute, Department of Electronic and Engineering, Tsinghua University, Beijing 100084, China; Information Technology National Laboratory, Tsinghua University, Beijing 100084, China
Download:   PDF(0KB)
Export: BibTeX | EndNote (RIS)      

Abstract  Data sparseness, the evident characteristic of short text, has always been regarded as the main cause of the low accuracy in the classification of short texts using statistical methods. Intensive research has been conducted in this area during the past decade. However, most researchers failed to notice that ignoring the semantic importance of certain feature terms might also contribute to low classification accuracy. In this paper we present a new method to tackle the problem by building a strong feature thesaurus (SFT) based on latent Dirichlet allocation (LDA) and information gain (IG) models. By giving larger weights to feature terms in SFT, the classification accuracy can be improved. Specifically, our method appeared to be more effective with more detailed classification. Experiments in two short text datasets demonstrate that our approach achieved improvement compared with the state-of-the-art methods including support vector machine (SVM) and Na?ve Bayes Multinomial.

Key wordsShort text      Classification      Data sparseness      Semantic      Strong feature thesaurus (SFT)      Latent Dirichlet allocation (LDA)     
Received: 19 December 2011      Published: 05 September 2012
CLC:  TP391.4  
Cite this article:

Bing-kun Wang, Yong-feng Huang, Wan-xia Yang, Xing Li. Short text classification based on strong feature thesaurus. Front. Inform. Technol. Electron. Eng., 2012, 13(9): 649-659.

URL:

http://www.zjujournals.com/xueshu/fitee/10.1631/jzus.C1100373     OR     http://www.zjujournals.com/xueshu/fitee/Y2012/V13/I9/649


Short text classification based on strong feature thesaurus

Data sparseness, the evident characteristic of short text, has always been regarded as the main cause of the low accuracy in the classification of short texts using statistical methods. Intensive research has been conducted in this area during the past decade. However, most researchers failed to notice that ignoring the semantic importance of certain feature terms might also contribute to low classification accuracy. In this paper we present a new method to tackle the problem by building a strong feature thesaurus (SFT) based on latent Dirichlet allocation (LDA) and information gain (IG) models. By giving larger weights to feature terms in SFT, the classification accuracy can be improved. Specifically, our method appeared to be more effective with more detailed classification. Experiments in two short text datasets demonstrate that our approach achieved improvement compared with the state-of-the-art methods including support vector machine (SVM) and Na?ve Bayes Multinomial.

关键词: Short text,  Classification,  Data sparseness,  Semantic,  Strong feature thesaurus (SFT),  Latent Dirichlet allocation (LDA) 
[1] Ehab ALI , Mahamod ISMAIL, Rosdiadee NORDIN, Nor Fadzilah ABDULAH. Beamforming techniques for massive MIMO systems in 5G: overview, classification, and trends for future research[J]. Front. Inform. Technol. Electron. Eng., 2017, 18(6): 753-772.
[2] Ehsan Saeedi, Yinan Kong, Md. Selim Hossain. Side-channel attacks and learning-vector quantization[J]. Front. Inform. Technol. Electron. Eng., 2017, 18(4): 511-518.
[3] Guang-hui Song, Xiao-gang Jin, Gen-lang Chen, Yan Nie. Two-level hierarchical feature learning for image classification[J]. Front. Inform. Technol. Electron. Eng., 2016, 17(9): 897-906.
[4] G. R. Brindha, P. Swaminathan, B. Santhi. Performance analysis of new word weighting procedures for opinion mining[J]. Front. Inform. Technol. Electron. Eng., 2016, 17(11): 1186-1198.
[5] Xi-ming Li, Ji-hong Ouyang, You Lu. Topic modeling for large-scale text data[J]. Front. Inform. Technol. Electron. Eng., 2015, 16(6): 457-465.
[6] Jie He, Yue-xiang Yang, Yong Qiao, Wen-ping Deng. Fine-grained P2P traffic classification by simply counting flows[J]. Front. Inform. Technol. Electron. Eng., 2015, 16(5): 391-403.
[7] Qi-rong Mao, Xin-yu Pan, Yong-zhao Zhan, Xiang-jun Shen. Using Kinect for real-time emotion recognition via facial expressions[J]. Front. Inform. Technol. Electron. Eng., 2015, 16(4): 272-282.
[8] Li-gang Ma, Jin-song Deng, Huai Yang, Yang Hong, Ke Wang. Urban landscape classification using Chinese advanced high-resolution satellite imagery and an object-oriented multi-variable model[J]. Front. Inform. Technol. Electron. Eng., 2015, 16(3): 238-248.
[9] Jie Zhou, Bi-cheng Li, Gang Chen. Automatically building large-scale named entity recognition corpora from Chinese Wikipedia[J]. Front. Inform. Technol. Electron. Eng., 2015, 16(11): 940-956.
[10] Ying Cai, Meng-long Yang, Jun Li. Multiclass classification based on a deep convolutional network for head pose estimation[J]. Front. Inform. Technol. Electron. Eng., 2015, 16(11): 930-939.
[11] Fu-xiang Lu, Jun Huang. Beyond bag of latent topics: spatial pyramid matching for scene category recognition[J]. Front. Inform. Technol. Electron. Eng., 2015, 16(10): 817-828.
[12] Hao Shao, Feng Tao, Rui Xu. Transfer active learning by querying committee[J]. Front. Inform. Technol. Electron. Eng., 2014, 15(2): 107-118.
[13] Fei-wei Qin, Lu-ye Li, Shu-ming Gao, Xiao-ling Yang, Xiang Chen. A deep learning approach to the classification of 3D CAD models[J]. Front. Inform. Technol. Electron. Eng., 2014, 15(2): 91-106.
[14] Xiao-hu Ma, Yan-qi Tan, Gang-min Zheng. A fast classification scheme and its application to face recognition[J]. Front. Inform. Technol. Electron. Eng., 2013, 14(7): 561-572.
[15] Hua-jun Chen, Tong Yu, Qing-zhao Zheng, Pei-qin Gu, Yu Zhang. A multi-agent framework for mining semantic relations from Linked Data[J]. Front. Inform. Technol. Electron. Eng., 2012, 13(4): 295-307.