Please wait a minute...
Front. Inform. Technol. Electron. Eng.  2013, Vol. 14 Issue (6): 407-416    DOI: 10.1631/jzus.C1200303
    
Accelerated k-nearest neighbors algorithm based on principal component analysis for text categorization
Min Du, Xing-shu Chen
School of Computer Science, Sichuan University, Chengdu 610065, China
Download:   PDF(0KB)
Export: BibTeX | EndNote (RIS)      

Abstract  Text categorization is a significant technique to manage the surging text data on the Internet. The k-nearest neighbors (kNN) algorithm is an effective, but not efficient, classification model for text categorization. In this paper, we propose an effective strategy to accelerate the standard kNN, based on a simple principle: usually, near points in space are also near when they are projected into a direction, which means that distant points in the projection direction are also distant in the original space. Using the proposed strategy, most of the irrelevant points can be removed when searching for the k-nearest neighbors of a query point, which greatly decreases the computation cost. Experimental results show that the proposed strategy greatly improves the time performance of the standard kNN, with little degradation in accuracy. Specifically, it is superior in applications that have large and high-dimensional datasets.

Key wordsk-nearest neighbors (kNN)      Text categorization      Accelerating strategy      Principal component analysis (PCA)     
Received: 24 October 2012      Published: 04 June 2013
CLC:  TP391  
Cite this article:

Min Du, Xing-shu Chen. Accelerated k-nearest neighbors algorithm based on principal component analysis for text categorization. Front. Inform. Technol. Electron. Eng., 2013, 14(6): 407-416.

URL:

http://www.zjujournals.com/xueshu/fitee/10.1631/jzus.C1200303     OR     http://www.zjujournals.com/xueshu/fitee/Y2013/V14/I6/407


Accelerated k-nearest neighbors algorithm based on principal component analysis for text categorization

Text categorization is a significant technique to manage the surging text data on the Internet. The k-nearest neighbors (kNN) algorithm is an effective, but not efficient, classification model for text categorization. In this paper, we propose an effective strategy to accelerate the standard kNN, based on a simple principle: usually, near points in space are also near when they are projected into a direction, which means that distant points in the projection direction are also distant in the original space. Using the proposed strategy, most of the irrelevant points can be removed when searching for the k-nearest neighbors of a query point, which greatly decreases the computation cost. Experimental results show that the proposed strategy greatly improves the time performance of the standard kNN, with little degradation in accuracy. Specifically, it is superior in applications that have large and high-dimensional datasets.

关键词: k-nearest neighbors (kNN),  Text categorization,  Accelerating strategy,  Principal component analysis (PCA) 
[1] Xiu-rui Geng, Lu-yan Ji, Kang Sun. Non-negative matrix factorization based unmixing for principal component transformed hyperspectral data[J]. Front. Inform. Technol. Electron. Eng., 2016, 17(5): 403-412.
[2] Gurmanik Kaur, Ajat Shatru Arora, Vijender Kumar Jain. Using hybrid models to predict blood pressure reactivity to unsupported back based on anthropometric characteristics[J]. Front. Inform. Technol. Electron. Eng., 2015, 16(6): 474-485.
[3] Ding-cheng Feng, Feng Chen, Wen-li Xu. Learning robust principal components from L1-norm maximization[J]. Front. Inform. Technol. Electron. Eng., 2012, 13(12): 901-908.