Please wait a minute...
Front. Inform. Technol. Electron. Eng.  2011, Vol. 12 Issue (8): 615-628    DOI: 10.1631/jzus.C1000330
    
Clustering feature decision trees for semi-supervised classification from high-speed data streams
Wen-hua Xu1, Zheng Qin*,2, Yang Chang2
1 Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China 2 School of Software, Tsinghua University, Beijing 100084, China
Download:   PDF(331KB)
Export: BibTeX | EndNote (RIS)      

Abstract  Most stream data classification algorithms apply the supervised learning strategy which requires massive labeled data. Such approaches are impractical since labeled data are usually hard to obtain in reality. In this paper, we build a clustering feature decision tree model, CFDT, from data streams having both unlabeled and a small number of labeled examples. CFDT applies a micro-clustering algorithm that scans the data only once to provide the statistical summaries of the data for incremental decision tree induction. Micro-clusters also serve as classifiers in tree leaves to improve classification accuracy and reinforce the any-time property. Our experiments on synthetic and real-world datasets show that CFDT is highly scalable for data streams while generating high classification accuracy with high speed.

Key wordsClustering feature vector      Decision tree      Semi-supervised learning      Stream data classification      Very fast decision tree     
Received: 25 September 2010      Published: 03 August 2011
CLC:  TP391  
Cite this article:

Wen-hua Xu, Zheng Qin, Yang Chang. Clustering feature decision trees for semi-supervised classification from high-speed data streams. Front. Inform. Technol. Electron. Eng., 2011, 12(8): 615-628.

URL:

http://www.zjujournals.com/xueshu/fitee/10.1631/jzus.C1000330     OR     http://www.zjujournals.com/xueshu/fitee/Y2011/V12/I8/615


Clustering feature decision trees for semi-supervised classification from high-speed data streams

Most stream data classification algorithms apply the supervised learning strategy which requires massive labeled data. Such approaches are impractical since labeled data are usually hard to obtain in reality. In this paper, we build a clustering feature decision tree model, CFDT, from data streams having both unlabeled and a small number of labeled examples. CFDT applies a micro-clustering algorithm that scans the data only once to provide the statistical summaries of the data for incremental decision tree induction. Micro-clusters also serve as classifiers in tree leaves to improve classification accuracy and reinforce the any-time property. Our experiments on synthetic and real-world datasets show that CFDT is highly scalable for data streams while generating high classification accuracy with high speed.

关键词: Clustering feature vector,  Decision tree,  Semi-supervised learning,  Stream data classification,  Very fast decision tree 
[1] Jin ZHANG , Zhao-hui TANG , Wei-hua GUI , Qing CHEN , Jin-ping LIU. Interactive image segmentation with a regression based ensemble learning paradigm[J]. Front. Inform. Technol. Electron. Eng., 2017, 18(7): 1002-1020.
[2] Xiao-lei Ma, Yin-hai Wang, Feng Chen, Jian-feng Liu. Transit smart card data mining for passenger origin information extraction[J]. Front. Inform. Technol. Electron. Eng., 2012, 13(10): 750-760.
[3] Yun-hua Qu, Tian-jiong Tao, Serge Sharoff, Narisong Jin, Ruo-yuan Gao, Nan Zhang, Yu-ting Yang, Cheng-zhi Xu. Using an integrated feature set to generalize and justify the Chinese-to-English transferring rule of the ‘ZHE’ aspect[J]. Front. Inform. Technol. Electron. Eng., 2010, 11(9): 663-676.