Clustering feature decision trees for semi-supervised classification from high-speed data streams

doi:10.1631/jzus.C1000330

Front. Inform. Technol. Electron. Eng.

2011, Vol. 12

Issue (8): 615-628 DOI: 10.1631/jzus.C1000330

Clustering feature decision trees for semi-supervised classification from high-speed data streams

Wen-hua Xu¹, Zheng Qin^*,2, Yang Chang²

1 Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China 2 School of Software, Tsinghua University, Beijing 100084, China

Download:

PDF(331KB)
Export: BibTeX | EndNote (RIS)

Abstract Most stream data classification algorithms apply the supervised learning strategy which requires massive labeled data. Such approaches are impractical since labeled data are usually hard to obtain in reality. In this paper, we build a clustering feature decision tree model, CFDT, from data streams having both unlabeled and a small number of labeled examples. CFDT applies a micro-clustering algorithm that scans the data only once to provide the statistical summaries of the data for incremental decision tree induction. Micro-clusters also serve as classifiers in tree leaves to improve classification accuracy and reinforce the any-time property. Our experiments on synthetic and real-world datasets show that CFDT is highly scalable for data streams while generating high classification accuracy with high speed.

Key words： Clustering feature vector Decision tree Semi-supervised learning Stream data classification Very fast decision tree

Received: 25 September 2010 Published: 03 August 2011

CLC:

TP391

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Wen-hua Xu
	Zheng Qin
	Yang Chang

Cite this article:

Wen-hua Xu, Zheng Qin, Yang Chang. Clustering feature decision trees for semi-supervised classification from high-speed data streams. Front. Inform. Technol. Electron. Eng., 2011, 12(8): 615-628.

URL:

http://www.zjujournals.com/xueshu/fitee/10.1631/jzus.C1000330 OR http://www.zjujournals.com/xueshu/fitee/Y2011/V12/I8/615

Clustering feature decision trees for semi-supervised classification from high-speed data streams

Most stream data classification algorithms apply the supervised learning strategy which requires massive labeled data. Such approaches are impractical since labeled data are usually hard to obtain in reality. In this paper, we build a clustering feature decision tree model, CFDT, from data streams having both unlabeled and a small number of labeled examples. CFDT applies a micro-clustering algorithm that scans the data only once to provide the statistical summaries of the data for incremental decision tree induction. Micro-clusters also serve as classifiers in tree leaves to improve classification accuracy and reinforce the any-time property. Our experiments on synthetic and real-world datasets show that CFDT is highly scalable for data streams while generating high classification accuracy with high speed.

关键词： Clustering feature vector, Decision tree, Semi-supervised learning, Stream data classification, Very fast decision tree

[1]	Jin ZHANG , Zhao-hui TANG , Wei-hua GUI , Qing CHEN , Jin-ping LIU. Interactive image segmentation with a regression based ensemble learning paradigm[J]. Front. Inform. Technol. Electron. Eng., 2017, 18(7): 1002-1020.

[2]	Xiao-lei Ma, Yin-hai Wang, Feng Chen, Jian-feng Liu. Transit smart card data mining for passenger origin information extraction[J]. Front. Inform. Technol. Electron. Eng., 2012, 13(10): 750-760.

[3]	Yun-hua Qu, Tian-jiong Tao, Serge Sharoff, Narisong Jin, Ruo-yuan Gao, Nan Zhang, Yu-ting Yang, Cheng-zhi Xu. Using an integrated feature set to generalize and justify the Chinese-to-English transferring rule of the ‘ZHE’ aspect[J]. Front. Inform. Technol. Electron. Eng., 2010, 11(9): 663-676.

Viewed

Full text

Abstract

Cited

Shared

Discussed