Improved AdaBoost algorithm using group degree and membership degree based noise detection and dynamic feature selection

doi:10.3785/j.issn.1008-973X.2021.02.017

Journal of ZheJiang University (Engineering Science)

2021, Vol. 55

Issue (2): 367-376 DOI: 10.3785/j.issn.1008-973X.2021.02.017

Improved AdaBoost algorithm using group degree and membership degree based noise detection and dynamic feature selection

You-wei WANG1(

),Li-zhou FENG2,*(

)

1. School of Information, Central University of Finance and Economics, Beijing 100081, China
2. School of Statistics, Tianjin University of Finance and Economics, Tianjin 300222, China

Download:

HTML

PDF(1050KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

An improved AdaBoost algorithm using group degree and membership degree based noise detection and dynamic feature selection was proposed in order to improve the performance of AdaBoost ensemble learning algorithm on data classification. Firstly, the similarity between a sample and its neighbor samples and the membership relationship between a sample and the categories were comprehensively considered. The conceptions of group degree and membership degree were introduced, and a new noise detection method was proposed. On this basis, for the purpose of selecting the features those can effectively distinguish the misclassified samples, a general and sample weight combined dynamic feature selection method was proposed based on the traditional feature selections of filters, improving the classification ability of AdaBoost algorithm on misclassified samples. Experiments were carried out by using support vector machine as the weak classifier on eight typical datasets from three aspects of noise detection, feature selection and existing algorithms comparison. Experimental results show that the proposed method comprehensively considers the influences of sample density and sample weights on the classification results of AdaBoost algorithm, and obtains significant improvement on classification performance compared to traditional algorithms.

Key words： ensemble learning data classification noise detection feature selection sample weight

Received: 12 March 2020 Published: 09 March 2021

CLC:

TP 311

Fund: 国家自然科学基金资助项目（61906220）；教育部人文社科资助项目（19YJCZH178）；国家社会科学基金资助项目（18CTJ008）；天津市自然科学基金资助项目（18JCQNJC69600）

Corresponding Authors: Li-zhou FENG E-mail: ywwang15@126.com;lzfeng15@126.com

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	You-wei WANG
	Li-zhou FENG

Cite this article:

You-wei WANG,Li-zhou FENG. Improved AdaBoost algorithm using group degree and membership degree based noise detection and dynamic feature selection. Journal of ZheJiang University (Engineering Science), 2021, 55(2): 367-376.

URL:

http://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2021.02.017 OR http://www.zjujournals.com/eng/Y2021/V55/I2/367

基于合群度-隶属度噪声检测及动态特征选择的改进AdaBoost算法

为了提高AdaBoost集成学习算法的数据分类性能，提出基于合群度-隶属度噪声检测及动态特征选择的改进AdaBoost算法. 综合考虑待检测样本与邻居样本的相似度及与不同类别样本集的隶属关系，引入合群度和隶属度的概念，提出新的噪声检测方法. 在此基础上，为了更好地选择那些能够有效区分错分样本的特征，在传统过滤器特征选择方法的基础上提出通用的结合样本权重的动态特征选择方法，以提高AdaBoost算法针对错分样本的分类能力. 以支持向量机作为弱分类器，在8个典型数据集上分别从噪声检测、特征选择及现有方法比较3个方面进行实验. 结果表明，所提算法充分考虑了噪声样本和样本权重对AdaBoost分类结果的影响，相对于传统算法在分类性能上获得显著提升.

关键词： 集成学习, 数据分类, 噪声检测, 特征选择, 样本权重

Fig.1 Effect of sample density on noise detection

Fig.2 Execution flowchart of proposed algorithm

Tab.1 Information of experimental datasets

Tab.2 Comparison of average F₁ values (F_a) of different noise detection algorithms

Tab.3 Comparison of consuming timeofdifferent noise detection algorithms

Tab.4 Comparison of time complexities of different feature selection methods

Fig.3 Comparison of average F₁ values（F_a）of different feature selection methods on different datasets

Tab.5 Comparison of average increments of average F₁ value of different feature selection methods

Tab.6 Comparison of average increments of average running time values of different feature selection methods

Fig.4 Comparison of average F₁ values of different algorithms under different iterations


[1]	刘金平, 何捷舟, 马天雨, 等基于KELM选择性集成的复杂网络环境入侵检测[J]. 电子学报, 2019, 47 (5): 96- 104 LIU Jin-ping, HE Jie-zhou, MA Tian-yu, et al Selective ensemble of KELM-based complex network intrusion detection[J]. Acta Electronica Sinica, 2019, 47 (5): 96- 104

[2]	GALAR M, FERNANDEZ A, BARRENECHEA E, et al A review on ensembles for the class imbalance problem: bagging, boosting, and hybrid-based approaches[J]. IEEE Transactions on Systems Man & Cybernetics Part C Applications & Reviews, 2012, 42 (4): 463- 484

[3]	FREUND Y, SCHAPIRE R E. Experiments with a new boosting algorithm [C]// Proceedings of the 13th International Conference on Machine Learning. Bari: ACM, 1996.

[4]	FREUND Y, SCHAPIRE R E A decision-theoretic generalization of online learning and an application to boosting[J]. Journal of Computer and System Sciences, 1997, 55 (1): 119- 139 doi: 10.1006/jcss.1997.1504

[5]	SCHAPIRER E, SINGER Y Improved boosting algorithms using confidence-rated predictions[J]. Machine Learning, 1999, 37 (3): 297- 336 doi: 10.1023/A:1007614523901

[6]	ZHU J, ZOU H, ROSSET S, et al Multi-class AdaBoost[J]. Statistics and its Interface, 2009, 2 (3): 349- 360 doi: 10.4310/SII.2009.v2.n3.a8

[7]	杨新武, 马壮, 袁顺基于弱分类器调整的多分类AdaBoost算法[J]. 电子与信息学报, 2016, 38 (2): 373- 380 YANG Xin-wu, MA Zhuang, YUAN Shun Multi-class AdaBoost algorithm based on the adjusted weak classifier[J]. Journal of Electronics and Information Technology, 2016, 38 (2): 373- 380

[8]	楼晓俊, 孙雨轩, 刘海涛聚类边界过采样不平衡数据分类方法[J]. 浙江大学学报: 工学版, 2013, 47 (6): 944- 950 LOU Xiao-jun, SUN Yu-xuan, LIU Hai-tao Clustering boundary over-sampling classification method for imbalanced data sets[J]. Journal of Zhejiang University: Engineering Science, 2013, 47 (6): 944- 950

[9]	CAO J, KWONG S, WANG R A noise-detection based AdaBoost algorithm for mislabeled data[J]. Pattern Recognition, 2012, 45 (1): 4451- 4465

[10]	张子祥, 陈优广基于样本噪声检测的AdaBoost算法改进[J]. 计算机系统应用, 2017, 26 (12): 186- 190 ZHANG Zi-xiang, CHEN You-guang Improvement of AdaBoost algorithm based on sample noise detection[J]. Computer Systems and Applications, 2017, 26 (12): 186- 190

[11]	YANG P, WANG D, WEI Z, et al An outlier detection approach based on improved self-organizing feature map clustering algorithm[J]. IEEE Access, 2019, 7: 115914- 115925 doi: 10.1109/ACCESS.2019.2922004

[12]	姚旭, 王晓丹, 张玉玺, 等基于随机子空间和AdaBoost的自适应集成方法[J]. 电子学报, 2013, 41 (4): 810- 814 YAO Xu, WANG Xiao-dan, ZHANG Yu-xi, et al A self-adaption ensemble algorithm based on random subspace and AdaBoost[J]. Acta Electronica Sinica, 2013, 41 (4): 810- 814 doi: 10.3969/j.issn.0372-2112.2013.04.031

[13]	曹莹, 刘家辰, 苗启广, 等 AdaBoost恶意程序行为检测新算法[J]. 西安电子科技大学学报, 2013, 40 (6): 116- 124 CAO Ying, LIU Jia-chen, MIAO Qi-guang, et al Improved behavior-based malware detection algorithm with AdaBoost[J]. Journal of Xidian University: Natural Science, 2013, 40 (6): 116- 124

[14]	SUN B, CHEN S, WANG J, et al A robust multi-class AdaBoost algorithm for mislabeled noisy data[J]. Knowledge-Based Systems, 2016, 102 (5): 87- 102

[15]	YOUSEFI M, YOUSEFI M, FERREIRA R P M, et al Chaotic genetic algorithm and AdaBoost ensemble metamodeling approach for optimum resource planning in emergency departments[J]. Artificial Intelligence in Medicine, 2018, 84: 23- 33 doi: 10.1016/j.artmed.2017.10.002

[16]	CHEN Y B, DOU P, YANG X J Improving land use/cover classification witha multiple classifier system using AdaBoost integration technique[J]. Remote Sensing, 2017, 9 (10): 1055- 1075 doi: 10.3390/rs9101055

[17]	LIN H T, LIN C J, WENG R C A note on Platt’s probabilistic outputs for support vector machines[J]. Machine Learning, 2007, 68 (3): 267- 276 doi: 10.1007/s10994-007-5018-6

[18]	刘宏伟, 黄静基于朴素贝叶斯算法的垃圾邮件网关[J]. 微计算机信息, 2006, 22 (18): 73- 75 LIU Hong-wei, HUANG Jing Spam filtering gateway based on NB algorithm[J]. Microcomputer Information, 2006, 22 (18): 73- 75 doi: 10.3969/j.issn.1008-0570.2006.18.025

[19]	YANG Y, PEDERSEN J O. A comparative study on feature selection in text categorization [C]// Proceedings of the 14th International Conference on Machine Learning. Nashville: ACM, 1997.

[20]	WANG Y W, FENG L Z, ZHU J M Novel artificial bee colony based feature selection for filtering redundant information[J]. Applied Intelligence, 2017, 48 (3): 868- 885

[21]	BOMMERT A, SUN X, BISCHL B, et al Benchmark for filter methods for feature selection in high-dimensional classification data[J]. Computational Statistics and Data Analysis, 2019, 143: 106839

[22]	YANG J, LIU Y, ZHU X, et al A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization[J]. Information Processing and Management, 2012, 48 (4): 741- 754 doi: 10.1016/j.ipm.2011.12.005

[23]	DADANEH B Z, MARKID H Y, ZAKEROLHOSSEINI A Unsupervised probabilistic feature selection using ant colony optimization[J]. Expert Systems with Applications, 2016, 53: 27- 42 doi: 10.1016/j.eswa.2016.01.021

[1]	Yun-qing HU,Qing-ying QIU,Xiu YU,Jian-wei WU. Semi-supervised patent text classification method based on improved Tri-training algorithm[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(2): 331-339.

[2]	Ya-jing WANG,Qun WANG,Bo-wen LI,Zhi-wen LIU,Yuan-yuan PIAO,Tao YU. Seizure prediction based on pre-ictal period selection of EEG signal[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(11): 2258-2265.

[3]	LIU Ru-hui, HUANG Wei-ping, WANG Kai, LIU Chuang, LIANG Jun. Semi-supervised constraint ensemble clustering by fast search and find of density peaks[J]. Journal of ZheJiang University (Engineering Science), 2018, 52(11): 2191-2200.

[4]	ZHU Xiao-en, HAO Xin, XIA Shun-ren. Feature selection algorithm based on Levy flight[J]. Journal of ZheJiang University (Engineering Science), 2013, 47(4): 638-643.

[5]	LUO Jian-hong, CHEN De-zhao. Application of adaptive ensemble algorithm based on correctness and diversity[J]. Journal of ZheJiang University (Engineering Science), 2011, 45(3): 557-562.

[6]	LI Wei-tao, ZHOU Xiao-jie, CHAI Tian-you. Gabor filter and latent semantic analysi based burning state recognition[J]. Journal of ZheJiang University (Engineering Science), 2011, 45(12): 2120-2126.

[7]	ZHANG Yu-hong, HU Xue-gang, YANG Qiu-jie. A feature selection approach suitable for data stream classification[J]. Journal of ZheJiang University (Engineering Science), 2011, 45(12): 2247-2251.

[8]	SHANG Jian, DIAO Li-Jie, YUE Heng, CHAI Tian-You. Soft sensor for ball mill load based on multisource data feature fusion[J]. Journal of ZheJiang University (Engineering Science), 2010, 44(7): 1406-1413.

[9]	XIE Jian-Fang, BO Xiao-Hong, WANG Zheng-Xiao, et al. Scheduling feature selection based on immune binary partial swarm optimization[J]. Journal of ZheJiang University (Engineering Science), 2009, 43(12): 2203-2207.

Viewed

Full text

Abstract

Cited

Shared

Discussed