Please wait a minute...
浙江大学学报(工学版)  2021, Vol. 55 Issue (2): 367-376    DOI: 10.3785/j.issn.1008-973X.2021.02.017
计算机与控制工程     
基于合群度-隶属度噪声检测及动态特征选择的改进AdaBoost算法
王友卫1(),凤丽洲2,*()
1. 中央财经大学 信息学院,北京 100026
2. 天津财经大学 统计学院,天津 300222
Improved AdaBoost algorithm using group degree and membership degree based noise detection and dynamic feature selection
You-wei WANG1(),Li-zhou FENG2,*()
1. School of Information, Central University of Finance and Economics, Beijing 100081, China
2. School of Statistics, Tianjin University of Finance and Economics, Tianjin 300222, China
 全文: PDF(1050 KB)   HTML
摘要:

为了提高AdaBoost集成学习算法的数据分类性能,提出基于合群度-隶属度噪声检测及动态特征选择的改进AdaBoost算法. 综合考虑待检测样本与邻居样本的相似度及与不同类别样本集的隶属关系,引入合群度和隶属度的概念,提出新的噪声检测方法. 在此基础上,为了更好地选择那些能够有效区分错分样本的特征,在传统过滤器特征选择方法的基础上提出通用的结合样本权重的动态特征选择方法,以提高AdaBoost算法针对错分样本的分类能力. 以支持向量机作为弱分类器,在8个典型数据集上分别从噪声检测、特征选择及现有方法比较3个方面进行实验. 结果表明,所提算法充分考虑了噪声样本和样本权重对AdaBoost分类结果的影响,相对于传统算法在分类性能上获得显著提升.

关键词: 集成学习数据分类噪声检测特征选择样本权重    
Abstract:

An improved AdaBoost algorithm using group degree and membership degree based noise detection and dynamic feature selection was proposed in order to improve the performance of AdaBoost ensemble learning algorithm on data classification. Firstly, the similarity between a sample and its neighbor samples and the membership relationship between a sample and the categories were comprehensively considered. The conceptions of group degree and membership degree were introduced, and a new noise detection method was proposed. On this basis, for the purpose of selecting the features those can effectively distinguish the misclassified samples, a general and sample weight combined dynamic feature selection method was proposed based on the traditional feature selections of filters, improving the classification ability of AdaBoost algorithm on misclassified samples. Experiments were carried out by using support vector machine as the weak classifier on eight typical datasets from three aspects of noise detection, feature selection and existing algorithms comparison. Experimental results show that the proposed method comprehensively considers the influences of sample density and sample weights on the classification results of AdaBoost algorithm, and obtains significant improvement on classification performance compared to traditional algorithms.

Key words: ensemble learning    data classification    noise detection    feature selection    sample weight
收稿日期: 2020-03-12 出版日期: 2021-03-09
CLC:  TP 311  
基金资助: 国家自然科学基金资助项目(61906220);教育部人文社科资助项目(19YJCZH178);国家社会科学基金资助项目(18CTJ008);天津市自然科学基金资助项目(18JCQNJC69600)
通讯作者: 凤丽洲     E-mail: ywwang15@126.com;lzfeng15@126.com
作者简介: 王友卫(1987—),男,副教授,博士,从事机器学习、数据挖掘研究. orcid.org/0000-0002-3925-3422. E-mail: ywwang15@126.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
王友卫
凤丽洲

引用本文:

王友卫,凤丽洲. 基于合群度-隶属度噪声检测及动态特征选择的改进AdaBoost算法[J]. 浙江大学学报(工学版), 2021, 55(2): 367-376.

You-wei WANG,Li-zhou FENG. Improved AdaBoost algorithm using group degree and membership degree based noise detection and dynamic feature selection. Journal of ZheJiang University (Engineering Science), 2021, 55(2): 367-376.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2021.02.017        http://www.zjujournals.com/eng/CN/Y2021/V55/I2/367

图 1  样本密度对噪声检测的影响
图 2  本研究算法执行流程
数据集 样本数 特征数 类别数 最大类样本数 最小类样本数
Spambase 4601 58 2 2788 1813
AD 3279 1558 2 2820 459
KDD99 1951 42 6 1407 2
DrivFace 606 6400 3 546 3
Arrhythmia 452 279 16 245 2
AntiVirus 371 531 2 301 70
dermatology 366 34 6 112 20
Amazon 300 10000 10 30 30
表 1  实验数据集信息
数据集 Fa
文献[8] 文献[9] 文献[10] 文献[11] 本研究
Spambase 0.849 0.864 0.867 0.867 0.875
AD 0.952 0.953 0.955 0.951 0.964
KDD99 0.987 0.992 0.985 0.985 0.991
DrivFace 0.963 0.965 0.972 0.971 0.979
Arrhythmia 0.606 0.602 0.618 0.598 0.626
AntiVirus 0.988 0.991 0.992 0.986 0.991
Dermatology 0.959 0.965 0.976 0.964 0.981
Amazon 0.783 0.783 0.786 0.787 0.792
表 2  不同噪声检测算法对应的F1均值(Fa)比较
s
数据集 ts
文献[8] 文献[9] 文献[10] 文献[11] 本研究
Spambase 2.215 2.852 2.882 67.287 2.583
AD 0.646 0.935 1.051 252.627 0.917
KDD99 0.342 0.472 0.542 21.612 0.433
DrivFace 0.086 0.102 0.112 367.372 0.089
Arrhythmia 0.035 0.043 0.052 18.223 0.038
AntiVirus 0.021 0.034 0.039 18.658 0.031
Dermatology 0.023 0.041 0.036 5.329 0.033
Amazon 0.036 0.089 0.083 88.173 0.077
表 3  不同噪声检测算法耗时比较
特征选择方法 时间复杂度
IG ${ {O} }\left( {N\left( {M + L + ML + {\rm{log_2\;} }N} \right)} \right)$
CHI ${ {O} }\left( {N\left( {M + {\rm{3} }L + {\rm{2} }ML + {\rm{log_2\;} }N} \right)} \right)$
MRMR ${{O} } \left( {\displaystyle\sum\limits_{S = 0}^{ {N_1} - 1} {S\left( {N - S} \right)\left( {2M + 2L} \right)} } \right)$
CMFS ${ {O} }\left( {N\left( {M + L + ML + {\rm{log_2\;} }N} \right)} \right)$
IGW ${ {O} }\left( {N\left( { {\rm{2} }M + L + ML + {\rm{log_2\;} }N} \right)} \right)$
CHIW ${ {O} }\left( {N\left( { {\rm{2} }M + {\rm{3} }L + ML + {\rm{log_2\;} }N} \right)} \right)$
MRMRW ${{O} } \left( {\displaystyle\sum\limits_{S = 0}^{ {N_1} - 1} {\left( {N - S} \right)\left( {S\left( {2M + 2L} \right) + M} \right)} } \right)$
CMFSW ${ {O} }\left( {N\left( { {\rm{2} }M + L + ML + {\rm{log_2\;} }N} \right)} \right)$
表 4  不同特征选择方法对应的时间复杂度比较
图 3  不同特征选择方法在不同数据集上的F1均值(Fa)比较
数据集 Fai
IGW vs IG CHIW vs CHI MRMRW vs MRMR CMFSW vs CMFS
Spambase ?0.065 0.007 0.004 0.002
AD 0.004 0.028 0.012 0.002
KDD99 0.061 0.082 0.000 0.078
DrivFace 0.020 0.118 0.012 0.059
Arrhythmia 0.061 0.036 0.009 0.010
AntiVirus ?0.002 0.024 0.002 0.011
Dermatology 0.077 0.314 0.023 0.015
Amazon 0.073 0.029 0.034 0.067
All datasets 0.028 0.079 0.012 0.030
表 5  不同特征选择方法的F1均值的增幅均值比较
数据集 rai
IGW vs IG CHIW vs CHI MRMRW vs MRMR CMFSW vs CMFS
Spambase 0.007 0.011 0.013 0.016
AD 0.126 0.131 26.927 0.107
KDD99 0.013 0.016 0.005 0.018
DrivFace 0.806 1.275 4.572 0.586
Arrhythmia 0.172 0.265 0.004 0.052
AntiVirus 0.005 0.008 0.006 0.001
Dermatology 0.000 0.001 0.000 0.000
Amazon 0.412 0.307 53.862 0.465
All datasets 0.192 0.251 10.673 0.155
表 6  不同特征选择方法的平均运行时间增幅均值比较
图 4  不同迭代次数下不同算法的F1均值比较
1 刘金平, 何捷舟, 马天雨, 等 基于KELM选择性集成的复杂网络环境入侵检测[J]. 电子学报, 2019, 47 (5): 96- 104
LIU Jin-ping, HE Jie-zhou, MA Tian-yu, et al Selective ensemble of KELM-based complex network intrusion detection[J]. Acta Electronica Sinica, 2019, 47 (5): 96- 104
2 GALAR M, FERNANDEZ A, BARRENECHEA E, et al A review on ensembles for the class imbalance problem: bagging, boosting, and hybrid-based approaches[J]. IEEE Transactions on Systems Man & Cybernetics Part C Applications & Reviews, 2012, 42 (4): 463- 484
3 FREUND Y, SCHAPIRE R E. Experiments with a new boosting algorithm [C]// Proceedings of the 13th International Conference on Machine Learning. Bari: ACM, 1996.
4 FREUND Y, SCHAPIRE R E A decision-theoretic generalization of online learning and an application to boosting[J]. Journal of Computer and System Sciences, 1997, 55 (1): 119- 139
doi: 10.1006/jcss.1997.1504
5 SCHAPIRER E, SINGER Y Improved boosting algorithms using confidence-rated predictions[J]. Machine Learning, 1999, 37 (3): 297- 336
doi: 10.1023/A:1007614523901
6 ZHU J, ZOU H, ROSSET S, et al Multi-class AdaBoost[J]. Statistics and its Interface, 2009, 2 (3): 349- 360
doi: 10.4310/SII.2009.v2.n3.a8
7 杨新武, 马壮, 袁顺 基于弱分类器调整的多分类AdaBoost算法[J]. 电子与信息学报, 2016, 38 (2): 373- 380
YANG Xin-wu, MA Zhuang, YUAN Shun Multi-class AdaBoost algorithm based on the adjusted weak classifier[J]. Journal of Electronics and Information Technology, 2016, 38 (2): 373- 380
8 楼晓俊, 孙雨轩, 刘海涛 聚类边界过采样不平衡数据分类方法[J]. 浙江大学学报: 工学版, 2013, 47 (6): 944- 950
LOU Xiao-jun, SUN Yu-xuan, LIU Hai-tao Clustering boundary over-sampling classification method for imbalanced data sets[J]. Journal of Zhejiang University: Engineering Science, 2013, 47 (6): 944- 950
9 CAO J, KWONG S, WANG R A noise-detection based AdaBoost algorithm for mislabeled data[J]. Pattern Recognition, 2012, 45 (1): 4451- 4465
10 张子祥, 陈优广 基于样本噪声检测的AdaBoost算法改进[J]. 计算机系统应用, 2017, 26 (12): 186- 190
ZHANG Zi-xiang, CHEN You-guang Improvement of AdaBoost algorithm based on sample noise detection[J]. Computer Systems and Applications, 2017, 26 (12): 186- 190
11 YANG P, WANG D, WEI Z, et al An outlier detection approach based on improved self-organizing feature map clustering algorithm[J]. IEEE Access, 2019, 7: 115914- 115925
doi: 10.1109/ACCESS.2019.2922004
12 姚旭, 王晓丹, 张玉玺, 等 基于随机子空间和AdaBoost的自适应集成方法[J]. 电子学报, 2013, 41 (4): 810- 814
YAO Xu, WANG Xiao-dan, ZHANG Yu-xi, et al A self-adaption ensemble algorithm based on random subspace and AdaBoost[J]. Acta Electronica Sinica, 2013, 41 (4): 810- 814
doi: 10.3969/j.issn.0372-2112.2013.04.031
13 曹莹, 刘家辰, 苗启广, 等 AdaBoost恶意程序行为检测新算法[J]. 西安电子科技大学学报, 2013, 40 (6): 116- 124
CAO Ying, LIU Jia-chen, MIAO Qi-guang, et al Improved behavior-based malware detection algorithm with AdaBoost[J]. Journal of Xidian University: Natural Science, 2013, 40 (6): 116- 124
14 SUN B, CHEN S, WANG J, et al A robust multi-class AdaBoost algorithm for mislabeled noisy data[J]. Knowledge-Based Systems, 2016, 102 (5): 87- 102
15 YOUSEFI M, YOUSEFI M, FERREIRA R P M, et al Chaotic genetic algorithm and AdaBoost ensemble metamodeling approach for optimum resource planning in emergency departments[J]. Artificial Intelligence in Medicine, 2018, 84: 23- 33
doi: 10.1016/j.artmed.2017.10.002
16 CHEN Y B, DOU P, YANG X J Improving land use/cover classification witha multiple classifier system using AdaBoost integration technique[J]. Remote Sensing, 2017, 9 (10): 1055- 1075
doi: 10.3390/rs9101055
17 LIN H T, LIN C J, WENG R C A note on Platt’s probabilistic outputs for support vector machines[J]. Machine Learning, 2007, 68 (3): 267- 276
doi: 10.1007/s10994-007-5018-6
18 刘宏伟, 黄静 基于朴素贝叶斯算法的垃圾邮件网关[J]. 微计算机信息, 2006, 22 (18): 73- 75
LIU Hong-wei, HUANG Jing Spam filtering gateway based on NB algorithm[J]. Microcomputer Information, 2006, 22 (18): 73- 75
doi: 10.3969/j.issn.1008-0570.2006.18.025
19 YANG Y, PEDERSEN J O. A comparative study on feature selection in text categorization [C]// Proceedings of the 14th International Conference on Machine Learning. Nashville: ACM, 1997.
20 WANG Y W, FENG L Z, ZHU J M Novel artificial bee colony based feature selection for filtering redundant information[J]. Applied Intelligence, 2017, 48 (3): 868- 885
21 BOMMERT A, SUN X, BISCHL B, et al Benchmark for filter methods for feature selection in high-dimensional classification data[J]. Computational Statistics and Data Analysis, 2019, 143: 106839
22 YANG J, LIU Y, ZHU X, et al A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization[J]. Information Processing and Management, 2012, 48 (4): 741- 754
doi: 10.1016/j.ipm.2011.12.005
23 DADANEH B Z, MARKID H Y, ZAKEROLHOSSEINI A Unsupervised probabilistic feature selection using ant colony optimization[J]. Expert Systems with Applications, 2016, 53: 27- 42
doi: 10.1016/j.eswa.2016.01.021
[1] 胡云青,邱清盈,余秀,武建伟. 基于改进三体训练法的半监督专利文本分类方法[J]. 浙江大学学报(工学版), 2020, 54(2): 331-339.
[2] 晋耀,张为. 采用Anchor-Free网络结构的实时火灾检测算法[J]. 浙江大学学报(工学版), 2020, 54(12): 2430-2436.
[3] 刘如辉, 黄炜平, 王凯, 刘创, 梁军. 半监督约束集成的快速密度峰值聚类算法[J]. 浙江大学学报(工学版), 2018, 52(11): 2191-2200.
[4] 朱晓恩, 郝欣, 夏顺仁. 基于Levy flight的特征选择算法[J]. J4, 2013, 47(4): 638-643.
[5] 罗建宏,陈德钊. 兼顾正确率和差异性的自适应集成算法及应用[J]. J4, 2011, 45(3): 557-562.
[6] 张玉红, 胡学钢, 杨秋洁. 一种适用于数据流分类的特征选择方法[J]. J4, 2011, 45(12): 2247-2251.
[7] 汤健, 赵立杰, 岳恒, 柴天佑. 基于多源数据特征融合的球磨机负荷软测量[J]. J4, 2010, 44(7): 1406-1413.
[8] 谷雨, 李平, 韩波. 基于分层粒子滤波的地标检测与跟踪[J]. J4, 2010, 44(4): 687-691.
[9] 叶建芳, 潘晓弘, 王正肖, 等. 基于免疫离散粒子群算法的调度属性选择[J]. J4, 2009, 43(12): 2203-2207.
[10] 谢波 陈岭 陈根才 陈纯. 普通话语音情感识别的特征选择技术[J]. J4, 2007, 41(11): 1816-1822.