Please wait a minute...
J4  2013, Vol. 47 Issue (6): 944-950    DOI: 10.3785/j.issn.1008-973X.2013.06.003
计算机技术     
聚类边界过采样不平衡数据分类方法
楼晓俊1, 孙雨轩1, 刘海涛1,2
1. 中国科学院上海微系统与信息技术研究所,上海 200050;
2. 无锡物联网产业研究院,江苏 无锡 214135
Clustering boundary over-sampling classification method for imbalanced data sets
LOU Xiao-jun1, SUN Yu-xuan1, LIU Hai-tao1,2
1. Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 200050, China;  2. Wuxi SensingNet Industrialization Research Institute, Wuxi 214135, China
 全文: PDF 
摘要:

针对传统SMOTE过采样方法在生成合成样本的过程中存在的盲目性,以及对噪声敏感且容易出现过拟合现象的问题,提出一种改进的聚类边界样本过采样(CB-SMOTE)方法,通过引入“聚类一致性系数”找到少数类样本的边界,利用边界样本的最近邻密度来剔除噪声点和确定合成样本的数量,对SMOTE方法的新样本合成规则进行了优化.该方法是一种指导性的过采样方法,合成样本更加有利于分类器的学习.通过实验对比6种不同方法在UCI公共数据集上的分类性能,结果表明:CB-SMOTE方法对少数类样本和多数类样本都具有较高的分类准确率,且对过采样倍数的变化具有更高的稳定性.

关键词: 不平衡数据过采样聚类边界最近邻密度合成样本    
Abstract:

The synthetic minority over-sampling technique (SMOTE) is a widely used method for imbalanced data classification. However, SMOTE synthesizes new samples without any guidance, which may lead to noise-sensitive and over-fitting. To resolve this problem, a novel over-sampling classification method for imbalanced data sets, called cluster boundary-synthetic minority over-sampling technique (CB-SMOTE), was proposed. Clustering consistency index was introduced to find the boundary minority samples. Then, k-nearest density was defined to calculate the number of synthetic new samples and to reject the noise samples, and it modified the rule of new samples synthesis. It is an over-sampling method with guidance, and the new samples generated by this method are much more beneficial for classifier learning. Six classification methods were compared using University of California Irvine (UCI) data sets. Experimental results show that the proposed method outperforms other methods in both minority samples and majority samples, and it is more stable in different over-sampling rates.

Key words: imbalanced data sets    over-sampling    clustering boundary    k-nearest density    synthetic samples
出版日期: 2013-07-09
:  TP 391.4  
基金资助:

国家“973”重点基础研究发展规划资助项目(2011CB302906);国家科技重大专项基金资助项目(2010ZX03006-004).

通讯作者: 刘海涛,男,研究员.     E-mail: liuhaitao@wsn.cn
作者简介: 楼晓俊(1984—),男,博士生,从事传感器网络中的信息处理、模式识别等研究.E-mail:louxjanan@gmail.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章  

引用本文:

楼晓俊, 孙雨轩, 刘海涛. 聚类边界过采样不平衡数据分类方法[J]. J4, 2013, 47(6): 944-950.

LOU Xiao-jun, SUN Yu-xuan, LIU Hai-tao. Clustering boundary over-sampling classification method for imbalanced data sets. J4, 2013, 47(6): 944-950.

链接本文:

http://www.zjujournals.com/xueshu/eng/CN/10.3785/j.issn.1008-973X.2013.06.003        http://www.zjujournals.com/xueshu/eng/CN/Y2013/V47/I6/944

[1] GU Qiong, CAI Zhi-hua, ZHU Li, et al. Data mining on imbalanced data sets [C]∥ Proceedings of International Conference on Advanced Computer Theory and Engineering (ICACTE’08). Phuket: IEEE, 2008: 1020-1024.

[2] 林智勇,郝志峰,杨晓伟.不平衡数据分类的研究现状[J].计算机应用研究,2008,25(2): 332-336.

LIN Zhi-yong, HAO Zhi-feng, YANG Xiao-wei. Current state of research on imbalanced data sets classification learning [J]. Application Research of Computers, 2008, 25(2): 332-336.

[3] 叶志飞,文益民,吕宝粮. 不平衡分类问题研究综述 [J]. 智能系统学报,2009,4(2): 148-156.

YE Zhi-fei, WEN Yi-min, LV Bao-liang. A survey of imbalanced pattern classification problems [J]. CAAI Transactions on Intelligent Systems, 2009, 4(2): 148-156.

[4] HE Hai-bo, GARCIA E A. Learning from imbalanced data [J]. Knowledge and Data Engineering, 2009, 21(9): 1263-1284.

[5] MIAO Zhi-min, ZHAO Lu-wen, YUAN Wei-wei, et al. Multi-class imbalanced learning implemented in network intrusion [C]∥ Proceedings of International Conference on Computer Science and Service System (CSSS’11). Nanjing: IEEE, 2011: 1395-1398.

[6] LIU Ya-qin, WANG Cheng, ZHANG Lu. Decision tree based predictive models for breast cancer survivability on imbalanced data[C]∥ Proceedings of the 3rd International Conference on Bioinformatics and Biomedical Engineering (ICBBE’09). Beijing: IEEE, 2009: 14.

[7] ESTABROOKS A, JO T, JAPKOWICZ N. A multiple resampling method for learning from imbalanced data sets [J]. Computational Intelligence, 2004, 20(1): 18-36.

[8] ZHAI Yun, MA Nan, RUAN Da, et al. An effective over-sampling method for imbalanced data sets classification [J]. Chinese Journal of Electronics, 2011, 20(3): 489-494.

[9] YEN S J, LEE Y S. Cluster-based under-sampling approaches for imbalanced data distributions [J]. Expert Systems with Applications, 2009, 36(3): 5718-5727.

[10] GARCIA V, SANCHEZ J S, MOLLINEDA R A. On the effectiveness of preprocessing methods when dealing with different levels of class imbalance [J]. Knowledge-Based Systems, 2012, 25(1): 13-21.

[11] SUN Yan-min, KAMEL M S, ANDREW W, et al. Cost-sensitive boosting for classification of imbalanced data [J]. Pattern Recognition, 2007, 40(12): 3358-3378.

[12] CHEN Xiao-lin, SONG En-ming, MA Guang-zhi. An adaptive cost-sensitive classifier [C]∥ Proceedings of the 2nd International Conference on Computer and Automation Engineering (ICCAE’10). Singapore: IEEE, 2010: 699-701.

[13] XIAO Jin, XIE Ling, HE Chang-zheng, et al. Dynamic classifier ensemble model for customer classification with imbalanced class distribution [J]. Expert Systems with Applications, 2012, 39(3): 3668-3675.

[14] WANG Shi-jin, XI Li-feng. Condition monitoring system design with one-class and imbalanced-data classifier [C]∥ Proceedings of the 16th International Conference on Industrial Engineering and Engineering Management (IE&EM’09). Beijing: IEEE, 2009: 779-783.

[15] CHAWLA N, BOWYER K, HALL L, et al. SMOTE: Synthetic minority over-sampling technique [J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357.

[16] 杨智明,乔立岩,彭喜元. 基于改进SMOTE的不平衡数据挖掘方法研究 [J]. 电子学报,2007,35(12): 22-26.

YANG Zhi-ming, QIAO Li-yan, PENG Xi-yuan. Research on data mining method for imbalanced dataset based on improved SMOTE [J]. Chinese Journal of Electronics, 2007, 35(12): 22-26.

[17] HAN Hui, WANG Wen-yuan, MAO Bing-huan. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning [C]∥ Proceedings of the International Conference on Intelligent Computing. Hefei, China: Springer, 2005: 878-887.

[18] HE Hai-bo, BAI Yang, GARCIA E A, et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning [C]∥ Proceedings of IEEE International Joint Conference on Neural Networks (IJCNN’08). Hong Kong: IEEE, 2008: 1322-1328.

[19] CHAWLA N V, LAZAREVIC A, HALL L O, et al. SMOTEBoost: Improving prediction of the minority class in boosting [C]∥ Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. Dubrovnik, Croatia: Springer, 2003: 107-119.

[20] TOPCHY A, MINAEI B B, JAIN A K, et al. Adaptive clustering ensembles [C]∥ Proceedings of the 17th International Conference on Pattern Recognition. Cambridge, UK, IEEE, 2004: 272-275.

[21] CHEN Si, GUO Gong-de, CHEN Li-fei. Semi-supervised classification based on clustering ensembles [C]∥ Proceedings of the International Conference on Artificial Intelligence and Computational Intelligence. Shanghai, China: Springer, 2009: 629-638.

[22] 陈思,郭躬德,陈黎飞.基于聚类融合的不平衡数据分类方法[J].模式识别与人工智能,2010, 23(6): 772-780.

CHEN Si, GUO Gong-de, CHEN Li-fei. Clustering ensembles based classification method for imbalanced data sets [J]. Pattern Recognition and Artificial Intelligence, 2010, 23(6): 772-780.

[23] SU C T, CHEN Long-sheng, YIH Y. Knowledge acquisition through information granulation for imbalanced data [J]. Expert Systems with Applications, 2006, 31(3): 531-541.

[1] 赵岩, 孙玲玲, 谭年熊. 用于谐波测量的非均匀同步采样时钟产生方法[J]. J4, 2013, 47(10): 1857-1862.
[2] 郑恩辉 许宏 李平 宋执环. 基于ν-SVM的不平衡数据挖掘研究[J]. J4, 2006, 40(10): 1682-1687.