Please wait a minute...
J4  2013, Vol. 47 Issue (6): 944-950    DOI: 10.3785/j.issn.1008-973X.2013.06.003
计算机技术     
聚类边界过采样不平衡数据分类方法
楼晓俊1, 孙雨轩1, 刘海涛1,2
1. 中国科学院上海微系统与信息技术研究所,上海 200050;
2. 无锡物联网产业研究院,江苏 无锡 214135
Clustering boundary over-sampling classification method for imbalanced data sets
LOU Xiao-jun1, SUN Yu-xuan1, LIU Hai-tao1,2
1. Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 200050, China;  2. Wuxi SensingNet Industrialization Research Institute, Wuxi 214135, China
 全文: PDF  HTML
摘要:

针对传统SMOTE过采样方法在生成合成样本的过程中存在的盲目性,以及对噪声敏感且容易出现过拟合现象的问题,提出一种改进的聚类边界样本过采样(CB-SMOTE)方法,通过引入“聚类一致性系数”找到少数类样本的边界,利用边界样本的最近邻密度来剔除噪声点和确定合成样本的数量,对SMOTE方法的新样本合成规则进行了优化.该方法是一种指导性的过采样方法,合成样本更加有利于分类器的学习.通过实验对比6种不同方法在UCI公共数据集上的分类性能,结果表明:CB-SMOTE方法对少数类样本和多数类样本都具有较高的分类准确率,且对过采样倍数的变化具有更高的稳定性.

Abstract:

The synthetic minority over-sampling technique (SMOTE) is a widely used method for imbalanced data classification. However, SMOTE synthesizes new samples without any guidance, which may lead to noise-sensitive and over-fitting. To resolve this problem, a novel over-sampling classification method for imbalanced data sets, called cluster boundary-synthetic minority over-sampling technique (CB-SMOTE), was proposed. Clustering consistency index was introduced to find the boundary minority samples. Then, k-nearest density was defined to calculate the number of synthetic new samples and to reject the noise samples, and it modified the rule of new samples synthesis. It is an over-sampling method with guidance, and the new samples generated by this method are much more beneficial for classifier learning. Six classification methods were compared using University of California Irvine (UCI) data sets. Experimental results show that the proposed method outperforms other methods in both minority samples and majority samples, and it is more stable in different over-sampling rates.

出版日期: 2013-11-22
:  TP 391.4  
基金资助:

国家“973”重点基础研究发展规划资助项目(2011CB302906);国家科技重大专项基金资助项目(2010ZX03006-004).

通讯作者: 刘海涛,男,研究员.     E-mail: liuhaitao@wsn.cn
作者简介: 楼晓俊(1984—),男,博士生,从事传感器网络中的信息处理、模式识别等研究.E-mail:louxjanan@gmail.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章  

引用本文:

楼晓俊, 孙雨轩, 刘海涛. 聚类边界过采样不平衡数据分类方法[J]. J4, 2013, 47(6): 944-950.

LOU Xiao-jun, SUN Yu-xuan, LIU Hai-tao. Clustering boundary over-sampling classification method for imbalanced data sets. J4, 2013, 47(6): 944-950.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2013.06.003        http://www.zjujournals.com/eng/CN/Y2013/V47/I6/944

[1] GU Qiong, CAI Zhi-hua, ZHU Li, et al. Data mining on imbalanced data sets [C]∥ Proceedings of International Conference on Advanced Computer Theory and Engineering (ICACTE’08). Phuket: IEEE, 2008: 1020-1024.

[2] 林智勇,郝志峰,杨晓伟.不平衡数据分类的研究现状[J].计算机应用研究,2008,25(2): 332-336.

LIN Zhi-yong, HAO Zhi-feng, YANG Xiao-wei. Current state of research on imbalanced data sets classification learning [J]. Application Research of Computers, 2008, 25(2): 332-336.

[3] 叶志飞,文益民,吕宝粮. 不平衡分类问题研究综述 [J]. 智能系统学报,2009,4(2): 148-156.

YE Zhi-fei, WEN Yi-min, LV Bao-liang. A survey of imbalanced pattern classification problems [J]. CAAI Transactions on Intelligent Systems, 2009, 4(2): 148-156.

[4] HE Hai-bo, GARCIA E A. Learning from imbalanced data [J]. Knowledge and Data Engineering, 2009, 21(9): 1263-1284.

[5] MIAO Zhi-min, ZHAO Lu-wen, YUAN Wei-wei, et al. Multi-class imbalanced learning implemented in network intrusion [C]∥ Proceedings of International Conference on Computer Science and Service System (CSSS’11). Nanjing: IEEE, 2011: 1395-1398.

[6] LIU Ya-qin, WANG Cheng, ZHANG Lu. Decision tree based predictive models for breast cancer survivability on imbalanced data[C]∥ Proceedings of the 3rd International Conference on Bioinformatics and Biomedical Engineering (ICBBE’09). Beijing: IEEE, 2009: 14.

[7] ESTABROOKS A, JO T, JAPKOWICZ N. A multiple resampling method for learning from imbalanced data sets [J]. Computational Intelligence, 2004, 20(1): 18-36.

[8] ZHAI Yun, MA Nan, RUAN Da, et al. An effective over-sampling method for imbalanced data sets classification [J]. Chinese Journal of Electronics, 2011, 20(3): 489-494.

[9] YEN S J, LEE Y S. Cluster-based under-sampling approaches for imbalanced data distributions [J]. Expert Systems with Applications, 2009, 36(3): 5718-5727.

[10] GARCIA V, SANCHEZ J S, MOLLINEDA R A. On the effectiveness of preprocessing methods when dealing with different levels of class imbalance [J]. Knowledge-Based Systems, 2012, 25(1): 13-21.

[11] SUN Yan-min, KAMEL M S, ANDREW W, et al. Cost-sensitive boosting for classification of imbalanced data [J]. Pattern Recognition, 2007, 40(12): 3358-3378.

[12] CHEN Xiao-lin, SONG En-ming, MA Guang-zhi. An adaptive cost-sensitive classifier [C]∥ Proceedings of the 2nd International Conference on Computer and Automation Engineering (ICCAE’10). Singapore: IEEE, 2010: 699-701.

[13] XIAO Jin, XIE Ling, HE Chang-zheng, et al. Dynamic classifier ensemble model for customer classification with imbalanced class distribution [J]. Expert Systems with Applications, 2012, 39(3): 3668-3675.

[14] WANG Shi-jin, XI Li-feng. Condition monitoring system design with one-class and imbalanced-data classifier [C]∥ Proceedings of the 16th International Conference on Industrial Engineering and Engineering Management (IE&EM’09). Beijing: IEEE, 2009: 779-783.

[15] CHAWLA N, BOWYER K, HALL L, et al. SMOTE: Synthetic minority over-sampling technique [J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357.

[16] 杨智明,乔立岩,彭喜元. 基于改进SMOTE的不平衡数据挖掘方法研究 [J]. 电子学报,2007,35(12): 22-26.

YANG Zhi-ming, QIAO Li-yan, PENG Xi-yuan. Research on data mining method for imbalanced dataset based on improved SMOTE [J]. Chinese Journal of Electronics, 2007, 35(12): 22-26.

[17] HAN Hui, WANG Wen-yuan, MAO Bing-huan. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning [C]∥ Proceedings of the International Conference on Intelligent Computing. Hefei, China: Springer, 2005: 878-887.

[18] HE Hai-bo, BAI Yang, GARCIA E A, et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning [C]∥ Proceedings of IEEE International Joint Conference on Neural Networks (IJCNN’08). Hong Kong: IEEE, 2008: 1322-1328.

[19] CHAWLA N V, LAZAREVIC A, HALL L O, et al. SMOTEBoost: Improving prediction of the minority class in boosting [C]∥ Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. Dubrovnik, Croatia: Springer, 2003: 107-119.

[20] TOPCHY A, MINAEI B B, JAIN A K, et al. Adaptive clustering ensembles [C]∥ Proceedings of the 17th International Conference on Pattern Recognition. Cambridge, UK, IEEE, 2004: 272-275.

[21] CHEN Si, GUO Gong-de, CHEN Li-fei. Semi-supervised classification based on clustering ensembles [C]∥ Proceedings of the International Conference on Artificial Intelligence and Computational Intelligence. Shanghai, China: Springer, 2009: 629-638.

[22] 陈思,郭躬德,陈黎飞.基于聚类融合的不平衡数据分类方法[J].模式识别与人工智能,2010, 23(6): 772-780.

CHEN Si, GUO Gong-de, CHEN Li-fei. Clustering ensembles based classification method for imbalanced data sets [J]. Pattern Recognition and Artificial Intelligence, 2010, 23(6): 772-780.

[23] SU C T, CHEN Long-sheng, YIH Y. Knowledge acquisition through information granulation for imbalanced data [J]. Expert Systems with Applications, 2006, 31(3): 531-541.

[1] 徐嵩,孙秀霞,何衍. 利用直线段成像特性的摄像机畸变迭代标定方法[J]. J4, 2014, 48(3): 404-413.
[2] 谢天, 解利军, 宋广华, 郑耀. 基于平面颜色分布的增强现实自然特征注册算法[J]. J4, 2013, 47(12): 2243-2252.
[3] 杨冰, 许端清, 杨鑫, 赵磊, 唐大伟. 基于艺术风格相似性规则的绘画图像分类[J]. J4, 2013, 47(8): 1486-1492.
[4] 杨帮华, 何美燕, 刘丽, 陆文宇. 脑机接口中基于BISVM的EEG分类[J]. J4, 2013, 47(8): 1431-1436.
[5] 孟子博, 姜虹, 陈婧, 袁波, 王立强. 基于特征剪裁的AdaBoost算法及在人脸检测中的应用[J]. J4, 2013, 47(5): 906-911.
[6] 何智翔, 丁晓青, 方驰, 文迪. 基于LBP和CCS-AdaBoost的多视角人脸检测[J]. J4, 2013, 47(4): 622-629.
[7] 刘晓芳,叶修梓,张三元,张引. 并行磁共振图像的非二次正则化保边性重建[J]. J4, 2012, 46(11): 2035-2043.
[8] 张远辉,韦巍. 在线角速度估计的乒乓球机器人视觉测量方法[J]. J4, 2012, 46(7): 1320-1326.
[9] 施锦河, 沈继忠, 王攀. 四类运动想象脑电信号特征提取与分类算法[J]. J4, 2012, 46(2): 338-344.
[10] 张大尉, 朱善安. 基于核邻域保持判别嵌入的人脸识别[J]. J4, 2011, 45(10): 1842-1847.
[11] 舒振宇, 汪国昭. 基于张量投票的快速网格分割算法[J]. J4, 2011, 45(6): 999-1005.
[12] 徐舒畅, 张三元, 张引. 基于彩色图像的皮肤色素浓度提取算法[J]. J4, 2011, 45(2): 253-258.
[13] 佘青山, 孟明, 罗志增, 马玉良. 基于多核学习的下肢肌电信号动作识别[J]. J4, 2010, 44(7): 1292-1297.
[14] 薛凌云, 段会龙, 向学勤, 范影乐. 基于FitzHughNaguno神经元随机共振机制的图像复原[J]. J4, 2010, 44(6): 1103-1107.
[15] 张远辉, 韦巍, 虞旦. 基于实时图像的乒乓机器人Kalman跟踪算法[J]. J4, 2009, 43(09): 1580-1584.