Please wait a minute...
J4  2013, Vol. 47 Issue (6): 944-950    DOI: 10.3785/j.issn.1008-973X.2013.06.003
    
Clustering boundary over-sampling classification method for imbalanced data sets
LOU Xiao-jun1, SUN Yu-xuan1, LIU Hai-tao1,2
1. Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 200050, China;  2. Wuxi SensingNet Industrialization Research Institute, Wuxi 214135, China
Download:   PDF(0KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

The synthetic minority over-sampling technique (SMOTE) is a widely used method for imbalanced data classification. However, SMOTE synthesizes new samples without any guidance, which may lead to noise-sensitive and over-fitting. To resolve this problem, a novel over-sampling classification method for imbalanced data sets, called cluster boundary-synthetic minority over-sampling technique (CB-SMOTE), was proposed. Clustering consistency index was introduced to find the boundary minority samples. Then, k-nearest density was defined to calculate the number of synthetic new samples and to reject the noise samples, and it modified the rule of new samples synthesis. It is an over-sampling method with guidance, and the new samples generated by this method are much more beneficial for classifier learning. Six classification methods were compared using University of California Irvine (UCI) data sets. Experimental results show that the proposed method outperforms other methods in both minority samples and majority samples, and it is more stable in different over-sampling rates.



Published: 22 November 2013
CLC:  TP 391.4  
Cite this article:

LOU Xiao-jun, SUN Yu-xuan, LIU Hai-tao. Clustering boundary over-sampling classification method for imbalanced data sets. J4, 2013, 47(6): 944-950.

URL:

http://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2013.06.003     OR     http://www.zjujournals.com/eng/Y2013/V47/I6/944


聚类边界过采样不平衡数据分类方法

针对传统SMOTE过采样方法在生成合成样本的过程中存在的盲目性,以及对噪声敏感且容易出现过拟合现象的问题,提出一种改进的聚类边界样本过采样(CB-SMOTE)方法,通过引入“聚类一致性系数”找到少数类样本的边界,利用边界样本的最近邻密度来剔除噪声点和确定合成样本的数量,对SMOTE方法的新样本合成规则进行了优化.该方法是一种指导性的过采样方法,合成样本更加有利于分类器的学习.通过实验对比6种不同方法在UCI公共数据集上的分类性能,结果表明:CB-SMOTE方法对少数类样本和多数类样本都具有较高的分类准确率,且对过采样倍数的变化具有更高的稳定性.

[1] GU Qiong, CAI Zhi-hua, ZHU Li, et al. Data mining on imbalanced data sets [C]∥ Proceedings of International Conference on Advanced Computer Theory and Engineering (ICACTE’08). Phuket: IEEE, 2008: 1020-1024.

[2] 林智勇,郝志峰,杨晓伟.不平衡数据分类的研究现状[J].计算机应用研究,2008,25(2): 332-336.

LIN Zhi-yong, HAO Zhi-feng, YANG Xiao-wei. Current state of research on imbalanced data sets classification learning [J]. Application Research of Computers, 2008, 25(2): 332-336.

[3] 叶志飞,文益民,吕宝粮. 不平衡分类问题研究综述 [J]. 智能系统学报,2009,4(2): 148-156.

YE Zhi-fei, WEN Yi-min, LV Bao-liang. A survey of imbalanced pattern classification problems [J]. CAAI Transactions on Intelligent Systems, 2009, 4(2): 148-156.

[4] HE Hai-bo, GARCIA E A. Learning from imbalanced data [J]. Knowledge and Data Engineering, 2009, 21(9): 1263-1284.

[5] MIAO Zhi-min, ZHAO Lu-wen, YUAN Wei-wei, et al. Multi-class imbalanced learning implemented in network intrusion [C]∥ Proceedings of International Conference on Computer Science and Service System (CSSS’11). Nanjing: IEEE, 2011: 1395-1398.

[6] LIU Ya-qin, WANG Cheng, ZHANG Lu. Decision tree based predictive models for breast cancer survivability on imbalanced data[C]∥ Proceedings of the 3rd International Conference on Bioinformatics and Biomedical Engineering (ICBBE’09). Beijing: IEEE, 2009: 14.

[7] ESTABROOKS A, JO T, JAPKOWICZ N. A multiple resampling method for learning from imbalanced data sets [J]. Computational Intelligence, 2004, 20(1): 18-36.

[8] ZHAI Yun, MA Nan, RUAN Da, et al. An effective over-sampling method for imbalanced data sets classification [J]. Chinese Journal of Electronics, 2011, 20(3): 489-494.

[9] YEN S J, LEE Y S. Cluster-based under-sampling approaches for imbalanced data distributions [J]. Expert Systems with Applications, 2009, 36(3): 5718-5727.

[10] GARCIA V, SANCHEZ J S, MOLLINEDA R A. On the effectiveness of preprocessing methods when dealing with different levels of class imbalance [J]. Knowledge-Based Systems, 2012, 25(1): 13-21.

[11] SUN Yan-min, KAMEL M S, ANDREW W, et al. Cost-sensitive boosting for classification of imbalanced data [J]. Pattern Recognition, 2007, 40(12): 3358-3378.

[12] CHEN Xiao-lin, SONG En-ming, MA Guang-zhi. An adaptive cost-sensitive classifier [C]∥ Proceedings of the 2nd International Conference on Computer and Automation Engineering (ICCAE’10). Singapore: IEEE, 2010: 699-701.

[13] XIAO Jin, XIE Ling, HE Chang-zheng, et al. Dynamic classifier ensemble model for customer classification with imbalanced class distribution [J]. Expert Systems with Applications, 2012, 39(3): 3668-3675.

[14] WANG Shi-jin, XI Li-feng. Condition monitoring system design with one-class and imbalanced-data classifier [C]∥ Proceedings of the 16th International Conference on Industrial Engineering and Engineering Management (IE&EM’09). Beijing: IEEE, 2009: 779-783.

[15] CHAWLA N, BOWYER K, HALL L, et al. SMOTE: Synthetic minority over-sampling technique [J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357.

[16] 杨智明,乔立岩,彭喜元. 基于改进SMOTE的不平衡数据挖掘方法研究 [J]. 电子学报,2007,35(12): 22-26.

YANG Zhi-ming, QIAO Li-yan, PENG Xi-yuan. Research on data mining method for imbalanced dataset based on improved SMOTE [J]. Chinese Journal of Electronics, 2007, 35(12): 22-26.

[17] HAN Hui, WANG Wen-yuan, MAO Bing-huan. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning [C]∥ Proceedings of the International Conference on Intelligent Computing. Hefei, China: Springer, 2005: 878-887.

[18] HE Hai-bo, BAI Yang, GARCIA E A, et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning [C]∥ Proceedings of IEEE International Joint Conference on Neural Networks (IJCNN’08). Hong Kong: IEEE, 2008: 1322-1328.

[19] CHAWLA N V, LAZAREVIC A, HALL L O, et al. SMOTEBoost: Improving prediction of the minority class in boosting [C]∥ Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. Dubrovnik, Croatia: Springer, 2003: 107-119.

[20] TOPCHY A, MINAEI B B, JAIN A K, et al. Adaptive clustering ensembles [C]∥ Proceedings of the 17th International Conference on Pattern Recognition. Cambridge, UK, IEEE, 2004: 272-275.

[21] CHEN Si, GUO Gong-de, CHEN Li-fei. Semi-supervised classification based on clustering ensembles [C]∥ Proceedings of the International Conference on Artificial Intelligence and Computational Intelligence. Shanghai, China: Springer, 2009: 629-638.

[22] 陈思,郭躬德,陈黎飞.基于聚类融合的不平衡数据分类方法[J].模式识别与人工智能,2010, 23(6): 772-780.

CHEN Si, GUO Gong-de, CHEN Li-fei. Clustering ensembles based classification method for imbalanced data sets [J]. Pattern Recognition and Artificial Intelligence, 2010, 23(6): 772-780.

[23] SU C T, CHEN Long-sheng, YIH Y. Knowledge acquisition through information granulation for imbalanced data [J]. Expert Systems with Applications, 2006, 31(3): 531-541.

[1] XU Song,SUN Xiu-xia,HE Yan. Iterative method of camera distortion calibration utilizing lines-imaging characteristics[J]. J4, 2014, 48(3): 404-413.
[2] . Augmented reality registration from nature features ased on planar color distribution[J]. J4, 2013, 47(12): 2243-2252.
[3] YANG Bing, XU Duan-qing, YANG Xin, ZHAO Lei, TANG Da-wei. Painting image classification based on aesthetic style similarity rule[J]. J4, 2013, 47(8): 1486-1492.
[4] YANG Bang-hua, HE Mei-yan, LIU Li, LU Wen-yu. EEG classification based on batch incremental SVM in
brain computer interfaces
[J]. J4, 2013, 47(8): 1431-1436.
[5] MENG Zi-bo, JIANG Hong, CHEN Jing, YUAN Bo, WANG Li-qiang. Feature pruning based AdaBoost and its application in face detection[J]. J4, 2013, 47(5): 906-911.
[6] HE Zhi-xiang, DING Xiao-qing, FANG Chi, WEN Di. Multiview face detection based on LBP and CCS-AdaBoost[J]. J4, 2013, 47(4): 622-629.
[7] LIU Xiao-fang,YE Xiu-zi ,ZHANG San-yuan ,ZHANG Yin. Non-quadratic regularized edge-preserving reconstruction for
parallel magnetic resonance image
[J]. J4, 2012, 46(11): 2035-2043.
[8] ZHANG Yuan-hui,WEI Wei. Online angular velocity estimated visual measurement for ping pong robot[J]. J4, 2012, 46(7): 1320-1326.
[9] SHI Jin-he, SHENG Ji-zhong, WANG Pan. Feature extraction and classification of four-class
motor imagery EEG data
[J]. J4, 2012, 46(2): 338-344.
[10] ZHANG Da-wei, ZHU Shan-an. Face recognition based kernel neighborhood preserving
discriminant embedding
[J]. J4, 2011, 45(10): 1842-1847.
[11] SHU Zhen-yu, WANG Guo-zhao. Fast mesh segmentation algorithm based on tensor voting[J]. J4, 2011, 45(6): 999-1005.
[12] Xu Shu-chang, ZHANG San-yuan, ZHANG Yin. Robust algorithm for extracting skin pigment concentration
from color image
[J]. J4, 2011, 45(2): 253-258.
[13] SHE Jing-Shan, MENG Meng, LUO Zhi-Ceng, MA Yu-Liang. Electromyography movement recognition of lower limb based on multiple kernel learning[J]. J4, 2010, 44(7): 1292-1297.
[14] XUE Ling-Yun, DUAN Hui-Long, XIANG Hua-Qi, FAN Ying-Le. Image restoration based on stochastic resonance mechanism of FitzHugh-Nagumo neuron[J]. J4, 2010, 44(6): 1103-1107.
[15] ZHANG Yuan-Hui, HUI Wei, YU Dan. Kalman tracking algorithm based on realtime vision of pingpong robot[J]. J4, 2009, 43(09): 1580-1584.