Cluster-Based Under-Sampling with Random Forest for Multi-Class Imbalanced Classification
Abstract
Multi-class imbalanced classification has emerged as a very challenging re- search area in machine learning for data mining applications. It occurs when the number of training instances representing majority class instances is much higher than that of minority class instances. Existing machine learn- ing algorithms provide a good accuracy when classifying majority class in- stances, but ignore misclassify the minority class instances. However, the minority class instances hold the most vital information and misclassify- ing them can lead to serious problems. Several sampling techniques with ensemble learning have been proposed for binary-class imbalanced classifi- cation in the last decade. In this work, we propose a new ensemble learning technique by employing cluster-based under-sampling with random forest algorithm for dealing with multiclass highly imbalanced data classification. The proposed approach cluster the majority class instances and then select the most informative majority class instances in each cluster to form several balanced datasets. After that random forest algorithm is applied on bal- anced datasets and applied majority voting technique to classify test/ new instances. We tested the performance of our proposed method with existing popular sampling with boosting methods like: AdaBoost, RUSBoost, and SMOTEBoost on 13 benchmark imbalanced datasets. The experimental results show that the proposed cluster-based under-sampling with random forest technique achieved high accuracy for classifying both majority and minority class instances in compare with existing Methods.
Collections
- M.Sc Thesis/Project [149]