Abstract:
Imbalanced data sets contain an unequal distribution of data samples among the classes
and pose a challenge to the learning algorithms as it becomes hard to learn the minority
class concepts. Synthetic oversampling techniques address this problem by generating
synthetic minority samples to balance the distribution between the samples of the ma-
jority and minority classes. This thesis identifies that most of the existing synthetic
oversampling techniques may generate wrong synthetic samples in some scenarios and
make the learning task harder. To this end, the thesis presents a new synthetic oversam-
pling method, called Majority Weighted Minority Oversampling Technique (MWMOTE),
for handling imbalanced data sets efficiently. The term ’majority weighted minority over-
sampling’ here means important minority samples for oversampling will be identified and
weighted by the nearest majority samples and then will be used for oversampling. To do
this, MWMOTE uses information from both the minority and majority samples in the
data set. First, it identifies hard-to-learn informative minority samples and assigns them
weights according to their importance using distance information from the nearest ma-
jority samples. MWMOTE then identifies the clusters in the minority data set and uses
weighted informative minority samples to generate synthetic samples inside the clusters.
This is done in order to ensure that generated samples always lie inside some minority
cluster and do not overlap with majority regions.
The thesis finally presents a new stand-alone ensemble algorithm, called, MWMOTE-
Boost, by integrating MWMOTE inside the famous AdaBoost.M2 boosting procedure.
MWMOTE-Boost algorithm is obtained from MWMOTE oversampling algorithm by in-
serting it into the boosting iteration of classic AdaBoost.M2 ensemble algorithm. The
manner in which MWMOTE and AdaBoost.M2 are integrated is similar to the recent
state-of-the-art RAMOBoost algorithm except that in place of RAMOBoost’s RAMO oversampling procedure, MWMOTE oversampling procedure is used. The proposed meth-
ods, i.e., MWMOTE and MWMOTE-Boost have been evaluated extensively on four arti-
ficial and seventeen real-world data sets and using several classifier models such as neural
network, decision tree, k-nearest neighbor and ensemble classifier. The simulation results
show that our new methods MWMOTE and MWMOTE-Boost are better or comparable
than some other existing methods in terms of various assessment metrics, such as pre-
cision, recall, F-measure, G-mean, and area under the receiver operating curve (ROC),
usually known as area under curve (AUC).