Quick Comparison Between Random Oversampling and SMOTE

Collinloo
3 min readSep 29, 2020

--

During a recent class project on binary classification, the sample data has a class imbalance issue, where the positive labels constitute about 15% of the sample data.

There are a few ways to deal with class imbalance issues, such as using the class weight parameter in the Logistic Regression. Random resampling and Synthetic Minority Oversampling (SMOTE) are the other two common approaches. The purpose of this writing is to share the findings on how random oversampling and SMOTE impact supervised learning method’s Area Under the Curve (AUC) score.

Here is a good link on class imbalances. Additional resources can be found at the bottom of the article.

Data Cleaning & Pre-modeling

  • Dropped irrelevant columns.
  • Changed target label from text to binary, 0 & 1.
  • Fill null values, if any.
  • Dropped columns showing high correlation. There are four columns showing 100% correlation. If they are not dealt with, they may impact the Logistic Regression performance.
  • One Hot Encoding categorical columns.

Classification Algorithms

Logistic Regression, K-Nearest Neighbors, Random Forest and XGBoost were selected for evaluation. The one that produces the best AUC score will be our final model.

Baseline Model

After separating the dataset into train and test sets, we fit the model with default classifier parameters. The figure shows the ROC curve and the AUC score for the four classifiers.

Sampling with SMOTE

SMOTE comes with several tuning parameters. The ‘sampling_strategy’ allows one to control the size of the minority label.

Function below loops through a list of ratios and reports back the best one where it delivers the highest AUC score.

def eval_smote_size(X_train, y_train, X_test, y_test, clfs):
# list of ratios to test
ratios = [0.25, 0.33, 0.5, 0.7, 1]
names = ['0.25', '0.33','0.5','0.7','1']
score_dict = {}

for n, ratio in enumerate(ratios):
smote = SMOTE(sampling_strategy=ratio, random_state=36)
X_train_resamp, y_train_resamp = smote.fit_sample(X_train, y_train)
#fit a model
temp = []
# loop thru' list of classifier and calculate auc score
for clf in clfs:
clf_name = type(clf).__name__
clf_name = clf.fit(X_train_resamp, y_train_resamp)
y_hat_test = clf.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_hat_test)
auc = roc_auc_score(y_test, y_hat_test)
# append auc at ratio value
temp.append(auc)
# update dict
score_dict[names[n]] = temp

# convert score dict to df
pd_col = [type(x).__name__ for x in clfs]
df_auc = pd.DataFrame.from_dict(score_dict).T
df_auc.columns = pd_col

# get the max auc by ratio
best_ratio = {}
for name in df_auc.columns:
max_v = df_auc[name].max()
best_ratio[name] = float(df_auc.index[df_auc[name] == max_v].values[0])


return best_ratio

Ran the function

smote_ratio = eval_smote_size(X_train_ohe, y_train, X_test_ohe, y_test, clfs)

Best Ratio

{'LogisticRegression': 0.7,
'KNeighborsClassifier': 0.33,
'RandomForestClassifier': 1.0,
'XGBClassifier': 0.25}

Applied the best ratio to each classifier’s train dataset. Plot displays the classifiers ROC and AUC score.

Random Oversampling

Same procedures above were followed to find the best ratio for imblearn RandomOverSampler.

{'LogisticRegression': 1.0,
'KNeighborsClassifier': 0.33,
'RandomForestClassifier': 0.7,
'XGBClassifier': 0.7}

Classifiers ROC and AUC score figure with the best ratios.

Conclusion

Following figure shows the AUC scores for the 4 classifier with different sampling techniques compared to the baseline model. In this particular dataset, resampling the data doesn’t appear to impact much on the models’ AUC scores.

The conclusion drawn here is by no mean definitive as this is a quick and simple analysis for this particular dataset. In the future, I will try a different dataset and compare the two results.

--

--