How to deal with imbalanced datasets?

Using the correct performance metric

Accuracy is not the best metric to use when evaluating imbalanced datasets as it can be very misleading. It's better to try:

Resampling Techniques

Always split into test and train sets BEFORE trying to resample techniques! And applying resample ONLY in the training set.

Oversample: by adding more copies of the minority class.

Undersample: by removing observations from the majority class.

Create synthetic samples

  • SMOTE (Synthetic Minority Oversampling Technique): uses a kNN algorithm to generate new and synthetic data we can use for training our model.

Apply an ensemble algorithm

Ensemble methods

Last updated