How to deal with imbalanced datasets?
Using the correct performance metric
Accuracy is not the best metric to use when evaluating imbalanced datasets as it can be very misleading. It's better to try:
Resampling Techniques
Always split into test and train sets BEFORE trying to resample techniques! And applying resample ONLY in the training set.
Oversample: by adding more copies of the minority class.
Undersample: by removing observations from the majority class.
Create synthetic samples
SMOTE (Synthetic Minority Oversampling Technique): uses a kNN algorithm to generate new and synthetic data we can use for training our model.
Apply an ensemble algorithm
Ensemble methodsLast updated