Ensemble methods

Sources:

Ensemble methods: bagging, boosting and stacking (Joseph Rocca)
Boosting, Bagging, and Stacking — Ensemble Methods with sklearn and mlens (Robert R.F. DeFilippi)

Overview

This is a machine learning paradigm where multiple weak learners are trained to solve the same problem and combined to get better results. The main hypothesis is that when weak models are correctly combined we can obtain more accurate and/or robust models.

Most of the time, these weak models perform not so well by themselves either because they have a high bias or because they have too much variance to be robust (high degree of freedom models, for example).

The idea of ensemble methods is to try reducing bias or variance of such weak learners by combining several of them together.

Notice that is important to be coherent with the way we aggregate these models. For instance, when selecting models with high variance, we should combine them with an aggregation method that helps to reduce variance.

The goals:

Bagging $\Rightarrow$ reduces the model’s variance.
Boosting $\Rightarrow$ reduces the model’s bias.
Stacking $\Rightarrow$ increases the predictive force of a set of models.

Bagging

Effective method when data is limited.
It can be parallelized.
The main idea is to fit several independent models and to average their predictions in order to obtain a model with a lower variance.
It generates bootstrap samples (representativity from the original dataset but independent from each other) to fit each model.
There are several possible ways to aggregate the multiple models fitted in parallel:
- For a regression problem, we can literally average the outputs.
- For classification, we can consider:
  - a hard-voting system: each model output is a vote, and the class with more votes is returned.
  - a soft-voting system: models return probabilities and the class with the highest averaged probability is returned.

Random Forest

Boosting

Consists in fitting SEQUENTIALLY multiple weak learners in a very adaptative way:

Each model in the sequence is fitted giving more importance to observations in the dataset that were badly handled by the previous models in the sequence.
So, each new model focuses its efforts on the most difficult observations to fit.

At the end of the process, we get a strong learner with a LOWER BIAS

Base models that are often considered for boosting are models with a low variance but high bias.
This method cannot be parallelized.

Adaptative boosting Gradient boosting

Stacking

It may train multiple heterogeneous learners.
It combines their outputs using a meta-learner to output final predictions based on the multiple predictions returned by these weak models.

"Keep in mind just by adding layers and more models to your stacking algorithm does not mean you’ll get a better predictor. There are no free lunches in machine learning."

PreviousHyperparameter tuning NextHard negative mining

Last updated 10 months ago

Was this helpful?