Boosting

In this post I take a look at boosting with a focus on building an intution for why this technique works. Most people who work in data science and machine learning will know that gradient boosting is one of the most powerful and effective algorithms out there. It continues to be one of the most successful ML techniques in Kaggle comps and is widely used in practice across a variety of use cases.

To build an intuition for boosting, I'll build a simple booster using the Scikit learn Decision Tree implementation. It goes without saying that the go to technique for gradient boosting is the excellent XGboost package. This post should help to develop your understanding of why boosting is so effective in predictive modelling problems.

At a high-level boosting sits within the Ensemble family of ML algorithms. Boosting involves sequentially training weak learners - where a weak learner is a low bias estimator - to predict some outcome. The interesting thing is that each learner does not predict the original tartget. Instead, each learner attempts to predict the mistakes of the previous learner. In practice this means that learner-1 will attempt to predict the target outcome directly and learner-2 will attempt to predict the residual of learner-1. This process of predicting residuals continues through to the final learner. The final prediction can then be made by taking the sum of all the individual learners. This is an extremely effective means of predicting things, but the intuition for this isn't always immediately clear. I hope to make this intuition more accessible in this post.

Let's start by importing some dependencies

from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

sns.set_style("whitegrid")

Let's create some toy data:

n = 100
X = np.linspace(0, 10, n) 
y = X**2 + 10 - (20 * np.random.random(n))
X = X[:, np.newaxis]

plt.figure(figsize=(15, 4))
plt.scatter(X, y, alpha=.7);

The functions below create a boosted regression learner using Decision Trees. Decision trees are one of the most popular and effective learners for ensembles (though technically you boost any algorithm). Decisions trees are a nice choice however because 1) they train quickly and 2) they can model non-linearities.

I have tried to keep the functions below super simple. many_trees returns a list of decision trees, boost fits the decision trees sequentially by first predicting the target outcome y with tree-0 then from tree-1-n predicts the residuals and predict iterates through the list of fitted decision trees and returns each trees prediction. plot_fits is a convienence functions that sums the prediction of n trees and returns the fitted line and the residuals.

Implementing boosting using Decision Trees

def many_trees(n_trees, clf=False, **kwargs):
    trees = [DecisionTreeRegressor(**kwargs) for i in range(n_trees)]
    return trees

def boost(trees, X, y, clf=False):
    fitted = []
    for tree in trees:
        tree.fit(X, y)
        yhat = tree.predict(X)
        y = (y-yhat) 
        fitted.append(tree)
    return fitted
        
def predict(trees, X):
    return np.array([tree.predict(X) for tree in trees]).T

With the boosting functonality implemented, we can now fit the trees. Given the simplicity of the predicton problem, I'll make the learners extremely week by setting the max-depth of each tree to 1. This limits each tree to one split of X when predicting y.

learners = many_trees(30, max_depth=1, clf=False)
fitted = boost(learners, X, y, clf=False)
boosted_yhat = predict(fitted, X)

xfit = np.linspace(0, 10, 100).reshape(-1, 1)
#dtfit = predict(learners, xfit) # predict over the range of X to  

def plot_fits(n_trees, row):
    preds_t = boosted_yhat[:, :n_trees]
    boosted_pred = preds_t.sum(1)
    res = boosted_pred-y
    axes[row, 0].plot(xfit, boosted_pred, c='red')
    axes[row, 0].scatter(X, y)
    axes[row, 0].set_title(f"Fit after {n_trees} trees", fontsize=15)
    axes[row, 1].scatter(sample_ix, res, alpha=0.7)
    axes[row, 1].plot(res, color='r', alpha=0.7)
    axes[row, 1].set_title(f"Residuals after {n_trees} trees", fontsize=15)

It's the residuals, silly!

Boosting works by fitting successive models to the residuals of the previous model. It is common to plot residuals as part of the model evaluation process. Typically you check residuals to ensure that they are randon and that there are not obvious patterns. If there are patterns in the residuals it is a sign that you are missing key information about the target variable and are underfitting the data. Essentially patterns in residuals represent information about the relationship between X and y that can be modelled.

Boosting takes advantage of this insight by fitting the residuals across the sequence of models. This is why we use weak learners - such as highly regaularised decision trees - in boosting, we want each single learner to underfit the data, thereby affording the next learner the opportunity to correct its mistakes (possibly with a different sample and feature space to the previous learner).

We can see this process play out in the plots below. Each plot shows the fitted line to the data from successive boosted trees. In panel 1 we show the first prediction and it is easy to see that this leaves a clear pattern in the residuals. In the next panel we show the fit after five boosted trees. The boosting has given the model more flexibility to fit the data, but it still leaves clear exploitable patterns in the data. In the next four panels we increase the number of boosted predictions and by the final panel you can see that the residuals begin to look quite random and the line appears to be a decent fit for the data.

fig, axes = plt.subplots(nrows=6, ncols=2, figsize=(20,25))
plot_fits(1, 0)
plot_fits(5, 1)
plot_fits(15, 2)
plot_fits(20, 3)
plot_fits(25, 4)
plot_fits(30, 5)

fig.tight_layout()

It's also informative to plot each fit across the data by adding each succesive set of predictions and plotting the line.

plt.figure(figsize=(15, 4))

pred = 0
for i in range(len(learners)):
    pred += boosted_yhat[:, i]
    plt.plot(xfit, pred)
plt.plot(xfit, predict(learners, X).sum(1))
plt.xlabel("X")
plt.ylabel("y");

plt.scatter(X, y, alpha=.4)

<matplotlib.collections.PathCollection at 0x1a3df7ec50>

There is obviously a lot more to Boosting than described in this post, but I think it is useful to have an intuitive understanding for the core reason this technique works and hopefully this post has make this clear! One final thought is that when you use Boosting you need to carefully validate your model as this approach can easily overfit. For example, if we increase the max-depth of the trees to 4, we can observe that the model begins to fit individual data points rather than the general trend in the data.

learners = many_trees(30, max_depth=4, clf=False)
fitted = boost(learners, X, y, clf=False)
boosted_yhat = predict(fitted, X)

plt.figure(figsize=(15, 4))
plt.scatter(X, y, alpha=.4)
pred = 0
for i in range(len(learners)):
    pred += boosted_yhat[:, i]
    plt.plot(xfit, pred)
plt.plot(xfit, predict(learners, X).sum(1))
plt.xlabel("X")
plt.ylabel("y");

sample_ix = np.arange(X.shape[0])
plt.figure(figsize=(15, 4))
plt.scatter(sample_ix, pred-y)
plt.plot(pred-y, color='r', alpha=0.4)

[<matplotlib.lines.Line2D at 0x1a3e46fa90>]

Rather than creating a randomly distributed set of residuals the model has learned to perfectly predict y given X. This might look nice, but in effect the model has become a lookup table and hasn't learned the fundamental structure of the data. A good thing to do therefore when training boosting models is to plot the residuals - this will give you a clear steer on whether or not your model is overfitting.

Hope you found this useful!