Ensemble Learning

Stuart Burrows 14 Sep 2017 5 minute read
Ensemble Learning

When I hear the word “ensemble”, I think of two things: the grand canonical ensemble, and the brass ensemble. But today we’re taking a look at ensemble learning, which is the canonical way that the grand top brass of data scientists augment and combine models.

The need for ensembling arises when you’re working at the cutting edge and a good model is not available. One model might anticipate a fraction of the cases, another a different fraction, and a third a different set again. If any one model is used in isolation, it will give poor predictions, which will lead to poor decisions. This risk can be mitigated by ensemble learning, which combines the strengths of several poor models to yield more robust predictions.

Overview

Ensemble learning begins with the notion that while each of our models may be poor in isolation, they may collectively cover the predictive landscape pretty well. If you have one poor model, then ensembling can effectively create a bunch of derived models, and recombine them into a better model. If you already have a few poor models, then ensemble methods can combine them by a judicious weighted average. Several techniques have emerged since ensembling became an active research field in the 1990s.

Some of the most popular algorithms fall into the resampling category and include bagging, boosting, and adaptive boosting. These are applicable when you have one poor base model. Bagging, which is short for bootstrap aggregating, involves resampling the data several times, creating a new model consisting of the base model trained on each sample, and holding a vote among the bootstrap models. Boosting constructs three classifiers from the base model and consults them all to make its final decision. The first is trained on resampled data; the second on another resample, this time made up of equal proportions of cases that the first classifier gets right and wrong; and the third on cases upon which the first two classifiers disagree. Adaptive boosting, or AdaBoost, trains a number of classifiers on resamples which are successively stripped of cases which the last-trained classifier gets right, in proportion to its expected rightness.

We shall explore two ensembling examples using Bayesian model weighting, which takes a number of poor models as its starting point, infers the model weights from the data, and recombines the models accordingly.

Examples

1) Will it rain tomorrow?

We wish to determine the probability that it will rain tomorrow, given whether today is cloudy or clear, and we can consult three experts, Bill, Cindy, and Dirk, each of whom we will consider as a model. Their prediction models are as follows:

Model Cloud state today Prob(rain tomorrow)
Bill Cloudy 0.3
Clear 0.5
Cindy Cloudy 0.9
Clear 0.2
Dirk Cloudy 0.5
Clear 0.7

We also have some relevant weather records, e.g.:

Cloudy? (day i) Rain? (day i + 1)
False True
True True
False False

What is the probability that it will rain tomorrow, conditional upon whether today is cloudy?

Solution

We now run the ensembling algorithm for datasets of size 10, 100, and 1000, which yield the following final predictions:

Dataset size Cloud state today Prob(rain tomorrow)
10 Cloudy 0.507
Clear 0.513
100 Cloudy 0.758
Clear 0.321
1000 Cloudy 0.670
Clear 0.425

Now the datasets were generated by computer from a mixture distribution, so we know what the truth is. The model weights are (0.2,0.5,0.3), which yield rain probabilities of 0.66 and 0.41 for cloudy and clear days respectively. The final prediction with 1000 data points is quite close to those, and much better than the individual model predictions. The following images will help us understand what is happening in the ensembler.

SimplicesCrop2 - Copy

There is one triangle for each dataset size. Each triangle is the set of possible weightings among the models. The vertex w_1 is the weighting (1,0,0) representing model 1 (Bill), and so on; a point in the interior of the triangle represents a weighted average among the three models whose weight for model i is roughly the closeness to vertex w_i. The rainbow contour shading indicates the shape of the density function for the weight vector, after taking account of the data. And the pink dot is the true weight vector.

As the dataset size grows, we gain more information and are able to form an increasingly accurate inference about the relative correctness of the models, which can be seen in the convergence of the density towards the true weight vector.

2) When will my new car break down?

There are two companies vying for supremacy in the breakdown-prediction market. Each has its own model, giving a probability distribution for the time to breakdown. SchmoozeCorp thinks that the brand b of the car is the decisive factor, and says

    \[\text{Prob}(t\:|\:\text{Schmooze}\;b\;I) \sim \begin{cases} \text{Gamma}\left(30, \frac{1}{10}\right) & \quad b = \text{FastBack}\\ \text{Gamma}\left(25, \frac{1}{5}\right) & \quad b = \text{SlowPoke}\\ \end{cases}\]

giving a shorter lifetime mean of 3 years to the zippy FastBack and a longer lifespan mean of 5 years to the sedate SlowPoke. The Malfunction Mafia has satisfied itself that the breakdown time is Weibull-distributed with parameters \alpha = \text{number of gears}, and \beta = \frac{\text{cost}}{10000}:

    \[\text{Prob}(t\:|\:\text{Mafia}\;g\;c\;I) \sim \text{Weibull}\left(g, \frac{c}{10000}\right)\]

with g as the number of gears, and c as the cost of the car. In machine learning jargon, SchmoozeCorp is using only a single feature—the brand—and Malfunction Mafia are using two features—the number of gears and the cost.

We have a dataset of how long it has taken a number of new cars to break down. You just bought a new FastBack for $71000, and it has 6 gears. When will it break down?

Solution

We run the ensembler for datasets of size 10, 50, and 400. Below we plot the SchmoozeCorp and Malfunction Mafia predictive distributions with the ensembler’s distribution and the real distribution that generated the datasets, for each dataset size.

CarDist10

CarDist50

CarDist400

The ensembler’s distribution converges upon the real distribution as the dataset size increases. The predictions made by each company are quite bad, but the ensembler’s combination of them is good. The ensembler makes an inference about the weighting, which is shown in the following graphs, with the rainbow-coloured distribution being the inferred density on the weight for SchmoozeCorp, and the magenta line indicating the actual weight used in the generating distribution.

CarWeight3

Conclusion

Ensembling provides a way to improve weak models by combining them. It is roughly analogous to Fourier analysis, where each basis function is a poor approximator of the target function, but a suitable combination of the basis functions can yield a good fit. If the component models are so bad that they collectively provide poor coverage, then ensembling will achieve little. Often enough, though, it can turn lacklustre results into actionable insight, especially in difficult analytics problems where good models are hard to find.

Further Reading

Robi Polikar (2009) Ensemble Learning. Scholarpedia, 4(1):2776. Viewed 8 September 2017.