Ensemble models (2)

Download the notebook, the raw script, or the annotated script for this tutorial (right-click on the link and save).


This tutorial builds upon the previous ensemble tutorial with a home-made Random Forest regressor on the "boston" dataset.

using MLJ
using PyPlot
using PrettyPrinting
using StableRNGs
import DataFrames

X, y = @load_boston
sch = schema(X)
p = length(sch.names)
n = sch.nrows
@show (n, p)
(n, p) = (506, 12)
Summary Stats:
Length:         506
Missing Count:  0
Mean:           22.532806
Minimum:        5.000000
1st Quartile:   17.025000
Median:         21.200000
3rd Quartile:   25.000000
Maximum:        50.000000
Type:           Float64

Let's load the decision tree regressor

@load DecisionTreeRegressor
    max_depth = -1,
    min_samples_leaf = 5,
    min_samples_split = 2,
    min_purity_increase = 0.0,
    n_subfeatures = 0,
    post_prune = false,
    merge_purity_threshold = 1.0) @386

Let's first check the performances of just a single Decision Tree Regressor (DTR for short):

tree = machine(DecisionTreeRegressor(), X, y)
e = evaluate!(tree, resampling=Holdout(fraction_train=0.8),
              measure=[rms, rmslp1])
e |> pprint # use PrettyPrinting
(measure = [rms, rmslp1],
 measurement = [7.054322052311652, 0.32730942581047084],
 per_fold = [[7.054322052311652], [0.32730942581047084]],
 per_observation = [missing, missing])

Note that multiple measures can be reported simultaneously.

Random forest

Let's create an ensemble of DTR and fix the number of subfeatures to 3 for now.

forest = EnsembleModel(atom=DecisionTreeRegressor())
forest.atom.n_subfeatures = 3

(NB: we could have fixed n_subfeatures in the DTR constructor too).

To get an idea of how many trees are needed, we can follow the evaluation of the error (say the rms) for an increasing number of tree over several sampling round.

rng = StableRNG(5123) # for reproducibility
m = machine(forest, X, y)
r = range(forest, :n, lower=10, upper=1000)
curves = learning_curve!(m, resampling=Holdout(fraction_train=0.8, rng=rng),
                         range=r, measure=rms);

let's plot the curves

plot(curves.parameter_values, curves.measurements)
ylabel("Root Mean Squared error", fontsize=16)
xlabel("Number of trees", fontsize=16)
xticks([10, 250, 500, 750, 1000], fontsize=14)
RMS vs number of trees

The curve is pretty noisy but let's just go for 150 trees:

forest.n = 150;


As forest is a composite model, it has nested hyperparameters:

params(forest) |> pprint
(atom = (max_depth = -1,
         min_samples_leaf = 5,
         min_samples_split = 2,
         min_purity_increase = 0.0,
         n_subfeatures = 3,
         post_prune = false,
         merge_purity_threshold = 1.0),
 atomic_weights = [],
 bagging_fraction = 0.8,
 rng = Random._GLOBAL_RNG(),
 n = 150,
 acceleration = CPU1{Nothing}(nothing),
 out_of_bag_measure = [])

Let's define a range for the number of subfeatures and for the bagging fraction:

r_sf = range(forest, :(atom.n_subfeatures), lower=1, upper=12)
r_bf = range(forest, :bagging_fraction, lower=0.4, upper=1.0);

And build a tuned model as usual that we fit on a 80/20 split. We use a low-resolution grid here to make this tutorial faster but you could of course use a finer grid.

tuned_forest = TunedModel(model=forest,
                          resampling=CV(nfolds=6, rng=StableRNG(32)),
                          ranges=[r_sf, r_bf],
m = machine(tuned_forest, X, y)
e = evaluate!(m, resampling=Holdout(fraction_train=0.8),
              measure=[rms, rmslp1])
e |> pprint
(measure = [rms, rmslp1],
 measurement = [3.9331681188967704, 0.25013647146434703],
 per_fold = [[3.9331681188967704], [0.25013647146434703]],
 per_observation = [missing, missing])


Again, you could show a 2D heatmap of the hyperparameters

r = report(m)


res = r.plotting

vals_sf = res.parameter_values[:, 1]
vals_bf = res.parameter_values[:, 2]

tricontourf(vals_sf, vals_bf, res.measurements)
xticks(1:3:12, fontsize=12)
xlabel("Number of sub-features", fontsize=14)
yticks(0.4:0.2:1, fontsize=12)
ylabel("Bagging fraction", fontsize=14)

Even though we've only done a very rough search, it seems that around 7 sub-features and a bagging fraction of around 0.75 work well.

Now that the machine m is trained, you can use use it for predictions (implicitly, this will use the best model). For instance we could look at predictions on the whole dataset:

ŷ = predict(m, X)
rms(ŷ, y)