# Ensemble models

*Download the*

*notebook*,

*the*

*raw script*,

*or the*

*annotated script*

*for this tutorial (right-click on the link and save).*

## Preliminary steps

Let's start by loading the relevant packages and generating some dummy data.

```
using MLJ
import DataFrames
import Statistics
using PrettyPrinting
using StableRNGs
rng = StableRNG(512)
Xraw = rand(rng, 300, 3)
y = exp.(Xraw[:,1] - Xraw[:,2] - 2Xraw[:,3] + 0.1*rand(rng, 300))
X = DataFrames.DataFrame(Xraw)
train, test = partition(eachindex(y), 0.7);
```

Let's also load a simple model:

```
@load KNNRegressor
knn_model = KNNRegressor(K=10)
```

```
KNNRegressor(
K = 10,
algorithm = :kdtree,
metric = Distances.Euclidean(0.0),
leafsize = 10,
reorder = true,
weights = :uniform) @373
```

As before, let's instantiate a machine that wraps the model and data:

`knn = machine(knn_model, X, y)`

```
Machine{KNNRegressor} @321 trained 0 times.
args:
1: Source @835 ⏎ `Table{AbstractArray{Continuous,1}}`
2: Source @786 ⏎ `AbstractArray{Continuous,1}`
```

and fit it

```
fit!(knn, rows=train)
ŷ = predict(knn, X[test, :]) # or use rows=test
rms(ŷ, y[test])
```

`0.06389980172436367`

The few steps above are equivalent to just calling `evaluate!`

:

```
evaluate!(knn, resampling=Holdout(fraction_train=0.7, rng=StableRNG(666)),
measure=rms) |> pprint
```

```
(measure = [rms],
measurement = [0.1236738803266666],
per_fold = [[0.1236738803266666]],
per_observation = [missing])
```

## Homogenous ensembles

MLJ offers basic support for ensembling such as *bagging*. Defining such an ensemble of simple "atomic" models is done via the `EnsembleModel`

constructor:

`ensemble_model = EnsembleModel(atom=knn_model, n=20);`

where the `n=20`

indicates how many models are present in the ensemble.

### Training and testing an ensemble

Now that we've instantiated an ensemble, it can be trained and tested the same as any other model:

```
ensemble = machine(ensemble_model, X, y)
estimates = evaluate!(ensemble, resampling=CV())
estimates |> pprint
```

```
(measure = [rms],
measurement = [0.08524739829927208],
per_fold = [[0.08866161072583786,
0.11204787388781821,
0.07793015151727334,
0.08916673799997521,
0.06632003095260786,
0.06902857419121246]],
per_observation = [missing])
```

here the implicit measure is the `rms`

(default for regressions). The `measurement`

is the mean taken over the folds:

```
@show estimates.measurement[1]
@show mean(estimates.per_fold[1])
```

```
estimates.measurement[1] = 0.08524739829927208
mean(estimates.per_fold[1]) = 0.08385916321245417
```

Note that multiple measurements can be specified jointly. Here only on measurement is (implicitly) specified but we still have to select the corresponding results (whence the `[1]`

for both the `measurement`

and `per_fold`

).

### Systematic tuning

Let's simultaneously tune the ensemble's `bagging_fraction`

and the K-Nearest neighbour hyperparameter `K`

. Since one of our models is a field of the other, we have nested hyperparameters:

`params(ensemble_model) |> pprint`

```
(atom = (K = 10,
algorithm = :kdtree,
metric = Distances.Euclidean(0.0),
leafsize = 10,
reorder = true,
weights = :uniform),
atomic_weights = [],
bagging_fraction = 0.8,
rng = Random._GLOBAL_RNG(),
n = 20,
acceleration = CPU1{Nothing}(nothing),
out_of_bag_measure = [])
```

To define a tuning grid, we construct ranges for the two parameters and collate these ranges:

```
B_range = range(ensemble_model, :bagging_fraction,
lower=0.5, upper=1.0)
K_range = range(ensemble_model, :(atom.K),
lower=1, upper=20);
```

the scale for a tuning grid is linear by default but can be specified to `:log10`

for logarithmic ranges. Now we have to define a `TunedModel`

and fit it:

```
tm = TunedModel(model=ensemble_model,
tuning=Grid(resolution=10), # 10x10 grid
resampling=Holdout(fraction_train=0.8, rng=StableRNG(42)),
ranges=[B_range, K_range])
tuned_ensemble = machine(tm, X, y)
fit!(tuned_ensemble, rows=train);
```

Note the `rng=42`

seeds the random number generator for reproducibility of this example.

### Reporting results

The best model can be accessed like so:

```
best_ensemble = fitted_params(tuned_ensemble).best_model
@show best_ensemble.atom.K
@show best_ensemble.bagging_fraction
```

```
best_ensemble.atom.K = 3
best_ensemble.bagging_fraction = 0.5
```

The `report`

method gives more detailed information on the tuning process:

`r = report(tuned_ensemble);`

For instance, `r.measurements`

are the measurements for all pairs of hyperparameters which you could visualise nicely:

```
using PyPlot
figure(figsize=(8,6))
res = r.plotting
vals_b = res.parameter_values[:, 1]
vals_k = res.parameter_values[:, 2]
tricontourf(vals_b, vals_k, res.measurements)
xticks(0.5:0.1:1, fontsize=12)
xlabel("Bagging fraction", fontsize=14)
yticks([1, 5, 10, 15, 20], fontsize=12)
ylabel("Number of neighbors - K", fontsize=14)
```

Finally you can always just evaluate the model by reporting `rms`

on the test set:

```
ŷ = predict(tuned_ensemble, rows=test)
rms(ŷ, y[test])
```