AMES

Download the notebook, the raw script, or the annotated script for this tutorial (right-click on the link and save).

Baby steps

Let's load a reduced version of the well-known Ames House Price data set (containing six of the more important categorical features and six of the more important numerical features). As "iris" the dataset is so common that you can load it directly with @load_ames and the reduced version via @load_reduced_ames

using MLJ
using  PrettyPrinting
import DataFrames
import Statistics

X, y = @load_reduced_ames
X = DataFrames.DataFrame(X)
@show size(X)
first(X, 3) |> pretty
size(X) = (1456, 12)
┌──────────────────────────────────────────────────┬────────────┬───────────────────────────────────────────────────┬────────────┬─────────────┬────────────┬────────────┬────────────┬───────────────────────────────────────────────────┬────────────┬──────────────┬───────────┐
│ OverallQual                                      │ GrLivArea  │ Neighborhood                                      │ x1stFlrSF  │ TotalBsmtSF │ BsmtFinSF1 │ LotArea    │ GarageCars │ MSSubClass                                        │ GarageArea │ YearRemodAdd │ YearBuilt │
│ CategoricalArrays.CategoricalValue{Int64,UInt32} │ Float64    │ CategoricalArrays.CategoricalValue{String,UInt32} │ Float64    │ Float64     │ Float64    │ Float64    │ Int64      │ CategoricalArrays.CategoricalValue{String,UInt32} │ Float64    │ Int64        │ Int64     │
│ OrderedFactor{10}                                │ Continuous │ Multiclass{25}                                    │ Continuous │ Continuous  │ Continuous │ Continuous │ Count      │ Multiclass{15}                                    │ Continuous │ Count        │ Count     │
├──────────────────────────────────────────────────┼────────────┼───────────────────────────────────────────────────┼────────────┼─────────────┼────────────┼────────────┼────────────┼───────────────────────────────────────────────────┼────────────┼──────────────┼───────────┤
│ 5                                                │ 816.0      │ Mitchel                                           │ 816.0      │ 816.0       │ 816.0      │ 6600.0     │ 2          │ _20                                               │ 816.0      │ 2003         │ 1982      │
│ 8                                                │ 2028.0     │ Timber                                            │ 2028.0     │ 1868.0      │ 1460.0     │ 11443.0    │ 3          │ _20                                               │ 880.0      │ 2006         │ 2005      │
│ 7                                                │ 1509.0     │ Gilbert                                           │ 807.0      │ 783.0       │ 0.0        │ 7875.0     │ 2          │ _60                                               │ 393.0      │ 2003         │ 2003      │
└──────────────────────────────────────────────────┴────────────┴───────────────────────────────────────────────────┴────────────┴─────────────┴────────────┴────────────┴────────────┴───────────────────────────────────────────────────┴────────────┴──────────────┴───────────┘

and the target is a continuous vector:

@show y[1:3]
scitype(y)
y[1:3] = [138000.0, 369900.0, 180000.0]
AbstractArray{Continuous,1}

so this is a standard regression problem with a mix of categorical and continuous input.

Dummy model

Remember that a model is just a container for hyperparameters; let's take a particularly simple one: the constant regression.

creg = ConstantRegressor()
ConstantRegressor(
    distribution_type = Distributions.Normal) @149

Wrapping the model in data creates a machine which will store training outcomes (fit-results)

cmach = machine(creg, X, y)
Machine{ConstantRegressor} @651 trained 0 times.
  args: 
    1:	Source @296 ⏎ `Table{Union{AbstractArray{Continuous,1}, AbstractArray{Count,1}, AbstractArray{Multiclass{15},1}, AbstractArray{Multiclass{25},1}, AbstractArray{OrderedFactor{10},1}}}`
    2:	Source @795 ⏎ `AbstractArray{Continuous,1}`

You can now train the machine specifying the data it should be trained on (if unspecified, all the data will be used);

train, test = partition(eachindex(y), 0.70, shuffle=true); # 70:30 split
fit!(cmach, rows=train)
ŷ = predict(cmach, rows=test)
ŷ[1:3] |> pprint
[Distributions.Normal{Float64}(μ=179338.79587831208, σ=72945.29606828818),
 Distributions.Normal{Float64}(μ=179338.79587831208, σ=72945.29606828818),
 Distributions.Normal{Float64}(μ=179338.79587831208, σ=72945.29606828818)]

Observe that the output is probabilistic, each element is a univariate normal distribution (with the same mean and variance as it's a constant model).

You can recover deterministic output by either computing the mean of predictions or using predict_mean directly (the mean function can bve applied to any distribution from Distributions.jl):

ŷ = predict_mean(cmach, rows=test)
ŷ[1:3]
3-element Array{Float64,1}:
 179338.79587831208
 179338.79587831208
 179338.79587831208

You can then call one of the loss functions to assess the quality of the model by comparing the performances on the test set:

rmsl(ŷ, y[test])
0.423173647088951

KNN-Ridge blend

Let's try something a bit fancier than a constant regressor.

  • one-hot-encode categorical inputs

  • log-transform the target

  • fit both a KNN regression and a Ridge regression on the data

  • Compute a weighted average of individual model predictions

  • inverse transform (exponentiate) the blended prediction

You will first define a fixed model where all hyperparameters are specified or set to default. Then you will see how to create a model around a learning network that can be tuned.

@load RidgeRegressor pkg="MultivariateStats"
@load KNNRegressor
KNNRegressor(
    K = 5,
    algorithm = :kdtree,
    metric = Distances.Euclidean(0.0),
    leafsize = 10,
    reorder = true,
    weights = :uniform) @250

Using the expanded syntax

Let's start by defining the source nodes:

Xs = source(X)
ys = source(y)
Source @174 ⏎ `AbstractArray{Continuous,1}`

On the "first layer", there's one hot encoder and a log transform, these will respectively lead to node W and node z:

hot = machine(OneHotEncoder(), Xs)

W = transform(hot, Xs)
z = log(ys);

On the "second layer", there's a KNN regressor and a ridge regressor, these lead to node ẑ₁ and ẑ₂

knn   = machine(KNNRegressor(K=5), W, z)
ridge = machine(RidgeRegressor(lambda=2.5), W, z)

ẑ₁ = predict(ridge, W)
ẑ₂ = predict(knn, W)
Node{Machine{KNNRegressor}} @873
  args:
    1:	Node{Machine{OneHotEncoder}} @846
  formula:
    predict(
        Machine{KNNRegressor} @456, 
        transform(
            Machine{OneHotEncoder} @746, 
            Source @759))

On the "third layer", there's a weighted combination of the two regression models:

ẑ = 0.3ẑ₁ + 0.7ẑ₂;

And finally we need to invert the initial transformation of the target (which was a log):

ŷ = exp(ẑ);

You've now defined a full learning network which you can fit and use for prediction:

fit!(ŷ, rows=train)
ypreds = ŷ(rows=test)
rmsl(y[test], ypreds)
0.1896640122741723

Using the "arrow" syntax

If you're using Julia 1.3, you can use the following syntax to do the same thing.

First layer: one hot encoding and log transform:

W = Xs |> OneHotEncoder()
z = ys |> log;

Second layer: KNN Regression and Ridge regression

ẑ₁ = (W, z) |> KNNRegressor(K=5)
ẑ₂ = (W, z) |> RidgeRegressor(lambda=2.5);

Third layer: weighted sum of the two models:

ẑ = 0.3ẑ₁ + 0.7ẑ₂;

then the inverse transform

ŷ = exp(ẑ);

You can then fit and evaluate the model as usual:

fit!(ŷ, rows=train)
rmsl(y[test], ŷ(rows=test))
0.1474789307963238

Tuning the model

So far the hyperparameters were explicitly given but it makes more sense to learn them. For this, we define a model around the learning network which can then be trained and tuned as any model:

mutable struct KNNRidgeBlend <: DeterministicNetwork
    knn_model::KNNRegressor
    ridge_model::RidgeRegressor
    knn_weight::Float64
end

We must specify how such a model should be fit, which is effectively just the learning network we had defined before except that now the parameters are contained in the struct:

function MLJ.fit(model::KNNRidgeBlend, verbosity::Int, X, y)
    Xs = source(X)
    ys = source(y)
    hot = machine(OneHotEncoder(), Xs)
    W = transform(hot, Xs)
    z = log(ys)
    ridge_model = model.ridge_model
    knn_model = model.knn_model
    ridge = machine(ridge_model, W, z)
    knn = machine(knn_model, W, z)
    # and finally
    ẑ = model.knn_weight * predict(knn, W) + (1.0 - model.knn_weight) * predict(ridge, W)
    ŷ = exp(ẑ)

    mach = machine(Deterministic(), Xs, ys; predict=ŷ)
    fit!(mach, verbosity=verbosity - 1)
    return mach()
end

Note: you really want to set verbosity=0 here otherwise in the tuning you will get a lot of verbose output!

You can now instantiate and fit such a model:

krb = KNNRidgeBlend(KNNRegressor(K=5), RidgeRegressor(lambda=2.5), 0.3)
mach = machine(krb, X, y)
fit!(mach, rows=train)

preds = predict(mach, rows=test)
rmsl(y[test], preds)
0.1474789307963238

But more interestingly, the hyperparameters of the model can be tuned.

Before we get started, it's important to note that the hyperparameters of the model have different levels of nesting. This becomes explicit when trying to access elements:

@show krb.knn_weight
@show krb.knn_model.K
@show krb.ridge_model.lambda
krb.knn_weight = 0.3
krb.knn_model.K = 5
krb.ridge_model.lambda = 2.5

You can also see all the hyperparameters using the params function:

params(krb) |> pprint
(knn_model = (K = 5,
              algorithm = :kdtree,
              metric = Distances.Euclidean(0.0),
              leafsize = 10,
              reorder = true,
              weights = :uniform),
 ridge_model = (lambda = 2.5,),
 knn_weight = 0.3)

The range of values to do your hyperparameter tuning over should follow the nesting structure reflected by params:

k_range = range(krb, :(knn_model.K), lower=2, upper=100, scale=:log10)
l_range = range(krb, :(ridge_model.lambda), lower=1e-4, upper=10, scale=:log10)
w_range = range(krb, :(knn_weight), lower=0.1, upper=0.9)

ranges = [k_range, l_range, w_range]
3-element Array{MLJBase.NumericRange{T,MLJBase.Bounded,Symbol} where T,1}:
 NumericRange{Int64,…} @426
 NumericRange{Float64,…} @046
 NumericRange{Float64,…} @442

Now there remains to define how the tuning should be done, let's just specify a very coarse grid tuning with cross validation and instantiate a tuned model:

tuning = Grid(resolution=3)
resampling = CV(nfolds=6)

tm = TunedModel(model=krb, tuning=tuning, resampling=resampling,
                ranges=ranges, measure=rmsl)
DeterministicTunedModel(
    model = KNNRidgeBlend(
            knn_model = KNNRegressor @108,
            ridge_model = RidgeRegressor @235,
            knn_weight = 0.3),
    tuning = Grid(
            goal = nothing,
            resolution = 3,
            shuffle = true,
            rng = Random._GLOBAL_RNG()),
    resampling = CV(
            nfolds = 6,
            shuffle = false,
            rng = Random._GLOBAL_RNG()),
    measure = rmsl(),
    weights = nothing,
    operation = MLJModelInterface.predict,
    range = MLJBase.NumericRange{T,MLJBase.Bounded,Symbol} where T[NumericRange{Int64,…} @426, NumericRange{Float64,…} @046, NumericRange{Float64,…} @442],
    train_best = true,
    repeats = 1,
    n = nothing,
    acceleration = CPU1{Nothing}(nothing),
    acceleration_resampling = CPU1{Nothing}(nothing),
    check_measure = true) @429

which we can now finally fit...

mtm = machine(tm, X, y)
fit!(mtm, rows=train);

To retrieve the best model, you can use:

krb_best = fitted_params(mtm).best_model
@show krb_best.knn_model.K
@show krb_best.ridge_model.lambda
@show krb_best.knn_weight
krb_best.knn_model.K = 2
krb_best.ridge_model.lambda = 0.03162277660168379
krb_best.knn_weight = 0.1

you can also use mtm to make predictions (which will be done using the best model)

preds = predict(mtm, rows=test)
rmsl(y[test], preds)
0.13932972436067093