# Lab 6b - Ridge and Lasso regression

Download the notebook, the raw script, or the annotated script for this tutorial (right-click on the link and save).
``# In this tutorial, we are exploring the application of Ridge and Lasso``
``````LassoRegressor(
lambda = 1.0,
fit_intercept = true,
penalize_intercept = false,
solver = nothing) @897``````

regression to the Hitters R dataset.

## Getting started

``````using MLJ
import RDatasets: dataset
using PrettyPrinting
import Distributions
const D = Distributions

``````LassoRegressor(
lambda = 1.0,
fit_intercept = true,
penalize_intercept = false,
solver = nothing) @053``````

We load the dataset using the `dataset` function, which takes the Package and dataset names as arguments.

``````hitters = dataset("ISLR", "Hitters")
@show size(hitters)
names(hitters) |> pprint``````
``````size(hitters) = (322, 20)
["AtBat",
"Hits",
"HmRun",
"Runs",
"RBI",
"Walks",
"Years",
"CAtBat",
"CHits",
"CHmRun",
"CRuns",
"CRBI",
"CWalks",
"League",
"Division",
"PutOuts",
"Assists",
"Errors",
"Salary",
"NewLeague"]``````

Let's unpack the dataset with the `unpack` function. In this case, the target is `Salary` (`==(:Salary)`) and all other columns are features (`col->true`).

``y, X = unpack(hitters, ==(:Salary), col->true);``

The target has missing values which we will just ignore. We extract the row indices corresponding to non-missing values of the target. Note the use of the element-wise operator `.`.

``no_miss = .!ismissing.(y);``

We collect the non missing values of the target in an Array.

``````# And keep only the corresponding features values.
y = collect(skipmissing(y))
X = X[no_miss, :]

# Let's now split our dataset into a train and test sets.
train, test = partition(eachindex(y), 0.5, shuffle=true, rng=424);``````

Let's have a look at the target.

``````using PyPlot

figure(figsize=(8,6))
plot(y, ls="none", marker="o")

xticks(fontsize=12); yticks(fontsize=12)
xlabel("Index", fontsize=14), ylabel("Salary", fontsize=14)`````` That looks quite skewed, let's have a look at a histogram:

``````figure(figsize=(8,6))
hist(y, bins=50, density=true)

xticks(fontsize=12); yticks(fontsize=12)
xlabel("Salary", fontsize=14); ylabel("Density", fontsize=14)

edfit = D.fit_mle(D.Exponential, y)
xx = range(minimum(y), 2500, length=100)
yy = pdf.(edfit, xx)
plot(xx, yy, lw=3, label="Exponential distribution fit")

legend(fontsize=12)`````` ### Data preparation

Most features are currently encoded as integers but we will consider them as continuous. To coerce `int` features to `Float`, we nest the `autotype` function in the `coerce` function. The `autotype` function returns a dictionary containing scientific types, which is then passed to the `coerce` function. For more details on the use of `autotype`, see the Scientific Types

``````Xc = coerce(X, autotype(X, rules=(:discrete_to_continuous,)))
scitype(Xc)``````
``Table{Union{AbstractArray{Continuous,1}, AbstractArray{Multiclass{2},1}}}``

There're a few features that are categorical which we'll one-hot-encode.

## Ridge pipeline

### Baseline

Let's first fit a simple pipeline with a standardizer, a one-hot-encoder and a basic linear regression:

``````model = @pipeline(Standardizer(),
OneHotEncoder(),
LinearRegressor())

pipe  = machine(model, Xc, y)
fit!(pipe, rows=train)
ŷ = predict(pipe, rows=test)
round(rms(ŷ, y[test])^2, sigdigits=4)``````
``123500.0``

Let's get a feel for how we're doing

``````figure(figsize=(8,6))

res = ŷ .- y[test]
stem(res)

xticks(fontsize=12); yticks(fontsize=12)
xlabel("Index", fontsize=14); ylabel("Residual (ŷ - y)", fontsize=14)

ylim([-1300, 1000])`````` ``````figure(figsize=(8,6))
hist(res, bins=30, density=true, color="green")

xx = range(-1100, 1100, length=100)
ndfit = D.fit_mle(D.Normal, res)
lfit  = D.fit_mle(D.Laplace, res)

plot(xx, pdf.(ndfit, xx), lw=3, color="orange", label="Normal fit")
plot(xx, pdf.(lfit, xx), lw=3, color="magenta", label="Laplace fit")

legend(fontsize=12)

xticks(fontsize=12); yticks(fontsize=12)
xlabel("Residual (ŷ - y)", fontsize=14); ylabel("Density", fontsize=14)
xlim([-1100, 1100])`````` ### Basic Ridge

Let's now swap the linear regressor for a Ridge one without specifying the penalty (`1` by default): We modify the supervised model in the pipeline directly.

``````pipe.model.linear_regressor = RidgeRegressor()
fit!(pipe, rows=train)
ŷ = predict(pipe, rows=test)
round(rms(ŷ, y[test])^2, sigdigits=4)``````
``109600.0``

Ok that's a bit better but surely we can do better with an appropriate selection of the hyperparameter.

### Cross validating

What penalty should you use? Let's do a simple CV to try to find out:

``````r  = range(model, :(linear_regressor.lambda), lower=1e-2, upper=100_000, scale=:log10)
tm = TunedModel(model=model, ranges=r, tuning=Grid(resolution=50),
resampling=CV(nfolds=3, rng=4141), measure=rms)
mtm = machine(tm, Xc, y)
fit!(mtm, rows=train)

best_mdl = fitted_params(mtm).best_model
round(best_mdl.linear_regressor.lambda, sigdigits=4)``````
``5.179``

right, and with that we get:

``````ŷ = predict(mtm, rows=test)
round(rms(ŷ, y[test])^2, sigdigits=4)``````
``96690.0``

Let's see:

``````figure(figsize=(8,6))

res = ŷ .- y[test]
stem(res)

xticks(fontsize=12); yticks(fontsize=12)
xlabel("Index", fontsize=14);
ylabel("Residual (ŷ - y)", fontsize=14)
xlim(1, length(res))

ylim([-1300, 1000])`````` You can compare that with the residuals obtained earlier.

## Lasso pipeline

Let's do the same as above but using a Lasso model and adjusting the range a bit:

``````mtm.model.model.linear_regressor = LassoRegressor()
mtm.model.range = range(model, :(linear_regressor.lambda), lower=500, upper=100_000, scale=:log10)
fit!(mtm, rows=train)

best_mdl = fitted_params(mtm).best_model
round(best_mdl.linear_regressor.lambda, sigdigits=4)``````
``2531.0``

Ok and let's see how that does:

``````ŷ = predict(mtm, rows=test)
round(rms(ŷ, y[test])^2, sigdigits=4)``````
``98330.0``

Pretty good! and the parameters are reasonably sparse as expected:

``````coefs, intercept = fitted_params(mtm.fitresult).linear_regressor
@show coefs
@show intercept``````
``````coefs = [:AtBat => -0.0, :Hits => 0.0, :HmRun => -0.0, :Runs => 82.33487884749951, :RBI => 0.0, :Walks => 37.88394248222143, :Years => -0.0, :CAtBat => 0.0, :CHits => 152.33401765158266, :CHmRun => 0.0, :CRuns => 0.0, :CRBI => 28.535678620243, :CWalks => -0.0, :League__A => -0.0, :League__N => 0.0, :Division__E => 22.920702987660867, :Division__W => -45.33006538883337, :PutOuts => 64.09339067499211, :Assists => 0.0, :Errors => -0.0, :NewLeague__A => -0.0, :NewLeague__N => 0.0]
intercept = 555.6709756637167
``````

with around 50% sparsity:

``````coef_vals = [c for c in coefs]
sum(coef_vals .≈ 0) / length(coefs)``````
``0.6818181818181818``

Let's visualise this:

``````figure(figsize=(8,6))
stem(coef_vals)

# name of the features including one-hot-encoded ones
all_names = [:AtBat, :Hits, :HmRun, :Runs, :RBI, :Walks, :Years,
:CAtBat, :CHits, :CHmRun, :CRuns, :CRBI, :CWalks,
:League__A, :League__N, :Div_E, :Div_W,
:PutOuts, :Assists, :Errors, :NewLeague_A, :NewLeague_N]

idxshow = collect(1:length(coef_vals))[abs.(coef_vals) .> 10]
xticks(idxshow .- 1, all_names[idxshow], rotation=45, fontsize=12)
yticks(fontsize=12)
ylabel("Amplitude", fontsize=14)`````` ## Elastic net pipeline

``````@load ElasticNetRegressor pkg=MLJLinearModels

mtm.model.model.linear_regressor = ElasticNetRegressor()
mtm.model.range = [range(model, :(linear_regressor.lambda), lower=0.1, upper=100, scale=:log10),
range(model, :(linear_regressor.gamma),  lower=500, upper=10_000, scale=:log10)]
mtm.model.tuning = Grid(resolution=10)
fit!(mtm, rows=train)

best_mdl = fitted_params(mtm).best_model
@show round(best_mdl.linear_regressor.lambda, sigdigits=4)
@show round(best_mdl.linear_regressor.gamma, sigdigits=4)``````
``````round(best_mdl.linear_regressor.lambda, sigdigits = 4) = 46.42
round(best_mdl.linear_regressor.gamma, sigdigits = 4) = 972.9
``````

And it's not too bad in terms of accuracy either

``````ŷ = predict(mtm, rows=test)
round(rms(ŷ, y[test])^2, sigdigits=4)``````
``97140.0``

But the simple ridge regression seems to work best here.