Lab 3 - Linear Regression

Download the notebook, the raw script, or the annotated script for this tutorial (right-click on the link and save).

Simple linear regression

MLJ essentially serves as a unified path to many existing Julia packages each of which provides their own functionalities and models, with their own conventions.

The simple linear regression demonstrates this. Several packages offer it (beyond just using the backslash operator): here we will use MLJLinearModels but we could also have used GLM, ScikitLearn etc.

To load the model from a given package use @load ModelName pkg=PackageName

using MLJ

@load LinearRegressor pkg=MLJLinearModels
LinearRegressor(
    fit_intercept = true,
    solver = nothing) @833

Note: in order to be able to load this, you must have the relevant package in your environment, if you don't, you can always add it (using Pkg; Pkg.add("MLJLinearModels")).

Let's load the boston data set

import RDatasets: dataset
import DataFrames: describe, select, Not, rename!
boston = dataset("MASS", "Boston")
first(boston, 3)
3×14 DataFrame
│ Row │ Crim    │ Zn      │ Indus   │ Chas  │ NOx     │ Rm      │ Age     │ Dis     │ Rad   │ Tax   │ PTRatio │ Black   │ LStat   │ MedV    │
│     │ Float64 │ Float64 │ Float64 │ Int64 │ Float64 │ Float64 │ Float64 │ Float64 │ Int64 │ Int64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼───────┼─────────┼─────────┼─────────┼─────────┼───────┼───────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │ 0.00632 │ 18.0    │ 2.31    │ 0     │ 0.538   │ 6.575   │ 65.2    │ 4.09    │ 1     │ 296   │ 15.3    │ 396.9   │ 4.98    │ 24.0    │
│ 2   │ 0.02731 │ 0.0     │ 7.07    │ 0     │ 0.469   │ 6.421   │ 78.9    │ 4.9671  │ 2     │ 242   │ 17.8    │ 396.9   │ 9.14    │ 21.6    │
│ 3   │ 0.02729 │ 0.0     │ 7.07    │ 0     │ 0.469   │ 7.185   │ 61.1    │ 4.9671  │ 2     │ 242   │ 17.8    │ 392.83  │ 4.03    │ 34.7    │

Let's get a feel for the data

describe(boston, :mean, :std, :eltype)
14×4 DataFrame
│ Row │ variable │ mean     │ std      │ eltype   │
│     │ Symbol   │ Float64  │ Float64  │ DataType │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ Crim     │ 3.61352  │ 8.60155  │ Float64  │
│ 2   │ Zn       │ 11.3636  │ 23.3225  │ Float64  │
│ 3   │ Indus    │ 11.1368  │ 6.86035  │ Float64  │
│ 4   │ Chas     │ 0.06917  │ 0.253994 │ Int64    │
│ 5   │ NOx      │ 0.554695 │ 0.115878 │ Float64  │
│ 6   │ Rm       │ 6.28463  │ 0.702617 │ Float64  │
│ 7   │ Age      │ 68.5749  │ 28.1489  │ Float64  │
│ 8   │ Dis      │ 3.79504  │ 2.10571  │ Float64  │
│ 9   │ Rad      │ 9.54941  │ 8.70726  │ Int64    │
│ 10  │ Tax      │ 408.237  │ 168.537  │ Int64    │
│ 11  │ PTRatio  │ 18.4555  │ 2.16495  │ Float64  │
│ 12  │ Black    │ 356.674  │ 91.2949  │ Float64  │
│ 13  │ LStat    │ 12.6531  │ 7.14106  │ Float64  │
│ 14  │ MedV     │ 22.5328  │ 9.1971   │ Float64  │

So there's no missing value and most variables are encoded as floating point numbers. In MLJ it's important to specify the interpretation of the features (should it be considered as a Continuous feature, as a Count, ...?), see also this tutorial section on scientific types.

Here we will just interpret the integer features as continuous as we will just use a basic linear regression:

data = coerce(boston, autotype(boston, :discrete_to_continuous));

Let's also extract the target variable (MedV):

y = data.MedV
X = select(data, Not(:MedV));

Let's declare a simple multivariate linear regression model:

mdl = LinearRegressor()
LinearRegressor(
    fit_intercept = true,
    solver = nothing) @003

First let's do a very simple univariate regression, in order to fit it on the data, we need to wrap it in a machine which, in MLJ, is the composition of a model and data to apply the model on:

X_uni = select(X, :LStat) # only a single feature
mach_uni = machine(mdl, X_uni, y)
fit!(mach_uni)
Machine{LinearRegressor} @470 trained 1 time.
  args: 
    1:	Source @957 ⏎ `Table{AbstractArray{Continuous,1}}`
    2:	Source @518 ⏎ `AbstractArray{Continuous,1}`

You can then retrieve the fitted parameters using fitted_params:

fp = fitted_params(mach_uni)
@show fp.coefs
@show fp.intercept
fp.coefs = [:LStat => -0.950049353757991]
fp.intercept = 34.553840879383095

You can also visualise this

using PyPlot

figure(figsize=(8,6))
plot(X.LStat, y, ls="none", marker="o")
Xnew = (LStat = collect(range(extrema(X.LStat)..., length=100)),)
plot(Xnew.LStat, predict(mach_uni, Xnew))
Univariate regression

The multivariate case is very similar

mach = machine(mdl, X, y)
fit!(mach)

fp = fitted_params(mach)
coefs = fp.coefs
intercept = fp.intercept
for (name, val) in coefs
    println("$(rpad(name, 8)):  $(round(val, sigdigits=3))")
end
println("Intercept: $(round(intercept, sigdigits=3))")
Crim    :  -0.108
Zn      :  0.0464
Indus   :  0.0206
Chas    :  2.69
NOx     :  -17.8
Rm      :  3.81
Age     :  0.000692
Dis     :  -1.48
Rad     :  0.306
Tax     :  -0.0123
PTRatio :  -0.953
Black   :  0.00931
LStat   :  -0.525
Intercept: 36.5

You can use the machine in order to predict values as well and, for instance, compute the root mean squared error:

ŷ = predict(mach, X)
round(rms(ŷ, y), sigdigits=4)
4.679

Let's see what the residuals look like

figure(figsize=(8,6))
res = ŷ .- y
stem(res)
Plot of the residuals

Maybe that a histogram is more appropriate here

figure(figsize=(8,6))
hist(res, density=true)
Histogram of the residuals

Interaction and transformation

Let's say we want to also consider an interaction term of lstat and age taken together. To do this, just create a new dataframe with an additional column corresponding to the interaction term:

X2 = hcat(X, X.LStat .* X.Age);

So here we have a DataFrame with one extra column corresponding to the elementwise products between :LStat and Age. DataFrame gives this a default name (:x1) which we can change:

rename!(X2, :x1 => :interaction);

Ok cool, now let's try the linear regression again

mach = machine(mdl, X2, y)
fit!(mach)
ŷ = predict(mach, X2)
round(rms(ŷ, y), sigdigits=4)
4.676

We get slightly better results but nothing spectacular.

Let's get back to the lab where they consider regressing the target variable on lstat and lstat^2; again, it's essentially a case of defining the right DataFrame:

X3 = hcat(X.LStat, X.LStat.^2)
mach = machine(mdl, X3, y)
fit!(mach)
ŷ = predict(mach, X3)
round(rms(ŷ, y), sigdigits=4)
5.507

which again, we can visualise:

Xnew = (LStat = Xnew.LStat, LStat2 = Xnew.LStat.^2)

figure(figsize=(8,6))
plot(X.LStat, y, ls="none", marker="o")
plot(Xnew.LStat, predict(mach, Xnew))
Polynomial regression