Lab 4 - Logistic Regression, LDA, QDA, KNN

Download the notebook, the raw script, or the annotated script for this tutorial (right-click on the link and save).

Stock market data

Let's load the usual packages and the data

using MLJ
import RDatasets: dataset
import DataFrames: DataFrame, describe, select, Not
import StatsBase: countmap, cor, var
using PrettyPrinting

smarket = dataset("ISLR", "Smarket")
@show size(smarket)
@show names(smarket)
size(smarket) = (1250, 9)
names(smarket) = ["Year", "Lag1", "Lag2", "Lag3", "Lag4", "Lag5", "Volume", "Today", "Direction"]

Since we often want to only show a few significant digits for the metrics etc, let's introduce a very simple function that does that:

r3 = x -> round(x, sigdigits=3)
r3(pi)
3.14

Let's get a description too

describe(smarket, :mean, :std, :eltype)
9×4 DataFrame
│ Row │ variable  │ mean      │ std      │ eltype                         │
│     │ Symbol    │ Union…    │ Union…   │ DataType                       │
├─────┼───────────┼───────────┼──────────┼────────────────────────────────┤
│ 1   │ Year      │ 2003.02   │ 1.40902  │ Float64                        │
│ 2   │ Lag1      │ 0.0038344 │ 1.1363   │ Float64                        │
│ 3   │ Lag2      │ 0.0039192 │ 1.13628  │ Float64                        │
│ 4   │ Lag3      │ 0.001716  │ 1.1387   │ Float64                        │
│ 5   │ Lag4      │ 0.001636  │ 1.13877  │ Float64                        │
│ 6   │ Lag5      │ 0.0056096 │ 1.14755  │ Float64                        │
│ 7   │ Volume    │ 1.4783    │ 0.360357 │ Float64                        │
│ 8   │ Today     │ 0.0031384 │ 1.13633  │ Float64                        │
│ 9   │ Direction │           │          │ CategoricalValue{String,UInt8} │

The target variable is :Direction:

y = smarket.Direction
X = select(smarket, Not(:Direction));

We can compute all the pairwise correlations; we use Matrix so that the dataframe entries are considered as one matrix of numbers with the same type (otherwise cor won't work):

cm = X |> Matrix |> cor
round.(cm, sigdigits=1)
8×8 Array{Float64,2}:
 1.0    0.03    0.03    0.03    0.04    0.03    0.5    0.03
 0.03   1.0    -0.03   -0.01   -0.003  -0.006   0.04  -0.03
 0.03  -0.03    1.0    -0.03   -0.01   -0.004  -0.04  -0.01
 0.03  -0.01   -0.03    1.0    -0.02   -0.02   -0.04  -0.002
 0.04  -0.003  -0.01   -0.02    1.0    -0.03   -0.05  -0.007
 0.03  -0.006  -0.004  -0.02   -0.03    1.0    -0.02  -0.03
 0.5    0.04   -0.04   -0.04   -0.05   -0.02    1.0    0.01
 0.03  -0.03   -0.01   -0.002  -0.007  -0.03    0.01   1.0

Let's see what the :Volume feature looks like:

using PyPlot
figure(figsize=(8,6))
plot(X.Volume)
xlabel("Tick number", fontsize=14)
ylabel("Volume", fontsize=14)
xticks(fontsize=12)
yticks(fontsize=12)
volume

Logistic Regression

We will now try to train models; the target :Direction has two classes: Up and Down; it needs to be interpreted as a categorical object, and we will mark it as a ordered factor to specify that 'Up' is positive and 'Down' negative (for the confusion matrix later):

y = coerce(y, OrderedFactor)
classes(y[1])
2-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
 "Down"
 "Up"

Note that in this case the default order comes from the lexicographic order which happens to map to our intuition since D comes before U.

figure(figsize=(8,6))
cm = countmap(y)
bar([1, 2], [cm["Down"], cm["Up"]])
xticks([1, 2], ["Down", "Up"], fontsize=12)
yticks(fontsize=12)
ylabel("Number of occurences", fontsize=14)

Seems pretty balanced.

Let's now try fitting a simple logistic classifier (aka logistic regression) not using :Year and :Today:

@load LogisticClassifier pkg=MLJLinearModels
X2 = select(X, Not([:Year, :Today]))
clf = machine(LogisticClassifier(), X2, y)
Machine{LogisticClassifier} @216 trained 0 times.
  args: 
    1:	Source @540 ⏎ `Table{AbstractArray{Continuous,1}}`
    2:	Source @218 ⏎ `AbstractArray{OrderedFactor{2},1}`

Let's fit it to the data and try to reproduce the output:

fit!(clf)
ŷ = MLJ.predict(clf, X2)
ŷ[1:3]
3-element MLJBase.UnivariateFiniteArray{OrderedFactor{2},String,UInt8,Float64,1}:
 UnivariateFinite{OrderedFactor{2}}(Down=>0.493, Up=>0.507)
 UnivariateFinite{OrderedFactor{2}}(Down=>0.518, Up=>0.482)
 UnivariateFinite{OrderedFactor{2}}(Down=>0.519, Up=>0.481)

Note that here the are scores. We can recover the average cross-entropy loss:

cross_entropy(ŷ, y) |> mean |> r3
0.691

in order to recover the class, we could use the mode and compare the misclassification rate:

ŷ = predict_mode(clf, X2)
misclassification_rate(ŷ, y) |> r3
0.479

Well that's not fantastic...

Let's visualise how we're doing building a confusion matrix, first is predicted, second is truth:

cm = confusion_matrix(ŷ, y)
              ┌───────────────────────────┐
              │       Ground Truth        │
┌─────────────┼─────────────┬─────────────┤
│  Predicted  │    Down     │     Up      │
├─────────────┼─────────────┼─────────────┤
│    Down     │     144     │     141     │
├─────────────┼─────────────┼─────────────┤
│     Up      │     458     │     507     │
└─────────────┴─────────────┴─────────────┘

We can then compute the accuracy or precision, etc. easily for instance:

@show false_positive(cm)
@show accuracy(ŷ, y)  |> r3
@show accuracy(cm)    |> r3  # same thing
@show precision(ŷ, y) |> r3
@show recall(ŷ, y)    |> r3
@show f1score(ŷ, y)   |> r3
false_positive(cm) = 458
accuracy(ŷ, y) |> r3 = 0.521
accuracy(cm) |> r3 = 0.521
precision(ŷ, y) |> r3 = 0.525
recall(ŷ, y) |> r3 = 0.782
f1score(ŷ, y) |> r3 = 0.629

Let's now train on the data before 2005 and use it to predict on the rest. Let's find the row indices for which the condition holds

train = 1:findlast(X.Year .< 2005)
test = last(train)+1:length(y);

We can now just re-fit the machine that we've already defined just on those rows and predict on the test:

fit!(clf, rows=train)
ŷ = predict_mode(clf, rows=test)
accuracy(ŷ, y[test]) |> r3
0.484

Well, that's not very good... Let's retrain a machine using only :Lag1 and :Lag2:

X3 = select(X2, [:Lag1, :Lag2])
clf = machine(LogisticClassifier(), X3, y)
fit!(clf, rows=train)
ŷ = predict_mode(clf, rows=test)
accuracy(ŷ, y[test]) |> r3
0.56

Interesting... it has higher accuracy than the model with more features! This could be investigated further by increasing the regularisation parameter but we'll leave that aside for now.

We can use a trained machine to predict on new data:

Xnew = (Lag1 = [1.2, 1.5], Lag2 = [1.1, -0.8])
ŷ = MLJ.predict(clf, Xnew)
ŷ |> pprint
UnivariateFinite{OrderedFactor{2},String,UInt8,Float64}[UnivariateFinite{OrderedFactor{2}}(Down=>0.521, Up=>0.479), UnivariateFinite{OrderedFactor{2}}(Down=>0.504, Up=>0.496)]

Note: when specifying data, we used a simple NamedTuple; we could also have defined a dataframe or any other compatible tabular container. Note also that we retrieved the raw predictions here i.e.: a score for each class; we could have used predict_mode or indeed

mode.(ŷ)
2-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
 "Down"
 "Down"

LDA

Let's do a similar thing but with a LDA model this time:

@load BayesianLDA pkg=MultivariateStats

clf = machine(BayesianLDA(), X3, y)
fit!(clf, rows=train)
ŷ = predict_mode(clf, rows=test)

accuracy(ŷ, y[test]) |> r3
0.56

Note: BayesianLDA is LDA using a multivariate normal model for each class with a default prior inferred from the proportions for each class in the training data. You can also use the bare LDA model which does not make these assumptions and allows using a different metric in the transformed space, see the docs for details.

@load LDA pkg=MultivariateStats
using Distances

clf = machine(LDA(dist=CosineDist()), X3, y)
fit!(clf, rows=train)
ŷ = predict_mode(clf, rows=test)

accuracy(ŷ, y[test]) |> r3
0.548

QDA

Bayesian QDA is available via ScikitLearn:

@load BayesianQDA pkg=ScikitLearn
BayesianQDA(
    priors = nothing,
    reg_param = 0.0,
    store_covariance = false,
    tol = 0.0001) @838

Using it is done in much the same way as before:

clf = machine(BayesianQDA(), X3, y)
fit!(clf, rows=train)
ŷ = predict_mode(clf, rows=test)

accuracy(ŷ, y[test]) |> r3
0.599

KNN

We can use K-Nearest Neighbors models via the NearestNeighbors package:

@load KNNClassifier pkg=NearestNeighbors

knnc = KNNClassifier(K=1)
clf = machine(knnc, X3, y)
fit!(clf, rows=train)
ŷ = predict_mode(clf, rows=test)
accuracy(ŷ, y[test]) |> r3
0.5

Pretty bad... let's try with three neighbors

knnc.K = 3
fit!(clf, rows=train)
ŷ = predict_mode(clf, rows=test)
accuracy(ŷ, y[test]) |> r3
0.532

A bit better but not hugely so.

Caravan insurance data

The caravan dataset is part of ISLR as well:

caravan  = dataset("ISLR", "Caravan")
size(caravan)
(5822, 86)

The target variable is Purchase, effectively a categorical

purchase = caravan.Purchase
vals     = unique(purchase)
2-element Array{String,1}:
 "No"
 "Yes"

Let's see how many of each we have

nl1 = sum(purchase .== vals[1])
nl2 = sum(purchase .== vals[2])
println("#$(vals[1]) ", nl1)
println("#$(vals[2]) ", nl2)
#No 5474
#Yes 348

we can also visualise this as was done before:

figure(figsize=(8,6))
cm = countmap(purchase)
bar([1, 2], [cm["No"], cm["Yes"]])
xticks([1, 2], ["No", "Yes"], fontsize=12)
yticks(fontsize=12)
ylabel("Number of occurences", fontsize=14)

that's quite unbalanced.

Apart from the target, all other variables are numbers; we can standardize the data:

y, X = unpack(caravan, ==(:Purchase), col->true)

mstd = machine(Standardizer(), X)
fit!(mstd)
Xs = transform(mstd, X)

var(Xs[:,1]) |> r3
1.0

Note: in MLJ, it is recommended to work with pipelines / networks when possible and not do "step-by-step" transformation and fitting of the data as this is more error prone. We do it here to stick to the ISL tutorial.

We split the data in the first 1000 rows for testing and the rest for training:

test = 1:1000
train = last(test)+1:nrows(Xs);

Let's now fit a KNN model and check the misclassification rate

clf = machine(KNNClassifier(K=3), Xs, y)
fit!(clf, rows=train)
ŷ = predict_mode(clf, rows=test)

accuracy(ŷ, y[test]) |> r3
0.925

that looks good but recall the problem is very unbalanced

mean(y[test] .!= "No") |> r3
0.059

Let's fit a logistic classifier to this problem

clf = machine(LogisticClassifier(), Xs, y)
fit!(clf, rows=train)
ŷ = predict_mode(clf, rows=test)

accuracy(ŷ, y[test]) |> r3
0.934

ROC and AUC

Since we have a probabilistic classifier, we can also check metrics that take scores into account such as the area under the ROC curve (AUC):

ŷ = MLJ.predict(clf, rows=test)

auc(ŷ, y[test])
0.7434211711306039

We can also display the curve itself

fprs, tprs, thresholds = roc(ŷ, y[test])

figure(figsize=(8,6))
plot(fprs, tprs)

xlabel("False Positive Rate", fontsize=14)
ylabel("True Positive Rate", fontsize=14)
xticks(fontsize=12)
yticks(fontsize=12)
ROC