Lab 10 - PCA and Clustering

Download the notebook, the raw script, or the annotated script for this tutorial (right-click on the link and save).

Getting started

using MLJ
import RDatasets: dataset
import DataFrames: DataFrame, select, Not, describe
using Random

data = dataset("datasets", "USArrests")
names(data)
5-element Array{String,1}:
 "State"
 "Murder"
 "Assault"
 "UrbanPop"
 "Rape"

Let's have a look at the mean and standard deviation of each feature:

describe(data, :mean, :std)
5×3 DataFrame
│ Row │ variable │ mean   │ std     │
│     │ Symbol   │ Union… │ Union…  │
├─────┼──────────┼────────┼─────────┤
│ 1   │ State    │        │         │
│ 2   │ Murder   │ 7.788  │ 4.35551 │
│ 3   │ Assault  │ 170.76 │ 83.3377 │
│ 4   │ UrbanPop │ 65.54  │ 14.4748 │
│ 5   │ Rape     │ 21.232 │ 9.36638 │

Let's extract the numerical component and coerce

X = select(data, Not(:State))
X = coerce(X, :UrbanPop=>Continuous, :Assault=>Continuous);

PCA pipeline

PCA is usually best done after standardization but we won't do it here:

@load PCA pkg=MultivariateStats

pca_mdl = PCA(pratio=1)
pca = machine(pca_mdl, X)
fit!(pca)
PCA
W = transform(pca, X);

W is the PCA'd data; here we've used default settings for PCA and it has recovered 2 components:

schema(W).names
(:x1, :x2, :x3, :x4)

Let's inspect the fit:

r = report(pca)
cumsum(r.principalvars ./ r.tvar)
4-element Array{Float64,1}:
 0.9655342205668823
 0.9933515571990573
 0.9991510921213993
 1.0

In the second line we look at the explained variance with 1 then 2 PCA features and it seems that with 2 we almost completely recover all of the variance.

More interesting data...

Instead of just playing with toy data, let's load the orange juice data and extract only the columns corresponding to price data:

data = dataset("ISLR", "OJ")

X = select(data, [:PriceCH, :PriceMM, :DiscCH, :DiscMM, :SalePriceMM,
                  :SalePriceCH, :PriceDiff, :PctDiscMM, :PctDiscCH]);

PCA pipeline

Random.seed!(1515)

SPCA = @pipeline(Standardizer(),
                 PCA(pratio=1-1e-4))

spca = machine(SPCA, X)
fit!(spca)
W = transform(spca, X)
names(W)
6-element Array{String,1}:
 "x1"
 "x2"
 "x3"
 "x4"
 "x5"
 "x6"

What kind of variance can we explain?

rpca = collect(values(report(spca).report_given_machine))[2]
cs = cumsum(rpca.principalvars ./ rpca.tvar)
6-element Array{Float64,1}:
 0.4174696748484731
 0.7233074812209765
 0.9436456234171869
 0.9997505816044248
 0.9998956501446636
 0.9999999999999998

Let's visualise this

using PyPlot

figure(figsize=(8,6))

bar(1:length(cs), cs)
plot(1:length(cs), cs, color="red", marker="o")

xlabel("Number of PCA features", fontsize=14)
ylabel("Ratio of explained variance", fontsize=14)
PCA explained variance

So 4 PCA features are enough to recover most of the variance.

Clustering

Random.seed!(1515)

@load KMeans pkg=Clustering
SPCA2 = @pipeline(Standardizer(),
                  PCA(),
                  KMeans(k=3))

spca2 = machine(SPCA2, X)
fit!(spca2)

assignments = collect(values(report(spca2).report_given_machine))[3].assignments
mask1 = assignments .== 1
mask2 = assignments .== 2
mask3 = assignments .== 3;

Now we can try visualising this

using PyPlot

figure(figsize=(8, 6))
for (m, c) in zip((mask1, mask2, mask3), ("red", "green", "blue"))
    plot(W[m, 1], W[m, 2], ls="none", marker=".", markersize=10, color=c)
end

xlabel("PCA-1", fontsize=13)
ylabel("PCA-2", fontsize=13)
legend(["Group 1", "Group 2", "Group 3"], fontsize=13)