# Lab 10 - PCA and Clustering

Download the notebook, the raw script, or the annotated script for this tutorial (right-click on the link and save).

## Getting started

``````using MLJ
import RDatasets: dataset
import DataFrames: DataFrame, select, Not, describe
using Random

data = dataset("datasets", "USArrests")
names(data)``````
``````5-element Array{String,1}:
"State"
"Murder"
"Assault"
"UrbanPop"
"Rape"``````

Let's have a look at the mean and standard deviation of each feature:

``describe(data, :mean, :std)``
``````5×3 DataFrame
│ Row │ variable │ mean   │ std     │
│     │ Symbol   │ Union… │ Union…  │
├─────┼──────────┼────────┼─────────┤
│ 1   │ State    │        │         │
│ 2   │ Murder   │ 7.788  │ 4.35551 │
│ 3   │ Assault  │ 170.76 │ 83.3377 │
│ 4   │ UrbanPop │ 65.54  │ 14.4748 │
│ 5   │ Rape     │ 21.232 │ 9.36638 │``````

Let's extract the numerical component and coerce

``````X = select(data, Not(:State))
X = coerce(X, :UrbanPop=>Continuous, :Assault=>Continuous);``````

## PCA pipeline

PCA is usually best done after standardization but we won't do it here:

``````@load PCA pkg=MultivariateStats

pca_mdl = PCA(pratio=1)
pca = machine(pca_mdl, X)
fit!(pca)
PCA
W = transform(pca, X);``````

W is the PCA'd data; here we've used default settings for PCA and it has recovered 2 components:

``schema(W).names``
``(:x1, :x2, :x3, :x4)``

Let's inspect the fit:

``````r = report(pca)
cumsum(r.principalvars ./ r.tvar)``````
``````4-element Array{Float64,1}:
0.9655342205668823
0.9933515571990573
0.9991510921213993
1.0``````

In the second line we look at the explained variance with 1 then 2 PCA features and it seems that with 2 we almost completely recover all of the variance.

## More interesting data...

Instead of just playing with toy data, let's load the orange juice data and extract only the columns corresponding to price data:

``````data = dataset("ISLR", "OJ")

X = select(data, [:PriceCH, :PriceMM, :DiscCH, :DiscMM, :SalePriceMM,
:SalePriceCH, :PriceDiff, :PctDiscMM, :PctDiscCH]);``````

### PCA pipeline

``````Random.seed!(1515)

SPCA = @pipeline(Standardizer(),
PCA(pratio=1-1e-4))

spca = machine(SPCA, X)
fit!(spca)
W = transform(spca, X)
names(W)``````
``````6-element Array{String,1}:
"x1"
"x2"
"x3"
"x4"
"x5"
"x6"``````

What kind of variance can we explain?

``````rpca = collect(values(report(spca).report_given_machine))
cs = cumsum(rpca.principalvars ./ rpca.tvar)``````
``````6-element Array{Float64,1}:
0.4174696748484731
0.7233074812209765
0.9436456234171869
0.9997505816044248
0.9998956501446636
0.9999999999999998``````

Let's visualise this

``````using PyPlot

figure(figsize=(8,6))

bar(1:length(cs), cs)
plot(1:length(cs), cs, color="red", marker="o")

xlabel("Number of PCA features", fontsize=14)
ylabel("Ratio of explained variance", fontsize=14)`````` So 4 PCA features are enough to recover most of the variance.

### Clustering

``````Random.seed!(1515)

SPCA2 = @pipeline(Standardizer(),
PCA(),
KMeans(k=3))

spca2 = machine(SPCA2, X)
fit!(spca2)

assignments = collect(values(report(spca2).report_given_machine)).assignments

Now we can try visualising this

``````using PyPlot

figure(figsize=(8, 6)) 