HierarchicalClustering

HierarchicalClustering

A model type for constructing a hierarchical clusterer, based on Clustering.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

HierarchicalClustering = @load HierarchicalClustering pkg=Clustering

Do model = HierarchicalClustering() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in HierarchicalClustering(linkage=...).

Hierarchical Clustering is a clustering algorithm that organizes the data in a dendrogram based on distances between groups of points and computes cluster assignments by cutting the dendrogram at a given height. More information is available at the Clustering.jl documentation. Use predict to get cluster assignments. The dendrogram and the dendrogram cutter are accessed from the machine report (see below).

This is a static implementation, i.e., it does not generalize to new data instances, and there is no training data. For clusterers that do generalize, see KMeans or KMedoids.

In MLJ or MLJBase, create a machine with

mach = machine(model)

Hyper-parameters

  • linkage = :single: linkage method (:single, :average, :complete, :ward, :ward_presquared)
  • metric = SqEuclidean: metric (see Distances.jl for available metrics)
  • branchorder = :r: branchorder (:r, :barjoseph, :optimal)
  • h = nothing: height at which the dendrogram is cut
  • k = 3: number of clusters.

If both k and h are specified, it is guaranteed that the number of clusters is not less than k and their height is not above h.

Operations

  • predict(mach, X): return cluster label assignments, as an unordered CategoricalVector. Here X is any table of input features (eg, a DataFrame) whose columns are of scitype Continuous; check column scitypes with schema(X).

Report

After calling predict(mach), the fields of report(mach) are:

  • dendrogram: the dendrogram that was computed when calling predict.
  • cutter: a dendrogram cutter that can be called with a height h or a number of clusters k, to obtain a new assignment of the data points to clusters (see example below).

Examples

using MLJ

X, labels  = make_moons(400, noise=0.09, rng=1) ## synthetic data with 2 clusters; X

HierarchicalClustering = @load HierarchicalClustering pkg=Clustering
model = HierarchicalClustering(linkage = :complete)
mach = machine(model)

## compute and output cluster assignments for observations in `X`:
yhat = predict(mach, X)

## plot dendrogram:
using StatsPlots
plot(report(mach).dendrogram)

## make new predictions by cutting the dendrogram at another height
report(mach).cutter(h = 2.5)