Transformers and other unsupervised models

Several unsupervised models used to perform common transformations, such as one-hot encoding, are available in MLJ out-of-the-box. These are detailed in Built-in transformers below.

A transformer is static if it has no learned parameters. While such a transformer is tantamount to an ordinary function, realizing it as an MLJ static transformer (subtype of Static <: Unsupervised) can be useful, especially if the function depends on parameters the user would like to manipulate (which become hyper-parameters of the model). The necessary syntax for defining your own static transformers is described in Static transformers below.

Some unsupervised models, such as clustering algorithms, have a predict method in addition to a transform method. We give an example of this in Transformers that also predict

Finally we note that models that fit a distribution, or more generally a sampler object, to some data, which are sometimes viewed as unsupervised, are treated in MLJ as supervised models. See Models that learn a probability distribution for an example.

Built-in transformers

MLJModels.StandardizerType
Standardizer(; features=Symbol[],
               ignore=false,
               ordered_factor=false,
               count=false)

Unsupervised model for standardizing (whitening) the columns of tabular data. If features is unspecified then all columns having Continuous element scitype are standardized. Otherwise, the features standardized are the Continuous features named in features (ignore=false) or Continuous features not named in features (ignore=true). To allow standarization of Count or OrderedFactor features as well, set the appropriate flag to true.

Instead of supplying a features vector, a Bool-valued callable with one argument can be also be specified. For example, specifying Standardizer(features = name -> name in [:x1, :x3], ignore = true, count=true) has the same effect as Standardizer(features = [:x1, :x3], ignore = true, count=true), namely to standardise all Continuous and Count features, with the exception of :x1 and :x3.

The inverse_tranform method is supported provided count=false and ordered_factor=false at time of fit.

Example

X = (ordinal1 = [1, 2, 3],
     ordinal2 = coerce([:x, :y, :x], OrderedFactor),
     ordinal3 = [10.0, 20.0, 30.0],
     ordinal4 = [-20.0, -30.0, -40.0],
     nominal = coerce(["Your father", "he", "is"], Multiclass));
stand1 = Standardizer();
julia> transform(fit!(machine(stand1, X)), X)
[ Info: Training Machine{Standardizer} @ 7…97.
(ordinal1 = [1, 2, 3],
 ordinal2 = CategoricalValue{Symbol,UInt32}[:x, :y, :x],
 ordinal3 = [-1.0, 0.0, 1.0],
 ordinal4 = [1.0, 0.0, -1.0],
 nominal = CategoricalVale{String,UInt32}["Your father", "he", "is"],)

stand2 = Standardizer(features=[:ordinal3, ], ignore=true, count=true);
julia> transform(fit!(machine(stand2, X)), X)
[ Info: Training Machine{Standardizer} @ 1…87.
(ordinal1 = [-1.0, 0.0, 1.0],
 ordinal2 = CategoricalValue{Symbol,UInt32}[:x, :y, :x],
 ordinal3 = [10.0, 20.0, 30.0],
 ordinal4 = [1.0, 0.0, -1.0],
 nominal = CategoricalValue{String,UInt32}["Your father", "he", "is"],)
MLJModels.OneHotEncoderType
OneHotEncoder(; features=Symbol[],
                ignore=false,
                ordered_factor=true,
                drop_last=false)

Unsupervised model for one-hot encoding the Finite features (columns) of some table. If features is unspecified all features with Finite element scitype are encoded. Otherwise, encoding is applied to all Finite features named in features (ignore=false) or all Finite features not named in features (ignore=true).

If ordered_factor=false then the above holds with Finite replaced with Multiclass, ie OrderedFactor features are not transformed.

Specify drop_last=true if the column for the last level of each categorical feature is to be dropped.

New data to be transformed may lack features present in the fit data, but no new features can be present.

Warning: This transformer assumes that levels(col) for any Multiclass or OrderedFactor column is the same in new data being transformed as it is in the data used to fit the transformer.

Example

X = (name=categorical(["Danesh", "Lee", "Mary", "John"]),
     grade=categorical([:A, :B, :A, :C], ordered=true),
     height=[1.85, 1.67, 1.5, 1.67],
     n_devices=[3, 2, 4, 3])
schema(X)

┌───────────┬─────────────────────────────────┬──────────────────┐
│ _.names   │ _.types                         │ _.scitypes       │
├───────────┼─────────────────────────────────┼──────────────────┤
│ name      │ CategoricalValue{String,UInt32} │ Multiclass{4}    │
│ grade     │ CategoricalValue{Symbol,UInt32} │ OrderedFactor{3} │
│ height    │ Float64                         │ Continuous       │
│ n_devices │ Int64                           │ Count            │
└───────────┴─────────────────────────────────┴──────────────────┘
_.nrows = 4

hot = OneHotEncoder(ordered_factor=true);
mach = fit!(machine(hot, X))
transform(mach, X) |> schema

┌──────────────┬─────────┬────────────┐
│ _.names      │ _.types │ _.scitypes │
├──────────────┼─────────┼────────────┤
│ name__Danesh │ Float64 │ Continuous │
│ name__John   │ Float64 │ Continuous │
│ name__Lee    │ Float64 │ Continuous │
│ name__Mary   │ Float64 │ Continuous │
│ grade__A     │ Float64 │ Continuous │
│ grade__B     │ Float64 │ Continuous │
│ grade__C     │ Float64 │ Continuous │
│ height       │ Float64 │ Continuous │
│ n_devices    │ Int64   │ Count      │
└──────────────┴─────────┴────────────┘
_.nrows = 4
MLJModels.ContinuousEncoderType
ContinuousEncoder(one_hot_ordered_factors=false, drop_last=false)

Unsupervised model for arranging all features (columns) of a table to have Continuous element scitype, by applying the following protocol to each feature ftr:

  • If ftr is already Continuous retain it.

  • If ftr is Multiclass, one-hot encode it.

  • If ftr is OrderedFactor, replace it with coerce(ftr, Continuous) (vector of floating point integers), unless ordered_factors=false is specified, in which case one-hot encode it.

  • If ftr is Count, replace it with coerce(ftr, Continuous).

  • If ftr is of some other element scitype, or was not observed in fitting the encoder, drop it from the table.

If drop_last=true is specified, then one-hot encoding always drops the last class indicator column.

Warning: This transformer assumes that levels(col) for any Multiclass or OrderedFactor column is the same in new data being transformed as it is in the data used to fit the transformer.

Example

X = (name=categorical(["Danesh", "Lee", "Mary", "John"]),
     grade=categorical([:A, :B, :A, :C], ordered=true),
     height=[1.85, 1.67, 1.5, 1.67],
     n_devices=[3, 2, 4, 3],
     comments=["the force", "be", "with you", "too"])
schema(X)

┌───────────┬─────────────────────────────────┬──────────────────┐
│ _.names   │ _.types                         │ _.scitypes       │
├───────────┼─────────────────────────────────┼──────────────────┤
│ name      │ CategoricalValue{String,UInt32} │ Multiclass{4}    │
│ grade     │ CategoricalValue{Symbol,UInt32} │ OrderedFactor{3} │
│ height    │ Float64                         │ Continuous       │
│ n_devices │ Int64                           │ Count            │
│ comments  │ String                          │ Textual          │
└───────────┴─────────────────────────────────┴──────────────────┘
_.nrows = 4

cont = ContinuousEncoder(drop_last=true);
mach = fit!(machine(cont, X))
transform(mach, X) |> schema

┌──────────────┬─────────┬────────────┐
│ _.names      │ _.types │ _.scitypes │
├──────────────┼─────────┼────────────┤
│ name__Danesh │ Float64 │ Continuous │
│ name__John   │ Float64 │ Continuous │
│ name__Lee    │ Float64 │ Continuous │
│ grade        │ Float64 │ Continuous │
│ height       │ Float64 │ Continuous │
│ n_devices    │ Float64 │ Continuous │
└──────────────┴─────────┴────────────┘
_.nrows = 4
MLJModels.FeatureSelectorType
FeatureSelector(features=Symbol[], ignore=false)

An unsupervised model for filtering features (columns) of a table. Only those features encountered during fitting will appear in transformed tables if features is empty (the default). Alternatively, if a non-empty features is specified, then only the specified features encountered during fitting are used (ignore=false) or all features encountered during fitting which are not named in features are used (ignore=true).

Throws an error if a recorded or specified feature is not present in the transformation input.

Instead of supplying a features vector, a Bool-valued callable with one argument can be also be specified. For example, specifying FeatureSelector(features = name -> name in [:x1, :x3], ignore = true) has the same effect as FeatureSelector(features = [:x1, :x3], ignore = true), namely to select all features, with the exception of :x1 and :x3.

Example

julia> X = (ordinal1 = [1, 2, 3],
            ordinal2 = coerce([:x, :y, :x], OrderedFactor),
            ordinal3 = [10.0, 20.0, 30.0],
            ordinal4 = [-20.0, -30.0, -40.0],
            nominal = coerce(["Your father", "he", "is"], Multiclass));

julia> select1 = FeatureSelector();

julia> transform(fit!(machine(select1, X)), X)
[ Info: Training Machine{FeatureSelector} @811.
(ordinal1 = [1, 2, 3],
 ordinal2 = CategoricalValue{Symbol,UInt32}[:x, :y, :x],
 ordinal3 = [-1.0, 0.0, 1.0],
 ordinal4 = [1.0, 0.0, -1.0],
 nominal = CategoricalVale{String,UInt32}["Your father", "he", "is"],)

julia> select2 = FeatureSelector(features=[:ordinal3, ], ignore=true);

julia> transform(fit!(machine(select2, X)), X)
[ Info: Training Machine{FeatureSelector} @721.
(ordinal1 = [1, 2, 3],
 ordinal2 = CategoricalValue{Symbol,UInt32}[:x, :y, :x],
 ordinal4 = [-20.0, -30.0, -40.0],
 nominal = CategoricalValue{String,UInt32}["Your father", "he", "is"],)
MLJModels.UnivariateBoxCoxTransformerType
UnivariateBoxCoxTransformer(; n=171, shift=false)

Unsupervised model specifying a univariate Box-Cox transformation of a single variable taking non-negative values, with a possible preliminary shift. Such a transformation is of the form

x -> ((x + c)^λ - 1)/λ for λ not 0
x -> log(x + c) for λ = 0

On fitting to data n different values of the Box-Cox exponent λ (between -0.4 and 3) are searched to fix the value maximizing normality. If shift=true and zero values are encountered in the data then the transformation sought includes a preliminary positive shift c of 0.2 times the data mean. If there are no zero values, then no shift is applied.

MLJModels.UnivariateDiscretizerType
UnivariateDiscretizer(n_classes=512)

Returns an MLJModel for for discretizing any continuous vector v (scitype(v) <: AbstractVector{Continuous}), where n_classes describes the resolution of the discretization.

Transformed output w is a vector of ordered factors (scitype(w) <: AbstractVector{<:OrderedFactor}). Specifically, w is a CategoricalVector, with element type CategoricalValue{R,R}, where R<Unsigned is optimized.

The transformation is chosen so that the vector on which the transformer is fit has, in transformed form, an approximately uniform distribution of values.

Example

using MLJ
t = UnivariateDiscretizer(n_classes=10)
discretizer = machine(t, randn(1000))
fit!(discretizer)
v = rand(10)
w = transform(discretizer, v)
v_approx = inverse_transform(discretizer, w) # reconstruction of v from w
MLJModels.FillImputerType
FillImputer(
 features        = [],
 continuous_fill = e -> skipmissing(e) |> median
 count_fill      = e -> skipmissing(e) |> (f -> round(eltype(f), median(f)))
 finite_fill     = e -> skipmissing(e) |> mode

Imputes missing data with a fixed value computed on the non-missing values. A different imputing function can be specified for Continuous, Count and Finite data.

Fields

  • continuous_fill: function to use on Continuous data, by default the median

  • count_fill: function to use on Count data, by default the rounded median

  • finite_fill: function to use on Multiclass and OrderedFactor data (including binary data), by default the mode

Static transformers

The main use-case for static transformers is for insertion into a @pipeline or other exported learning network (see Composing Models). If a static transformer has no hyper-parameters, it is tantamount to an ordinary function. An ordinary function can be inserted directly into a @pipeline; the situation for learning networks is only slightly more complicated; see Static operations on nodes.

The following example defines a new model type Averager to perform the weighted average of two vectors (target predictions, for example). We suppose the weighting is normalized, and therefore controlled by a single hyper-parameter, mix.

mutable struct Averager <: Static
    mix::Float64
end

MLJ.transform(a::Averager, _, y1, y2) = (1 - a.mix)*y1 + a.mix*y2

Important. Note the sub-typing <: Static.

Such static transformers with (unlearned) parameters can have arbitrarily many inputs, but only one output. In the single input case an inverse_transform can also be defined. Since they have no real learned parameters, you bind a static transformer to a machine without specifying training arguments.

mach = machine(Averager(0.5)) |> fit!
transform(mach, [1, 2, 3], [3, 2, 1])
3-element Vector{Float64}:
 2.0
 2.0
 2.0

Let's see how we can include our Averager in a learning network (see Composing Models) to mix the predictions of two regressors, with one-hot encoding of the inputs:

X = source()
y = source()

ridge = (@load RidgeRegressor pkg=MultivariateStats)()
knn = (@load KNNRegressor)()
averager = Averager(0.5)

hotM = machine(OneHotEncoder(), X)
W = transform(hotM, X) # one-hot encode the input

ridgeM = machine(ridge, W, y)
y1 = predict(ridgeM, W)

knnM = machine(knn, W, y)
y2 = predict(knnM, W)

averagerM= machine(averager)
yhat = transform(averagerM, y1, y2)
Node{Machine{Averager,…}} @732
  args:
    1:	Node{Machine{RidgeRegressor,…}} @464
    2:	Node{Machine{KNNRegressor,…}} @225
  formula:
    transform(
        Machine{Averager,…} @276, 
        predict(
            Machine{RidgeRegressor,…} @375, 
            transform(
                Machine{OneHotEncoder,…} @155, 
                Source @556)),
        predict(
            Machine{KNNRegressor,…} @908, 
            transform(
                Machine{OneHotEncoder,…} @155, 
                Source @556)))

Now we export to obtain a Deterministic composite model and then instantiate composite model

learning_mach = machine(Deterministic(), X, y; predict=yhat)
Machine{DeterministicSurrogate} @772 trained 0 times.
  args:
    1:	Source @415 ⏎ `Unknown`
    2:	Source @389 ⏎ `Unknown`


@from_network learning_mach struct DoubleRegressor
       regressor1=ridge
       regressor2=knn
       averager=averager
       end

composite = DoubleRegressor()
julia> composite = DoubleRegressor()
DoubleRegressor(
    regressor1 = RidgeRegressor(
            lambda = 1.0),
    regressor2 = KNNRegressor(
            K = 5,
            algorithm = :kdtree,
            metric = Distances.Euclidean(0.0),
            leafsize = 10,
            reorder = true,
            weights = :uniform),
    averager = Averager(
            mix = 0.5)) @301

which can be can be evaluated like any other model:

composite.averager.mix = 0.25 # adjust mix from default of 0.5
julia> evaluate(composite, (@load_reduced_ames)..., measure=rms)
Evaluating over 6 folds: 100%[=========================] Time: 0:00:00
┌───────────┬───────────────┬────────────────────────────────────────────────────────┐
│ _.measure │ _.measurement │ _.per_fold                                             │
├───────────┼───────────────┼────────────────────────────────────────────────────────┤
│ rms       │ 26800.0       │ [21400.0, 23700.0, 26800.0, 25900.0, 30800.0, 30700.0] │
└───────────┴───────────────┴────────────────────────────────────────────────────────┘
_.per_observation = [missing]
_.fitted_params_per_fold = [ … ]
_.report_per_fold = [ … ]

Transformers that also predict

Commonly, clustering algorithms learn to label data by identifying a collection of "centroids" in the training data. Any new input observation is labeled with the cluster to which it is closest (this is the output of predict) while the vector of all distances from the centroids defines a lower-dimensional representation of the observation (the output of transform). In the following example a K-means clustering algorithm assigns one of three labels 1, 2, 3 to the input features of the iris data set and compares them with the actual species recorded in the target (not seen by the algorithm).

import Random.seed!
seed!(123)

X, y = @load_iris;
KMeans = @load KMeans pkg=ParallelKMeans
kmeans = KMeans()
mach = machine(kmeans, X) |> fit!

# transforming:
Xsmall = transform(mach);
selectrows(Xsmall, 1:4) |> pretty
julia> selectrows(Xsmall, 1:4) |> pretty
┌─────────────────────┬────────────────────┬────────────────────┐
│ x1                  │ x2                 │ x3                 │
│ Float64             │ Float64            │ Float64            │
│ Continuous          │ Continuous         │ Continuous         │
├─────────────────────┼────────────────────┼────────────────────┤
│ 0.0215920000000267  │ 25.314260355029603 │ 11.645232464391299 │
│ 0.19199200000001326 │ 25.882721893491123 │ 11.489658693899486 │
│ 0.1699920000000077  │ 27.58656804733728  │ 12.674412792260142 │
│ 0.26919199999998966 │ 26.28656804733727  │ 11.64392098898145  │
└─────────────────────┴────────────────────┴────────────────────┘

# predicting:
yhat = predict(mach);
compare = zip(yhat, y) |> collect;
compare[1:8]
8-element Array{Tuple{CategoricalValue{Int64,UInt32},CategoricalString{UInt32}},1}:
 (1, "setosa")
 (1, "setosa")
 (1, "setosa")
 (1, "setosa")
 (1, "setosa")
 (1, "setosa")
 (1, "setosa")
 (1, "setosa")

compare[51:58]
8-element Array{Tuple{CategoricalValue{Int64,UInt32},CategoricalString{UInt32}},1}:
 (2, "versicolor")
 (3, "versicolor")
 (2, "versicolor")
 (3, "versicolor")
 (3, "versicolor")
 (3, "versicolor")
 (3, "versicolor")
 (3, "versicolor")

compare[101:108]
8-element Array{Tuple{CategoricalValue{Int64,UInt32},CategoricalString{UInt32}},1}:
 (2, "virginica")
 (3, "virginica")
 (2, "virginica")
 (2, "virginica")
 (2, "virginica")
 (2, "virginica")
 (3, "virginica")
 (2, "virginica")