Common MLJ Workflows

Data ingestion

import RDatasets
channing = RDatasets.dataset("boot", "channing")

julia> first(channing, 4)
4×5 DataFrame
 Row │ Sex   Entry  Exit   Time   Cens
     │ Cat…  Int32  Int32  Int32  Int32
─────┼──────────────────────────────────
   1 │ Male    782    909    127      1
   2 │ Male   1020   1128    108      1
   3 │ Male    856    969    113      1
   4 │ Male    915    957     42      1

Inspecting metadata, including column scientific types:

schema(channing)
┌─────────┬──────────────────────────────────┬───────────────┐
│ _.names │ _.types                          │ _.scitypes    │
├─────────┼──────────────────────────────────┼───────────────┤
│ Sex     │ CategoricalValue{String, UInt32} │ Multiclass{2} │
│ Entry   │ Int64                            │ Count         │
│ Exit    │ Int64                            │ Count         │
│ Time    │ Int64                            │ Count         │
│ Cens    │ Int64                            │ Count         │
└─────────┴──────────────────────────────────┴───────────────┘
_.nrows = 462

Unpacking data and shuffling rows:

y, X =  unpack(channing,
               ==(:Exit),            # y is the :Exit column
               !=(:Time));           # X is the rest, except :Time

Note: Before julia 1.2, replace !=(:Time) with col -> col != :Time.

julia> y[1:4]
4-element Vector{Int32}:
  909
 1128
  969
  957

Fixing wrong scientfic types in X:

X = coerce(X, :Exit=>Continuous, :Entry=>Continuous, :Cens=>Multiclass)

julia> schema(X)
┌─────────┬─────────────────────────────────┬───────────────┐
│ _.names │ _.types                         │ _.scitypes    │
├─────────┼─────────────────────────────────┼───────────────┤
│ Sex     │ CategoricalValue{String, UInt8} │ Multiclass{2} │
│ Entry   │ Float64                         │ Continuous    │
│ Cens    │ CategoricalValue{Int32, UInt32} │ Multiclass{2} │
└─────────┴─────────────────────────────────┴───────────────┘
_.nrows = 462

Loading a built-in supervised dataset:

table = load_iris()
schema(table)
┌──────────────┬──────────────────────────────────┬───────────────┐
│ _.names      │ _.types                          │ _.scitypes    │
├──────────────┼──────────────────────────────────┼───────────────┤
│ sepal_length │ Float64                          │ Continuous    │
│ sepal_width  │ Float64                          │ Continuous    │
│ petal_length │ Float64                          │ Continuous    │
│ petal_width  │ Float64                          │ Continuous    │
│ target       │ CategoricalValue{String, UInt32} │ Multiclass{3} │
└──────────────┴──────────────────────────────────┴───────────────┘
_.nrows = 150

Loading a built-in data set already split into X and y:

X, y = @load_iris;
selectrows(X, 1:4) # selectrows works for any Tables.jl table
(sepal_length = [5.1, 4.9, 4.7, 4.6],
 sepal_width = [3.5, 3.0, 3.2, 3.1],
 petal_length = [1.4, 1.4, 1.3, 1.5],
 petal_width = [0.2, 0.2, 0.2, 0.2],)
y[1:4]
4-element CategoricalArray{String,1,UInt32}:
 "setosa"
 "setosa"
 "setosa"
 "setosa"

Reference: Model Search

Searching for a supervised model:

X, y = @load_boston
models(matching(X, y))
59-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype), T} where T<:Tuple}:
 (name = ARDRegressor, package_name = ScikitLearn, ... )
 (name = AdaBoostRegressor, package_name = ScikitLearn, ... )
 (name = BaggingRegressor, package_name = ScikitLearn, ... )
 (name = BayesianRidgeRegressor, package_name = ScikitLearn, ... )
 (name = ConstantRegressor, package_name = MLJModels, ... )
 (name = DecisionTreeRegressor, package_name = BetaML, ... )
 (name = DecisionTreeRegressor, package_name = DecisionTree, ... )
 (name = DeterministicConstantRegressor, package_name = MLJModels, ... )
 (name = DummyRegressor, package_name = ScikitLearn, ... )
 (name = ElasticNetCVRegressor, package_name = ScikitLearn, ... )
 ⋮
 (name = RidgeRegressor, package_name = MultivariateStats, ... )
 (name = RidgeRegressor, package_name = ScikitLearn, ... )
 (name = RobustRegressor, package_name = MLJLinearModels, ... )
 (name = SGDRegressor, package_name = ScikitLearn, ... )
 (name = SVMLinearRegressor, package_name = ScikitLearn, ... )
 (name = SVMNuRegressor, package_name = ScikitLearn, ... )
 (name = SVMRegressor, package_name = ScikitLearn, ... )
 (name = TheilSenRegressor, package_name = ScikitLearn, ... )
 (name = XGBoostRegressor, package_name = XGBoost, ... )
models(matching(X, y))[6]
A simple Decision Tree for regression with support for Missing data, from the Beta Machine Learning Toolkit (BetaML).
→ based on [BetaML](https://github.com/sylvaticus/BetaML.jl).
→ do `@load DecisionTreeRegressor pkg="BetaML"` to use the model.
→ do `?DecisionTreeRegressor` for documentation.
(name = "DecisionTreeRegressor",
 package_name = "BetaML",
 is_supervised = true,
 abstract_type = Deterministic,
 deep_properties = (),
 docstring = "A simple Decision Tree for regression with support for Missing data, from the Beta Machine Learning Toolkit (BetaML).\n→ based on [BetaML](https://github.com/sylvaticus/BetaML.jl).\n→ do `@load DecisionTreeRegressor pkg=\"BetaML\"` to use the model.\n→ do `?DecisionTreeRegressor` for documentation.",
 fit_data_scitype = Tuple{Table{_s52} where _s52<:Union{AbstractVector{_s51} where _s51<:Known, AbstractVector{_s51} where _s51<:Missing}, AbstractVector{_s690} where _s690<:Continuous},
 hyperparameter_ranges = (nothing, nothing, nothing, nothing, nothing, nothing),
 hyperparameter_types = ("Int64", "Float64", "Int64", "Int64", "Function", "Random.AbstractRNG"),
 hyperparameters = (:maxDepth, :minGain, :minRecords, :maxFeatures, :splittingCriterion, :rng),
 implemented_methods = [:predict, :fit],
 inverse_transform_scitype = Unknown,
 is_pure_julia = true,
 is_wrapper = false,
 iteration_parameter = nothing,
 load_path = "BetaML.Trees.DecisionTreeRegressor",
 package_license = "MIT",
 package_url = "https://github.com/sylvaticus/BetaML.jl",
 package_uuid = "024491cd-cc6b-443e-8034-08ea7eb7db2b",
 predict_scitype = AbstractVector{_s690} where _s690<:Continuous,
 prediction_type = :deterministic,
 supports_class_weights = false,
 supports_online = false,
 supports_training_losses = false,
 supports_weights = false,
 transform_scitype = Unknown,
 input_scitype = Table{_s52} where _s52<:Union{AbstractVector{_s51} where _s51<:Known, AbstractVector{_s51} where _s51<:Missing},
 target_scitype = AbstractVector{_s690} where _s690<:Continuous,
 output_scitype = Unknown,)

More refined searches:

models() do model
    matching(model, X, y) &&
    model.prediction_type == :deterministic &&
    model.is_pure_julia
end
20-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype), T} where T<:Tuple}:
 (name = DecisionTreeRegressor, package_name = BetaML, ... )
 (name = DecisionTreeRegressor, package_name = DecisionTree, ... )
 (name = DeterministicConstantRegressor, package_name = MLJModels, ... )
 (name = ElasticNetRegressor, package_name = MLJLinearModels, ... )
 (name = EvoTreeRegressor, package_name = EvoTrees, ... )
 (name = HuberRegressor, package_name = MLJLinearModels, ... )
 (name = KNNRegressor, package_name = NearestNeighborModels, ... )
 (name = KPLSRegressor, package_name = PartialLeastSquaresRegressor, ... )
 (name = LADRegressor, package_name = MLJLinearModels, ... )
 (name = LassoRegressor, package_name = MLJLinearModels, ... )
 (name = LinearRegressor, package_name = MLJLinearModels, ... )
 (name = LinearRegressor, package_name = MultivariateStats, ... )
 (name = NeuralNetworkRegressor, package_name = MLJFlux, ... )
 (name = PLSRegressor, package_name = PartialLeastSquaresRegressor, ... )
 (name = QuantileRegressor, package_name = MLJLinearModels, ... )
 (name = RandomForestRegressor, package_name = BetaML, ... )
 (name = RandomForestRegressor, package_name = DecisionTree, ... )
 (name = RidgeRegressor, package_name = MLJLinearModels, ... )
 (name = RidgeRegressor, package_name = MultivariateStats, ... )
 (name = RobustRegressor, package_name = MLJLinearModels, ... )

Searching for an unsupervised model:

models(matching(X))
52-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype), T} where T<:Tuple}:
 (name = ABODDetector, package_name = OutlierDetectionNeighbors, ... )
 (name = ABODDetector, package_name = OutlierDetectionPython, ... )
 (name = AEDetector, package_name = OutlierDetectionNetworks, ... )
 (name = AffinityPropagation, package_name = ScikitLearn, ... )
 (name = AgglomerativeClustering, package_name = ScikitLearn, ... )
 (name = Birch, package_name = ScikitLearn, ... )
 (name = CBLOFDetector, package_name = OutlierDetectionPython, ... )
 (name = COFDetector, package_name = OutlierDetectionNeighbors, ... )
 (name = COFDetector, package_name = OutlierDetectionPython, ... )
 (name = COPODDetector, package_name = OutlierDetectionPython, ... )
 ⋮
 (name = PCA, package_name = MultivariateStats, ... )
 (name = PCADetector, package_name = OutlierDetectionPython, ... )
 (name = PPCA, package_name = MultivariateStats, ... )
 (name = RODDetector, package_name = OutlierDetectionPython, ... )
 (name = SODDetector, package_name = OutlierDetectionPython, ... )
 (name = SOSDetector, package_name = OutlierDetectionPython, ... )
 (name = SpectralClustering, package_name = ScikitLearn, ... )
 (name = Standardizer, package_name = MLJModels, ... )
 (name = TSVDTransformer, package_name = TSVD, ... )

Getting the metadata entry for a given model type:

info("PCA")
info("RidgeRegressor", pkg="MultivariateStats") # a model type in multiple packages
Ridge regressor with regularization parameter lambda. Learns a
linear regression with a penalty on the l2 norm of the coefficients.

→ based on [MultivariateStats](https://github.com/JuliaStats/MultivariateStats.jl).
→ do `@load RidgeRegressor pkg="MultivariateStats"` to use the model.
→ do `?RidgeRegressor` for documentation.
(name = "RidgeRegressor",
 package_name = "MultivariateStats",
 is_supervised = true,
 abstract_type = Deterministic,
 deep_properties = (),
 docstring = "Ridge regressor with regularization parameter lambda. Learns a\nlinear regression with a penalty on the l2 norm of the coefficients.\n\n→ based on [MultivariateStats](https://github.com/JuliaStats/MultivariateStats.jl).\n→ do `@load RidgeRegressor pkg=\"MultivariateStats\"` to use the model.\n→ do `?RidgeRegressor` for documentation.",
 fit_data_scitype = Tuple{Table{_s52} where _s52<:(AbstractVector{_s51} where _s51<:Continuous), AbstractVector{Continuous}},
 hyperparameter_ranges = (nothing, nothing),
 hyperparameter_types = ("Union{Real, AbstractVecOrMat{T} where T}", "Bool"),
 hyperparameters = (:lambda, :bias),
 implemented_methods = [:predict, :clean!, :fit, :fitted_params],
 inverse_transform_scitype = Unknown,
 is_pure_julia = true,
 is_wrapper = false,
 iteration_parameter = nothing,
 load_path = "MLJMultivariateStatsInterface.RidgeRegressor",
 package_license = "MIT",
 package_url = "https://github.com/JuliaStats/MultivariateStats.jl",
 package_uuid = "6f286f6a-111f-5878-ab1e-185364afe411",
 predict_scitype = AbstractVector{Continuous},
 prediction_type = :deterministic,
 supports_class_weights = false,
 supports_online = false,
 supports_training_losses = false,
 supports_weights = false,
 transform_scitype = Unknown,
 input_scitype = Table{_s52} where _s52<:(AbstractVector{_s51} where _s51<:Continuous),
 target_scitype = AbstractVector{Continuous},
 output_scitype = Unknown,)

Instantiating a model

Reference: Getting Started, Loading Model Code

Tree = @load DecisionTreeClassifier pkg=DecisionTree
tree = Tree(min_samples_split=5, max_depth=4)
DecisionTreeClassifier(
    max_depth = 4,
    min_samples_leaf = 1,
    min_samples_split = 5,
    min_purity_increase = 0.0,
    n_subfeatures = 0,
    post_prune = false,
    merge_purity_threshold = 1.0,
    pdf_smoothing = 0.0,
    display_depth = 5,
    rng = Random._GLOBAL_RNG())

or

tree = (@load DecisionTreeClassifier)()
tree.min_samples_split = 5
tree.max_depth = 4

Evaluating a model

Reference: Evaluating Model Performance

X, y = @load_boston
KNN = @load KNNRegressor
knn = KNN()
evaluate(knn, X, y, resampling=CV(nfolds=5), measure=[RootMeanSquaredError(), MeanAbsoluteError()])
PerformanceEvaluation object with these fields:
  measure, measurement, operation, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_pairs
Extract:
┌────────────────────────┬─────────────┬───────────┬────────────────────────────
│ measure                │ measurement │ operation │ per_fold                  ⋯
├────────────────────────┼─────────────┼───────────┼────────────────────────────
│ RootMeanSquaredError() │ 8.77        │ predict   │ [8.53, 8.8, 10.7, 9.43, 5 ⋯
│ MeanAbsoluteError()    │ 6.02        │ predict   │ [6.52, 5.7, 7.65, 6.09, 4 ⋯
└────────────────────────┴─────────────┴───────────┴────────────────────────────
                                                                1 column omitted

Note RootMeanSquaredError() has alias rms and MeanAbsoluteError() has alias mae.

Do measures() to list all losses and scores and their aliases.

Basic fit/evaluate/predict by hand:

Reference: Getting Started, Machines, Evaluating Model Performance, Performance Measures

crabs = load_crabs() |> DataFrames.DataFrame
schema(crabs)
┌─────────┬──────────────────────────────────┬───────────────┐
│ _.names │ _.types                          │ _.scitypes    │
├─────────┼──────────────────────────────────┼───────────────┤
│ sp      │ CategoricalValue{String, UInt32} │ Multiclass{2} │
│ sex     │ CategoricalValue{String, UInt32} │ Multiclass{2} │
│ index   │ Int64                            │ Count         │
│ FL      │ Float64                          │ Continuous    │
│ RW      │ Float64                          │ Continuous    │
│ CL      │ Float64                          │ Continuous    │
│ CW      │ Float64                          │ Continuous    │
│ BD      │ Float64                          │ Continuous    │
└─────────┴──────────────────────────────────┴───────────────┘
_.nrows = 200
y, X = unpack(crabs, ==(:sp), !in([:index, :sex]); rng=123)


Tree = @load DecisionTreeClassifier pkg=DecisionTree
DecisionTreeClassifier(
    max_depth = 2,
    min_samples_leaf = 1,
    min_samples_split = 2,
    min_purity_increase = 0.0,
    n_subfeatures = 0,
    post_prune = false,
    merge_purity_threshold = 1.0,
    pdf_smoothing = 0.0,
    display_depth = 5,
    rng = Random._GLOBAL_RNG())

Bind the model and data together in a machine , which will additionally store the learned parameters (fitresults) when fit:

mach = machine(tree, X, y)
Machine{DecisionTreeClassifier,…} trained 0 times; caches data
  args: 
    1:	Source @683 ⏎ `Table{AbstractVector{Continuous}}`
    2:	Source @957 ⏎ `AbstractVector{Multiclass{2}}`

Split row indices into training and evaluation rows:

train, test = partition(eachindex(y), 0.7); # 70:30 split
([1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  131, 132, 133, 134, 135, 136, 137, 138, 139, 140], [141, 142, 143, 144, 145, 146, 147, 148, 149, 150  …  191, 192, 193, 194, 195, 196, 197, 198, 199, 200])

Fit on train and evaluate on test:

fit!(mach, rows=train)
yhat = predict(mach, X[test,:])
mean(LogLoss(tol=1e-4)(yhat, y[test]))
1.0788055664326648

Note LogLoss() has aliases log_loss and cross_entropy.

Run measures() to list all losses and scores and their aliases ("instances").

Predict on new data:

Xnew = (FL = rand(3), RW = rand(3), CL = rand(3), CW = rand(3), BD =rand(3))
predict(mach, Xnew)      # a vector of distributions
3-element MLJBase.UnivariateFiniteVector{Multiclass{2}, String, UInt32, Float64}:
 UnivariateFinite{Multiclass{2}}(B=>0.667, O=>0.333)
 UnivariateFinite{Multiclass{2}}(B=>0.667, O=>0.333)
 UnivariateFinite{Multiclass{2}}(B=>0.667, O=>0.333)
predict_mode(mach, Xnew) # a vector of point-predictions
3-element CategoricalArray{String,1,UInt32}:
 "B"
 "B"
 "B"

More performance evaluation examples

Evaluating model + data directly:

evaluate(tree, X, y,
         resampling=Holdout(fraction_train=0.7, shuffle=true, rng=1234),
         measure=[LogLoss(), Accuracy()])
PerformanceEvaluation object with these fields:
  measure, measurement, operation, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_pairs
Extract:
┌────────────────────────────┬─────────────┬──────────────┬──────────┐
│ measure                    │ measurement │ operation    │ per_fold │
├────────────────────────────┼─────────────┼──────────────┼──────────┤
│ LogLoss(tol = 2.22045e-16) │ 1.12        │ predict      │ [1.12]   │
│ Accuracy()                 │ 0.683       │ predict_mode │ [0.683]  │
└────────────────────────────┴─────────────┴──────────────┴──────────┘

If a machine is already defined, as above:

evaluate!(mach,
          resampling=Holdout(fraction_train=0.7, shuffle=true, rng=1234),
          measure=[LogLoss(), Accuracy()])
PerformanceEvaluation object with these fields:
  measure, measurement, operation, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_pairs
Extract:
┌────────────────────────────┬─────────────┬──────────────┬──────────┐
│ measure                    │ measurement │ operation    │ per_fold │
├────────────────────────────┼─────────────┼──────────────┼──────────┤
│ LogLoss(tol = 2.22045e-16) │ 1.12        │ predict      │ [1.12]   │
│ Accuracy()                 │ 0.683       │ predict_mode │ [0.683]  │
└────────────────────────────┴─────────────┴──────────────┴──────────┘

Using cross-validation:

evaluate!(mach, resampling=CV(nfolds=5, shuffle=true, rng=1234),
          measure=[LogLoss(), Accuracy()])
PerformanceEvaluation object with these fields:
  measure, measurement, operation, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_pairs
Extract:
┌────────────────────────────┬─────────────┬──────────────┬─────────────────────
│ measure                    │ measurement │ operation    │ per_fold           ⋯
├────────────────────────────┼─────────────┼──────────────┼─────────────────────
│ LogLoss(tol = 2.22045e-16) │ 0.748       │ predict      │ [0.552, 0.534, 0.4 ⋯
│ Accuracy()                 │ 0.7         │ predict_mode │ [0.775, 0.7, 0.8,  ⋯
└────────────────────────────┴─────────────┴──────────────┴─────────────────────
                                                                1 column omitted

With user-specified train/test pairs of row indices:

f1, f2, f3 = 1:13, 14:26, 27:36
pairs = [(f1, vcat(f2, f3)), (f2, vcat(f3, f1)), (f3, vcat(f1, f2))];
evaluate!(mach,
          resampling=pairs,
          measure=[LogLoss(), Accuracy()])
PerformanceEvaluation object with these fields:
  measure, measurement, operation, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_pairs
Extract:
┌────────────────────────────┬─────────────┬──────────────┬─────────────────────
│ measure                    │ measurement │ operation    │ per_fold           ⋯
├────────────────────────────┼─────────────┼──────────────┼─────────────────────
│ LogLoss(tol = 2.22045e-16) │ 4.87        │ predict      │ [5.1, 6.48, 3.01]  ⋯
│ Accuracy()                 │ 0.735       │ predict_mode │ [0.696, 0.739, 0.7 ⋯
└────────────────────────────┴─────────────┴──────────────┴─────────────────────
                                                                1 column omitted

Changing a hyperparameter and re-evaluating:

tree.max_depth = 3
evaluate!(mach,
          resampling=CV(nfolds=5, shuffle=true, rng=1234),
          measure=[LogLoss(), Accuracy()])
PerformanceEvaluation object with these fields:
  measure, measurement, operation, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_pairs
Extract:
┌────────────────────────────┬─────────────┬──────────────┬─────────────────────
│ measure                    │ measurement │ operation    │ per_fold           ⋯
├────────────────────────────┼─────────────┼──────────────┼─────────────────────
│ LogLoss(tol = 2.22045e-16) │ 1.19        │ predict      │ [1.26, 0.2, 0.199, ⋯
│ Accuracy()                 │ 0.865       │ predict_mode │ [0.8, 0.95, 0.975, ⋯
└────────────────────────────┴─────────────┴──────────────┴─────────────────────
                                                                1 column omitted

Inspecting training results

Fit a ordinary least square model to some synthetic data:

x1 = rand(100)
x2 = rand(100)

X = (x1=x1, x2=x2)
y = x1 - 2x2 + 0.1*rand(100);

OLS = @load LinearRegressor pkg=GLM
ols = OLS()
mach =  machine(ols, X, y) |> fit!
Machine{LinearRegressor,…} trained 1 time; caches data
  args: 
    1:	Source @974 ⏎ `Table{AbstractVector{Continuous}}`
    2:	Source @018 ⏎ `AbstractVector{Continuous}`

Get a named tuple representing the learned parameters, human-readable if appropriate:

fitted_params(mach)
(features = ["x1", "x2"],
 coef = [0.9998434572144332, -1.9976407299283758, 0.049727672595657364],
 intercept = 0.049727672595657364,)

Get other training-related information:

report(mach)
(deviance = 0.08061707073384658,
 dof_residual = 97.0,
 stderror = [0.010419870942299733, 0.010425589373216122, 0.008810148562054832],
 vcov = [0.00010857371045418233 1.5268598379893988e-5 -6.487222205144533e-5; 1.5268598379893988e-5 0.00010869291377891694 -6.617233864319732e-5; -6.487222205144533e-5 -6.617233864319732e-5 7.761871768547682e-5],)

Basic fit/transform for unsupervised models

Load data:

X, y = @load_iris
train, test = partition(eachindex(y), 0.97, shuffle=true, rng=123)
([125, 100, 130, 9, 70, 148, 39, 64, 6, 107  …  110, 59, 139, 21, 112, 144, 140, 72, 109, 41], [106, 147, 47, 5])

Instantiate and fit the model/machine:

PCA = @load PCA
pca = PCA(maxoutdim=2)
mach = machine(pca, X)
fit!(mach, rows=train)
Machine{PCA,…} trained 1 time; caches data
  args: 
    1:	Source @483 ⏎ `Table{AbstractVector{Continuous}}`

Transform selected data bound to the machine:

transform(mach, rows=test);
(x1 = [-3.3942826854483243, -1.5219827578765068, 2.538247455185219, 2.7299639893931373],
 x2 = [0.5472450223745241, -0.36842368617126214, 0.5199299511335698, 0.3448466122232363],)

Transform new data:

Xnew = (sepal_length=rand(3), sepal_width=rand(3),
        petal_length=rand(3), petal_width=rand(3));
transform(mach, Xnew)
(x1 = [4.132002426682904, 4.3386388210170646, 5.074302024688248],
 x2 = [-4.699432290274227, -4.612078475716781, -5.083631280955615],)

Inverting learned transformations

y = rand(100);
stand = Standardizer()
mach = machine(stand, y)
fit!(mach)
z = transform(mach, y);
@assert inverse_transform(mach, z) ≈ y # true
[ Info: Training Machine{Standardizer,…}.

Nested hyperparameter tuning

Reference: Tuning Models

Define a model with nested hyperparameters:

Tree = @load DecisionTreeClassifier pkg=DecisionTree
tree = Tree()
forest = EnsembleModel(atom=tree, n=300)
ProbabilisticEnsembleModel(
    atom = DecisionTreeClassifier(
            max_depth = -1,
            min_samples_leaf = 1,
            min_samples_split = 2,
            min_purity_increase = 0.0,
            n_subfeatures = 0,
            post_prune = false,
            merge_purity_threshold = 1.0,
            pdf_smoothing = 0.0,
            display_depth = 5,
            rng = Random._GLOBAL_RNG()),
    atomic_weights = Float64[],
    bagging_fraction = 0.8,
    rng = Random._GLOBAL_RNG(),
    n = 300,
    acceleration = CPU1{Nothing}(nothing),
    out_of_bag_measure = Any[])

Define ranges for hyperparameters to be tuned:

r1 = range(forest, :bagging_fraction, lower=0.5, upper=1.0, scale=:log10)
NumericRange(0.5 ≤ bagging_fraction ≤ 1.0; origin=0.75, unit=0.25) on log10 scale
r2 = range(forest, :(atom.n_subfeatures), lower=1, upper=4) # nested
NumericRange(1 ≤ atom.n_subfeatures ≤ 4; origin=2.5, unit=1.5)

Wrap the model in a tuning strategy:

tuned_forest = TunedModel(model=forest,
                          tuning=Grid(resolution=12),
                          resampling=CV(nfolds=6),
                          ranges=[r1, r2],
                          measure=BrierScore())
ProbabilisticTunedModel(
    model = ProbabilisticEnsembleModel(
            atom = DecisionTreeClassifier,
            atomic_weights = Float64[],
            bagging_fraction = 0.8,
            rng = Random._GLOBAL_RNG(),
            n = 300,
            acceleration = CPU1{Nothing}(nothing),
            out_of_bag_measure = Any[]),
    tuning = Grid(
            goal = nothing,
            resolution = 12,
            shuffle = true,
            rng = Random._GLOBAL_RNG()),
    resampling = CV(
            nfolds = 6,
            shuffle = false,
            rng = Random._GLOBAL_RNG()),
    measure = BrierScore(),
    weights = nothing,
    operation = nothing,
    range = MLJBase.NumericRange{T, MLJBase.Bounded, Symbol} where T[NumericRange(0.5 ≤ bagging_fraction ≤ 1.0; origin=0.75, unit=0.25) on log10 scale, NumericRange(1 ≤ atom.n_subfeatures ≤ 4; origin=2.5, unit=1.5)],
    selection_heuristic = MLJTuning.NaiveSelection(nothing),
    train_best = true,
    repeats = 1,
    n = nothing,
    acceleration = CPU1{Nothing}(nothing),
    acceleration_resampling = CPU1{Nothing}(nothing),
    check_measure = true,
    cache = true)

Bound the wrapped model to data:

mach = machine(tuned_forest, X, y)
Machine{ProbabilisticTunedModel{Grid,…},…} trained 0 times; caches data
  args: 
    1:	Source @319 ⏎ `Table{AbstractVector{Continuous}}`
    2:	Source @137 ⏎ `AbstractVector{Multiclass{3}}`

Fitting the resultant machine optimizes the hyperparameters specified in range, using the specified tuning and resampling strategies and performance measure (possibly a vector of measures), and retrains on all data bound to the machine:

fit!(mach)
Machine{ProbabilisticTunedModel{Grid,…},…} trained 1 time; caches data
  args: 
    1:	Source @319 ⏎ `Table{AbstractVector{Continuous}}`
    2:	Source @137 ⏎ `AbstractVector{Multiclass{3}}`

Inspecting the optimal model:

F = fitted_params(mach)
(best_model = ProbabilisticEnsembleModel{DecisionTreeClassifier},
 best_fitted_params = (fitresult = WrappedEnsemble{Tuple{Node{Float64,…},…},…},),)
F.best_model
ProbabilisticEnsembleModel(
    atom = DecisionTreeClassifier(
            max_depth = -1,
            min_samples_leaf = 1,
            min_samples_split = 2,
            min_purity_increase = 0.0,
            n_subfeatures = 3,
            post_prune = false,
            merge_purity_threshold = 1.0,
            pdf_smoothing = 0.0,
            display_depth = 5,
            rng = Random._GLOBAL_RNG()),
    atomic_weights = Float64[],
    bagging_fraction = 0.5671562610977313,
    rng = Random._GLOBAL_RNG(),
    n = 300,
    acceleration = CPU1{Nothing}(nothing),
    out_of_bag_measure = Any[])

Inspecting details of tuning procedure:

r = report(mach);
keys(r)
(:best_model, :best_history_entry, :history, :best_report, :plotting)
r.history[[1,end]]
2-element Vector{NamedTuple{(:model, :measure, :measurement, :per_fold), Tuple{MLJEnsembles.ProbabilisticEnsembleModel{MLJDecisionTreeInterface.DecisionTreeClassifier}, Vector{BrierScore}, Vector{Float64}, Vector{Vector{Float64}}}}}:
 (model = ProbabilisticEnsembleModel{DecisionTreeClassifier}, measure = [BrierScore()], measurement = [-0.12081184090534976], per_fold = [[-0.025783111111111187, -0.003321777777777859, -0.1664670987654322, -0.15373140740740715, -0.1602489343209875, -0.21531871604938274]])
 (model = ProbabilisticEnsembleModel{DecisionTreeClassifier}, measure = [BrierScore()], measurement = [-0.10730281481481464], per_fold = [[0.0, 0.0, -0.14540444444444428, -0.15306133333333288, -0.13258044444444422, -0.2127706666666665]])

Visualizing these results:

using Plots
plot(mach)

Predicting on new data using the optimized model:

predict(mach, Xnew)
3-element Vector{UnivariateFinite{Multiclass{3}, String, UInt32, Float64}}:
 UnivariateFinite{Multiclass{3}}(versicolor=>0.417, virginica=>0.03, setosa=>0.553)
 UnivariateFinite{Multiclass{3}}(versicolor=>0.0, virginica=>0.0, setosa=>1.0)
 UnivariateFinite{Multiclass{3}}(versicolor=>0.0, virginica=>0.0, setosa=>1.0)

Constructing a linear pipeline

Reference: Composing Models

Constructing a linear (unbranching) pipeline with a learned target transformation/inverse transformation:

X, y = @load_reduced_ames
KNN = @load KNNRegressor
pipe = @pipeline(X -> coerce(X, :age=>Continuous),
                 OneHotEncoder,
                 KNN(K=3),
                 target = Standardizer)
Pipeline264(
    one_hot_encoder = OneHotEncoder(
            features = Symbol[],
            drop_last = false,
            ordered_factor = true,
            ignore = false),
    knn_regressor = KNNRegressor(
            K = 3,
            algorithm = :kdtree,
            metric = Distances.Euclidean(0.0),
            leafsize = 10,
            reorder = true,
            weights = NearestNeighborModels.Uniform()),
    target = Standardizer(
            features = Symbol[],
            ignore = false,
            ordered_factor = false,
            count = false))

Evaluating the pipeline (just as you would any other model):

pipe.knn_regressor.K = 2
pipe.one_hot_encoder.drop_last = true
evaluate(pipe, X, y, resampling=Holdout(), measure=RootMeanSquaredError(), verbosity=2)
PerformanceEvaluation object with these fields:
  measure, measurement, operation, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_pairs
Extract:
┌────────────────────────┬─────────────┬───────────┬───────────┐
│ measure                │ measurement │ operation │ per_fold  │
├────────────────────────┼─────────────┼───────────┼───────────┤
│ RootMeanSquaredError() │ 53100.0     │ predict   │ [53100.0] │
└────────────────────────┴─────────────┴───────────┴───────────┘

Inspecting the learned parameters in a pipeline:

mach = machine(pipe, X, y) |> fit!
F = fitted_params(mach)
F.one_hot_encoder
(fitresult = OneHotEncoderResult,)

Constructing a linear (unbranching) pipeline with a static (unlearned) target transformation/inverse transformation:

Tree = @load DecisionTreeRegressor pkg=DecisionTree
pipe2 = @pipeline(X -> coerce(X, :age=>Continuous),
                  OneHotEncoder,
                  Tree(max_depth=4),
                  target = y -> log.(y),
                  inverse = z -> exp.(z))
Pipeline275(
    one_hot_encoder = OneHotEncoder(
            features = Symbol[],
            drop_last = false,
            ordered_factor = true,
            ignore = false),
    decision_tree_regressor = DecisionTreeRegressor(
            max_depth = 4,
            min_samples_leaf = 5,
            min_samples_split = 2,
            min_purity_increase = 0.0,
            n_subfeatures = 0,
            post_prune = false,
            merge_purity_threshold = 1.0,
            rng = Random._GLOBAL_RNG()),
    target = WrappedFunction(
            f = Main.ex-workflows.var"#26#27"()),
    inverse = WrappedFunction(
            f = Main.ex-workflows.var"#28#29"()))

Creating a homogeneous ensemble of models

Reference: Homogeneous Ensembles

X, y = @load_iris
Tree = @load DecisionTreeClassifier pkg=DecisionTree
tree = Tree()
forest = EnsembleModel(atom=tree, bagging_fraction=0.8, n=300)
mach = machine(forest, X, y)
evaluate!(mach, measure=LogLoss())
PerformanceEvaluation object with these fields:
  measure, measurement, operation, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_pairs
Extract:
┌────────────────────────────┬─────────────┬───────────┬────────────────────────
│ measure                    │ measurement │ operation │ per_fold              ⋯
├────────────────────────────┼─────────────┼───────────┼────────────────────────
│ LogLoss(tol = 2.22045e-16) │ 0.427       │ predict   │ [3.66e-15, 3.66e-15,  ⋯
└────────────────────────────┴─────────────┴───────────┴────────────────────────
                                                                1 column omitted

Performance curves

Generate a plot of performance, as a function of some hyperparameter (building on the preceding example)

Single performance curve:

r = range(forest, :n, lower=1, upper=1000, scale=:log10)
curve = learning_curve(mach,
                       range=r,
                       resampling=Holdout(),
                       resolution=50,
                       measure=LogLoss(),
                       verbosity=0)
(parameter_name = "n",
 parameter_scale = :log10,
 parameter_values = [1, 2, 3, 4, 5, 6, 7, 8, 10, 11  …  281, 324, 373, 429, 494, 569, 655, 754, 869, 1000],
 measurements = [8.009700753137142, 8.179136730607352, 4.126918408642807, 7.3150943051434805, 2.7363826980491766, 5.899649738665178, 4.202257368617792, 2.0084289011186898, 2.736160551407981, 1.9088885842149823  …  1.2349401623105083, 1.2640707545070622, 1.2592286199339435, 1.2438363186078858, 1.245816334023486, 1.2422934743252563, 1.248757356658255, 1.2542963845170354, 0.5854460735584636, 1.2558153847324776],)
using Plots
plot(curve.parameter_values, curve.measurements, xlab=curve.parameter_name, xscale=curve.parameter_scale)

Multiple curves:

curve = learning_curve(mach,
                       range=r,
                       resampling=Holdout(),
                       measure=LogLoss(),
                       resolution=50,
                       rng_name=:rng,
                       rngs=4,
                       verbosity=0)
(parameter_name = "n",
 parameter_scale = :log10,
 parameter_values = [1, 2, 3, 4, 5, 6, 7, 8, 10, 11  …  281, 324, 373, 429, 494, 569, 655, 754, 869, 1000],
 measurements = [4.004850376568572 8.009700753137142 15.218431430960573 9.611640903764572; 4.004850376568572 7.316553572577199 2.710975639523342 8.040507294495363; … ; 1.2512395949721171 1.2413921831693864 1.2465367076273322 1.2560896612560073; 1.2523443415257007 1.252338537502407 1.2419338305358867 1.2590311153904017],)
plot(curve.parameter_values, curve.measurements,
xlab=curve.parameter_name, xscale=curve.parameter_scale)