# Evaluating Model Performance

MLJ allows quick evaluation of a supervised model's performance against a battery of selected losses or scores. For more on available performance measures, see Performance Measures.

In addition to hold-out and cross-validation, the user can specify their own list of train/test pairs of row indices for resampling, or define their own re-usable resampling strategies.

For simultaneously evaluating multiple models and/or data sets, see Benchmarking.

## Evaluating against a single measure

julia> using MLJ

julia> X = (a=rand(12), b=rand(12), c=rand(12));

julia> y = X.a + 2X.b + 0.05*rand(12);

julia> model = (@load RidgeRegressor pkg=MultivariateStats verbosity=0)()
RidgeRegressor(
lambda = 1.0,
bias = true)

julia> cv=CV(nfolds=3)
CV(
nfolds = 3,
shuffle = false,
rng = Random._GLOBAL_RNG())

julia> evaluate(model, X, y, resampling=cv, measure=l2, verbosity=0)
PerformanceEvaluation object with these fields:
measure, measurement, operation, per_fold,
per_observation, fitted_params_per_fold,
report_per_fold, train_test_pairs
Extract:
┌───────────────┬─────────────┬───────────┬───────────────────────┐
│ measure       │ measurement │ operation │ per_fold              │
├───────────────┼─────────────┼───────────┼───────────────────────┤
│ LPLoss(p = 2) │ 0.156       │ predict   │ [0.27, 0.106, 0.0907] │
└───────────────┴─────────────┴───────────┴───────────────────────┘

Alternatively, instead of applying evaluate to a model + data, one may call evaluate! on an existing machine wrapping the model in data:

julia> mach = machine(model, X, y)
Machine{RidgeRegressor,…} trained 0 times; caches data
args:
1:	Source @296 ⏎ Table{AbstractVector{Continuous}}
2:	Source @916 ⏎ AbstractVector{Continuous}

julia> evaluate!(mach, resampling=cv, measure=l2, verbosity=0)
PerformanceEvaluation object with these fields:
measure, measurement, operation, per_fold,
per_observation, fitted_params_per_fold,
report_per_fold, train_test_pairs
Extract:
┌───────────────┬─────────────┬───────────┬───────────────────────┐
│ measure       │ measurement │ operation │ per_fold              │
├───────────────┼─────────────┼───────────┼───────────────────────┤
│ LPLoss(p = 2) │ 0.156       │ predict   │ [0.27, 0.106, 0.0907] │
└───────────────┴─────────────┴───────────┴───────────────────────┘

(The latter call is a mutating call as the learned parameters stored in the machine potentially change. )

## Multiple measures

julia> evaluate!(mach,
resampling=cv,
measure=[l1, rms, rmslp1], verbosity=0)
PerformanceEvaluation object with these fields:
measure, measurement, operation, per_fold,
per_observation, fitted_params_per_fold,
report_per_fold, train_test_pairs
Extract:
┌───────────────────────────────────────────────────┬─────────────┬─────────────
│ measure                                           │ measurement │ operation  ⋯
├───────────────────────────────────────────────────┼─────────────┼─────────────
│ LPLoss(p = 1)                                     │ 0.293       │ predict    ⋯
│ RootMeanSquaredError()                            │ 0.395       │ predict    ⋯
│ RootMeanSquaredLogProportionalError(offset = 1.0) │ 0.167       │ predict    ⋯
└───────────────────────────────────────────────────┴─────────────┴─────────────
1 column omitted

## Custom measures and weighted measures

julia> my_loss(yhat, y) = maximum((yhat - y).^2);

julia> my_per_observation_loss(yhat, y) = abs.(yhat - y);

julia> MLJ.reports_each_observation(::typeof(my_per_observation_loss)) = true;

julia> my_weighted_score(yhat, y) = 1/mean(abs.(yhat - y));

julia> my_weighted_score(yhat, y, w) = 1/mean(abs.((yhat - y).^w));

julia> MLJ.supports_weights(::typeof(my_weighted_score)) = true;

julia> MLJ.orientation(::typeof(my_weighted_score)) = :score;

julia> holdout = Holdout(fraction_train=0.8)
Holdout(
fraction_train = 0.8,
shuffle = false,
rng = Random._GLOBAL_RNG())

julia> weights = [1, 1, 2, 1, 1, 2, 3, 1, 1, 2, 3, 1];

julia> evaluate!(mach,
resampling=CV(nfolds=3),
measure=[my_loss, my_per_observation_loss, my_weighted_score, l1],
weights=weights, verbosity=0)
┌ Warning: Sample weights ignored in evaluations of the following measures, as unsupported:
│ my_loss, my_per_observation_loss
└ @ MLJBase ~/.julia/packages/MLJBase/HZmTU/src/resampling.jl:675
PerformanceEvaluation object with these fields:
measure, measurement, operation, per_fold,
per_observation, fitted_params_per_fold,
report_per_fold, train_test_pairs
Extract:
┌─────────────────────────┬─────────────┬───────────┬───────────────────────┐
│ measure                 │ measurement │ operation │ per_fold              │
├─────────────────────────┼─────────────┼───────────┼───────────────────────┤
│ my_loss                 │ 0.471       │ predict   │ [1.04, 0.203, 0.172]  │
│ my_per_observation_loss │ 0.293       │ predict   │ [0.346, 0.269, 0.264] │
│ my_weighted_score       │ 5.44        │ predict   │ [3.12, 7.02, 6.19]    │
│ LPLoss(p = 1)           │ 0.468       │ predict   │ [0.374, 0.539, 0.49]  │
└─────────────────────────┴─────────────┴───────────┴───────────────────────┘

## User-specified train/test sets

Users can either provide their own list of train/test pairs of row indices for resampling, as in this example:

julia> fold1 = 1:6; fold2 = 7:12;

julia> evaluate!(mach,
resampling = [(fold1, fold2), (fold2, fold1)],
measure=[l1, l2], verbosity=0)
PerformanceEvaluation object with these fields:
measure, measurement, operation, per_fold,
per_observation, fitted_params_per_fold,
report_per_fold, train_test_pairs
Extract:
┌───────────────┬─────────────┬───────────┬────────────────┐
│ measure       │ measurement │ operation │ per_fold       │
├───────────────┼─────────────┼───────────┼────────────────┤
│ LPLoss(p = 1) │ 0.28        │ predict   │ [0.24, 0.321]  │
│ LPLoss(p = 2) │ 0.145       │ predict   │ [0.0802, 0.21] │
└───────────────┴─────────────┴───────────┴────────────────┘

Or define their own re-usable ResamplingStrategy objects, - see Custom resampling strategies below.

## Built-in resampling strategies

MLJBase.HoldoutType
holdout = Holdout(; fraction_train=0.7,
shuffle=nothing,
rng=nothing)

Holdout resampling strategy, for use in evaluate!, evaluate and in tuning.

train_test_pairs(holdout, rows)

Returns the pair [(train, test)], where train and test are vectors such that rows=vcat(train, test) and length(train)/length(rows) is approximatey equal to fraction_train.

Pre-shuffling of rows is controlled by rng and shuffle. If rng is an integer, then the Holdout keyword constructor resets it to MersenneTwister(rng). Otherwise some AbstractRNG object is expected.

If rng is left unspecified, rng is reset to Random.GLOBAL_RNG, in which case rows are only pre-shuffled if shuffle=true is specified.

MLJBase.CVType
cv = CV(; nfolds=6,  shuffle=nothing, rng=nothing)

Cross-validation resampling strategy, for use in evaluate!, evaluate and tuning.

train_test_pairs(cv, rows)

Returns an nfolds-length iterator of (train, test) pairs of vectors (row indices), where each train and test is a sub-vector of rows. The test vectors are mutually exclusive and exhaust rows. Each train vector is the complement of the corresponding test vector. With no row pre-shuffling, the order of rows is preserved, in the sense that rows coincides precisely with the concatenation of the test vectors, in the order they are generated. The first r test vectors have length n + 1, where n, r = divrem(length(rows), nfolds), and the remaining test vectors have length n.

Pre-shuffling of rows is controlled by rng and shuffle. If rng is an integer, then the CV keyword constructor resets it to MersenneTwister(rng). Otherwise some AbstractRNG object is expected.

If rng is left unspecified, rng is reset to Random.GLOBAL_RNG, in which case rows are only pre-shuffled if shuffle=true is explicitly specified.

MLJBase.StratifiedCVType
stratified_cv = StratifiedCV(; nfolds=6,
shuffle=false,
rng=Random.GLOBAL_RNG)

Stratified cross-validation resampling strategy, for use in evaluate!, evaluate and in tuning. Applies only to classification problems (OrderedFactor or Multiclass targets).

train_test_pairs(stratified_cv, rows, y)

Returns an nfolds-length iterator of (train, test) pairs of vectors (row indices) where each train and test is a sub-vector of rows. The test vectors are mutually exclusive and exhaust rows. Each train vector is the complement of the corresponding test vector.

Unlike regular cross-validation, the distribution of the levels of the target y corresponding to each train and test is constrained, as far as possible, to replicate that of y[rows] as a whole.

The stratified train_test_pairs algorithm is invariant to label renaming. For example, if you run replace!(y, 'a' => 'b', 'b' => 'a') and then re-run train_test_pairs, the returned (train, test) pairs will be the same.

Pre-shuffling of rows is controlled by rng and shuffle. If rng is an integer, then the StratifedCV keyword constructor resets it to MersenneTwister(rng). Otherwise some AbstractRNG object is expected.

If rng is left unspecified, rng is reset to Random.GLOBAL_RNG, in which case rows are only pre-shuffled if shuffle=true is explicitly specified.

MLJBase.TimeSeriesCVType

tscv = TimeSeriesCV(; nfolds=4)

Cross-validation resampling strategy, for use in evaluate!, evaluate and tuning, when observations are chronological and not expected to be independent.

traintestpairs(tscv, rows)

Returns an nfolds-length iterator of (train, test) pairs of vectors (row indices), where each train and test is a sub-vector of rows. The rows are partitioned sequentially into nfolds + 1 approximately equal length partitions, where the first partition is the first train set, and the second partition is the first test set. The second train set consists of the first two partitions, and the second test set consists of the third partition, and so on for each fold.

The first partition (which is the first train set) has length n + r, where n, r = divrem(length(rows), nfolds + 1), and the remaining partitions (all of the test folds) have length n.

Examples

julia> MLJBase.train_test_pairs(TimeSeriesCV(nfolds=3), 1:10)
3-element Vector{Tuple{UnitRange{Int64}, UnitRange{Int64}}}:
(1:4, 5:6)
(1:6, 7:8)
(1:8, 9:10)

julia> model = (@load RidgeRegressor pkg=MultivariateStats verbosity=0)();

julia> X = (lag1 = data.sunspot_number[2:end-1],
lag2 = data.sunspot_number[1:end-2]);

julia> y = data.sunspot_number[3:end];

julia> tscv = TimeSeriesCV(nfolds=3);

julia> evaluate(model, X, y, resampling=tscv, measure=rmse, verbosity=0)
┌───────────────────────────┬───────────────┬────────────────────┐
│ _.measure                 │ _.measurement │ _.per_fold         │
├───────────────────────────┼───────────────┼────────────────────┤
│ RootMeanSquaredError @753 │ 21.7          │ [25.4, 16.3, 22.4] │
└───────────────────────────┴───────────────┴────────────────────┘
_.per_observation = [missing]
_.fitted_params_per_fold = [ … ]
_.report_per_fold = [ … ]
_.train_test_rows = [ … ]

## Custom resampling strategies

To define your own resampling strategy, make relevant parameters of your strategy the fields of a new type MyResamplingStrategy <: MLJ.ResamplingStrategy, and implement one of the following methods:

MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows)
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows, y)
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows, X, y)

Each method takes a vector of indices rows and return a vector [(t1, e1), (t2, e2), ... (tk, ek)] of train/test pairs of row indices selected from rows. Here X, y are the input and target data (ignored in simple strategies, such as Holdout and CV).

Here is the code for the Holdout strategy as an example:

struct Holdout <: ResamplingStrategy
fraction_train::Float64
shuffle::Bool
rng::Union{Int,AbstractRNG}

function Holdout(fraction_train, shuffle, rng)
0 < fraction_train < 1 ||
error("fraction_train must be between 0 and 1.")
return new(fraction_train, shuffle, rng)
end
end

# Keyword Constructor
function Holdout(; fraction_train::Float64=0.7, shuffle=nothing, rng=nothing)
if rng isa Integer
rng = MersenneTwister(rng)
end
if shuffle === nothing
shuffle = ifelse(rng===nothing, false, true)
end
if rng === nothing
rng = Random.GLOBAL_RNG
end
return Holdout(fraction_train, shuffle, rng)
end

function train_test_pairs(holdout::Holdout, rows)
train, test = partition(rows, holdout.fraction_train,
shuffle=holdout.shuffle, rng=holdout.rng)
return [(train, test),]
end

## API

MLJBase.evaluate!Function
evaluate!(mach,
resampling=CV(),
measure=nothing,
rows=nothing,
weights=nothing,
class_weights=nothing,
operation=nothing,
repeats=1,
acceleration=default_resource(),
force=false,
verbosity=1,
check_measure=true)

Estimate the performance of a machine mach wrapping a supervised model in data, using the specified resampling strategy (defaulting to 6-fold cross-validation) and measure, which can be a single measure or vector.

Do subtypes(MLJ.ResamplingStrategy) to obtain a list of available resampling strategies. If resampling is not an object of type MLJ.ResamplingStrategy, then a vector of pairs (of the form (train_rows, test_rows) is expected. For example, setting

resampling = [(1:100), (101:200)),
(101:200), (1:100)]

gives two-fold cross-validation using the first 200 rows of data.

The type of operation (predict, predict_mode, etc) to be associated with measure is automatically inferred from measure traits where possible. For example, predict_mode will be used for a Multiclass target, if model is probabilistic but measure is deterministic. The operations applied can be inspected from the operation field of the object returned. Alternatively, operations can be explicitly specified using operation=.... If measure is a vector, then operation must be a single operation, which will be associated with all measures, or a vector of the same length as measure.

The resampling strategy is applied repeatedly (Monte Carlo resampling) if repeats > 1. For example, if repeats = 10, then resampling = CV(nfolds=5, shuffle=true), generates a total of 50 (train, test) pairs for evaluation and subsequent aggregation.

If resampling isa MLJ.ResamplingStrategy then one may optionally restrict the data used in evaluation by specifying rows.

An optional weights vector may be passed for measures that support sample weights (MLJ.supports_weights(measure) == true), which is ignored by those that don't. These weights are not to be confused with any weights w bound to mach (as in mach = machine(model, X, y, w)). To pass these to the performance evaluation measures you must explictly specify weights=w in the evaluate! call.

Additionally, optional class_weights dictionary may be passed for measures that support class weights (MLJ.supports_class_weights(measure) == true), which is ignored by those that don't. These weights are not to be confused with any weights class_w bound to mach (as in mach = machine(model, X, y, class_w)). To pass these to the performance evaluation measures you must explictly specify class_weights=w in the evaluate! call.

User-defined measures are supported; see the manual for details.

If no measure is specified, then default_measure(mach.model) is used, unless this default is nothing and an error is thrown.

The acceleration keyword argument is used to specify the compute resource (a subtype of ComputationalResources.AbstractResource) that will be used to accelerate/parallelize the resampling operation.

Although evaluate! is mutating, mach.model and mach.args are untouched.

Summary of key-word arguments

• resampling - resampling strategy (default is CV(nfolds=6))

• measure/measures - measure or vector of measures (losses, scores, etc)

• rows - vector of observation indices from which both train and test folds are constructed (default is all observations)

• weights - per-sample weights for measures that support them (not to be confused with weights used in training)

• class_weights - dictionary of per-class weights for use with measures that support these, in classification problems (not to be confused with per-sample weights or with class weights used in training)

• operation/operations - One of predict, predict_mean, predict_mode, predict_median, or predict_joint, or a vector of these of the same length as measure/measures. Automatically inferred if left unspecified.

• repeats - default is 1; set to a higher value for repeated (Monte Carlo) resampling

• acceleration - parallelization option; currently supported options are instances of CPU1 (single-threaded computation) CPUThreads (multi-threaded computation) and CPUProcesses (multi-process computation); default is default_resource().

• force - default is false; set to true for force cold-restart of each training event

• verbosity level, an integer defaulting to 1.

• check_measure - default is true

Return value

A property-accessible object of type PerformanceEvaluation with these properties:

• measure: the vector of specified measures

• measurement: the corresponding measurements, aggregated across the test folds using the aggregation method defined for each measure (do aggregation(measure) to inspect)

• operation: for each measure, the operation applied; one of: predict, predict_mean, predict_mode, predict_median, or predict_joint.

• per_fold: a vector of vectors of individual test fold evaluations (one vector per measure)

• per_observation: a vector of vectors of individual observation evaluations of those measures for which reports_each_observation(measure) is true, which is otherwise reported missing

-fitted_params_per_fold: a vector containing fitted pamarms(mach) for each machine mach trained during resampling.

• report_per_fold: a vector containing report(mach) for each machine mach` training in resampling