# Evaluating Model Performance

MLJ allows quick evaluation of a supervised model's performance against a battery of selected losses or scores. For more on available performance measures, see Performance Measures.

In addition to hold-out and cross-validation, the user can specify their own list of train/test pairs of row indices for resampling, or define their own re-usable resampling strategies.

For simultaneously evaluating *multiple* models and/or data sets, see Benchmarking.

## Evaluating against a single measure

```
julia> using MLJ
julia> X = (a=rand(12), b=rand(12), c=rand(12));
julia> y = X.a + 2X.b + 0.05*rand(12);
julia> model = (@load RidgeRegressor pkg=MultivariateStats verbosity=0)()
RidgeRegressor(
lambda = 1.0,
bias = true) @806
julia> cv=CV(nfolds=3)
CV(
nfolds = 3,
shuffle = false,
rng = Random._GLOBAL_RNG()) @201
julia> evaluate(model, X, y, resampling=cv, measure=l2, verbosity=0)
┌────────────────────┬───────────────┬────────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├────────────────────┼───────────────┼────────────────────────┤
│ LPLoss{Int64} @535 │ 0.256 │ [0.336, 0.0345, 0.397] │
└────────────────────┴───────────────┴────────────────────────┘
_.per_observation = [[[0.0282, 1.21, ..., 0.0255], [0.000377, 0.0867, ..., 0.0417], [0.297, 0.329, ..., 0.831]]]
_.fitted_params_per_fold = [ … ]
_.report_per_fold = [ … ]
```

Alternatively, instead of applying `evaluate`

to a model + data, one may call `evaluate!`

on an existing machine wrapping the model in data:

```
julia> mach = machine(model, X, y)
Machine{RidgeRegressor,…} @798 trained 0 times; caches data
args:
1: Source @976 ⏎ `Table{AbstractVector{Continuous}}`
2: Source @992 ⏎ `AbstractVector{Continuous}`
julia> evaluate!(mach, resampling=cv, measure=l2, verbosity=0)
┌────────────────────┬───────────────┬────────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├────────────────────┼───────────────┼────────────────────────┤
│ LPLoss{Int64} @535 │ 0.256 │ [0.336, 0.0345, 0.397] │
└────────────────────┴───────────────┴────────────────────────┘
_.per_observation = [[[0.0282, 1.21, ..., 0.0255], [0.000377, 0.0867, ..., 0.0417], [0.297, 0.329, ..., 0.831]]]
_.fitted_params_per_fold = [ … ]
_.report_per_fold = [ … ]
```

(The latter call is a mutating call as the learned parameters stored in the machine potentially change. )

## Multiple measures

```
julia> evaluate!(mach,
resampling=cv,
measure=[l1, rms, rmslp1], verbosity=0)
┌───────────────────────────────────────────────────┬───────────────┬───────────
│ _.measure │ _.measurement │ _.per_fo ⋯
├───────────────────────────────────────────────────┼───────────────┼───────────
│ LPLoss{Int64} @237 │ 0.393 │ [0.429, ⋯
│ RootMeanSquaredError @815 │ 0.506 │ [0.58, 0 ⋯
│ RootMeanSquaredLogProportionalError{Float64} @763 │ 0.216 │ [0.268, ⋯
└───────────────────────────────────────────────────┴───────────────┴───────────
1 column omitted
_.per_observation = [[[0.168, 1.1, ..., 0.16], [0.0194, 0.294, ..., 0.204], [0.545, 0.574, ..., 0.911]], missing, missing]
_.fitted_params_per_fold = [ … ]
_.report_per_fold = [ … ]
```

## Custom measures and weighted measures

```
julia> my_loss(yhat, y) = maximum((yhat - y).^2);
julia> my_per_observation_loss(yhat, y) = abs.(yhat - y);
julia> MLJ.reports_each_observation(::typeof(my_per_observation_loss)) = true;
julia> my_weighted_score(yhat, y) = 1/mean(abs.(yhat - y));
julia> my_weighted_score(yhat, y, w) = 1/mean(abs.((yhat - y).^w));
julia> MLJ.supports_weights(::typeof(my_weighted_score)) = true;
julia> MLJ.orientation(::typeof(my_weighted_score)) = :score;
julia> holdout = Holdout(fraction_train=0.8)
Holdout(
fraction_train = 0.8,
shuffle = false,
rng = Random._GLOBAL_RNG()) @738
julia> weights = [1, 1, 2, 1, 1, 2, 3, 1, 1, 2, 3, 1];
julia> evaluate!(mach,
resampling=CV(nfolds=3),
measure=[my_loss, my_per_observation_loss, my_weighted_score, l1],
weights=weights, verbosity=0)
┌ Warning: Sample weights ignored in evaluations of the following measures, as unsupported:
│ my_loss, my_per_observation_loss
└ @ MLJBase ~/.julia/packages/MLJBase/swtDd/src/resampling.jl:474
┌─────────────────────────┬───────────────┬───────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├─────────────────────────┼───────────────┼───────────────────────┤
│ my_loss │ 0.708 │ [1.21, 0.0867, 0.831] │
│ my_per_observation_loss │ 0.393 │ [0.429, 0.154, 0.598] │
│ my_weighted_score │ 5.9 │ [2.65, 12.9, 2.18] │
│ LPLoss{Int64} @237 │ 0.566 │ [0.501, 0.275, 0.922] │
└─────────────────────────┴───────────────┴───────────────────────┘
_.per_observation = [missing, [[0.168, 1.1, ..., 0.16], [0.0194, 0.294, ..., 0.204], [0.545, 0.574, ..., 0.911]], missing, [[0.168, 1.1, ..., 0.16], [0.0194, 0.589, ..., 0.204], [0.545, 1.15, ..., 0.911]]]
_.fitted_params_per_fold = [ … ]
_.report_per_fold = [ … ]
```

## User-specified train/test sets

Users can either provide their own list of train/test pairs of row indices for resampling, as in this example:

```
julia> fold1 = 1:6; fold2 = 7:12;
julia> evaluate!(mach,
resampling = [(fold1, fold2), (fold2, fold1)],
measure=[l1, l2], verbosity=0)
┌────────────────────┬───────────────┬───────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├────────────────────┼───────────────┼───────────────┤
│ LPLoss{Int64} @237 │ 0.403 │ [0.49, 0.316] │
│ LPLoss{Int64} @535 │ 0.262 │ [0.295, 0.23] │
└────────────────────┴───────────────┴───────────────┘
_.per_observation = [[[0.225, 0.184, ..., 0.832], [0.164, 1.1, ..., 0.0359]], [[0.0505, 0.034, ..., 0.692], [0.0267, 1.22, ..., 0.00129]]]
_.fitted_params_per_fold = [ … ]
_.report_per_fold = [ … ]
```

Or define their own re-usable `ResamplingStrategy`

objects, - see Custom resampling strategies below.

## Built-in resampling strategies

`MLJBase.Holdout`

— Type```
holdout = Holdout(; fraction_train=0.7,
shuffle=nothing,
rng=nothing)
```

Holdout resampling strategy, for use in `evaluate!`

, `evaluate`

and in tuning.

`train_test_pairs(holdout, rows)`

Returns the pair `[(train, test)]`

, where `train`

and `test`

are vectors such that `rows=vcat(train, test)`

and `length(train)/length(rows)`

is approximatey equal to fraction_train`.

Pre-shuffling of `rows`

is controlled by `rng`

and `shuffle`

. If `rng`

is an integer, then the `Holdout`

keyword constructor resets it to `MersenneTwister(rng)`

. Otherwise some `AbstractRNG`

object is expected.

If `rng`

is left unspecified, `rng`

is reset to `Random.GLOBAL_RNG`

, in which case rows are only pre-shuffled if `shuffle=true`

is specified.

`MLJBase.CV`

— Type`cv = CV(; nfolds=6, shuffle=nothing, rng=nothing)`

Cross-validation resampling strategy, for use in `evaluate!`

, `evaluate`

and tuning.

`train_test_pairs(cv, rows)`

Returns an `nfolds`

-length iterator of `(train, test)`

pairs of vectors (row indices), where each `train`

and `test`

is a sub-vector of `rows`

. The `test`

vectors are mutually exclusive and exhaust `rows`

. Each `train`

vector is the complement of the corresponding `test`

vector. With no row pre-shuffling, the order of `rows`

is preserved, in the sense that `rows`

coincides precisely with the concatenation of the `test`

vectors, in the order they are generated. The first `r`

test vectors have length `n + 1`

, where `n, r = divrem(length(rows), nfolds)`

, and the remaining test vectors have length `n`

.

Pre-shuffling of `rows`

is controlled by `rng`

and `shuffle`

. If `rng`

is an integer, then the `CV`

keyword constructor resets it to `MersenneTwister(rng)`

. Otherwise some `AbstractRNG`

object is expected.

If `rng`

is left unspecified, `rng`

is reset to `Random.GLOBAL_RNG`

, in which case rows are only pre-shuffled if `shuffle=true`

is explicitly specified.

`MLJBase.StratifiedCV`

— Type```
stratified_cv = StratifiedCV(; nfolds=6,
shuffle=false,
rng=Random.GLOBAL_RNG)
```

Stratified cross-validation resampling strategy, for use in `evaluate!`

, `evaluate`

and in tuning. Applies only to classification problems (`OrderedFactor`

or `Multiclass`

targets).

`train_test_pairs(stratified_cv, rows, y)`

Returns an `nfolds`

-length iterator of `(train, test)`

pairs of vectors (row indices) where each `train`

and `test`

is a sub-vector of `rows`

. The `test`

vectors are mutually exclusive and exhaust `rows`

. Each `train`

vector is the complement of the corresponding `test`

vector.

Unlike regular cross-validation, the distribution of the levels of the target `y`

corresponding to each `train`

and `test`

is constrained, as far as possible, to replicate that of `y[rows]`

as a whole.

The stratified `train_test_pairs`

algorithm is invariant to label renaming. For example, if you run `replace!(y, 'a' => 'b', 'b' => 'a')`

and then re-run `train_test_pairs`

, the returned `(train, test)`

pairs will be the same.

Pre-shuffling of `rows`

is controlled by `rng`

and `shuffle`

. If `rng`

is an integer, then the `StratifedCV`

keyword constructor resets it to `MersenneTwister(rng)`

. Otherwise some `AbstractRNG`

object is expected.

If `rng`

is left unspecified, `rng`

is reset to `Random.GLOBAL_RNG`

, in which case rows are only pre-shuffled if `shuffle=true`

is explicitly specified.

## Custom resampling strategies

To define your own resampling strategy, make relevant parameters of your strategy the fields of a new type `MyResamplingStrategy <: MLJ.ResamplingStrategy`

, and implement one of the following methods:

```
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows)
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows, y)
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows, X, y)
```

Each method takes a vector of indices `rows`

and return a vector `[(t1, e1), (t2, e2), ... (tk, ek)]`

of train/test pairs of row indices selected from `rows`

. Here `X`

, `y`

are the input and target data (ignored in simple strategies, such as `Holdout`

and `CV`

).

Here is the code for the `Holdout`

strategy as an example:

```
struct Holdout <: ResamplingStrategy
fraction_train::Float64
shuffle::Bool
rng::Union{Int,AbstractRNG}
function Holdout(fraction_train, shuffle, rng)
0 < fraction_train < 1 ||
error("`fraction_train` must be between 0 and 1.")
return new(fraction_train, shuffle, rng)
end
end
# Keyword Constructor
function Holdout(; fraction_train::Float64=0.7, shuffle=nothing, rng=nothing)
if rng isa Integer
rng = MersenneTwister(rng)
end
if shuffle === nothing
shuffle = ifelse(rng===nothing, false, true)
end
if rng === nothing
rng = Random.GLOBAL_RNG
end
return Holdout(fraction_train, shuffle, rng)
end
function train_test_pairs(holdout::Holdout, rows)
train, test = partition(rows, holdout.fraction_train,
shuffle=holdout.shuffle, rng=holdout.rng)
return [(train, test),]
end
```

## API

`MLJBase.evaluate!`

— Function```
evaluate!(mach,
resampling=CV(),
measure=nothing,
rows=nothing,
weights=nothing,
class_weights=nothing,
operation=predict,
repeats=1,
acceleration=default_resource(),
force=false,
verbosity=1,
check_measure=true)
```

Estimate the performance of a machine `mach`

wrapping a supervised model in data, using the specified `resampling`

strategy (defaulting to 6-fold cross-validation) and `measure`

, which can be a single measure or vector.

Do `subtypes(MLJ.ResamplingStrategy)`

to obtain a list of available resampling strategies. If `resampling`

is not an object of type `MLJ.ResamplingStrategy`

, then a vector of pairs (of the form `(train_rows, test_rows)`

is expected. For example, setting

```
resampling = [(1:100), (101:200)),
(101:200), (1:100)]
```

gives two-fold cross-validation using the first 200 rows of data.

The resampling strategy is applied repeatedly (Monte Carlo resampling) if `repeats > 1`

. For example, if `repeats = 10`

, then `resampling = CV(nfolds=5, shuffle=true)`

, generates a total of 50 `(train, test)`

pairs for evaluation and subsequent aggregation.

If `resampling isa MLJ.ResamplingStrategy`

then one may optionally restrict the data used in evaluation by specifying `rows`

.

An optional `weights`

vector may be passed for measures that support sample weights (`MLJ.supports_weights(measure) == true`

), which is ignored by those that don't. These weights are not to be confused with any weights `w`

bound to `mach`

(as in `mach = machine(model, X, y, w)`

). To pass these to the performance evaluation measures you must explictly specify `weights=w`

in the `evaluate!`

call.

Additionally, optional `class_weights`

dictionary may be passed for measures that support class weights (`MLJ.supports_class_weights(measure) == true`

), which is ignored by those that don't. These weights are not to be confused with any weights `class_w`

bound to `mach`

(as in `mach = machine(model, X, y, class_w)`

). To pass these to the performance evaluation measures you must explictly specify `class_weights=w`

in the `evaluate!`

call.

User-defined measures are supported; see the manual for details.

If no measure is specified, then `default_measure(mach.model)`

is used, unless this default is `nothing`

and an error is thrown.

The `acceleration`

keyword argument is used to specify the compute resource (a subtype of `ComputationalResources.AbstractResource`

) that will be used to accelerate/parallelize the resampling operation.

Although evaluate! is mutating, `mach.model`

and `mach.args`

are untouched.

**Summary of key-word arguments**

`resampling`

- resampling strategy (default is`CV(nfolds=6)`

)`measure`

/`measures`

- measure or vector of measures (losses, scores, etc)`rows`

- vector of observation indices from which both train and test folds are constructed (default is all observations)`weights`

- per-sample weights for measures that support them (not to be confused with weights used in training)`class_weights`

- dictionary of per-class weights for use with measures that support these, in classification problems (not to be confused with per-sample`weights`

or with class weights used in training)`operation`

-`predict`

,`predict_mean`

,`predict_mode`

or`predict_median`

;`predict`

is the default but cannot be used with a deterministic measure if`model isa Probabilistic`

`repeats`

- default is 1; set to a higher value for repeated (Monte Carlo) resampling`acceleration`

- parallelization option; currently supported options are instances of`CPU1`

(single-threaded computation)`CPUThreads`

(multi-threaded computation) and`CPUProcesses`

(multi-process computation); default is`default_resource()`

.`force`

- default is`false`

; set to`true`

for force cold-restart of each training event`verbosity`

level, an integer defaulting to 1.`check_measure`

- default is`true`

**Return value**

A property-accessible object of type `PerformanceEvaluation`

with these properties:

`measure`

: the vector of specified measures`measurements`

: the corresponding measurements, aggregated across the test folds using the aggregation method defined for each measure (do`aggregation(measure)`

to inspect)`per_fold`

: a vector of vectors of individual test fold evaluations (one vector per measure)`per_observation`

: a vector of vectors of individual observation evaluations of those measures for which`reports_each_observation(measure)`

is true, which is otherwise reported`missing`

-`fitted_params_per_fold`

: a vector containing `fitted pamarms(mach)`

for each machine `mach`

trained during resampling.

`report_per_fold`

: a vector containing`report(mach)`

for each machine`mach`

training in resampling

`MLJModelInterface.evaluate`

— Functionsome meta-models may choose to implement the `evaluate`

operations