# Evaluating Model Performance

MLJ allows quick evaluation of a supervised model's performance against a battery of selected losses or scores. For more on available performance measures, see Performance Measures.

In addition to hold-out and cross-validation, the user can specify their own list of train/test pairs of row indices for resampling, or define their own re-usable resampling strategies.

For simultaneously evaluating *multiple* models and/or data sets, see Benchmarking.

## Evaluating against a single measure

```
julia> using MLJ
julia> X = (a=rand(12), b=rand(12), c=rand(12));
julia> y = X.a + 2X.b + 0.05*rand(12);
julia> model = @load RidgeRegressor pkg=MultivariateStats
RidgeRegressor(
lambda = 1.0) @994
julia> cv=CV(nfolds=3)
CV(
nfolds = 3,
shuffle = false,
rng = MersenneTwister(UInt32[0xf88b2cd5, 0x0b67973f, 0xa0e4f6c0, 0x05d66c97]) @ 306) @359
julia> evaluate(model, X, y, resampling=cv, measure=l2, verbosity=0)
┌───────────┬───────────────┬────────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────┼───────────────┼────────────────────────┤
│ l2 │ 0.144 │ [0.0634, 0.186, 0.182] │
└───────────┴───────────────┴────────────────────────┘
_.per_observation = [[[0.132, 0.00688, ..., 0.0416], [0.0107, 0.41, ..., 0.131], [0.122, 0.376, ..., 0.189]]]
```

Alternatively, instead of applying `evaluate`

to a model + data, one may call `evaluate!`

on an existing machine wrapping the model in data:

```
julia> mach = machine(model, X, y)
Machine{RidgeRegressor} @227 trained 0 times.
args:
1: Source @059 ⏎ `Table{AbstractArray{Continuous,1}}`
2: Source @432 ⏎ `AbstractArray{Continuous,1}`
julia> evaluate!(mach, resampling=cv, measure=l2, verbosity=0)
┌───────────┬───────────────┬────────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────┼───────────────┼────────────────────────┤
│ l2 │ 0.144 │ [0.0634, 0.186, 0.182] │
└───────────┴───────────────┴────────────────────────┘
_.per_observation = [[[0.132, 0.00688, ..., 0.0416], [0.0107, 0.41, ..., 0.131], [0.122, 0.376, ..., 0.189]]]
```

(The latter call is a mutating call as the learned parameters stored in the machine potentially change. )

## Multiple measures

```
julia> evaluate!(mach,
resampling=cv,
measure=[l1, rms, rmslp1], verbosity=0)
┌───────────┬───────────────┬───────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────┼───────────────┼───────────────────────┤
│ l1 │ 0.338 │ [0.23, 0.386, 0.4] │
│ rms │ 0.379 │ [0.252, 0.431, 0.427] │
│ rmslp1 │ 0.156 │ [0.093, 0.167, 0.191] │
└───────────┴───────────────┴───────────────────────┘
_.per_observation = [[[0.363, 0.0829, ..., 0.204], [0.103, 0.641, ..., 0.362], [0.349, 0.613, ..., 0.435]], missing, missing]
```

## Custom measures and weighted measures

```
julia> my_loss(yhat, y) = maximum((yhat - y).^2);
julia> my_per_observation_loss(yhat, y) = abs.(yhat - y);
julia> MLJ.reports_each_observation(::typeof(my_per_observation_loss)) = true;
julia> my_weighted_score(yhat, y) = 1/mean(abs.(yhat - y));
julia> my_weighted_score(yhat, y, w) = 1/mean(abs.((yhat - y).^w));
julia> MLJ.supports_weights(::typeof(my_weighted_score)) = true;
julia> MLJ.orientation(::typeof(my_weighted_score)) = :score;
julia> holdout = Holdout(fraction_train=0.8)
Holdout(
fraction_train = 0.8,
shuffle = false,
rng = MersenneTwister(UInt32[0xf88b2cd5, 0x0b67973f, 0xa0e4f6c0, 0x05d66c97]) @ 306) @864
julia> weights = [1, 1, 2, 1, 1, 2, 3, 1, 1, 2, 3, 1];
julia> evaluate!(mach,
resampling=CV(nfolds=3),
measure=[my_loss, my_per_observation_loss, my_weighted_score, l1],
weights=weights, verbosity=0)
┌ Warning: Sample weights ignored in evaluations of the following measures, as unsupported:
│ my_loss, my_per_observation_loss
└ @ MLJBase ~/.julia/packages/MLJBase/cJmIS/src/resampling.jl:601
┌─────────────────────────┬───────────────┬───────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├─────────────────────────┼───────────────┼───────────────────────┤
│ my_loss │ 0.306 │ [0.132, 0.41, 0.376] │
│ my_per_observation_loss │ 0.338 │ [0.23, 0.386, 0.4] │
│ my_weighted_score │ 4.38 │ [5.53, 4.17, 3.42] │
│ l1 │ 0.572 │ [0.298, 0.764, 0.654] │
└─────────────────────────┴───────────────┴───────────────────────┘
_.per_observation = [missing, [[0.363, 0.0829, ..., 0.204], [0.103, 0.641, ..., 0.362], [0.349, 0.613, ..., 0.435]], missing, [[0.363, 0.0829, ..., 0.204], [0.103, 1.28, ..., 0.362], [0.349, 1.23, ..., 0.435]]]
```

## User-specified train/test sets

Users can either provide their own list of train/test pairs of row indices for resampling, as in this example:

```
julia> fold1 = 1:6; fold2 = 7:12;
julia> evaluate!(mach,
resampling = [(fold1, fold2), (fold2, fold1)],
measure=[l1, l2], verbosity=0)
┌───────────┬───────────────┬────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────┼───────────────┼────────────────┤
│ l1 │ 0.376 │ [0.395, 0.356] │
│ l2 │ 0.2 │ [0.201, 0.199] │
└───────────┴───────────────┴────────────────┘
_.per_observation = [[[0.576, 0.125, ..., 0.5], [0.571, 0.22, ..., 0.778]], [[0.332, 0.0157, ..., 0.25], [0.326, 0.0486, ..., 0.605]]]
```

Or define their own re-usable `ResamplingStrategy`

objects, - see Custom resampling strategies below.

## Built-in resampling strategies

`MLJBase.Holdout`

— Type```
holdout = Holdout(; fraction_train=0.7,
shuffle=nothing,
rng=nothing)
```

Holdout resampling strategy, for use in `evaluate!`

, `evaluate`

and in tuning.

`train_test_pairs(holdout, rows)`

Returns the pair `[(train, test)]`

, where `train`

and `test`

are vectors such that `rows=vcat(train, test)`

and `length(train)/length(rows)`

is approximatey equal to fraction_train`.

Pre-shuffling of `rows`

is controlled by `rng`

and `shuffle`

. If `rng`

is an integer, then the `Holdout`

keyword constructor resets it to `MersenneTwister(rng)`

. Otherwise some `AbstractRNG`

object is expected.

If `rng`

is left unspecified, `rng`

is reset to `Random.GLOBAL_RNG`

, in which case rows are only pre-shuffled if `shuffle=true`

is specified.

`MLJBase.CV`

— Type`cv = CV(; nfolds=6, shuffle=nothing, rng=nothing)`

Cross-validation resampling strategy, for use in `evaluate!`

, `evaluate`

and tuning.

`train_test_pairs(cv, rows)`

Returns an `nfolds`

-length iterator of `(train, test)`

pairs of vectors (row indices), where each `train`

and `test`

is a sub-vector of `rows`

. The `test`

vectors are mutually exclusive and exhaust `rows`

. Each `train`

vector is the complement of the corresponding `test`

vector. With no row pre-shuffling, the order of `rows`

is preserved, in the sense that `rows`

coincides precisely with the concatenation of the `test`

vectors, in the order they are generated. The first `r`

test vectors have length `n + 1`

, where `n, r = divrem(length(rows), nfolds)`

, and the remaining test vectors have length `n`

.

Pre-shuffling of `rows`

is controlled by `rng`

and `shuffle`

. If `rng`

is an integer, then the `CV`

keyword constructor resets it to `MersenneTwister(rng)`

. Otherwise some `AbstractRNG`

object is expected.

If `rng`

is left unspecified, `rng`

is reset to `Random.GLOBAL_RNG`

, in which case rows are only pre-shuffled if `shuffle=true`

is explicitly specified.

`MLJBase.StratifiedCV`

— Type```
stratified_cv = StratifiedCV(; nfolds=6,
shuffle=false,
rng=Random.GLOBAL_RNG)
```

Stratified cross-validation resampling strategy, for use in `evaluate!`

, `evaluate`

and in tuning. Applies only to classification problems (`OrderedFactor`

or `Multiclass`

targets).

`train_test_pairs(stratified_cv, rows, y)`

Returns an `nfolds`

-length iterator of `(train, test)`

pairs of vectors (row indices) where each `train`

and `test`

is a sub-vector of `rows`

. The `test`

vectors are mutually exclusive and exhaust `rows`

. Each `train`

vector is the complement of the corresponding `test`

vector.

Unlike regular cross-validation, the distribution of the levels of the target `y`

corresponding to each `train`

and `test`

is constrained, as far as possible, to replicate that of `y[rows]`

as a whole.

The stratified `train_test_pairs`

algorithm is invariant to label renaming. For example, if you run `replace!(y, 'a' => 'b', 'b' => 'a')`

and then re-run `train_test_pairs`

, the returned `(train, test)`

pairs will be the same.

Pre-shuffling of `rows`

is controlled by `rng`

and `shuffle`

. If `rng`

is an integer, then the `StratifedCV`

keyword constructor resets it to `MersenneTwister(rng)`

. Otherwise some `AbstractRNG`

object is expected.

If `rng`

is left unspecified, `rng`

is reset to `Random.GLOBAL_RNG`

, in which case rows are only pre-shuffled if `shuffle=true`

is explicitly specified.

## Custom resampling strategies

To define your own resampling strategy, make relevant parameters of your strategy the fields of a new type `MyResamplingStrategy <: MLJ.ResamplingStrategy`

, and implement one of the following methods:

```
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows)
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows, y)
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows, X, y)
```

Each method takes a vector of indices `rows`

and return a vector `[(t1, e1), (t2, e2), ... (tk, ek)]`

of train/test pairs of row indices selected from `rows`

. Here `X`

, `y`

are the input and target data (ignored in simple strategies, such as `Holdout`

and `CV`

).

Here is the code for the `Holdout`

strategy as an example:

```
struct Holdout <: ResamplingStrategy
fraction_train::Float64
shuffle::Bool
rng::Union{Int,AbstractRNG}
function Holdout(fraction_train, shuffle, rng)
0 < fraction_train < 1 ||
error("`fraction_train` must be between 0 and 1.")
return new(fraction_train, shuffle, rng)
end
end
# Keyword Constructor
function Holdout(; fraction_train::Float64=0.7, shuffle=nothing, rng=nothing)
if rng isa Integer
rng = MersenneTwister(rng)
end
if shuffle === nothing
shuffle = ifelse(rng===nothing, false, true)
end
if rng === nothing
rng = Random.GLOBAL_RNG
end
return Holdout(fraction_train, shuffle, rng)
end
function train_test_pairs(holdout::Holdout, rows)
train, test = partition(rows, holdout.fraction_train,
shuffle=holdout.shuffle, rng=holdout.rng)
return [(train, test),]
end
```

## API

`MLJBase.evaluate!`

— Function```
evaluate!(mach,
resampling=CV(),
measure=nothing,
rows=nothing,
weights=nothing,
operation=predict,
repeats=1,
acceleration=default_resource(),
force=false,
verbosity=1,
check_measure=true)
```

Estimate the performance of a machine `mach`

wrapping a supervised model in data, using the specified `resampling`

strategy (defaulting to 6-fold cross-validation) and `measure`

, which can be a single measure or vector.

Do `subtypes(MLJ.ResamplingStrategy)`

to obtain a list of available resampling strategies. If `resampling`

is not an object of type `MLJ.ResamplingStrategy`

, then a vector of pairs (of the form `(train_rows, test_rows)`

is expected. For example, setting

```
resampling = [(1:100), (101:200)),
(101:200), (1:100)]
```

gives two-fold cross-validation using the first 200 rows of data.

The resampling strategy is applied repeatedly (Monte Carlo resampling) if `repeats > 1`

. For example, if `repeats = 10`

, then `resampling = CV(nfolds=5, shuffle=true)`

, generates a total of 50 `(train, test)`

pairs for evaluation and subsequent aggregation.

If `resampling isa MLJ.ResamplingStrategy`

then one may optionally restrict the data used in evaluation by specifying `rows`

.

An optional `weights`

vector may be passed for measures that support sample weights (`MLJ.supports_weights(measure) == true`

), which is ignored by those that don't.

*Important:* If `mach`

already wraps sample weights `w`

(as in `mach = machine(model, X, y, w)`

) then these weights, which are used for *training*, are automatically passed to the measures for evaluation. However, for evaluation purposes, any `weights`

specified as a keyword argument will take precedence over `w`

.

User-defined measures are supported; see the manual for details.

If no measure is specified, then `default_measure(mach.model)`

is used, unless this default is `nothing`

and an error is thrown.

The `acceleration`

keyword argument is used to specify the compute resource (a subtype of `ComputationalResources.AbstractResource`

) that will be used to accelerate/parallelize the resampling operation.

Although evaluate! is mutating, `mach.model`

and `mach.args`

are untouched.

**Summary of key-word arguments**

`resampling`

- resampling strategy (default is`CV(nfolds=6)`

)`measure`

/`measures`

- measure or vector of measures (losses, scores, etc)`rows`

- vector of observation indices from which both train and test folds are constructed (default is all observations)`weights`

- per-sample weights for training and measures; see important note above`operation`

-`predict`

,`predict_mean`

,`predict_mode`

or`predict_median`

;`predict`

is the default but cannot be used with a deterministic measure if`model isa Probabilistic`

`repeats`

- default is 1; set to a higher value for repeated (Monte Carlo) resampling`acceleration`

- parallelization option; currently supported options are instances of`CPU1`

(single-threaded computation)`CPUThreads`

(multi-threaded computation) and`CPUProcesses`

(multi-process computation); default is`default_resource()`

.`force`

- default is`false`

; set to`true`

for force cold-restart of each training event`verbosity`

level, an integer defaulting to 1.`check_measure`

- default is`true`

**Return value**

A property-accessible object of type `PerformanceEvaluation`

with these properties:

`measure`

: the vector of specified measures`measurements`

: the corresponding measurements, aggregated across the test folds using the aggregation method defined for each measure (do`aggregation(measure)`

to inspect)`per_fold`

: a vector of vectors of individual test fold evaluations (one vector per measure)`per_observation`

: a vector of vectors of individual observation evaluations of those measures for which`reports_each_observation(measure)`

is true, which is otherwise reported`missing`

.

See also `evaluate`

`MLJModelInterface.evaluate`

— Functionsome meta-models may choose to implement the `evaluate`

operations