# Evaluating Model Performance

MLJ allows quick evaluation of a supervised model's performance against a battery of selected losses or scores. For more on available performance measures, see Performance Measures.

In addition to hold-out and cross-validation, the user can specify an explicit list of train/test pairs of row indices for resampling, or define new resampling strategies.

For simultaneously evaluating *multiple* models, see Comparing models of different type and nested cross-validation.

For externally logging the outcomes of performance evaluation experiments, see Logging Workflows

## Evaluating against a single measure

`julia> using MLJ`

`julia> X = (a=rand(12), b=rand(12), c=rand(12));`

`julia> y = X.a + 2X.b + 0.05*rand(12);`

`julia> model = (@load RidgeRegressor pkg=MultivariateStats verbosity=0)()`

`RidgeRegressor( lambda = 1.0, bias = true)`

`julia> cv=CV(nfolds=3)`

`CV( nfolds = 3, shuffle = false, rng = Random._GLOBAL_RNG())`

`julia> evaluate(model, X, y, resampling=cv, measure=l2, verbosity=0)`

`PerformanceEvaluation object with these fields: model, measure, operation, measurement, per_fold, per_observation, fitted_params_per_fold, report_per_fold, train_test_rows, resampling, repeats Extract: ┌──────────┬───────────┬─────────────┬─────────┬───────────────────────┐ │ measure │ operation │ measurement │ 1.96*SE │ per_fold │ ├──────────┼───────────┼─────────────┼─────────┼───────────────────────┤ │ LPLoss( │ predict │ 0.194 │ 0.0758 │ [0.236, 0.132, 0.215] │ │ p = 2) │ │ │ │ │ └──────────┴───────────┴─────────────┴─────────┴───────────────────────┘`

Alternatively, instead of applying `evaluate`

to a model + data, one may call `evaluate!`

on an existing machine wrapping the model in data:

`julia> mach = machine(model, X, y)`

`untrained Machine; caches model-specific representations of data model: RidgeRegressor(lambda = 1.0, …) args: 1: Source @432 ⏎ Table{AbstractVector{Continuous}} 2: Source @475 ⏎ AbstractVector{Continuous}`

`julia> evaluate!(mach, resampling=cv, measure=l2, verbosity=0)`

`PerformanceEvaluation object with these fields: model, measure, operation, measurement, per_fold, per_observation, fitted_params_per_fold, report_per_fold, train_test_rows, resampling, repeats Extract: ┌──────────┬───────────┬─────────────┬─────────┬───────────────────────┐ │ measure │ operation │ measurement │ 1.96*SE │ per_fold │ ├──────────┼───────────┼─────────────┼─────────┼───────────────────────┤ │ LPLoss( │ predict │ 0.194 │ 0.0758 │ [0.236, 0.132, 0.215] │ │ p = 2) │ │ │ │ │ └──────────┴───────────┴─────────────┴─────────┴───────────────────────┘`

(The latter call is a mutating call as the learned parameters stored in the machine potentially change. )

## Multiple measures

Multiple measures are specified as a vector:

`julia> evaluate!( mach, resampling=cv, measures=[l1, rms, rmslp1], verbosity=0, )`

`PerformanceEvaluation object with these fields: model, measure, operation, measurement, per_fold, per_observation, fitted_params_per_fold, report_per_fold, train_test_rows, resampling, repeats Extract: ┌──────────────────────────────────────┬───────────┬─────────────┬─────────┬──── │ measure │ operation │ measurement │ 1.96*SE │ p ⋯ ├──────────────────────────────────────┼───────────┼─────────────┼─────────┼──── │ LPLoss( │ predict │ 0.384 │ 0.117 │ [ ⋯ │ p = 1) │ │ │ │ ⋯ │ RootMeanSquaredError() │ predict │ 0.441 │ 0.09 │ [ ⋯ │ RootMeanSquaredLogProportionalError( │ predict │ 0.18 │ 0.0479 │ [ ⋯ │ offset = 1) │ │ │ │ ⋯ └──────────────────────────────────────┴───────────┴─────────────┴─────────┴──── 1 column omitted`

Custom measures can also be provided.

## Specifying weights

Per-observation weights can be passed to measures. If a measure does not support weights, the weights are ignored:

`julia> holdout = Holdout(fraction_train=0.8)`

`Holdout( fraction_train = 0.8, shuffle = false, rng = Random._GLOBAL_RNG())`

`julia> weights = [1, 1, 2, 1, 1, 2, 3, 1, 1, 2, 3, 1];`

`julia> evaluate!( mach, resampling=CV(nfolds=3), measure=[l2, rsquared], weights=weights, )`

`┌ Warning: Sample weights ignored in evaluations of the following measures, as unsupported: │ RSquared() └ @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/resampling.jl:809 Evaluating over 3 folds: 67%[================> ] ETA: 0:00:00 Evaluating over 3 folds: 100%[=========================] Time: 0:00:00 PerformanceEvaluation object with these fields: model, measure, operation, measurement, per_fold, per_observation, fitted_params_per_fold, report_per_fold, train_test_rows, resampling, repeats Extract: ┌────────────┬───────────┬─────────────┬─────────┬──────────────────────────┐ │ measure │ operation │ measurement │ 1.96*SE │ per_fold │ ├────────────┼───────────┼─────────────┼─────────┼──────────────────────────┤ │ LPLoss( │ predict │ 0.299 │ 0.134 │ [0.253, 0.234, 0.41] │ │ p = 2) │ │ │ │ │ │ RSquared() │ predict │ 0.255 │ 0.48 │ [-0.00538, 0.647, 0.122] │ └────────────┴───────────┴─────────────┴─────────┴──────────────────────────┘`

In classification problems, use `class_weights=...`

to specify a class weight dictionary.

`MLJBase.evaluate!`

— Function`evaluate!(mach; resampling=CV(), measure=nothing, options...)`

Estimate the performance of a machine `mach`

wrapping a supervised model in data, using the specified `resampling`

strategy (defaulting to 6-fold cross-validation) and `measure`

, which can be a single measure or vector. Returns a `PerformanceEvaluation`

object.

Available resampling strategies are `CV`

, `Holdout`

, `StratifiedCV`

and `TimeSeriesCV`

. If `resampling`

is not an instance of one of these, then a vector of tuples of the form `(train_rows, test_rows)`

is expected. For example, setting

```
resampling = [((1:100), (101:200)),
((101:200), (1:100))]
```

gives two-fold cross-validation using the first 200 rows of data.

Any measure conforming to the StatisticalMeasuresBase.jl API can be provided, assuming it can consume multiple observations.

Although `evaluate!`

is mutating, `mach.model`

and `mach.args`

are not mutated.

**Additional keyword options**

`rows`

- vector of observation indices from which both train and test folds are constructed (default is all observations)`operation`

/`operations=nothing`

- One of`predict`

,`predict_mean`

,`predict_mode`

,`predict_median`

, or`predict_joint`

, or a vector of these of the same length as`measure`

/`measures`

. Automatically inferred if left unspecified. For example,`predict_mode`

will be used for a`Multiclass`

target, if`model`

is a probabilistic predictor, but`measure`

is expects literal (point) target predictions. Operations actually applied can be inspected from the`operation`

field of the object returned.`weights`

- per-sample`Real`

weights for measures that support them (not to be confused with weights used in training, such as the`w`

in`mach = machine(model, X, y, w)`

).`class_weights`

- dictionary of`Real`

per-class weights for use with measures that support these, in classification problems (not to be confused with weights used in training, such as the`w`

in`mach = machine(model, X, y, w)`

).`repeats::Int=1`

: set to a higher value for repeated (Monte Carlo) resampling. For example, if`repeats = 10`

, then`resampling = CV(nfolds=5, shuffle=true)`

, generates a total of 50`(train, test)`

pairs for evaluation and subsequent aggregation.`acceleration=CPU1()`

: acceleration/parallelization option; can be any instance of`CPU1`

, (single-threaded computation),`CPUThreads`

(multi-threaded computation) or`CPUProcesses`

(multi-process computation); default is`default_resource()`

. These types are owned by ComputationalResources.jl.`force=false`

: set to`true`

to force cold-restart of each training event`verbosity::Int=1`

logging level; can be negative`check_measure=true`

: whether to screen measures for possible incompatibility with the model. Will not catch all incompatibilities.`per_observation=true`

: whether to calculate estimates for individual observations; if`false`

the`per_observation`

field of the returned object is populated with`missing`

s. Setting to`false`

may reduce compute time and allocations.`logger`

- a logger object (see`MLJBase.log_evaluation`

)

See also `evaluate`

, `PerformanceEvaluation`

`MLJModelInterface.evaluate`

— Functionsome meta-models may choose to implement the `evaluate`

operations

`MLJBase.PerformanceEvaluation`

— Type`PerformanceEvaluation`

Type of object returned by `evaluate`

(for models plus data) or `evaluate!`

(for machines). Such objects encode estimates of the performance (generalization error) of a supervised model or outlier detection model.

When `evaluate`

/`evaluate!`

is called, a number of train/test pairs ("folds") of row indices are generated, according to the options provided, which are discussed in the `evaluate!`

doc-string. Rows correspond to observations. The generated train/test pairs are recorded in the `train_test_rows`

field of the `PerformanceEvaluation`

struct, and the corresponding estimates, aggregated over all train/test pairs, are recorded in `measurement`

, a vector with one entry for each measure (metric) recorded in `measure`

.

When displayed, a `PerformanceEvalution`

object includes a value under the heading `1.96*SE`

, derived from the standard error of the `per_fold`

entries. This value is suitable for constructing a formal 95% confidence interval for the given `measurement`

. Such intervals should be interpreted with caution. See, for example, Bates et al. (2021).

**Fields**

These fields are part of the public API of the `PerformanceEvaluation`

struct.

`model`

: model used to create the performance evaluation. In the case a tuning model, this is the best model found.`measure`

: vector of measures (metrics) used to evaluate performance`measurement`

: vector of measurements - one for each element of`measure`

- aggregating the performance measurements over all train/test pairs (folds). The aggregation method applied for a given measure`m`

is`StatisticalMeasuresBase.external_aggregation_mode(m)`

(commonly`Mean()`

or`Sum()`

)`operation`

(e.g.,`predict_mode`

): the operations applied for each measure to generate predictions to be evaluated. Possibilities are:`predict`

,`predict_mean`

,`predict_mode`

,`predict_median`

, or`predict_joint`

.`per_fold`

: a vector of vectors of individual test fold evaluations (one vector per measure). Useful for obtaining a rough estimate of the variance of the performance estimate.`per_observation`

: a vector of vectors of vectors containing individual per-observation measurements: for an evaluation`e`

,`e.per_observation[m][f][i]`

is the measurement for the`i`

th observation in the`f`

th test fold, evaluated using the`m`

th measure. Useful for some forms of hyper-parameter optimization. Note that an aggregregated measurement for some measure`measure`

is repeated across all observations in a fold if`StatisticalMeasures.can_report_unaggregated(measure) == true`

. If`e`

has been computed with the`per_observation=false`

option, then`e_per_observation`

is a vector of`missings`

.`fitted_params_per_fold`

: a vector containing`fitted params(mach)`

for each machine`mach`

trained during resampling - one machine per train/test pair. Use this to extract the learned parameters for each individual training event.`report_per_fold`

: a vector containing`report(mach)`

for each machine`mach`

training in resampling - one machine per train/test pair.`train_test_rows`

: a vector of tuples, each of the form`(train, test)`

, where`train`

and`test`

are vectors of row (observation) indices for training and evaluation respectively.`resampling`

: the resampling strategy used to generate the train/test pairs.`repeats`

: the number of times the resampling strategy was repeated.

## User-specified train/test sets

Users can either provide an explicit list of train/test pairs of row indices for resampling, as in this example:

`julia> fold1 = 1:6; fold2 = 7:12;`

`julia> evaluate!( mach, resampling = [(fold1, fold2), (fold2, fold1)], measures=[l1, l2], verbosity=0, )`

`PerformanceEvaluation object with these fields: model, measure, operation, measurement, per_fold, per_observation, fitted_params_per_fold, report_per_fold, train_test_rows, resampling, repeats Extract: ┌──────────┬───────────┬─────────────┬──────────┬────────────────┐ │ measure │ operation │ measurement │ 1.96*SE │ per_fold │ ├──────────┼───────────┼─────────────┼──────────┼────────────────┤ │ LPLoss( │ predict │ 0.681 │ 0.000723 │ [0.681, 0.682] │ │ p = 1) │ │ │ │ │ │ LPLoss( │ predict │ 0.593 │ 0.0453 │ [0.577, 0.61] │ │ p = 2) │ │ │ │ │ └──────────┴───────────┴─────────────┴──────────┴────────────────┘`

Or the user can define their own re-usable `ResamplingStrategy`

objects, - see Custom resampling strategies below.

## Built-in resampling strategies

`MLJBase.Holdout`

— Type```
holdout = Holdout(; fraction_train=0.7,
shuffle=nothing,
rng=nothing)
```

Holdout resampling strategy, for use in `evaluate!`

, `evaluate`

and in tuning.

`train_test_pairs(holdout, rows)`

Returns the pair `[(train, test)]`

, where `train`

and `test`

are vectors such that `rows=vcat(train, test)`

and `length(train)/length(rows)`

is approximatey equal to fraction_train`.

Pre-shuffling of `rows`

is controlled by `rng`

and `shuffle`

. If `rng`

is an integer, then the `Holdout`

keyword constructor resets it to `MersenneTwister(rng)`

. Otherwise some `AbstractRNG`

object is expected.

If `rng`

is left unspecified, `rng`

is reset to `Random.GLOBAL_RNG`

, in which case rows are only pre-shuffled if `shuffle=true`

is specified.

`MLJBase.CV`

— Type`cv = CV(; nfolds=6, shuffle=nothing, rng=nothing)`

Cross-validation resampling strategy, for use in `evaluate!`

, `evaluate`

and tuning.

`train_test_pairs(cv, rows)`

Returns an `nfolds`

-length iterator of `(train, test)`

pairs of vectors (row indices), where each `train`

and `test`

is a sub-vector of `rows`

. The `test`

vectors are mutually exclusive and exhaust `rows`

. Each `train`

vector is the complement of the corresponding `test`

vector. With no row pre-shuffling, the order of `rows`

is preserved, in the sense that `rows`

coincides precisely with the concatenation of the `test`

vectors, in the order they are generated. The first `r`

test vectors have length `n + 1`

, where `n, r = divrem(length(rows), nfolds)`

, and the remaining test vectors have length `n`

.

Pre-shuffling of `rows`

is controlled by `rng`

and `shuffle`

. If `rng`

is an integer, then the `CV`

keyword constructor resets it to `MersenneTwister(rng)`

. Otherwise some `AbstractRNG`

object is expected.

If `rng`

is left unspecified, `rng`

is reset to `Random.GLOBAL_RNG`

, in which case rows are only pre-shuffled if `shuffle=true`

is explicitly specified.

`MLJBase.StratifiedCV`

— Type```
stratified_cv = StratifiedCV(; nfolds=6,
shuffle=false,
rng=Random.GLOBAL_RNG)
```

Stratified cross-validation resampling strategy, for use in `evaluate!`

, `evaluate`

and in tuning. Applies only to classification problems (`OrderedFactor`

or `Multiclass`

targets).

`train_test_pairs(stratified_cv, rows, y)`

Returns an `nfolds`

-length iterator of `(train, test)`

pairs of vectors (row indices) where each `train`

and `test`

is a sub-vector of `rows`

. The `test`

vectors are mutually exclusive and exhaust `rows`

. Each `train`

vector is the complement of the corresponding `test`

vector.

Unlike regular cross-validation, the distribution of the levels of the target `y`

corresponding to each `train`

and `test`

is constrained, as far as possible, to replicate that of `y[rows]`

as a whole.

The stratified `train_test_pairs`

algorithm is invariant to label renaming. For example, if you run `replace!(y, 'a' => 'b', 'b' => 'a')`

and then re-run `train_test_pairs`

, the returned `(train, test)`

pairs will be the same.

Pre-shuffling of `rows`

is controlled by `rng`

and `shuffle`

. If `rng`

is an integer, then the `StratifedCV`

keywod constructor resets it to `MersenneTwister(rng)`

. Otherwise some `AbstractRNG`

object is expected.

If `rng`

is left unspecified, `rng`

is reset to `Random.GLOBAL_RNG`

, in which case rows are only pre-shuffled if `shuffle=true`

is explicitly specified.

`MLJBase.TimeSeriesCV`

— Type`tscv = TimeSeriesCV(; nfolds=4)`

Cross-validation resampling strategy, for use in `evaluate!`

, `evaluate`

and tuning, when observations are chronological and not expected to be independent.

`train_test_pairs(tscv, rows)`

Returns an `nfolds`

-length iterator of `(train, test)`

pairs of vectors (row indices), where each `train`

and `test`

is a sub-vector of `rows`

. The rows are partitioned sequentially into `nfolds + 1`

approximately equal length partitions, where the first partition is the first train set, and the second partition is the first test set. The second train set consists of the first two partitions, and the second test set consists of the third partition, and so on for each fold.

The first partition (which is the first train set) has length `n + r`

, where `n, r = divrem(length(rows), nfolds + 1)`

, and the remaining partitions (all of the test folds) have length `n`

.

**Examples**

```
julia> MLJBase.train_test_pairs(TimeSeriesCV(nfolds=3), 1:10)
3-element Vector{Tuple{UnitRange{Int64}, UnitRange{Int64}}}:
(1:4, 5:6)
(1:6, 7:8)
(1:8, 9:10)
julia> model = (@load RidgeRegressor pkg=MultivariateStats verbosity=0)();
julia> data = @load_sunspots;
julia> X = (lag1 = data.sunspot_number[2:end-1],
lag2 = data.sunspot_number[1:end-2]);
julia> y = data.sunspot_number[3:end];
julia> tscv = TimeSeriesCV(nfolds=3);
julia> evaluate(model, X, y, resampling=tscv, measure=rmse, verbosity=0)
┌───────────────────────────┬───────────────┬────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────────────────────┼───────────────┼────────────────────┤
│ RootMeanSquaredError @753 │ 21.7 │ [25.4, 16.3, 22.4] │
└───────────────────────────┴───────────────┴────────────────────┘
_.per_observation = [missing]
_.fitted_params_per_fold = [ … ]
_.report_per_fold = [ … ]
_.train_test_rows = [ … ]
```

## Custom resampling strategies

To define a new resampling strategy, make relevant parameters of your strategy the fields of a new type `MyResamplingStrategy <: MLJ.ResamplingStrategy`

, and implement one of the following methods:

```
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows)
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows, y)
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows, X, y)
```

Each method takes a vector of indices `rows`

and returns a vector `[(t1, e1), (t2, e2), ... (tk, ek)]`

of train/test pairs of row indices selected from `rows`

. Here `X`

, `y`

are the input and target data (ignored in simple strategies, such as `Holdout`

and `CV`

).

Here is the code for the `Holdout`

strategy as an example:

```
struct Holdout <: ResamplingStrategy
fraction_train::Float64
shuffle::Bool
rng::Union{Int,AbstractRNG}
function Holdout(fraction_train, shuffle, rng)
0 < fraction_train < 1 ||
error("`fraction_train` must be between 0 and 1.")
return new(fraction_train, shuffle, rng)
end
end
# Keyword Constructor
function Holdout(; fraction_train::Float64=0.7, shuffle=nothing, rng=nothing)
if rng isa Integer
rng = MersenneTwister(rng)
end
if shuffle === nothing
shuffle = ifelse(rng===nothing, false, true)
end
if rng === nothing
rng = Random.GLOBAL_RNG
end
return Holdout(fraction_train, shuffle, rng)
end
function train_test_pairs(holdout::Holdout, rows)
train, test = partition(rows, holdout.fraction_train,
shuffle=holdout.shuffle, rng=holdout.rng)
return [(train, test),]
end
```