Evaluating Model Performance
MLJ allows quick evaluation of a supervised model's performance against a battery of selected losses or scores. For more on available performance measures, see Performance Measures.
In addition to hold-out and cross-validation, the user can specify an explicit list of train/test pairs of row indices for resampling, or define new resampling strategies.
For simultaneously evaluating multiple models and/or data sets, see Benchmarking.
Evaluating against a single measure
julia> using MLJ
julia> X = (a=rand(12), b=rand(12), c=rand(12));
julia> y = X.a + 2X.b + 0.05*rand(12);
julia> model = (@load RidgeRegressor pkg=MultivariateStats verbosity=0)()
RidgeRegressor(
lambda = 1.0,
bias = true)
julia> cv=CV(nfolds=3)
CV(
nfolds = 3,
shuffle = false,
rng = Random._GLOBAL_RNG())
julia> evaluate(model, X, y, resampling=cv, measure=l2, verbosity=0)
PerformanceEvaluation object with these fields:
measure, operation, measurement, per_fold,
per_observation, fitted_params_per_fold,
report_per_fold, train_test_rows
Extract:
┌──────────┬───────────┬─────────────┬─────────┬────────────────────────┐
│ measure │ operation │ measurement │ 1.96*SE │ per_fold │
├──────────┼───────────┼─────────────┼─────────┼────────────────────────┤
│ LPLoss( │ predict │ 0.131 │ 0.0723 │ [0.0707, 0.165, 0.156] │
│ p = 2) │ │ │ │ │
└──────────┴───────────┴─────────────┴─────────┴────────────────────────┘
Alternatively, instead of applying evaluate
to a model + data, one may call evaluate!
on an existing machine wrapping the model in data:
julia> mach = machine(model, X, y)
untrained Machine; caches model-specific representations of data
model: RidgeRegressor(lambda = 1.0, …)
args:
1: Source @983 ⏎ Table{AbstractVector{Continuous}}
2: Source @073 ⏎ AbstractVector{Continuous}
julia> evaluate!(mach, resampling=cv, measure=l2, verbosity=0)
PerformanceEvaluation object with these fields:
measure, operation, measurement, per_fold,
per_observation, fitted_params_per_fold,
report_per_fold, train_test_rows
Extract:
┌──────────┬───────────┬─────────────┬─────────┬────────────────────────┐
│ measure │ operation │ measurement │ 1.96*SE │ per_fold │
├──────────┼───────────┼─────────────┼─────────┼────────────────────────┤
│ LPLoss( │ predict │ 0.131 │ 0.0723 │ [0.0707, 0.165, 0.156] │
│ p = 2) │ │ │ │ │
└──────────┴───────────┴─────────────┴─────────┴────────────────────────┘
(The latter call is a mutating call as the learned parameters stored in the machine potentially change. )
Multiple measures
julia> evaluate!(mach,
resampling=cv,
measure=[l1, rms, rmslp1], verbosity=0)
PerformanceEvaluation object with these fields:
measure, operation, measurement, per_fold,
per_observation, fitted_params_per_fold,
report_per_fold, train_test_rows
Extract:
┌──────────────────────────────────────┬───────────┬─────────────┬─────────┬────
│ measure │ operation │ measurement │ 1.96*SE │ p ⋯
├──────────────────────────────────────┼───────────┼─────────────┼─────────┼────
│ LPLoss( │ predict │ 0.315 │ 0.0835 │ [ ⋯
│ p = 1) │ │ │ │ ⋯
│ RootMeanSquaredError() │ predict │ 0.362 │ N/A │ [ ⋯
│ RootMeanSquaredLogProportionalError( │ predict │ 0.137 │ N/A │ [ ⋯
│ offset = 1.0) │ │ │ │ ⋯
└──────────────────────────────────────┴───────────┴─────────────┴─────────┴────
1 column omitted
Custom measures and weighted measures
julia> my_loss(yhat, y) = maximum((yhat - y).^2);
julia> my_per_observation_loss(yhat, y) = abs.(yhat - y);
julia> MLJ.reports_each_observation(::typeof(my_per_observation_loss)) = true;
julia> my_weighted_score(yhat, y) = 1/mean(abs.(yhat - y));
julia> my_weighted_score(yhat, y, w) = 1/mean(abs.((yhat - y).^w));
julia> MLJ.supports_weights(::typeof(my_weighted_score)) = true;
julia> MLJ.orientation(::typeof(my_weighted_score)) = :score;
julia> holdout = Holdout(fraction_train=0.8)
Holdout(
fraction_train = 0.8,
shuffle = false,
rng = Random._GLOBAL_RNG())
julia> weights = [1, 1, 2, 1, 1, 2, 3, 1, 1, 2, 3, 1];
julia> evaluate!(mach,
resampling=CV(nfolds=3),
measure=[my_loss, my_per_observation_loss, my_weighted_score, l1],
weights=weights, verbosity=0)
┌ Warning: Sample weights ignored in evaluations of the following measures, as unsupported:
│ my_loss, my_per_observation_loss
└ @ MLJBase ~/.julia/packages/MLJBase/97P9U/src/resampling.jl:777
PerformanceEvaluation object with these fields:
measure, operation, measurement, per_fold,
per_observation, fitted_params_per_fold,
report_per_fold, train_test_rows
Extract:
┌──────────────────────────────────────────────────────────┬───────────┬────────
│ measure │ operation │ measu ⋯
├──────────────────────────────────────────────────────────┼───────────┼────────
│ my_loss (generic function with 1 method) │ predict │ 0.312 ⋯
│ my_per_observation_loss (generic function with 1 method) │ predict │ 0.315 ⋯
│ my_weighted_score (generic function with 2 methods) │ predict │ 4.87 ⋯
│ LPLoss( │ predict │ 0.543 ⋯
│ p = 1) │ │ ⋯
└──────────────────────────────────────────────────────────┴───────────┴────────
3 columns omitted
User-specified train/test sets
Users can either provide an explicit list of train/test pairs of row indices for resampling, as in this example:
julia> fold1 = 1:6; fold2 = 7:12;
julia> evaluate!(mach,
resampling = [(fold1, fold2), (fold2, fold1)],
measure=[l1, l2], verbosity=0)
PerformanceEvaluation object with these fields:
measure, operation, measurement, per_fold,
per_observation, fitted_params_per_fold,
report_per_fold, train_test_rows
Extract:
┌──────────┬───────────┬─────────────┬─────────┬─────────────────┐
│ measure │ operation │ measurement │ 1.96*SE │ per_fold │
├──────────┼───────────┼─────────────┼─────────┼─────────────────┤
│ LPLoss( │ predict │ 0.324 │ 0.105 │ [0.363, 0.286] │
│ p = 1) │ │ │ │ │
│ LPLoss( │ predict │ 0.133 │ 0.106 │ [0.172, 0.0952] │
│ p = 2) │ │ │ │ │
└──────────┴───────────┴─────────────┴─────────┴─────────────────┘
Or define their own re-usable ResamplingStrategy
objects, - see Custom resampling strategies below.
Built-in resampling strategies
MLJBase.Holdout
— Typeholdout = Holdout(; fraction_train=0.7,
shuffle=nothing,
rng=nothing)
Holdout resampling strategy, for use in evaluate!
, evaluate
and in tuning.
train_test_pairs(holdout, rows)
Returns the pair [(train, test)]
, where train
and test
are vectors such that rows=vcat(train, test)
and length(train)/length(rows)
is approximatey equal to fraction_train`.
Pre-shuffling of rows
is controlled by rng
and shuffle
. If rng
is an integer, then the Holdout
keyword constructor resets it to MersenneTwister(rng)
. Otherwise some AbstractRNG
object is expected.
If rng
is left unspecified, rng
is reset to Random.GLOBAL_RNG
, in which case rows are only pre-shuffled if shuffle=true
is specified.
MLJBase.CV
— Typecv = CV(; nfolds=6, shuffle=nothing, rng=nothing)
Cross-validation resampling strategy, for use in evaluate!
, evaluate
and tuning.
train_test_pairs(cv, rows)
Returns an nfolds
-length iterator of (train, test)
pairs of vectors (row indices), where each train
and test
is a sub-vector of rows
. The test
vectors are mutually exclusive and exhaust rows
. Each train
vector is the complement of the corresponding test
vector. With no row pre-shuffling, the order of rows
is preserved, in the sense that rows
coincides precisely with the concatenation of the test
vectors, in the order they are generated. The first r
test vectors have length n + 1
, where n, r = divrem(length(rows), nfolds)
, and the remaining test vectors have length n
.
Pre-shuffling of rows
is controlled by rng
and shuffle
. If rng
is an integer, then the CV
keyword constructor resets it to MersenneTwister(rng)
. Otherwise some AbstractRNG
object is expected.
If rng
is left unspecified, rng
is reset to Random.GLOBAL_RNG
, in which case rows are only pre-shuffled if shuffle=true
is explicitly specified.
MLJBase.StratifiedCV
— Typestratified_cv = StratifiedCV(; nfolds=6,
shuffle=false,
rng=Random.GLOBAL_RNG)
Stratified cross-validation resampling strategy, for use in evaluate!
, evaluate
and in tuning. Applies only to classification problems (OrderedFactor
or Multiclass
targets).
train_test_pairs(stratified_cv, rows, y)
Returns an nfolds
-length iterator of (train, test)
pairs of vectors (row indices) where each train
and test
is a sub-vector of rows
. The test
vectors are mutually exclusive and exhaust rows
. Each train
vector is the complement of the corresponding test
vector.
Unlike regular cross-validation, the distribution of the levels of the target y
corresponding to each train
and test
is constrained, as far as possible, to replicate that of y[rows]
as a whole.
The stratified train_test_pairs
algorithm is invariant to label renaming. For example, if you run replace!(y, 'a' => 'b', 'b' => 'a')
and then re-run train_test_pairs
, the returned (train, test)
pairs will be the same.
Pre-shuffling of rows
is controlled by rng
and shuffle
. If rng
is an integer, then the StratifedCV
keyword constructor resets it to MersenneTwister(rng)
. Otherwise some AbstractRNG
object is expected.
If rng
is left unspecified, rng
is reset to Random.GLOBAL_RNG
, in which case rows are only pre-shuffled if shuffle=true
is explicitly specified.
MLJBase.TimeSeriesCV
— Typetscv = TimeSeriesCV(; nfolds=4)
Cross-validation resampling strategy, for use in evaluate!
, evaluate
and tuning, when observations are chronological and not expected to be independent.
train_test_pairs(tscv, rows)
Returns an nfolds
-length iterator of (train, test)
pairs of vectors (row indices), where each train
and test
is a sub-vector of rows
. The rows are partitioned sequentially into nfolds + 1
approximately equal length partitions, where the first partition is the first train set, and the second partition is the first test set. The second train set consists of the first two partitions, and the second test set consists of the third partition, and so on for each fold.
The first partition (which is the first train set) has length n + r
, where n, r = divrem(length(rows), nfolds + 1)
, and the remaining partitions (all of the test folds) have length n
.
Examples
julia> MLJBase.train_test_pairs(TimeSeriesCV(nfolds=3), 1:10)
3-element Vector{Tuple{UnitRange{Int64}, UnitRange{Int64}}}:
(1:4, 5:6)
(1:6, 7:8)
(1:8, 9:10)
julia> model = (@load RidgeRegressor pkg=MultivariateStats verbosity=0)();
julia> data = @load_sunspots;
julia> X = (lag1 = data.sunspot_number[2:end-1],
lag2 = data.sunspot_number[1:end-2]);
julia> y = data.sunspot_number[3:end];
julia> tscv = TimeSeriesCV(nfolds=3);
julia> evaluate(model, X, y, resampling=tscv, measure=rmse, verbosity=0)
┌───────────────────────────┬───────────────┬────────────────────┐
│ _.measure │ _.measurement │ _.per_fold │
├───────────────────────────┼───────────────┼────────────────────┤
│ RootMeanSquaredError @753 │ 21.7 │ [25.4, 16.3, 22.4] │
└───────────────────────────┴───────────────┴────────────────────┘
_.per_observation = [missing]
_.fitted_params_per_fold = [ … ]
_.report_per_fold = [ … ]
_.train_test_rows = [ … ]
Custom resampling strategies
To define a new resampling strategy, make relevant parameters of your strategy the fields of a new type MyResamplingStrategy <: MLJ.ResamplingStrategy
, and implement one of the following methods:
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows)
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows, y)
MLJ.train_test_pairs(my_strategy::MyResamplingStrategy, rows, X, y)
Each method takes a vector of indices rows
and returns a vector [(t1, e1), (t2, e2), ... (tk, ek)]
of train/test pairs of row indices selected from rows
. Here X
, y
are the input and target data (ignored in simple strategies, such as Holdout
and CV
).
Here is the code for the Holdout
strategy as an example:
struct Holdout <: ResamplingStrategy
fraction_train::Float64
shuffle::Bool
rng::Union{Int,AbstractRNG}
function Holdout(fraction_train, shuffle, rng)
0 < fraction_train < 1 ||
error("`fraction_train` must be between 0 and 1.")
return new(fraction_train, shuffle, rng)
end
end
# Keyword Constructor
function Holdout(; fraction_train::Float64=0.7, shuffle=nothing, rng=nothing)
if rng isa Integer
rng = MersenneTwister(rng)
end
if shuffle === nothing
shuffle = ifelse(rng===nothing, false, true)
end
if rng === nothing
rng = Random.GLOBAL_RNG
end
return Holdout(fraction_train, shuffle, rng)
end
function train_test_pairs(holdout::Holdout, rows)
train, test = partition(rows, holdout.fraction_train,
shuffle=holdout.shuffle, rng=holdout.rng)
return [(train, test),]
end
API
MLJBase.evaluate!
— Functionevaluate!(mach,
resampling=CV(),
measure=nothing,
rows=nothing,
weights=nothing,
class_weights=nothing,
operation=nothing,
repeats=1,
acceleration=default_resource(),
force=false,
verbosity=1,
check_measure=true)
Estimate the performance of a machine mach
wrapping a supervised model in data, using the specified resampling
strategy (defaulting to 6-fold cross-validation) and measure
, which can be a single measure or vector.
Do subtypes(MLJ.ResamplingStrategy)
to obtain a list of available resampling strategies. If resampling
is not an object of type MLJ.ResamplingStrategy
, then a vector of tuples (of the form (train_rows, test_rows)
is expected. For example, setting
resampling = [((1:100), (101:200)),
((101:200), (1:100))]
gives two-fold cross-validation using the first 200 rows of data.
The type of operation (predict
, predict_mode
, etc) to be associated with measure
is automatically inferred from measure traits where possible. For example, predict_mode
will be used for a Multiclass
target, if model
is probabilistic but measure
is deterministic. The operations applied can be inspected from the operation
field of the object returned. Alternatively, operations can be explicitly specified using operation=...
. If measure
is a vector, then operation
must be a single operation, which will be associated with all measures, or a vector of the same length as measure
.
The resampling strategy is applied repeatedly (Monte Carlo resampling) if repeats > 1
. For example, if repeats = 10
, then resampling = CV(nfolds=5, shuffle=true)
, generates a total of 50 (train, test)
pairs for evaluation and subsequent aggregation.
If resampling isa MLJ.ResamplingStrategy
then one may optionally restrict the data used in evaluation by specifying rows
.
An optional weights
vector may be passed for measures that support sample weights (MLJ.supports_weights(measure) == true
), which is ignored by those that don't. These weights are not to be confused with any weights w
bound to mach
(as in mach = machine(model, X, y, w)
). To pass these to the performance evaluation measures you must explictly specify weights=w
in the evaluate!
call.
Additionally, optional class_weights
dictionary may be passed for measures that support class weights (MLJ.supports_class_weights(measure) == true
), which is ignored by those that don't. These weights are not to be confused with any weights class_w
bound to mach
(as in mach = machine(model, X, y, class_w)
). To pass these to the performance evaluation measures you must explictly specify class_weights=w
in the evaluate!
call.
User-defined measures are supported; see the manual for details.
If no measure is specified, then default_measure(mach.model)
is used, unless this default is nothing
and an error is thrown.
The acceleration
keyword argument is used to specify the compute resource (a subtype of ComputationalResources.AbstractResource
) that will be used to accelerate/parallelize the resampling operation.
Although evaluate!
is mutating, mach.model
and mach.args
are untouched.
Summary of key-word arguments
resampling
- resampling strategy (default isCV(nfolds=6)
)measure
/measures
- measure or vector of measures (losses, scores, etc)rows
- vector of observation indices from which both train and test folds are constructed (default is all observations)weights
- per-sample weights for measures that support them (not to be confused with weights used in training)class_weights
- dictionary of per-class weights for use with measures that support these, in classification problems (not to be confused with per-sampleweights
or with class weights used in training)operation
/operations
- One ofpredict
,predict_mean
,predict_mode
,predict_median
, orpredict_joint
, or a vector of these of the same length asmeasure
/measures
. Automatically inferred if left unspecified.repeats
- default is 1; set to a higher value for repeated (Monte Carlo) resamplingacceleration
- parallelization option; currently supported options are instances ofCPU1
(single-threaded computation)CPUThreads
(multi-threaded computation) andCPUProcesses
(multi-process computation); default isdefault_resource()
.force
- default isfalse
; set totrue
for force cold-restart of each training eventverbosity
level, an integer defaulting to 1.check_measure
- default istrue
Return value
A PerformanceEvaluation
object. See PerformanceEvaluation
for details.
MLJModelInterface.evaluate
— Functionsome meta-models may choose to implement the evaluate
operations
MLJBase.PerformanceEvaluation
— TypePerformanceEvaluation
Type of object returned by evaluate
(for models plus data) or evaluate!
(for machines). Such objects encode estimates of the performance (generalization error) of a supervised model or outlier detection model.
When evaluate
/evaluate!
is called, a number of train/test pairs ("folds") of row indices are generated, according to the options provided, which are discussed in the evaluate!
doc-string. Rows correspond to observations. The generated train/test pairs are recorded in the train_test_rows
field of the PerformanceEvaluation
struct, and the corresponding estimates, aggregated over all train/test pairs, are recorded in measurement
, a vector with one entry for each measure (metric) recorded in measure
.
When displayed, a PerformanceEvalution
object includes a value under the heading 1.96*SE
, derived from the standard error of the per_fold
entries. This value is suitable for constructing a formal 95% confidence interval for the given measurement
. Such intervals should be interpreted with caution. See, for example, Bates et al. (2021).
Fields
These fields are part of the public API of the PerformanceEvaluation
struct.
measure
: vector of measures (metrics) used to evaluate performancemeasurement
: vector of measurements - one for each element ofmeasure
- aggregating the performance measurements over all train/test pairs (folds). The aggregation method applied for a given measurem
isaggregation(m)
(commonlyMean
orSum
)operation
(e.g.,predict_mode
): the operations applied for each measure to generate predictions to be evaluated. Possibilities are:predict
,predict_mean
,predict_mode
,predict_median
, orpredict_joint
.per_fold
: a vector of vectors of individual test fold evaluations (one vector per measure). Useful for obtaining a rough estimate of the variance of the performance estimate.per_observation
: a vector of vectors of individual observation evaluations of those measures for whichreports_each_observation(measure)
is true, which is otherwise reportedmissing
. Useful for some forms of hyper-parameter optimization.fitted_params_per_fold
: a vector containingfitted params(mach)
for each machinemach
trained during resampling - one machine per train/test pair. Use this to extract the learned parameters for each individual training event.report_per_fold
: a vector containingreport(mach)
for each machinemach
training in resampling - one machine per train/test pair.train_test_rows
: a vector of tuples, each of the form(train, test)
, wheretrain
andtest
are vectors of row (observation) indices for training and evaluation respectively.