Performance Measures

In MLJ loss functions, scoring rules, sensitivities, and so on, are collectively referred to as measures. These include re-exported loss functions from the LossFunctions.jl library, overloaded to behave the same way as the built-in measures.

To see list all measures, run measures(). Further measures for probabilistic predictors, such as proper scoring rules, and for constructing multi-target product measures, are planned. If you'd like to see measure added to MLJ, post a comment here.g

Note for developers: The measures interface and the built-in measures described here are defined in MLJBase, but will ultimately live in a separate package.

Using built-in measures

These measures all have the common calling syntax

measure(ŷ, y)

or

measure(ŷ, y, w)

where y iterates over observations of some target variable, and iterates over predictions (Distribution or Sampler objects in the probabilistic case). Here w is an optional vector of sample weights, or a dictionary of class weights, when these are supported by the measure.

julia> using MLJ

julia> y = [1, 2, 3, 4];

julia> ŷ = [2, 3, 3, 3];

julia> w = [1, 2, 2, 1];

julia> rms(ŷ, y) # reports an aggregrate loss
0.8660254037844386

julia> l2(ŷ, y, w) # reports per observation losses
4-element Vector{Int64}:
 1
 2
 0
 1

julia> y = coerce(["male", "female", "female"], Multiclass)
3-element CategoricalArray{String,1,UInt32}:
 "male"
 "female"
 "female"

julia> d = UnivariateFinite(["male", "female"], [0.55, 0.45], pool=y);

julia> ŷ = [d, d, d];

julia> log_loss(ŷ, y)
3-element Vector{Float64}:
 0.7985076962177716
 0.5978370007556204
 0.5978370007556204

The measures rms, l2 and log_loss illustrated here are actually instances of measure types. For, example, l2 = LPLoss(p=2) and log_loss = LogLoss() = LogLoss(tol=eps()). Common aliases are provided:

julia> cross_entropy
LogLoss(tol = 2.220446049250313e-16) @449

Traits and custom measures

Notice that l1 reports per-sample evaluations, while rms only reports an aggregated result. This and other behavior can be gleaned from measure traits which are summarized by the info method:

julia> info(l1)
`LPLoss` - lp loss type with instances `l1`, `l2`.
(name = "LPLoss",
 instances = ["l1", "l2"],
 human_name = "lp loss",
 target_scitype = Union{AbstractArray{var"#s1071", N} where {var"#s1071"<:Union{Missing, Continuous}, N}, AbstractArray{var"#s1070", N} where {var"#s1070"<:Union{Missing, Count}, N}},
 supports_weights = true,
 supports_class_weights = false,
 prediction_type = :deterministic,
 orientation = :loss,
 reports_each_observation = true,
 aggregation = StatisticalTraits.Mean(),
 is_feature_dependent = false,
 docstring = "`LPLoss` - lp loss type with instances `l1`, `l2`. ",
 distribution_type = Unknown,)

Query the doc-string for a measure using the name of its type:

julia> rms
RootMeanSquaredError() @852

julia> @doc RootMeanSquaredError # same as `?RootMeanSqauredError
  MLJBase.RootMeanSquaredError

  A measure type for root mean squared error, which includes the instance(s):
  rms, rmse, root_mean_squared_error.

  RootMeanSquaredError()(ŷ, y)
  RootMeanSquaredError()(ŷ, y, w)

  Evaluate the root mean squared error on predictions ŷ, given ground truth
  observations y. Optionally specify per-sample weights, w.

  \text{root mean squared error} = \sqrt{n^{-1}∑ᵢ|yᵢ-ŷᵢ|^2} or \text{root
  mean squared error} = \sqrt{\frac{∑ᵢwᵢ|yᵢ-ŷᵢ|^2}{∑ᵢwᵢ}}

  Requires scitype(y) to be a subtype of Union{AbstractArray{var"#s1071", N}
  where {var"#s1071"<:Union{Missing, ScientificTypesBase.Continuous}, N},
  AbstractArray{var"#s1070", N} where {var"#s1070"<:Union{Missing,
  ScientificTypesBase.Count}, N}}; ŷ must be an array of deterministic
  predictions.

  For more information, run info(RootMeanSquaredError).

Use measures() to list all measures, and measures(conditions...) to search for measures with given traits (as you would query models). The trait instances list the actual callable instances of a given measure type (typically aliases for the default instance).

MLJBase.measuresMethod
measures()

List all measures as named-tuples keyed on measure traits.

measures(filters...)

List all measures compatible with the target y.

measures(needle::Union{AbstractString,Regex}

List all measures with needle in a measure's name or docstring.

Example

Find all classification measures supporting sample weights:

measures(m -> m.target_scitype <: AbstractVector{<:Finite} &&
              m.supports_weights)

Find all measures in the rms family:

measures("rms")

A user-defined measure in MLJ can be passed to the evaluate! method, and elsewhere in MLJ, provided it is a function or callable object conforming to the above syntactic conventions. By default, a custom measure is understood to:

  • be a loss function (rather than a score)

  • report an aggregated value (rather than per-sample evaluations)

  • be feature-independent

To override this behaviour one simply overloads the appropriate trait, as shown in the following examples:

julia> y = [1, 2, 3, 4];

julia> ŷ = [2, 3, 3, 3];

julia> w = [1, 2, 2, 1];

julia> my_loss(ŷ, y) = maximum((ŷ - y).^2);

julia> my_loss(ŷ, y)
1

julia> my_per_sample_loss(ŷ, y) = abs.(ŷ - y);

julia> MLJ.reports_each_observation(::typeof(my_per_sample_loss)) = true;

julia> my_per_sample_loss(ŷ, y)
4-element Vector{Int64}:
 1
 1
 0
 1

julia> my_weighted_score(ŷ, y) = 1/mean(abs.(ŷ - y));

julia> my_weighted_score(ŷ, y, w) = 1/mean(abs.((ŷ - y).^w));

julia> MLJ.supports_weights(::typeof(my_weighted_score)) = true;

julia> MLJ.orientation(::typeof(my_weighted_score)) = :score;

julia> my_weighted_score(ŷ, y)
1.3333333333333333

julia> X = (x=rand(4), penalty=[1, 2, 3, 4]);

julia> my_feature_dependent_loss(ŷ, X, y) = sum(abs.(ŷ - y) .* X.penalty)/sum(X.penalty);

julia> MLJ.is_feature_dependent(::typeof(my_feature_dependent_loss)) = true

julia> my_feature_dependent_loss(ŷ, X, y)
0.7

The possible signatures for custom measures are: measure(ŷ, y), measure(ŷ, y, w), measure(ŷ, X, y) and measure(ŷ, X, y, w), each measure implementing one non-weighted version, and possibly a second weighted version.

Implementation detail: Internally, every measure is evaluated using the syntax

MLJ.value(measure, ŷ, X, y, w)

and the traits determine what can be ignored and how measure is actually called. If w=nothing then the non-weighted form of measure is dispatched.

Using measures from LossFunctions.jl

The LossFunctions.jl package includes "distance loss" functions for Continuous targets, and "marginal loss" functions for Finite{2} (binary) targets. While the LossFunctions.jl interface differs from the present one (for, example binary observations must be +1 or -1), MLJ has overloaded instances of the LossFunctions.jl types to behave the same as the built-in types.

Note that the "distance losses" in the package apply to deterministic predictions, while the "marginal losses" apply to probabilistic predictions.

List of measures

All measures listed below have a doc-string associated with the measure's type. So, for example, do ?LPLoss not ?l2.

ms = measures()
types = map(ms) do m
    m.name
end
instance = map(ms) do m m.instances end
table = (type=types, instances=instance)
DataFrame(table)

61 rows × 2 columns

typeinstances
StringArray…
1BrierLoss["brier_loss"]
2BrierScore["brier_score"]
3LPLoss["l1", "l2"]
4LogCoshLoss["log_cosh", "log_cosh_loss"]
5LogLoss["log_loss", "cross_entropy"]
6LogScore["log_score"]
7SphericalScore["spherical_score"]
8Accuracy["accuracy"]
9AreaUnderCurve["area_under_curve", "auc"]
10BalancedAccuracy["balanced_accuracy", "bacc", "bac"]
11ConfusionMatrix["confusion_matrix", "confmat"]
12FScore["f1score"]
13FalseDiscoveryRate["false_discovery_rate", "falsediscovery_rate", "fdr"]
14FalseNegative["false_negative", "falsenegative"]
15FalseNegativeRate["false_negative_rate", "falsenegative_rate", "fnr", "miss_rate"]
16FalsePositive["false_positive", "falsepositive"]
17FalsePositiveRate["false_positive_rate", "falsepositive_rate", "fpr", "fallout"]
18MatthewsCorrelation["matthews_correlation", "mcc"]
19MeanAbsoluteError["mae", "mav", "mean_absolute_error", "mean_absolute_value"]
20MeanAbsoluteProportionalError["mape"]
21MisclassificationRate["misclassification_rate", "mcr"]
22MulticlassFScore["macro_f1score", "micro_f1score", "multiclass_f1score"]
23MulticlassFalseDiscoveryRate["multiclass_falsediscovery_rate", "multiclass_fdr"]
24MulticlassFalseNegative["multiclass_false_negative", "multiclass_falsenegative"]
25MulticlassFalseNegativeRate["multiclass_false_negative_rate", "multiclass_fnr", "multiclass_miss_rate", "multiclass_falsenegative_rate"]
26MulticlassFalsePositive["multiclass_false_positive", "multiclass_falsepositive"]
27MulticlassFalsePositiveRate["multiclass_false_positive_rate", "multiclass_fpr", "multiclass_fallout", "multiclass_falsepositive_rate"]
28MulticlassNegativePredictiveValue["multiclass_negative_predictive_value", "multiclass_negativepredictive_value", "multiclass_npv"]
29MulticlassPrecision["multiclass_positive_predictive_value", "multiclass_ppv", "multiclass_positivepredictive_value", "multiclass_recall"]
30MulticlassTrueNegative["multiclass_true_negative", "multiclass_truenegative"]
31MulticlassTrueNegativeRate["multiclass_true_negative_rate", "multiclass_tnr", "multiclass_specificity", "multiclass_selectivity", "multiclass_truenegative_rate"]
32MulticlassTruePositive["multiclass_true_positive", "multiclass_truepositive"]
33MulticlassTruePositiveRate["multiclass_true_positive_rate", "multiclass_tpr", "multiclass_sensitivity", "multiclass_recall", "multiclass_hit_rate", "multiclass_truepositive_rate"]
34NegativePredictiveValue["negative_predictive_value", "negativepredictive_value", "npv"]
35Precision["positive_predictive_value", "ppv", "positivepredictive_value", "precision"]
36RootMeanSquaredError["rms", "rmse", "root_mean_squared_error"]
37RootMeanSquaredLogError["rmsl", "rmsle", "root_mean_squared_log_error"]
38RootMeanSquaredLogProportionalError["rmslp1"]
39RootMeanSquaredProportionalError["rmsp"]
40TrueNegative["true_negative", "truenegative"]
41TrueNegativeRate["true_negative_rate", "truenegative_rate", "tnr", "specificity", "selectivity"]
42TruePositive["true_positive", "truepositive"]
43TruePositiveRate["true_positive_rate", "truepositive_rate", "tpr", "sensitivity", "recall", "hit_rate"]
44DWDMarginLoss["dwd_margin_loss"]
45ExpLoss["exp_loss"]
46L1HingeLoss["l1_hinge_loss"]
47L2HingeLoss["l2_hinge_loss"]
48L2MarginLoss["l2_margin_loss"]
49LogitMarginLoss["logit_margin_loss"]
50ModifiedHuberLoss["modified_huber_loss"]
51PerceptronLoss["perceptron_loss"]
52SigmoidLoss["sigmoid_loss"]
53SmoothedL1HingeLoss["smoothed_l1_hinge_loss"]
54ZeroOneLoss["zero_one_loss"]
55HuberLoss["huber_loss"]
56L1EpsilonInsLoss["l1_epsilon_ins_loss"]
57L2EpsilonInsLoss["l2_epsilon_ins_loss"]
58LPDistLoss["lp_dist_loss"]
59LogitDistLoss["logit_dist_loss"]
60PeriodicLoss["periodic_loss"]
61QuantileLoss["quantile_loss"]

In MLJ one computes a confusion matrix by calling an instance of the ConfusionMatrix measure type on the data:

MLJBase.ConfusionMatrixType
MLJBase.ConfusionMatrix

A measure type for confusion matrix, which includes the instance(s): confusion_matrix, confmat.

ConfusionMatrix()(ŷ, y)

Evaluate the default instance of ConfusionMatrix on predictions , given ground truth observations y.

If r is the return value, then the raw confusion matrix is r.mat, whose rows correspond to predictions, and columns to ground truth. The ordering follows that of levels(y).

Use ConfusionMatrix(perm=[2, 1]) to reverse the class order for binary data. For more than two classes, specify an appropriate permutation, as in ConfusionMatrix(perm=[2, 3, 1]).

Requires scitype(y) to be a subtype of AbstractArray{<:OrderedFactor{2}} (binary classification where choice of "true" effects the measure); must be an array of deterministic predictions.

For more information, run info(ConfusionMatrix).

MLJBase.roc_curveFunction
fprs, tprs, ts = roc_curve(ŷ, y) = roc(ŷ, y)

Return the ROC curve for a two-class probabilistic prediction given the ground truth y. The true positive rates, false positive rates over a range of thresholds ts are returned. Note that if there are k unique scores, there are correspondingly k thresholds and k+1 "bins" over which the FPR and TPR are constant:

  • [0.0 - thresh[1]]
  • [thresh[1] - thresh[2]]
  • ...
  • [thresh[k] - 1]

consequently, tprs and fprs are of length k+1 if ts is of length k.

To draw the curve using your favorite plotting backend, do plot(fprs, tprs).