Performance Measures

In MLJ loss functions, scoring rules, sensitivities, and so on, are collectively referred to as measures. These include re-exported loss functions from the LossFunctions.jl library, overloaded to behave the same way as the built-in measures.

To see the list of all measures, run measures(). Further measures for probabilistic predictors, such as proper scoring rules, and for constructing multi-target product measures, are planned. If you'd like to see a measure added to MLJ, post a comment here.g

Note for developers: The measures interface and the built-in measures described here are defined in MLJBase, but will ultimately live in a separate package.

Using built-in measures

These measures all have the common calling syntax

measure(ŷ, y)

measure(ŷ, y, w)

where y iterates over observations of some target variable, and ŷ iterates over predictions (Distribution or Sampler objects in the probabilistic case). Here w is an optional vector of sample weights, or a dictionary of class weights, when these are supported by the measure.

julia> using MLJ

julia> y = [1, 2, 3, 4];

julia> ŷ = [2, 3, 3, 3];

julia> w = [1, 2, 2, 1];

julia> rms(ŷ, y) # reports an aggregate loss
0.8660254037844386

julia> l2(ŷ, y, w) # reports per observation losses
4-element Vector{Int64}:
 1
 2
 0
 1

julia> y = coerce(["male", "female", "female"], Multiclass)
3-element CategoricalArray{String,1,UInt32}:
 "male"
 "female"
 "female"

julia> d = UnivariateFinite(["male", "female"], [0.55, 0.45], pool=y);

julia> ŷ = [d, d, d];

julia> log_loss(ŷ, y)
3-element Vector{Float64}:
 0.5978370007556204
 0.7985076962177716
 0.7985076962177716

The measures rms, l2 and log_loss illustrated here are actually instances of measure types. For, example, l2 = LPLoss(p=2) and log_loss = LogLoss() = LogLoss(tol=eps()). Common aliases are provided:

julia> cross_entropy
LogLoss(
  tol = 2.220446049250313e-16)

Traits and custom measures

Notice that l1 reports per-sample evaluations, while rms only reports an aggregated result. This and other behavior can be gleaned from measure traits which are summarized by the info method:

julia> info(l1)
`LPLoss` - lp loss type with instances `l1`, `l2`.
(name = "LPLoss",
 instances = ["l1", "l2"],
 human_name = "lp loss",
 target_scitype = Union{AbstractArray{<:Union{Missing, Continuous}}, AbstractArray{<:Union{Missing, Count}}},
 supports_weights = true,
 supports_class_weights = false,
 prediction_type = :deterministic,
 orientation = :loss,
 reports_each_observation = true,
 aggregation = StatisticalTraits.Mean(),
 is_feature_dependent = false,
 docstring = "`LPLoss` - lp loss type with instances `l1`, `l2`. ",
 distribution_type = Unknown,)

Query the doc-string for a measure using the name of its type:

julia> rms
RootMeanSquaredError()

julia> @doc RootMeanSquaredError # same as `?RootMeanSqauredError
  MLJBase.RootMeanSquaredError

  A measure type for root mean squared error, which includes the instance(s):
  rms, rmse, root_mean_squared_error.

  RootMeanSquaredError()(ŷ, y)
  RootMeanSquaredError()(ŷ, y, w)

  Evaluate the root mean squared error on predictions ŷ, given ground truth
  observations y. Optionally specify per-sample weights, w.

  \text{root mean squared error} = \sqrt{n^{-1}∑ᵢ|yᵢ-ŷᵢ|^2} or \text{root
  mean squared error} = \sqrt{\frac{∑ᵢwᵢ|yᵢ-ŷᵢ|^2}{∑ᵢwᵢ}}

  Requires scitype(y) to be a subtype of
  AbstractArray{<:Union{Infinite,Missing}}; ŷ must be an array of
  deterministic predictions.

  For more information, run info(RootMeanSquaredError).

Use measures() to list all measures, and measures(conditions...) to search for measures with given traits (as you would query models). The trait instances list the actual callable instances of a given measure type (typically aliases for the default instance).

MLJBase.measures — Method

measures()

List all measures as named-tuples keyed on measure traits.

measures(filters...)

List all measures compatible with the target y.

measures(needle::Union{AbstractString,Regex}

List all measures with needle in a measure's name, instances, or docstring

Example

Find all classification measures supporting sample weights:

measures(m -> m.target_scitype <: AbstractVector{<:Finite} &&
              m.supports_weights)

Find all measures in the "rms" family:

measures("rms")

A user-defined measure in MLJ can be passed to the evaluate! method, and elsewhere in MLJ, provided it is a function or callable object conforming to the above syntactic conventions. By default, a custom measure is understood to:

be a loss function (rather than a score)
report an aggregated value (rather than per-sample evaluations)
be feature-independent

To override this behavior one simply overloads the appropriate trait, as shown in the following examples:

julia> y = [1, 2, 3, 4];

julia> ŷ = [2, 3, 3, 3];

julia> w = [1, 2, 2, 1];

julia> my_loss(ŷ, y) = maximum((ŷ - y).^2);

julia> my_loss(ŷ, y)
1

julia> my_per_sample_loss(ŷ, y) = abs.(ŷ - y);

julia> MLJ.reports_each_observation(::typeof(my_per_sample_loss)) = true;

julia> my_per_sample_loss(ŷ, y)
4-element Vector{Int64}:
 1
 1
 0
 1

julia> my_weighted_score(ŷ, y) = 1/mean(abs.(ŷ - y));

julia> my_weighted_score(ŷ, y, w) = 1/mean(abs.((ŷ - y).^w));

julia> MLJ.supports_weights(::typeof(my_weighted_score)) = true;

julia> MLJ.orientation(::typeof(my_weighted_score)) = :score;

julia> my_weighted_score(ŷ, y)
1.3333333333333333

julia> X = (x=rand(4), penalty=[1, 2, 3, 4]);

julia> my_feature_dependent_loss(ŷ, X, y) = sum(abs.(ŷ - y) .* X.penalty)/sum(X.penalty);

julia> MLJ.is_feature_dependent(::typeof(my_feature_dependent_loss)) = true

julia> my_feature_dependent_loss(ŷ, X, y)
0.7

The possible signatures for custom measures are: measure(ŷ, y), measure(ŷ, y, w), measure(ŷ, X, y) and measure(ŷ, X, y, w), each measure implementing one non-weighted version, and possibly a second weighted version.

Implementation detail: Internally, every measure is evaluated using the syntax

MLJ.value(measure, ŷ, X, y, w)

and the traits determine what can be ignored and how measure is actually called. If w=nothing then the non-weighted form of measure is dispatched.

Using measures from LossFunctions.jl

The LossFunctions.jl package includes "distance loss" functions for Continuous targets, and "marginal loss" functions for Finite{2} (binary) targets. While the LossFunctions.jl interface differs from the present one (for, example binary observations must be +1 or -1), MLJ has overloaded instances of the LossFunctions.jl types to behave the same as the built-in types.

Note that the "distance losses" in the package apply to deterministic predictions, while the "marginal losses" apply to probabilistic predictions.

List of measures

All measures listed below have a doc-string associated with the measure's type. So, for example, do ?LPLoss not ?l2.

ms = measures()
types = map(ms) do m
    m.name
end
instance = map(ms) do m m.instances end
table = (type=types, instances=instance)
DataFrame(table)

63×2 DataFrame

Row	type	instances
	String	Array…
1	BrierLoss	["brier_loss"]
2	BrierScore	["brier_score"]
3	LPLoss	["l1", "l2"]
4	LogCoshLoss	["log_cosh", "log_cosh_loss"]
5	LogLoss	["log_loss", "cross_entropy"]
6	LogScore	["log_score"]
7	SphericalScore	["spherical_score"]
8	Accuracy	["accuracy"]
9	AreaUnderCurve	["area_under_curve", "auc"]
10	BalancedAccuracy	["balanced_accuracy", "bacc", "bac"]
11	ConfusionMatrix	["confusion_matrix", "confmat"]
12	FScore	["f1score"]
13	FalseDiscoveryRate	["false_discovery_rate", "falsediscovery_rate", "fdr"]
14	FalseNegative	["false_negative", "falsenegative"]
15	FalseNegativeRate	["false_negative_rate", "falsenegative_rate", "fnr", "miss_rate"]
16	FalsePositive	["false_positive", "falsepositive"]
17	FalsePositiveRate	["false_positive_rate", "falsepositive_rate", "fpr", "fallout"]
18	Kappa	["kappa"]
19	MatthewsCorrelation	["matthews_correlation", "mcc"]
20	MeanAbsoluteError	["mae", "mav", "mean_absolute_error", "mean_absolute_value"]
21	MeanAbsoluteProportionalError	["mape"]
22	MisclassificationRate	["misclassification_rate", "mcr"]
23	MulticlassFScore	["macro_f1score", "micro_f1score", "multiclass_f1score"]
24	MulticlassFalseDiscoveryRate	["multiclass_falsediscovery_rate", "multiclass_fdr", "multiclass_false_discovery_rate"]
25	MulticlassFalseNegative	["multiclass_false_negative", "multiclass_falsenegative"]
26	MulticlassFalseNegativeRate	["multiclass_false_negative_rate", "multiclass_fnr", "multiclass_miss_rate", "multiclass_falsenegative_rate"]
27	MulticlassFalsePositive	["multiclass_false_positive", "multiclass_falsepositive"]
28	MulticlassFalsePositiveRate	["multiclass_false_positive_rate", "multiclass_fpr", "multiclass_fallout", "multiclass_falsepositive_rate"]
29	MulticlassNegativePredictiveValue	["multiclass_negative_predictive_value", "multiclass_negativepredictive_value", "multiclass_npv"]
30	MulticlassPrecision	["multiclass_positive_predictive_value", "multiclass_ppv", "multiclass_positivepredictive_value", "multiclass_precision"]
31	MulticlassTrueNegative	["multiclass_true_negative", "multiclass_truenegative"]
32	MulticlassTrueNegativeRate	["multiclass_true_negative_rate", "multiclass_tnr", "multiclass_specificity", "multiclass_selectivity", "multiclass_truenegative_rate"]
33	MulticlassTruePositive	["multiclass_true_positive", "multiclass_truepositive"]
34	MulticlassTruePositiveRate	["multiclass_true_positive_rate", "multiclass_tpr", "multiclass_sensitivity", "multiclass_recall", "multiclass_hit_rate", "multiclass_truepositive_rate"]
35	NegativePredictiveValue	["negative_predictive_value", "negativepredictive_value", "npv"]
36	Precision	["positive_predictive_value", "ppv", "positivepredictive_value", "precision"]
37	RSquared	["rsq", "rsquared"]
38	RootMeanSquaredError	["rms", "rmse", "root_mean_squared_error"]
39	RootMeanSquaredLogError	["rmsl", "rmsle", "root_mean_squared_log_error"]
40	RootMeanSquaredLogProportionalError	["rmslp1"]
41	RootMeanSquaredProportionalError	["rmsp"]
42	TrueNegative	["true_negative", "truenegative"]
43	TrueNegativeRate	["true_negative_rate", "truenegative_rate", "tnr", "specificity", "selectivity"]
44	TruePositive	["true_positive", "truepositive"]
45	TruePositiveRate	["true_positive_rate", "truepositive_rate", "tpr", "sensitivity", "recall", "hit_rate"]
46	DWDMarginLoss	["dwd_margin_loss"]
47	ExpLoss	["exp_loss"]
48	L1HingeLoss	["l1_hinge_loss"]
49	L2HingeLoss	["l2_hinge_loss"]
50	L2MarginLoss	["l2_margin_loss"]
51	LogitMarginLoss	["logit_margin_loss"]
52	ModifiedHuberLoss	["modified_huber_loss"]
53	PerceptronLoss	["perceptron_loss"]
54	SigmoidLoss	["sigmoid_loss"]
55	SmoothedL1HingeLoss	["smoothed_l1_hinge_loss"]
56	ZeroOneLoss	["zero_one_loss"]
57	HuberLoss	["huber_loss"]
58	L1EpsilonInsLoss	["l1_epsilon_ins_loss"]
59	L2EpsilonInsLoss	["l2_epsilon_ins_loss"]
60	LPDistLoss	["lp_dist_loss"]
61	LogitDistLoss	["logit_dist_loss"]
62	PeriodicLoss	["periodic_loss"]
63	QuantileLoss	["quantile_loss"]

In MLJ one computes a confusion matrix by calling an instance of the ConfusionMatrix measure type on the data:

MLJBase.ConfusionMatrix — Type

MLJBase.ConfusionMatrix

A measure type for confusion matrix, which includes the instance(s): confusion_matrix, confmat.

ConfusionMatrix()(ŷ, y)

Evaluate the default instance of ConfusionMatrix on predictions ŷ, given ground truth observations y.

If r is the return value, then the raw confusion matrix is r.mat, whose rows correspond to predictions, and columns to ground truth. The ordering follows that of levels(y).

Use ConfusionMatrix(perm=[2, 1]) to reverse the class order for binary data. For more than two classes, specify an appropriate permutation, as in ConfusionMatrix(perm=[2, 3, 1]).

Requires scitype(y) to be a subtype of AbstractArray{<:Union{OrderedFactor{2},Missing}} (binary classification where choice of "true" effects the measure); ŷ must be an array of deterministic predictions.

For more information, run info(ConfusionMatrix).

MLJBase.roc_curve — Function

fprs, tprs, ts = roc_curve(ŷ, y) = roc(ŷ, y)

Return the ROC curve for a two-class probabilistic prediction ŷ given the ground truth y. The true positive rates, false positive rates over a range of thresholds ts are returned. Note that if there are k unique scores, there are correspondingly k thresholds and k+1 "bins" over which the FPR and TPR are constant:

[0.0 - thresh[1]]
[thresh[1] - thresh[2]]
...
[thresh[k] - 1]

consequently, tprs and fprs are of length k+1 if ts is of length k.

To draw the curve using your favorite plotting backend, do plot(fprs, tprs).

Performance Measures

Using built-in measures

Traits and custom measures

Using measures from LossFunctions.jl

List of measures

Other performance-related tools