Getting Started

MLJ homepage

Getting Started

Basic supervised training and testing

julia> using MLJ
julia> using RDatasets
julia> iris = dataset("datasets", "iris"); # a DataFrame

In MLJ one can either wrap data for supervised learning in a formal task (see Working with Tasks), or work directly with the data, split into its input and target parts:

julia> const X = iris[:, 1:4];
julia> const y = iris[:, 5];

A model is a container for hyperparameters. Assuming the DecisionTree package is in your installation load path, we can instantiate a DecisionTreeClassifier model like this:

julia> @load DecisionTreeClassifier
import MLJModels ✔
import DecisionTree ✔
import MLJModels.DecisionTree_.DecisionTreeClassifier ✔

julia> tree_model = DecisionTreeClassifier(target_type=String, max_depth=2)
DecisionTreeClassifier(target_type = String,
                       pruning_purity = 1.0,
                       max_depth = 2,
                       min_samples_leaf = 1,
                       min_samples_split = 2,
                       min_purity_increase = 0.0,
                       n_subfeatures = 0.0,
                       display_depth = 5,
                       post_prune = false,
                       merge_purity_threshold = 0.9,) @ 1…72

Wrapping the model in data creates a machine which will store training outcomes:

julia> tree = machine(tree_model, X, y)
Machine @ 5…78

Training and testing on a hold-out set:

julia> train, test = partition(eachindex(y), 0.7, shuffle=true); # 70:30 split
julia> fit!(tree, rows=train)
julia> yhat = predict(tree, X[test,:]);
julia> misclassification_rate(yhat, y[test]);

┌ Info: Training Machine{DecisionTreeClassifier{S…} @ 1…36.
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/machines.jl:68
0.08888888888888889

Or, in one line:

julia> evaluate!(tree, resampling=Holdout(fraction_train=0.7, shuffle=true), measure=misclassification_rate)
0.08888888888888889

Changing a hyperparameter and re-evaluating:

julia> tree_model.max_depth = 3
julia> evaluate!(tree, resampling=Holdout(fraction_train=0.5, shuffle=true), measure=misclassification_rate)
0.06666666666666667

Next steps

To learn a little more about what MLJ can do, take the MLJ tour. Read the remainder of this document before considering more serious use of MLJ.

Prerequisites

MLJ assumes some familiarity with the CategoricalArrays.jl package, used for representing categorical data. For probabilistic predictors, a basic acquaintance with Distributions.jl is also assumed.

Data containers and scientific types

The MLJ user should acquaint themselves with some basic assumptions about the form of data expected by MLJ, as outlined below.

machine(model::Supervised, X, y) 
machine(model::Unsupervised, X)

The input X in the above machine constructors can be any table, where table means any data type supporting the Tables.jl interface.

At present our API is more restrictive; see this issue with Tables.jl. If your Tables.jl compatible format is not working in MLJ, please post an issue.

In particular, DataFrame, JuliaDB.IndexedTable and TypedTables.Table objects are supported, as are two Julia native formats: column tables (named tuples of equal length vectors) and row tables (vectors of named tuples sharing the same keys).

For models which handle only univariate inputs (input_is_multivariate(model)=false) X cannot be a table but is expected to have Vector or CategoricalVector type.

The target y in the first constructor above must be a Vector or CategoricalVector. A multivariate target y will be vector of tuples. The tuples need not have uniform length, so some forms of sequence prediction are supported.

Conversely, the choice of element types used to represent your data has implicit consequences about how MLJ will interpret that data.

WIP: Eventually users will use task constructors to coerce element types into the requisite form. At present, the user must do so manually, before passing data to the constructors.

To articulate MLJ's conventions about data representation, MLJ distinguishes between machine data types on the one hand (Float64, Bool, String, etc) and scientific data types on the other, represented by new Julia types: Continuous, Multiclass{N}, FiniteOrderedFactor{N}, Count (unbounded ordered factor), and Unknown, with obvious interpretations. These types are organized in a type hierarchy rooted in a new abstract type Found:

A scientific type is any subtype of Union{Missing,Found}. Scientific types have no instances.

Scientific types appear when querying model metadata, as in this example:

julia> info("DecisionTreeClassifier")[:target_scitype_union]

Union{Multiclass,FiniteOrderedFactor}

Basic data convention. The scientific type of data that a Julia object x can represent is defined by scitype(x). If scitype(x) == Unknown, then the interpretation of x by an MLJ model is unpredictable.

(scitype(42), scitype(π), scitype("Julia"))
(Count, Continuous, MLJBase.Unknown)

In particular, integers cannot be used to represent Multiclass or FiniteOrderedFactor data; these can be represented by an unordered or ordered CategoricalValue or CategoricalString:

Tscitype(x) for x::T
MissingMissing
RealContinuous
IntegerCount
CategoricalValueMulticlass{N} where N = nlevels(x), provided x.pool.ordered == false
CategoricalStringMulticlass{N} where N = nlevels(x), provided x.pool.ordered == false
CategoricalValueFiniteOrderedFactor{N} where N = nlevels(x), provided x.pool.ordered == true
CategoricalStringFiniteOrderedFactor{N} where N = nlevels(x) provided x.pool.ordered == true
IntegerCount

Here nlevels(x) = length(levels(x.pool)).