Preparing Data

Splitting data

MLJ has two tools for splitting data. To split data vertically (that is, to split by observations) use partition. This is commonly applied to a vector of observation indices, but can also be applied to datasets themselves, provided they are vectors, matrices or tables.

To split tabular data horizontally (i.e., break up a table based on feature names) use unpack.

MLJBase.partitionFunction
partition(X, fractions...;
          shuffle=nothing,
          rng=Random.GLOBAL_RNG,
          stratify=nothing)

Splits the vector or matrix X into a tuple of vectors or matrices whose vertical concatentation is X. The number of rows in each componenent of the return value is determined by the corresponding fractions of length(nrows(X)), where valid fractions are in (0,1) and sum up to less than one. The last fraction is not provided, as it is inferred from the preceding ones.

X can also be any object which implements the Tables.jl interface according to Tables.istable.

So, for example,

julia> partition(1:1000, 0.8)
([1,...,800], [801,...,1000])

julia> partition(1:1000, 0.2, 0.7)
([1,...,200], [201,...,900], [901,...,1000])

julia> partition(reshape(1:10, 5, 2), 0.2, 0.4)
([1 6], [2 7; 3 8], [4 9; 5 10])

X, _ = make_regression()          # a table
Xtrain, Xtest = partition(X, 0.8) # the table split on rows

Keywords

  • shuffle=nothing: if set to true, shuffles the rows before taking fractions.
  • rng=Random.GLOBAL_RNG: specifies the random number generator to be used, can be an integer seed. If specified, and shuffle === nothing is interpreted as true.
  • stratify=nothing: if a vector is specified, the partition will match the stratification of the given vector. In that case, shuffle cannot be false.
MLJBase.unpackFunction
t1, t2, ...., tk = unpack(table, f1, f2, ... fk;
                         wrap_singles=false,
                         shuffle=false,
                         rng::Union{AbstractRNG,Int,Nothing}=nothing)

Horizontally split any Tables.jl compatible table into smaller tables (or vectors) t1, t2, ..., tk by making column selections without replacement by successively applying the columnn name filters f1, f2, ..., fk. A filter is any object f such that f(name) is true or false for each column name::Symbol of table. For example, use the filter _ -> true to pick up all remaining columns of the table.

Whenever a returned table contains a single column, it is converted to a vector unless wrap_singles=true.

Scientific type conversions can be optionally specified (note semicolon):

unpack(table, t...; col1=>scitype1, col2=>scitype2, ... )

If shuffle=true then the rows of table are first shuffled, using the global RNG, unless rng is specified; if rng is an integer, it specifies the seed of an automatically generated Mersenne twister. If rng is specified then shuffle=true is implicit.

Example

julia> table = DataFrame(x=[1,2], y=['a', 'b'], z=[10.0, 20.0], w=["A", "B"])
julia> Z, XY = unpack(table, ==(:z), !=(:w);
               :x=>Continuous, :y=>Multiclass)
julia> XY
2×2 DataFrame
│ Row │ x       │ y            │
│     │ Float64 │ Categorical… │
├─────┼─────────┼──────────────┤
│ 1   │ 1.0     │ 'a'          │
│ 2   │ 2.0     │ 'b'          │

julia> Z
2-element Array{Float64,1}:
 10.0
 20.0

Bridging the gap between data type and model requirements

As outlined in Getting Started, it is important that the scientific type of data matches the requirements of the model of interest. For example, while the majority of supervised learning models require input features to be Continuous, newcomers to MLJ are sometimes surprised at the disappointing results of model queries such as this one:

X = (height   = [185, 153, 163, 114, 180],
     time     = [2.3, 4.5, 4.2, 1.8, 7.1],
     mark     = ["D", "A", "C", "B", "A"],
     admitted = ["yes", "no", missing, "yes"]);
y = [12.4, 12.5, 12.0, 31.9, 43.0]
models(matching(X, y))
2-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype), T} where T<:Tuple}:
 (name = ConstantRegressor, package_name = MLJModels, ... )
 (name = DeterministicConstantRegressor, package_name = MLJModels, ... )

Or are unsure about the source of the following warning:

Tree = @load DecisionTreeRegressor pkg=DecisionTree verbosity=0
tree = Tree();
julia> machine(tree, X, y)

julia> machine(tree, X, y)
┌ Warning: The scitype of `X`, in `machine(model, X, ...)` is incompatible with `model=DecisionTreeRegressor @378`:                                                                
│ scitype(X) = Table{Union{AbstractVector{Continuous}, AbstractVector{Count}, AbstractVector{Textual}, AbstractVector{Union{Missing, Textual}}}}
│ input_scitype(model) = Table{var"#s46"} where var"#s46"<:Union{AbstractVector{var"#s9"} where var"#s9"<:Continuous, AbstractVector{var"#s9"} where var"#s9"<:Count, AbstractVector{var"#s9"} where var"#s9"<:OrderedFactor}.
└ @ MLJBase ~/Dropbox/Julia7/MLJ/MLJBase/src/machines.jl:103
Machine{DecisionTreeRegressor,…} @198 trained 0 times; caches data
  args: 
    1:  Source @628 ⏎ `Table{Union{AbstractVector{Continuous}, AbstractVector{Count}, AbstractVector{Textual}, AbstractVector{Union{Missing, Textual}}}}`
    2:  Source @544 ⏎ `AbstractVector{Continuous}`

The meaning of the warning is:

  • The input X is a table with column scitypes Continuous, Count, and Textual and Union{Missing, Textual}, which can also see by inspecting the schema:
schema(X)
┌──────────┬────────────────────────┬─────────────────────────┐
│ _.names  │ _.types                │ _.scitypes              │
├──────────┼────────────────────────┼─────────────────────────┤
│ height   │ Int64                  │ Count                   │
│ time     │ Float64                │ Continuous              │
│ mark     │ String                 │ Textual                 │
│ admitted │ Union{Missing, String} │ Union{Missing, Textual} │
└──────────┴────────────────────────┴─────────────────────────┘
_.nrows = 5
  • The model requires a table whose column element scitypes subtype Continuous, an incompatibility.

Common data preprocessing workflows

There are two tools for addressing data-model type mismatches like the above, with links to further documentation given below:

Scientific type coercion: We coerce machine types to obtain the intended scientific interpretation. If height in the above example is intended to be Continuous, mark is supposed to be OrderedFactor, and admitted a (binary) Multiclass, then we can do

X_coerced = coerce(X, :height=>Continuous, :mark=>OrderedFactor, :admitted=>Multiclass);
schema(X_coerced)
┌──────────┬──────────────────────────────────────────────────┬─────────────────
│ _.names  │ _.types                                          │ _.scitypes     ⋯
├──────────┼──────────────────────────────────────────────────┼─────────────────
│ height   │ Float64                                          │ Continuous     ⋯
│ time     │ Float64                                          │ Continuous     ⋯
│ mark     │ CategoricalValue{String, UInt32}                 │ OrderedFactor{ ⋯
│ admitted │ Union{Missing, CategoricalValue{String, UInt32}} │ Union{Missing, ⋯
└──────────┴──────────────────────────────────────────────────┴─────────────────
                                                                1 column omitted
_.nrows = 5

Data transformations: We carry out conventional data transformations, such as missing value imputation and feature encoding:

imputer = FillImputer()
mach = machine(imputer, X_coerced) |> fit!
X_imputed = transform(mach, X_coerced);
schema(X_imputed)
┌──────────┬──────────────────────────────────┬──────────────────┐
│ _.names  │ _.types                          │ _.scitypes       │
├──────────┼──────────────────────────────────┼──────────────────┤
│ height   │ Float64                          │ Continuous       │
│ time     │ Float64                          │ Continuous       │
│ mark     │ CategoricalValue{String, UInt32} │ OrderedFactor{4} │
│ admitted │ CategoricalValue{String, UInt32} │ Multiclass{2}    │
└──────────┴──────────────────────────────────┴──────────────────┘
_.nrows = 5
encoder = ContinuousEncoder()
mach = machine(encoder, X_imputed) |> fit!
X_encoded = transform(mach, X_imputed)
(height = [185.0, 153.0, 163.0, 114.0, 180.0],
 time = [2.3, 4.5, 4.2, 1.8, 7.1],
 mark = [4.0, 1.0, 3.0, 2.0, 1.0],
 admitted__no = [0.0, 1.0, 0.0, 0.0],
 admitted__yes = [1.0, 0.0, 1.0, 1.0],)
schema(X_encoded)
┌───────────────┬─────────┬────────────┐
│ _.names       │ _.types │ _.scitypes │
├───────────────┼─────────┼────────────┤
│ height        │ Float64 │ Continuous │
│ time          │ Float64 │ Continuous │
│ mark          │ Float64 │ Continuous │
│ admitted__no  │ Float64 │ Continuous │
│ admitted__yes │ Float64 │ Continuous │
└───────────────┴─────────┴────────────┘
_.nrows = 5

Such transformations can also be combined in a pipeline; see Linear Pipelines.

Scientific type coercion

Scientific type coercion is documented in detail at ScientificTypesBase.jl. See also the tutorial at the this MLJ Workshop (specifically, here) and this Data Science in Julia tutorial.

Also relevant is the section, Working with Categorical Data.

Data transformation

MLJ's Built-in transformers are documented at Transformers and Other Unsupervised Models. The most relevant in the present context are: ContinuousEncoder, OneHotEncoder, FeatureSelector and FillImputer. A Gaussian mixture models imputer is provided by BetaML, which can be loaded with

MissingImputator = @load MissingImputator pkg=BetaML

This MLJ Workshop, and the "End-to-end examples" in Data Science in Julia tutorials give further illustrations of data preprocessing in MLJ.