ScientificTypes.jl

A light-weight Julia interface for implementing conventions about the scientific interpretation of data, and for performing type coercions enforcing those conventions.

The package makes the distinction between between machine type and scientific type:

  • the machine type is a Julia type the data is currently encoded as (for instance: Float64)
  • the scientific type is a type defined by this package which encapsulates how the data should be interpreted in the rest of the code (for instance: Continuous or Multiclass)

As a motivating example, the data might contain a column corresponding to a number of transactions, the machine type in that case could be an Int whereas the scientific type would be a Count.

The usefulness of this machinery becomes evident when the machine type does not directly connect with a scientific type; taking the previous example, the data could have been encoded as a Float64 whereas the meaning should still be a Count.

Features

The package ScientificTypes provides:

  • A hierarchy of new Julia types representing scientific data types for use in method dispatch (eg, for trait values). Instances of the types play no role:
Found
├─ Known
│  ├─ Finite
│  │  ├─ Multiclass
│  │  └─ OrderedFactor
│  ├─ Infinite
│  │  ├─ Continuous
│  │  └─ Count
│  ├─ Image
│  │  ├─ ColorImage
│  │  └─ GrayImage
|  ├─ Textual
│  └─ Table
└─ Unknown
  • A single method scitype for articulating a convention about what scientific type each Julia object can represent. For example, one might declare scitype(::AbstractFloat) = Continuous.
  • A default convention called MLJ, based on dependencies CategoricalArrays, ColorTypes, and Tables, which includes a convenience method coerce for performing scientific type coercion on AbstractVectors and columns of tabular data (any table implementing the Tables.jl interface).
  • A schema method for tabular data, based on the optional Tables dependency, for inspecting the machine and scientific types of tabular data, in addition to column names and number of rows.

Getting started

The package is registered and can be installed via the package manager with add ScientificTypes.

To get the scientific type of a Julia object according to the convention in use, call scitype:

scitype(3.14)
Continuous

For a vector, you can use scitype or scitype_union (which will give you a scitype corresponding to the elements):

scitype([1,2,3,missing])
AbstractArray{Union{Missing, Count},1}
scitype_union([1,2,3,missing])
Union{Missing, Count}

Type coercion work-flow for tabular data

The standard workflow involves the following two steps:

  1. inspect the schema of the data and the scitypes in particular
  2. provide pairs or a dictionary with column names and scitypes for any changes you may want and coerce the data to those scitypes
using DataFrames, Tables
X = DataFrame(
     name=["Siri", "Robo", "Alexa", "Cortana"],
     height=[152, missing, 148, 163],
     rating=[1, 5, 2, 1])
schema(X)
_.table = 
┌─────────┬───────────────────────┬───────────────────────┐
│ _.names │ _.types               │ _.scitypes            │
├─────────┼───────────────────────┼───────────────────────┤
│ name    │ String                │ Textual               │
│ height  │ Union{Missing, Int64} │ Union{Missing, Count} │
│ rating  │ Int64                 │ Count                 │
└─────────┴───────────────────────┴───────────────────────┘
_.nrows = 4

inspecting the scitypes:

schema(X).scitypes
(Textual, Union{Missing, Count}, Count)

but in this case you may want to map the names to Multiclass, the height to Continuous and the ratings to OrderedFactor; to do so:

Xfixed = coerce(X, :name=>Multiclass,
                   :height=>Continuous,
                   :rating=>OrderedFactor)
schema(Xfixed).scitypes
(Multiclass{4}, Union{Missing, Continuous}, OrderedFactor{3})

Note that, as it encountered missing values in height it coerced the type to Union{Missing,Continuous}.

One can also make a replacement based on existing scientific type, instead of feature name:

X  = (x = [1, 2, 3],
      y = rand(3),
      z = [10, 20, 30])
Xfixed = coerce(X, Count=>Continuous)
schema(Xfixed).scitypes
(Continuous, Continuous, Continuous)

Finally there is a coerce! method that does in-place coercion provided the data structure allows it (at the moment only DataFrames.DataFrame is supported).

Notes

  • We regard the built-in Julia type Missing as a scientific type. The new scientific types introduced in the current package are rooted in the abstract type Found (see tree above) and you export the alias Scientific = Union{Missing, Found}.
  • Finite{N}, Multiclass{N} and OrderedFactor{N} are all parametrised by the number of levels N. We export the alias Binary = Finite{2}.
  • Image{W,H}, GrayImage{W,H} and ColorImage{W,H} are all parametrised by the image width and height dimensions, (W, H).
  • The function scitype has the fallback value Unknown.
  • Since Tables is an optional dependency, the scitype of a Tables.jl supported table is Unknown unless Tables has been imported.
  • Developers can define their own conventions using the code in src/conventions/mlj/ as a template. The active convention is controlled by the value of ScientificTypes.CONVENTION[1].

Special note on binary data

ScientificTypes does not define a separate "binary" scientific type. Rather, when binary data has an intrinsic "true" class (for example pass/fail in a product test), then it should be assigned an OrderedFactor{2} scitype, while data with no such class (e.g., gender) should be assigned a Multiclass{2} scitype. In the former case we recommend that the "true" class come after "false" in the ordering (corresponding to the usual assignment "false=0" and "true=1"). Of course, Finite{2} covers both cases of binary data.

Detailed usage examples

using ScientificTypes
# activate a convention
ScientificTypes.set_convention(MLJ) # redundant as it's the default

scitype((2.718, 42))

Let's try with categorical valued objects:

using CategoricalArrays
v = categorical(['a', 'c', 'a', missing, 'b'], ordered=true)
scitype(v[1])
OrderedFactor{3}

and

scitype_union(v)
Union{Missing, OrderedFactor{3}}

you could coerce this to Multiclass:

w = coerce(v, Multiclass)
scitype_union(w)
Union{Missing, Multiclass{3}}

Working with tables

using Tables
data = (x1=rand(10), x2=rand(10), x3=collect(1:10))
scitype(data)
Table{Union{AbstractArray{Continuous,1}, AbstractArray{Count,1}}}

you can also use schema:

schema(data)
_.table = 
┌─────────┬─────────┬────────────┐
│ _.names │ _.types │ _.scitypes │
├─────────┼─────────┼────────────┤
│ x1      │ Float64 │ Continuous │
│ x2      │ Float64 │ Continuous │
│ x3      │ Int64   │ Count      │
└─────────┴─────────┴────────────┘
_.nrows = 10

and use <: for type checks:

scitype(data) <: Table(Continuous)
false
scitype(data) <: Table(Infinite)
true

or specify multiple types directly:

data = (x=rand(10), y=collect(1:10), z = [1,2,3,1,2,3,1,2,3,1])
data = coerce(data, :z=>OrderedFactor)
scitype(data) <: Table(Continuous,Count,OrderedFactor)
true

The scientific type of tuples, arrays and tables

Important Definition 1 Under any convention, the scitype of a tuple is a Tuple type parametrised by scientific types:

scitype((1, 4.5))
Tuple{Count,Continuous}

Important Definition 2 The scitype of an AbstractArray, A, is alwaysAbstractArray{U} where U is the union of the scitpyes of the elements of A, with one exception: If typeof(A) <: AbstractArray{Union{Missing,T}} for some T different from Any, then the scitype of A is AbstractArray{Union{Missing, U}}, where U is the union over all non-missing elements, even if A has no missing elements.

This exception is made for performance reasons. If one wants to override it, one uses scitype(A, tight=true).

v = [1.3, 4.5, missing]
scitype(v)
AbstractArray{Union{Missing, Continuous},1}
scitype(v[1:2])
AbstractArray{Union{Missing, Continuous},1}
scitype(v[1:2], tight=true)
AbstractArray{Continuous,1}

Performance note: Computing type unions over large arrays is expensive and, depending on the convention's implementation and the array eltype, computing the scitype can be slow. In the MLJ convention this is mitigated with the help of the ScientificTypes.Scitype method, of which other conventions could make use. Do ?ScientificTypes.Scitype for details. An eltype Any may lead to poor performances and you may want to consider replacing an array A with broadcast(identity, A) to collapse the eltype and speed up the computation.

Provided the Tables.jl package is loaded, any table implementing the Tables interface has a scitype encoding the scitypes of its columns:

using CategoricalArrays, Tables
X = (x1=rand(10),
     x2=rand(10),
     x3=categorical(rand("abc", 10)),
     x4=categorical(rand("01", 10)))
schema(X)
_.table = 
┌─────────┬─────────────────────────────────────────────────┬───────────────┐
│ _.names │ _.types                                         │ _.scitypes    │
├─────────┼─────────────────────────────────────────────────┼───────────────┤
│ x1      │ Float64                                         │ Continuous    │
│ x2      │ Float64                                         │ Continuous    │
│ x3      │ CategoricalArrays.CategoricalValue{Char,UInt32} │ Multiclass{3} │
│ x4      │ CategoricalArrays.CategoricalValue{Char,UInt32} │ Multiclass{2} │
└─────────┴─────────────────────────────────────────────────┴───────────────┘
_.nrows = 10

Important Definition 3 Specifically, if X has columns c1, ..., cn, then

scitype(X) == Table{Union{scitype(c1), ..., scitype(cn)}}

With this definition, common type checks can be performed with tables. For instance, you could check that each column of X has an element scitype that is either Continuous or Finite:

scitype(X) <: Table{<:Union{AbstractVector{<:Continuous}, AbstractVector{<:Finite}}}
true

A built-in Table constructor provides a shorthand for the right-hand side:

scitype(X) <: Table(Continuous, Finite)
true

Note that Table(Continuous,Finite) is a type union and not a Table instance.

The MLJ convention

The table below summarises the MLJ convention for representing scientific types:

Type Tscitype(x) for x::Tpackage required
MissingMissing
AbstractFloatContinuous
IntegerCount
StringTextual
CategoricalValueMulticlass{N} where N = nlevels(x), provided x.pool.ordered == falseCategoricalArrays
CategoricalStringMulticlass{N} where N = nlevels(x), provided x.pool.ordered == falseCategoricalArrays
CategoricalValueOrderedFactor{N} where N = nlevels(x), provided x.pool.ordered == trueCategoricalArrays
CategoricalStringOrderedFactor{N} where N = nlevels(x) provided x.pool.ordered == trueCategoricalArrays
AbstractArray{<:Gray,2}GrayImage{W,H} where (W, H) = size(x)ColorTypes
AbstractArrray{<:AbstractRGB,2}ColorImage{W,H} where (W, H) = size(x)ColorTypes
any table type T supported by Tables.jlTable{K} where K=Union{column_scitypes...}Tables

Here nlevels(x) = length(levels(x.pool)).

Automatic type conversion for tabular data

The autotype function allows to use specific rules in order to guess appropriate scientific types for the data. Such rules would typically be more constraining than the ones implied by the active convention. When autotype is used, a dictionary of suggested types is returned for each column in the data; if none of the specified rule applies, the ambient convention is used as "fallback".

The function is called as:

autotype(X)

If the keyword only_changes is passed set to true, then only the column names for which the suggested type is different from that provided by the convention are returned.

autotype(X; only_changes=true)

To specify which rules are to be applied, use the rules keyword and specify a tuple of symbols referring to specific rules; the default rule is :few_to_finite which applies a heuristic for columns which have relatively few values, these columns are then encoded with an appropriate Finite type. It is important to note that the order in which the rules are specified matters; rules will be applied in that order.

autotype(X; rules=(:few_to_finite,))

Finally, you can also use the following shorthands:

autotype(X, :few_to_finite)
autotype(X, (:few_to_finite, :discrete_to_continuous))

Available rules

Rule symbolscitype suggestion
:few_to_finitean appropriate Finite subtype for columns with few distinct values
:discrete_to_continuousif not Finite, then Continuous for any Count or Integer scitypes/types
:string_to_multiclassMulticlass for any string-like column

Autotype can be used in conjunction with coerce:

X_coerced = coerce(X, autotype(X))

Examples

By default it only applies the :few_to_many rule

n = 50
X = (a = rand("abc", n),         # 3 values, not number        --> Multiclass
     b = rand([1,2,3,4], n),     # 4 values, number            --> OrderedFactor
     c = rand([true,false], n),  # 2 values, number but only 2 --> Multiclass
     d = randn(n),               # many values                 --> unchanged
     e = rand(collect(1:n), n))  # many values                 --> unchanged
autotype(X, only_changes=true)
Dict{Symbol,Type{#s18} where #s18<:Union{Missing, Found}} with 3 entries:
  :a => Multiclass
  :b => OrderedFactor
  :c => OrderedFactor

For example, we could first apply the :discrete_to_continuous rule, followed by :few_to_finite rule. The first rule will apply to b and e but the subsequent application of the second rule will mean we will get the same result apart for e (which will be Continuous)

autotype(X, only_changes=true, rules=(:discrete_to_continuous, :few_to_finite))
Dict{Symbol,Type{#s18} where #s18<:Union{Missing, Found}} with 4 entries:
  :a => Multiclass
  :b => OrderedFactor
  :e => Continuous
  :c => OrderedFactor

One should check and possibly modify the returned dictionary before passing to coerce.