# ScientificTypes.jl

A light-weight Julia interface for implementing conventions about the scientific interpretation of data, and for performing type coercions enforcing those conventions.

The package makes the distinction between between **machine type** and **scientific type**:

- the
*machine type*is a Julia type the data is currently encoded as (for instance:`Float64`

) - the
*scientific type*is a type defined by this package which encapsulates how the data should be*interpreted*in the rest of the code (for instance:`Continuous`

or`Multiclass`

)

As a motivating example, the data might contain a column corresponding to a *number of transactions*, the machine type in that case could be an `Int`

whereas the scientific type would be a `Count`

.

The usefulness of this machinery becomes evident when the machine type does not directly connect with a scientific type; taking the previous example, the data could have been encoded as a `Float64`

whereas the meaning should still be a `Count`

.

## Features

The package `ScientificTypes`

provides:

- A hierarchy of new Julia types representing scientific data types for use in method dispatch (eg, for trait values). Instances of the types play no role:

```
Found
├─ Known
│ ├─ Finite
│ │ ├─ Multiclass
│ │ └─ OrderedFactor
│ ├─ Infinite
│ │ ├─ Continuous
│ │ └─ Count
│ ├─ Image
│ │ ├─ ColorImage
│ │ └─ GrayImage
| ├─ Textual
│ └─ Table
└─ Unknown
```

- A single method
`scitype`

for articulating a convention about what scientific type each Julia object can represent. For example, one might declare`scitype(::AbstractFloat) = Continuous`

. - A default convention called
*MLJ*, based on dependencies`CategoricalArrays`

,`ColorTypes`

, and`Tables`

, which includes a convenience method`coerce`

for performing scientific type coercion on`AbstractVectors`

and columns of tabular data (any table implementing the Tables.jl interface). - A
`schema`

method for tabular data, based on the optional Tables dependency, for inspecting the machine and scientific types of tabular data, in addition to column names and number of rows.

## Getting started

The package is registered and can be installed via the package manager with `add ScientificTypes`

.

To get the scientific type of a Julia object according to the convention in use, call `scitype`

:

`scitype(3.14)`

`Continuous`

For a vector, you can use `scitype`

or `scitype_union`

(which will give you a scitype corresponding to the elements):

`scitype([1,2,3,missing])`

`AbstractArray{Union{Missing, Count},1}`

`scitype_union([1,2,3,missing])`

`Union{Missing, Count}`

### Type coercion work-flow for tabular data

The standard workflow involves the following two steps:

- inspect the
`schema`

of the data and the`scitypes`

in particular - provide pairs or a dictionary with column names and scitypes for any changes you may want and coerce the data to those scitypes

```
using DataFrames, Tables
X = DataFrame(
name=["Siri", "Robo", "Alexa", "Cortana"],
height=[152, missing, 148, 163],
rating=[1, 5, 2, 1])
schema(X)
```

```
_.table =
┌─────────┬───────────────────────┬───────────────────────┐
│ _.names │ _.types │ _.scitypes │
├─────────┼───────────────────────┼───────────────────────┤
│ name │ String │ Textual │
│ height │ Union{Missing, Int64} │ Union{Missing, Count} │
│ rating │ Int64 │ Count │
└─────────┴───────────────────────┴───────────────────────┘
_.nrows = 4
```

inspecting the scitypes:

`schema(X).scitypes`

`(Textual, Union{Missing, Count}, Count)`

but in this case you may want to map the names to `Multiclass`

, the height to `Continuous`

and the ratings to `OrderedFactor`

; to do so:

```
Xfixed = coerce(X, :name=>Multiclass,
:height=>Continuous,
:rating=>OrderedFactor)
schema(Xfixed).scitypes
```

`(Multiclass{4}, Union{Missing, Continuous}, OrderedFactor{3})`

Note that, as it encountered missing values in `height`

it coerced the type to `Union{Missing,Continuous}`

.

One can also make a replacement based on existing scientific type, instead of feature name:

```
X = (x = [1, 2, 3],
y = rand(3),
z = [10, 20, 30])
Xfixed = coerce(X, Count=>Continuous)
schema(Xfixed).scitypes
```

`(Continuous, Continuous, Continuous)`

Finally there is a `coerce!`

method that does in-place coercion provided the data structure allows it (at the moment only `DataFrames.DataFrame`

is supported).

## Notes

- We regard the built-in Julia type
`Missing`

as a scientific type. The new scientific types introduced in the current package are rooted in the abstract type`Found`

(see tree above) and you export the alias`Scientific = Union{Missing, Found}`

. `Finite{N}`

,`Multiclass{N}`

and`OrderedFactor{N}`

are all parametrised by the number of levels`N`

. We export the alias`Binary = Finite{2}`

.`Image{W,H}`

,`GrayImage{W,H}`

and`ColorImage{W,H}`

are all parametrised by the image width and height dimensions,`(W, H)`

.- The function
`scitype`

has the fallback value`Unknown`

. - Since Tables is an optional dependency, the
`scitype`

of a`Tables.jl`

supported table is`Unknown`

unless Tables has been imported. - Developers can define their own conventions using the code in
`src/conventions/mlj/`

as a template. The active convention is controlled by the value of`ScientificTypes.CONVENTION[1]`

.

## Special note on binary data

ScientificTypes does not define a separate "binary" scientific type. Rather, when binary data has an intrinsic "true" class (for example pass/fail in a product test), then it should be assigned an `OrderedFactor{2}`

scitype, while data with no such class (e.g., gender) should be assigned a `Multiclass{2}`

scitype. In the former case we recommend that the "true" class come after "false" in the ordering (corresponding to the usual assignment "false=0" and "true=1"). Of course, `Finite{2}`

covers both cases of binary data.

## Detailed usage examples

```
using ScientificTypes
# activate a convention
ScientificTypes.set_convention(MLJ) # redundant as it's the default
scitype((2.718, 42))
```

Let's try with categorical valued objects:

```
using CategoricalArrays
v = categorical(['a', 'c', 'a', missing, 'b'], ordered=true)
scitype(v[1])
```

`OrderedFactor{3}`

and

`scitype_union(v)`

`Union{Missing, OrderedFactor{3}}`

you could coerce this to `Multiclass`

:

```
w = coerce(v, Multiclass)
scitype_union(w)
```

`Union{Missing, Multiclass{3}}`

### Working with tables

```
using Tables
data = (x1=rand(10), x2=rand(10), x3=collect(1:10))
scitype(data)
```

`Table{Union{AbstractArray{Continuous,1}, AbstractArray{Count,1}}}`

you can also use `schema`

:

`schema(data)`

```
_.table =
┌─────────┬─────────┬────────────┐
│ _.names │ _.types │ _.scitypes │
├─────────┼─────────┼────────────┤
│ x1 │ Float64 │ Continuous │
│ x2 │ Float64 │ Continuous │
│ x3 │ Int64 │ Count │
└─────────┴─────────┴────────────┘
_.nrows = 10
```

and use `<:`

for type checks:

`scitype(data) <: Table(Continuous)`

`false`

`scitype(data) <: Table(Infinite)`

`true`

or specify multiple types directly:

```
data = (x=rand(10), y=collect(1:10), z = [1,2,3,1,2,3,1,2,3,1])
data = coerce(data, :z=>OrderedFactor)
scitype(data) <: Table(Continuous,Count,OrderedFactor)
```

`true`

### The scientific type of tuples, arrays and tables

**Important Definition 1** Under any convention, the scitype of a tuple is a `Tuple`

type parametrised by scientific types:

`scitype((1, 4.5))`

`Tuple{Count,Continuous}`

**Important Definition 2** The scitype of an `AbstractArray`

, `A`

, is always`AbstractArray{U}`

where `U`

is the union of the scitpyes of the elements of `A`

, with one exception: If `typeof(A) <: AbstractArray{Union{Missing,T}}`

for some `T`

different from `Any`

, then the scitype of `A`

is `AbstractArray{Union{Missing, U}}`

, where `U`

is the union over all non-missing elements, **even if A has no missing elements**.

This exception is made for performance reasons. If one wants to override it, one uses `scitype(A, tight=true)`

.

```
v = [1.3, 4.5, missing]
scitype(v)
```

`AbstractArray{Union{Missing, Continuous},1}`

`scitype(v[1:2])`

`AbstractArray{Union{Missing, Continuous},1}`

`scitype(v[1:2], tight=true)`

`AbstractArray{Continuous,1}`

*Performance note:* Computing type unions over large arrays is expensive and, depending on the convention's implementation and the array eltype, computing the scitype can be slow. In the *MLJ* convention this is mitigated with the help of the `ScientificTypes.Scitype`

method, of which other conventions could make use. Do `?ScientificTypes.Scitype`

for details. An eltype `Any`

may lead to poor performances and you may want to consider replacing an array `A`

with `broadcast(identity, A)`

to collapse the eltype and speed up the computation.

Provided the Tables.jl package is loaded, any table implementing the Tables interface has a scitype encoding the scitypes of its columns:

```
using CategoricalArrays, Tables
X = (x1=rand(10),
x2=rand(10),
x3=categorical(rand("abc", 10)),
x4=categorical(rand("01", 10)))
schema(X)
```

```
_.table =
┌─────────┬─────────────────────────────────────────────────┬───────────────┐
│ _.names │ _.types │ _.scitypes │
├─────────┼─────────────────────────────────────────────────┼───────────────┤
│ x1 │ Float64 │ Continuous │
│ x2 │ Float64 │ Continuous │
│ x3 │ CategoricalArrays.CategoricalValue{Char,UInt32} │ Multiclass{3} │
│ x4 │ CategoricalArrays.CategoricalValue{Char,UInt32} │ Multiclass{2} │
└─────────┴─────────────────────────────────────────────────┴───────────────┘
_.nrows = 10
```

**Important Definition 3** Specifically, if `X`

has columns `c1, ..., cn`

, then

`scitype(X) == Table{Union{scitype(c1), ..., scitype(cn)}}`

With this definition, common type checks can be performed with tables. For instance, you could check that each column of `X`

has an element scitype that is either `Continuous`

or `Finite`

:

`scitype(X) <: Table{<:Union{AbstractVector{<:Continuous}, AbstractVector{<:Finite}}}`

`true`

A built-in `Table`

constructor provides a shorthand for the right-hand side:

`scitype(X) <: Table(Continuous, Finite)`

`true`

Note that `Table(Continuous,Finite)`

is a *type* union and not a `Table`

*instance*.

## The MLJ convention

The table below summarises the *MLJ* convention for representing scientific types:

Type `T` | `scitype(x)` for `x::T` | package required |
---|---|---|

`Missing` | `Missing` | |

`AbstractFloat` | `Continuous` | |

`Integer` | `Count` | |

`String` | `Textual` | |

`CategoricalValue` | `Multiclass{N}` where `N = nlevels(x)` , provided `x.pool.ordered == false` | CategoricalArrays |

`CategoricalString` | `Multiclass{N}` where `N = nlevels(x)` , provided `x.pool.ordered == false` | CategoricalArrays |

`CategoricalValue` | `OrderedFactor{N}` where `N = nlevels(x)` , provided `x.pool.ordered == true` | CategoricalArrays |

`CategoricalString` | `OrderedFactor{N}` where `N = nlevels(x)` provided `x.pool.ordered == true` | CategoricalArrays |

`AbstractArray{<:Gray,2}` | `GrayImage{W,H}` where `(W, H) = size(x)` | ColorTypes |

`AbstractArrray{<:AbstractRGB,2}` | `ColorImage{W,H}` where `(W, H) = size(x)` | ColorTypes |

any table type `T` supported by Tables.jl | `Table{K}` where `K=Union{column_scitypes...}` | Tables |

Here `nlevels(x) = length(levels(x.pool))`

.

## Automatic type conversion for tabular data

The `autotype`

function allows to use specific rules in order to guess appropriate scientific types for the data. Such rules would typically be more constraining than the ones implied by the active convention. When `autotype`

is used, a dictionary of suggested types is returned for each column in the data; if none of the specified rule applies, the ambient convention is used as "fallback".

The function is called as:

`autotype(X)`

If the keyword `only_changes`

is passed set to `true`

, then only the column names for which the suggested type is different from that provided by the convention are returned.

`autotype(X; only_changes=true)`

To specify which rules are to be applied, use the `rules`

keyword and specify a tuple of symbols referring to specific rules; the default rule is `:few_to_finite`

which applies a heuristic for columns which have relatively few values, these columns are then encoded with an appropriate `Finite`

type. It is important to note that the order in which the rules are specified matters; rules will be applied in that order.

`autotype(X; rules=(:few_to_finite,))`

Finally, you can also use the following shorthands:

```
autotype(X, :few_to_finite)
autotype(X, (:few_to_finite, :discrete_to_continuous))
```

### Available rules

Rule symbol | scitype suggestion |
---|---|

`:few_to_finite` | an appropriate `Finite` subtype for columns with few distinct values |

`:discrete_to_continuous` | if not `Finite` , then `Continuous` for any `Count` or `Integer` scitypes/types |

`:string_to_multiclass` | `Multiclass` for any string-like column |

Autotype can be used in conjunction with `coerce`

:

`X_coerced = coerce(X, autotype(X))`

### Examples

By default it only applies the `:few_to_many`

rule

```
n = 50
X = (a = rand("abc", n), # 3 values, not number --> Multiclass
b = rand([1,2,3,4], n), # 4 values, number --> OrderedFactor
c = rand([true,false], n), # 2 values, number but only 2 --> Multiclass
d = randn(n), # many values --> unchanged
e = rand(collect(1:n), n)) # many values --> unchanged
autotype(X, only_changes=true)
```

```
Dict{Symbol,Type{#s18} where #s18<:Union{Missing, Found}} with 3 entries:
:a => Multiclass
:b => OrderedFactor
:c => OrderedFactor
```

For example, we could first apply the `:discrete_to_continuous`

rule, followed by `:few_to_finite`

rule. The first rule will apply to `b`

and `e`

but the subsequent application of the second rule will mean we will get the same result apart for `e`

(which will be `Continuous`

)

`autotype(X, only_changes=true, rules=(:discrete_to_continuous, :few_to_finite))`

```
Dict{Symbol,Type{#s18} where #s18<:Union{Missing, Found}} with 4 entries:
:a => Multiclass
:b => OrderedFactor
:e => Continuous
:c => OrderedFactor
```

One should check and possibly modify the returned dictionary before passing to `coerce`

.