# Working with Categorical Data

## Scientific types for discrete data

Recall that models articulate their data requirements using scientific types (see Getting Started or the MLJScientificTypes.jl documentation). There are three scientific types discrete data can have: `Count`

, `OrderedFactor`

and `Multiclass`

.

### Count data

In MLJ you cannot use integers to represent (finite) categorical data. Integers are reserved for discrete data you want interpreted as `Count <: Infinite`

:

`scitype([1, 4, 5, 6])`

`AbstractArray{Count,1}`

The `Count`

scientific type includes things like the number of phone calls, or city populations, and other "frequency" data of a generally unbounded nature.

That said, you may have data that is theoretically `Count`

, but which you coerce to `OrderedFactor`

to enable the use of more models, trusting to your knowledge of how those models work to inform an appropriate interpretation.

### OrderedFactor and Multiclass data

Other integer data, such as the number of an animal's legs, or number of rooms of homes, are generally coerced to `OrderedFactor <: Finite`

. The other categorical scientific type is `Multiclass <: Finite`

, which is for *unordered* categorical data. Coercing data to one of these two forms is discussed under Detecting and coercing improperly represented categorical data below.

### Binary data

There is no separate scientific type for binary data. Binary data is either `OrderedFactor{2}`

if ordered, and `Multiclass{2}`

otherwise. Data with type `OrderedFactor{2}`

is considered to have an intrinsic "positive" class, e.g., the outcome of a medical test, and the "pass/fail" outcome of an exam. MLJ measures, such as `true_positive`

assume the *second* class in the ordering is the "positive" class. Inspecting and changing order is discussed in the next section.

If data has type `Bool`

it is considered `Count`

data (as `Bool <: Integer`

) and generally users will want to coerce such data to `Multiclass`

or `OrderedFactor`

.

## Detecting and coercing improperly represented categorical data

One inspects the scientific type of data using `scitype`

as shown above. To inspect all column scientific types in a table simultaneously, use `schema`

. (The `scitype(X)`

of a table `X`

contains a condensed form of this information used in type dispatch; see here.)

```
using DataFrames
X = DataFrame(
name = ["Siri", "Robo", "Alexa", "Cortana"],
gender = ["male", "male", "Female", "female"],
likes_soup = [true, false, false, true],
height = [152, missing, 148, 163],
rating = [2, 5, 2, 1],
outcome = ["rejected", "accepted", "accepted", "rejected"])
schema(X)
```

```
┌────────────┬───────────────────────┬───────────────────────┐
│ _.names │ _.types │ _.scitypes │
├────────────┼───────────────────────┼───────────────────────┤
│ name │ String │ Textual │
│ gender │ String │ Textual │
│ likes_soup │ Bool │ Count │
│ height │ Union{Missing, Int64} │ Union{Missing, Count} │
│ rating │ Int64 │ Count │
│ outcome │ String │ Textual │
└────────────┴───────────────────────┴───────────────────────┘
_.nrows = 4
```

Coercing a single column:

`X.outcome = coerce(X.outcome, OrderedFactor)`

```
4-element CategoricalArray{String,1,UInt32}:
"rejected"
"accepted"
"accepted"
"rejected"
```

The *machine* type of the result is a `CategoricalArray`

. For more on this type see Under the hood: CategoricalValue and CategoricalArray below.

Inspecting the order of the levels:

`levels(X.outcome)`

```
2-element Array{String,1}:
"accepted"
"rejected"
```

Since we wish to regard "accepted" as the positive class, it should appear second, which we correct with the `levels!`

function:

```
levels!(X.outcome, ["rejected", "accepted"])
levels(X.outcome)
```

```
2-element Array{String,1}:
"rejected"
"accepted"
```

The order of levels should generally be changed early in your data science work-flow and then not again. Similar remarks apply to *adding* levels (which is possible; see the CategorialArrays.jl documentation). MLJ supervised and unsupervised models assume levels and their order do not change.

Coercing all remaining types simultaneously:

```
Xnew = coerce(X, :gender => Multiclass,
:like_soup => OrderedFactor,
:height => Continuous,
:rating => OrderedFactor)
schema(Xnew)
```

```
┌────────────┬─────────────────────────────────┬────────────────────────────┐
│ _.names │ _.types │ _.scitypes │
├────────────┼─────────────────────────────────┼────────────────────────────┤
│ name │ String │ Textual │
│ gender │ CategoricalValue{String,UInt32} │ Multiclass{3} │
│ likes_soup │ Bool │ Count │
│ height │ Union{Missing, Float64} │ Union{Missing, Continuous} │
│ rating │ CategoricalValue{Int64,UInt32} │ OrderedFactor{3} │
│ outcome │ CategoricalValue{String,UInt32} │ OrderedFactor{2} │
└────────────┴─────────────────────────────────┴────────────────────────────┘
_.nrows = 4
```

For `DataFrame`

s there is also in-place coercion, using `coerce!`

.

## Tracking all levels

The key property of vectors of scientific type `OrderedFactor`

and `Multiclass`

is that the pool of all levels is not lost when separating out one or more elements:

`v = Xnew.rating`

```
4-element CategoricalArray{Int64,1,UInt32}:
2
5
2
1
```

`levels(v)`

```
3-element Array{Int64,1}:
1
2
5
```

`levels(v[1:2])`

```
3-element Array{Int64,1}:
1
2
5
```

`levels(v[2])`

```
3-element Array{Int64,1}:
1
2
5
```

By tracking all classes in this way, MLJ avoids common pain points around categorical data, such as evaluating models on an evaluation set, only to crash your code because classes appear there which were not seen during training.

## Under the hood: CategoricalValue and CategoricalArray

In MLJ the objects with `OrderedFactor`

or `Multiclass`

scientific type have machine type `CategoricalValue`

, from the CategoricalArrays.jl package. In some sense `CategoricalValue`

s are an implementation detail users can ignore for the most part, as shown above. However, you may want some basic understanding of these types, and those implementing MLJ's model interface for new algorithms will have to understand them. For the complete API, see the CategoricalArrays.jl documentation. Here are the basics:

To construct an `OrderedFactor`

or `Multiclass`

vector directly from raw labels, one uses `categorical`

:

```
v = categorical([:A, :B, :A, :A, :C])
typeof(v)
```

`CategoricalArray{Symbol,1,UInt32,Symbol,CategoricalValue{Symbol,UInt32},Union{}}`

(Equivalent to the idiomatically MLJ `v = coerce([:A, :B, :A, :A, :C]), Multiclass)`

.)

`scitype(v)`

`AbstractArray{Multiclass{3},1}`

`v = categorical([:A, :B, :A, :A, :C], ordered=true, compress=true)`

```
5-element CategoricalArray{Symbol,1,UInt8}:
:A
:B
:A
:A
:C
```

`scitype(v)`

`AbstractArray{OrderedFactor{3},1}`

When you index a `CategoricalVector`

you don't get a raw label, but instead an instance of `CategoricalValue`

. As explained above, this value knows the complete pool of levels from vector from which it came. Use `get(val)`

to extract the raw label from a value `val`

.

Despite the distinction that exists between a value (element) and a label, the two are the same, from the point of `==`

and `in`

:

```
v[1] == :A # true
:A in v # true
```

## Probabilistic predictions of categorical data

Recall from Getting Started that probabilistic classfiers ordinarily predict `UnivariateFinite`

distributions, not raw probabilities (which are instead accessed using the `pdf`

method.) Here's how to construct such a distribution yourself:

```
v = coerce([:yes, :no, :yes, :yes, :maybe], Multiclass)
d = UnivariateFinite([v[1], v[2]], [0.9, 0.1])
```

`UnivariateFinite{Multiclass{3}}(no=>0.1, yes=>0.9)`

Or, equivalently,

`d = UnivariateFinite([:no, :yes], [0.9, 0.1], pool=v)`

`UnivariateFinite{Multiclass{3}}(no=>0.9, yes=>0.1)`

This distribution tracks *all* levels, not just the ones to which you have assigned probabilities:

`pdf(d, :maybe)`

`0.0`

However, `pdf(d, :dunno)`

will throw an error.

You can declare `pool=missing`

, but then `:maybe`

will not be tracked:

```
d = UnivariateFinite([:no, :yes], [0.9, 0.1], pool=missing)
levels(d)
```

```
2-element Array{Symbol,1}:
:no
:yes
```

To construct a whole *vector* of `UnivariateFinite`

distributions, simply give the constructor a matrix of probabilities:

```
yes_probs = rand(5)
probs = hcat(1 .- yes_probs, yes_probs)
d_vec = UnivariateFinite([:no, :yes], probs, pool=v)
```

```
5-element MLJBase.UnivariateFiniteArray{Multiclass{3},Symbol,UInt32,Float64,1}:
UnivariateFinite{Multiclass{3}}(no=>0.254, yes=>0.746)
UnivariateFinite{Multiclass{3}}(no=>0.729, yes=>0.271)
UnivariateFinite{Multiclass{3}}(no=>0.918, yes=>0.0823)
UnivariateFinite{Multiclass{3}}(no=>0.308, yes=>0.692)
UnivariateFinite{Multiclass{3}}(no=>0.619, yes=>0.381)
```

Or, equivalently:

`d_vec = UnivariateFinite([:no, :yes], yes_probs, augment=true, pool=v)`

For more options, see `UnivariateFinite`

.