Handling categorical data

Download the notebook, the raw script, or the annotated script for this tutorial (right-click on the link and save). This tutorial follows loosely the docs.

Defining a categorical vector

using CategoricalArrays

v = categorical(["AA", "BB", "CC", "AA", "BB", "CC"])
6-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "AA"
 "BB"
 "CC"
 "AA"
 "BB"
 "CC"

This declares a categorical vector, i.e. a Vector whose entries are expected to represent a group or category. You can retrieve the group labels using levels:

levels(v)
3-element Array{String,1}:
 "AA"
 "BB"
 "CC"

which, by default, returns the labels in lexicographic order.

Working with categoricals

Ordered categoricals

You can specify that categories are ordered by specifying ordered=true, the order then follows that of the levels. If you wish to change that order, you need to use the levels! function. Let's see two examples.

v = categorical([1, 2, 3, 1, 2, 3, 1, 2, 3], ordered=true)

levels(v)
3-element Array{Int64,1}:
 1
 2
 3

Here the lexicographic order matches what we want so no need to change it, since we've specified that the categories are ordered we can do:

v[1] < v[2]
true

Let's now consider another example

v = categorical(["high", "med", "low", "high", "med", "low"], ordered=true)

levels(v)
3-element Array{String,1}:
 "high"
 "low"
 "med"

The levels follow the lexicographic order which is not what we want:

v[1] < v[2]
true

In order to re-specify the order we need to use levels!:

levels!(v, ["low", "med", "high"])
6-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "high"
 "med"
 "low"
 "high"
 "med"
 "low"

now things are properly ordered:

v[1] < v[2]
false

Missing values

You can also have a categorical vector with missing values:

v = categorical(["AA", "BB", missing, "AA", "BB", "CC"]);

that doesn't change the levels:

levels(v)
3-element Array{String,1}:
 "AA"
 "BB"
 "CC"