Manipulating a DataFrame

Download the notebook, the raw script, or the annotated script for this tutorial (right-click on the link and save). This tutorial is loosely adapted from this pandas tutorial as well as the DataFrames.jl documentation. It is by no means meant to be a complete introduction, rather, it focuses on some key functionalities that are particularly useful in a classical machine learning context.

Basics

To start with, we will use the Boston dataset which is very simple.

using RDatasets
using DataFrames

boston = dataset("MASS", "Boston");

The dataset function returns a DataFrame object:

typeof(boston)
DataFrames.DataFrame

Accessing data

Intuitively a DataFrame is just a wrapper around a number of columns, each of which is a Vector of some type with a name"

names(boston)
14-element Array{String,1}:
 "Crim"
 "Zn"
 "Indus"
 "Chas"
 "NOx"
 "Rm"
 "Age"
 "Dis"
 "Rad"
 "Tax"
 "PTRatio"
 "Black"
 "LStat"
 "MedV"

You can view the first few rows using first and specifying a number of rows:

first(boston, 4)
4×14 DataFrame
│ Row │ Crim    │ Zn      │ Indus   │ Chas  │ NOx     │ Rm      │ Age     │ Dis     │ Rad   │ Tax   │ PTRatio │ Black   │ LStat   │ MedV    │
│     │ Float64 │ Float64 │ Float64 │ Int64 │ Float64 │ Float64 │ Float64 │ Float64 │ Int64 │ Int64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼───────┼─────────┼─────────┼─────────┼─────────┼───────┼───────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │ 0.00632 │ 18.0    │ 2.31    │ 0     │ 0.538   │ 6.575   │ 65.2    │ 4.09    │ 1     │ 296   │ 15.3    │ 396.9   │ 4.98    │ 24.0    │
│ 2   │ 0.02731 │ 0.0     │ 7.07    │ 0     │ 0.469   │ 6.421   │ 78.9    │ 4.9671  │ 2     │ 242   │ 17.8    │ 396.9   │ 9.14    │ 21.6    │
│ 3   │ 0.02729 │ 0.0     │ 7.07    │ 0     │ 0.469   │ 7.185   │ 61.1    │ 4.9671  │ 2     │ 242   │ 17.8    │ 392.83  │ 4.03    │ 34.7    │
│ 4   │ 0.03237 │ 0.0     │ 2.18    │ 0     │ 0.458   │ 6.998   │ 45.8    │ 6.0622  │ 3     │ 222   │ 18.7    │ 394.63  │ 2.94    │ 33.4    │

You can access one of those columns easily using .colname, this returns a vector that you can access like any Julia vector:

boston.Crim[1:5]
5-element Array{Float64,1}:
 0.00632
 0.02731
 0.02729
 0.03237
 0.06905

You can also just access the dataframe as you would a big matrix:

boston[3, 5]
0.469

or specifying a range of rows/columns:

boston[1:5, [:Crim, :Zn]]
5×2 DataFrame
│ Row │ Crim    │ Zn      │
│     │ Float64 │ Float64 │
├─────┼─────────┼─────────┤
│ 1   │ 0.00632 │ 18.0    │
│ 2   │ 0.02731 │ 0.0     │
│ 3   │ 0.02729 │ 0.0     │
│ 4   │ 0.03237 │ 0.0     │
│ 5   │ 0.06905 │ 0.0     │

or, similarly,

boston[1:5, 1:2]
5×2 DataFrame
│ Row │ Crim    │ Zn      │
│     │ Float64 │ Float64 │
├─────┼─────────┼─────────┤
│ 1   │ 0.00632 │ 18.0    │
│ 2   │ 0.02731 │ 0.0     │
│ 3   │ 0.02729 │ 0.0     │
│ 4   │ 0.03237 │ 0.0     │
│ 5   │ 0.06905 │ 0.0     │

The select function is very convenient to get sub dataframes of interest:

b1 = select(boston, [:Crim, :Zn, :Indus])
first(b1, 2)
2×3 DataFrame
│ Row │ Crim    │ Zn      │ Indus   │
│     │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┤
│ 1   │ 0.00632 │ 18.0    │ 2.31    │
│ 2   │ 0.02731 │ 0.0     │ 7.07    │

The Not syntax is also very useful:

b2 = select(boston, Not(:NOx))
first(b2, 2)
2×13 DataFrame
│ Row │ Crim    │ Zn      │ Indus   │ Chas  │ Rm      │ Age     │ Dis     │ Rad   │ Tax   │ PTRatio │ Black   │ LStat   │ MedV    │
│     │ Float64 │ Float64 │ Float64 │ Int64 │ Float64 │ Float64 │ Float64 │ Int64 │ Int64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼───────┼─────────┼─────────┼─────────┼───────┼───────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │ 0.00632 │ 18.0    │ 2.31    │ 0     │ 6.575   │ 65.2    │ 4.09    │ 1     │ 296   │ 15.3    │ 396.9   │ 4.98    │ 24.0    │
│ 2   │ 0.02731 │ 0.0     │ 7.07    │ 0     │ 6.421   │ 78.9    │ 4.9671  │ 2     │ 242   │ 17.8    │ 396.9   │ 9.14    │ 21.6    │

Finally, if you would like to drop columns, you can use select! which will mutate the dataframe in place:

select!(b1, Not(:Crim))
first(b1, 2)
2×2 DataFrame
│ Row │ Zn      │ Indus   │
│     │ Float64 │ Float64 │
├─────┼─────────┼─────────┤
│ 1   │ 18.0    │ 2.31    │
│ 2   │ 0.0     │ 7.07    │

Describing the data

StatsBase offers a convenient describe function which you can use on a DataFrame to get an overview of the data:

using StatsBase
describe(boston, :min, :max, :mean, :median, :std)
14×6 DataFrame
│ Row │ variable │ min     │ max     │ mean     │ median  │ std      │
│     │ Symbol   │ Real    │ Real    │ Float64  │ Float64 │ Float64  │
├─────┼──────────┼─────────┼─────────┼──────────┼─────────┼──────────┤
│ 1   │ Crim     │ 0.00632 │ 88.9762 │ 3.61352  │ 0.25651 │ 8.60155  │
│ 2   │ Zn       │ 0.0     │ 100.0   │ 11.3636  │ 0.0     │ 23.3225  │
│ 3   │ Indus    │ 0.46    │ 27.74   │ 11.1368  │ 9.69    │ 6.86035  │
│ 4   │ Chas     │ 0       │ 1       │ 0.06917  │ 0.0     │ 0.253994 │
│ 5   │ NOx      │ 0.385   │ 0.871   │ 0.554695 │ 0.538   │ 0.115878 │
│ 6   │ Rm       │ 3.561   │ 8.78    │ 6.28463  │ 6.2085  │ 0.702617 │
│ 7   │ Age      │ 2.9     │ 100.0   │ 68.5749  │ 77.5    │ 28.1489  │
│ 8   │ Dis      │ 1.1296  │ 12.1265 │ 3.79504  │ 3.20745 │ 2.10571  │
│ 9   │ Rad      │ 1       │ 24      │ 9.54941  │ 5.0     │ 8.70726  │
│ 10  │ Tax      │ 187     │ 711     │ 408.237  │ 330.0   │ 168.537  │
│ 11  │ PTRatio  │ 12.6    │ 22.0    │ 18.4555  │ 19.05   │ 2.16495  │
│ 12  │ Black    │ 0.32    │ 396.9   │ 356.674  │ 391.44  │ 91.2949  │
│ 13  │ LStat    │ 1.73    │ 37.97   │ 12.6531  │ 11.36   │ 7.14106  │
│ 14  │ MedV     │ 5.0     │ 50.0    │ 22.5328  │ 21.2    │ 9.1971   │

You can pass a number of symbols to the describe function to indicate which statistics to compute for each feature:

  • mean, std, min, max, median, first, last are all fairly self explanatory

  • q25, q75 are respectively for the 25th and 75th percentile,

  • eltype, nunique, nmissing can also be used

You can also pass your custom function with a pair name => function for instance:

foo(x) = sum(abs.(x)) / length(x)
d = describe(boston, :mean, :median, :foo => foo)
first(d, 3)
3×4 DataFrame
│ Row │ variable │ mean    │ median  │ foo     │
│     │ Symbol   │ Float64 │ Float64 │ Float64 │
├─────┼──────────┼─────────┼─────────┼─────────┤
│ 1   │ Crim     │ 3.61352 │ 0.25651 │ 3.61352 │
│ 2   │ Zn       │ 11.3636 │ 0.0     │ 11.3636 │
│ 3   │ Indus    │ 11.1368 │ 9.69    │ 11.1368 │

The describe function returns a derived object with one row per feature and one column per required statistic.

Further to StatsBase, Statistics offers a range of useful functions for data analysis.

using Statistics

Converting the data

If you want to get the content of the dataframe as one big matrix, use convert:

mat = convert(Matrix, boston)
mat[1:3, 1:3]
3×3 Array{Float64,2}:
 0.00632  18.0  2.31
 0.02731   0.0  7.07
 0.02729   0.0  7.07

Adding columns

Adding a column to a dataframe is very easy:

boston.Crim_x_Zn = boston.Crim .* boston.Zn;

that's it! Remember also that you can drop columns or make subselections with select and select!.

Missing values

Let's load a dataset with missing values

mao = dataset("gap", "mao")
describe(mao, :nmissing)
19×2 DataFrame
│ Row │ variable │ nmissing │
│     │ Symbol   │ Union…   │
├─────┼──────────┼──────────┤
│ 1   │ ID       │          │
│ 2   │ Type     │          │
│ 3   │ Gender   │          │
│ 4   │ Age      │ 188      │
│ 5   │ AAO      │ 296      │
│ 6   │ AAD      │ 295      │
│ 7   │ UPDRS    │ 301      │
│ 8   │ MAOAI2   │          │
│ 9   │ AI2Code  │          │
│ 10  │ MAOBI2   │          │
│ 11  │ BI2Code  │          │
│ 12  │ GTBEX3   │          │
│ 13  │ BEX3Code │          │
│ 14  │ MAOAVNTR │ 46       │
│ 15  │ VNTRCode │ 46       │
│ 16  │ VNTRCod2 │ 46       │
│ 17  │ MAOA31   │ 36       │
│ 18  │ MAO31COD │ 36       │
│ 19  │ MAO31CO2 │ 36       │

Lots of missing values... If you wanted to compute simple functions on columns, they may just return missing:

std(mao.Age)
missing

The skipmissing function can help counter this easily:

std(skipmissing(mao.Age))
11.551715894987558

Split-Apply-Combine

This is a shorter version of the DataFrames.jl tutorial.

iris = dataset("datasets", "iris")
first(iris, 3)
3×5 DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
│     │ Float64     │ Float64    │ Float64     │ Float64    │ Cat…    │
├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
│ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ setosa  │
│ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ setosa  │
│ 3   │ 4.7         │ 3.2        │ 1.3         │ 0.2        │ setosa  │

groupby

The groupby function allows to form "sub-dataframes" corresponding to groups of rows. This can be very convenient to run specific analyses for specific groups without copying the data.

The basic usage is groupby(df, cols) where cols specifies one or several columns to use for the grouping.

Consider a simple example: in iris there is a Species column with 3 species:

unique(iris.Species)
3-element Array{String,1}:
 "setosa"
 "versicolor"
 "virginica"

We can form views for each of these:

gdf = groupby(iris, :Species);

The gdf object now corresponds to views of the original dataframe for each of the 3 species; the first species is "setosa" with:

subdf_setosa = gdf[1]
describe(subdf_setosa, :min, :mean, :max)
5×4 DataFrame
│ Row │ variable    │ min    │ mean   │ max    │
│     │ Symbol      │ Any    │ Union… │ Any    │
├─────┼─────────────┼────────┼────────┼────────┤
│ 1   │ SepalLength │ 4.3    │ 5.006  │ 5.8    │
│ 2   │ SepalWidth  │ 2.3    │ 3.428  │ 4.4    │
│ 3   │ PetalLength │ 1.0    │ 1.462  │ 1.9    │
│ 4   │ PetalWidth  │ 0.1    │ 0.246  │ 0.6    │
│ 5   │ Species     │ setosa │        │ setosa │

Note that subdf_setosa is a SubDataFrame meaning that it is just a view of the parent dataframe iris; if you modify that parent dataframe then the sub dataframe is also modified.

See ?groupby for more information.

combine

The combine function allows to derive a new dataframe out of transformations of an existing one. Here's an example taken from the official doc (see ?combine):

df = DataFrame(a=1:3, b=4:6)
combine(df, :a => sum, nrow)
1×2 DataFrame
│ Row │ a_sum │ nrow  │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 6     │ 3     │

what happened here is that the derived DataFrame has two columns obtained respectively by (1) computing the sum of the first column and (2) applying the nrow function on the df.

The transformation can produce one or several values, combine will try to concatenate these columns as it can, for instance:

foo(v) = v[1:2]
combine(df, :a => maximum, :b => foo)
2×2 DataFrame
│ Row │ a_maximum │ b_foo │
│     │ Int64     │ Int64 │
├─────┼───────────┼───────┤
│ 1   │ 3         │ 4     │
│ 2   │ 3         │ 5     │

here the maximum value of a is copied twice so that the two columns have the same number of rows.

bar(v) = v[end-1:end]
combine(df, :a => foo, :b => bar)
2×2 DataFrame
│ Row │ a_foo │ b_bar │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 5     │
│ 2   │ 2     │ 6     │

combine with groupby

Combining groupby with combine is very useful. For instance you might want to compute statistics across groups for different variables:

combine(groupby(iris, :Species), :PetalLength => mean)
3×2 DataFrame
│ Row │ Species    │ PetalLength_mean │
│     │ Cat…       │ Float64          │
├─────┼────────────┼──────────────────┤
│ 1   │ setosa     │ 1.462            │
│ 2   │ versicolor │ 4.26             │
│ 3   │ virginica  │ 5.552            │

let's decompose that:

  1. the groupby(iris, :Species) creates groups using the :Species column (which has values setosa, versicolor, virginica)

  2. the combine creates a derived dataframe by applying the mean function to the :PetalLength column

  3. since there are three groups, we get one column (mean of PetalLength) and three rows (one per group).

You can do this for several columns/statistics at the time and give new column names to the results:

gdf = groupby(iris, :Species)
combine(gdf, :PetalLength => mean => :MPL, :PetalLength => std => :SPL)
3×3 DataFrame
│ Row │ Species    │ MPL     │ SPL      │
│     │ Cat…       │ Float64 │ Float64  │
├─────┼────────────┼─────────┼──────────┤
│ 1   │ setosa     │ 1.462   │ 0.173664 │
│ 2   │ versicolor │ 4.26    │ 0.469911 │
│ 3   │ virginica  │ 5.552   │ 0.551895 │

so here we assign the names :MPL and :SPL to the derived columns. If you want to apply something on all columns apart from the grouping one, using names and Not comes in handy:

combine(gdf, names(iris, Not(:Species)) .=> std)
3×5 DataFrame
│ Row │ Species    │ SepalLength_std │ SepalWidth_std │ PetalLength_std │ PetalWidth_std │
│     │ Cat…       │ Float64         │ Float64        │ Float64         │ Float64        │
├─────┼────────────┼─────────────────┼────────────────┼─────────────────┼────────────────┤
│ 1   │ setosa     │ 0.35249         │ 0.379064       │ 0.173664        │ 0.105386       │
│ 2   │ versicolor │ 0.516171        │ 0.313798       │ 0.469911        │ 0.197753       │
│ 3   │ virginica  │ 0.63588         │ 0.322497       │ 0.551895        │ 0.27465        │

where

names(iris, Not(:Species))
4-element Array{String,1}:
 "SepalLength"
 "SepalWidth"
 "PetalLength"
 "PetalWidth"

and note the use of . in .=> to indicate that we broadcast the function over each column.