Loading and elementary processing of data

Download the notebook, the raw script, or the annotated script for this tutorial (right-click on the link and save). In this short tutorial we discuss two ways to easily load data in Julia:

  1. loading a standard dataset via RDatasets.jl,

  2. loading a local file with CSV.jl,

Using RDatasets

The package RDatasets.jl provides access to most of the many datasets listed on this page. These are well known, standard datasets that can be used to get started with data processing and classical machine learning such as for instance iris, crabs, Boston, etc.

To load such a dataset, you will need to specify which R package it belongs to as well as its name; for instance Boston is part of MASS.

using RDatasets

boston = dataset("MASS", "Boston");

The fact that Boston is part of MASS is clearly indicated on the list linked to earlier. While it can be a bit slow, loading a dataset via RDatasets is very simple and convenient as you don't have to worry about setting the names of columns etc.

The dataset function returns a DataFrame object from the DataFrames.jl package.

typeof(boston)
DataFrames.DataFrame

For a short introduction to DataFrame objects, see this tutorial.

Using CSV

The package CSV.jl offers a powerful way to read arbitrary CSV files efficiently. In particular the CSV.read function allows to read a file and return a DataFrame.

Basic usage

Let's say you have a file foo.csv at some path fpath=joinpath("data", "foo.csv") with the content

col1,col2,col3,col4,col5,col6,col7,col8
,1,1.0,1,one,2019-01-01,2019-01-01T00:00:00,true
,2,2.0,2,two,2019-01-02,2019-01-02T00:00:00,false
,3,3.0,3.14,three,2019-01-03,2019-01-03T00:00:00,true

You can read it with CSV using

using CSV
data = CSV.read(fpath)
3×8 DataFrame
│ Row │ col1    │ col2  │ col3    │ col4    │ col5   │ col6       │ col7                │ col8 │
│     │ Missing │ Int64 │ Float64 │ Float64 │ String │ Dates.Date │ Dates.DateTime      │ Bool │
├─────┼─────────┼───────┼─────────┼─────────┼────────┼────────────┼─────────────────────┼──────┤
│ 1   │ missing │ 1     │ 1.0     │ 1.0     │ one    │ 2019-01-01 │ 2019-01-01T00:00:00 │ 1    │
│ 2   │ missing │ 2     │ 2.0     │ 2.0     │ two    │ 2019-01-02 │ 2019-01-02T00:00:00 │ 0    │
│ 3   │ missing │ 3     │ 3.0     │ 3.14    │ three  │ 2019-01-03 │ 2019-01-03T00:00:00 │ 1    │

Note that we use this joinpath for compatibility with our system but you could pass any valid path on your system for instance CSV.read("path/to/file.csv"). The data is also returned as a dataframe

typeof(data)
DataFrames.DataFrame

Some of the useful arguments for read are:

  • header= to specify whether there's a header, or which line the header is on or to specify a full header yourself,

  • skipto= to specify how many rows to skip before starting to read the data,

  • limit= to specify a maximum number of rows to parse,

  • missingstring= to specify a string or vector of strings that should be parsed as missing values,

  • delim=',' a char or string to specify how columns are separated.

For more details see ?CSV.File.

Example 1

Let's consider this dataset, the content of which we saved in a file at path fpath.

It doesn't have a header so we have to provide it ourselves.

header = ["CIC0", "SM1_Dz", "GATS1i",
          "NdsCH", "NdssC", "MLOGP", "LC50"]
data = CSV.read(fpath, header=header)
first(data, 3)
3×7 DataFrame
│ Row │ CIC0    │ SM1_Dz  │ GATS1i  │ NdsCH │ NdssC │ MLOGP   │ LC50    │
│     │ Float64 │ Float64 │ Float64 │ Int64 │ Int64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼───────┼───────┼─────────┼─────────┤
│ 1   │ 3.26    │ 0.829   │ 1.676   │ 0     │ 1     │ 1.453   │ 3.77    │
│ 2   │ 2.189   │ 0.58    │ 0.863   │ 0     │ 0     │ 1.348   │ 3.115   │
│ 3   │ 2.125   │ 0.638   │ 0.831   │ 0     │ 0     │ 1.348   │ 3.531   │

Example 2

Let's consider this dataset, the content of which we saved at fpath.

It does not have a header and missing values indicated by ?.

data = CSV.read(fpath, header=false, missingstring="?")
first(data[:, 1:5], 3)
3×5 DataFrame
│ Row │ Column1 │ Column2 │ Column3 │ Column4 │ Column5 │
│     │ Int64   │ Int64?  │ Int64   │ Int64   │ Int64?  │
├─────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │ 1       │ 0       │ 1       │ 0       │ 0       │
│ 2   │ 0       │ missing │ 0       │ 0       │ 0       │
│ 3   │ 1       │ 0       │ 1       │ 1       │ 0       │