Lab 2

Download the notebook, the raw script, or the annotated script for this tutorial (right-click on the link and save).

Basic commands

This is a very brief and rough primer if you're new to Julia and wondering how to do simple things that are relevant for data analysis.

Defining a vector

x = [1, 3, 2, 5]
@show x
@show length(x)
x = [1, 3, 2, 5]
length(x) = 4

Operations between vectors

y = [4, 5, 6, 1]
z = x .+ y # elementwise operation
4-element Array{Int64,1}:
 5
 8
 8
 6

Defining a matrix

X = [1  2; 3 4]
2×2 Array{Int64,2}:
 1  2
 3  4

You can also do that from a vector

X = reshape([1, 2, 3, 4], 2, 2)
2×2 Array{Int64,2}:
 1  3
 2  4

But you have to be careful that it fills the matrix by column; so if you want to get the same result as before, you will need to permute the dimensions

X = permutedims(reshape([1, 2, 3, 4], 2, 2))
2×2 Array{Int64,2}:
 1  2
 3  4

Function calls can be split with the |> operator so that the above can also be written

X = reshape([1,2,3,4], 2, 2) |> permutedims
2×2 Array{Int64,2}:
 1  2
 3  4

You don't have to do that of course but we will sometimes use it in these tutorials.

There's a wealth of functions available for simple math operations

x = 4
@show x^2
@show sqrt(x)
x ^ 2 = 16
sqrt(x) = 2.0

Element wise operations on a collection can be done with the dot syntax:

sqrt.([4, 9, 16])
3-element Array{Float64,1}:
 2.0
 3.0
 4.0

The packages Statistics (from the standard library) and StatsBase offer a number of useful function for stats:

using Statistics, StatsBase

Note that if you don't have StatsBase, you can add it using using Pkg; Pkg.add("StatsBase"). Right, let's now compute some simple statistics:

x = randn(1_000) # 1_000 points iid from a N(0, 1)
μ = mean(x)
σ = std(x)
@show (μ, σ)
(μ, σ) = (-0.023363181706442294, 0.9757686582990799)

Indexing data starts at 1, use : to indicate the full range

X = [1 2; 3 4; 5 6]
@show X[1, 2]
@show X[:, 1]
@show X[1, :]
@show X[[1, 2], [1, 2]]
X[1, 2] = 2
X[:, 1] = [1, 3, 5]
X[1, :] = [1, 2]
X[[1, 2], [1, 2]] = [1 2; 3 4]
size gives dimensions (nrows, ncolumns)

size(X)
(3, 2)

Loading data

There are many ways to load data in Julia, one convenient one is via the CSV package.

using CSV

Many datasets are available via the RDatasets package

using RDatasets

And finally the DataFrames package allows to manipulate data easily

using DataFrames

Let's load some data from RDatasets (the full list of datasets is available here).

auto = dataset("ISLR", "Auto")
first(auto, 3)
3×9 DataFrame
│ Row │ MPG     │ Cylinders │ Displacement │ Horsepower │ Weight  │ Acceleration │ Year    │ Origin  │ Name                      │
│     │ Float64 │ Float64   │ Float64      │ Float64    │ Float64 │ Float64      │ Float64 │ Float64 │ String                    │
├─────┼─────────┼───────────┼──────────────┼────────────┼─────────┼──────────────┼─────────┼─────────┼───────────────────────────┤
│ 1   │ 18.0    │ 8.0       │ 307.0        │ 130.0      │ 3504.0  │ 12.0         │ 70.0    │ 1.0     │ chevrolet chevelle malibu │
│ 2   │ 15.0    │ 8.0       │ 350.0        │ 165.0      │ 3693.0  │ 11.5         │ 70.0    │ 1.0     │ buick skylark 320         │
│ 3   │ 18.0    │ 8.0       │ 318.0        │ 150.0      │ 3436.0  │ 11.0         │ 70.0    │ 1.0     │ plymouth satellite        │

The describe function allows to get an idea for the data:

describe(auto, :mean, :median, :std)
9×4 DataFrame
│ Row │ variable     │ mean    │ median │ std      │
│     │ Symbol       │ Union…  │ Union… │ Union…   │
├─────┼──────────────┼─────────┼────────┼──────────┤
│ 1   │ MPG          │ 23.4459 │ 22.75  │ 7.80501  │
│ 2   │ Cylinders    │ 5.47194 │ 4.0    │ 1.70578  │
│ 3   │ Displacement │ 194.412 │ 151.0  │ 104.644  │
│ 4   │ Horsepower   │ 104.469 │ 93.5   │ 38.4912  │
│ 5   │ Weight       │ 2977.58 │ 2803.5 │ 849.403  │
│ 6   │ Acceleration │ 15.5413 │ 15.5   │ 2.75886  │
│ 7   │ Year         │ 75.9796 │ 76.0   │ 3.68374  │
│ 8   │ Origin       │ 1.57653 │ 1.0    │ 0.805518 │
│ 9   │ Name         │         │        │          │

To retrieve column names, you can use names:

names(auto)
9-element Array{String,1}:
 "MPG"
 "Cylinders"
 "Displacement"
 "Horsepower"
 "Weight"
 "Acceleration"
 "Year"
 "Origin"
 "Name"

Accesssing columns can be done in different ways:

mpg = auto.MPG
mpg = auto[:, 1]
mpg = auto[:, :MPG]
mpg |> mean
23.44591836734694

To get dimensions you can use size and nrow and ncol

@show size(auto)
@show nrow(auto)
@show ncol(auto)
size(auto) = (392, 9)
nrow(auto) = 392
ncol(auto) = 9

For more detailed tutorials on basic data wrangling in Julia, consider

Plotting data

There are multiple libraries that can be used to plot things in Julia:

In these tutorials we use PyPlot but you could use another package of course.

using PyPlot

figure(figsize=(8,6))
plot(mpg)
Simple plot