Module 2: Hands-on session
Contents
Module 2: Hands-on session¶
In this hands-on notebook we’ll be getting familiar with working with the European Quality of Life data in pandas by:
Creating a Local Python Environment
Downloading the data
Loading and exploring the data
Understanding the different data types present
Making the data more readable
Considering how missing data is represented
Exploring some differences between the 2007 and 2011 data
Comparing data from UK and Spain respondees with respect to data missingness
We have marked the exercise parts of this notebook as # TODO
with some comments as guidance. You can refer back to the taught material, or external documentation, for further information.
Running the notebook¶
Prerequisites¶
To setup and run the commands in this notebook you will need a (preferably bash/similar) shell with these installed:
Python 3.7 or later
Check by running
python --version
orpython3 --version
in your shell
Git (optional)
Check by running
git --version
in your shell
Curl (optional)
Check by running
curl --version
in your shell
If you don’t have these we have instructions in our Research Software Engineering course.
Clone the Course Repository¶
In order to work locally with this notebook, you should clone the course repository.
Go to the GitHub repository in a web browser: https://github.com/alan-turing-institute/rds-course
Click on the green “Code” button and copy the address under “Clone - HTTPS”.
In your shell, run the following command from a sensible location (this will create a new dir for the course in current dir):
git clone https://github.com/alan-turing-institute/rds-course.git
Change directory to the repository root
cd rds-course
We’re currently using the
develop
branch, so check that outgit checkout develop
Troubleshooting:
If you don’t have
git
: We recommend using git, but if you don’t have it installed you can download a zip of the code by clicking on “Download Zip” in step 2 above instead, and unpack it locally.If you previously cloned/downloaded the repo: Please run
git checkout develop
and thengit pull
from therds-course
directory to ensure you have the latest version of the material.
Create a Local Python Environment¶
We need to install third-party packages necessary for the course, with the same package versions as it was developed with to ensure compatibility and reproducibility.
Managing Python Versions¶
As well as the versions of packages your codebase should specify which version(s) of Python itself that it’s compatible with. The code for this course should run with Python 3.7 or above. We don’t cover it here to speedup setup, but if you need to use multiple versions of Python on your system we recommend Pyenv and Conda.
Creating a Virtual Environment with Poetry
¶
The Python ecosystem has many different ways of managing packaging and installing dependencies (this page lists somem). The most well-known is pip
with dependencies listed in a requirements.txt
file.
In this course we use the tool Poetry, which can help manage multiple environments, in particular switching between environments .
Dependencies are listed in pyproject.toml
and have versions fixed in poetry.lock
. Poetry
will pick these files up and install the required packages in a predictable manner, and into a virtual environment isolated from other projects on your system.
Install
Poetry
by following their instructions here.Change to the
rds-course
directory (the directory of the git repository cloned above), if you’re not there already: cd /path/to/rds-courseSet the relevant Python executable for Poetry to use:
If
python --version
returns a version number of 3.7 or above:Skip to step 4
If
python --version
is less than 3.7 (e.g., 2.7), butpython3 --version
gives 3.7 or above:Run
poetry env use python3
If you have a Python 3.7+ environment available somewhere else:
Run
poetry env use /full/path/to/python
If you don’t have Python 3.7+ installed or don’t know where to find it:
Refer back to the instructions in the prerequisites and/or ask for help.
Run the following command to create the virtual environment and install the third-party packages necessary for the course:
poetry install
Check the details of the virtual environment that’s been created:
poetry env info
Initialise the environment:
poetry shell
The last step creates a new shell setup to use the Python virtual environment we just created (e.g., which python
, should now show the path returned earlier by poetry env info
above, rather the path to your global Python executable). If you want to stop using the virtual environment exit
the shell.
Troubleshooting:
If you don’t have
curl
:curl
is used to download a Python script (currently this script but check the Poetry documentation for the latest URL). Save this script asget-poetry.py
and then runpython get-poetry.py
to install Poetry.
If you don’t want to use
Poetry
:You can install the course dependencies by running
pip install .
from therds-course
directory, but we recommend doing this in an alternative virtual environment of your choice (not in your global Python installation).
Jupyter¶
We recommend use of JupyterLab for running through the hands-on notebooks in this course.
JuypyterLab was installed into your Poetry environment in the previous step. We can launch a local instance, from the poetry environment terminal, with:
jupyter lab # from the root of the cloned github repository! "rds-course" directory
We recommend following the rest of the notebook via the JupyterLab instance that should spawn!
Click through the file explorer in the left-hand pane to bring up this notebook.
The notebook should be present at: rds-course/coursebook/modules/m2/2-hands-on.ipynb
If you’ve not used Jupyter
before you might find their Notebook basics and Running code documentation helpful.
Download Data¶
As described in module 1, we can find the data available for download from the UK Data Service here
Observe the licence and take this opportunity to check your obligations
Download the CSV data
Unzip the data to
$PROJECT_ROOT/data
this should give us$PROJECT_ROOT/data/UKDA-7724-csv
$PROJECT_ROOT
is the root of the cloned github repository for this course. If you change the directory that you hold this data in, you’ll need to make appropriate changes to the DATA_ROOT_PATH
variable below.
Notebook Setup¶
Import necessary packages for this work:
import os
import pandas as pd
# these are utility functions to check some answers in the notebook
from handson_utils import (
parse_country_values_2011,
check_dataset_load,
check_dataset_explored,
check_column_mapping,
set_column_mapping
)
Define path to data. This relies on the data being saved in the specified location below.
DATA_ROOT_PATH = "../m4/data/UKDA-7724-csv" # should match the path you unzipped the data to
COMBINED_CSV_PATH = os.path.join(DATA_ROOT_PATH, "csv/eqls_2007and2011.csv")
MAPPING_CSV_PATH = os.path.join(DATA_ROOT_PATH, "mrdoc/excel/eqls_api_map.csv")
# consts for loading categorical data value maps
CATEGORICAL_VALUES_XLSX_PATH = os.path.join(DATA_ROOT_PATH, "mrdoc/excel/eqls_concordance_grid.xlsx")
CATEGORICAL_VALUES_XLSX_SHEET_NAME = "Values"
Exploring the Downloaded Files¶
Take some time to familiarise yourself with the file structure of the downloaded files (in the UKDA-7724-csv
directory), opening them in Excel/Numbers/relevant application of your choice initially. In particular:
The table of files
mrdoc/excel/7724_file_information_csv.csv
The
csv
directoryWhat data does each file contain?
The user guide
mrdoc/pdf/7724_eqls_2007-2011_user_guide_v2.pdf
The “Variables” and “Values” worksheet in
mrdoc/excel/eqls_concordance_grid.xlsx
How many variables are there? Can they be grouped in any way?
# TODO
Pandas Help¶
You’ll be using pandas for most of the hands-on session. In each question there’s a list of sections from the teaching material that may give you hints for how to approach them.
You might also like to refer to the Pandas documentation:
Or you can get a function’s documentation in the notebook by:
Hitting SHIFT + TAB in a code cell with your text cursor on a function name.
Running
help(<function_name>)
in a code cell (e.g.help(pd.read_csv)
Load and Explore¶
Relevant sections: 2.1.4 (Data Sources and Formats), 2.2.1 (Data Consistency)
Read the combined 2007 and 2011 data into a pandas DataFrame
# TODO read in the file and display the head
# your code here
df = None
# checks - will produce an `AssertionError` until DataFrame loaded correctly
check_dataset_load(df)
Take a look at some summary statistics and use these to assign the variables below correctly.
You can assign the variables manually from inspecting printed output or assign them results from methods/properties.
# TODO - code to explore dataframe can go here
# TODO - fill in the below with the correct values
n_columns = 0 # how many columns are there in the DataFrame? Does this match number of variables you expected?
n_float64_columns = 0 # how many columns are of dtype float64?
Y11_Q31_mean_value = 0 # what is the mean of the values in the Y11_Q31 column?
Y11_Country_n_unique = 0 # how many unique values in the Y11_Country column?
# function to check your answers (run this cell)
check_dataset_explored(
n_columns,
n_float64_columns,
Y11_Q31_mean_value,
Y11_Country_n_unique
)
Different Data Types¶
Relevant Sections: 2.2.1 (Data Consistency)
Are there columns containing the following types of data? If so give examples.
Numeric data (without a set of pre-defined categories)
Binary data
Categorical data, unordered
Categorical data, ordered
Text data
Time or date data
Note:
The dataset contains at least 4 of these, try to have another look if you found fewer.
You’ll need to refer to the dataset documentation to fully understand which columns contain which types (
mrdoc/excel/eqls_concordance_grid.xlsx
may be helpful, in particlar).
# TODO
For one example of each type, what are the possible values the data can take, according to the documentation? What does each value mean? Is each possible value present in the dataset?
# TODO
What is the minimum, maximum and mean value in each of your example columns? Taking into consideration the type of data in the column, are all of these values meaningful?
# TODO
For one of the categorical columns, replace the numerical encodings with their actual meaning (category title). You can do this by manually creating a Python dictionary with the values to replace (we’ll look at extracting them with code later). What is the most common value?
# TODO
Making Things More Readable¶
Relevant Sections: 2.2.2 (Modifying Columns and Indices), 2.2.4.2 (Text Data)
At the moment, we’ve got column headings such as 'Y11_Country'
, 'Y11_Q31'
and 'Y11_Q32'
in our data. These aren’t particularly helpful at a glance and we’d need to do some cross-referencing with eqls_api_map.csv
to make sense of them.
To make things more readable, let’s rename our columns according to the 'VariableLabel'
column in mrdoc/excel/eqls_api_map.csv
.
However, because it can make .
access a bit tricky, we’d like to make sure we don’t have any spaces or non-word characters in our new column names! For consistency, we’d like everything to be lower case.
# we have to explicitly use latin1 encoding as the file is not saved in utf-8 (our platform default)
eqls_api_map_df = pd.read_csv(MAPPING_CSV_PATH, encoding='latin1')
eqls_api_map_df.head()
TODO:
replace column names in
df
with corresponding entry in'VariableLabel'
column fromeqls_api_map_df
ensure all column names are entirely lowercase
ensure only characters [a-zA-Z0-9_] are present in column names
remove apostrophes (
"'"
)replace otherwise non-matching (e.g., whitespace or
'/'
) character with'_'
we don’t want consecutive
'_'
characters (e.g.,no_of_children
rather thanno__of_children
)
keep a map (python
dict
) that shows the old -> new column mapping in case we ever want to invert this transformation.
Example manual mapping (you should produce this with a general code solution!):
{
...,
'Y11_Q32' -> 'no_of_children'
...,
'Y11_Q67_4' -> 'citizenship_dont_know',
...,
}
You may find it helpful to use a site like regex101 to experiment with creating a suitable regex expression.
⚠️ Note: ⚠️ This is a reasonably tricky regex question. If you prefer to move on to the following questions there is a code cell you can run below to set the updated column names.
# TODO
column_mapping = {}
# TODO
# Apply your column mapping to df
# check your answer
check_column_mapping(column_mapping, df)
If you would like to skip this question, uncomment the code and run this cell:
### THIS CODE MODIFIES df TO THE EXPECTED RESULT AFTER THE MAKING THINGS MORE READABLE EXERCISE
# !!! uncomment the line below and run the cell to set the answer !!!
# df = set_column_mapping(df, eqls_api_map_df)
df.head()
Self-Reported Health¶
Relevant Sections: 2.2.1 (Data Consistency), 2.2.3 (Feature Engineering), 2.2.4.3 (Categorical Data)
The research question we defined in Module 1 and will explore in more detail in Modules 3 and 4 aims to explore the contribution of various factors on self-reported health.
Which column in the dataset contains self-reported health values? How many people had “Very good” self-reported health?
# TODO
For the models we develop in Module 4 we’ll convert self-reported health into a binary variable. What might be a sensible way to group the categories into only two possible values? Create a new column with your proposed binary encoding for self-reported health.
# TODO
Missing Values¶
Relevant Sections: 2.2.1 (Data Consistency), 2.2.7 (Missing Data)
There are at least three different ways missing data and unknown information are represented in this dataset.
Look at one row of the data. What value does Pandas use to represent missing data? How many missing values are there in the row you picked?
# TODO
Looking at the possible values for each column in the dataset documentation, find two different columns and at least two different values that are also used to represent missing or unknown information.
# TODO
For the columns you picked:
How many missing values does Pandas say the column has?
How many values match the numeric representation of missing data you found from the documentation (e.g., if the documentation says -99 means unknown, how many -99 values are there in the column)?
Does Pandas include the numeric representation in its count of missing values?
# TODO
Replace the numbers representing missing values in the columns you picked with the NaN
type from numpy
. What is the Pandas count of missing values now?
# TODO
Are there different types of missing data in the dataset (different reasons values can be unknown)? Does this have any implications for the way you would analyse the data?
# TODO
2007 vs 2011¶
Relevant Sections: 2.2.1 (Data Consistency), 2.2.7 (Missing Data)
Which column should be used to distinguish between the collection years? (2007 and 2011)
How many rows do we have for each year?
For each collection year, what % of null values do we have for each column?
Why is this?
Display these %s in descending order sorted by: 2007 then 2011
# TODO
UK vs Spain¶
Further to the missing data we saw above, grouped by wave/year, how do missing values look for each country of collection?
Compare the UK with Spain:
are there columns that have all values for one country but some are missing for the other?
are there columns that don’t have any values for one country but at least some are present for the other?
What implications are there from your answers to the above questions?
For simplicity, just look at 2011 data.
df_2011 = None # TODO
# Some code for parsing the data in mrdoc/excel/eqls_concordance_grid.xlsx"
# E.g. country data is categorically encoded in our DataFrame but not human readable
# We can get the human readable categories from this file
# load the categorical value data from excel workbook, specifying the appropriate sheet
cat_vals_df = pd.read_excel(CATEGORICAL_VALUES_XLSX_PATH, sheet_name=CATEGORICAL_VALUES_XLSX_SHEET_NAME)
# display head
cat_vals_df.head()
You can write your own function to extract the data you need from the cat_vals_df
dataframe, input it manually, or use our prewritten function
country_mapping_2011 = None # TODO
# uncomment beow if you want to use our pre-written function to parse the data for country mappings
#country_mapping_2011 = parse_country_values_2011(categorical_values_df=cat_vals_df)
# display
country_mapping_2011
# TODO - Null values in UK vs Spain
Your Turn!¶
We now leave it to you to learn something interesting from the dataset, using any of the techniques we’ve learned. You could attempt to explore questions like the ones below, for example, but pick anything that appeals to you. Always take care to consider whether the values you’ve calculated are likely to be representative.
Which age group is most optimistic about the future?
Which country has the most trust in the police? And the least?
Are there differences between genders for the highest level of education completed? Does this vary by country and/or age group?
If you prefer, you could also do this with one of the example datasets we used during teaching:
World Bank percentage of people living in urban environments (Section 2.1.4 Data Sources and Formats)
Palmer Penguins dataset (Section 2.2.1 Data Consistency)
Anthropometric Survey of US Army Personnel (Section 2.2.3 Feature Engineering)
New York Patient Characteristics Survey (Section 2.2.4.3 Categorical Data)
These datasets are stored in the coursebook/modules/m2/data
directory, but you may prefer to download the original versions as we made modifications for teaching.
# TODO