Module 2: Hands-on session (Solutions)
Contents
Module 2: Hands-on session (Solutions)¶
⚠️ Warning ⚠️ These are sample solutions to the exercises for the Module 2 hands-on session. We don’t recommend looking here before you’ve attempted them!
Notebook Setup¶
Import necessary packages for this work:
import os
import pandas as pd
from handson_utils import (
parse_country_values_2011,
check_dataset_load,
check_dataset_explored,
check_column_mapping
)
Define path to data. This relies on the data being saved in the specified location below.
DATA_ROOT_PATH = "../m4/data/UKDA-7724-csv" # should match the path you unzipped the data to
COMBINED_CSV_PATH = os.path.join(DATA_ROOT_PATH, "csv/eqls_2007and2011.csv")
MAPPING_CSV_PATH = os.path.join(DATA_ROOT_PATH, "mrdoc/excel/eqls_api_map.csv")
# consts for loading categorical data value maps
CATEGORICAL_VALUES_XLSX_PATH = os.path.join(DATA_ROOT_PATH, "mrdoc/excel/eqls_concordance_grid.xlsx")
CATEGORICAL_VALUES_XLSX_SHEET_NAME = "Values"
Exploring the Downloaded Files¶
Take some time to familiarise yourself with the file structure of the downloaded files (in the UKDA-7724-csv
directory), opening them in Excel/Numbers/relevant application of your choice initially. In particular:
The table of files
mrdoc/excel/7724_file_information_csv.csv
The
csv
directoryWhat data does each file contain?
The user guide
mrdoc/pdf/7724_eqls_2007-2011_user_guide_v2.pdf
The “Variables” and “Values” worksheet in
mrdoc/excel/eqls_concordance_grid.xlsx
How many variables are there? Can they be grouped in any way?
Answer: There are 199 variables listed in mrdoc/excel/eqls_concordance_grid.xlsx
. There are a number of “Variable Groups” (e.g., “Family and Social Life”) and “Topic Classifications” (e.g., “Social stratification and groupings - Family life and marriage”)
Load and Explore¶
Relevant sections: 2.1.4 (Data Sources and Formats), 2.2.1 (Data Consistency)
Read the combined 2007 and 2011 data into a pandas DataFrame
# ANSWER read in the file and display the head
####
df = pd.read_csv(COMBINED_CSV_PATH)
display(df.head())
Wave | Y11_Country | Y11_Q31 | Y11_Q32 | Y11_ISCEDsimple | Y11_Q49 | Y11_Q67_1 | Y11_Q67_2 | Y11_Q67_3 | Y11_Q67_4 | ... | DV_Q54a | DV_Q54b | DV_Q55 | DV_Q56 | DV_Q8 | DV_Q10 | ISO3166_Country | RowID | URIRowID | UniqueID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 1 | 4.0 | 0.0 | 4.0 | 4.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | https://www.iso.org/obp/ui/#iso:code:3166:AT | 1 | https://api.ukdataservice.ac.uk/V1/datasets/eq... | AT9000083 |
1 | 2 | 1 | 4.0 | 0.0 | 4.0 | 4.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | https://www.iso.org/obp/ui/#iso:code:3166:AT | 2 | https://api.ukdataservice.ac.uk/V1/datasets/eq... | AT9000126 |
2 | 2 | 1 | 1.0 | 2.0 | 3.0 | 2.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | https://www.iso.org/obp/ui/#iso:code:3166:AT | 3 | https://api.ukdataservice.ac.uk/V1/datasets/eq... | AT9000267 |
3 | 2 | 1 | 2.0 | 0.0 | 3.0 | 1.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | https://www.iso.org/obp/ui/#iso:code:3166:AT | 4 | https://api.ukdataservice.ac.uk/V1/datasets/eq... | AT9000268 |
4 | 2 | 1 | 4.0 | 0.0 | 3.0 | 4.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | https://www.iso.org/obp/ui/#iso:code:3166:AT | 5 | https://api.ukdataservice.ac.uk/V1/datasets/eq... | AT9000427 |
5 rows × 199 columns
# checks - will produce an `AssertionError` until DataFrame loaded correctly
check_dataset_load(df)
✅ df loaded correctly
Take a look at some summary statistics and use these to assign the variables below correctly.
You can assign the variables manually from inspecting printed output or assign them results from methods/properties.
# ANSWER - example functions to explore and answer questions below
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79270 entries, 0 to 79269
Columns: 199 entries, Wave to UniqueID
dtypes: float64(187), int64(9), object(3)
memory usage: 120.4+ MB
df.describe()
Wave | Y11_Country | Y11_Q31 | Y11_Q32 | Y11_ISCEDsimple | Y11_Q49 | Y11_Q67_1 | Y11_Q67_2 | Y11_Q67_3 | Y11_Q67_4 | ... | DV_Q7 | DV_Q67 | DV_Q43Q44 | DV_Q54a | DV_Q54b | DV_Q55 | DV_Q56 | DV_Q8 | DV_Q10 | RowID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 79270.000000 | 79270.000000 | 78756.000000 | 78769.000000 | 78556.000000 | 79082.000000 | 43636.000000 | 43636.000000 | 43636.000000 | 43636.000000 | ... | 2225.000000 | 43636.000000 | 78312.000000 | 43636.000000 | 43636.000000 | 43636.000000 | 43636.000000 | 43636.000000 | 43636.000000 | 79270.000000 |
mean | 2.550473 | 16.841138 | 1.856049 | 1.598141 | 4.019146 | 2.640955 | 1.959368 | 1.023673 | 1.019204 | 1.001971 | ... | 52.612135 | 1.086465 | 2.485992 | 2.815565 | 2.925635 | 0.303442 | 0.231437 | 3.931708 | 3.283482 | 39635.500000 |
std | 0.497449 | 9.358320 | 1.186271 | 1.276425 | 1.368993 | 0.987352 | 0.197437 | 0.152030 | 0.137244 | 0.044351 | ... | 15.696943 | 0.460388 | 0.838558 | 0.721642 | 0.568403 | 0.881979 | 0.827727 | 0.436254 | 1.130667 | 22883.422256 |
min | 2.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 5.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 |
25% | 2.000000 | 9.000000 | 1.000000 | 0.000000 | 3.000000 | 2.000000 | 2.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 43.000000 | 1.000000 | 2.000000 | 3.000000 | 3.000000 | 0.000000 | 0.000000 | 4.000000 | 2.000000 | 19818.250000 |
50% | 3.000000 | 16.000000 | 1.000000 | 2.000000 | 4.000000 | 3.000000 | 2.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 50.000000 | 1.000000 | 3.000000 | 3.000000 | 3.000000 | 0.000000 | 0.000000 | 4.000000 | 4.000000 | 39635.500000 |
75% | 3.000000 | 25.000000 | 3.000000 | 2.000000 | 5.000000 | 4.000000 | 2.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 61.000000 | 1.000000 | 3.000000 | 3.000000 | 3.000000 | 0.000000 | 0.000000 | 4.000000 | 4.000000 | 59452.750000 |
max | 3.000000 | 35.000000 | 4.000000 | 5.000000 | 8.000000 | 4.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | ... | 80.000000 | 6.000000 | 3.000000 | 6.000000 | 6.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 79270.000000 |
8 rows × 196 columns
df["Y11_Country"].nunique()
35
# ANSWER
n_columns = 199 # how many columns are there in the DataFrame? This matches the number expected from the documentation.
n_float64_columns = 187 # how many columns are of dtype float64?
Y11_Q31_mean_value = 1.856049 # what is the mean of the values in the Y11_Q31 column?
Y11_Country_n_unique = 35 # how many unique values in the Y11_Country column?
# function to check your answers (run this cell)
check_dataset_explored(
n_columns,
n_float64_columns,
Y11_Q31_mean_value,
Y11_Country_n_unique
)
n_columns answer ✅ correct
n_float64_columns answer ✅ correct
Y11_Q31_mean_value answer ✅ correct
Y11_Country_n_unique answer ✅ correct
Different Data Types¶
Relevant Sections: 2.2.1 (Data Consistency)
Are there columns containing the following types of data? If so give examples.
Numeric data (without a set of pre-defined categories)
Binary data
Categorical data, unordered
Categorical data, ordered
Text data
Time or date data
Note:
The dataset contains at least 4 of these, try to have another look if you found fewer.
You’ll need to refer to the dataset documentation to fully understand which columns contain which types (
mrdoc/excel/eqls_concordance_grid.xlsx
may be helpful, in particlar).
Answer:
A few examples:
Numeric data
Y11_Q8
(How many hours would you prefer to work)Y11_Q52
(How much time spent on travelling to work/study)
Binary data
Y11_Q67_1
(I am a citizen of the country)Y11_HH2a
(Gender - discussion point re inclusivity)
Categorical data, unordered
Country
Y11_EmploymentStatus
(e.g., no clear ordering between retired, unable to work, homemaker and unemployed)
Categorical data, ordered
Majority of the data can be treated as ordered categories.
Y11_Agecategory
(Age)Y11_Q53a
(Quality of Health services)
Text data
There are no free text fields, but there are a couple of columns with strings (IDs/URLs mostly), e.g.,
ISO3166_Country
,UniqueID
.
Time or date data
None in this dataset (could claim that
Wave
is perhaps as it represents a year)
For one example of each type, what are the possible values the data can take, according to the documentation? What does each value mean? Is each possible value present in the dataset?
# Numeric example:
# Note the question of "what are the possible values" only really makes sense for categorical data,
# but we could talk about the range of values in the data.
df["Y11_Q8"].describe()
# units are hours
count 37958.000000
mean 29.658043
std 16.476718
min 0.000000
25% 20.000000
50% 36.000000
75% 40.000000
max 120.000000
Name: Y11_Q8, dtype: float64
# Binary example:
df["Y11_Q67_1"].value_counts()
# 1 means No and 2 means Yes
2.0 41863
1.0 1773
Name: Y11_Q67_1, dtype: int64
# Unordered categorical example:
df["Y11_EmploymentStatus"].value_counts()
# Values represent employment status: 1 = Employed (includes on leave), 2 = Unemployed, 3 = Unable to work - disability/illness,
# 4 = Retired, 5 = Homemaker, 6 = Student, 7 = Other"
# There are rows with all 7 values present.
1 36381
4 22925
5 7372
2 5375
6 4259
3 1720
7 1238
Name: Y11_EmploymentStatus, dtype: int64
# Ordered categorical example:
df["Y11_Q53a"].value_counts().sort_index()
# Values are respondent's rating of the quality of health services, from 1 (very poor) to 10 (very high).
# There are rows with all 10 values present.
1.0 3481
2.0 2876
3.0 5041
4.0 5963
5.0 12780
6.0 9705
7.0 13805
8.0 13970
9.0 5672
10.0 4715
Name: Y11_Q53a, dtype: int64
What is the minimum, maximum and mean value in each of your example columns? Taking into consideration the type of data in the column, are all of these values meaningful?
# Can use "describe" or the "min", "max" and "mean" functions
df["Y11_Q8"].describe() # Numeric
count 37958.000000
mean 29.658043
std 16.476718
min 0.000000
25% 20.000000
50% 36.000000
75% 40.000000
max 120.000000
Name: Y11_Q8, dtype: float64
df["Y11_Q67_1"].describe() # Binary
# The mean represents the proportion of people that answered "Yes" here (1.96 means 96% of
# people answered Yes (2), and 4% no (1). Would be clearer if no was 0 and yes was 1.
count 43636.000000
mean 1.959368
std 0.197437
min 1.000000
25% 2.000000
50% 2.000000
75% 2.000000
max 2.000000
Name: Y11_Q67_1, dtype: float64
df["Y11_EmploymentStatus"].describe() # Unordered category
# Note: The mean of an unordered category is not meaningful!
# E.g. if we had 2 people, one with a value of 4 (retired), and one with 5 (homemaker), we'd compute
# a mean of 4.5 - does that mean they're part-time retired, part-time homemaker?!
count 79270.000000
mean 2.713145
std 1.793318
min 1.000000
25% 1.000000
50% 2.000000
75% 4.000000
max 7.000000
Name: Y11_EmploymentStatus, dtype: float64
df["Y11_Q53a"].describe() # Ordered category
# Need to be careful about interpreting the mean for ordered categories. For this column the
# categories are ratings from 1 to 10, and the mean rating represents what you'd expect. Others may
# be more nuanced, e.g. the age category column has categories with different sized bins (18-24
# covers 7 years, 35-49 covers 15 years, 65+ covers N years).
count 78008.000000
mean 6.113873
std 2.293152
min 1.000000
25% 5.000000
50% 6.000000
75% 8.000000
max 10.000000
Name: Y11_Q53a, dtype: float64
For one of the categorical columns, replace the numerical encodings with their actual meaning (category title). You can do this by manually creating a Python dictionary with the values to replace (we’ll look at extracting them with code later). What is the most common value?
status_encoding = {
1: "Employed (includes on leave)",
2: "Unemployed",
3: "Unable to work - disability/illness",
4: "Retired",
5: "Homemaker",
6: "Student",
7: "Other"
}
employment_status = df["Y11_EmploymentStatus"].replace(status_encoding)
employment_status.value_counts()
Employed (includes on leave) 36381
Retired 22925
Homemaker 7372
Unemployed 5375
Student 4259
Unable to work - disability/illness 1720
Other 1238
Name: Y11_EmploymentStatus, dtype: int64
Making Things More Readable¶
Relevant Sections: 2.2.2 (Modifying Columns and Indices), 2.2.4.2 (Text Data)
At the moment, we’ve got column headings such as 'Y11_Country'
, 'Y11_Q31'
and 'Y11_Q32'
in our data. These aren’t particularly helpful at a glance and we’d need to do some cross-referencing with eqls_api_map.csv
to make sense of them.
To make things more readable, let’s rename our columns according to the 'VariableLabel'
column in mrdoc/excel/eqls_api_map.csv
.
However, because it can make .
access a bit tricky, we’d like to make sure we don’t have any spaces or non-word characters in our new column names! For consistency, we’d like everything to be lower case.
# we have to explicitly use latin1 encoding as the file is not saved in utf-8 (our platform default)
eqls_api_map_df = pd.read_csv(MAPPING_CSV_PATH, encoding='latin1')
eqls_api_map_df.head()
VariableName | VariableLabel | Question | TopicValue | KeywordValue | VariableGroupValue | |
---|---|---|---|---|---|---|
0 | Wave | EQLS Wave | EQLS Wave | NaN | NaN | Administrative Variables |
1 | Y11_Country | Country | Country | Geographies | NaN | Household Grid and Country |
2 | Y11_Q31 | Marital status | Marital status | Social stratification and groupings - Family l... | Marital status | Family and Social Life |
3 | Y11_Q32 | No. of children | Number of children of your own | Social stratification and groupings - Family l... | Children | Family and Social Life |
4 | Y11_ISCEDsimple | Education completed | Highest level of education completed | Education - Higher and further | Education levels | Education |
TODO:
replace column names in
df
with corresponding entry in'VariableLabel'
column fromeqls_api_map_df
ensure all column names are entirely lowercase
ensure only characters [a-zA-Z0-9_] are present in column names
remove apostrophes (
"'"
)replace otherwise non-matching (e.g., whitespace or
'/'
) character with'_'
we don’t want consecutive
'_'
characters (e.g.,no_of_children
rather thanno__of_children
)
keep a map (python
dict
) that shows the old -> new column mapping in case we ever want to invert this transformation.
Example manual mapping (you should produce this with a general code solution!):
{
...,
'Y11_Q32' -> 'no_of_children'
...,
'Y11_Q67_4' -> 'citizenship_dont_know',
...,
}
You may find it helpful to use a site like regex101 to experiment with creating a suitable regex expression.
# Answer
old_cols = eqls_api_map_df["VariableName"]
# The 4 .str calls below:
# convert to lowercase
# remove apostrophes
# replace whitespace with _
# rremove consecutive underscores
new_cols = eqls_api_map_df["VariableLabel"].str.lower()\
.str.replace("'",'')\
.str.replace('[^\w]','_')\
.str.replace("_+", "_")
column_mapping = dict(zip(old_cols, new_cols))
# Apply your column mapping to df
df = df.rename(columns=column_mapping)
display(df.head())
/var/folders/xv/d5nvn2ps5r3fcf276w707n01qdmpqf/T/ipykernel_77880/3644821443.py:9: FutureWarning: The default value of regex will change from True to False in a future version.
new_cols = eqls_api_map_df["VariableLabel"].str.lower()\
eqls_wave | country | marital_status | no_of_children | education_completed | rural_urban_living | citizenship_country | citizenship_another_eu_member | citizenship_a_non_eu_country | citizenship_dont_know | ... | dv_anyone_used_would_have_like_to_use_child_care_last_12_months_ | dv_anyone_used_would_have_like_to_use_long_term_care_last_12_months_ | dv_no_of_factors_which_made_it_difficult_to_use_child_care_ | dv_no_of_factors_which_made_it_difficult_to_use_long_term_care_ | dv_preferred_working_hours_3_groups_ | dv_preferred_working_hours_of_respondents_partner_3_groups_ | iso3166_country_url | rowid_for_the_uk_data_service_public_api | root_uri_for_a_row_respondent_that_displays_all_data_values_for_a_single_row_via_the_uk_data_service_public_api | unique_respondent_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 1 | 4.0 | 0.0 | 4.0 | 4.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | https://www.iso.org/obp/ui/#iso:code:3166:AT | 1 | https://api.ukdataservice.ac.uk/V1/datasets/eq... | AT9000083 |
1 | 2 | 1 | 4.0 | 0.0 | 4.0 | 4.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | https://www.iso.org/obp/ui/#iso:code:3166:AT | 2 | https://api.ukdataservice.ac.uk/V1/datasets/eq... | AT9000126 |
2 | 2 | 1 | 1.0 | 2.0 | 3.0 | 2.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | https://www.iso.org/obp/ui/#iso:code:3166:AT | 3 | https://api.ukdataservice.ac.uk/V1/datasets/eq... | AT9000267 |
3 | 2 | 1 | 2.0 | 0.0 | 3.0 | 1.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | https://www.iso.org/obp/ui/#iso:code:3166:AT | 4 | https://api.ukdataservice.ac.uk/V1/datasets/eq... | AT9000268 |
4 | 2 | 1 | 4.0 | 0.0 | 3.0 | 4.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | https://www.iso.org/obp/ui/#iso:code:3166:AT | 5 | https://api.ukdataservice.ac.uk/V1/datasets/eq... | AT9000427 |
5 rows × 199 columns
# check your answer
check_column_mapping(column_mapping, df)
Checking each column...
✅ Column mapping correct... Checking df columns set correctly...
✅ Success!
Self-Reported Health¶
Relevant Sections: 2.2.1 (Data Consistency), 2.2.3 (Feature Engineering), 2.2.4.3 (Categorical Data)
The research question we defined in Module 1 and will explore in more detail in Modules 3 and 4 aims to explore the contribution of various factors on self-reported health.
Which column in the dataset contains self-reported health values? How many people had “Very good” self-reported health?
# The relevant column is "health_condition" (Y11_Q42 before renaming)
df["health_condition"].value_counts().sort_index()
# 16898 people reported "very good" health (value of 1 below)
1.0 16898
2.0 31248
3.0 22265
4.0 6916
5.0 1802
Name: health_condition, dtype: int64
For the models we develop in Module 4 we’ll convert self-reported health into a binary variable. What might be a sensible way to group the categories into only two possible values? Create a new column with your proposed binary encoding for self-reported health.
# Could assign 1, 2, 3 (very good, good, fair) as "good health", and 4, 5 (bad, very bad) as "bad health".
df["bad_health"] = (df["health_condition"] > 3).astype(int)
display(df[["health_condition", "bad_health"]].sample(5))
# Note this creates very imbalanced classes.
df["bad_health"].value_counts()
health_condition | bad_health | |
---|---|---|
28292 | 2.0 | 0 |
43984 | 2.0 | 0 |
37485 | 3.0 | 0 |
76012 | 1.0 | 0 |
49608 | 1.0 | 0 |
0 70552
1 8718
Name: bad_health, dtype: int64
Missing Values¶
Relevant Sections: 2.2.1 (Data Consistency), 2.2.7 (Missing Data)
There are at least three different ways missing data and unknown information are represented in this dataset.
Look at one row of the data. What value does Pandas use to represent missing data? How many missing values are there in the row you picked?
df.iloc[0].values
# missing values represented by "nan"
array([2, 1, 4.0, 0.0, 4.0, 4.0, nan, nan, nan, nan, nan, 1, 1, 4, nan, 5,
2.0, 6, nan, 2.0, nan, 2, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, 1.0, 2.0, nan, 4.0, 3.0, 1.0, 2.0, 2.0,
2.0, 2.0, 1.0, 4.0, nan, nan, nan, nan, nan, nan, 2.0, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, 7.0, 6.0, 8.0, 7.0,
nan, nan, 6.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, 6.0, 2.0, 2.0, 2.0, 2.0,
2.0, 2.0, nan, 7.0, 6.0, 7.0, 8.0, 8.0, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, 2.0, nan, nan, 2.0, 3.0, 1.0, 2.0, 4.0,
nan, 3.5, nan, nan, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0,
nan, nan, nan, nan, 0.0, 2.0, nan, nan, nan, 3.0, 9.0, nan, 9.0,
7.0, 8.0, 10.0, 10.0, nan, 7.0, 3.0, 2.0, 2.0, 1.0, 2.0, nan, nan,
nan, 80.0, 3.6421, 1.80451061034635, 1.80982660081538, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, 2.0, 5.0, 2.0, nan,
nan, nan, nan, nan, nan, nan, nan, nan, 3.0, nan, nan, nan, nan,
nan, nan, 'https://www.iso.org/obp/ui/#iso:code:3166:AT', 1,
'https://api.ukdataservice.ac.uk/V1/datasets/eqls/rows/1',
'AT9000083', 0], dtype=object)
df.iloc[0].isna().sum()
# no. of missing values in this row
113
Looking at the possible values for each column in the dataset documentation, find two different columns and at least two different values that are also used to represent missing or unknown information.
Answer: Examples include:
education_3_groups
(Y11_Education
before renaming)5 = “Don’t know”, 6 = “Refused to Answer”.
4 = “Educated abroad”: Could also be considered missing (as we don’t know whether it was primary/secondary/tertiary education)
direct_contact_with_children
(Y11_Q33a
before renaming)6 = “NA”
For the columns you picked:
How many missing values does Pandas say the column has?
How many values match the numeric representation of missing data you found from the documentation (e.g., if the documentation says -99 means unknown, how many -99 values are there in the column)?
Does Pandas include the numeric representation in its count of missing values?
column = "education_3_groups"
print(column, ":", df[column].isna().sum(), "Pandas missing values")
print(column, ":", (df[column] > 3).sum(), "Numeric representation of missing values\n")
column = "direct_contact_with_children"
print(column, ":", df[column].isna().sum(), "Pandas missing values")
print(column, ":", (df[column] == 6).sum(), "Numeric representation of missing values")
# In both cases Pandas does NOT know to include the numeric values in its count of missing data
education_3_groups : 623 Pandas missing values
education_3_groups : 276 Numeric representation of missing values
direct_contact_with_children : 47519 Pandas missing values
direct_contact_with_children : 6860 Numeric representation of missing values
Replace the numbers representing missing values in the columns you picked with the NaN
type from numpy
. What is the Pandas count of missing values now?
import numpy as np
column = "education_3_groups"
df[column] = df[column].replace([4,5,6], np.nan)
print(column, ":", df[column].isna().sum(), "Pandas missing values")
column = "direct_contact_with_children"
df[column] = df[column].replace(6, np.nan)
print(column, ":", df[column].isna().sum(), "Pandas missing values")
education_3_groups : 899 Pandas missing values
direct_contact_with_children : 54379 Pandas missing values
Are there different types of missing data in the dataset (different reasons values can be unknown)? Does this have any implications for the way you would analyse the data?
Answer: Yes, as described above in the education_3_groups
values could be unknown becuase they’re missing, the person refused to anwer, or the person was educated abroad. Being educated abroad, for example, is quite a different reason for not knowing what level of education they received (and perhaps could be correlated with having more opportunities or wealth).
2007 vs 2011¶
Relevant Sections: 2.2.1 (Data Consistency), 2.2.7 (Missing Data)
Which column should be used to distinguish between the collection years? (2007 and 2011)
How many rows do we have for each year?
For each collection year, what % of null values do we have for each column?
Why is this?
Display these %s in descending order sorted by: 2007 then 2011
# Column to use is "eqls_wave"
# 2 is 2007, and 3 is 2011
df["eqls_wave"].value_counts()
3 43636
2 35634
Name: eqls_wave, dtype: int64
missing = df.groupby("eqls_wave").apply(lambda g: g.isna().mean() * 100).round(3).transpose()
missing.sort_values(by=[2,3], ascending=False)
eqls_wave | 2 | 3 |
---|---|---|
difficult_to_use_long_term_care_because_of_quality_of_care_ | 100.0 | 89.603 |
difficult_to_use_long_term_care_because_of_cost_ | 100.0 | 89.534 |
difficult_to_use_long_term_care_because_of_access_ | 100.0 | 89.495 |
difficult_to_use_long_term_care_because_of_availability_ | 100.0 | 89.433 |
difficult_to_use_child_care_because_of_quality_of_care_ | 100.0 | 83.401 |
... | ... | ... |
iso3166_country_url | 0.0 | 0.000 |
rowid_for_the_uk_data_service_public_api | 0.0 | 0.000 |
root_uri_for_a_row_respondent_that_displays_all_data_values_for_a_single_row_via_the_uk_data_service_public_api | 0.0 | 0.000 |
unique_respondent_id | 0.0 | 0.000 |
bad_health | 0.0 | 0.000 |
200 rows × 2 columns
Why? In UKDA-7724-csv/mrdoc/excel/eqls_concordance_grid.xlsx
we can see some variables not collected for given year.
UK vs Spain¶
Further to the missing data we saw above, grouped by wave/year, how do missing values look for each country of collection?
Compare the UK with Spain:
are there columns that have all values for one country but some are missing for the other?
are there columns that don’t have any values for one country but at least some are present for the other?
What implications are there from your answers to the above questions?
For simplicity, just look at 2011 data.
# Select 2011 data
df_2011 = df[df["eqls_wave"] == 3]
# Some code for parsing the data in mrdoc/excel/eqls_concordance_grid.xlsx"
# E.g. country data is categorically encoded in our DataFrame but not human readable
# We can get the human readable categories from this file
# load the categorical value data from excel workbook, specifying the appropriate sheet
cat_vals_df = pd.read_excel(CATEGORICAL_VALUES_XLSX_PATH, sheet_name=CATEGORICAL_VALUES_XLSX_SHEET_NAME)
# display head
cat_vals_df.head()
Variable name | Variable label | Variable question text | Topic Classifications | Keywords | Variable Groups | Values if present - 2007 | Values if present - 2011 | |
---|---|---|---|---|---|---|---|---|
0 | Wave | EQLS Wave | EQLS Wave | NaN | NaN | Administrative Variables | 2 = 2007\n3 = 2011 | 2 = 2007\n3 = 2011 |
1 | Y11_Country | Country | Country | Geographies | NaN | Household Grid and Country | 1 = Austria\n2 = Belgium\n3 = Bulgaria\n4 = Cy... | 1 = Austria\n2 = Belgium\n3 = Bulgaria\n4 = Cy... |
2 | Y11_Q31 | Marital status | Marital status | Social stratification and groupings - Family l... | Marital status | Family and Social Life | 1 = Married or living with partner\n2 = Separa... | 1 = Married or living with partner\n2 = Separa... |
3 | Y11_Q32 | No. of children | Number of children of your own | Social stratification and groupings - Family l... | Children | Family and Social Life | 5 = 5 or more children | 5 = 5 or more children |
4 | Y11_ISCEDsimple | Education completed | Highest level of education completed | Education - Higher and further | Education levels | Education | 1 = No education completed (ISCED 0)\n2 = Prim... | 1 = No education completed (ISCED 0)\n2 = Prim... |
You can write your own function to extract the data you need from the cat_vals_df
dataframe, input it manually, or use our prewritten function
country_mapping_2011 = parse_country_values_2011(categorical_values_df=cat_vals_df)
# display
country_mapping_2011
{1: 'Austria',
2: 'Belgium',
3: 'Bulgaria',
4: 'Cyprus',
5: 'Czech Republic',
6: 'Germany',
7: 'Denmark',
8: 'Estonia',
9: 'Greece',
10: 'Spain',
11: 'Finland',
12: 'France',
13: 'Hungary',
14: 'Ireland',
15: 'Italy',
16: 'Lithuania',
17: 'Luxembourg',
18: 'Latvia',
19: 'Malta',
20: 'Netherlands',
21: 'Poland',
22: 'Portugal',
23: 'Romania',
24: 'Sweden',
25: 'Slovenia',
26: 'Slovakia',
27: 'UK',
28: 'Turkey',
29: 'Croatia',
30: 'Macedonia (FYROM)',
31: 'Kosovo',
32: 'Serbia',
33: 'Montenegro',
34: 'Iceland',
35: 'Norway'}
# add country_human column with mapped value
df_2011["country_human"] = df_2011["country"].map(country_mapping_2011) # ignore set with copy warning
# % missing values in each column for each country
by_country = df_2011.groupby("country_human").apply(lambda g: g.isna().mean() * 100).round(3).transpose()
# extract UK and Spain
uk_vs_spain = by_country[["UK", "Spain"]]
/var/folders/xv/d5nvn2ps5r3fcf276w707n01qdmpqf/T/ipykernel_77880/1548146086.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_2011["country_human"] = df_2011["country"].map(country_mapping_2011) # ignore set with copy warning
# one 0% NA but not the other
uk_vs_spain[(uk_vs_spain == 0.0).apply(sum, axis=1) == 1]
country_human | UK | Spain |
---|---|---|
rural_urban_living | 0.577 | 0.000 |
dv_rural_urban_living | 0.577 | 0.000 |
degree_of_urbanisation | 5.595 | 0.000 |
health_condition | 0.044 | 0.000 |
how_frequently_take_part_in_sports_or_exercise_ | 0.089 | 0.000 |
how_frequently_participate_in_social_activities_ | 0.089 | 0.000 |
how_often_worked_unpaid_for_community_services_last_12_months_ | 0.000 | 0.794 |
how_often_worked_unpaid_for_education_cultural_etc_organisation_last_12_months_ | 0.000 | 0.794 |
how_satisfied_with_present_standard_of_living_ | 0.044 | 0.000 |
how_satisfied_with_accommodation_ | 0.044 | 0.000 |
how_satisfied_with_family_life_ | 0.755 | 0.000 |
Your Turn!¶
We now leave it to you to learn something interesting from the dataset, using any of the techniques we’ve learned. You could attempt to explore questions like the ones below, for example, but pick anything that appeals to you. Always take care to consider whether the values you’ve calculated are likely to be representative.
Which age group is most optimistic about the future?
Which country has the most trust in the police? And the least?
Are there differences between genders for the highest level of education completed? Does this vary by country and/or age group?
If you prefer, you could also do this with one of the example datasets we used during teaching:
World Bank percentage of people living in urban environments (Section 2.1.4 Data Sources and Formats)
Palmer Penguins dataset (Section 2.2.1 Data Consistency)
Anthropometric Survey of US Army Personnel (Section 2.2.3 Feature Engineering)
New York Patient Characteristics Survey (Section 2.2.4.3 Categorical Data)
These datasets are stored in the coursebook/modules/m2/data
directory, but you may prefer to download the original versions as we made modifications for teaching.
# Which age group is most optimistic about the future?
df.groupby("age")["i_am_optimistic_about_the_future"].mean()
# younger people seem to be more optimistic on average
age
1 2.163824
2 2.346278
3 2.548597
4 2.679574
5 2.752320
Name: i_am_optimistic_about_the_future, dtype: float64
# Which country has the most trust in the police? And the least?
df_2011.groupby("country_human")["how_much_trust_the_police_"].mean().sort_values()
# Bulgaria have the least trust, Finland the most
# Caveat for all of this: ~1000 respondents per country (and check user guide for details on data collection)!
country_human
Bulgaria 4.168737
Serbia 4.331643
Montenegro 4.531773
Romania 4.583611
Macedonia (FYROM) 4.660144
Croatia 4.662614
Slovakia 4.718782
Cyprus 4.771574
Greece 4.876754
Latvia 4.958420
Slovenia 4.971660
Hungary 5.078764
Czech Republic 5.098802
Poland 5.243012
Lithuania 5.260073
Kosovo 5.314096
Portugal 5.587649
Italy 5.750224
France 5.783880
Belgium 5.892430
Spain 6.129593
Malta 6.338809
Estonia 6.411224
Netherlands 6.498004
UK 6.529096
Luxembourg 6.575605
Ireland 6.644359
Turkey 6.708880
Sweden 6.716000
Germany 6.842923
Austria 6.927734
Denmark 7.830059
Iceland 7.952953
Finland 8.138643
Name: how_much_trust_the_police_, dtype: float64
# Are there differences between genders for the highest level of education completed? Does this vary
# by country and/or age group?
df.groupby("gender")["education_completed"].mean()
# Males seem to reach (are given the opportunity to reach) a slightly higher education level
# on average, but we'd need to check the significance of this.
gender
1 4.106716
2 3.952754
Name: education_completed, dtype: float64
edu_age_m = df[df["gender"] == 1].groupby("age")["education_completed"].mean()
edu_age_f = df[df["gender"] == 2].groupby("age")["education_completed"].mean()
edu_age_m - edu_age_f
# The difference appears to be larger in older people and disappears (maybe even reverts)
# in younger people.
age
1 -0.049220
2 -0.007889
3 0.043056
4 0.145832
5 0.453163
Name: education_completed, dtype: float64
edu_cntry_m = df_2011[df_2011["gender"] == 1].groupby("country_human")["education_completed"].mean()
edu_cntry_f = df_2011[df_2011["gender"] == 2].groupby("country_human")["education_completed"].mean()
(edu_cntry_m - edu_cntry_f).sort_values()
# Ireland, Estonia, Latvia : Top 3 countries with higher female education than male
# Turkey, Romania, Malta : Top 3 countries with higher male education than female
country_human
Ireland -0.375420
Estonia -0.181662
Latvia -0.171925
Finland -0.126074
Lithuania -0.087181
Bulgaria -0.032975
Iceland -0.017156
Slovenia 0.004494
Hungary 0.026447
Denmark 0.030770
Sweden 0.047772
France 0.060102
Austria 0.072093
Italy 0.097751
Portugal 0.098896
Poland 0.107906
Cyprus 0.113697
UK 0.127559
Czech Republic 0.131751
Montenegro 0.138855
Belgium 0.150959
Serbia 0.166398
Spain 0.185198
Macedonia (FYROM) 0.185369
Netherlands 0.187195
Slovakia 0.194347
Croatia 0.258897
Greece 0.294329
Germany 0.297801
Kosovo 0.333768
Luxembourg 0.385009
Malta 0.387355
Romania 0.486371
Turkey 0.562414
Name: education_completed, dtype: float64
CAVEAT! The dataset includes weighting factors that should be used when computing some stats and comparing countries. We’ve simplified and not included that here. See the user guide for details if needed.