{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Module 2: Hands-on session (Solutions)\n", "\n", "⚠️ **Warning** ⚠️ These are sample solutions to the exercises for the Module 2 hands-on session. We don't recommend looking here before you've attempted them!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook Setup\n", "\n", "Import necessary packages for this work:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "import pandas as pd\n", "\n", "from handson_utils import (\n", " parse_country_values_2011,\n", " check_dataset_load,\n", " check_dataset_explored,\n", " check_column_mapping\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define path to data. This relies on the data being saved in the specified location below." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "DATA_ROOT_PATH = \"../m4/data/UKDA-7724-csv\" # should match the path you unzipped the data to\n", "\n", "COMBINED_CSV_PATH = os.path.join(DATA_ROOT_PATH, \"csv/eqls_2007and2011.csv\")\n", "MAPPING_CSV_PATH = os.path.join(DATA_ROOT_PATH, \"mrdoc/excel/eqls_api_map.csv\")\n", "\n", "# consts for loading categorical data value maps\n", "CATEGORICAL_VALUES_XLSX_PATH = os.path.join(DATA_ROOT_PATH, \"mrdoc/excel/eqls_concordance_grid.xlsx\")\n", "CATEGORICAL_VALUES_XLSX_SHEET_NAME = \"Values\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring the Downloaded Files\n", "\n", "Take some time to familiarise yourself with the file structure of the downloaded files (in the `UKDA-7724-csv` directory), opening them in Excel/Numbers/relevant application of your choice initially. In particular:\n", "\n", "- The table of files `mrdoc/excel/7724_file_information_csv.csv`\n", "- The `csv` directory\n", " - What data does each file contain?\n", "- The user guide `mrdoc/pdf/7724_eqls_2007-2011_user_guide_v2.pdf`\n", "- The \"Variables\" and \"Values\" worksheet in `mrdoc/excel/eqls_concordance_grid.xlsx`\n", " - How many variables are there? Can they be grouped in any way?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Answer:** There are 199 variables listed in `mrdoc/excel/eqls_concordance_grid.xlsx`. There are a number of \"Variable Groups\" (e.g., \"Family and Social Life\") and \"Topic Classifications\" (e.g., \"Social stratification and groupings - Family life and marriage\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load and Explore\n", "\n", "***Relevant sections:*** *2.1.4 (Data Sources and Formats), 2.2.1 (Data Consistency)*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read the combined 2007 and 2011 data into a pandas `DataFrame`" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "raises-exception" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
WaveY11_CountryY11_Q31Y11_Q32Y11_ISCEDsimpleY11_Q49Y11_Q67_1Y11_Q67_2Y11_Q67_3Y11_Q67_4...DV_Q54aDV_Q54bDV_Q55DV_Q56DV_Q8DV_Q10ISO3166_CountryRowIDURIRowIDUniqueID
0214.00.04.04.0NaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNhttps://www.iso.org/obp/ui/#iso:code:3166:AT1https://api.ukdataservice.ac.uk/V1/datasets/eq...AT9000083
1214.00.04.04.0NaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNhttps://www.iso.org/obp/ui/#iso:code:3166:AT2https://api.ukdataservice.ac.uk/V1/datasets/eq...AT9000126
2211.02.03.02.0NaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNhttps://www.iso.org/obp/ui/#iso:code:3166:AT3https://api.ukdataservice.ac.uk/V1/datasets/eq...AT9000267
3212.00.03.01.0NaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNhttps://www.iso.org/obp/ui/#iso:code:3166:AT4https://api.ukdataservice.ac.uk/V1/datasets/eq...AT9000268
4214.00.03.04.0NaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNhttps://www.iso.org/obp/ui/#iso:code:3166:AT5https://api.ukdataservice.ac.uk/V1/datasets/eq...AT9000427
\n", "

5 rows × 199 columns

\n", "
" ], "text/plain": [ " Wave Y11_Country Y11_Q31 Y11_Q32 Y11_ISCEDsimple Y11_Q49 Y11_Q67_1 \\\n", "0 2 1 4.0 0.0 4.0 4.0 NaN \n", "1 2 1 4.0 0.0 4.0 4.0 NaN \n", "2 2 1 1.0 2.0 3.0 2.0 NaN \n", "3 2 1 2.0 0.0 3.0 1.0 NaN \n", "4 2 1 4.0 0.0 3.0 4.0 NaN \n", "\n", " Y11_Q67_2 Y11_Q67_3 Y11_Q67_4 ... DV_Q54a DV_Q54b DV_Q55 DV_Q56 \\\n", "0 NaN NaN NaN ... NaN NaN NaN NaN \n", "1 NaN NaN NaN ... NaN NaN NaN NaN \n", "2 NaN NaN NaN ... NaN NaN NaN NaN \n", "3 NaN NaN NaN ... NaN NaN NaN NaN \n", "4 NaN NaN NaN ... NaN NaN NaN NaN \n", "\n", " DV_Q8 DV_Q10 ISO3166_Country RowID \\\n", "0 NaN NaN https://www.iso.org/obp/ui/#iso:code:3166:AT 1 \n", "1 NaN NaN https://www.iso.org/obp/ui/#iso:code:3166:AT 2 \n", "2 NaN NaN https://www.iso.org/obp/ui/#iso:code:3166:AT 3 \n", "3 NaN NaN https://www.iso.org/obp/ui/#iso:code:3166:AT 4 \n", "4 NaN NaN https://www.iso.org/obp/ui/#iso:code:3166:AT 5 \n", "\n", " URIRowID UniqueID \n", "0 https://api.ukdataservice.ac.uk/V1/datasets/eq... AT9000083 \n", "1 https://api.ukdataservice.ac.uk/V1/datasets/eq... AT9000126 \n", "2 https://api.ukdataservice.ac.uk/V1/datasets/eq... AT9000267 \n", "3 https://api.ukdataservice.ac.uk/V1/datasets/eq... AT9000268 \n", "4 https://api.ukdataservice.ac.uk/V1/datasets/eq... AT9000427 \n", "\n", "[5 rows x 199 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# ANSWER read in the file and display the head\n", "####\n", "\n", "df = pd.read_csv(COMBINED_CSV_PATH)\n", "display(df.head())\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "✅ df loaded correctly\n" ] } ], "source": [ "# checks - will produce an `AssertionError` until DataFrame loaded correctly\n", "check_dataset_load(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Take a look at some summary statistics and use these to assign the variables below correctly.\n", "\n", "You can assign the variables manually from inspecting printed output or assign them results from methods/properties." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 79270 entries, 0 to 79269\n", "Columns: 199 entries, Wave to UniqueID\n", "dtypes: float64(187), int64(9), object(3)\n", "memory usage: 120.4+ MB\n" ] } ], "source": [ "# ANSWER - example functions to explore and answer questions below\n", "\n", "df.info()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
WaveY11_CountryY11_Q31Y11_Q32Y11_ISCEDsimpleY11_Q49Y11_Q67_1Y11_Q67_2Y11_Q67_3Y11_Q67_4...DV_Q7DV_Q67DV_Q43Q44DV_Q54aDV_Q54bDV_Q55DV_Q56DV_Q8DV_Q10RowID
count79270.00000079270.00000078756.00000078769.00000078556.00000079082.00000043636.00000043636.00000043636.00000043636.000000...2225.00000043636.00000078312.00000043636.00000043636.00000043636.00000043636.00000043636.00000043636.00000079270.000000
mean2.55047316.8411381.8560491.5981414.0191462.6409551.9593681.0236731.0192041.001971...52.6121351.0864652.4859922.8155652.9256350.3034420.2314373.9317083.28348239635.500000
std0.4974499.3583201.1862711.2764251.3689930.9873520.1974370.1520300.1372440.044351...15.6969430.4603880.8385580.7216420.5684030.8819790.8277270.4362541.13066722883.422256
min2.0000001.0000001.0000000.0000001.0000001.0000001.0000001.0000001.0000001.000000...5.0000001.0000001.0000001.0000001.0000000.0000000.0000001.0000001.0000001.000000
25%2.0000009.0000001.0000000.0000003.0000002.0000002.0000001.0000001.0000001.000000...43.0000001.0000002.0000003.0000003.0000000.0000000.0000004.0000002.00000019818.250000
50%3.00000016.0000001.0000002.0000004.0000003.0000002.0000001.0000001.0000001.000000...50.0000001.0000003.0000003.0000003.0000000.0000000.0000004.0000004.00000039635.500000
75%3.00000025.0000003.0000002.0000005.0000004.0000002.0000001.0000001.0000001.000000...61.0000001.0000003.0000003.0000003.0000000.0000000.0000004.0000004.00000059452.750000
max3.00000035.0000004.0000005.0000008.0000004.0000002.0000002.0000002.0000002.000000...80.0000006.0000003.0000006.0000006.0000004.0000004.0000004.0000004.00000079270.000000
\n", "

8 rows × 196 columns

\n", "
" ], "text/plain": [ " Wave Y11_Country Y11_Q31 Y11_Q32 \\\n", "count 79270.000000 79270.000000 78756.000000 78769.000000 \n", "mean 2.550473 16.841138 1.856049 1.598141 \n", "std 0.497449 9.358320 1.186271 1.276425 \n", "min 2.000000 1.000000 1.000000 0.000000 \n", "25% 2.000000 9.000000 1.000000 0.000000 \n", "50% 3.000000 16.000000 1.000000 2.000000 \n", "75% 3.000000 25.000000 3.000000 2.000000 \n", "max 3.000000 35.000000 4.000000 5.000000 \n", "\n", " Y11_ISCEDsimple Y11_Q49 Y11_Q67_1 Y11_Q67_2 \\\n", "count 78556.000000 79082.000000 43636.000000 43636.000000 \n", "mean 4.019146 2.640955 1.959368 1.023673 \n", "std 1.368993 0.987352 0.197437 0.152030 \n", "min 1.000000 1.000000 1.000000 1.000000 \n", "25% 3.000000 2.000000 2.000000 1.000000 \n", "50% 4.000000 3.000000 2.000000 1.000000 \n", "75% 5.000000 4.000000 2.000000 1.000000 \n", "max 8.000000 4.000000 2.000000 2.000000 \n", "\n", " Y11_Q67_3 Y11_Q67_4 ... DV_Q7 DV_Q67 \\\n", "count 43636.000000 43636.000000 ... 2225.000000 43636.000000 \n", "mean 1.019204 1.001971 ... 52.612135 1.086465 \n", "std 0.137244 0.044351 ... 15.696943 0.460388 \n", "min 1.000000 1.000000 ... 5.000000 1.000000 \n", "25% 1.000000 1.000000 ... 43.000000 1.000000 \n", "50% 1.000000 1.000000 ... 50.000000 1.000000 \n", "75% 1.000000 1.000000 ... 61.000000 1.000000 \n", "max 2.000000 2.000000 ... 80.000000 6.000000 \n", "\n", " DV_Q43Q44 DV_Q54a DV_Q54b DV_Q55 DV_Q56 \\\n", "count 78312.000000 43636.000000 43636.000000 43636.000000 43636.000000 \n", "mean 2.485992 2.815565 2.925635 0.303442 0.231437 \n", "std 0.838558 0.721642 0.568403 0.881979 0.827727 \n", "min 1.000000 1.000000 1.000000 0.000000 0.000000 \n", "25% 2.000000 3.000000 3.000000 0.000000 0.000000 \n", "50% 3.000000 3.000000 3.000000 0.000000 0.000000 \n", "75% 3.000000 3.000000 3.000000 0.000000 0.000000 \n", "max 3.000000 6.000000 6.000000 4.000000 4.000000 \n", "\n", " DV_Q8 DV_Q10 RowID \n", "count 43636.000000 43636.000000 79270.000000 \n", "mean 3.931708 3.283482 39635.500000 \n", "std 0.436254 1.130667 22883.422256 \n", "min 1.000000 1.000000 1.000000 \n", "25% 4.000000 2.000000 19818.250000 \n", "50% 4.000000 4.000000 39635.500000 \n", "75% 4.000000 4.000000 59452.750000 \n", "max 4.000000 4.000000 79270.000000 \n", "\n", "[8 rows x 196 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "35" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"Y11_Country\"].nunique()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "n_columns answer ✅ correct\n", "n_float64_columns answer ✅ correct\n", "Y11_Q31_mean_value answer ✅ correct\n", "Y11_Country_n_unique answer ✅ correct\n" ] } ], "source": [ "# ANSWER\n", "n_columns = 199 # how many columns are there in the DataFrame? This matches the number expected from the documentation.\n", "n_float64_columns = 187 # how many columns are of dtype float64?\n", "Y11_Q31_mean_value = 1.856049 # what is the mean of the values in the Y11_Q31 column?\n", "Y11_Country_n_unique = 35 # how many unique values in the Y11_Country column?\n", "\n", "\n", "# function to check your answers (run this cell)\n", "check_dataset_explored(\n", " n_columns,\n", " n_float64_columns,\n", " Y11_Q31_mean_value,\n", " Y11_Country_n_unique\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Different Data Types\n", "\n", "***Relevant Sections:*** *2.2.1 (Data Consistency)*\n", "\n", "Are there columns containing the following types of data? If so give examples.\n", "\n", "- Numeric data (without a set of pre-defined categories)\n", "- Binary data\n", "- Categorical data, unordered\n", "- Categorical data, ordered\n", "- Text data\n", "- Time or date data\n", "\n", "**Note:**\n", "\n", "- The dataset contains at least 4 of these, try to have another look if you found fewer.\n", "- You'll need to refer to the dataset documentation to fully understand which columns contain which types (`mrdoc/excel/eqls_concordance_grid.xlsx` may be helpful, in particlar)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Answer**:\n", "\n", "A few examples:\n", "\n", "- Numeric data\n", " - `Y11_Q8` (How many hours would you prefer to work)\n", " - `Y11_Q52` (How much time spent on travelling to work/study)\n", "\n", "- Binary data\n", " - `Y11_Q67_1` (I am a citizen of the country)\n", " - `Y11_HH2a` (Gender - discussion point re inclusivity)\n", "\n", "- Categorical data, unordered\n", " - `Country`\n", " - `Y11_EmploymentStatus` (e.g., no clear ordering between retired, unable to work, homemaker and unemployed)\n", "\n", "- Categorical data, ordered\n", " - Majority of the data can be treated as ordered categories.\n", " - `Y11_Agecategory` (Age)\n", " - `Y11_Q53a` (Quality of Health services)\n", "\n", "- Text data\n", " - There are no free text fields, but there are a couple of columns with strings (IDs/URLs mostly), e.g., `ISO3166_Country`, `UniqueID`.\n", "\n", "- Time or date data\n", " - None in this dataset (could claim that `Wave` is perhaps as it represents a year)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For one example of each type, what are the possible values the data can take, according to the documentation? What does each value mean? Is each possible value present in the dataset?" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 37958.000000\n", "mean 29.658043\n", "std 16.476718\n", "min 0.000000\n", "25% 20.000000\n", "50% 36.000000\n", "75% 40.000000\n", "max 120.000000\n", "Name: Y11_Q8, dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Numeric example:\n", "# Note the question of \"what are the possible values\" only really makes sense for categorical data,\n", "# but we could talk about the range of values in the data.\n", "\n", "df[\"Y11_Q8\"].describe()\n", "# units are hours" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.0 41863\n", "1.0 1773\n", "Name: Y11_Q67_1, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Binary example:\n", "\n", "df[\"Y11_Q67_1\"].value_counts()\n", "# 1 means No and 2 means Yes" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 36381\n", "4 22925\n", "5 7372\n", "2 5375\n", "6 4259\n", "3 1720\n", "7 1238\n", "Name: Y11_EmploymentStatus, dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Unordered categorical example:\n", "df[\"Y11_EmploymentStatus\"].value_counts()\n", "\n", "# Values represent employment status: 1 = Employed (includes on leave), 2 = Unemployed, 3 = Unable to work - disability/illness,\n", "# 4 = Retired, 5 = Homemaker, 6 = Student, 7 = Other\"\n", "# There are rows with all 7 values present.\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0 3481\n", "2.0 2876\n", "3.0 5041\n", "4.0 5963\n", "5.0 12780\n", "6.0 9705\n", "7.0 13805\n", "8.0 13970\n", "9.0 5672\n", "10.0 4715\n", "Name: Y11_Q53a, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Ordered categorical example:\n", "df[\"Y11_Q53a\"].value_counts().sort_index()\n", "\n", "# Values are respondent's rating of the quality of health services, from 1 (very poor) to 10 (very high).\n", "# There are rows with all 10 values present." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the minimum, maximum and mean value in each of your example columns? Taking into consideration the type of data in the column, are all of these values meaningful? " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 37958.000000\n", "mean 29.658043\n", "std 16.476718\n", "min 0.000000\n", "25% 20.000000\n", "50% 36.000000\n", "75% 40.000000\n", "max 120.000000\n", "Name: Y11_Q8, dtype: float64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Can use \"describe\" or the \"min\", \"max\" and \"mean\" functions\n", "\n", "df[\"Y11_Q8\"].describe() # Numeric" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 43636.000000\n", "mean 1.959368\n", "std 0.197437\n", "min 1.000000\n", "25% 2.000000\n", "50% 2.000000\n", "75% 2.000000\n", "max 2.000000\n", "Name: Y11_Q67_1, dtype: float64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"Y11_Q67_1\"].describe() # Binary\n", "\n", "# The mean represents the proportion of people that answered \"Yes\" here (1.96 means 96% of\n", "# people answered Yes (2), and 4% no (1). Would be clearer if no was 0 and yes was 1." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 79270.000000\n", "mean 2.713145\n", "std 1.793318\n", "min 1.000000\n", "25% 1.000000\n", "50% 2.000000\n", "75% 4.000000\n", "max 7.000000\n", "Name: Y11_EmploymentStatus, dtype: float64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"Y11_EmploymentStatus\"].describe() # Unordered category\n", "\n", "# Note: The mean of an unordered category is not meaningful!\n", "# E.g. if we had 2 people, one with a value of 4 (retired), and one with 5 (homemaker), we'd compute\n", "# a mean of 4.5 - does that mean they're part-time retired, part-time homemaker?!" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 78008.000000\n", "mean 6.113873\n", "std 2.293152\n", "min 1.000000\n", "25% 5.000000\n", "50% 6.000000\n", "75% 8.000000\n", "max 10.000000\n", "Name: Y11_Q53a, dtype: float64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"Y11_Q53a\"].describe() # Ordered category\n", "\n", "# Need to be careful about interpreting the mean for ordered categories. For this column the\n", "# categories are ratings from 1 to 10, and the mean rating represents what you'd expect. Others may\n", "# be more nuanced, e.g. the age category column has categories with different sized bins (18-24\n", "# covers 7 years, 35-49 covers 15 years, 65+ covers N years)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For one of the categorical columns, replace the numerical encodings with their actual meaning (category title). You can do this by manually creating a Python dictionary with the values to replace (we'll look at extracting them with code later). What is the most common value?" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Employed (includes on leave) 36381\n", "Retired 22925\n", "Homemaker 7372\n", "Unemployed 5375\n", "Student 4259\n", "Unable to work - disability/illness 1720\n", "Other 1238\n", "Name: Y11_EmploymentStatus, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "status_encoding = {\n", " 1: \"Employed (includes on leave)\",\n", " 2: \"Unemployed\",\n", " 3: \"Unable to work - disability/illness\",\n", " 4: \"Retired\",\n", " 5: \"Homemaker\",\n", " 6: \"Student\",\n", " 7: \"Other\"\n", "}\n", "\n", "employment_status = df[\"Y11_EmploymentStatus\"].replace(status_encoding)\n", "employment_status.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Making Things More Readable\n", "\n", "***Relevant Sections:*** *2.2.2 (Modifying Columns and Indices), 2.2.4.2 (Text Data)*\n", "\n", "At the moment, we've got column headings such as `'Y11_Country'`, `'Y11_Q31'` and `'Y11_Q32'` in our data. These aren't particularly helpful at a glance and we'd need to do some cross-referencing with `eqls_api_map.csv` to make sense of them.\n", "\n", "To make things more readable, let's rename our columns according to the `'VariableLabel'` column in `mrdoc/excel/eqls_api_map.csv`.\n", "\n", "However, because it can make `.` access a bit tricky, we'd like to make sure we don't have any spaces or non-word characters in our new column names! For consistency, we'd like everything to be lower case." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
VariableNameVariableLabelQuestionTopicValueKeywordValueVariableGroupValue
0WaveEQLS WaveEQLS WaveNaNNaNAdministrative Variables
1Y11_CountryCountryCountryGeographiesNaNHousehold Grid and Country
2Y11_Q31Marital statusMarital statusSocial stratification and groupings - Family l...Marital statusFamily and Social Life
3Y11_Q32No. of childrenNumber of children of your ownSocial stratification and groupings - Family l...ChildrenFamily and Social Life
4Y11_ISCEDsimpleEducation completedHighest level of education completedEducation - Higher and furtherEducation levelsEducation
\n", "
" ], "text/plain": [ " VariableName VariableLabel Question \\\n", "0 Wave EQLS Wave EQLS Wave \n", "1 Y11_Country Country Country \n", "2 Y11_Q31 Marital status Marital status \n", "3 Y11_Q32 No. of children Number of children of your own \n", "4 Y11_ISCEDsimple Education completed Highest level of education completed \n", "\n", " TopicValue KeywordValue \\\n", "0 NaN NaN \n", "1 Geographies NaN \n", "2 Social stratification and groupings - Family l... Marital status \n", "3 Social stratification and groupings - Family l... Children \n", "4 Education - Higher and further Education levels \n", "\n", " VariableGroupValue \n", "0 Administrative Variables \n", "1 Household Grid and Country \n", "2 Family and Social Life \n", "3 Family and Social Life \n", "4 Education " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we have to explicitly use latin1 encoding as the file is not saved in utf-8 (our platform default)\n", "eqls_api_map_df = pd.read_csv(MAPPING_CSV_PATH, encoding='latin1')\n", "eqls_api_map_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TODO:\n", "- replace column names in `df` with corresponding entry in `'VariableLabel'` column from `eqls_api_map_df`\n", "- ensure all column names are entirely lowercase\n", "- ensure only characters [a-zA-Z0-9_] are present in column names\n", " - remove apostrophes (`\"'\"`)\n", " - replace otherwise non-matching (e.g., whitespace or `'/'`) character with `'_'`\n", " - we don't want consecutive `'_'` characters (e.g., `no_of_children` rather than `no__of_children`)\n", "- keep a map (python `dict`) that shows the old -> new column mapping in case we ever want to invert this transformation.\n", "\n", "Example manual mapping (you should produce this with a general code solution!):\n", "\n", "```json\n", "{\n", " ...,\n", " 'Y11_Q32' -> 'no_of_children'\n", " ...,\n", " 'Y11_Q67_4' -> 'citizenship_dont_know',\n", " ...,\n", "}\n", "```\n", "\n", "You may find it helpful to use a site like [regex101](https://regex101.com/) to experiment with creating a\n", "suitable regex expression." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/xv/d5nvn2ps5r3fcf276w707n01qdmpqf/T/ipykernel_77880/3644821443.py:9: FutureWarning: The default value of regex will change from True to False in a future version.\n", " new_cols = eqls_api_map_df[\"VariableLabel\"].str.lower()\\\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
eqls_wavecountrymarital_statusno_of_childreneducation_completedrural_urban_livingcitizenship_countrycitizenship_another_eu_membercitizenship_a_non_eu_countrycitizenship_dont_know...dv_anyone_used_would_have_like_to_use_child_care_last_12_months_dv_anyone_used_would_have_like_to_use_long_term_care_last_12_months_dv_no_of_factors_which_made_it_difficult_to_use_child_care_dv_no_of_factors_which_made_it_difficult_to_use_long_term_care_dv_preferred_working_hours_3_groups_dv_preferred_working_hours_of_respondents_partner_3_groups_iso3166_country_urlrowid_for_the_uk_data_service_public_apiroot_uri_for_a_row_respondent_that_displays_all_data_values_for_a_single_row_via_the_uk_data_service_public_apiunique_respondent_id
0214.00.04.04.0NaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNhttps://www.iso.org/obp/ui/#iso:code:3166:AT1https://api.ukdataservice.ac.uk/V1/datasets/eq...AT9000083
1214.00.04.04.0NaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNhttps://www.iso.org/obp/ui/#iso:code:3166:AT2https://api.ukdataservice.ac.uk/V1/datasets/eq...AT9000126
2211.02.03.02.0NaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNhttps://www.iso.org/obp/ui/#iso:code:3166:AT3https://api.ukdataservice.ac.uk/V1/datasets/eq...AT9000267
3212.00.03.01.0NaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNhttps://www.iso.org/obp/ui/#iso:code:3166:AT4https://api.ukdataservice.ac.uk/V1/datasets/eq...AT9000268
4214.00.03.04.0NaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNhttps://www.iso.org/obp/ui/#iso:code:3166:AT5https://api.ukdataservice.ac.uk/V1/datasets/eq...AT9000427
\n", "

5 rows × 199 columns

\n", "
" ], "text/plain": [ " eqls_wave country marital_status no_of_children education_completed \\\n", "0 2 1 4.0 0.0 4.0 \n", "1 2 1 4.0 0.0 4.0 \n", "2 2 1 1.0 2.0 3.0 \n", "3 2 1 2.0 0.0 3.0 \n", "4 2 1 4.0 0.0 3.0 \n", "\n", " rural_urban_living citizenship_country citizenship_another_eu_member \\\n", "0 4.0 NaN NaN \n", "1 4.0 NaN NaN \n", "2 2.0 NaN NaN \n", "3 1.0 NaN NaN \n", "4 4.0 NaN NaN \n", "\n", " citizenship_a_non_eu_country citizenship_dont_know ... \\\n", "0 NaN NaN ... \n", "1 NaN NaN ... \n", "2 NaN NaN ... \n", "3 NaN NaN ... \n", "4 NaN NaN ... \n", "\n", " dv_anyone_used_would_have_like_to_use_child_care_last_12_months_ \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", " dv_anyone_used_would_have_like_to_use_long_term_care_last_12_months_ \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", " dv_no_of_factors_which_made_it_difficult_to_use_child_care_ \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", " dv_no_of_factors_which_made_it_difficult_to_use_long_term_care_ \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", " dv_preferred_working_hours_3_groups_ \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", " dv_preferred_working_hours_of_respondents_partner_3_groups_ \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", " iso3166_country_url \\\n", "0 https://www.iso.org/obp/ui/#iso:code:3166:AT \n", "1 https://www.iso.org/obp/ui/#iso:code:3166:AT \n", "2 https://www.iso.org/obp/ui/#iso:code:3166:AT \n", "3 https://www.iso.org/obp/ui/#iso:code:3166:AT \n", "4 https://www.iso.org/obp/ui/#iso:code:3166:AT \n", "\n", " rowid_for_the_uk_data_service_public_api \\\n", "0 1 \n", "1 2 \n", "2 3 \n", "3 4 \n", "4 5 \n", "\n", " root_uri_for_a_row_respondent_that_displays_all_data_values_for_a_single_row_via_the_uk_data_service_public_api \\\n", "0 https://api.ukdataservice.ac.uk/V1/datasets/eq... \n", "1 https://api.ukdataservice.ac.uk/V1/datasets/eq... \n", "2 https://api.ukdataservice.ac.uk/V1/datasets/eq... \n", "3 https://api.ukdataservice.ac.uk/V1/datasets/eq... \n", "4 https://api.ukdataservice.ac.uk/V1/datasets/eq... \n", "\n", " unique_respondent_id \n", "0 AT9000083 \n", "1 AT9000126 \n", "2 AT9000267 \n", "3 AT9000268 \n", "4 AT9000427 \n", "\n", "[5 rows x 199 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Answer\n", "old_cols = eqls_api_map_df[\"VariableName\"]\n", "\n", "# The 4 .str calls below:\n", "# convert to lowercase\n", "# remove apostrophes\n", "# replace whitespace with _\n", "# rremove consecutive underscores\n", "new_cols = eqls_api_map_df[\"VariableLabel\"].str.lower()\\\n", " .str.replace(\"'\",'')\\\n", " .str.replace('[^\\w]','_')\\\n", " .str.replace(\"_+\", \"_\")\n", "\n", "column_mapping = dict(zip(old_cols, new_cols))\n", "\n", "\n", "# Apply your column mapping to df\n", "df = df.rename(columns=column_mapping)\n", "\n", "display(df.head())" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Checking each column...\n", "✅ Column mapping correct... Checking df columns set correctly...\n", "✅ Success!\n" ] } ], "source": [ "# check your answer\n", "check_column_mapping(column_mapping, df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Self-Reported Health\n", "\n", "***Relevant Sections:*** *2.2.1 (Data Consistency), 2.2.3 (Feature Engineering), 2.2.4.3 (Categorical Data)*\n", "\n", "The research question we defined in Module 1 and will explore in more detail in Modules 3 and 4 aims to explore the contribution of various factors on self-reported health.\n", "\n", "Which column in the dataset contains self-reported health values? How many people had \"Very good\" self-reported health?" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0 16898\n", "2.0 31248\n", "3.0 22265\n", "4.0 6916\n", "5.0 1802\n", "Name: health_condition, dtype: int64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The relevant column is \"health_condition\" (Y11_Q42 before renaming)\n", "df[\"health_condition\"].value_counts().sort_index()\n", "\n", "# 16898 people reported \"very good\" health (value of 1 below)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the models we develop in Module 4 we'll convert self-reported health into a binary variable. What might be a sensible way to group the categories into only two possible values? Create a new column with your proposed binary encoding for self-reported health." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
health_conditionbad_health
282922.00
439842.00
374853.00
760121.00
496081.00
\n", "
" ], "text/plain": [ " health_condition bad_health\n", "28292 2.0 0\n", "43984 2.0 0\n", "37485 3.0 0\n", "76012 1.0 0\n", "49608 1.0 0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0 70552\n", "1 8718\n", "Name: bad_health, dtype: int64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Could assign 1, 2, 3 (very good, good, fair) as \"good health\", and 4, 5 (bad, very bad) as \"bad health\".\n", "df[\"bad_health\"] = (df[\"health_condition\"] > 3).astype(int)\n", "\n", "display(df[[\"health_condition\", \"bad_health\"]].sample(5))\n", "\n", "# Note this creates very imbalanced classes.\n", "df[\"bad_health\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Missing Values\n", "\n", "***Relevant Sections:*** *2.2.1 (Data Consistency), 2.2.7 (Missing Data)*\n", "\n", "There are at least three different ways missing data and unknown information are represented in this dataset.\n", "\n", "Look at one row of the data. What value does Pandas use to represent missing data? How many missing values are there in the row you picked?" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2, 1, 4.0, 0.0, 4.0, 4.0, nan, nan, nan, nan, nan, 1, 1, 4, nan, 5,\n", " 2.0, 6, nan, 2.0, nan, 2, nan, nan, nan, nan, nan, nan, nan, nan,\n", " nan, nan, nan, nan, nan, 1.0, 2.0, nan, 4.0, 3.0, 1.0, 2.0, 2.0,\n", " 2.0, 2.0, 1.0, 4.0, nan, nan, nan, nan, nan, nan, 2.0, nan, nan,\n", " nan, nan, nan, nan, nan, nan, nan, nan, nan, 7.0, 6.0, 8.0, 7.0,\n", " nan, nan, 6.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n", " nan, nan, nan, nan, nan, nan, nan, nan, 6.0, 2.0, 2.0, 2.0, 2.0,\n", " 2.0, 2.0, nan, 7.0, 6.0, 7.0, 8.0, 8.0, nan, nan, nan, nan, nan,\n", " nan, nan, nan, nan, nan, 2.0, nan, nan, 2.0, 3.0, 1.0, 2.0, 4.0,\n", " nan, 3.5, nan, nan, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0,\n", " nan, nan, nan, nan, 0.0, 2.0, nan, nan, nan, 3.0, 9.0, nan, 9.0,\n", " 7.0, 8.0, 10.0, 10.0, nan, 7.0, 3.0, 2.0, 2.0, 1.0, 2.0, nan, nan,\n", " nan, 80.0, 3.6421, 1.80451061034635, 1.80982660081538, nan, nan,\n", " nan, nan, nan, nan, nan, nan, nan, nan, nan, 2.0, 5.0, 2.0, nan,\n", " nan, nan, nan, nan, nan, nan, nan, nan, 3.0, nan, nan, nan, nan,\n", " nan, nan, 'https://www.iso.org/obp/ui/#iso:code:3166:AT', 1,\n", " 'https://api.ukdataservice.ac.uk/V1/datasets/eqls/rows/1',\n", " 'AT9000083', 0], dtype=object)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[0].values\n", "# missing values represented by \"nan\"" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "113" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[0].isna().sum()\n", "# no. of missing values in this row" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at the possible values for each column in the dataset documentation, find two different columns and at least two different values that are also used to represent missing or unknown information." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Answer:** Examples include:\n", "- `education_3_groups` (`Y11_Education` before renaming)\n", " - 5 = \"Don't know\", 6 = \"Refused to Answer\".\n", " - 4 = \"Educated abroad\": Could also be considered missing (as we don't know whether it was primary/secondary/tertiary education)\n", " \n", "- `direct_contact_with_children` (`Y11_Q33a` before renaming)\n", " - 6 = \"NA\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the columns you picked:\n", "- How many missing values does Pandas say the column has?\n", "- How many values match the numeric representation of missing data you found from the documentation (e.g., if the documentation says -99 means unknown, how many -99 values are there in the column)?\n", "- Does Pandas include the numeric representation in its count of missing values?" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "education_3_groups : 623 Pandas missing values\n", "education_3_groups : 276 Numeric representation of missing values\n", "\n", "direct_contact_with_children : 47519 Pandas missing values\n", "direct_contact_with_children : 6860 Numeric representation of missing values\n" ] } ], "source": [ "column = \"education_3_groups\"\n", "print(column, \":\", df[column].isna().sum(), \"Pandas missing values\")\n", "print(column, \":\", (df[column] > 3).sum(), \"Numeric representation of missing values\\n\")\n", "\n", "column = \"direct_contact_with_children\"\n", "print(column, \":\", df[column].isna().sum(), \"Pandas missing values\")\n", "print(column, \":\", (df[column] == 6).sum(), \"Numeric representation of missing values\")\n", "\n", "# In both cases Pandas does NOT know to include the numeric values in its count of missing data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Replace the numbers representing missing values in the columns you picked with the `NaN` type from `numpy`. What is the Pandas count of missing values now?" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "education_3_groups : 899 Pandas missing values\n", "direct_contact_with_children : 54379 Pandas missing values\n" ] } ], "source": [ "import numpy as np\n", "\n", "column = \"education_3_groups\"\n", "df[column] = df[column].replace([4,5,6], np.nan)\n", "print(column, \":\", df[column].isna().sum(), \"Pandas missing values\")\n", "\n", "column = \"direct_contact_with_children\"\n", "df[column] = df[column].replace(6, np.nan)\n", "print(column, \":\", df[column].isna().sum(), \"Pandas missing values\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Are there different types of missing data in the dataset (different reasons values can be unknown)? Does this have any implications for the way you would analyse the data?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Answer:** Yes, as described above in the `education_3_groups` values could be unknown becuase they're missing, the person refused to anwer, or the person was educated abroad. Being educated abroad, for example, is quite a different reason for not knowing what level of education they received (and perhaps could be correlated with having more opportunities or wealth)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2007 vs 2011\n", "\n", "***Relevant Sections:*** *2.2.1 (Data Consistency), 2.2.7 (Missing Data)*\n", "\n", "- Which column should be used to distinguish between the collection years? (2007 and 2011)\n", "- How many rows do we have for each year?\n", "- For each collection year, what % of null values do we have for each column?\n", " - Why is this?\n", " - Display these %s in descending order sorted by: 2007 then 2011" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3 43636\n", "2 35634\n", "Name: eqls_wave, dtype: int64" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Column to use is \"eqls_wave\"\n", "# 2 is 2007, and 3 is 2011\n", "df[\"eqls_wave\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
eqls_wave23
difficult_to_use_long_term_care_because_of_quality_of_care_100.089.603
difficult_to_use_long_term_care_because_of_cost_100.089.534
difficult_to_use_long_term_care_because_of_access_100.089.495
difficult_to_use_long_term_care_because_of_availability_100.089.433
difficult_to_use_child_care_because_of_quality_of_care_100.083.401
.........
iso3166_country_url0.00.000
rowid_for_the_uk_data_service_public_api0.00.000
root_uri_for_a_row_respondent_that_displays_all_data_values_for_a_single_row_via_the_uk_data_service_public_api0.00.000
unique_respondent_id0.00.000
bad_health0.00.000
\n", "

200 rows × 2 columns

\n", "
" ], "text/plain": [ "eqls_wave 2 3\n", "difficult_to_use_long_term_care_because_of_qual... 100.0 89.603\n", "difficult_to_use_long_term_care_because_of_cost_ 100.0 89.534\n", "difficult_to_use_long_term_care_because_of_access_ 100.0 89.495\n", "difficult_to_use_long_term_care_because_of_avai... 100.0 89.433\n", "difficult_to_use_child_care_because_of_quality_... 100.0 83.401\n", "... ... ...\n", "iso3166_country_url 0.0 0.000\n", "rowid_for_the_uk_data_service_public_api 0.0 0.000\n", "root_uri_for_a_row_respondent_that_displays_all... 0.0 0.000\n", "unique_respondent_id 0.0 0.000\n", "bad_health 0.0 0.000\n", "\n", "[200 rows x 2 columns]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "missing = df.groupby(\"eqls_wave\").apply(lambda g: g.isna().mean() * 100).round(3).transpose()\n", "missing.sort_values(by=[2,3], ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why? In `UKDA-7724-csv/mrdoc/excel/eqls_concordance_grid.xlsx` we can see some variables not collected for given year." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## UK vs Spain\n", "\n", "Further to the missing data we saw above, grouped by wave/year, how do missing values look for each country of collection?\n", "\n", "Compare the UK with Spain:\n", "- are there columns that have all values for one country but some are missing for the other?\n", "- are there columns that don't have any values for one country but at least some are present for the other?\n", "\n", "What implications are there from your answers to the above questions?\n", "\n", "For simplicity, just look at 2011 data." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "# Select 2011 data\n", "df_2011 = df[df[\"eqls_wave\"] == 3]" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Variable nameVariable labelVariable question textTopic ClassificationsKeywordsVariable GroupsValues if present - 2007Values if present - 2011
0WaveEQLS WaveEQLS WaveNaNNaNAdministrative Variables2 = 2007\\n3 = 20112 = 2007\\n3 = 2011
1Y11_CountryCountryCountryGeographiesNaNHousehold Grid and Country1 = Austria\\n2 = Belgium\\n3 = Bulgaria\\n4 = Cy...1 = Austria\\n2 = Belgium\\n3 = Bulgaria\\n4 = Cy...
2Y11_Q31Marital statusMarital statusSocial stratification and groupings - Family l...Marital statusFamily and Social Life1 = Married or living with partner\\n2 = Separa...1 = Married or living with partner\\n2 = Separa...
3Y11_Q32No. of childrenNumber of children of your ownSocial stratification and groupings - Family l...ChildrenFamily and Social Life5 = 5 or more children5 = 5 or more children
4Y11_ISCEDsimpleEducation completedHighest level of education completedEducation - Higher and furtherEducation levelsEducation1 = No education completed (ISCED 0)\\n2 = Prim...1 = No education completed (ISCED 0)\\n2 = Prim...
\n", "
" ], "text/plain": [ " Variable name Variable label Variable question text \\\n", "0 Wave EQLS Wave EQLS Wave \n", "1 Y11_Country Country Country \n", "2 Y11_Q31 Marital status Marital status \n", "3 Y11_Q32 No. of children Number of children of your own \n", "4 Y11_ISCEDsimple Education completed Highest level of education completed \n", "\n", " Topic Classifications Keywords \\\n", "0 NaN NaN \n", "1 Geographies NaN \n", "2 Social stratification and groupings - Family l... Marital status \n", "3 Social stratification and groupings - Family l... Children \n", "4 Education - Higher and further Education levels \n", "\n", " Variable Groups \\\n", "0 Administrative Variables \n", "1 Household Grid and Country \n", "2 Family and Social Life \n", "3 Family and Social Life \n", "4 Education \n", "\n", " Values if present - 2007 \\\n", "0 2 = 2007\\n3 = 2011 \n", "1 1 = Austria\\n2 = Belgium\\n3 = Bulgaria\\n4 = Cy... \n", "2 1 = Married or living with partner\\n2 = Separa... \n", "3 5 = 5 or more children \n", "4 1 = No education completed (ISCED 0)\\n2 = Prim... \n", "\n", " Values if present - 2011 \n", "0 2 = 2007\\n3 = 2011 \n", "1 1 = Austria\\n2 = Belgium\\n3 = Bulgaria\\n4 = Cy... \n", "2 1 = Married or living with partner\\n2 = Separa... \n", "3 5 = 5 or more children \n", "4 1 = No education completed (ISCED 0)\\n2 = Prim... " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Some code for parsing the data in mrdoc/excel/eqls_concordance_grid.xlsx\"\n", "# E.g. country data is categorically encoded in our DataFrame but not human readable\n", "# We can get the human readable categories from this file\n", "\n", "\n", "# load the categorical value data from excel workbook, specifying the appropriate sheet\n", "cat_vals_df = pd.read_excel(CATEGORICAL_VALUES_XLSX_PATH, sheet_name=CATEGORICAL_VALUES_XLSX_SHEET_NAME)\n", "\n", "# display head\n", "cat_vals_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can write your own function to extract the data you need from the `cat_vals_df` dataframe, input it manually, or use our prewritten function" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{1: 'Austria',\n", " 2: 'Belgium',\n", " 3: 'Bulgaria',\n", " 4: 'Cyprus',\n", " 5: 'Czech Republic',\n", " 6: 'Germany',\n", " 7: 'Denmark',\n", " 8: 'Estonia',\n", " 9: 'Greece',\n", " 10: 'Spain',\n", " 11: 'Finland',\n", " 12: 'France',\n", " 13: 'Hungary',\n", " 14: 'Ireland',\n", " 15: 'Italy',\n", " 16: 'Lithuania',\n", " 17: 'Luxembourg',\n", " 18: 'Latvia',\n", " 19: 'Malta',\n", " 20: 'Netherlands',\n", " 21: 'Poland',\n", " 22: 'Portugal',\n", " 23: 'Romania',\n", " 24: 'Sweden',\n", " 25: 'Slovenia',\n", " 26: 'Slovakia',\n", " 27: 'UK',\n", " 28: 'Turkey',\n", " 29: 'Croatia',\n", " 30: 'Macedonia (FYROM)',\n", " 31: 'Kosovo',\n", " 32: 'Serbia',\n", " 33: 'Montenegro',\n", " 34: 'Iceland',\n", " 35: 'Norway'}" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "country_mapping_2011 = parse_country_values_2011(categorical_values_df=cat_vals_df)\n", "\n", "# display\n", "country_mapping_2011" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/xv/d5nvn2ps5r3fcf276w707n01qdmpqf/T/ipykernel_77880/1548146086.py:2: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df_2011[\"country_human\"] = df_2011[\"country\"].map(country_mapping_2011) # ignore set with copy warning\n" ] } ], "source": [ "# add country_human column with mapped value\n", "df_2011[\"country_human\"] = df_2011[\"country\"].map(country_mapping_2011) # ignore set with copy warning\n", "\n", "# % missing values in each column for each country\n", "by_country = df_2011.groupby(\"country_human\").apply(lambda g: g.isna().mean() * 100).round(3).transpose()\n", "\n", "# extract UK and Spain\n", "uk_vs_spain = by_country[[\"UK\", \"Spain\"]]" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
country_humanUKSpain
rural_urban_living0.5770.000
dv_rural_urban_living0.5770.000
degree_of_urbanisation5.5950.000
health_condition0.0440.000
how_frequently_take_part_in_sports_or_exercise_0.0890.000
how_frequently_participate_in_social_activities_0.0890.000
how_often_worked_unpaid_for_community_services_last_12_months_0.0000.794
how_often_worked_unpaid_for_education_cultural_etc_organisation_last_12_months_0.0000.794
how_satisfied_with_present_standard_of_living_0.0440.000
how_satisfied_with_accommodation_0.0440.000
how_satisfied_with_family_life_0.7550.000
\n", "
" ], "text/plain": [ "country_human UK Spain\n", "rural_urban_living 0.577 0.000\n", "dv_rural_urban_living 0.577 0.000\n", "degree_of_urbanisation 5.595 0.000\n", "health_condition 0.044 0.000\n", "how_frequently_take_part_in_sports_or_exercise_ 0.089 0.000\n", "how_frequently_participate_in_social_activities_ 0.089 0.000\n", "how_often_worked_unpaid_for_community_services_... 0.000 0.794\n", "how_often_worked_unpaid_for_education_cultural_... 0.000 0.794\n", "how_satisfied_with_present_standard_of_living_ 0.044 0.000\n", "how_satisfied_with_accommodation_ 0.044 0.000\n", "how_satisfied_with_family_life_ 0.755 0.000" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# one 0% NA but not the other\n", "uk_vs_spain[(uk_vs_spain == 0.0).apply(sum, axis=1) == 1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Your Turn!\n", "\n", "We now leave it to you to learn something interesting from the dataset, using any of the techniques we've learned. You could attempt to explore questions like the ones below, for example, but pick anything that appeals to you. Always take care to consider whether the values you've calculated are likely to be representative.\n", "\n", "- Which age group is most optimistic about the future?\n", "- Which country has the most trust in the police? And the least?\n", "- Are there differences between genders for the highest level of education completed? Does this vary by country and/or age group?\n", "\n", "If you prefer, you could also do this with one of the example datasets we used during teaching:\n", "- World Bank percentage of people living in urban environments (Section 2.1.4 Data Sources and Formats)\n", "- Palmer Penguins dataset (Section 2.2.1 Data Consistency)\n", "- Anthropometric Survey of US Army Personnel (Section 2.2.3 Feature Engineering)\n", "- New York Patient Characteristics Survey (Section 2.2.4.3 Categorical Data)\n", "\n", "These datasets are stored in the `coursebook/modules/m2/data` directory, but you may prefer to download the original versions as we made modifications for teaching." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "age\n", "1 2.163824\n", "2 2.346278\n", "3 2.548597\n", "4 2.679574\n", "5 2.752320\n", "Name: i_am_optimistic_about_the_future, dtype: float64" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Which age group is most optimistic about the future?\n", "\n", "df.groupby(\"age\")[\"i_am_optimistic_about_the_future\"].mean()\n", "# younger people seem to be more optimistic on average" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "country_human\n", "Bulgaria 4.168737\n", "Serbia 4.331643\n", "Montenegro 4.531773\n", "Romania 4.583611\n", "Macedonia (FYROM) 4.660144\n", "Croatia 4.662614\n", "Slovakia 4.718782\n", "Cyprus 4.771574\n", "Greece 4.876754\n", "Latvia 4.958420\n", "Slovenia 4.971660\n", "Hungary 5.078764\n", "Czech Republic 5.098802\n", "Poland 5.243012\n", "Lithuania 5.260073\n", "Kosovo 5.314096\n", "Portugal 5.587649\n", "Italy 5.750224\n", "France 5.783880\n", "Belgium 5.892430\n", "Spain 6.129593\n", "Malta 6.338809\n", "Estonia 6.411224\n", "Netherlands 6.498004\n", "UK 6.529096\n", "Luxembourg 6.575605\n", "Ireland 6.644359\n", "Turkey 6.708880\n", "Sweden 6.716000\n", "Germany 6.842923\n", "Austria 6.927734\n", "Denmark 7.830059\n", "Iceland 7.952953\n", "Finland 8.138643\n", "Name: how_much_trust_the_police_, dtype: float64" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Which country has the most trust in the police? And the least?\n", "\n", "df_2011.groupby(\"country_human\")[\"how_much_trust_the_police_\"].mean().sort_values()\n", "# Bulgaria have the least trust, Finland the most\n", "\n", "# Caveat for all of this: ~1000 respondents per country (and check user guide for details on data collection)!" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "gender\n", "1 4.106716\n", "2 3.952754\n", "Name: education_completed, dtype: float64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Are there differences between genders for the highest level of education completed? Does this vary\n", "# by country and/or age group?\n", "\n", "df.groupby(\"gender\")[\"education_completed\"].mean()\n", "# Males seem to reach (are given the opportunity to reach) a slightly higher education level\n", "# on average, but we'd need to check the significance of this." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "age\n", "1 -0.049220\n", "2 -0.007889\n", "3 0.043056\n", "4 0.145832\n", "5 0.453163\n", "Name: education_completed, dtype: float64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "edu_age_m = df[df[\"gender\"] == 1].groupby(\"age\")[\"education_completed\"].mean()\n", "edu_age_f = df[df[\"gender\"] == 2].groupby(\"age\")[\"education_completed\"].mean()\n", "\n", "edu_age_m - edu_age_f\n", "# The difference appears to be larger in older people and disappears (maybe even reverts)\n", "# in younger people." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "country_human\n", "Ireland -0.375420\n", "Estonia -0.181662\n", "Latvia -0.171925\n", "Finland -0.126074\n", "Lithuania -0.087181\n", "Bulgaria -0.032975\n", "Iceland -0.017156\n", "Slovenia 0.004494\n", "Hungary 0.026447\n", "Denmark 0.030770\n", "Sweden 0.047772\n", "France 0.060102\n", "Austria 0.072093\n", "Italy 0.097751\n", "Portugal 0.098896\n", "Poland 0.107906\n", "Cyprus 0.113697\n", "UK 0.127559\n", "Czech Republic 0.131751\n", "Montenegro 0.138855\n", "Belgium 0.150959\n", "Serbia 0.166398\n", "Spain 0.185198\n", "Macedonia (FYROM) 0.185369\n", "Netherlands 0.187195\n", "Slovakia 0.194347\n", "Croatia 0.258897\n", "Greece 0.294329\n", "Germany 0.297801\n", "Kosovo 0.333768\n", "Luxembourg 0.385009\n", "Malta 0.387355\n", "Romania 0.486371\n", "Turkey 0.562414\n", "Name: education_completed, dtype: float64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "edu_cntry_m = df_2011[df_2011[\"gender\"] == 1].groupby(\"country_human\")[\"education_completed\"].mean()\n", "edu_cntry_f = df_2011[df_2011[\"gender\"] == 2].groupby(\"country_human\")[\"education_completed\"].mean()\n", "\n", "(edu_cntry_m - edu_cntry_f).sort_values()\n", "# Ireland, Estonia, Latvia : Top 3 countries with higher female education than male\n", "# Turkey, Romania, Malta : Top 3 countries with higher male education than female" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**CAVEAT!** The dataset includes weighting factors that should be used when computing some stats and comparing countries. We've simplified and not included that here. See the user guide for details if needed." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.15" }, "vscode": { "interpreter": { "hash": "b9e8921bd18fbd36d3a09ae9691fc21c58beec206524d0083259030e87e84f05" } }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }