{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Module 2: Hands-on session (Solutions)\n",
    "\n",
    "⚠️ **Warning** ⚠️ These are sample solutions to the exercises for the Module 2 hands-on session. We don't recommend looking here before you've attempted them!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook Setup\n",
    "\n",
    "Import necessary packages for this work:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "from handson_utils import (\n",
    "    parse_country_values_2011,\n",
    "    check_dataset_load,\n",
    "    check_dataset_explored,\n",
    "    check_column_mapping\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define path to data. This relies on the data being saved in the specified location below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "DATA_ROOT_PATH = \"../m4/data/UKDA-7724-csv\" # should match the path you unzipped the data to\n",
    "\n",
    "COMBINED_CSV_PATH = os.path.join(DATA_ROOT_PATH, \"csv/eqls_2007and2011.csv\")\n",
    "MAPPING_CSV_PATH = os.path.join(DATA_ROOT_PATH, \"mrdoc/excel/eqls_api_map.csv\")\n",
    "\n",
    "# consts for loading categorical data value maps\n",
    "CATEGORICAL_VALUES_XLSX_PATH = os.path.join(DATA_ROOT_PATH, \"mrdoc/excel/eqls_concordance_grid.xlsx\")\n",
    "CATEGORICAL_VALUES_XLSX_SHEET_NAME = \"Values\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploring the Downloaded Files\n",
    "\n",
    "Take some time to familiarise yourself with the file structure of the downloaded files (in the `UKDA-7724-csv` directory), opening them in Excel/Numbers/relevant application of your choice initially. In particular:\n",
    "\n",
    "- The table of files `mrdoc/excel/7724_file_information_csv.csv`\n",
    "- The `csv` directory\n",
    "   - What data does each file contain?\n",
    "- The user guide `mrdoc/pdf/7724_eqls_2007-2011_user_guide_v2.pdf`\n",
    "- The \"Variables\" and \"Values\" worksheet in `mrdoc/excel/eqls_concordance_grid.xlsx`\n",
    "   - How many variables are there? Can they be grouped in any way?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Answer:** There are 199 variables listed in `mrdoc/excel/eqls_concordance_grid.xlsx`. There are a number of \"Variable Groups\" (e.g., \"Family and Social Life\") and \"Topic Classifications\" (e.g., \"Social stratification and groupings - Family life and marriage\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load and Explore\n",
    "\n",
    "***Relevant sections:*** *2.1.4 (Data Sources and Formats), 2.2.1 (Data Consistency)*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read the combined 2007 and 2011 data into a pandas `DataFrame`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": [
     "raises-exception"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Wave</th>\n",
       "      <th>Y11_Country</th>\n",
       "      <th>Y11_Q31</th>\n",
       "      <th>Y11_Q32</th>\n",
       "      <th>Y11_ISCEDsimple</th>\n",
       "      <th>Y11_Q49</th>\n",
       "      <th>Y11_Q67_1</th>\n",
       "      <th>Y11_Q67_2</th>\n",
       "      <th>Y11_Q67_3</th>\n",
       "      <th>Y11_Q67_4</th>\n",
       "      <th>...</th>\n",
       "      <th>DV_Q54a</th>\n",
       "      <th>DV_Q54b</th>\n",
       "      <th>DV_Q55</th>\n",
       "      <th>DV_Q56</th>\n",
       "      <th>DV_Q8</th>\n",
       "      <th>DV_Q10</th>\n",
       "      <th>ISO3166_Country</th>\n",
       "      <th>RowID</th>\n",
       "      <th>URIRowID</th>\n",
       "      <th>UniqueID</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>https://www.iso.org/obp/ui/#iso:code:3166:AT</td>\n",
       "      <td>1</td>\n",
       "      <td>https://api.ukdataservice.ac.uk/V1/datasets/eq...</td>\n",
       "      <td>AT9000083</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>https://www.iso.org/obp/ui/#iso:code:3166:AT</td>\n",
       "      <td>2</td>\n",
       "      <td>https://api.ukdataservice.ac.uk/V1/datasets/eq...</td>\n",
       "      <td>AT9000126</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>https://www.iso.org/obp/ui/#iso:code:3166:AT</td>\n",
       "      <td>3</td>\n",
       "      <td>https://api.ukdataservice.ac.uk/V1/datasets/eq...</td>\n",
       "      <td>AT9000267</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>https://www.iso.org/obp/ui/#iso:code:3166:AT</td>\n",
       "      <td>4</td>\n",
       "      <td>https://api.ukdataservice.ac.uk/V1/datasets/eq...</td>\n",
       "      <td>AT9000268</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>https://www.iso.org/obp/ui/#iso:code:3166:AT</td>\n",
       "      <td>5</td>\n",
       "      <td>https://api.ukdataservice.ac.uk/V1/datasets/eq...</td>\n",
       "      <td>AT9000427</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 199 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   Wave  Y11_Country  Y11_Q31  Y11_Q32  Y11_ISCEDsimple  Y11_Q49  Y11_Q67_1  \\\n",
       "0     2            1      4.0      0.0              4.0      4.0        NaN   \n",
       "1     2            1      4.0      0.0              4.0      4.0        NaN   \n",
       "2     2            1      1.0      2.0              3.0      2.0        NaN   \n",
       "3     2            1      2.0      0.0              3.0      1.0        NaN   \n",
       "4     2            1      4.0      0.0              3.0      4.0        NaN   \n",
       "\n",
       "   Y11_Q67_2  Y11_Q67_3  Y11_Q67_4  ...  DV_Q54a  DV_Q54b  DV_Q55  DV_Q56  \\\n",
       "0        NaN        NaN        NaN  ...      NaN      NaN     NaN     NaN   \n",
       "1        NaN        NaN        NaN  ...      NaN      NaN     NaN     NaN   \n",
       "2        NaN        NaN        NaN  ...      NaN      NaN     NaN     NaN   \n",
       "3        NaN        NaN        NaN  ...      NaN      NaN     NaN     NaN   \n",
       "4        NaN        NaN        NaN  ...      NaN      NaN     NaN     NaN   \n",
       "\n",
       "   DV_Q8  DV_Q10                               ISO3166_Country  RowID  \\\n",
       "0    NaN     NaN  https://www.iso.org/obp/ui/#iso:code:3166:AT      1   \n",
       "1    NaN     NaN  https://www.iso.org/obp/ui/#iso:code:3166:AT      2   \n",
       "2    NaN     NaN  https://www.iso.org/obp/ui/#iso:code:3166:AT      3   \n",
       "3    NaN     NaN  https://www.iso.org/obp/ui/#iso:code:3166:AT      4   \n",
       "4    NaN     NaN  https://www.iso.org/obp/ui/#iso:code:3166:AT      5   \n",
       "\n",
       "                                            URIRowID   UniqueID  \n",
       "0  https://api.ukdataservice.ac.uk/V1/datasets/eq...  AT9000083  \n",
       "1  https://api.ukdataservice.ac.uk/V1/datasets/eq...  AT9000126  \n",
       "2  https://api.ukdataservice.ac.uk/V1/datasets/eq...  AT9000267  \n",
       "3  https://api.ukdataservice.ac.uk/V1/datasets/eq...  AT9000268  \n",
       "4  https://api.ukdataservice.ac.uk/V1/datasets/eq...  AT9000427  \n",
       "\n",
       "[5 rows x 199 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# ANSWER read in the file and display the head\n",
    "####\n",
    "\n",
    "df = pd.read_csv(COMBINED_CSV_PATH)\n",
    "display(df.head())\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✅ df loaded correctly\n"
     ]
    }
   ],
   "source": [
    "# checks - will produce an `AssertionError` until DataFrame loaded correctly\n",
    "check_dataset_load(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Take a look at some summary statistics and use these to assign the variables below correctly.\n",
    "\n",
    "You can assign the variables manually from inspecting printed output or assign them results from methods/properties."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 79270 entries, 0 to 79269\n",
      "Columns: 199 entries, Wave to UniqueID\n",
      "dtypes: float64(187), int64(9), object(3)\n",
      "memory usage: 120.4+ MB\n"
     ]
    }
   ],
   "source": [
    "# ANSWER - example functions to explore and answer questions below\n",
    "\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Wave</th>\n",
       "      <th>Y11_Country</th>\n",
       "      <th>Y11_Q31</th>\n",
       "      <th>Y11_Q32</th>\n",
       "      <th>Y11_ISCEDsimple</th>\n",
       "      <th>Y11_Q49</th>\n",
       "      <th>Y11_Q67_1</th>\n",
       "      <th>Y11_Q67_2</th>\n",
       "      <th>Y11_Q67_3</th>\n",
       "      <th>Y11_Q67_4</th>\n",
       "      <th>...</th>\n",
       "      <th>DV_Q7</th>\n",
       "      <th>DV_Q67</th>\n",
       "      <th>DV_Q43Q44</th>\n",
       "      <th>DV_Q54a</th>\n",
       "      <th>DV_Q54b</th>\n",
       "      <th>DV_Q55</th>\n",
       "      <th>DV_Q56</th>\n",
       "      <th>DV_Q8</th>\n",
       "      <th>DV_Q10</th>\n",
       "      <th>RowID</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>79270.000000</td>\n",
       "      <td>79270.000000</td>\n",
       "      <td>78756.000000</td>\n",
       "      <td>78769.000000</td>\n",
       "      <td>78556.000000</td>\n",
       "      <td>79082.000000</td>\n",
       "      <td>43636.000000</td>\n",
       "      <td>43636.000000</td>\n",
       "      <td>43636.000000</td>\n",
       "      <td>43636.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>2225.000000</td>\n",
       "      <td>43636.000000</td>\n",
       "      <td>78312.000000</td>\n",
       "      <td>43636.000000</td>\n",
       "      <td>43636.000000</td>\n",
       "      <td>43636.000000</td>\n",
       "      <td>43636.000000</td>\n",
       "      <td>43636.000000</td>\n",
       "      <td>43636.000000</td>\n",
       "      <td>79270.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>2.550473</td>\n",
       "      <td>16.841138</td>\n",
       "      <td>1.856049</td>\n",
       "      <td>1.598141</td>\n",
       "      <td>4.019146</td>\n",
       "      <td>2.640955</td>\n",
       "      <td>1.959368</td>\n",
       "      <td>1.023673</td>\n",
       "      <td>1.019204</td>\n",
       "      <td>1.001971</td>\n",
       "      <td>...</td>\n",
       "      <td>52.612135</td>\n",
       "      <td>1.086465</td>\n",
       "      <td>2.485992</td>\n",
       "      <td>2.815565</td>\n",
       "      <td>2.925635</td>\n",
       "      <td>0.303442</td>\n",
       "      <td>0.231437</td>\n",
       "      <td>3.931708</td>\n",
       "      <td>3.283482</td>\n",
       "      <td>39635.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.497449</td>\n",
       "      <td>9.358320</td>\n",
       "      <td>1.186271</td>\n",
       "      <td>1.276425</td>\n",
       "      <td>1.368993</td>\n",
       "      <td>0.987352</td>\n",
       "      <td>0.197437</td>\n",
       "      <td>0.152030</td>\n",
       "      <td>0.137244</td>\n",
       "      <td>0.044351</td>\n",
       "      <td>...</td>\n",
       "      <td>15.696943</td>\n",
       "      <td>0.460388</td>\n",
       "      <td>0.838558</td>\n",
       "      <td>0.721642</td>\n",
       "      <td>0.568403</td>\n",
       "      <td>0.881979</td>\n",
       "      <td>0.827727</td>\n",
       "      <td>0.436254</td>\n",
       "      <td>1.130667</td>\n",
       "      <td>22883.422256</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>9.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>43.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>19818.250000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>16.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>50.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>39635.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>25.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>61.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>59452.750000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>35.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>80.000000</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>79270.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>8 rows × 196 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "               Wave   Y11_Country       Y11_Q31       Y11_Q32  \\\n",
       "count  79270.000000  79270.000000  78756.000000  78769.000000   \n",
       "mean       2.550473     16.841138      1.856049      1.598141   \n",
       "std        0.497449      9.358320      1.186271      1.276425   \n",
       "min        2.000000      1.000000      1.000000      0.000000   \n",
       "25%        2.000000      9.000000      1.000000      0.000000   \n",
       "50%        3.000000     16.000000      1.000000      2.000000   \n",
       "75%        3.000000     25.000000      3.000000      2.000000   \n",
       "max        3.000000     35.000000      4.000000      5.000000   \n",
       "\n",
       "       Y11_ISCEDsimple       Y11_Q49     Y11_Q67_1     Y11_Q67_2  \\\n",
       "count     78556.000000  79082.000000  43636.000000  43636.000000   \n",
       "mean          4.019146      2.640955      1.959368      1.023673   \n",
       "std           1.368993      0.987352      0.197437      0.152030   \n",
       "min           1.000000      1.000000      1.000000      1.000000   \n",
       "25%           3.000000      2.000000      2.000000      1.000000   \n",
       "50%           4.000000      3.000000      2.000000      1.000000   \n",
       "75%           5.000000      4.000000      2.000000      1.000000   \n",
       "max           8.000000      4.000000      2.000000      2.000000   \n",
       "\n",
       "          Y11_Q67_3     Y11_Q67_4  ...        DV_Q7        DV_Q67  \\\n",
       "count  43636.000000  43636.000000  ...  2225.000000  43636.000000   \n",
       "mean       1.019204      1.001971  ...    52.612135      1.086465   \n",
       "std        0.137244      0.044351  ...    15.696943      0.460388   \n",
       "min        1.000000      1.000000  ...     5.000000      1.000000   \n",
       "25%        1.000000      1.000000  ...    43.000000      1.000000   \n",
       "50%        1.000000      1.000000  ...    50.000000      1.000000   \n",
       "75%        1.000000      1.000000  ...    61.000000      1.000000   \n",
       "max        2.000000      2.000000  ...    80.000000      6.000000   \n",
       "\n",
       "          DV_Q43Q44       DV_Q54a       DV_Q54b        DV_Q55        DV_Q56  \\\n",
       "count  78312.000000  43636.000000  43636.000000  43636.000000  43636.000000   \n",
       "mean       2.485992      2.815565      2.925635      0.303442      0.231437   \n",
       "std        0.838558      0.721642      0.568403      0.881979      0.827727   \n",
       "min        1.000000      1.000000      1.000000      0.000000      0.000000   \n",
       "25%        2.000000      3.000000      3.000000      0.000000      0.000000   \n",
       "50%        3.000000      3.000000      3.000000      0.000000      0.000000   \n",
       "75%        3.000000      3.000000      3.000000      0.000000      0.000000   \n",
       "max        3.000000      6.000000      6.000000      4.000000      4.000000   \n",
       "\n",
       "              DV_Q8        DV_Q10         RowID  \n",
       "count  43636.000000  43636.000000  79270.000000  \n",
       "mean       3.931708      3.283482  39635.500000  \n",
       "std        0.436254      1.130667  22883.422256  \n",
       "min        1.000000      1.000000      1.000000  \n",
       "25%        4.000000      2.000000  19818.250000  \n",
       "50%        4.000000      4.000000  39635.500000  \n",
       "75%        4.000000      4.000000  59452.750000  \n",
       "max        4.000000      4.000000  79270.000000  \n",
       "\n",
       "[8 rows x 196 columns]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "35"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"Y11_Country\"].nunique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "n_columns answer ✅ correct\n",
      "n_float64_columns answer ✅ correct\n",
      "Y11_Q31_mean_value answer ✅ correct\n",
      "Y11_Country_n_unique answer ✅ correct\n"
     ]
    }
   ],
   "source": [
    "# ANSWER\n",
    "n_columns = 199 # how many columns are there in the DataFrame? This matches the number expected from the documentation.\n",
    "n_float64_columns = 187 # how many columns are of dtype float64?\n",
    "Y11_Q31_mean_value = 1.856049 #  what is the mean of the values in the Y11_Q31 column?\n",
    "Y11_Country_n_unique = 35 # how many unique values in the Y11_Country column?\n",
    "\n",
    "\n",
    "# function to check your answers (run this cell)\n",
    "check_dataset_explored(\n",
    "    n_columns,\n",
    "    n_float64_columns,\n",
    "    Y11_Q31_mean_value,\n",
    "    Y11_Country_n_unique\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Different Data Types\n",
    "\n",
    "***Relevant Sections:*** *2.2.1 (Data Consistency)*\n",
    "\n",
    "Are there columns containing the following types of data? If so give examples.\n",
    "\n",
    "- Numeric data (without a set of pre-defined categories)\n",
    "- Binary data\n",
    "- Categorical data, unordered\n",
    "- Categorical data, ordered\n",
    "- Text data\n",
    "- Time or date data\n",
    "\n",
    "**Note:**\n",
    "\n",
    "- The dataset contains at least 4 of these, try to have another look if you found fewer.\n",
    "- You'll need to refer to the dataset documentation to fully understand which columns contain which types (`mrdoc/excel/eqls_concordance_grid.xlsx` may be helpful, in particlar)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Answer**:\n",
    "\n",
    "A few examples:\n",
    "\n",
    "- Numeric data\n",
    "   - `Y11_Q8` (How many hours would you prefer to work)\n",
    "   - `Y11_Q52` (How much time spent on travelling to work/study)\n",
    "\n",
    "- Binary data\n",
    "  - `Y11_Q67_1` (I am a citizen of the country)\n",
    "  - `Y11_HH2a` (Gender - discussion point re inclusivity)\n",
    "\n",
    "- Categorical data, unordered\n",
    "  - `Country`\n",
    "  - `Y11_EmploymentStatus` (e.g., no clear ordering between retired, unable to work, homemaker and unemployed)\n",
    "\n",
    "- Categorical data, ordered\n",
    "  - Majority of the data can be treated as ordered categories.\n",
    "  - `Y11_Agecategory` (Age)\n",
    "  - `Y11_Q53a` (Quality of Health services)\n",
    "\n",
    "- Text data\n",
    "  - There are no free text fields, but there are a couple of columns with strings (IDs/URLs mostly), e.g., `ISO3166_Country`, `UniqueID`.\n",
    "\n",
    "- Time or date data\n",
    "  - None in this dataset (could claim that `Wave` is perhaps as it represents a year)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For one example of each type, what are the possible values the data can take, according to the documentation? What does each value mean? Is each possible value present in the dataset?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    37958.000000\n",
       "mean        29.658043\n",
       "std         16.476718\n",
       "min          0.000000\n",
       "25%         20.000000\n",
       "50%         36.000000\n",
       "75%         40.000000\n",
       "max        120.000000\n",
       "Name: Y11_Q8, dtype: float64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Numeric example:\n",
    "# Note the question of \"what are the possible values\" only really makes sense for categorical data,\n",
    "# but we could talk about the range of values in the data.\n",
    "\n",
    "df[\"Y11_Q8\"].describe()\n",
    "# units are hours"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2.0    41863\n",
       "1.0     1773\n",
       "Name: Y11_Q67_1, dtype: int64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Binary example:\n",
    "\n",
    "df[\"Y11_Q67_1\"].value_counts()\n",
    "# 1 means No and 2 means Yes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1    36381\n",
       "4    22925\n",
       "5     7372\n",
       "2     5375\n",
       "6     4259\n",
       "3     1720\n",
       "7     1238\n",
       "Name: Y11_EmploymentStatus, dtype: int64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Unordered categorical example:\n",
    "df[\"Y11_EmploymentStatus\"].value_counts()\n",
    "\n",
    "# Values represent employment status: 1 = Employed (includes on leave), 2 = Unemployed, 3 = Unable to work - disability/illness,\n",
    "# 4 = Retired, 5 = Homemaker, 6 = Student, 7 = Other\"\n",
    "# There are rows with all 7 values present.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0      3481\n",
       "2.0      2876\n",
       "3.0      5041\n",
       "4.0      5963\n",
       "5.0     12780\n",
       "6.0      9705\n",
       "7.0     13805\n",
       "8.0     13970\n",
       "9.0      5672\n",
       "10.0     4715\n",
       "Name: Y11_Q53a, dtype: int64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Ordered categorical example:\n",
    "df[\"Y11_Q53a\"].value_counts().sort_index()\n",
    "\n",
    "# Values are respondent's rating of the quality of health services, from 1 (very poor) to 10 (very high).\n",
    "# There are rows with all 10 values present."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What is the minimum, maximum and mean value in each of your example columns? Taking into consideration the type of data in the column, are all of these values meaningful? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    37958.000000\n",
       "mean        29.658043\n",
       "std         16.476718\n",
       "min          0.000000\n",
       "25%         20.000000\n",
       "50%         36.000000\n",
       "75%         40.000000\n",
       "max        120.000000\n",
       "Name: Y11_Q8, dtype: float64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Can use \"describe\" or the \"min\", \"max\" and \"mean\" functions\n",
    "\n",
    "df[\"Y11_Q8\"].describe()  # Numeric"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    43636.000000\n",
       "mean         1.959368\n",
       "std          0.197437\n",
       "min          1.000000\n",
       "25%          2.000000\n",
       "50%          2.000000\n",
       "75%          2.000000\n",
       "max          2.000000\n",
       "Name: Y11_Q67_1, dtype: float64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"Y11_Q67_1\"].describe()  # Binary\n",
    "\n",
    "# The mean represents the proportion of people that answered \"Yes\" here (1.96  means 96% of\n",
    "# people answered Yes (2), and 4% no (1). Would be clearer if no was 0 and yes was 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    79270.000000\n",
       "mean         2.713145\n",
       "std          1.793318\n",
       "min          1.000000\n",
       "25%          1.000000\n",
       "50%          2.000000\n",
       "75%          4.000000\n",
       "max          7.000000\n",
       "Name: Y11_EmploymentStatus, dtype: float64"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"Y11_EmploymentStatus\"].describe() # Unordered category\n",
    "\n",
    "# Note: The mean of an unordered category is not meaningful!\n",
    "# E.g. if we had 2 people, one with a value of 4 (retired), and one with 5 (homemaker), we'd compute\n",
    "# a mean of 4.5 - does that mean they're part-time retired, part-time homemaker?!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    78008.000000\n",
       "mean         6.113873\n",
       "std          2.293152\n",
       "min          1.000000\n",
       "25%          5.000000\n",
       "50%          6.000000\n",
       "75%          8.000000\n",
       "max         10.000000\n",
       "Name: Y11_Q53a, dtype: float64"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"Y11_Q53a\"].describe()  # Ordered  category\n",
    "\n",
    "# Need to be careful about interpreting the mean for ordered categories. For this column the\n",
    "# categories are ratings from 1 to 10, and the mean rating represents what you'd expect. Others may\n",
    "# be more nuanced, e.g. the age category column has categories with different sized bins (18-24\n",
    "# covers 7 years, 35-49 covers 15 years, 65+ covers N years)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For one of the categorical columns, replace the numerical encodings with their actual meaning (category title). You can do this by manually creating a Python dictionary with the values to replace (we'll look at extracting them with code later). What is the most common value?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Employed (includes on leave)           36381\n",
       "Retired                                22925\n",
       "Homemaker                               7372\n",
       "Unemployed                              5375\n",
       "Student                                 4259\n",
       "Unable to work - disability/illness     1720\n",
       "Other                                   1238\n",
       "Name: Y11_EmploymentStatus, dtype: int64"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "status_encoding = {\n",
    "    1: \"Employed (includes on leave)\",\n",
    "    2: \"Unemployed\",\n",
    "    3: \"Unable to work - disability/illness\",\n",
    "    4: \"Retired\",\n",
    "    5: \"Homemaker\",\n",
    "    6: \"Student\",\n",
    "    7: \"Other\"\n",
    "}\n",
    "\n",
    "employment_status = df[\"Y11_EmploymentStatus\"].replace(status_encoding)\n",
    "employment_status.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Making Things More Readable\n",
    "\n",
    "***Relevant Sections:*** *2.2.2 (Modifying Columns and Indices), 2.2.4.2 (Text Data)*\n",
    "\n",
    "At the moment, we've got column headings such as `'Y11_Country'`, `'Y11_Q31'` and `'Y11_Q32'` in our data. These aren't particularly helpful at a glance and we'd need to do some cross-referencing with `eqls_api_map.csv` to make sense of them.\n",
    "\n",
    "To make things more readable, let's rename our columns according to the `'VariableLabel'` column in `mrdoc/excel/eqls_api_map.csv`.\n",
    "\n",
    "However, because it can make `.` access a bit tricky, we'd like to make sure we don't have any spaces or non-word characters in our new column names! For consistency, we'd like everything to be lower case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>VariableName</th>\n",
       "      <th>VariableLabel</th>\n",
       "      <th>Question</th>\n",
       "      <th>TopicValue</th>\n",
       "      <th>KeywordValue</th>\n",
       "      <th>VariableGroupValue</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Wave</td>\n",
       "      <td>EQLS Wave</td>\n",
       "      <td>EQLS Wave</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Administrative Variables</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Y11_Country</td>\n",
       "      <td>Country</td>\n",
       "      <td>Country</td>\n",
       "      <td>Geographies</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Household Grid and Country</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Y11_Q31</td>\n",
       "      <td>Marital status</td>\n",
       "      <td>Marital status</td>\n",
       "      <td>Social stratification and groupings - Family l...</td>\n",
       "      <td>Marital status</td>\n",
       "      <td>Family and Social Life</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Y11_Q32</td>\n",
       "      <td>No. of children</td>\n",
       "      <td>Number of children of your own</td>\n",
       "      <td>Social stratification and groupings - Family l...</td>\n",
       "      <td>Children</td>\n",
       "      <td>Family and Social Life</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Y11_ISCEDsimple</td>\n",
       "      <td>Education completed</td>\n",
       "      <td>Highest level of education completed</td>\n",
       "      <td>Education - Higher and further</td>\n",
       "      <td>Education levels</td>\n",
       "      <td>Education</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      VariableName        VariableLabel                              Question  \\\n",
       "0             Wave            EQLS Wave                             EQLS Wave   \n",
       "1      Y11_Country              Country                               Country   \n",
       "2          Y11_Q31       Marital status                        Marital status   \n",
       "3          Y11_Q32      No. of children        Number of children of your own   \n",
       "4  Y11_ISCEDsimple  Education completed  Highest level of education completed   \n",
       "\n",
       "                                          TopicValue      KeywordValue  \\\n",
       "0                                                NaN               NaN   \n",
       "1                                        Geographies               NaN   \n",
       "2  Social stratification and groupings - Family l...    Marital status   \n",
       "3  Social stratification and groupings - Family l...          Children   \n",
       "4                     Education - Higher and further  Education levels   \n",
       "\n",
       "           VariableGroupValue  \n",
       "0    Administrative Variables  \n",
       "1  Household Grid and Country  \n",
       "2      Family and Social Life  \n",
       "3      Family and Social Life  \n",
       "4                   Education  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# we have to explicitly use latin1 encoding as the file is not saved in utf-8 (our platform default)\n",
    "eqls_api_map_df = pd.read_csv(MAPPING_CSV_PATH, encoding='latin1')\n",
    "eqls_api_map_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TODO:\n",
    "- replace column names in `df` with corresponding entry in `'VariableLabel'` column from `eqls_api_map_df`\n",
    "- ensure all column names are entirely lowercase\n",
    "- ensure only characters [a-zA-Z0-9_] are present in column names\n",
    "    - remove apostrophes (`\"'\"`)\n",
    "    - replace otherwise non-matching (e.g., whitespace or `'/'`) character with `'_'`\n",
    "    - we don't want consecutive `'_'` characters (e.g., `no_of_children` rather than `no__of_children`)\n",
    "- keep a map (python `dict`) that shows the old -> new column mapping in case we ever want to invert this transformation.\n",
    "\n",
    "Example manual mapping (you should produce this with a general code solution!):\n",
    "\n",
    "```json\n",
    "{\n",
    "    ...,\n",
    "    'Y11_Q32' -> 'no_of_children'\n",
    "    ...,\n",
    "    'Y11_Q67_4' -> 'citizenship_dont_know',\n",
    "    ...,\n",
    "}\n",
    "```\n",
    "\n",
    "You may find it helpful to use a site like [regex101](https://regex101.com/) to experiment with creating a\n",
    "suitable regex expression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/xv/d5nvn2ps5r3fcf276w707n01qdmpqf/T/ipykernel_77880/3644821443.py:9: FutureWarning: The default value of regex will change from True to False in a future version.\n",
      "  new_cols = eqls_api_map_df[\"VariableLabel\"].str.lower()\\\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>eqls_wave</th>\n",
       "      <th>country</th>\n",
       "      <th>marital_status</th>\n",
       "      <th>no_of_children</th>\n",
       "      <th>education_completed</th>\n",
       "      <th>rural_urban_living</th>\n",
       "      <th>citizenship_country</th>\n",
       "      <th>citizenship_another_eu_member</th>\n",
       "      <th>citizenship_a_non_eu_country</th>\n",
       "      <th>citizenship_dont_know</th>\n",
       "      <th>...</th>\n",
       "      <th>dv_anyone_used_would_have_like_to_use_child_care_last_12_months_</th>\n",
       "      <th>dv_anyone_used_would_have_like_to_use_long_term_care_last_12_months_</th>\n",
       "      <th>dv_no_of_factors_which_made_it_difficult_to_use_child_care_</th>\n",
       "      <th>dv_no_of_factors_which_made_it_difficult_to_use_long_term_care_</th>\n",
       "      <th>dv_preferred_working_hours_3_groups_</th>\n",
       "      <th>dv_preferred_working_hours_of_respondents_partner_3_groups_</th>\n",
       "      <th>iso3166_country_url</th>\n",
       "      <th>rowid_for_the_uk_data_service_public_api</th>\n",
       "      <th>root_uri_for_a_row_respondent_that_displays_all_data_values_for_a_single_row_via_the_uk_data_service_public_api</th>\n",
       "      <th>unique_respondent_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>https://www.iso.org/obp/ui/#iso:code:3166:AT</td>\n",
       "      <td>1</td>\n",
       "      <td>https://api.ukdataservice.ac.uk/V1/datasets/eq...</td>\n",
       "      <td>AT9000083</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>https://www.iso.org/obp/ui/#iso:code:3166:AT</td>\n",
       "      <td>2</td>\n",
       "      <td>https://api.ukdataservice.ac.uk/V1/datasets/eq...</td>\n",
       "      <td>AT9000126</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>https://www.iso.org/obp/ui/#iso:code:3166:AT</td>\n",
       "      <td>3</td>\n",
       "      <td>https://api.ukdataservice.ac.uk/V1/datasets/eq...</td>\n",
       "      <td>AT9000267</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>https://www.iso.org/obp/ui/#iso:code:3166:AT</td>\n",
       "      <td>4</td>\n",
       "      <td>https://api.ukdataservice.ac.uk/V1/datasets/eq...</td>\n",
       "      <td>AT9000268</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>https://www.iso.org/obp/ui/#iso:code:3166:AT</td>\n",
       "      <td>5</td>\n",
       "      <td>https://api.ukdataservice.ac.uk/V1/datasets/eq...</td>\n",
       "      <td>AT9000427</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 199 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   eqls_wave  country  marital_status  no_of_children  education_completed  \\\n",
       "0          2        1             4.0             0.0                  4.0   \n",
       "1          2        1             4.0             0.0                  4.0   \n",
       "2          2        1             1.0             2.0                  3.0   \n",
       "3          2        1             2.0             0.0                  3.0   \n",
       "4          2        1             4.0             0.0                  3.0   \n",
       "\n",
       "   rural_urban_living  citizenship_country  citizenship_another_eu_member  \\\n",
       "0                 4.0                  NaN                            NaN   \n",
       "1                 4.0                  NaN                            NaN   \n",
       "2                 2.0                  NaN                            NaN   \n",
       "3                 1.0                  NaN                            NaN   \n",
       "4                 4.0                  NaN                            NaN   \n",
       "\n",
       "   citizenship_a_non_eu_country  citizenship_dont_know  ...  \\\n",
       "0                           NaN                    NaN  ...   \n",
       "1                           NaN                    NaN  ...   \n",
       "2                           NaN                    NaN  ...   \n",
       "3                           NaN                    NaN  ...   \n",
       "4                           NaN                    NaN  ...   \n",
       "\n",
       "   dv_anyone_used_would_have_like_to_use_child_care_last_12_months_  \\\n",
       "0                                                NaN                  \n",
       "1                                                NaN                  \n",
       "2                                                NaN                  \n",
       "3                                                NaN                  \n",
       "4                                                NaN                  \n",
       "\n",
       "   dv_anyone_used_would_have_like_to_use_long_term_care_last_12_months_  \\\n",
       "0                                                NaN                      \n",
       "1                                                NaN                      \n",
       "2                                                NaN                      \n",
       "3                                                NaN                      \n",
       "4                                                NaN                      \n",
       "\n",
       "   dv_no_of_factors_which_made_it_difficult_to_use_child_care_  \\\n",
       "0                                                NaN             \n",
       "1                                                NaN             \n",
       "2                                                NaN             \n",
       "3                                                NaN             \n",
       "4                                                NaN             \n",
       "\n",
       "   dv_no_of_factors_which_made_it_difficult_to_use_long_term_care_  \\\n",
       "0                                                NaN                 \n",
       "1                                                NaN                 \n",
       "2                                                NaN                 \n",
       "3                                                NaN                 \n",
       "4                                                NaN                 \n",
       "\n",
       "   dv_preferred_working_hours_3_groups_  \\\n",
       "0                                   NaN   \n",
       "1                                   NaN   \n",
       "2                                   NaN   \n",
       "3                                   NaN   \n",
       "4                                   NaN   \n",
       "\n",
       "   dv_preferred_working_hours_of_respondents_partner_3_groups_  \\\n",
       "0                                                NaN             \n",
       "1                                                NaN             \n",
       "2                                                NaN             \n",
       "3                                                NaN             \n",
       "4                                                NaN             \n",
       "\n",
       "                            iso3166_country_url  \\\n",
       "0  https://www.iso.org/obp/ui/#iso:code:3166:AT   \n",
       "1  https://www.iso.org/obp/ui/#iso:code:3166:AT   \n",
       "2  https://www.iso.org/obp/ui/#iso:code:3166:AT   \n",
       "3  https://www.iso.org/obp/ui/#iso:code:3166:AT   \n",
       "4  https://www.iso.org/obp/ui/#iso:code:3166:AT   \n",
       "\n",
       "   rowid_for_the_uk_data_service_public_api  \\\n",
       "0                                         1   \n",
       "1                                         2   \n",
       "2                                         3   \n",
       "3                                         4   \n",
       "4                                         5   \n",
       "\n",
       "   root_uri_for_a_row_respondent_that_displays_all_data_values_for_a_single_row_via_the_uk_data_service_public_api  \\\n",
       "0  https://api.ukdataservice.ac.uk/V1/datasets/eq...                                                                 \n",
       "1  https://api.ukdataservice.ac.uk/V1/datasets/eq...                                                                 \n",
       "2  https://api.ukdataservice.ac.uk/V1/datasets/eq...                                                                 \n",
       "3  https://api.ukdataservice.ac.uk/V1/datasets/eq...                                                                 \n",
       "4  https://api.ukdataservice.ac.uk/V1/datasets/eq...                                                                 \n",
       "\n",
       "   unique_respondent_id  \n",
       "0             AT9000083  \n",
       "1             AT9000126  \n",
       "2             AT9000267  \n",
       "3             AT9000268  \n",
       "4             AT9000427  \n",
       "\n",
       "[5 rows x 199 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Answer\n",
    "old_cols = eqls_api_map_df[\"VariableName\"]\n",
    "\n",
    "# The 4 .str calls below:\n",
    "#  convert to lowercase\n",
    "#  remove apostrophes\n",
    "#  replace whitespace with _\n",
    "#  rremove consecutive underscores\n",
    "new_cols = eqls_api_map_df[\"VariableLabel\"].str.lower()\\\n",
    "                                        .str.replace(\"'\",'')\\\n",
    "                                        .str.replace('[^\\w]','_')\\\n",
    "                                        .str.replace(\"_+\", \"_\")\n",
    "\n",
    "column_mapping = dict(zip(old_cols, new_cols))\n",
    "\n",
    "\n",
    "# Apply your column mapping to df\n",
    "df = df.rename(columns=column_mapping)\n",
    "\n",
    "display(df.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Checking each column...\n",
      "✅ Column mapping correct... Checking df columns set correctly...\n",
      "✅ Success!\n"
     ]
    }
   ],
   "source": [
    "# check your answer\n",
    "check_column_mapping(column_mapping, df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Self-Reported Health\n",
    "\n",
    "***Relevant Sections:*** *2.2.1 (Data Consistency), 2.2.3 (Feature Engineering), 2.2.4.3 (Categorical Data)*\n",
    "\n",
    "The research question we defined in Module 1 and will explore in more detail in Modules 3 and 4 aims to explore the contribution of various factors on self-reported health.\n",
    "\n",
    "Which column in the dataset contains self-reported health values? How many people had \"Very good\" self-reported health?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0    16898\n",
       "2.0    31248\n",
       "3.0    22265\n",
       "4.0     6916\n",
       "5.0     1802\n",
       "Name: health_condition, dtype: int64"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The relevant column is \"health_condition\" (Y11_Q42 before renaming)\n",
    "df[\"health_condition\"].value_counts().sort_index()\n",
    "\n",
    "# 16898 people reported \"very good\" health (value of 1 below)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the models we develop in Module 4 we'll convert self-reported health into a binary variable. What might be a sensible way to group the categories into only two possible values? Create a new column with your proposed binary encoding for self-reported health."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>health_condition</th>\n",
       "      <th>bad_health</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>28292</th>\n",
       "      <td>2.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43984</th>\n",
       "      <td>2.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37485</th>\n",
       "      <td>3.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>76012</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49608</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       health_condition  bad_health\n",
       "28292               2.0           0\n",
       "43984               2.0           0\n",
       "37485               3.0           0\n",
       "76012               1.0           0\n",
       "49608               1.0           0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "0    70552\n",
       "1     8718\n",
       "Name: bad_health, dtype: int64"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Could assign 1, 2, 3 (very good, good, fair) as \"good health\", and 4, 5 (bad, very bad) as \"bad health\".\n",
    "df[\"bad_health\"] = (df[\"health_condition\"] > 3).astype(int)\n",
    "\n",
    "display(df[[\"health_condition\", \"bad_health\"]].sample(5))\n",
    "\n",
    "# Note this creates very imbalanced classes.\n",
    "df[\"bad_health\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Missing Values\n",
    "\n",
    "***Relevant Sections:*** *2.2.1 (Data Consistency), 2.2.7 (Missing Data)*\n",
    "\n",
    "There are at least three different ways missing data and unknown information are represented in this dataset.\n",
    "\n",
    "Look at one row of the data. What value does Pandas use to represent missing data? How many missing values are there in the row you picked?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([2, 1, 4.0, 0.0, 4.0, 4.0, nan, nan, nan, nan, nan, 1, 1, 4, nan, 5,\n",
       "       2.0, 6, nan, 2.0, nan, 2, nan, nan, nan, nan, nan, nan, nan, nan,\n",
       "       nan, nan, nan, nan, nan, 1.0, 2.0, nan, 4.0, 3.0, 1.0, 2.0, 2.0,\n",
       "       2.0, 2.0, 1.0, 4.0, nan, nan, nan, nan, nan, nan, 2.0, nan, nan,\n",
       "       nan, nan, nan, nan, nan, nan, nan, nan, nan, 7.0, 6.0, 8.0, 7.0,\n",
       "       nan, nan, 6.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n",
       "       nan, nan, nan, nan, nan, nan, nan, nan, 6.0, 2.0, 2.0, 2.0, 2.0,\n",
       "       2.0, 2.0, nan, 7.0, 6.0, 7.0, 8.0, 8.0, nan, nan, nan, nan, nan,\n",
       "       nan, nan, nan, nan, nan, 2.0, nan, nan, 2.0, 3.0, 1.0, 2.0, 4.0,\n",
       "       nan, 3.5, nan, nan, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0,\n",
       "       nan, nan, nan, nan, 0.0, 2.0, nan, nan, nan, 3.0, 9.0, nan, 9.0,\n",
       "       7.0, 8.0, 10.0, 10.0, nan, 7.0, 3.0, 2.0, 2.0, 1.0, 2.0, nan, nan,\n",
       "       nan, 80.0, 3.6421, 1.80451061034635, 1.80982660081538, nan, nan,\n",
       "       nan, nan, nan, nan, nan, nan, nan, nan, nan, 2.0, 5.0, 2.0, nan,\n",
       "       nan, nan, nan, nan, nan, nan, nan, nan, 3.0, nan, nan, nan, nan,\n",
       "       nan, nan, 'https://www.iso.org/obp/ui/#iso:code:3166:AT', 1,\n",
       "       'https://api.ukdataservice.ac.uk/V1/datasets/eqls/rows/1',\n",
       "       'AT9000083', 0], dtype=object)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[0].values\n",
    "# missing values represented by \"nan\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "113"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[0].isna().sum()\n",
    "# no. of missing values in this row"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at the possible values for each column in the dataset documentation, find two different columns and at least two different values that are also used to represent missing or unknown information."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Answer:** Examples include:\n",
    "- `education_3_groups` (`Y11_Education` before renaming)\n",
    "   - 5 = \"Don't know\", 6 = \"Refused to Answer\".\n",
    "   - 4 = \"Educated abroad\": Could also be considered missing (as we don't know whether it was primary/secondary/tertiary education)\n",
    "   \n",
    "- `direct_contact_with_children` (`Y11_Q33a` before renaming)\n",
    "   - 6 = \"NA\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the columns you picked:\n",
    "- How many missing values does Pandas say the column has?\n",
    "- How many values match the numeric representation of missing data you found from the documentation (e.g., if the documentation says -99 means unknown, how many -99 values are there in the column)?\n",
    "- Does Pandas include the numeric representation in its count of missing values?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "education_3_groups : 623 Pandas missing values\n",
      "education_3_groups : 276 Numeric representation of missing values\n",
      "\n",
      "direct_contact_with_children : 47519 Pandas missing values\n",
      "direct_contact_with_children : 6860 Numeric representation of missing values\n"
     ]
    }
   ],
   "source": [
    "column = \"education_3_groups\"\n",
    "print(column, \":\", df[column].isna().sum(), \"Pandas missing values\")\n",
    "print(column, \":\", (df[column] > 3).sum(), \"Numeric representation of missing values\\n\")\n",
    "\n",
    "column = \"direct_contact_with_children\"\n",
    "print(column, \":\", df[column].isna().sum(), \"Pandas missing values\")\n",
    "print(column, \":\", (df[column] == 6).sum(), \"Numeric representation of missing values\")\n",
    "\n",
    "# In both cases Pandas does NOT know to include the numeric values in its count of missing data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Replace the numbers representing missing values in the columns you picked with the `NaN` type from `numpy`. What is the Pandas count of missing values now?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "education_3_groups : 899 Pandas missing values\n",
      "direct_contact_with_children : 54379 Pandas missing values\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "column = \"education_3_groups\"\n",
    "df[column] = df[column].replace([4,5,6], np.nan)\n",
    "print(column, \":\", df[column].isna().sum(), \"Pandas missing values\")\n",
    "\n",
    "column = \"direct_contact_with_children\"\n",
    "df[column] = df[column].replace(6, np.nan)\n",
    "print(column, \":\", df[column].isna().sum(), \"Pandas missing values\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Are there different types of missing data in the dataset (different reasons values can be unknown)? Does this have any implications for the way you would analyse the data?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Answer:** Yes, as described above in the `education_3_groups` values could be unknown becuase they're missing, the person refused to anwer, or the person was educated abroad. Being educated abroad, for example, is quite a different reason for not knowing what level of education they received (and perhaps could be correlated with having more opportunities or wealth)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2007 vs 2011\n",
    "\n",
    "***Relevant Sections:*** *2.2.1 (Data Consistency), 2.2.7 (Missing Data)*\n",
    "\n",
    "- Which column should be used to distinguish between the collection years? (2007 and 2011)\n",
    "- How many rows do we have for each year?\n",
    "- For each collection year, what % of null values do we have for each column?\n",
    "    - Why is this?\n",
    "    - Display these %s in descending  order sorted by: 2007 then 2011"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3    43636\n",
       "2    35634\n",
       "Name: eqls_wave, dtype: int64"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Column to use is \"eqls_wave\"\n",
    "# 2 is 2007, and 3 is 2011\n",
    "df[\"eqls_wave\"].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>eqls_wave</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>difficult_to_use_long_term_care_because_of_quality_of_care_</th>\n",
       "      <td>100.0</td>\n",
       "      <td>89.603</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>difficult_to_use_long_term_care_because_of_cost_</th>\n",
       "      <td>100.0</td>\n",
       "      <td>89.534</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>difficult_to_use_long_term_care_because_of_access_</th>\n",
       "      <td>100.0</td>\n",
       "      <td>89.495</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>difficult_to_use_long_term_care_because_of_availability_</th>\n",
       "      <td>100.0</td>\n",
       "      <td>89.433</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>difficult_to_use_child_care_because_of_quality_of_care_</th>\n",
       "      <td>100.0</td>\n",
       "      <td>83.401</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>iso3166_country_url</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>rowid_for_the_uk_data_service_public_api</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>root_uri_for_a_row_respondent_that_displays_all_data_values_for_a_single_row_via_the_uk_data_service_public_api</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique_respondent_id</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bad_health</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>200 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "eqls_wave                                               2       3\n",
       "difficult_to_use_long_term_care_because_of_qual...  100.0  89.603\n",
       "difficult_to_use_long_term_care_because_of_cost_    100.0  89.534\n",
       "difficult_to_use_long_term_care_because_of_access_  100.0  89.495\n",
       "difficult_to_use_long_term_care_because_of_avai...  100.0  89.433\n",
       "difficult_to_use_child_care_because_of_quality_...  100.0  83.401\n",
       "...                                                   ...     ...\n",
       "iso3166_country_url                                   0.0   0.000\n",
       "rowid_for_the_uk_data_service_public_api              0.0   0.000\n",
       "root_uri_for_a_row_respondent_that_displays_all...    0.0   0.000\n",
       "unique_respondent_id                                  0.0   0.000\n",
       "bad_health                                            0.0   0.000\n",
       "\n",
       "[200 rows x 2 columns]"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "missing = df.groupby(\"eqls_wave\").apply(lambda g: g.isna().mean() * 100).round(3).transpose()\n",
    "missing.sort_values(by=[2,3], ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Why? In `UKDA-7724-csv/mrdoc/excel/eqls_concordance_grid.xlsx` we can see some variables not collected for given year."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## UK vs Spain\n",
    "\n",
    "Further to the missing data we saw above, grouped by wave/year, how do missing values look for each country of collection?\n",
    "\n",
    "Compare the UK with Spain:\n",
    "- are there columns that have all values for one country but some are missing for the other?\n",
    "- are there columns that don't have any values for one country but at least some are present for the other?\n",
    "\n",
    "What implications are there from your answers to the above questions?\n",
    "\n",
    "For simplicity, just look at 2011 data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select 2011 data\n",
    "df_2011 = df[df[\"eqls_wave\"] == 3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Variable name</th>\n",
       "      <th>Variable label</th>\n",
       "      <th>Variable question text</th>\n",
       "      <th>Topic Classifications</th>\n",
       "      <th>Keywords</th>\n",
       "      <th>Variable Groups</th>\n",
       "      <th>Values if present - 2007</th>\n",
       "      <th>Values if present - 2011</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Wave</td>\n",
       "      <td>EQLS Wave</td>\n",
       "      <td>EQLS Wave</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Administrative Variables</td>\n",
       "      <td>2 = 2007\\n3 = 2011</td>\n",
       "      <td>2 = 2007\\n3 = 2011</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Y11_Country</td>\n",
       "      <td>Country</td>\n",
       "      <td>Country</td>\n",
       "      <td>Geographies</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Household Grid and Country</td>\n",
       "      <td>1 = Austria\\n2 = Belgium\\n3 = Bulgaria\\n4 = Cy...</td>\n",
       "      <td>1 = Austria\\n2 = Belgium\\n3 = Bulgaria\\n4 = Cy...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Y11_Q31</td>\n",
       "      <td>Marital status</td>\n",
       "      <td>Marital status</td>\n",
       "      <td>Social stratification and groupings - Family l...</td>\n",
       "      <td>Marital status</td>\n",
       "      <td>Family and Social Life</td>\n",
       "      <td>1 = Married or living with partner\\n2 = Separa...</td>\n",
       "      <td>1 = Married or living with partner\\n2 = Separa...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Y11_Q32</td>\n",
       "      <td>No. of children</td>\n",
       "      <td>Number of children of your own</td>\n",
       "      <td>Social stratification and groupings - Family l...</td>\n",
       "      <td>Children</td>\n",
       "      <td>Family and Social Life</td>\n",
       "      <td>5 = 5 or more children</td>\n",
       "      <td>5 = 5 or more children</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Y11_ISCEDsimple</td>\n",
       "      <td>Education completed</td>\n",
       "      <td>Highest level of education completed</td>\n",
       "      <td>Education - Higher and further</td>\n",
       "      <td>Education levels</td>\n",
       "      <td>Education</td>\n",
       "      <td>1 = No education completed (ISCED 0)\\n2 = Prim...</td>\n",
       "      <td>1 = No education completed (ISCED 0)\\n2 = Prim...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Variable name       Variable label                Variable question text  \\\n",
       "0             Wave            EQLS Wave                             EQLS Wave   \n",
       "1      Y11_Country              Country                               Country   \n",
       "2          Y11_Q31       Marital status                        Marital status   \n",
       "3          Y11_Q32      No. of children        Number of children of your own   \n",
       "4  Y11_ISCEDsimple  Education completed  Highest level of education completed   \n",
       "\n",
       "                               Topic Classifications          Keywords  \\\n",
       "0                                                NaN               NaN   \n",
       "1                                        Geographies               NaN   \n",
       "2  Social stratification and groupings - Family l...    Marital status   \n",
       "3  Social stratification and groupings - Family l...          Children   \n",
       "4                     Education - Higher and further  Education levels   \n",
       "\n",
       "              Variable Groups  \\\n",
       "0    Administrative Variables   \n",
       "1  Household Grid and Country   \n",
       "2      Family and Social Life   \n",
       "3      Family and Social Life   \n",
       "4                   Education   \n",
       "\n",
       "                            Values if present - 2007  \\\n",
       "0                                 2 = 2007\\n3 = 2011   \n",
       "1  1 = Austria\\n2 = Belgium\\n3 = Bulgaria\\n4 = Cy...   \n",
       "2  1 = Married or living with partner\\n2 = Separa...   \n",
       "3                             5 = 5 or more children   \n",
       "4  1 = No education completed (ISCED 0)\\n2 = Prim...   \n",
       "\n",
       "                            Values if present - 2011  \n",
       "0                                 2 = 2007\\n3 = 2011  \n",
       "1  1 = Austria\\n2 = Belgium\\n3 = Bulgaria\\n4 = Cy...  \n",
       "2  1 = Married or living with partner\\n2 = Separa...  \n",
       "3                             5 = 5 or more children  \n",
       "4  1 = No education completed (ISCED 0)\\n2 = Prim...  "
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Some code for parsing the data in mrdoc/excel/eqls_concordance_grid.xlsx\"\n",
    "# E.g. country data is categorically encoded in our DataFrame but not human readable\n",
    "# We can get the human readable categories from this file\n",
    "\n",
    "\n",
    "# load the categorical value data from excel workbook, specifying the appropriate sheet\n",
    "cat_vals_df = pd.read_excel(CATEGORICAL_VALUES_XLSX_PATH, sheet_name=CATEGORICAL_VALUES_XLSX_SHEET_NAME)\n",
    "\n",
    "# display head\n",
    "cat_vals_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can write your own function to extract the data you need from the `cat_vals_df` dataframe, input it manually, or use our prewritten function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{1: 'Austria',\n",
       " 2: 'Belgium',\n",
       " 3: 'Bulgaria',\n",
       " 4: 'Cyprus',\n",
       " 5: 'Czech Republic',\n",
       " 6: 'Germany',\n",
       " 7: 'Denmark',\n",
       " 8: 'Estonia',\n",
       " 9: 'Greece',\n",
       " 10: 'Spain',\n",
       " 11: 'Finland',\n",
       " 12: 'France',\n",
       " 13: 'Hungary',\n",
       " 14: 'Ireland',\n",
       " 15: 'Italy',\n",
       " 16: 'Lithuania',\n",
       " 17: 'Luxembourg',\n",
       " 18: 'Latvia',\n",
       " 19: 'Malta',\n",
       " 20: 'Netherlands',\n",
       " 21: 'Poland',\n",
       " 22: 'Portugal',\n",
       " 23: 'Romania',\n",
       " 24: 'Sweden',\n",
       " 25: 'Slovenia',\n",
       " 26: 'Slovakia',\n",
       " 27: 'UK',\n",
       " 28: 'Turkey',\n",
       " 29: 'Croatia',\n",
       " 30: 'Macedonia (FYROM)',\n",
       " 31: 'Kosovo',\n",
       " 32: 'Serbia',\n",
       " 33: 'Montenegro',\n",
       " 34: 'Iceland',\n",
       " 35: 'Norway'}"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "country_mapping_2011 = parse_country_values_2011(categorical_values_df=cat_vals_df)\n",
    "\n",
    "# display\n",
    "country_mapping_2011"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/xv/d5nvn2ps5r3fcf276w707n01qdmpqf/T/ipykernel_77880/1548146086.py:2: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  df_2011[\"country_human\"] = df_2011[\"country\"].map(country_mapping_2011) # ignore set with copy warning\n"
     ]
    }
   ],
   "source": [
    "# add country_human column with mapped value\n",
    "df_2011[\"country_human\"] = df_2011[\"country\"].map(country_mapping_2011) # ignore set with copy warning\n",
    "\n",
    "# % missing values in each column for each country\n",
    "by_country = df_2011.groupby(\"country_human\").apply(lambda g: g.isna().mean() * 100).round(3).transpose()\n",
    "\n",
    "# extract UK and Spain\n",
    "uk_vs_spain = by_country[[\"UK\", \"Spain\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>country_human</th>\n",
       "      <th>UK</th>\n",
       "      <th>Spain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>rural_urban_living</th>\n",
       "      <td>0.577</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dv_rural_urban_living</th>\n",
       "      <td>0.577</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>degree_of_urbanisation</th>\n",
       "      <td>5.595</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>health_condition</th>\n",
       "      <td>0.044</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>how_frequently_take_part_in_sports_or_exercise_</th>\n",
       "      <td>0.089</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>how_frequently_participate_in_social_activities_</th>\n",
       "      <td>0.089</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>how_often_worked_unpaid_for_community_services_last_12_months_</th>\n",
       "      <td>0.000</td>\n",
       "      <td>0.794</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>how_often_worked_unpaid_for_education_cultural_etc_organisation_last_12_months_</th>\n",
       "      <td>0.000</td>\n",
       "      <td>0.794</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>how_satisfied_with_present_standard_of_living_</th>\n",
       "      <td>0.044</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>how_satisfied_with_accommodation_</th>\n",
       "      <td>0.044</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>how_satisfied_with_family_life_</th>\n",
       "      <td>0.755</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "country_human                                          UK  Spain\n",
       "rural_urban_living                                  0.577  0.000\n",
       "dv_rural_urban_living                               0.577  0.000\n",
       "degree_of_urbanisation                              5.595  0.000\n",
       "health_condition                                    0.044  0.000\n",
       "how_frequently_take_part_in_sports_or_exercise_     0.089  0.000\n",
       "how_frequently_participate_in_social_activities_    0.089  0.000\n",
       "how_often_worked_unpaid_for_community_services_...  0.000  0.794\n",
       "how_often_worked_unpaid_for_education_cultural_...  0.000  0.794\n",
       "how_satisfied_with_present_standard_of_living_      0.044  0.000\n",
       "how_satisfied_with_accommodation_                   0.044  0.000\n",
       "how_satisfied_with_family_life_                     0.755  0.000"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# one 0% NA but not the other\n",
    "uk_vs_spain[(uk_vs_spain == 0.0).apply(sum, axis=1) == 1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Your Turn!\n",
    "\n",
    "We now leave it to you to learn something interesting from the dataset, using any of the techniques we've learned. You could attempt to explore questions like the ones below, for example, but pick anything that appeals to you. Always take care to consider whether the values you've calculated are likely to be representative.\n",
    "\n",
    "- Which age group is most optimistic about the future?\n",
    "- Which country has the most trust in the police? And the least?\n",
    "- Are there differences between genders for the highest level of education completed? Does this vary by country and/or age group?\n",
    "\n",
    "If you prefer, you could also do this with one of the example datasets we used during teaching:\n",
    "- World Bank percentage of people living in urban environments (Section 2.1.4 Data Sources and Formats)\n",
    "- Palmer Penguins dataset (Section 2.2.1 Data Consistency)\n",
    "- Anthropometric Survey of US Army Personnel (Section 2.2.3 Feature Engineering)\n",
    "- New York Patient Characteristics Survey (Section 2.2.4.3 Categorical Data)\n",
    "\n",
    "These datasets are stored in the `coursebook/modules/m2/data` directory, but you may prefer to download the original versions as we made modifications for teaching."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "age\n",
       "1    2.163824\n",
       "2    2.346278\n",
       "3    2.548597\n",
       "4    2.679574\n",
       "5    2.752320\n",
       "Name: i_am_optimistic_about_the_future, dtype: float64"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Which age group is most optimistic about the future?\n",
    "\n",
    "df.groupby(\"age\")[\"i_am_optimistic_about_the_future\"].mean()\n",
    "# younger people seem to be more optimistic on average"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "country_human\n",
       "Bulgaria             4.168737\n",
       "Serbia               4.331643\n",
       "Montenegro           4.531773\n",
       "Romania              4.583611\n",
       "Macedonia (FYROM)    4.660144\n",
       "Croatia              4.662614\n",
       "Slovakia             4.718782\n",
       "Cyprus               4.771574\n",
       "Greece               4.876754\n",
       "Latvia               4.958420\n",
       "Slovenia             4.971660\n",
       "Hungary              5.078764\n",
       "Czech Republic       5.098802\n",
       "Poland               5.243012\n",
       "Lithuania            5.260073\n",
       "Kosovo               5.314096\n",
       "Portugal             5.587649\n",
       "Italy                5.750224\n",
       "France               5.783880\n",
       "Belgium              5.892430\n",
       "Spain                6.129593\n",
       "Malta                6.338809\n",
       "Estonia              6.411224\n",
       "Netherlands          6.498004\n",
       "UK                   6.529096\n",
       "Luxembourg           6.575605\n",
       "Ireland              6.644359\n",
       "Turkey               6.708880\n",
       "Sweden               6.716000\n",
       "Germany              6.842923\n",
       "Austria              6.927734\n",
       "Denmark              7.830059\n",
       "Iceland              7.952953\n",
       "Finland              8.138643\n",
       "Name: how_much_trust_the_police_, dtype: float64"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Which country has the most trust in the police? And the least?\n",
    "\n",
    "df_2011.groupby(\"country_human\")[\"how_much_trust_the_police_\"].mean().sort_values()\n",
    "# Bulgaria have the least trust, Finland the most\n",
    "\n",
    "# Caveat for all of this: ~1000 respondents per country (and check user guide for details on data collection)!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "gender\n",
       "1    4.106716\n",
       "2    3.952754\n",
       "Name: education_completed, dtype: float64"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Are there differences between genders for the highest level of education completed? Does this vary\n",
    "# by country and/or age group?\n",
    "\n",
    "df.groupby(\"gender\")[\"education_completed\"].mean()\n",
    "# Males seem to reach (are given the opportunity to reach) a slightly higher education level\n",
    "# on average, but we'd need to check the significance of this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "age\n",
       "1   -0.049220\n",
       "2   -0.007889\n",
       "3    0.043056\n",
       "4    0.145832\n",
       "5    0.453163\n",
       "Name: education_completed, dtype: float64"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "edu_age_m = df[df[\"gender\"] == 1].groupby(\"age\")[\"education_completed\"].mean()\n",
    "edu_age_f = df[df[\"gender\"] == 2].groupby(\"age\")[\"education_completed\"].mean()\n",
    "\n",
    "edu_age_m - edu_age_f\n",
    "# The difference appears to be larger in older people and disappears (maybe even reverts)\n",
    "# in younger people."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "country_human\n",
       "Ireland             -0.375420\n",
       "Estonia             -0.181662\n",
       "Latvia              -0.171925\n",
       "Finland             -0.126074\n",
       "Lithuania           -0.087181\n",
       "Bulgaria            -0.032975\n",
       "Iceland             -0.017156\n",
       "Slovenia             0.004494\n",
       "Hungary              0.026447\n",
       "Denmark              0.030770\n",
       "Sweden               0.047772\n",
       "France               0.060102\n",
       "Austria              0.072093\n",
       "Italy                0.097751\n",
       "Portugal             0.098896\n",
       "Poland               0.107906\n",
       "Cyprus               0.113697\n",
       "UK                   0.127559\n",
       "Czech Republic       0.131751\n",
       "Montenegro           0.138855\n",
       "Belgium              0.150959\n",
       "Serbia               0.166398\n",
       "Spain                0.185198\n",
       "Macedonia (FYROM)    0.185369\n",
       "Netherlands          0.187195\n",
       "Slovakia             0.194347\n",
       "Croatia              0.258897\n",
       "Greece               0.294329\n",
       "Germany              0.297801\n",
       "Kosovo               0.333768\n",
       "Luxembourg           0.385009\n",
       "Malta                0.387355\n",
       "Romania              0.486371\n",
       "Turkey               0.562414\n",
       "Name: education_completed, dtype: float64"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "edu_cntry_m = df_2011[df_2011[\"gender\"] == 1].groupby(\"country_human\")[\"education_completed\"].mean()\n",
    "edu_cntry_f = df_2011[df_2011[\"gender\"] == 2].groupby(\"country_human\")[\"education_completed\"].mean()\n",
    "\n",
    "(edu_cntry_m - edu_cntry_f).sort_values()\n",
    "# Ireland, Estonia, Latvia : Top 3 countries with higher female education than male\n",
    "# Turkey, Romania, Malta : Top 3 countries with  higher male education than female"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**CAVEAT!** The dataset includes weighting factors that should be used when computing some stats and comparing countries. We've simplified and not included that here. See the user guide for details if needed."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.15"
  },
  "vscode": {
   "interpreter": {
    "hash": "b9e8921bd18fbd36d3a09ae9691fc21c58beec206524d0083259030e87e84f05"
   }
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}