{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e88ea460",
   "metadata": {
    "tags": []
   },
   "source": [
    "# 2.2.1 Data Consistency\n",
    "\n",
    "In [Section 2.1.4](2-01-04-DataSourcesAndFormats.md) we saw how to load data of different types into pandas. In this section we'll look at first steps you can perform to look at the data, understand what it contains, and a few common issues that may come up with data consistency like missing or mis-interpreted values."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c0c3e66",
   "metadata": {},
   "source": [
    "## Domain Knowledge\n",
    "\n",
    "Before jumping into looking at and analysing the data you should check any information you've been given about it, so you know what to expect. In this section we'll be using the Palmer penguins dataset, adapted from the version created by Allison Horst available [here (doi:10.5281/zenodo.3960218)](https://allisonhorst.github.io/palmerpenguins/).\n",
    "\n",
    "We've made changes to the data to demonstrate the concepts we're teaching, adding missing values and other common data issues. The (cleaner) original file is available [here](https://github.com/allisonhorst/palmerpenguins/blob/master/inst/extdata/penguins.csv).\n",
    "\n",
    "The dataset was originally collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu/), a member of the [Long Term Ecological Research Network](https://lternet.edu/), and published in the [PLOS ONE journal (doi:10.1371/journal.pone.0090081)](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090081) in 2014 .\n",
    "\n",
    "It includes measurements of the bill size, flipper length and weight of three different species of penguin (Adélie, Chinstrap, Gentoo) on three different islands (Biscoe, Dream, Torgersen) in the Palmer Archipelago, Antarctica. The [dataset homepage](https://allisonhorst.github.io/palmerpenguins/) contains more information about the columns and data types we expect. To reiterate, it's always important to check the documentation and associated literature first.\n",
    "\n",
    "| ![lter_penguins.png](data/lter_penguins.png) | \n",
    "|:--:| \n",
    "| *Artwork by [@allison_horst](https://twitter.com/allison_horst).* |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "611701e4",
   "metadata": {},
   "source": [
    "## Having a First Look at the Data\n",
    "\n",
    "The dataset is saved in `data/penguins.csv` and we can load it with [`pd.read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html?highlight=read_csv#pandas.read_csv), as seen previously:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "9cad4dc7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "69741e99",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"data/penguins.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "554cc3b1",
   "metadata": {},
   "source": [
    "Display the first few ten rows of the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "597e146f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Id</th>\n",
       "      <th>species</th>\n",
       "      <th>island</th>\n",
       "      <th>bill_length_mm</th>\n",
       "      <th>bill_depth_mm</th>\n",
       "      <th>flipper_length_mm</th>\n",
       "      <th>body_mass_g</th>\n",
       "      <th>sex</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>P-179</td>\n",
       "      <td>Gentoo</td>\n",
       "      <td>Biscoe</td>\n",
       "      <td>47.8</td>\n",
       "      <td>15.0</td>\n",
       "      <td>215.0</td>\n",
       "      <td>5650.0</td>\n",
       "      <td>male</td>\n",
       "      <td>2007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>P-306</td>\n",
       "      <td>Chinstrap</td>\n",
       "      <td>Dream</td>\n",
       "      <td>40.9</td>\n",
       "      <td>16.6</td>\n",
       "      <td>187.0</td>\n",
       "      <td>3200.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>P-247</td>\n",
       "      <td>Gentoo</td>\n",
       "      <td>Biscoe</td>\n",
       "      <td>50.8</td>\n",
       "      <td>15.7</td>\n",
       "      <td>226.0</td>\n",
       "      <td>5200.0</td>\n",
       "      <td>male</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>P-120</td>\n",
       "      <td>Adelie</td>\n",
       "      <td>Torgersen</td>\n",
       "      <td>36.2</td>\n",
       "      <td>17.2</td>\n",
       "      <td>187.0</td>\n",
       "      <td>3150.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>P-220</td>\n",
       "      <td>Gentoo</td>\n",
       "      <td>Biscoe</td>\n",
       "      <td>43.5</td>\n",
       "      <td>14.2</td>\n",
       "      <td>220.0</td>\n",
       "      <td>4700.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>P-150</td>\n",
       "      <td>Adelie</td>\n",
       "      <td>Dream</td>\n",
       "      <td>36.0</td>\n",
       "      <td>17.1</td>\n",
       "      <td>187.0</td>\n",
       "      <td>3700.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>P-348</td>\n",
       "      <td>Adelie</td>\n",
       "      <td>Biscoe</td>\n",
       "      <td>36.4</td>\n",
       "      <td>18.1</td>\n",
       "      <td>193.0</td>\n",
       "      <td>285.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>P-091</td>\n",
       "      <td>Adelie</td>\n",
       "      <td>Dream</td>\n",
       "      <td>41.1</td>\n",
       "      <td>18.1</td>\n",
       "      <td>205.0</td>\n",
       "      <td>4300.0</td>\n",
       "      <td>male</td>\n",
       "      <td>2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>P-327</td>\n",
       "      <td>Chinstrap</td>\n",
       "      <td>Dream</td>\n",
       "      <td>51.4</td>\n",
       "      <td>19.0</td>\n",
       "      <td>201.0</td>\n",
       "      <td>3950.0</td>\n",
       "      <td>male</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>P-221</td>\n",
       "      <td>Gentoo</td>\n",
       "      <td>Biscoe</td>\n",
       "      <td>50.7</td>\n",
       "      <td>15.0</td>\n",
       "      <td>223.0</td>\n",
       "      <td>5550.0</td>\n",
       "      <td>male</td>\n",
       "      <td>2008</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      Id    species     island  bill_length_mm bill_depth_mm  \\\n",
       "0  P-179     Gentoo     Biscoe            47.8          15.0   \n",
       "1  P-306  Chinstrap      Dream            40.9          16.6   \n",
       "2  P-247     Gentoo     Biscoe            50.8          15.7   \n",
       "3  P-120     Adelie  Torgersen            36.2          17.2   \n",
       "4  P-220     Gentoo     Biscoe            43.5          14.2   \n",
       "5  P-150     Adelie      Dream            36.0          17.1   \n",
       "6  P-348     Adelie     Biscoe            36.4          18.1   \n",
       "7  P-091     Adelie      Dream            41.1          18.1   \n",
       "8  P-327  Chinstrap      Dream            51.4          19.0   \n",
       "9  P-221     Gentoo     Biscoe            50.7          15.0   \n",
       "\n",
       "   flipper_length_mm  body_mass_g     sex  year  \n",
       "0              215.0       5650.0    male  2007  \n",
       "1              187.0       3200.0  female  2008  \n",
       "2              226.0       5200.0    male  2009  \n",
       "3              187.0       3150.0  female  2009  \n",
       "4              220.0       4700.0  female  2008  \n",
       "5              187.0       3700.0  female  2009  \n",
       "6              193.0        285.0  female  2007  \n",
       "7              205.0       4300.0    male  2008  \n",
       "8              201.0       3950.0    male  2009  \n",
       "9              223.0       5550.0    male  2008  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6289868f",
   "metadata": {},
   "source": [
    "We could also look at the last few rows of the data with `df.tail()`, or a random sample of rows with `df.sample()`.\n",
    "\n",
    "To check the number of rows and columns we can use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "7bce880c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(351, 9)\n"
     ]
    }
   ],
   "source": [
    "print(df.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "867edbd9",
   "metadata": {},
   "source": [
    "Our data has 351 rows and 9 columns. It might also be useful to look at the column names (especially for larger datasets with many columns where they may not all be displayed by `df.head()`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "dfa54809",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index(['Id', 'species', 'island', 'bill_length_mm', 'bill_depth_mm',\n",
      "       'flipper_length_mm', 'body_mass_g', 'sex', 'year'],\n",
      "      dtype='object')\n"
     ]
    }
   ],
   "source": [
    "print(df.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37ed133b",
   "metadata": {},
   "source": [
    "A useful command that summarises much of this information is `df.info()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "ff5c537c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 351 entries, 0 to 350\n",
      "Data columns (total 9 columns):\n",
      " #   Column             Non-Null Count  Dtype  \n",
      "---  ------             --------------  -----  \n",
      " 0   Id                 351 non-null    object \n",
      " 1   species            351 non-null    object \n",
      " 2   island             351 non-null    object \n",
      " 3   bill_length_mm     347 non-null    float64\n",
      " 4   bill_depth_mm      349 non-null    object \n",
      " 5   flipper_length_mm  349 non-null    float64\n",
      " 6   body_mass_g        349 non-null    float64\n",
      " 7   sex                340 non-null    object \n",
      " 8   year               351 non-null    int64  \n",
      "dtypes: float64(3), int64(1), object(5)\n",
      "memory usage: 24.8+ KB\n"
     ]
    }
   ],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8ed6aaf",
   "metadata": {},
   "source": [
    "This gives us the number of rows (entries) and columns at the top, and then a table with the name, number of non-null values, and data type of each column. Finally, it gives the amount of memory the data frame is using. Pandas can use a lot of memory, which may cause problems when analysing large datasets. The [Scaling to large datasets](https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html) page in the Pandas documentation gives pointers for what you can try in that case."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "498b00b3",
   "metadata": {},
   "source": [
    "## Null Values\n",
    "\n",
    "The data frame info shows we have 351 \"non-null\" values in the `Id`, `species`, `island` and `year` columns, but fewer in the other columns.\n",
    "\n",
    "\"Null values\" is Pandas' way of describing data that is missing. Under the hood, these are encoded as NumPy's NaN (not a number) constant (see [here](https://numpy.org/doc/stable/reference/constants.html#numpy.nan)), which has type `float64` so numeric columns with NaN values still have a numeric type and can have numeric operations applied to them.\n",
    "\n",
    "To find missing values in a column we can use the `isnull()` function:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "541355b9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0      False\n",
      "1      False\n",
      "2      False\n",
      "3      False\n",
      "4      False\n",
      "       ...  \n",
      "346    False\n",
      "347    False\n",
      "348    False\n",
      "349    False\n",
      "350    False\n",
      "Name: bill_length_mm, Length: 351, dtype: bool\n"
     ]
    }
   ],
   "source": [
    "is_missing = df[\"bill_length_mm\"].isnull()\n",
    "print(is_missing)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8129dc9c",
   "metadata": {},
   "source": [
    "This returns a boolean series which is True if that row's value is NaN, which can then be used to filter the data frame and show only the rows with missing data in the `bill_length_mm` column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "ad52255a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Id</th>\n",
       "      <th>species</th>\n",
       "      <th>island</th>\n",
       "      <th>bill_length_mm</th>\n",
       "      <th>bill_depth_mm</th>\n",
       "      <th>flipper_length_mm</th>\n",
       "      <th>body_mass_g</th>\n",
       "      <th>sex</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>233</th>\n",
       "      <td>P-344</td>\n",
       "      <td>Chinstrap</td>\n",
       "      <td>Dream</td>\n",
       "      <td>NaN</td>\n",
       "      <td>19.2</td>\n",
       "      <td>197.0</td>\n",
       "      <td>4000.0</td>\n",
       "      <td>male</td>\n",
       "      <td>2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>286</th>\n",
       "      <td>P-003</td>\n",
       "      <td>Adelie</td>\n",
       "      <td>Torgersen</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>307</th>\n",
       "      <td>P-271</td>\n",
       "      <td>Gentoo</td>\n",
       "      <td>Biscoe</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>312</th>\n",
       "      <td>P-345</td>\n",
       "      <td>Adelie</td>\n",
       "      <td>Torgersen</td>\n",
       "      <td>NaN</td>\n",
       "      <td>18.0</td>\n",
       "      <td>193.0</td>\n",
       "      <td>43400.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Id    species     island  bill_length_mm bill_depth_mm  \\\n",
       "233  P-344  Chinstrap      Dream             NaN          19.2   \n",
       "286  P-003     Adelie  Torgersen             NaN           NaN   \n",
       "307  P-271     Gentoo     Biscoe             NaN           NaN   \n",
       "312  P-345     Adelie  Torgersen             NaN          18.0   \n",
       "\n",
       "     flipper_length_mm  body_mass_g     sex  year  \n",
       "233              197.0       4000.0    male  2008  \n",
       "286                NaN          NaN     NaN  2007  \n",
       "307                NaN          NaN     NaN  2009  \n",
       "312              193.0      43400.0  female  2009  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[is_missing]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4eadf9d4",
   "metadata": {},
   "source": [
    "There are many reasons data could be missing and how you choose to deal with them is an important part of any research project. We'll revisit this later. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d924d5b4",
   "metadata": {},
   "source": [
    "## Unexpected Column Types\n",
    "\n",
    "Looking at the first few rows of our data (the output of `df.head()` above) it looks like we expect the `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, `body_mass_g` and `year` columns to have a numeric type. Comparing with the output of `df.info()` above most of them do, having a `dtype` (data type) of either `int64` or `float64`.\n",
    "\n",
    "However, the `bill_depth_mm` column has a dtype of `object`, which usually means the column is being treated as strings/text data. This will generally  be because there is at least one row in the column that Pandas was not able to parse as a number. Common reasons this might happen include:\n",
    "- Data entry errors and typos, for example \"23/15\" instead of \"23.15\".\n",
    "- Encoding of missing values: The `pd.read_csv()` function checks for common string representations of missing values like \"NA\" or \"NULL\" and converts these to `numpy.nan` when loading the data. But many different conventions for missing data exist, such as more verbose representations like \"UNKNOWN\", and Pandas will load these as strings by default. This can be customised with the `na_values` parameter of [`pd.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).\n",
    "- Additional metadata incorrectly loaded into the data frame, such as CSV files with headers and footers (as seen in the [Data Sources & Formats section](2-01-04-DataSourcesAndFormats) previously).\n",
    "\n",
    "To see what's wrong with the `bill_depth_mm` column we can try to convert it to a numeric type with the [`pd.to_numeric`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html) function:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "a1547ef1",
   "metadata": {
    "tags": [
     "raises-exception"
    ]
   },
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "Unable to parse string \"14,2\" at position 142",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "File \u001b[0;32m~/Library/Caches/pypoetry/virtualenvs/rds-course-5zqYD5aN-py3.9/lib/python3.9/site-packages/pandas/_libs/lib.pyx:2369\u001b[0m, in \u001b[0;36mpandas._libs.lib.maybe_convert_numeric\u001b[0;34m()\u001b[0m\n",
      "\u001b[0;31mValueError\u001b[0m: Unable to parse string \"14,2\"",
      "\nDuring handling of the above exception, another exception occurred:\n",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[9], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m df[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mbill_depth_mm\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m \u001b[43mpd\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mto_numeric\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdf\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mbill_depth_mm\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m)\u001b[49m\n",
      "File \u001b[0;32m~/Library/Caches/pypoetry/virtualenvs/rds-course-5zqYD5aN-py3.9/lib/python3.9/site-packages/pandas/core/tools/numeric.py:185\u001b[0m, in \u001b[0;36mto_numeric\u001b[0;34m(arg, errors, downcast)\u001b[0m\n\u001b[1;32m    183\u001b[0m coerce_numeric \u001b[38;5;241m=\u001b[39m errors \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m (\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mignore\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mraise\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m    184\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 185\u001b[0m     values, _ \u001b[38;5;241m=\u001b[39m \u001b[43mlib\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmaybe_convert_numeric\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m    186\u001b[0m \u001b[43m        \u001b[49m\u001b[43mvalues\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mset\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcoerce_numeric\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcoerce_numeric\u001b[49m\n\u001b[1;32m    187\u001b[0m \u001b[43m    \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    188\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m (\u001b[38;5;167;01mValueError\u001b[39;00m, \u001b[38;5;167;01mTypeError\u001b[39;00m):\n\u001b[1;32m    189\u001b[0m     \u001b[38;5;28;01mif\u001b[39;00m errors \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mraise\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n",
      "File \u001b[0;32m~/Library/Caches/pypoetry/virtualenvs/rds-course-5zqYD5aN-py3.9/lib/python3.9/site-packages/pandas/_libs/lib.pyx:2411\u001b[0m, in \u001b[0;36mpandas._libs.lib.maybe_convert_numeric\u001b[0;34m()\u001b[0m\n",
      "\u001b[0;31mValueError\u001b[0m: Unable to parse string \"14,2\" at position 142"
     ]
    }
   ],
   "source": [
    "df[\"bill_depth_mm\"] = pd.to_numeric(df[\"bill_depth_mm\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6ba77b9",
   "metadata": {},
   "source": [
    "The error above tells us Pandas has encountered a value \"14,2\", which it doesn't know how to convert into a number. It also says the problem is at index 142, which we can access ourselves to check the value directly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "0b1cff3c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'14,2'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[142, \"bill_depth_mm\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8023508",
   "metadata": {},
   "source": [
    "In this case it looks like a typo, the person entering the data probably meant to write `14.2`, but we should check this first. There may be information in the data documentation, or you could ask the data provider.\n",
    "\n",
    "Let's say we're confident it is a typo. We can fix it ourselves and then convert the column to a numeric type:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "48a448ec",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 351 entries, 0 to 350\n",
      "Data columns (total 9 columns):\n",
      " #   Column             Non-Null Count  Dtype  \n",
      "---  ------             --------------  -----  \n",
      " 0   Id                 351 non-null    object \n",
      " 1   species            351 non-null    object \n",
      " 2   island             351 non-null    object \n",
      " 3   bill_length_mm     347 non-null    float64\n",
      " 4   bill_depth_mm      349 non-null    float64\n",
      " 5   flipper_length_mm  349 non-null    float64\n",
      " 6   body_mass_g        349 non-null    float64\n",
      " 7   sex                340 non-null    object \n",
      " 8   year               351 non-null    int64  \n",
      "dtypes: float64(4), int64(1), object(4)\n",
      "memory usage: 24.8+ KB\n"
     ]
    }
   ],
   "source": [
    "# set the incorrectly typed number to its intended value\n",
    "df.loc[142, \"bill_depth_mm\"] = 14.2\n",
    "# convert the column to a numeric type\n",
    "df[\"bill_depth_mm\"] = pd.to_numeric(df[\"bill_depth_mm\"])\n",
    "\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ade8d5d",
   "metadata": {},
   "source": [
    "The `bill_depth_mm` now has type `float64` as we originally expected.\n",
    "\n",
    "This was a simple example with just one strange value - we'll see more approaches for handling and sanitising strings later."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24eca291",
   "metadata": {},
   "source": [
    "## Sanity Checking Values\n",
    "\n",
    "### Numeric Columns\n",
    "\n",
    "The pandas `describe()` function gives summary statistics for the numeric columns in our data (the mean, standard deviation, minimum and maximum value, and quartiles for each column):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "e2b5f300",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>bill_length_mm</th>\n",
       "      <th>bill_depth_mm</th>\n",
       "      <th>flipper_length_mm</th>\n",
       "      <th>body_mass_g</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>347.000000</td>\n",
       "      <td>349.000000</td>\n",
       "      <td>349.000000</td>\n",
       "      <td>349.000000</td>\n",
       "      <td>351.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>43.923055</td>\n",
       "      <td>17.152722</td>\n",
       "      <td>200.088825</td>\n",
       "      <td>4305.329513</td>\n",
       "      <td>2008.022792</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>5.491795</td>\n",
       "      <td>1.967049</td>\n",
       "      <td>21.320100</td>\n",
       "      <td>2256.300048</td>\n",
       "      <td>0.820832</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>32.100000</td>\n",
       "      <td>13.100000</td>\n",
       "      <td>-99.000000</td>\n",
       "      <td>285.000000</td>\n",
       "      <td>2007.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>39.200000</td>\n",
       "      <td>15.600000</td>\n",
       "      <td>190.000000</td>\n",
       "      <td>3550.000000</td>\n",
       "      <td>2007.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>44.500000</td>\n",
       "      <td>17.300000</td>\n",
       "      <td>197.000000</td>\n",
       "      <td>4050.000000</td>\n",
       "      <td>2008.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>48.500000</td>\n",
       "      <td>18.700000</td>\n",
       "      <td>213.000000</td>\n",
       "      <td>4775.000000</td>\n",
       "      <td>2009.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>59.600000</td>\n",
       "      <td>21.500000</td>\n",
       "      <td>231.000000</td>\n",
       "      <td>43400.000000</td>\n",
       "      <td>2009.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       bill_length_mm  bill_depth_mm  flipper_length_mm   body_mass_g  \\\n",
       "count      347.000000     349.000000         349.000000    349.000000   \n",
       "mean        43.923055      17.152722         200.088825   4305.329513   \n",
       "std          5.491795       1.967049          21.320100   2256.300048   \n",
       "min         32.100000      13.100000         -99.000000    285.000000   \n",
       "25%         39.200000      15.600000         190.000000   3550.000000   \n",
       "50%         44.500000      17.300000         197.000000   4050.000000   \n",
       "75%         48.500000      18.700000         213.000000   4775.000000   \n",
       "max         59.600000      21.500000         231.000000  43400.000000   \n",
       "\n",
       "              year  \n",
       "count   351.000000  \n",
       "mean   2008.022792  \n",
       "std       0.820832  \n",
       "min    2007.000000  \n",
       "25%    2007.000000  \n",
       "50%    2008.000000  \n",
       "75%    2009.000000  \n",
       "max    2009.000000  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c98310a",
   "metadata": {},
   "source": [
    "Even though `bill_length_mm`, and several of the other columns, have missing (NaN) values, Pandas is able to compute statistics for that column. When calculating these Pandas will ignore all NaN values by default. To change this behaviour, some functions have a `skipna` argument, for example `df[\"bill_length_mm\"].mean(skipna=False)` will return NaN if there are _any_ NaN values in the column.\n",
    "\n",
    "You should think carefully about which approach is more suitable for your data (for example, if a column only has a few non-null values will the mean be representative?)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "878ebe30",
   "metadata": {},
   "source": [
    "Looking at these values gives us a better idea of what our data contains, but also allows us to perform some sanity checks. For example, do the minimum and maximum values in each column make sense given what we know about the dataset?\n",
    "\n",
    "There are two things that might standout. First, the `flipper_length_mm` column has a minimum value of -99, but all the other values in the data are positive as we'd expect for measurements of lengths, widths and weights. In some datasets missing data is represented with negative values (but this may not always be the case so, as always, make sure to check what they mean in any data you're using).\n",
    "\n",
    "If we're sure `-99` is meant to be a missing value, we can replace those with `numpy.nan` so Pandas will treat them correctly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "f844b9a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.replace(-99, numpy.nan)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3fa8a2fc",
   "metadata": {},
   "source": [
    "With these values replaced, the \"actual\" minimum value of `flipper_length_mm` is 172 mm:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "5d7a7d38",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "172.0"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"flipper_length_mm\"].min()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00e8111a",
   "metadata": {},
   "source": [
    "The second thing that may stand out is the minimum value of 285 grams in `body_mass_g`, which looks far smaller than the other values (e.g., the 25% quartile of `body_mass_g` is only 3550g). Excluding the 285g value the next lightest penguin weighs 2700g:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "8bb3868a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2700.0"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# True for each row with body_mass_g greater than the min value of 285g\n",
    "smaller_petals = df[\"body_mass_g\"] > df[\"body_mass_g\"].min()\n",
    "\n",
    "# Lowest penguin weight out of all rows with weights above 285g\n",
    "df.loc[smaller_petals, \"body_mass_g\"].min()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "107e7525",
   "metadata": {},
   "source": [
    "Another way to see this would be to sort the data frame by body mass using the [`sort_values`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "ce349c0f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Id</th>\n",
       "      <th>species</th>\n",
       "      <th>island</th>\n",
       "      <th>bill_length_mm</th>\n",
       "      <th>bill_depth_mm</th>\n",
       "      <th>flipper_length_mm</th>\n",
       "      <th>body_mass_g</th>\n",
       "      <th>sex</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>P-348</td>\n",
       "      <td>Adelie</td>\n",
       "      <td>Biscoe</td>\n",
       "      <td>36.4</td>\n",
       "      <td>18.1</td>\n",
       "      <td>193.0</td>\n",
       "      <td>285.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>236</th>\n",
       "      <td>P-314</td>\n",
       "      <td>Chinstrap</td>\n",
       "      <td>Dream</td>\n",
       "      <td>46.9</td>\n",
       "      <td>16.6</td>\n",
       "      <td>192.0</td>\n",
       "      <td>2700.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>248</th>\n",
       "      <td>P-058</td>\n",
       "      <td>Adelie</td>\n",
       "      <td>Biscoe</td>\n",
       "      <td>36.5</td>\n",
       "      <td>16.6</td>\n",
       "      <td>181.0</td>\n",
       "      <td>2850.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>311</th>\n",
       "      <td>P-064</td>\n",
       "      <td>Adelie</td>\n",
       "      <td>Biscoe</td>\n",
       "      <td>36.4</td>\n",
       "      <td>17.1</td>\n",
       "      <td>184.0</td>\n",
       "      <td>2850.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>349</th>\n",
       "      <td>P-098</td>\n",
       "      <td>Adelie</td>\n",
       "      <td>Dream</td>\n",
       "      <td>33.1</td>\n",
       "      <td>16.1</td>\n",
       "      <td>178.0</td>\n",
       "      <td>2900.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227</th>\n",
       "      <td>P-298</td>\n",
       "      <td>Chinstrap</td>\n",
       "      <td>Dream</td>\n",
       "      <td>43.2</td>\n",
       "      <td>16.6</td>\n",
       "      <td>187.0</td>\n",
       "      <td>2900.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>270</th>\n",
       "      <td>P-116</td>\n",
       "      <td>Adelie</td>\n",
       "      <td>Torgersen</td>\n",
       "      <td>38.6</td>\n",
       "      <td>17.0</td>\n",
       "      <td>188.0</td>\n",
       "      <td>2900.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>P-054</td>\n",
       "      <td>Adelie</td>\n",
       "      <td>Biscoe</td>\n",
       "      <td>34.5</td>\n",
       "      <td>18.1</td>\n",
       "      <td>187.0</td>\n",
       "      <td>2900.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>137</th>\n",
       "      <td>P-104</td>\n",
       "      <td>Adelie</td>\n",
       "      <td>Biscoe</td>\n",
       "      <td>37.9</td>\n",
       "      <td>18.6</td>\n",
       "      <td>193.0</td>\n",
       "      <td>2925.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>337</th>\n",
       "      <td>P-047</td>\n",
       "      <td>Adelie</td>\n",
       "      <td>Dream</td>\n",
       "      <td>37.5</td>\n",
       "      <td>18.9</td>\n",
       "      <td>179.0</td>\n",
       "      <td>2975.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2007</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Id    species     island  bill_length_mm  bill_depth_mm  \\\n",
       "6    P-348     Adelie     Biscoe            36.4           18.1   \n",
       "236  P-314  Chinstrap      Dream            46.9           16.6   \n",
       "248  P-058     Adelie     Biscoe            36.5           16.6   \n",
       "311  P-064     Adelie     Biscoe            36.4           17.1   \n",
       "349  P-098     Adelie      Dream            33.1           16.1   \n",
       "227  P-298  Chinstrap      Dream            43.2           16.6   \n",
       "270  P-116     Adelie  Torgersen            38.6           17.0   \n",
       "17   P-054     Adelie     Biscoe            34.5           18.1   \n",
       "137  P-104     Adelie     Biscoe            37.9           18.6   \n",
       "337  P-047     Adelie      Dream            37.5           18.9   \n",
       "\n",
       "     flipper_length_mm  body_mass_g     sex  year  \n",
       "6                193.0        285.0  female  2007  \n",
       "236              192.0       2700.0  female  2008  \n",
       "248              181.0       2850.0  female  2008  \n",
       "311              184.0       2850.0  female  2008  \n",
       "349              178.0       2900.0  female  2008  \n",
       "227              187.0       2900.0  female  2007  \n",
       "270              188.0       2900.0  female  2009  \n",
       "17               187.0       2900.0  female  2008  \n",
       "137              193.0       2925.0  female  2009  \n",
       "337              179.0       2975.0     NaN  2007  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.sort_values(by=\"body_mass_g\").head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d31ea54b",
   "metadata": {},
   "source": [
    "By default `sort_values` sorts values from smallest to largest, you can change that by setting `ascending=False`. \n",
    "\n",
    "Again, we see the 2nd smallest value in the column is only 2700g. This could be another data entry error (perhaps the weight was meant to be 2850g rather than 285g), or perhaps that penguin is a chick and the rest are adults. This type of issue is much more nuanced and difficult to spot in real world scenarios. Visualizing the data (and distributions in the data) can be very helpful here, which is the focus of the next module."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57aaa8ca",
   "metadata": {},
   "source": [
    "### Text and Categorical Columns\n",
    "\n",
    "Note that the `species`, `island`, and `sex` columns do not appear when we use `describe()` above as they contain text. For both text and numeric columns, it can be helpful to know the number of unique values in each column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "b0e00076",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Id                   350\n",
       "species                4\n",
       "island                 3\n",
       "bill_length_mm       164\n",
       "bill_depth_mm         80\n",
       "flipper_length_mm     55\n",
       "body_mass_g           96\n",
       "sex                    2\n",
       "year                   3\n",
       "dtype: int64"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.nunique()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85a7b47f",
   "metadata": {},
   "source": [
    "The measurement and `Id` columns have many unique values, whereas columns like `island` have only a few different unique values (categories).  Looking closely, the `species` column has four different values, but from the dataset documentation we only expect there to be three penguin species.\n",
    "\n",
    "The `value_counts()` function, applied to the `species` column, shows the number of occurrences of the different values in that column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "82ae93ef",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Adelie       155\n",
       "Gentoo       125\n",
       "Chinstrap     70\n",
       "UNKNOWN        1\n",
       "Name: species, dtype: int64"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"species\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01273fce",
   "metadata": {},
   "source": [
    "The \"Adélie\", \"Chinstrap\", and \"Gentoo\" species described in the documentation all appear, but there's also an \"UNKNOWN\" entry. This looks like it should have been treated as a missing value. To make Pandas correctly treat it as missing we can replace it with `numpy.nan` using the [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html?highlight=replace#pandas.DataFrame.replace) method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "bfadf286",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Adelie       155\n",
       "Gentoo       125\n",
       "Chinstrap     70\n",
       "Name: species, dtype: int64"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"species\"] = df[\"species\"].replace(\"UNKNOWN\", numpy.nan)\n",
    "df[\"species\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd98d7fa",
   "metadata": {},
   "source": [
    "By default, the `value_counts` will not display the number of missing values in the column. To show that you can use `df[\"species\"].value_counts(dropna=False)` instead. You can also try `df[\"species\"].value_counts(normalize=True)` to show the fraction of data with each value, rather than the count.\n",
    "\n",
    "We'll look at more approaches for manipulating strings and categories in Sections [2.2.4.2](2-02-04-02-TextData) and [2.2.4.3](2-02-04-02-TextData) of this module.\n",
    "\n",
    "Finally, it may be interesting to look at how the measurements vary between the species. We can do that with the Pandas [`groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) function: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "ee83308a",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/xv/d5nvn2ps5r3fcf276w707n01qdmpqf/T/ipykernel_55040/2880954085.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.\n",
      "  df.groupby(\"species\").mean()\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>bill_length_mm</th>\n",
       "      <th>bill_depth_mm</th>\n",
       "      <th>flipper_length_mm</th>\n",
       "      <th>body_mass_g</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>species</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Adelie</th>\n",
       "      <td>38.757516</td>\n",
       "      <td>18.335714</td>\n",
       "      <td>189.993464</td>\n",
       "      <td>3934.805195</td>\n",
       "      <td>2008.006452</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Chinstrap</th>\n",
       "      <td>48.800000</td>\n",
       "      <td>18.424286</td>\n",
       "      <td>195.785714</td>\n",
       "      <td>3733.571429</td>\n",
       "      <td>2007.957143</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Gentoo</th>\n",
       "      <td>47.486290</td>\n",
       "      <td>14.975806</td>\n",
       "      <td>217.241935</td>\n",
       "      <td>5080.241935</td>\n",
       "      <td>2008.072000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  \\\n",
       "species                                                                    \n",
       "Adelie          38.757516      18.335714         189.993464  3934.805195   \n",
       "Chinstrap       48.800000      18.424286         195.785714  3733.571429   \n",
       "Gentoo          47.486290      14.975806         217.241935  5080.241935   \n",
       "\n",
       "                  year  \n",
       "species                 \n",
       "Adelie     2008.006452  \n",
       "Chinstrap  2007.957143  \n",
       "Gentoo     2008.072000  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(\"species\").mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6b4a401",
   "metadata": {},
   "source": [
    "`df.groupby(\"species\")` splits the date frame into sub-groups with the same value in the \"species\" column. We then must specify a function we want to use to summarize the members of each group, in this case the mean. It looks like, on average, \"Chinstrap\" penguins have the largest bills, but \"Gentoo\" penguins have the largest flippers and body mass. For more information about using `groupby` see [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae5ed4d5",
   "metadata": {},
   "source": [
    "## Duplicate Data\n",
    "\n",
    "In the output of `df.nunique()` above we see the `Id` column has 350 unique values, one fewer than the 351 rows in the dataset. We expect `Id` to be a unique identifier, so to have 351 unique values (1 for each row). What's going on?\n",
    "\n",
    "One explanation could be that there are duplicate rows in the data. The [`duplicated`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html) method of a data frame returns True for any rows that appear multiple times in the data (with an exact copy of all values):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "23341c3d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      False\n",
       "1      False\n",
       "2      False\n",
       "3      False\n",
       "4      False\n",
       "       ...  \n",
       "346    False\n",
       "347    False\n",
       "348    False\n",
       "349    False\n",
       "350    False\n",
       "Length: 351, dtype: bool"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.duplicated()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0024b9ac",
   "metadata": {},
   "source": [
    "We can use this to filter the data frame and show only the duplicated rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "90f205fc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Id</th>\n",
       "      <th>species</th>\n",
       "      <th>island</th>\n",
       "      <th>bill_length_mm</th>\n",
       "      <th>bill_depth_mm</th>\n",
       "      <th>flipper_length_mm</th>\n",
       "      <th>body_mass_g</th>\n",
       "      <th>sex</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <td>P-276</td>\n",
       "      <td>Chinstrap</td>\n",
       "      <td>Dream</td>\n",
       "      <td>46.5</td>\n",
       "      <td>17.9</td>\n",
       "      <td>192.0</td>\n",
       "      <td>3500.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>324</th>\n",
       "      <td>P-276</td>\n",
       "      <td>Chinstrap</td>\n",
       "      <td>Dream</td>\n",
       "      <td>46.5</td>\n",
       "      <td>17.9</td>\n",
       "      <td>192.0</td>\n",
       "      <td>3500.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2007</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Id    species island  bill_length_mm  bill_depth_mm  \\\n",
       "36   P-276  Chinstrap  Dream            46.5           17.9   \n",
       "324  P-276  Chinstrap  Dream            46.5           17.9   \n",
       "\n",
       "     flipper_length_mm  body_mass_g     sex  year  \n",
       "36               192.0       3500.0  female  2007  \n",
       "324              192.0       3500.0  female  2007  "
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df.duplicated(keep=False)]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3692d88d",
   "metadata": {},
   "source": [
    "By default, the `duplicated` function only marks the second and subsequent instances of the same data as being duplicates. Setting `keep=False` marks the first instance as a duplicate as well.\n",
    "\n",
    "We see there are two entries for a penguin with id `P-276` in the data. Why might that be the case? It could be caused by a data entry/processing issue and be there by mistake, or be a genuine repeated measurement for this penguin, for example. It's important to understand the context before taking any action.\n",
    "\n",
    "In some cases it may be appropriate to delete the duplicate data. This can be done with the [`drop_duplicates`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html?highlight=drop_duplicates#pandas.DataFrame.drop_duplicates) method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "39c214f8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Length before removing duplicates: 351 rows\n",
      "Length after removing duplicates: 350 rows\n"
     ]
    }
   ],
   "source": [
    "print(\"Length before removing duplicates:\", len(df), \"rows\")\n",
    "df = df.drop_duplicates()\n",
    "print(\"Length after removing duplicates:\", len(df), \"rows\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76933b08",
   "metadata": {},
   "source": [
    "## Displaying Data Frames with Style 😎\n",
    "\n",
    "You can get fancy with how you display data frames by highlighting and formatting cells differently using its `style` attribute. There are a few examples below, for more details see the [Table Visualization page in the Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#Styler-Object-and-HTML).\n",
    "\n",
    "Change the precision with which numbers are displayed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "f0de84c8",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style type=\"text/css\">\n",
       "</style>\n",
       "<table id=\"T_5a03a\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th class=\"blank level0\" >&nbsp;</th>\n",
       "      <th id=\"T_5a03a_level0_col0\" class=\"col_heading level0 col0\" >Id</th>\n",
       "      <th id=\"T_5a03a_level0_col1\" class=\"col_heading level0 col1\" >species</th>\n",
       "      <th id=\"T_5a03a_level0_col2\" class=\"col_heading level0 col2\" >island</th>\n",
       "      <th id=\"T_5a03a_level0_col3\" class=\"col_heading level0 col3\" >bill_length_mm</th>\n",
       "      <th id=\"T_5a03a_level0_col4\" class=\"col_heading level0 col4\" >bill_depth_mm</th>\n",
       "      <th id=\"T_5a03a_level0_col5\" class=\"col_heading level0 col5\" >flipper_length_mm</th>\n",
       "      <th id=\"T_5a03a_level0_col6\" class=\"col_heading level0 col6\" >body_mass_g</th>\n",
       "      <th id=\"T_5a03a_level0_col7\" class=\"col_heading level0 col7\" >sex</th>\n",
       "      <th id=\"T_5a03a_level0_col8\" class=\"col_heading level0 col8\" >year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th id=\"T_5a03a_level0_row0\" class=\"row_heading level0 row0\" >0</th>\n",
       "      <td id=\"T_5a03a_row0_col0\" class=\"data row0 col0\" >P-179</td>\n",
       "      <td id=\"T_5a03a_row0_col1\" class=\"data row0 col1\" >Gentoo</td>\n",
       "      <td id=\"T_5a03a_row0_col2\" class=\"data row0 col2\" >Biscoe</td>\n",
       "      <td id=\"T_5a03a_row0_col3\" class=\"data row0 col3\" >48</td>\n",
       "      <td id=\"T_5a03a_row0_col4\" class=\"data row0 col4\" >15</td>\n",
       "      <td id=\"T_5a03a_row0_col5\" class=\"data row0 col5\" >215</td>\n",
       "      <td id=\"T_5a03a_row0_col6\" class=\"data row0 col6\" >5650</td>\n",
       "      <td id=\"T_5a03a_row0_col7\" class=\"data row0 col7\" >male</td>\n",
       "      <td id=\"T_5a03a_row0_col8\" class=\"data row0 col8\" >2007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_5a03a_level0_row1\" class=\"row_heading level0 row1\" >1</th>\n",
       "      <td id=\"T_5a03a_row1_col0\" class=\"data row1 col0\" >P-306</td>\n",
       "      <td id=\"T_5a03a_row1_col1\" class=\"data row1 col1\" >Chinstrap</td>\n",
       "      <td id=\"T_5a03a_row1_col2\" class=\"data row1 col2\" >Dream</td>\n",
       "      <td id=\"T_5a03a_row1_col3\" class=\"data row1 col3\" >41</td>\n",
       "      <td id=\"T_5a03a_row1_col4\" class=\"data row1 col4\" >17</td>\n",
       "      <td id=\"T_5a03a_row1_col5\" class=\"data row1 col5\" >187</td>\n",
       "      <td id=\"T_5a03a_row1_col6\" class=\"data row1 col6\" >3200</td>\n",
       "      <td id=\"T_5a03a_row1_col7\" class=\"data row1 col7\" >female</td>\n",
       "      <td id=\"T_5a03a_row1_col8\" class=\"data row1 col8\" >2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_5a03a_level0_row2\" class=\"row_heading level0 row2\" >2</th>\n",
       "      <td id=\"T_5a03a_row2_col0\" class=\"data row2 col0\" >P-247</td>\n",
       "      <td id=\"T_5a03a_row2_col1\" class=\"data row2 col1\" >Gentoo</td>\n",
       "      <td id=\"T_5a03a_row2_col2\" class=\"data row2 col2\" >Biscoe</td>\n",
       "      <td id=\"T_5a03a_row2_col3\" class=\"data row2 col3\" >51</td>\n",
       "      <td id=\"T_5a03a_row2_col4\" class=\"data row2 col4\" >16</td>\n",
       "      <td id=\"T_5a03a_row2_col5\" class=\"data row2 col5\" >226</td>\n",
       "      <td id=\"T_5a03a_row2_col6\" class=\"data row2 col6\" >5200</td>\n",
       "      <td id=\"T_5a03a_row2_col7\" class=\"data row2 col7\" >male</td>\n",
       "      <td id=\"T_5a03a_row2_col8\" class=\"data row2 col8\" >2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_5a03a_level0_row3\" class=\"row_heading level0 row3\" >3</th>\n",
       "      <td id=\"T_5a03a_row3_col0\" class=\"data row3 col0\" >P-120</td>\n",
       "      <td id=\"T_5a03a_row3_col1\" class=\"data row3 col1\" >Adelie</td>\n",
       "      <td id=\"T_5a03a_row3_col2\" class=\"data row3 col2\" >Torgersen</td>\n",
       "      <td id=\"T_5a03a_row3_col3\" class=\"data row3 col3\" >36</td>\n",
       "      <td id=\"T_5a03a_row3_col4\" class=\"data row3 col4\" >17</td>\n",
       "      <td id=\"T_5a03a_row3_col5\" class=\"data row3 col5\" >187</td>\n",
       "      <td id=\"T_5a03a_row3_col6\" class=\"data row3 col6\" >3150</td>\n",
       "      <td id=\"T_5a03a_row3_col7\" class=\"data row3 col7\" >female</td>\n",
       "      <td id=\"T_5a03a_row3_col8\" class=\"data row3 col8\" >2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_5a03a_level0_row4\" class=\"row_heading level0 row4\" >4</th>\n",
       "      <td id=\"T_5a03a_row4_col0\" class=\"data row4 col0\" >P-220</td>\n",
       "      <td id=\"T_5a03a_row4_col1\" class=\"data row4 col1\" >Gentoo</td>\n",
       "      <td id=\"T_5a03a_row4_col2\" class=\"data row4 col2\" >Biscoe</td>\n",
       "      <td id=\"T_5a03a_row4_col3\" class=\"data row4 col3\" >44</td>\n",
       "      <td id=\"T_5a03a_row4_col4\" class=\"data row4 col4\" >14</td>\n",
       "      <td id=\"T_5a03a_row4_col5\" class=\"data row4 col5\" >220</td>\n",
       "      <td id=\"T_5a03a_row4_col6\" class=\"data row4 col6\" >4700</td>\n",
       "      <td id=\"T_5a03a_row4_col7\" class=\"data row4 col7\" >female</td>\n",
       "      <td id=\"T_5a03a_row4_col8\" class=\"data row4 col8\" >2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_5a03a_level0_row5\" class=\"row_heading level0 row5\" >5</th>\n",
       "      <td id=\"T_5a03a_row5_col0\" class=\"data row5 col0\" >P-150</td>\n",
       "      <td id=\"T_5a03a_row5_col1\" class=\"data row5 col1\" >Adelie</td>\n",
       "      <td id=\"T_5a03a_row5_col2\" class=\"data row5 col2\" >Dream</td>\n",
       "      <td id=\"T_5a03a_row5_col3\" class=\"data row5 col3\" >36</td>\n",
       "      <td id=\"T_5a03a_row5_col4\" class=\"data row5 col4\" >17</td>\n",
       "      <td id=\"T_5a03a_row5_col5\" class=\"data row5 col5\" >187</td>\n",
       "      <td id=\"T_5a03a_row5_col6\" class=\"data row5 col6\" >3700</td>\n",
       "      <td id=\"T_5a03a_row5_col7\" class=\"data row5 col7\" >female</td>\n",
       "      <td id=\"T_5a03a_row5_col8\" class=\"data row5 col8\" >2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_5a03a_level0_row6\" class=\"row_heading level0 row6\" >6</th>\n",
       "      <td id=\"T_5a03a_row6_col0\" class=\"data row6 col0\" >P-348</td>\n",
       "      <td id=\"T_5a03a_row6_col1\" class=\"data row6 col1\" >Adelie</td>\n",
       "      <td id=\"T_5a03a_row6_col2\" class=\"data row6 col2\" >Biscoe</td>\n",
       "      <td id=\"T_5a03a_row6_col3\" class=\"data row6 col3\" >36</td>\n",
       "      <td id=\"T_5a03a_row6_col4\" class=\"data row6 col4\" >18</td>\n",
       "      <td id=\"T_5a03a_row6_col5\" class=\"data row6 col5\" >193</td>\n",
       "      <td id=\"T_5a03a_row6_col6\" class=\"data row6 col6\" >285</td>\n",
       "      <td id=\"T_5a03a_row6_col7\" class=\"data row6 col7\" >female</td>\n",
       "      <td id=\"T_5a03a_row6_col8\" class=\"data row6 col8\" >2007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_5a03a_level0_row7\" class=\"row_heading level0 row7\" >7</th>\n",
       "      <td id=\"T_5a03a_row7_col0\" class=\"data row7 col0\" >P-091</td>\n",
       "      <td id=\"T_5a03a_row7_col1\" class=\"data row7 col1\" >Adelie</td>\n",
       "      <td id=\"T_5a03a_row7_col2\" class=\"data row7 col2\" >Dream</td>\n",
       "      <td id=\"T_5a03a_row7_col3\" class=\"data row7 col3\" >41</td>\n",
       "      <td id=\"T_5a03a_row7_col4\" class=\"data row7 col4\" >18</td>\n",
       "      <td id=\"T_5a03a_row7_col5\" class=\"data row7 col5\" >205</td>\n",
       "      <td id=\"T_5a03a_row7_col6\" class=\"data row7 col6\" >4300</td>\n",
       "      <td id=\"T_5a03a_row7_col7\" class=\"data row7 col7\" >male</td>\n",
       "      <td id=\"T_5a03a_row7_col8\" class=\"data row7 col8\" >2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_5a03a_level0_row8\" class=\"row_heading level0 row8\" >8</th>\n",
       "      <td id=\"T_5a03a_row8_col0\" class=\"data row8 col0\" >P-327</td>\n",
       "      <td id=\"T_5a03a_row8_col1\" class=\"data row8 col1\" >Chinstrap</td>\n",
       "      <td id=\"T_5a03a_row8_col2\" class=\"data row8 col2\" >Dream</td>\n",
       "      <td id=\"T_5a03a_row8_col3\" class=\"data row8 col3\" >51</td>\n",
       "      <td id=\"T_5a03a_row8_col4\" class=\"data row8 col4\" >19</td>\n",
       "      <td id=\"T_5a03a_row8_col5\" class=\"data row8 col5\" >201</td>\n",
       "      <td id=\"T_5a03a_row8_col6\" class=\"data row8 col6\" >3950</td>\n",
       "      <td id=\"T_5a03a_row8_col7\" class=\"data row8 col7\" >male</td>\n",
       "      <td id=\"T_5a03a_row8_col8\" class=\"data row8 col8\" >2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_5a03a_level0_row9\" class=\"row_heading level0 row9\" >9</th>\n",
       "      <td id=\"T_5a03a_row9_col0\" class=\"data row9 col0\" >P-221</td>\n",
       "      <td id=\"T_5a03a_row9_col1\" class=\"data row9 col1\" >Gentoo</td>\n",
       "      <td id=\"T_5a03a_row9_col2\" class=\"data row9 col2\" >Biscoe</td>\n",
       "      <td id=\"T_5a03a_row9_col3\" class=\"data row9 col3\" >51</td>\n",
       "      <td id=\"T_5a03a_row9_col4\" class=\"data row9 col4\" >15</td>\n",
       "      <td id=\"T_5a03a_row9_col5\" class=\"data row9 col5\" >223</td>\n",
       "      <td id=\"T_5a03a_row9_col6\" class=\"data row9 col6\" >5550</td>\n",
       "      <td id=\"T_5a03a_row9_col7\" class=\"data row9 col7\" >male</td>\n",
       "      <td id=\"T_5a03a_row9_col8\" class=\"data row9 col8\" >2008</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n"
      ],
      "text/plain": [
       "<pandas.io.formats.style.Styler at 0x10bade9d0>"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_top10 = df.head(10)  # just style the first 10 rows for demo purposes here\n",
    "\n",
    "# round values to nearest integer (0 decimal places)\n",
    "df_top10.style.format(precision=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3798286c",
   "metadata": {},
   "source": [
    "Apply a colour gradient to each column based on each cell's value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "ac0b9191",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style type=\"text/css\">\n",
       "#T_20f12_row0_col3 {\n",
       "  background-color: #056dab;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_20f12_row0_col4, #T_20f12_row9_col4 {\n",
       "  background-color: #e3e0ee;\n",
       "  color: #000000;\n",
       "}\n",
       "#T_20f12_row0_col5 {\n",
       "  background-color: #1278b4;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_20f12_row0_col6, #T_20f12_row2_col5, #T_20f12_row2_col8, #T_20f12_row3_col8, #T_20f12_row5_col8, #T_20f12_row8_col3, #T_20f12_row8_col4, #T_20f12_row8_col8 {\n",
       "  background-color: #023858;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_20f12_row0_col8, #T_20f12_row1_col5, #T_20f12_row3_col5, #T_20f12_row4_col4, #T_20f12_row5_col3, #T_20f12_row5_col5, #T_20f12_row6_col6, #T_20f12_row6_col8 {\n",
       "  background-color: #fff7fb;\n",
       "  color: #000000;\n",
       "}\n",
       "#T_20f12_row1_col3 {\n",
       "  background-color: #b9c6e0;\n",
       "  color: #000000;\n",
       "}\n",
       "#T_20f12_row1_col4, #T_20f12_row1_col8, #T_20f12_row4_col8, #T_20f12_row7_col8, #T_20f12_row9_col8 {\n",
       "  background-color: #73a9cf;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_20f12_row1_col6 {\n",
       "  background-color: #5ea0ca;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_20f12_row2_col3 {\n",
       "  background-color: #034267;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_20f12_row2_col4 {\n",
       "  background-color: #bcc7e1;\n",
       "  color: #000000;\n",
       "}\n",
       "#T_20f12_row2_col6 {\n",
       "  background-color: #034e7b;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_20f12_row3_col3 {\n",
       "  background-color: #fdf5fa;\n",
       "  color: #000000;\n",
       "}\n",
       "#T_20f12_row3_col4 {\n",
       "  background-color: #3790c0;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_20f12_row3_col6 {\n",
       "  background-color: #63a2cb;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_20f12_row4_col3 {\n",
       "  background-color: #79abd0;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_20f12_row4_col5 {\n",
       "  background-color: #045f95;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_20f12_row4_col6 {\n",
       "  background-color: #04639b;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_20f12_row5_col4 {\n",
       "  background-color: #4094c3;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_20f12_row5_col6 {\n",
       "  background-color: #328dbf;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_20f12_row6_col3 {\n",
       "  background-color: #fbf4f9;\n",
       "  color: #000000;\n",
       "}\n",
       "#T_20f12_row6_col4, #T_20f12_row7_col4 {\n",
       "  background-color: #04649e;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_20f12_row6_col5 {\n",
       "  background-color: #e6e2ef;\n",
       "  color: #000000;\n",
       "}\n",
       "#T_20f12_row7_col3 {\n",
       "  background-color: #b5c4df;\n",
       "  color: #000000;\n",
       "}\n",
       "#T_20f12_row7_col5 {\n",
       "  background-color: #83afd3;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_20f12_row7_col6 {\n",
       "  background-color: #0570b0;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_20f12_row8_col5 {\n",
       "  background-color: #acc0dd;\n",
       "  color: #000000;\n",
       "}\n",
       "#T_20f12_row8_col6 {\n",
       "  background-color: #2081b9;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_20f12_row9_col3 {\n",
       "  background-color: #03446a;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_20f12_row9_col5 {\n",
       "  background-color: #034c78;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_20f12_row9_col6 {\n",
       "  background-color: #023c5f;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "</style>\n",
       "<table id=\"T_20f12\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th class=\"blank level0\" >&nbsp;</th>\n",
       "      <th id=\"T_20f12_level0_col0\" class=\"col_heading level0 col0\" >Id</th>\n",
       "      <th id=\"T_20f12_level0_col1\" class=\"col_heading level0 col1\" >species</th>\n",
       "      <th id=\"T_20f12_level0_col2\" class=\"col_heading level0 col2\" >island</th>\n",
       "      <th id=\"T_20f12_level0_col3\" class=\"col_heading level0 col3\" >bill_length_mm</th>\n",
       "      <th id=\"T_20f12_level0_col4\" class=\"col_heading level0 col4\" >bill_depth_mm</th>\n",
       "      <th id=\"T_20f12_level0_col5\" class=\"col_heading level0 col5\" >flipper_length_mm</th>\n",
       "      <th id=\"T_20f12_level0_col6\" class=\"col_heading level0 col6\" >body_mass_g</th>\n",
       "      <th id=\"T_20f12_level0_col7\" class=\"col_heading level0 col7\" >sex</th>\n",
       "      <th id=\"T_20f12_level0_col8\" class=\"col_heading level0 col8\" >year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th id=\"T_20f12_level0_row0\" class=\"row_heading level0 row0\" >0</th>\n",
       "      <td id=\"T_20f12_row0_col0\" class=\"data row0 col0\" >P-179</td>\n",
       "      <td id=\"T_20f12_row0_col1\" class=\"data row0 col1\" >Gentoo</td>\n",
       "      <td id=\"T_20f12_row0_col2\" class=\"data row0 col2\" >Biscoe</td>\n",
       "      <td id=\"T_20f12_row0_col3\" class=\"data row0 col3\" >47.800000</td>\n",
       "      <td id=\"T_20f12_row0_col4\" class=\"data row0 col4\" >15.000000</td>\n",
       "      <td id=\"T_20f12_row0_col5\" class=\"data row0 col5\" >215.000000</td>\n",
       "      <td id=\"T_20f12_row0_col6\" class=\"data row0 col6\" >5650.000000</td>\n",
       "      <td id=\"T_20f12_row0_col7\" class=\"data row0 col7\" >male</td>\n",
       "      <td id=\"T_20f12_row0_col8\" class=\"data row0 col8\" >2007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_20f12_level0_row1\" class=\"row_heading level0 row1\" >1</th>\n",
       "      <td id=\"T_20f12_row1_col0\" class=\"data row1 col0\" >P-306</td>\n",
       "      <td id=\"T_20f12_row1_col1\" class=\"data row1 col1\" >Chinstrap</td>\n",
       "      <td id=\"T_20f12_row1_col2\" class=\"data row1 col2\" >Dream</td>\n",
       "      <td id=\"T_20f12_row1_col3\" class=\"data row1 col3\" >40.900000</td>\n",
       "      <td id=\"T_20f12_row1_col4\" class=\"data row1 col4\" >16.600000</td>\n",
       "      <td id=\"T_20f12_row1_col5\" class=\"data row1 col5\" >187.000000</td>\n",
       "      <td id=\"T_20f12_row1_col6\" class=\"data row1 col6\" >3200.000000</td>\n",
       "      <td id=\"T_20f12_row1_col7\" class=\"data row1 col7\" >female</td>\n",
       "      <td id=\"T_20f12_row1_col8\" class=\"data row1 col8\" >2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_20f12_level0_row2\" class=\"row_heading level0 row2\" >2</th>\n",
       "      <td id=\"T_20f12_row2_col0\" class=\"data row2 col0\" >P-247</td>\n",
       "      <td id=\"T_20f12_row2_col1\" class=\"data row2 col1\" >Gentoo</td>\n",
       "      <td id=\"T_20f12_row2_col2\" class=\"data row2 col2\" >Biscoe</td>\n",
       "      <td id=\"T_20f12_row2_col3\" class=\"data row2 col3\" >50.800000</td>\n",
       "      <td id=\"T_20f12_row2_col4\" class=\"data row2 col4\" >15.700000</td>\n",
       "      <td id=\"T_20f12_row2_col5\" class=\"data row2 col5\" >226.000000</td>\n",
       "      <td id=\"T_20f12_row2_col6\" class=\"data row2 col6\" >5200.000000</td>\n",
       "      <td id=\"T_20f12_row2_col7\" class=\"data row2 col7\" >male</td>\n",
       "      <td id=\"T_20f12_row2_col8\" class=\"data row2 col8\" >2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_20f12_level0_row3\" class=\"row_heading level0 row3\" >3</th>\n",
       "      <td id=\"T_20f12_row3_col0\" class=\"data row3 col0\" >P-120</td>\n",
       "      <td id=\"T_20f12_row3_col1\" class=\"data row3 col1\" >Adelie</td>\n",
       "      <td id=\"T_20f12_row3_col2\" class=\"data row3 col2\" >Torgersen</td>\n",
       "      <td id=\"T_20f12_row3_col3\" class=\"data row3 col3\" >36.200000</td>\n",
       "      <td id=\"T_20f12_row3_col4\" class=\"data row3 col4\" >17.200000</td>\n",
       "      <td id=\"T_20f12_row3_col5\" class=\"data row3 col5\" >187.000000</td>\n",
       "      <td id=\"T_20f12_row3_col6\" class=\"data row3 col6\" >3150.000000</td>\n",
       "      <td id=\"T_20f12_row3_col7\" class=\"data row3 col7\" >female</td>\n",
       "      <td id=\"T_20f12_row3_col8\" class=\"data row3 col8\" >2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_20f12_level0_row4\" class=\"row_heading level0 row4\" >4</th>\n",
       "      <td id=\"T_20f12_row4_col0\" class=\"data row4 col0\" >P-220</td>\n",
       "      <td id=\"T_20f12_row4_col1\" class=\"data row4 col1\" >Gentoo</td>\n",
       "      <td id=\"T_20f12_row4_col2\" class=\"data row4 col2\" >Biscoe</td>\n",
       "      <td id=\"T_20f12_row4_col3\" class=\"data row4 col3\" >43.500000</td>\n",
       "      <td id=\"T_20f12_row4_col4\" class=\"data row4 col4\" >14.200000</td>\n",
       "      <td id=\"T_20f12_row4_col5\" class=\"data row4 col5\" >220.000000</td>\n",
       "      <td id=\"T_20f12_row4_col6\" class=\"data row4 col6\" >4700.000000</td>\n",
       "      <td id=\"T_20f12_row4_col7\" class=\"data row4 col7\" >female</td>\n",
       "      <td id=\"T_20f12_row4_col8\" class=\"data row4 col8\" >2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_20f12_level0_row5\" class=\"row_heading level0 row5\" >5</th>\n",
       "      <td id=\"T_20f12_row5_col0\" class=\"data row5 col0\" >P-150</td>\n",
       "      <td id=\"T_20f12_row5_col1\" class=\"data row5 col1\" >Adelie</td>\n",
       "      <td id=\"T_20f12_row5_col2\" class=\"data row5 col2\" >Dream</td>\n",
       "      <td id=\"T_20f12_row5_col3\" class=\"data row5 col3\" >36.000000</td>\n",
       "      <td id=\"T_20f12_row5_col4\" class=\"data row5 col4\" >17.100000</td>\n",
       "      <td id=\"T_20f12_row5_col5\" class=\"data row5 col5\" >187.000000</td>\n",
       "      <td id=\"T_20f12_row5_col6\" class=\"data row5 col6\" >3700.000000</td>\n",
       "      <td id=\"T_20f12_row5_col7\" class=\"data row5 col7\" >female</td>\n",
       "      <td id=\"T_20f12_row5_col8\" class=\"data row5 col8\" >2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_20f12_level0_row6\" class=\"row_heading level0 row6\" >6</th>\n",
       "      <td id=\"T_20f12_row6_col0\" class=\"data row6 col0\" >P-348</td>\n",
       "      <td id=\"T_20f12_row6_col1\" class=\"data row6 col1\" >Adelie</td>\n",
       "      <td id=\"T_20f12_row6_col2\" class=\"data row6 col2\" >Biscoe</td>\n",
       "      <td id=\"T_20f12_row6_col3\" class=\"data row6 col3\" >36.400000</td>\n",
       "      <td id=\"T_20f12_row6_col4\" class=\"data row6 col4\" >18.100000</td>\n",
       "      <td id=\"T_20f12_row6_col5\" class=\"data row6 col5\" >193.000000</td>\n",
       "      <td id=\"T_20f12_row6_col6\" class=\"data row6 col6\" >285.000000</td>\n",
       "      <td id=\"T_20f12_row6_col7\" class=\"data row6 col7\" >female</td>\n",
       "      <td id=\"T_20f12_row6_col8\" class=\"data row6 col8\" >2007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_20f12_level0_row7\" class=\"row_heading level0 row7\" >7</th>\n",
       "      <td id=\"T_20f12_row7_col0\" class=\"data row7 col0\" >P-091</td>\n",
       "      <td id=\"T_20f12_row7_col1\" class=\"data row7 col1\" >Adelie</td>\n",
       "      <td id=\"T_20f12_row7_col2\" class=\"data row7 col2\" >Dream</td>\n",
       "      <td id=\"T_20f12_row7_col3\" class=\"data row7 col3\" >41.100000</td>\n",
       "      <td id=\"T_20f12_row7_col4\" class=\"data row7 col4\" >18.100000</td>\n",
       "      <td id=\"T_20f12_row7_col5\" class=\"data row7 col5\" >205.000000</td>\n",
       "      <td id=\"T_20f12_row7_col6\" class=\"data row7 col6\" >4300.000000</td>\n",
       "      <td id=\"T_20f12_row7_col7\" class=\"data row7 col7\" >male</td>\n",
       "      <td id=\"T_20f12_row7_col8\" class=\"data row7 col8\" >2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_20f12_level0_row8\" class=\"row_heading level0 row8\" >8</th>\n",
       "      <td id=\"T_20f12_row8_col0\" class=\"data row8 col0\" >P-327</td>\n",
       "      <td id=\"T_20f12_row8_col1\" class=\"data row8 col1\" >Chinstrap</td>\n",
       "      <td id=\"T_20f12_row8_col2\" class=\"data row8 col2\" >Dream</td>\n",
       "      <td id=\"T_20f12_row8_col3\" class=\"data row8 col3\" >51.400000</td>\n",
       "      <td id=\"T_20f12_row8_col4\" class=\"data row8 col4\" >19.000000</td>\n",
       "      <td id=\"T_20f12_row8_col5\" class=\"data row8 col5\" >201.000000</td>\n",
       "      <td id=\"T_20f12_row8_col6\" class=\"data row8 col6\" >3950.000000</td>\n",
       "      <td id=\"T_20f12_row8_col7\" class=\"data row8 col7\" >male</td>\n",
       "      <td id=\"T_20f12_row8_col8\" class=\"data row8 col8\" >2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_20f12_level0_row9\" class=\"row_heading level0 row9\" >9</th>\n",
       "      <td id=\"T_20f12_row9_col0\" class=\"data row9 col0\" >P-221</td>\n",
       "      <td id=\"T_20f12_row9_col1\" class=\"data row9 col1\" >Gentoo</td>\n",
       "      <td id=\"T_20f12_row9_col2\" class=\"data row9 col2\" >Biscoe</td>\n",
       "      <td id=\"T_20f12_row9_col3\" class=\"data row9 col3\" >50.700000</td>\n",
       "      <td id=\"T_20f12_row9_col4\" class=\"data row9 col4\" >15.000000</td>\n",
       "      <td id=\"T_20f12_row9_col5\" class=\"data row9 col5\" >223.000000</td>\n",
       "      <td id=\"T_20f12_row9_col6\" class=\"data row9 col6\" >5550.000000</td>\n",
       "      <td id=\"T_20f12_row9_col7\" class=\"data row9 col7\" >male</td>\n",
       "      <td id=\"T_20f12_row9_col8\" class=\"data row9 col8\" >2008</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n"
      ],
      "text/plain": [
       "<pandas.io.formats.style.Styler at 0x12098a760>"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_top10.style.background_gradient()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6d8c4ff0",
   "metadata": {},
   "source": [
    "Highlight the smallest value in each column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "ed5623ff",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style type=\"text/css\">\n",
       "#T_9c679_row0_col2, #T_9c679_row0_col8, #T_9c679_row1_col5, #T_9c679_row1_col7, #T_9c679_row2_col2, #T_9c679_row3_col1, #T_9c679_row3_col5, #T_9c679_row3_col7, #T_9c679_row4_col2, #T_9c679_row4_col4, #T_9c679_row4_col7, #T_9c679_row5_col1, #T_9c679_row5_col3, #T_9c679_row5_col5, #T_9c679_row5_col7, #T_9c679_row6_col1, #T_9c679_row6_col2, #T_9c679_row6_col6, #T_9c679_row6_col7, #T_9c679_row6_col8, #T_9c679_row7_col0, #T_9c679_row7_col1, #T_9c679_row9_col2 {\n",
       "  background-color: yellow;\n",
       "}\n",
       "</style>\n",
       "<table id=\"T_9c679\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th class=\"blank level0\" >&nbsp;</th>\n",
       "      <th id=\"T_9c679_level0_col0\" class=\"col_heading level0 col0\" >Id</th>\n",
       "      <th id=\"T_9c679_level0_col1\" class=\"col_heading level0 col1\" >species</th>\n",
       "      <th id=\"T_9c679_level0_col2\" class=\"col_heading level0 col2\" >island</th>\n",
       "      <th id=\"T_9c679_level0_col3\" class=\"col_heading level0 col3\" >bill_length_mm</th>\n",
       "      <th id=\"T_9c679_level0_col4\" class=\"col_heading level0 col4\" >bill_depth_mm</th>\n",
       "      <th id=\"T_9c679_level0_col5\" class=\"col_heading level0 col5\" >flipper_length_mm</th>\n",
       "      <th id=\"T_9c679_level0_col6\" class=\"col_heading level0 col6\" >body_mass_g</th>\n",
       "      <th id=\"T_9c679_level0_col7\" class=\"col_heading level0 col7\" >sex</th>\n",
       "      <th id=\"T_9c679_level0_col8\" class=\"col_heading level0 col8\" >year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th id=\"T_9c679_level0_row0\" class=\"row_heading level0 row0\" >0</th>\n",
       "      <td id=\"T_9c679_row0_col0\" class=\"data row0 col0\" >P-179</td>\n",
       "      <td id=\"T_9c679_row0_col1\" class=\"data row0 col1\" >Gentoo</td>\n",
       "      <td id=\"T_9c679_row0_col2\" class=\"data row0 col2\" >Biscoe</td>\n",
       "      <td id=\"T_9c679_row0_col3\" class=\"data row0 col3\" >47.800000</td>\n",
       "      <td id=\"T_9c679_row0_col4\" class=\"data row0 col4\" >15.000000</td>\n",
       "      <td id=\"T_9c679_row0_col5\" class=\"data row0 col5\" >215.000000</td>\n",
       "      <td id=\"T_9c679_row0_col6\" class=\"data row0 col6\" >5650.000000</td>\n",
       "      <td id=\"T_9c679_row0_col7\" class=\"data row0 col7\" >male</td>\n",
       "      <td id=\"T_9c679_row0_col8\" class=\"data row0 col8\" >2007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_9c679_level0_row1\" class=\"row_heading level0 row1\" >1</th>\n",
       "      <td id=\"T_9c679_row1_col0\" class=\"data row1 col0\" >P-306</td>\n",
       "      <td id=\"T_9c679_row1_col1\" class=\"data row1 col1\" >Chinstrap</td>\n",
       "      <td id=\"T_9c679_row1_col2\" class=\"data row1 col2\" >Dream</td>\n",
       "      <td id=\"T_9c679_row1_col3\" class=\"data row1 col3\" >40.900000</td>\n",
       "      <td id=\"T_9c679_row1_col4\" class=\"data row1 col4\" >16.600000</td>\n",
       "      <td id=\"T_9c679_row1_col5\" class=\"data row1 col5\" >187.000000</td>\n",
       "      <td id=\"T_9c679_row1_col6\" class=\"data row1 col6\" >3200.000000</td>\n",
       "      <td id=\"T_9c679_row1_col7\" class=\"data row1 col7\" >female</td>\n",
       "      <td id=\"T_9c679_row1_col8\" class=\"data row1 col8\" >2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_9c679_level0_row2\" class=\"row_heading level0 row2\" >2</th>\n",
       "      <td id=\"T_9c679_row2_col0\" class=\"data row2 col0\" >P-247</td>\n",
       "      <td id=\"T_9c679_row2_col1\" class=\"data row2 col1\" >Gentoo</td>\n",
       "      <td id=\"T_9c679_row2_col2\" class=\"data row2 col2\" >Biscoe</td>\n",
       "      <td id=\"T_9c679_row2_col3\" class=\"data row2 col3\" >50.800000</td>\n",
       "      <td id=\"T_9c679_row2_col4\" class=\"data row2 col4\" >15.700000</td>\n",
       "      <td id=\"T_9c679_row2_col5\" class=\"data row2 col5\" >226.000000</td>\n",
       "      <td id=\"T_9c679_row2_col6\" class=\"data row2 col6\" >5200.000000</td>\n",
       "      <td id=\"T_9c679_row2_col7\" class=\"data row2 col7\" >male</td>\n",
       "      <td id=\"T_9c679_row2_col8\" class=\"data row2 col8\" >2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_9c679_level0_row3\" class=\"row_heading level0 row3\" >3</th>\n",
       "      <td id=\"T_9c679_row3_col0\" class=\"data row3 col0\" >P-120</td>\n",
       "      <td id=\"T_9c679_row3_col1\" class=\"data row3 col1\" >Adelie</td>\n",
       "      <td id=\"T_9c679_row3_col2\" class=\"data row3 col2\" >Torgersen</td>\n",
       "      <td id=\"T_9c679_row3_col3\" class=\"data row3 col3\" >36.200000</td>\n",
       "      <td id=\"T_9c679_row3_col4\" class=\"data row3 col4\" >17.200000</td>\n",
       "      <td id=\"T_9c679_row3_col5\" class=\"data row3 col5\" >187.000000</td>\n",
       "      <td id=\"T_9c679_row3_col6\" class=\"data row3 col6\" >3150.000000</td>\n",
       "      <td id=\"T_9c679_row3_col7\" class=\"data row3 col7\" >female</td>\n",
       "      <td id=\"T_9c679_row3_col8\" class=\"data row3 col8\" >2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_9c679_level0_row4\" class=\"row_heading level0 row4\" >4</th>\n",
       "      <td id=\"T_9c679_row4_col0\" class=\"data row4 col0\" >P-220</td>\n",
       "      <td id=\"T_9c679_row4_col1\" class=\"data row4 col1\" >Gentoo</td>\n",
       "      <td id=\"T_9c679_row4_col2\" class=\"data row4 col2\" >Biscoe</td>\n",
       "      <td id=\"T_9c679_row4_col3\" class=\"data row4 col3\" >43.500000</td>\n",
       "      <td id=\"T_9c679_row4_col4\" class=\"data row4 col4\" >14.200000</td>\n",
       "      <td id=\"T_9c679_row4_col5\" class=\"data row4 col5\" >220.000000</td>\n",
       "      <td id=\"T_9c679_row4_col6\" class=\"data row4 col6\" >4700.000000</td>\n",
       "      <td id=\"T_9c679_row4_col7\" class=\"data row4 col7\" >female</td>\n",
       "      <td id=\"T_9c679_row4_col8\" class=\"data row4 col8\" >2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_9c679_level0_row5\" class=\"row_heading level0 row5\" >5</th>\n",
       "      <td id=\"T_9c679_row5_col0\" class=\"data row5 col0\" >P-150</td>\n",
       "      <td id=\"T_9c679_row5_col1\" class=\"data row5 col1\" >Adelie</td>\n",
       "      <td id=\"T_9c679_row5_col2\" class=\"data row5 col2\" >Dream</td>\n",
       "      <td id=\"T_9c679_row5_col3\" class=\"data row5 col3\" >36.000000</td>\n",
       "      <td id=\"T_9c679_row5_col4\" class=\"data row5 col4\" >17.100000</td>\n",
       "      <td id=\"T_9c679_row5_col5\" class=\"data row5 col5\" >187.000000</td>\n",
       "      <td id=\"T_9c679_row5_col6\" class=\"data row5 col6\" >3700.000000</td>\n",
       "      <td id=\"T_9c679_row5_col7\" class=\"data row5 col7\" >female</td>\n",
       "      <td id=\"T_9c679_row5_col8\" class=\"data row5 col8\" >2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_9c679_level0_row6\" class=\"row_heading level0 row6\" >6</th>\n",
       "      <td id=\"T_9c679_row6_col0\" class=\"data row6 col0\" >P-348</td>\n",
       "      <td id=\"T_9c679_row6_col1\" class=\"data row6 col1\" >Adelie</td>\n",
       "      <td id=\"T_9c679_row6_col2\" class=\"data row6 col2\" >Biscoe</td>\n",
       "      <td id=\"T_9c679_row6_col3\" class=\"data row6 col3\" >36.400000</td>\n",
       "      <td id=\"T_9c679_row6_col4\" class=\"data row6 col4\" >18.100000</td>\n",
       "      <td id=\"T_9c679_row6_col5\" class=\"data row6 col5\" >193.000000</td>\n",
       "      <td id=\"T_9c679_row6_col6\" class=\"data row6 col6\" >285.000000</td>\n",
       "      <td id=\"T_9c679_row6_col7\" class=\"data row6 col7\" >female</td>\n",
       "      <td id=\"T_9c679_row6_col8\" class=\"data row6 col8\" >2007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_9c679_level0_row7\" class=\"row_heading level0 row7\" >7</th>\n",
       "      <td id=\"T_9c679_row7_col0\" class=\"data row7 col0\" >P-091</td>\n",
       "      <td id=\"T_9c679_row7_col1\" class=\"data row7 col1\" >Adelie</td>\n",
       "      <td id=\"T_9c679_row7_col2\" class=\"data row7 col2\" >Dream</td>\n",
       "      <td id=\"T_9c679_row7_col3\" class=\"data row7 col3\" >41.100000</td>\n",
       "      <td id=\"T_9c679_row7_col4\" class=\"data row7 col4\" >18.100000</td>\n",
       "      <td id=\"T_9c679_row7_col5\" class=\"data row7 col5\" >205.000000</td>\n",
       "      <td id=\"T_9c679_row7_col6\" class=\"data row7 col6\" >4300.000000</td>\n",
       "      <td id=\"T_9c679_row7_col7\" class=\"data row7 col7\" >male</td>\n",
       "      <td id=\"T_9c679_row7_col8\" class=\"data row7 col8\" >2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_9c679_level0_row8\" class=\"row_heading level0 row8\" >8</th>\n",
       "      <td id=\"T_9c679_row8_col0\" class=\"data row8 col0\" >P-327</td>\n",
       "      <td id=\"T_9c679_row8_col1\" class=\"data row8 col1\" >Chinstrap</td>\n",
       "      <td id=\"T_9c679_row8_col2\" class=\"data row8 col2\" >Dream</td>\n",
       "      <td id=\"T_9c679_row8_col3\" class=\"data row8 col3\" >51.400000</td>\n",
       "      <td id=\"T_9c679_row8_col4\" class=\"data row8 col4\" >19.000000</td>\n",
       "      <td id=\"T_9c679_row8_col5\" class=\"data row8 col5\" >201.000000</td>\n",
       "      <td id=\"T_9c679_row8_col6\" class=\"data row8 col6\" >3950.000000</td>\n",
       "      <td id=\"T_9c679_row8_col7\" class=\"data row8 col7\" >male</td>\n",
       "      <td id=\"T_9c679_row8_col8\" class=\"data row8 col8\" >2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_9c679_level0_row9\" class=\"row_heading level0 row9\" >9</th>\n",
       "      <td id=\"T_9c679_row9_col0\" class=\"data row9 col0\" >P-221</td>\n",
       "      <td id=\"T_9c679_row9_col1\" class=\"data row9 col1\" >Gentoo</td>\n",
       "      <td id=\"T_9c679_row9_col2\" class=\"data row9 col2\" >Biscoe</td>\n",
       "      <td id=\"T_9c679_row9_col3\" class=\"data row9 col3\" >50.700000</td>\n",
       "      <td id=\"T_9c679_row9_col4\" class=\"data row9 col4\" >15.000000</td>\n",
       "      <td id=\"T_9c679_row9_col5\" class=\"data row9 col5\" >223.000000</td>\n",
       "      <td id=\"T_9c679_row9_col6\" class=\"data row9 col6\" >5550.000000</td>\n",
       "      <td id=\"T_9c679_row9_col7\" class=\"data row9 col7\" >male</td>\n",
       "      <td id=\"T_9c679_row9_col8\" class=\"data row9 col8\" >2008</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n"
      ],
      "text/plain": [
       "<pandas.io.formats.style.Styler at 0x12098adf0>"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_top10.style.highlight_min()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.15"
  },
  "vscode": {
   "interpreter": {
    "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}