{
"cells": [
{
"cell_type": "markdown",
"id": "e88ea460",
"metadata": {
"tags": []
},
"source": [
"# 2.2.1 Data Consistency\n",
"\n",
"In [Section 2.1.4](2-01-04-DataSourcesAndFormats.md) we saw how to load data of different types into pandas. In this section we'll look at first steps you can perform to look at the data, understand what it contains, and a few common issues that may come up with data consistency like missing or mis-interpreted values."
]
},
{
"cell_type": "markdown",
"id": "9c0c3e66",
"metadata": {},
"source": [
"## Domain Knowledge\n",
"\n",
"Before jumping into looking at and analysing the data you should check any information you've been given about it, so you know what to expect. In this section we'll be using the Palmer penguins dataset, adapted from the version created by Allison Horst available [here (doi:10.5281/zenodo.3960218)](https://allisonhorst.github.io/palmerpenguins/).\n",
"\n",
"We've made changes to the data to demonstrate the concepts we're teaching, adding missing values and other common data issues. The (cleaner) original file is available [here](https://github.com/allisonhorst/palmerpenguins/blob/master/inst/extdata/penguins.csv).\n",
"\n",
"The dataset was originally collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu/), a member of the [Long Term Ecological Research Network](https://lternet.edu/), and published in the [PLOS ONE journal (doi:10.1371/journal.pone.0090081)](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090081) in 2014 .\n",
"\n",
"It includes measurements of the bill size, flipper length and weight of three different species of penguin (Adélie, Chinstrap, Gentoo) on three different islands (Biscoe, Dream, Torgersen) in the Palmer Archipelago, Antarctica. The [dataset homepage](https://allisonhorst.github.io/palmerpenguins/) contains more information about the columns and data types we expect. To reiterate, it's always important to check the documentation and associated literature first.\n",
"\n",
"|  | \n",
"|:--:| \n",
"| *Artwork by [@allison_horst](https://twitter.com/allison_horst).* |"
]
},
{
"cell_type": "markdown",
"id": "611701e4",
"metadata": {},
"source": [
"## Having a First Look at the Data\n",
"\n",
"The dataset is saved in `data/penguins.csv` and we can load it with [`pd.read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html?highlight=read_csv#pandas.read_csv), as seen previously:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "9cad4dc7",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "69741e99",
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(\"data/penguins.csv\")"
]
},
{
"cell_type": "markdown",
"id": "554cc3b1",
"metadata": {},
"source": [
"Display the first few ten rows of the data:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "597e146f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Id | \n",
" species | \n",
" island | \n",
" bill_length_mm | \n",
" bill_depth_mm | \n",
" flipper_length_mm | \n",
" body_mass_g | \n",
" sex | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" P-179 | \n",
" Gentoo | \n",
" Biscoe | \n",
" 47.8 | \n",
" 15.0 | \n",
" 215.0 | \n",
" 5650.0 | \n",
" male | \n",
" 2007 | \n",
"
\n",
" \n",
" 1 | \n",
" P-306 | \n",
" Chinstrap | \n",
" Dream | \n",
" 40.9 | \n",
" 16.6 | \n",
" 187.0 | \n",
" 3200.0 | \n",
" female | \n",
" 2008 | \n",
"
\n",
" \n",
" 2 | \n",
" P-247 | \n",
" Gentoo | \n",
" Biscoe | \n",
" 50.8 | \n",
" 15.7 | \n",
" 226.0 | \n",
" 5200.0 | \n",
" male | \n",
" 2009 | \n",
"
\n",
" \n",
" 3 | \n",
" P-120 | \n",
" Adelie | \n",
" Torgersen | \n",
" 36.2 | \n",
" 17.2 | \n",
" 187.0 | \n",
" 3150.0 | \n",
" female | \n",
" 2009 | \n",
"
\n",
" \n",
" 4 | \n",
" P-220 | \n",
" Gentoo | \n",
" Biscoe | \n",
" 43.5 | \n",
" 14.2 | \n",
" 220.0 | \n",
" 4700.0 | \n",
" female | \n",
" 2008 | \n",
"
\n",
" \n",
" 5 | \n",
" P-150 | \n",
" Adelie | \n",
" Dream | \n",
" 36.0 | \n",
" 17.1 | \n",
" 187.0 | \n",
" 3700.0 | \n",
" female | \n",
" 2009 | \n",
"
\n",
" \n",
" 6 | \n",
" P-348 | \n",
" Adelie | \n",
" Biscoe | \n",
" 36.4 | \n",
" 18.1 | \n",
" 193.0 | \n",
" 285.0 | \n",
" female | \n",
" 2007 | \n",
"
\n",
" \n",
" 7 | \n",
" P-091 | \n",
" Adelie | \n",
" Dream | \n",
" 41.1 | \n",
" 18.1 | \n",
" 205.0 | \n",
" 4300.0 | \n",
" male | \n",
" 2008 | \n",
"
\n",
" \n",
" 8 | \n",
" P-327 | \n",
" Chinstrap | \n",
" Dream | \n",
" 51.4 | \n",
" 19.0 | \n",
" 201.0 | \n",
" 3950.0 | \n",
" male | \n",
" 2009 | \n",
"
\n",
" \n",
" 9 | \n",
" P-221 | \n",
" Gentoo | \n",
" Biscoe | \n",
" 50.7 | \n",
" 15.0 | \n",
" 223.0 | \n",
" 5550.0 | \n",
" male | \n",
" 2008 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Id species island bill_length_mm bill_depth_mm \\\n",
"0 P-179 Gentoo Biscoe 47.8 15.0 \n",
"1 P-306 Chinstrap Dream 40.9 16.6 \n",
"2 P-247 Gentoo Biscoe 50.8 15.7 \n",
"3 P-120 Adelie Torgersen 36.2 17.2 \n",
"4 P-220 Gentoo Biscoe 43.5 14.2 \n",
"5 P-150 Adelie Dream 36.0 17.1 \n",
"6 P-348 Adelie Biscoe 36.4 18.1 \n",
"7 P-091 Adelie Dream 41.1 18.1 \n",
"8 P-327 Chinstrap Dream 51.4 19.0 \n",
"9 P-221 Gentoo Biscoe 50.7 15.0 \n",
"\n",
" flipper_length_mm body_mass_g sex year \n",
"0 215.0 5650.0 male 2007 \n",
"1 187.0 3200.0 female 2008 \n",
"2 226.0 5200.0 male 2009 \n",
"3 187.0 3150.0 female 2009 \n",
"4 220.0 4700.0 female 2008 \n",
"5 187.0 3700.0 female 2009 \n",
"6 193.0 285.0 female 2007 \n",
"7 205.0 4300.0 male 2008 \n",
"8 201.0 3950.0 male 2009 \n",
"9 223.0 5550.0 male 2008 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(10)"
]
},
{
"cell_type": "markdown",
"id": "6289868f",
"metadata": {},
"source": [
"We could also look at the last few rows of the data with `df.tail()`, or a random sample of rows with `df.sample()`.\n",
"\n",
"To check the number of rows and columns we can use:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "7bce880c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(351, 9)\n"
]
}
],
"source": [
"print(df.shape)"
]
},
{
"cell_type": "markdown",
"id": "867edbd9",
"metadata": {},
"source": [
"Our data has 351 rows and 9 columns. It might also be useful to look at the column names (especially for larger datasets with many columns where they may not all be displayed by `df.head()`):"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "dfa54809",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['Id', 'species', 'island', 'bill_length_mm', 'bill_depth_mm',\n",
" 'flipper_length_mm', 'body_mass_g', 'sex', 'year'],\n",
" dtype='object')\n"
]
}
],
"source": [
"print(df.columns)"
]
},
{
"cell_type": "markdown",
"id": "37ed133b",
"metadata": {},
"source": [
"A useful command that summarises much of this information is `df.info()`:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "ff5c537c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 351 entries, 0 to 350\n",
"Data columns (total 9 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Id 351 non-null object \n",
" 1 species 351 non-null object \n",
" 2 island 351 non-null object \n",
" 3 bill_length_mm 347 non-null float64\n",
" 4 bill_depth_mm 349 non-null object \n",
" 5 flipper_length_mm 349 non-null float64\n",
" 6 body_mass_g 349 non-null float64\n",
" 7 sex 340 non-null object \n",
" 8 year 351 non-null int64 \n",
"dtypes: float64(3), int64(1), object(5)\n",
"memory usage: 24.8+ KB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "markdown",
"id": "f8ed6aaf",
"metadata": {},
"source": [
"This gives us the number of rows (entries) and columns at the top, and then a table with the name, number of non-null values, and data type of each column. Finally, it gives the amount of memory the data frame is using. Pandas can use a lot of memory, which may cause problems when analysing large datasets. The [Scaling to large datasets](https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html) page in the Pandas documentation gives pointers for what you can try in that case."
]
},
{
"cell_type": "markdown",
"id": "498b00b3",
"metadata": {},
"source": [
"## Null Values\n",
"\n",
"The data frame info shows we have 351 \"non-null\" values in the `Id`, `species`, `island` and `year` columns, but fewer in the other columns.\n",
"\n",
"\"Null values\" is Pandas' way of describing data that is missing. Under the hood, these are encoded as NumPy's NaN (not a number) constant (see [here](https://numpy.org/doc/stable/reference/constants.html#numpy.nan)), which has type `float64` so numeric columns with NaN values still have a numeric type and can have numeric operations applied to them.\n",
"\n",
"To find missing values in a column we can use the `isnull()` function:\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "541355b9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 False\n",
"1 False\n",
"2 False\n",
"3 False\n",
"4 False\n",
" ... \n",
"346 False\n",
"347 False\n",
"348 False\n",
"349 False\n",
"350 False\n",
"Name: bill_length_mm, Length: 351, dtype: bool\n"
]
}
],
"source": [
"is_missing = df[\"bill_length_mm\"].isnull()\n",
"print(is_missing)"
]
},
{
"cell_type": "markdown",
"id": "8129dc9c",
"metadata": {},
"source": [
"This returns a boolean series which is True if that row's value is NaN, which can then be used to filter the data frame and show only the rows with missing data in the `bill_length_mm` column:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "ad52255a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Id | \n",
" species | \n",
" island | \n",
" bill_length_mm | \n",
" bill_depth_mm | \n",
" flipper_length_mm | \n",
" body_mass_g | \n",
" sex | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 233 | \n",
" P-344 | \n",
" Chinstrap | \n",
" Dream | \n",
" NaN | \n",
" 19.2 | \n",
" 197.0 | \n",
" 4000.0 | \n",
" male | \n",
" 2008 | \n",
"
\n",
" \n",
" 286 | \n",
" P-003 | \n",
" Adelie | \n",
" Torgersen | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2007 | \n",
"
\n",
" \n",
" 307 | \n",
" P-271 | \n",
" Gentoo | \n",
" Biscoe | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2009 | \n",
"
\n",
" \n",
" 312 | \n",
" P-345 | \n",
" Adelie | \n",
" Torgersen | \n",
" NaN | \n",
" 18.0 | \n",
" 193.0 | \n",
" 43400.0 | \n",
" female | \n",
" 2009 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Id species island bill_length_mm bill_depth_mm \\\n",
"233 P-344 Chinstrap Dream NaN 19.2 \n",
"286 P-003 Adelie Torgersen NaN NaN \n",
"307 P-271 Gentoo Biscoe NaN NaN \n",
"312 P-345 Adelie Torgersen NaN 18.0 \n",
"\n",
" flipper_length_mm body_mass_g sex year \n",
"233 197.0 4000.0 male 2008 \n",
"286 NaN NaN NaN 2007 \n",
"307 NaN NaN NaN 2009 \n",
"312 193.0 43400.0 female 2009 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[is_missing]"
]
},
{
"cell_type": "markdown",
"id": "4eadf9d4",
"metadata": {},
"source": [
"There are many reasons data could be missing and how you choose to deal with them is an important part of any research project. We'll revisit this later. "
]
},
{
"cell_type": "markdown",
"id": "d924d5b4",
"metadata": {},
"source": [
"## Unexpected Column Types\n",
"\n",
"Looking at the first few rows of our data (the output of `df.head()` above) it looks like we expect the `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, `body_mass_g` and `year` columns to have a numeric type. Comparing with the output of `df.info()` above most of them do, having a `dtype` (data type) of either `int64` or `float64`.\n",
"\n",
"However, the `bill_depth_mm` column has a dtype of `object`, which usually means the column is being treated as strings/text data. This will generally be because there is at least one row in the column that Pandas was not able to parse as a number. Common reasons this might happen include:\n",
"- Data entry errors and typos, for example \"23/15\" instead of \"23.15\".\n",
"- Encoding of missing values: The `pd.read_csv()` function checks for common string representations of missing values like \"NA\" or \"NULL\" and converts these to `numpy.nan` when loading the data. But many different conventions for missing data exist, such as more verbose representations like \"UNKNOWN\", and Pandas will load these as strings by default. This can be customised with the `na_values` parameter of [`pd.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).\n",
"- Additional metadata incorrectly loaded into the data frame, such as CSV files with headers and footers (as seen in the [Data Sources & Formats section](2-01-04-DataSourcesAndFormats) previously).\n",
"\n",
"To see what's wrong with the `bill_depth_mm` column we can try to convert it to a numeric type with the [`pd.to_numeric`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html) function:\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "a1547ef1",
"metadata": {
"tags": [
"raises-exception"
]
},
"outputs": [
{
"ename": "ValueError",
"evalue": "Unable to parse string \"14,2\" at position 142",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"File \u001b[0;32m~/Library/Caches/pypoetry/virtualenvs/rds-course-5zqYD5aN-py3.9/lib/python3.9/site-packages/pandas/_libs/lib.pyx:2369\u001b[0m, in \u001b[0;36mpandas._libs.lib.maybe_convert_numeric\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mValueError\u001b[0m: Unable to parse string \"14,2\"",
"\nDuring handling of the above exception, another exception occurred:\n",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[9], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m df[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mbill_depth_mm\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m \u001b[43mpd\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mto_numeric\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdf\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mbill_depth_mm\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[0;32m~/Library/Caches/pypoetry/virtualenvs/rds-course-5zqYD5aN-py3.9/lib/python3.9/site-packages/pandas/core/tools/numeric.py:185\u001b[0m, in \u001b[0;36mto_numeric\u001b[0;34m(arg, errors, downcast)\u001b[0m\n\u001b[1;32m 183\u001b[0m coerce_numeric \u001b[38;5;241m=\u001b[39m errors \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m (\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mignore\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mraise\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 184\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 185\u001b[0m values, _ \u001b[38;5;241m=\u001b[39m \u001b[43mlib\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmaybe_convert_numeric\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 186\u001b[0m \u001b[43m \u001b[49m\u001b[43mvalues\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mset\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcoerce_numeric\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcoerce_numeric\u001b[49m\n\u001b[1;32m 187\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 188\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m (\u001b[38;5;167;01mValueError\u001b[39;00m, \u001b[38;5;167;01mTypeError\u001b[39;00m):\n\u001b[1;32m 189\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m errors \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mraise\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n",
"File \u001b[0;32m~/Library/Caches/pypoetry/virtualenvs/rds-course-5zqYD5aN-py3.9/lib/python3.9/site-packages/pandas/_libs/lib.pyx:2411\u001b[0m, in \u001b[0;36mpandas._libs.lib.maybe_convert_numeric\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mValueError\u001b[0m: Unable to parse string \"14,2\" at position 142"
]
}
],
"source": [
"df[\"bill_depth_mm\"] = pd.to_numeric(df[\"bill_depth_mm\"])"
]
},
{
"cell_type": "markdown",
"id": "b6ba77b9",
"metadata": {},
"source": [
"The error above tells us Pandas has encountered a value \"14,2\", which it doesn't know how to convert into a number. It also says the problem is at index 142, which we can access ourselves to check the value directly:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "0b1cff3c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'14,2'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[142, \"bill_depth_mm\"]"
]
},
{
"cell_type": "markdown",
"id": "f8023508",
"metadata": {},
"source": [
"In this case it looks like a typo, the person entering the data probably meant to write `14.2`, but we should check this first. There may be information in the data documentation, or you could ask the data provider.\n",
"\n",
"Let's say we're confident it is a typo. We can fix it ourselves and then convert the column to a numeric type:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "48a448ec",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 351 entries, 0 to 350\n",
"Data columns (total 9 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Id 351 non-null object \n",
" 1 species 351 non-null object \n",
" 2 island 351 non-null object \n",
" 3 bill_length_mm 347 non-null float64\n",
" 4 bill_depth_mm 349 non-null float64\n",
" 5 flipper_length_mm 349 non-null float64\n",
" 6 body_mass_g 349 non-null float64\n",
" 7 sex 340 non-null object \n",
" 8 year 351 non-null int64 \n",
"dtypes: float64(4), int64(1), object(4)\n",
"memory usage: 24.8+ KB\n"
]
}
],
"source": [
"# set the incorrectly typed number to its intended value\n",
"df.loc[142, \"bill_depth_mm\"] = 14.2\n",
"# convert the column to a numeric type\n",
"df[\"bill_depth_mm\"] = pd.to_numeric(df[\"bill_depth_mm\"])\n",
"\n",
"df.info()"
]
},
{
"cell_type": "markdown",
"id": "7ade8d5d",
"metadata": {},
"source": [
"The `bill_depth_mm` now has type `float64` as we originally expected.\n",
"\n",
"This was a simple example with just one strange value - we'll see more approaches for handling and sanitising strings later."
]
},
{
"cell_type": "markdown",
"id": "24eca291",
"metadata": {},
"source": [
"## Sanity Checking Values\n",
"\n",
"### Numeric Columns\n",
"\n",
"The pandas `describe()` function gives summary statistics for the numeric columns in our data (the mean, standard deviation, minimum and maximum value, and quartiles for each column):"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "e2b5f300",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" bill_length_mm | \n",
" bill_depth_mm | \n",
" flipper_length_mm | \n",
" body_mass_g | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 347.000000 | \n",
" 349.000000 | \n",
" 349.000000 | \n",
" 349.000000 | \n",
" 351.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 43.923055 | \n",
" 17.152722 | \n",
" 200.088825 | \n",
" 4305.329513 | \n",
" 2008.022792 | \n",
"
\n",
" \n",
" std | \n",
" 5.491795 | \n",
" 1.967049 | \n",
" 21.320100 | \n",
" 2256.300048 | \n",
" 0.820832 | \n",
"
\n",
" \n",
" min | \n",
" 32.100000 | \n",
" 13.100000 | \n",
" -99.000000 | \n",
" 285.000000 | \n",
" 2007.000000 | \n",
"
\n",
" \n",
" 25% | \n",
" 39.200000 | \n",
" 15.600000 | \n",
" 190.000000 | \n",
" 3550.000000 | \n",
" 2007.000000 | \n",
"
\n",
" \n",
" 50% | \n",
" 44.500000 | \n",
" 17.300000 | \n",
" 197.000000 | \n",
" 4050.000000 | \n",
" 2008.000000 | \n",
"
\n",
" \n",
" 75% | \n",
" 48.500000 | \n",
" 18.700000 | \n",
" 213.000000 | \n",
" 4775.000000 | \n",
" 2009.000000 | \n",
"
\n",
" \n",
" max | \n",
" 59.600000 | \n",
" 21.500000 | \n",
" 231.000000 | \n",
" 43400.000000 | \n",
" 2009.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" bill_length_mm bill_depth_mm flipper_length_mm body_mass_g \\\n",
"count 347.000000 349.000000 349.000000 349.000000 \n",
"mean 43.923055 17.152722 200.088825 4305.329513 \n",
"std 5.491795 1.967049 21.320100 2256.300048 \n",
"min 32.100000 13.100000 -99.000000 285.000000 \n",
"25% 39.200000 15.600000 190.000000 3550.000000 \n",
"50% 44.500000 17.300000 197.000000 4050.000000 \n",
"75% 48.500000 18.700000 213.000000 4775.000000 \n",
"max 59.600000 21.500000 231.000000 43400.000000 \n",
"\n",
" year \n",
"count 351.000000 \n",
"mean 2008.022792 \n",
"std 0.820832 \n",
"min 2007.000000 \n",
"25% 2007.000000 \n",
"50% 2008.000000 \n",
"75% 2009.000000 \n",
"max 2009.000000 "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "markdown",
"id": "4c98310a",
"metadata": {},
"source": [
"Even though `bill_length_mm`, and several of the other columns, have missing (NaN) values, Pandas is able to compute statistics for that column. When calculating these Pandas will ignore all NaN values by default. To change this behaviour, some functions have a `skipna` argument, for example `df[\"bill_length_mm\"].mean(skipna=False)` will return NaN if there are _any_ NaN values in the column.\n",
"\n",
"You should think carefully about which approach is more suitable for your data (for example, if a column only has a few non-null values will the mean be representative?)\n"
]
},
{
"cell_type": "markdown",
"id": "878ebe30",
"metadata": {},
"source": [
"Looking at these values gives us a better idea of what our data contains, but also allows us to perform some sanity checks. For example, do the minimum and maximum values in each column make sense given what we know about the dataset?\n",
"\n",
"There are two things that might standout. First, the `flipper_length_mm` column has a minimum value of -99, but all the other values in the data are positive as we'd expect for measurements of lengths, widths and weights. In some datasets missing data is represented with negative values (but this may not always be the case so, as always, make sure to check what they mean in any data you're using).\n",
"\n",
"If we're sure `-99` is meant to be a missing value, we can replace those with `numpy.nan` so Pandas will treat them correctly:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "f844b9a3",
"metadata": {},
"outputs": [],
"source": [
"df = df.replace(-99, numpy.nan)"
]
},
{
"cell_type": "markdown",
"id": "3fa8a2fc",
"metadata": {},
"source": [
"With these values replaced, the \"actual\" minimum value of `flipper_length_mm` is 172 mm:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "5d7a7d38",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"172.0"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"flipper_length_mm\"].min()"
]
},
{
"cell_type": "markdown",
"id": "00e8111a",
"metadata": {},
"source": [
"The second thing that may stand out is the minimum value of 285 grams in `body_mass_g`, which looks far smaller than the other values (e.g., the 25% quartile of `body_mass_g` is only 3550g). Excluding the 285g value the next lightest penguin weighs 2700g:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "8bb3868a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2700.0"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# True for each row with body_mass_g greater than the min value of 285g\n",
"smaller_petals = df[\"body_mass_g\"] > df[\"body_mass_g\"].min()\n",
"\n",
"# Lowest penguin weight out of all rows with weights above 285g\n",
"df.loc[smaller_petals, \"body_mass_g\"].min()"
]
},
{
"cell_type": "markdown",
"id": "107e7525",
"metadata": {},
"source": [
"Another way to see this would be to sort the data frame by body mass using the [`sort_values`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) function:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "ce349c0f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Id | \n",
" species | \n",
" island | \n",
" bill_length_mm | \n",
" bill_depth_mm | \n",
" flipper_length_mm | \n",
" body_mass_g | \n",
" sex | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 6 | \n",
" P-348 | \n",
" Adelie | \n",
" Biscoe | \n",
" 36.4 | \n",
" 18.1 | \n",
" 193.0 | \n",
" 285.0 | \n",
" female | \n",
" 2007 | \n",
"
\n",
" \n",
" 236 | \n",
" P-314 | \n",
" Chinstrap | \n",
" Dream | \n",
" 46.9 | \n",
" 16.6 | \n",
" 192.0 | \n",
" 2700.0 | \n",
" female | \n",
" 2008 | \n",
"
\n",
" \n",
" 248 | \n",
" P-058 | \n",
" Adelie | \n",
" Biscoe | \n",
" 36.5 | \n",
" 16.6 | \n",
" 181.0 | \n",
" 2850.0 | \n",
" female | \n",
" 2008 | \n",
"
\n",
" \n",
" 311 | \n",
" P-064 | \n",
" Adelie | \n",
" Biscoe | \n",
" 36.4 | \n",
" 17.1 | \n",
" 184.0 | \n",
" 2850.0 | \n",
" female | \n",
" 2008 | \n",
"
\n",
" \n",
" 349 | \n",
" P-098 | \n",
" Adelie | \n",
" Dream | \n",
" 33.1 | \n",
" 16.1 | \n",
" 178.0 | \n",
" 2900.0 | \n",
" female | \n",
" 2008 | \n",
"
\n",
" \n",
" 227 | \n",
" P-298 | \n",
" Chinstrap | \n",
" Dream | \n",
" 43.2 | \n",
" 16.6 | \n",
" 187.0 | \n",
" 2900.0 | \n",
" female | \n",
" 2007 | \n",
"
\n",
" \n",
" 270 | \n",
" P-116 | \n",
" Adelie | \n",
" Torgersen | \n",
" 38.6 | \n",
" 17.0 | \n",
" 188.0 | \n",
" 2900.0 | \n",
" female | \n",
" 2009 | \n",
"
\n",
" \n",
" 17 | \n",
" P-054 | \n",
" Adelie | \n",
" Biscoe | \n",
" 34.5 | \n",
" 18.1 | \n",
" 187.0 | \n",
" 2900.0 | \n",
" female | \n",
" 2008 | \n",
"
\n",
" \n",
" 137 | \n",
" P-104 | \n",
" Adelie | \n",
" Biscoe | \n",
" 37.9 | \n",
" 18.6 | \n",
" 193.0 | \n",
" 2925.0 | \n",
" female | \n",
" 2009 | \n",
"
\n",
" \n",
" 337 | \n",
" P-047 | \n",
" Adelie | \n",
" Dream | \n",
" 37.5 | \n",
" 18.9 | \n",
" 179.0 | \n",
" 2975.0 | \n",
" NaN | \n",
" 2007 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Id species island bill_length_mm bill_depth_mm \\\n",
"6 P-348 Adelie Biscoe 36.4 18.1 \n",
"236 P-314 Chinstrap Dream 46.9 16.6 \n",
"248 P-058 Adelie Biscoe 36.5 16.6 \n",
"311 P-064 Adelie Biscoe 36.4 17.1 \n",
"349 P-098 Adelie Dream 33.1 16.1 \n",
"227 P-298 Chinstrap Dream 43.2 16.6 \n",
"270 P-116 Adelie Torgersen 38.6 17.0 \n",
"17 P-054 Adelie Biscoe 34.5 18.1 \n",
"137 P-104 Adelie Biscoe 37.9 18.6 \n",
"337 P-047 Adelie Dream 37.5 18.9 \n",
"\n",
" flipper_length_mm body_mass_g sex year \n",
"6 193.0 285.0 female 2007 \n",
"236 192.0 2700.0 female 2008 \n",
"248 181.0 2850.0 female 2008 \n",
"311 184.0 2850.0 female 2008 \n",
"349 178.0 2900.0 female 2008 \n",
"227 187.0 2900.0 female 2007 \n",
"270 188.0 2900.0 female 2009 \n",
"17 187.0 2900.0 female 2008 \n",
"137 193.0 2925.0 female 2009 \n",
"337 179.0 2975.0 NaN 2007 "
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sort_values(by=\"body_mass_g\").head(10)"
]
},
{
"cell_type": "markdown",
"id": "d31ea54b",
"metadata": {},
"source": [
"By default `sort_values` sorts values from smallest to largest, you can change that by setting `ascending=False`. \n",
"\n",
"Again, we see the 2nd smallest value in the column is only 2700g. This could be another data entry error (perhaps the weight was meant to be 2850g rather than 285g), or perhaps that penguin is a chick and the rest are adults. This type of issue is much more nuanced and difficult to spot in real world scenarios. Visualizing the data (and distributions in the data) can be very helpful here, which is the focus of the next module."
]
},
{
"cell_type": "markdown",
"id": "57aaa8ca",
"metadata": {},
"source": [
"### Text and Categorical Columns\n",
"\n",
"Note that the `species`, `island`, and `sex` columns do not appear when we use `describe()` above as they contain text. For both text and numeric columns, it can be helpful to know the number of unique values in each column:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "b0e00076",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Id 350\n",
"species 4\n",
"island 3\n",
"bill_length_mm 164\n",
"bill_depth_mm 80\n",
"flipper_length_mm 55\n",
"body_mass_g 96\n",
"sex 2\n",
"year 3\n",
"dtype: int64"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.nunique()"
]
},
{
"cell_type": "markdown",
"id": "85a7b47f",
"metadata": {},
"source": [
"The measurement and `Id` columns have many unique values, whereas columns like `island` have only a few different unique values (categories). Looking closely, the `species` column has four different values, but from the dataset documentation we only expect there to be three penguin species.\n",
"\n",
"The `value_counts()` function, applied to the `species` column, shows the number of occurrences of the different values in that column:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "82ae93ef",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Adelie 155\n",
"Gentoo 125\n",
"Chinstrap 70\n",
"UNKNOWN 1\n",
"Name: species, dtype: int64"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"species\"].value_counts()"
]
},
{
"cell_type": "markdown",
"id": "01273fce",
"metadata": {},
"source": [
"The \"Adélie\", \"Chinstrap\", and \"Gentoo\" species described in the documentation all appear, but there's also an \"UNKNOWN\" entry. This looks like it should have been treated as a missing value. To make Pandas correctly treat it as missing we can replace it with `numpy.nan` using the [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html?highlight=replace#pandas.DataFrame.replace) method:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "bfadf286",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Adelie 155\n",
"Gentoo 125\n",
"Chinstrap 70\n",
"Name: species, dtype: int64"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"species\"] = df[\"species\"].replace(\"UNKNOWN\", numpy.nan)\n",
"df[\"species\"].value_counts()"
]
},
{
"cell_type": "markdown",
"id": "cd98d7fa",
"metadata": {},
"source": [
"By default, the `value_counts` will not display the number of missing values in the column. To show that you can use `df[\"species\"].value_counts(dropna=False)` instead. You can also try `df[\"species\"].value_counts(normalize=True)` to show the fraction of data with each value, rather than the count.\n",
"\n",
"We'll look at more approaches for manipulating strings and categories in Sections [2.2.4.2](2-02-04-02-TextData) and [2.2.4.3](2-02-04-02-TextData) of this module.\n",
"\n",
"Finally, it may be interesting to look at how the measurements vary between the species. We can do that with the Pandas [`groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) function: "
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "ee83308a",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/var/folders/xv/d5nvn2ps5r3fcf276w707n01qdmpqf/T/ipykernel_55040/2880954085.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.\n",
" df.groupby(\"species\").mean()\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" bill_length_mm | \n",
" bill_depth_mm | \n",
" flipper_length_mm | \n",
" body_mass_g | \n",
" year | \n",
"
\n",
" \n",
" species | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Adelie | \n",
" 38.757516 | \n",
" 18.335714 | \n",
" 189.993464 | \n",
" 3934.805195 | \n",
" 2008.006452 | \n",
"
\n",
" \n",
" Chinstrap | \n",
" 48.800000 | \n",
" 18.424286 | \n",
" 195.785714 | \n",
" 3733.571429 | \n",
" 2007.957143 | \n",
"
\n",
" \n",
" Gentoo | \n",
" 47.486290 | \n",
" 14.975806 | \n",
" 217.241935 | \n",
" 5080.241935 | \n",
" 2008.072000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" bill_length_mm bill_depth_mm flipper_length_mm body_mass_g \\\n",
"species \n",
"Adelie 38.757516 18.335714 189.993464 3934.805195 \n",
"Chinstrap 48.800000 18.424286 195.785714 3733.571429 \n",
"Gentoo 47.486290 14.975806 217.241935 5080.241935 \n",
"\n",
" year \n",
"species \n",
"Adelie 2008.006452 \n",
"Chinstrap 2007.957143 \n",
"Gentoo 2008.072000 "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby(\"species\").mean()"
]
},
{
"cell_type": "markdown",
"id": "e6b4a401",
"metadata": {},
"source": [
"`df.groupby(\"species\")` splits the date frame into sub-groups with the same value in the \"species\" column. We then must specify a function we want to use to summarize the members of each group, in this case the mean. It looks like, on average, \"Chinstrap\" penguins have the largest bills, but \"Gentoo\" penguins have the largest flippers and body mass. For more information about using `groupby` see [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)."
]
},
{
"cell_type": "markdown",
"id": "ae5ed4d5",
"metadata": {},
"source": [
"## Duplicate Data\n",
"\n",
"In the output of `df.nunique()` above we see the `Id` column has 350 unique values, one fewer than the 351 rows in the dataset. We expect `Id` to be a unique identifier, so to have 351 unique values (1 for each row). What's going on?\n",
"\n",
"One explanation could be that there are duplicate rows in the data. The [`duplicated`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html) method of a data frame returns True for any rows that appear multiple times in the data (with an exact copy of all values):"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "23341c3d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 False\n",
"2 False\n",
"3 False\n",
"4 False\n",
" ... \n",
"346 False\n",
"347 False\n",
"348 False\n",
"349 False\n",
"350 False\n",
"Length: 351, dtype: bool"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.duplicated()"
]
},
{
"cell_type": "markdown",
"id": "0024b9ac",
"metadata": {},
"source": [
"We can use this to filter the data frame and show only the duplicated rows:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "90f205fc",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Id | \n",
" species | \n",
" island | \n",
" bill_length_mm | \n",
" bill_depth_mm | \n",
" flipper_length_mm | \n",
" body_mass_g | \n",
" sex | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 36 | \n",
" P-276 | \n",
" Chinstrap | \n",
" Dream | \n",
" 46.5 | \n",
" 17.9 | \n",
" 192.0 | \n",
" 3500.0 | \n",
" female | \n",
" 2007 | \n",
"
\n",
" \n",
" 324 | \n",
" P-276 | \n",
" Chinstrap | \n",
" Dream | \n",
" 46.5 | \n",
" 17.9 | \n",
" 192.0 | \n",
" 3500.0 | \n",
" female | \n",
" 2007 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Id species island bill_length_mm bill_depth_mm \\\n",
"36 P-276 Chinstrap Dream 46.5 17.9 \n",
"324 P-276 Chinstrap Dream 46.5 17.9 \n",
"\n",
" flipper_length_mm body_mass_g sex year \n",
"36 192.0 3500.0 female 2007 \n",
"324 192.0 3500.0 female 2007 "
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df.duplicated(keep=False)]"
]
},
{
"cell_type": "markdown",
"id": "3692d88d",
"metadata": {},
"source": [
"By default, the `duplicated` function only marks the second and subsequent instances of the same data as being duplicates. Setting `keep=False` marks the first instance as a duplicate as well.\n",
"\n",
"We see there are two entries for a penguin with id `P-276` in the data. Why might that be the case? It could be caused by a data entry/processing issue and be there by mistake, or be a genuine repeated measurement for this penguin, for example. It's important to understand the context before taking any action.\n",
"\n",
"In some cases it may be appropriate to delete the duplicate data. This can be done with the [`drop_duplicates`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html?highlight=drop_duplicates#pandas.DataFrame.drop_duplicates) method:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "39c214f8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Length before removing duplicates: 351 rows\n",
"Length after removing duplicates: 350 rows\n"
]
}
],
"source": [
"print(\"Length before removing duplicates:\", len(df), \"rows\")\n",
"df = df.drop_duplicates()\n",
"print(\"Length after removing duplicates:\", len(df), \"rows\")"
]
},
{
"cell_type": "markdown",
"id": "76933b08",
"metadata": {},
"source": [
"## Displaying Data Frames with Style 😎\n",
"\n",
"You can get fancy with how you display data frames by highlighting and formatting cells differently using its `style` attribute. There are a few examples below, for more details see the [Table Visualization page in the Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#Styler-Object-and-HTML).\n",
"\n",
"Change the precision with which numbers are displayed:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "f0de84c8",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" \n",
" \n",
" | \n",
" Id | \n",
" species | \n",
" island | \n",
" bill_length_mm | \n",
" bill_depth_mm | \n",
" flipper_length_mm | \n",
" body_mass_g | \n",
" sex | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" P-179 | \n",
" Gentoo | \n",
" Biscoe | \n",
" 48 | \n",
" 15 | \n",
" 215 | \n",
" 5650 | \n",
" male | \n",
" 2007 | \n",
"
\n",
" \n",
" 1 | \n",
" P-306 | \n",
" Chinstrap | \n",
" Dream | \n",
" 41 | \n",
" 17 | \n",
" 187 | \n",
" 3200 | \n",
" female | \n",
" 2008 | \n",
"
\n",
" \n",
" 2 | \n",
" P-247 | \n",
" Gentoo | \n",
" Biscoe | \n",
" 51 | \n",
" 16 | \n",
" 226 | \n",
" 5200 | \n",
" male | \n",
" 2009 | \n",
"
\n",
" \n",
" 3 | \n",
" P-120 | \n",
" Adelie | \n",
" Torgersen | \n",
" 36 | \n",
" 17 | \n",
" 187 | \n",
" 3150 | \n",
" female | \n",
" 2009 | \n",
"
\n",
" \n",
" 4 | \n",
" P-220 | \n",
" Gentoo | \n",
" Biscoe | \n",
" 44 | \n",
" 14 | \n",
" 220 | \n",
" 4700 | \n",
" female | \n",
" 2008 | \n",
"
\n",
" \n",
" 5 | \n",
" P-150 | \n",
" Adelie | \n",
" Dream | \n",
" 36 | \n",
" 17 | \n",
" 187 | \n",
" 3700 | \n",
" female | \n",
" 2009 | \n",
"
\n",
" \n",
" 6 | \n",
" P-348 | \n",
" Adelie | \n",
" Biscoe | \n",
" 36 | \n",
" 18 | \n",
" 193 | \n",
" 285 | \n",
" female | \n",
" 2007 | \n",
"
\n",
" \n",
" 7 | \n",
" P-091 | \n",
" Adelie | \n",
" Dream | \n",
" 41 | \n",
" 18 | \n",
" 205 | \n",
" 4300 | \n",
" male | \n",
" 2008 | \n",
"
\n",
" \n",
" 8 | \n",
" P-327 | \n",
" Chinstrap | \n",
" Dream | \n",
" 51 | \n",
" 19 | \n",
" 201 | \n",
" 3950 | \n",
" male | \n",
" 2009 | \n",
"
\n",
" \n",
" 9 | \n",
" P-221 | \n",
" Gentoo | \n",
" Biscoe | \n",
" 51 | \n",
" 15 | \n",
" 223 | \n",
" 5550 | \n",
" male | \n",
" 2008 | \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_top10 = df.head(10) # just style the first 10 rows for demo purposes here\n",
"\n",
"# round values to nearest integer (0 decimal places)\n",
"df_top10.style.format(precision=0)"
]
},
{
"cell_type": "markdown",
"id": "3798286c",
"metadata": {},
"source": [
"Apply a colour gradient to each column based on each cell's value:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "ac0b9191",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" \n",
" \n",
" | \n",
" Id | \n",
" species | \n",
" island | \n",
" bill_length_mm | \n",
" bill_depth_mm | \n",
" flipper_length_mm | \n",
" body_mass_g | \n",
" sex | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" P-179 | \n",
" Gentoo | \n",
" Biscoe | \n",
" 47.800000 | \n",
" 15.000000 | \n",
" 215.000000 | \n",
" 5650.000000 | \n",
" male | \n",
" 2007 | \n",
"
\n",
" \n",
" 1 | \n",
" P-306 | \n",
" Chinstrap | \n",
" Dream | \n",
" 40.900000 | \n",
" 16.600000 | \n",
" 187.000000 | \n",
" 3200.000000 | \n",
" female | \n",
" 2008 | \n",
"
\n",
" \n",
" 2 | \n",
" P-247 | \n",
" Gentoo | \n",
" Biscoe | \n",
" 50.800000 | \n",
" 15.700000 | \n",
" 226.000000 | \n",
" 5200.000000 | \n",
" male | \n",
" 2009 | \n",
"
\n",
" \n",
" 3 | \n",
" P-120 | \n",
" Adelie | \n",
" Torgersen | \n",
" 36.200000 | \n",
" 17.200000 | \n",
" 187.000000 | \n",
" 3150.000000 | \n",
" female | \n",
" 2009 | \n",
"
\n",
" \n",
" 4 | \n",
" P-220 | \n",
" Gentoo | \n",
" Biscoe | \n",
" 43.500000 | \n",
" 14.200000 | \n",
" 220.000000 | \n",
" 4700.000000 | \n",
" female | \n",
" 2008 | \n",
"
\n",
" \n",
" 5 | \n",
" P-150 | \n",
" Adelie | \n",
" Dream | \n",
" 36.000000 | \n",
" 17.100000 | \n",
" 187.000000 | \n",
" 3700.000000 | \n",
" female | \n",
" 2009 | \n",
"
\n",
" \n",
" 6 | \n",
" P-348 | \n",
" Adelie | \n",
" Biscoe | \n",
" 36.400000 | \n",
" 18.100000 | \n",
" 193.000000 | \n",
" 285.000000 | \n",
" female | \n",
" 2007 | \n",
"
\n",
" \n",
" 7 | \n",
" P-091 | \n",
" Adelie | \n",
" Dream | \n",
" 41.100000 | \n",
" 18.100000 | \n",
" 205.000000 | \n",
" 4300.000000 | \n",
" male | \n",
" 2008 | \n",
"
\n",
" \n",
" 8 | \n",
" P-327 | \n",
" Chinstrap | \n",
" Dream | \n",
" 51.400000 | \n",
" 19.000000 | \n",
" 201.000000 | \n",
" 3950.000000 | \n",
" male | \n",
" 2009 | \n",
"
\n",
" \n",
" 9 | \n",
" P-221 | \n",
" Gentoo | \n",
" Biscoe | \n",
" 50.700000 | \n",
" 15.000000 | \n",
" 223.000000 | \n",
" 5550.000000 | \n",
" male | \n",
" 2008 | \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_top10.style.background_gradient()"
]
},
{
"cell_type": "markdown",
"id": "6d8c4ff0",
"metadata": {},
"source": [
"Highlight the smallest value in each column:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "ed5623ff",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" \n",
" \n",
" | \n",
" Id | \n",
" species | \n",
" island | \n",
" bill_length_mm | \n",
" bill_depth_mm | \n",
" flipper_length_mm | \n",
" body_mass_g | \n",
" sex | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" P-179 | \n",
" Gentoo | \n",
" Biscoe | \n",
" 47.800000 | \n",
" 15.000000 | \n",
" 215.000000 | \n",
" 5650.000000 | \n",
" male | \n",
" 2007 | \n",
"
\n",
" \n",
" 1 | \n",
" P-306 | \n",
" Chinstrap | \n",
" Dream | \n",
" 40.900000 | \n",
" 16.600000 | \n",
" 187.000000 | \n",
" 3200.000000 | \n",
" female | \n",
" 2008 | \n",
"
\n",
" \n",
" 2 | \n",
" P-247 | \n",
" Gentoo | \n",
" Biscoe | \n",
" 50.800000 | \n",
" 15.700000 | \n",
" 226.000000 | \n",
" 5200.000000 | \n",
" male | \n",
" 2009 | \n",
"
\n",
" \n",
" 3 | \n",
" P-120 | \n",
" Adelie | \n",
" Torgersen | \n",
" 36.200000 | \n",
" 17.200000 | \n",
" 187.000000 | \n",
" 3150.000000 | \n",
" female | \n",
" 2009 | \n",
"
\n",
" \n",
" 4 | \n",
" P-220 | \n",
" Gentoo | \n",
" Biscoe | \n",
" 43.500000 | \n",
" 14.200000 | \n",
" 220.000000 | \n",
" 4700.000000 | \n",
" female | \n",
" 2008 | \n",
"
\n",
" \n",
" 5 | \n",
" P-150 | \n",
" Adelie | \n",
" Dream | \n",
" 36.000000 | \n",
" 17.100000 | \n",
" 187.000000 | \n",
" 3700.000000 | \n",
" female | \n",
" 2009 | \n",
"
\n",
" \n",
" 6 | \n",
" P-348 | \n",
" Adelie | \n",
" Biscoe | \n",
" 36.400000 | \n",
" 18.100000 | \n",
" 193.000000 | \n",
" 285.000000 | \n",
" female | \n",
" 2007 | \n",
"
\n",
" \n",
" 7 | \n",
" P-091 | \n",
" Adelie | \n",
" Dream | \n",
" 41.100000 | \n",
" 18.100000 | \n",
" 205.000000 | \n",
" 4300.000000 | \n",
" male | \n",
" 2008 | \n",
"
\n",
" \n",
" 8 | \n",
" P-327 | \n",
" Chinstrap | \n",
" Dream | \n",
" 51.400000 | \n",
" 19.000000 | \n",
" 201.000000 | \n",
" 3950.000000 | \n",
" male | \n",
" 2009 | \n",
"
\n",
" \n",
" 9 | \n",
" P-221 | \n",
" Gentoo | \n",
" Biscoe | \n",
" 50.700000 | \n",
" 15.000000 | \n",
" 223.000000 | \n",
" 5550.000000 | \n",
" male | \n",
" 2008 | \n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_top10.style.highlight_min()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.15"
},
"vscode": {
"interpreter": {
"hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}