{
"cells": [
{
"cell_type": "markdown",
"id": "768291be-1495-4111-bffd-ada2e779e332",
"metadata": {},
"source": [
"# 2.1.3 Pandas intro\n",
"\n",
"The Pandas library is a core part of the Python data science ecosystem.\n",
"It provides easy-to-use data structures and data analysis tools.\n",
"\n",
"Pandas has some great resources for getting started, including guides\n",
"tailored to those familiar with other software for manipulating data:\n",
"https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html#getting-started\n",
".\n",
"\n",
"For now, we’ll stick just to what we need for this course."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "94a5d9cd",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "8c8c3fee",
"metadata": {},
"source": [
"## Structures\n",
"\n",
"Pandas has two main **labelled** data structures: \n",
"- Series\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "81846b73",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 0.3\n",
"1 4.0\n",
"2 1.0\n",
"3 NaN\n",
"4 9.0\n",
"dtype: float64\n"
]
}
],
"source": [
"s = pd.Series([0.3, 4, 1, None, 9])\n",
"print(s)"
]
},
{
"cell_type": "markdown",
"id": "db418663",
"metadata": {},
"source": [
"- DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "369b98c0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" random_A | \n",
" random_B | \n",
"
\n",
" \n",
" \n",
" \n",
" 3 | \n",
" 1.425158 | \n",
" -0.169013 | \n",
"
\n",
" \n",
" 4 | \n",
" -0.299078 | \n",
" 0.244578 | \n",
"
\n",
" \n",
" 5 | \n",
" 0.503473 | \n",
" -0.465702 | \n",
"
\n",
" \n",
" 6 | \n",
" 1.245454 | \n",
" -0.106239 | \n",
"
\n",
" \n",
" 7 | \n",
" 0.027438 | \n",
" -1.415794 | \n",
"
\n",
" \n",
" 8 | \n",
" -1.414463 | \n",
" -0.493611 | \n",
"
\n",
" \n",
" 9 | \n",
" -0.623091 | \n",
" -0.350707 | \n",
"
\n",
" \n",
" 10 | \n",
" 1.775940 | \n",
" -1.448867 | \n",
"
\n",
" \n",
" 11 | \n",
" 1.201266 | \n",
" -0.084514 | \n",
"
\n",
" \n",
" 12 | \n",
" 1.041766 | \n",
" -1.319784 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" random_A random_B\n",
"3 1.425158 -0.169013\n",
"4 -0.299078 0.244578\n",
"5 0.503473 -0.465702\n",
"6 1.245454 -0.106239\n",
"7 0.027438 -1.415794\n",
"8 -1.414463 -0.493611\n",
"9 -0.623091 -0.350707\n",
"10 1.775940 -1.448867\n",
"11 1.201266 -0.084514\n",
"12 1.041766 -1.319784"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"\n",
"df = pd.DataFrame(np.random.randn(10,2), index=np.arange(3, 13), columns=[\"random_A\", \"random_B\"]) \n",
"df"
]
},
{
"cell_type": "markdown",
"id": "f0b6dc95",
"metadata": {},
"source": [
"\n",
"Once we have data in these Pandas structures, we can perform some useful\n",
"operations such as:\n",
"\n",
"- `info()` (`DataFrame` only)\n",
" - prints a concise summary of a `DataFrame`\n",
"\n",
"- `value_counts()`\n",
" - returns a `Series` containing counts of unique values in the structure\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "7dbb1700",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 0\n",
"1 0\n",
"2 0\n",
"3 1\n",
"4 0\n",
"5 1\n",
"6 0\n",
"7 0\n",
"8 1\n",
"9 0\n",
"dtype: int64\n",
"\n",
"value counts:\n",
"0 7\n",
"1 3\n",
"dtype: int64\n"
]
}
],
"source": [
"s = pd.Series(np.random.randint(0,2,10))\n",
"print(s)\n",
"\n",
"print(\"\\nvalue counts:\")\n",
"print(s.value_counts())"
]
},
{
"cell_type": "markdown",
"id": "f8636002",
"metadata": {},
"source": [
"\n",
"We’ll see more on how to use these structures, and other Pandas\n",
"capabilities, later.\n",
"\n",
"## Indexing\n",
"\n",
"Again, we’re just covering some basics here. For a complete guide to\n",
"indexing in Pandas see\n",
"[here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html).\n",
"\n",
"Pandas allows us to use the same basic `[]` indexing and `.` attribute\n",
"operators that we’re used to with Python and NumPy. However, Pandas also\n",
"provides the (often preferred) `.loc` labelled indexing method and the\n",
"`.iloc` position indexing methods.\n",
"\n",
"### `[]` Indexing\n",
"\n",
"For basic `[]` indexing, we can select columns from a DataFrame and\n",
"items from a Series.\n",
"\n",
"#### DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "ececba48",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"single column from DataFrame, gives us a Series:\n"
]
},
{
"data": {
"text/plain": [
"3 1.425158\n",
"4 -0.299078\n",
"5 0.503473\n",
"6 1.245454\n",
"7 0.027438\n",
"8 -1.414463\n",
"9 -0.623091\n",
"10 1.775940\n",
"11 1.201266\n",
"12 1.041766\n",
"Name: random_A, dtype: float64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# select a single column\n",
"print(\"single column from DataFrame, gives us a Series:\")\n",
"\n",
"df[\"random_A\"]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "7e68ba96",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"two columns from DataFrame, gives us a DataFrame:\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" random_A | \n",
" random_B | \n",
"
\n",
" \n",
" \n",
" \n",
" 3 | \n",
" 1.425158 | \n",
" -0.169013 | \n",
"
\n",
" \n",
" 4 | \n",
" -0.299078 | \n",
" 0.244578 | \n",
"
\n",
" \n",
" 5 | \n",
" 0.503473 | \n",
" -0.465702 | \n",
"
\n",
" \n",
" 6 | \n",
" 1.245454 | \n",
" -0.106239 | \n",
"
\n",
" \n",
" 7 | \n",
" 0.027438 | \n",
" -1.415794 | \n",
"
\n",
" \n",
" 8 | \n",
" -1.414463 | \n",
" -0.493611 | \n",
"
\n",
" \n",
" 9 | \n",
" -0.623091 | \n",
" -0.350707 | \n",
"
\n",
" \n",
" 10 | \n",
" 1.775940 | \n",
" -1.448867 | \n",
"
\n",
" \n",
" 11 | \n",
" 1.201266 | \n",
" -0.084514 | \n",
"
\n",
" \n",
" 12 | \n",
" 1.041766 | \n",
" -1.319784 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" random_A random_B\n",
"3 1.425158 -0.169013\n",
"4 -0.299078 0.244578\n",
"5 0.503473 -0.465702\n",
"6 1.245454 -0.106239\n",
"7 0.027438 -1.415794\n",
"8 -1.414463 -0.493611\n",
"9 -0.623091 -0.350707\n",
"10 1.775940 -1.448867\n",
"11 1.201266 -0.084514\n",
"12 1.041766 -1.319784"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# select two columns\n",
"print(\"two columns from DataFrame, gives us a DataFrame:\")\n",
"\n",
"df[[\"random_A\", \"random_B\"]]"
]
},
{
"cell_type": "markdown",
"id": "a33c8c9e",
"metadata": {},
"source": [
"Note that we can't do:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "e239e96e",
"metadata": {
"tags": [
"raises-exception"
]
},
"outputs": [
{
"ename": "KeyError",
"evalue": "5",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m~/Library/Caches/pypoetry/virtualenvs/rds-course-5zqYD5aN-py3.9/lib/python3.9/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 3360\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3361\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3362\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/Library/Caches/pypoetry/virtualenvs/rds-course-5zqYD5aN-py3.9/lib/python3.9/site-packages/pandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32m~/Library/Caches/pypoetry/virtualenvs/rds-course-5zqYD5aN-py3.9/lib/python3.9/site-packages/pandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mKeyError\u001b[0m: 5",
"\nThe above exception was the direct cause of the following exception:\n",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m/var/folders/xv/d5nvn2ps5r3fcf276w707n01qdmpqf/T/ipykernel_71539/528749764.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m5\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m~/Library/Caches/pypoetry/virtualenvs/rds-course-5zqYD5aN-py3.9/lib/python3.9/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 3456\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnlevels\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3457\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3458\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3459\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_integer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3460\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/Library/Caches/pypoetry/virtualenvs/rds-course-5zqYD5aN-py3.9/lib/python3.9/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 3361\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3362\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3363\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3364\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3365\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_scalar\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0misna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhasnans\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mKeyError\u001b[0m: 5"
]
}
],
"source": [
"df[5]"
]
},
{
"cell_type": "markdown",
"id": "c40e1b53",
"metadata": {},
"source": [
"as this tries to access a row, not a column. But you can do this with a series (a single column):\n",
"\n",
"#### Series"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "14cb1131",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"single item from Series, gives us an item (of type numpy.int64,in this case):\n"
]
},
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# select single item\n",
"print(\"single item from Series, gives us an item (of type numpy.int64,in this case):\")\n",
"\n",
"s[2]"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "daac3c60",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"two items from Series, gives us a Series:\n"
]
},
{
"data": {
"text/plain": [
"2 0\n",
"4 0\n",
"dtype: int64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# select two items\n",
"print(\"two items from Series, gives us a Series:\")\n",
"\n",
"s[[2, 4]]"
]
},
{
"cell_type": "markdown",
"id": "1c86d07c",
"metadata": {},
"source": [
"### Attribute Access\n",
"\n",
"Similarly, we can access a column from a DataFrame and an item from a\n",
"Series using as an attribute. However, we can’t do this when the label\n",
"is not a valid identifier."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "f25ccd09",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3 1.677205\n",
"4 1.543037\n",
"5 0.689541\n",
"6 -0.414047\n",
"7 0.155389\n",
"8 0.126505\n",
"9 0.807657\n",
"10 0.282047\n",
"11 -0.369983\n",
"12 -1.738524\n",
"Name: random_A, dtype: float64"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df.random_A"
]
},
{
"cell_type": "markdown",
"id": "07ffba84",
"metadata": {},
"source": [
"### `.loc`\n",
"\n",
"`.loc` provides label-based indexing. `.loc` can also be used for\n",
"slicing and we can even provide a `callable` as its input! However, here\n",
"we’ll just show single item access.\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "9c5935ec",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"random_A 0.689541\n",
"random_B 1.616827\n",
"Name: 5, dtype: float64"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df.loc[5]"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "251af820",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# and for a Series\n",
"s.loc[2]"
]
},
{
"cell_type": "markdown",
"id": "6430b57c",
"metadata": {},
"source": [
"\n",
"### `.iloc`\n",
"\n",
"`.iloc` provides integer-based indexing. This closely resembles Python and NumPy slicing. Again, we'll just show single item access."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "f891c7ab",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"random_A -1.414463\n",
"random_B -0.493611\n",
"Name: 8, dtype: float64"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# for DataFrame\n",
"df.iloc[5]"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "eaf20b9d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# and for a Series\n",
"s.iloc[2]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.15"
},
"vscode": {
"interpreter": {
"hash": "bfb6be6a1e2928ff2d2ab7904db3c9037c4b5ca4f399bd5e7064365a54a3f091"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}
|