{ "cells": [ { "cell_type": "markdown", "id": "768291be-1495-4111-bffd-ada2e779e332", "metadata": {}, "source": [ "# 2.1.3 Pandas intro\n", "\n", "The Pandas library is a core part of the Python data science ecosystem.\n", "It provides easy-to-use data structures and data analysis tools.\n", "\n", "Pandas has some great resources for getting started, including guides\n", "tailored to those familiar with other software for manipulating data:\n", "https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html#getting-started\n", ".\n", "\n", "For now, we’ll stick just to what we need for this course." ] }, { "cell_type": "code", "execution_count": 2, "id": "94a5d9cd", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "id": "8c8c3fee", "metadata": {}, "source": [ "## Structures\n", "\n", "Pandas has two main **labelled** data structures: \n", "- Series\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "81846b73", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 0.3\n", "1 4.0\n", "2 1.0\n", "3 NaN\n", "4 9.0\n", "dtype: float64\n" ] } ], "source": [ "s = pd.Series([0.3, 4, 1, None, 9])\n", "print(s)" ] }, { "cell_type": "markdown", "id": "db418663", "metadata": {}, "source": [ "- DataFrame" ] }, { "cell_type": "code", "execution_count": 4, "id": "369b98c0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
random_Arandom_B
31.425158-0.169013
4-0.2990780.244578
50.503473-0.465702
61.245454-0.106239
70.027438-1.415794
8-1.414463-0.493611
9-0.623091-0.350707
101.775940-1.448867
111.201266-0.084514
121.041766-1.319784
\n", "
" ], "text/plain": [ " random_A random_B\n", "3 1.425158 -0.169013\n", "4 -0.299078 0.244578\n", "5 0.503473 -0.465702\n", "6 1.245454 -0.106239\n", "7 0.027438 -1.415794\n", "8 -1.414463 -0.493611\n", "9 -0.623091 -0.350707\n", "10 1.775940 -1.448867\n", "11 1.201266 -0.084514\n", "12 1.041766 -1.319784" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "df = pd.DataFrame(np.random.randn(10,2), index=np.arange(3, 13), columns=[\"random_A\", \"random_B\"]) \n", "df" ] }, { "cell_type": "markdown", "id": "f0b6dc95", "metadata": {}, "source": [ "\n", "Once we have data in these Pandas structures, we can perform some useful\n", "operations such as:\n", "\n", "- `info()` (`DataFrame` only)\n", " - prints a concise summary of a `DataFrame`\n", "\n", "- `value_counts()`\n", " - returns a `Series` containing counts of unique values in the structure\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "7dbb1700", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 0\n", "1 0\n", "2 0\n", "3 1\n", "4 0\n", "5 1\n", "6 0\n", "7 0\n", "8 1\n", "9 0\n", "dtype: int64\n", "\n", "value counts:\n", "0 7\n", "1 3\n", "dtype: int64\n" ] } ], "source": [ "s = pd.Series(np.random.randint(0,2,10))\n", "print(s)\n", "\n", "print(\"\\nvalue counts:\")\n", "print(s.value_counts())" ] }, { "cell_type": "markdown", "id": "f8636002", "metadata": {}, "source": [ "\n", "We’ll see more on how to use these structures, and other Pandas\n", "capabilities, later.\n", "\n", "## Indexing\n", "\n", "Again, we’re just covering some basics here. For a complete guide to\n", "indexing in Pandas see\n", "[here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html).\n", "\n", "Pandas allows us to use the same basic `[]` indexing and `.` attribute\n", "operators that we’re used to with Python and NumPy. However, Pandas also\n", "provides the (often preferred) `.loc` labelled indexing method and the\n", "`.iloc` position indexing methods.\n", "\n", "### `[]` Indexing\n", "\n", "For basic `[]` indexing, we can select columns from a DataFrame and\n", "items from a Series.\n", "\n", "#### DataFrame" ] }, { "cell_type": "code", "execution_count": 6, "id": "ececba48", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "single column from DataFrame, gives us a Series:\n" ] }, { "data": { "text/plain": [ "3 1.425158\n", "4 -0.299078\n", "5 0.503473\n", "6 1.245454\n", "7 0.027438\n", "8 -1.414463\n", "9 -0.623091\n", "10 1.775940\n", "11 1.201266\n", "12 1.041766\n", "Name: random_A, dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# select a single column\n", "print(\"single column from DataFrame, gives us a Series:\")\n", "\n", "df[\"random_A\"]" ] }, { "cell_type": "code", "execution_count": 8, "id": "7e68ba96", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "two columns from DataFrame, gives us a DataFrame:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
random_Arandom_B
31.425158-0.169013
4-0.2990780.244578
50.503473-0.465702
61.245454-0.106239
70.027438-1.415794
8-1.414463-0.493611
9-0.623091-0.350707
101.775940-1.448867
111.201266-0.084514
121.041766-1.319784
\n", "
" ], "text/plain": [ " random_A random_B\n", "3 1.425158 -0.169013\n", "4 -0.299078 0.244578\n", "5 0.503473 -0.465702\n", "6 1.245454 -0.106239\n", "7 0.027438 -1.415794\n", "8 -1.414463 -0.493611\n", "9 -0.623091 -0.350707\n", "10 1.775940 -1.448867\n", "11 1.201266 -0.084514\n", "12 1.041766 -1.319784" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# select two columns\n", "print(\"two columns from DataFrame, gives us a DataFrame:\")\n", "\n", "df[[\"random_A\", \"random_B\"]]" ] }, { "cell_type": "markdown", "id": "a33c8c9e", "metadata": {}, "source": [ "Note that we can't do:" ] }, { "cell_type": "code", "execution_count": 12, "id": "e239e96e", "metadata": { "tags": [ "raises-exception" ] }, "outputs": [ { "ename": "KeyError", "evalue": "5", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m~/Library/Caches/pypoetry/virtualenvs/rds-course-5zqYD5aN-py3.9/lib/python3.9/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 3360\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3361\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3362\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/Library/Caches/pypoetry/virtualenvs/rds-course-5zqYD5aN-py3.9/lib/python3.9/site-packages/pandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", "\u001b[0;32m~/Library/Caches/pypoetry/virtualenvs/rds-course-5zqYD5aN-py3.9/lib/python3.9/site-packages/pandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", "\u001b[0;31mKeyError\u001b[0m: 5", "\nThe above exception was the direct cause of the following exception:\n", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/var/folders/xv/d5nvn2ps5r3fcf276w707n01qdmpqf/T/ipykernel_71539/528749764.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m5\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/Library/Caches/pypoetry/virtualenvs/rds-course-5zqYD5aN-py3.9/lib/python3.9/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 3456\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnlevels\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3457\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3458\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3459\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_integer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3460\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/Library/Caches/pypoetry/virtualenvs/rds-course-5zqYD5aN-py3.9/lib/python3.9/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 3361\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3362\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3363\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3364\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3365\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_scalar\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0misna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhasnans\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mKeyError\u001b[0m: 5" ] } ], "source": [ "df[5]" ] }, { "cell_type": "markdown", "id": "c40e1b53", "metadata": {}, "source": [ "as this tries to access a row, not a column. But you can do this with a series (a single column):\n", "\n", "#### Series" ] }, { "cell_type": "code", "execution_count": 13, "id": "14cb1131", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "single item from Series, gives us an item (of type numpy.int64,in this case):\n" ] }, { "data": { "text/plain": [ "0" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# select single item\n", "print(\"single item from Series, gives us an item (of type numpy.int64,in this case):\")\n", "\n", "s[2]" ] }, { "cell_type": "code", "execution_count": 14, "id": "daac3c60", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "two items from Series, gives us a Series:\n" ] }, { "data": { "text/plain": [ "2 0\n", "4 0\n", "dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# select two items\n", "print(\"two items from Series, gives us a Series:\")\n", "\n", "s[[2, 4]]" ] }, { "cell_type": "markdown", "id": "1c86d07c", "metadata": {}, "source": [ "### Attribute Access\n", "\n", "Similarly, we can access a column from a DataFrame and an item from a\n", "Series using as an attribute. However, we can’t do this when the label\n", "is not a valid identifier." ] }, { "cell_type": "code", "execution_count": 7, "id": "f25ccd09", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3 1.677205\n", "4 1.543037\n", "5 0.689541\n", "6 -0.414047\n", "7 0.155389\n", "8 0.126505\n", "9 0.807657\n", "10 0.282047\n", "11 -0.369983\n", "12 -1.738524\n", "Name: random_A, dtype: float64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.random_A" ] }, { "cell_type": "markdown", "id": "07ffba84", "metadata": {}, "source": [ "### `.loc`\n", "\n", "`.loc` provides label-based indexing. `.loc` can also be used for\n", "slicing and we can even provide a `callable` as its input! However, here\n", "we’ll just show single item access.\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "9c5935ec", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "random_A 0.689541\n", "random_B 1.616827\n", "Name: 5, dtype: float64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.loc[5]" ] }, { "cell_type": "code", "execution_count": 9, "id": "251af820", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# and for a Series\n", "s.loc[2]" ] }, { "cell_type": "markdown", "id": "6430b57c", "metadata": {}, "source": [ "\n", "### `.iloc`\n", "\n", "`.iloc` provides integer-based indexing. This closely resembles Python and NumPy slicing. Again, we'll just show single item access." ] }, { "cell_type": "code", "execution_count": 16, "id": "f891c7ab", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "random_A -1.414463\n", "random_B -0.493611\n", "Name: 8, dtype: float64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# for DataFrame\n", "df.iloc[5]" ] }, { "cell_type": "code", "execution_count": 15, "id": "eaf20b9d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# and for a Series\n", "s.iloc[2]" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.15" }, "vscode": { "interpreter": { "hash": "bfb6be6a1e2928ff2d2ab7904db3c9037c4b5ca4f399bd5e7064365a54a3f091" } } }, "nbformat": 4, "nbformat_minor": 5 }