2.1.3 Pandas intro

The Pandas library is a core part of the Python data science ecosystem. It provides easy-to-use data structures and data analysis tools.

Pandas has some great resources for getting started, including guides tailored to those familiar with other software for manipulating data: https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html#getting-started .

For now, we’ll stick just to what we need for this course.

import pandas as pd

Structures

Pandas has two main labelled data structures:

  • Series

s = pd.Series([0.3, 4, 1, None, 9])
print(s)
0    0.3
1    4.0
2    1.0
3    NaN
4    9.0
dtype: float64
  • DataFrame

import numpy as np

df = pd.DataFrame(np.random.randn(10,2), index=np.arange(3, 13), columns=["random_A", "random_B"]) 
df
random_A random_B
3 0.591502 -0.453185
4 -0.870349 -1.213743
5 0.332426 -0.127790
6 0.219025 0.801325
7 -0.699849 -0.256481
8 -0.671411 1.131856
9 1.733254 -1.113464
10 2.537998 1.169176
11 -1.934276 -0.295280
12 2.125796 -1.515066

Once we have data in these Pandas structures, we can perform some useful operations such as:

  • info() (DataFrame only)

    • prints a concise summary of a DataFrame

  • value_counts()

    • returns a Series containing counts of unique values in the structure

s = pd.Series(np.random.randint(0,2,10))
print(s)

print("\nvalue counts:")
print(s.value_counts())
0    0
1    1
2    0
3    1
4    1
5    1
6    0
7    0
8    1
9    0
dtype: int64

value counts:
0    5
1    5
dtype: int64

We’ll see more on how to use these structures, and other Pandas capabilities, later.

Indexing

Again, we’re just covering some basics here. For a complete guide to indexing in Pandas see here.

Pandas allows us to use the same basic [] indexing and . attribute operators that we’re used to with Python and NumPy. However, Pandas also provides the (often preferred) .loc labelled indexing method and the .iloc position indexing methods.

[] Indexing

For basic [] indexing, we can select columns from a DataFrame and items from a Series.

DataFrame

# select a single column
print("single column from DataFrame, gives us a Series:")

df["random_A"]
single column from DataFrame, gives us a Series:
3     0.591502
4    -0.870349
5     0.332426
6     0.219025
7    -0.699849
8    -0.671411
9     1.733254
10    2.537998
11   -1.934276
12    2.125796
Name: random_A, dtype: float64
# select two columns
print("two columns from DataFrame, gives us a DataFrame:")

df[["random_A", "random_B"]]
two columns from DataFrame, gives us a DataFrame:
random_A random_B
3 0.591502 -0.453185
4 -0.870349 -1.213743
5 0.332426 -0.127790
6 0.219025 0.801325
7 -0.699849 -0.256481
8 -0.671411 1.131856
9 1.733254 -1.113464
10 2.537998 1.169176
11 -1.934276 -0.295280
12 2.125796 -1.515066

Note that we can’t do:

df[5]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3801 try:
-> 3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:

File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 5

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[7], line 1
----> 1 df[5]

File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/core/frame.py:3807, in DataFrame.__getitem__(self, key)
   3805 if self.columns.nlevels > 1:
   3806     return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
   3808 if is_integer(indexer):
   3809     indexer = [indexer]

File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/core/indexes/base.py:3804, in Index.get_loc(self, key, method, tolerance)
   3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:
-> 3804     raise KeyError(key) from err
   3805 except TypeError:
   3806     # If we have a listlike key, _check_indexing_error will raise
   3807     #  InvalidIndexError. Otherwise we fall through and re-raise
   3808     #  the TypeError.
   3809     self._check_indexing_error(key)

KeyError: 5

as this tries to access a row, not a column. But you can do this with a series (a single column):

Series

# select single item
print("single item from Series, gives us an item (of type numpy.int64,in this case):")

s[2]
single item from Series, gives us an item (of type numpy.int64,in this case):
0
# select two items
print("two items from Series, gives us a Series:")

s[[2, 4]]
two items from Series, gives us a Series:
2    0
4    1
dtype: int64

Attribute Access

Similarly, we can access a column from a DataFrame and an item from a Series using as an attribute. However, we can’t do this when the label is not a valid identifier.

df.random_A
3     0.591502
4    -0.870349
5     0.332426
6     0.219025
7    -0.699849
8    -0.671411
9     1.733254
10    2.537998
11   -1.934276
12    2.125796
Name: random_A, dtype: float64

.loc

.loc provides label-based indexing. .loc can also be used for slicing and we can even provide a callable as its input! However, here we’ll just show single item access.

df.loc[5]
random_A    0.332426
random_B   -0.127790
Name: 5, dtype: float64
# and for a Series
s.loc[2]
0

.iloc

.iloc provides integer-based indexing. This closely resembles Python and NumPy slicing. Again, we’ll just show single item access.

# for DataFrame
df.iloc[5]
random_A   -0.671411
random_B    1.131856
Name: 8, dtype: float64
# and for a Series
s.iloc[2]
0