2.1.3 Pandas intro
Contents
2.1.3 Pandas intro¶
The Pandas library is a core part of the Python data science ecosystem. It provides easy-to-use data structures and data analysis tools.
Pandas has some great resources for getting started, including guides tailored to those familiar with other software for manipulating data: https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html#getting-started .
For now, we’ll stick just to what we need for this course.
import pandas as pd
Structures¶
Pandas has two main labelled data structures:
Series
s = pd.Series([0.3, 4, 1, None, 9])
print(s)
0 0.3
1 4.0
2 1.0
3 NaN
4 9.0
dtype: float64
DataFrame
import numpy as np
df = pd.DataFrame(np.random.randn(10,2), index=np.arange(3, 13), columns=["random_A", "random_B"])
df
random_A | random_B | |
---|---|---|
3 | 0.591502 | -0.453185 |
4 | -0.870349 | -1.213743 |
5 | 0.332426 | -0.127790 |
6 | 0.219025 | 0.801325 |
7 | -0.699849 | -0.256481 |
8 | -0.671411 | 1.131856 |
9 | 1.733254 | -1.113464 |
10 | 2.537998 | 1.169176 |
11 | -1.934276 | -0.295280 |
12 | 2.125796 | -1.515066 |
Once we have data in these Pandas structures, we can perform some useful operations such as:
info()
(DataFrame
only)prints a concise summary of a
DataFrame
value_counts()
returns a
Series
containing counts of unique values in the structure
s = pd.Series(np.random.randint(0,2,10))
print(s)
print("\nvalue counts:")
print(s.value_counts())
0 0
1 1
2 0
3 1
4 1
5 1
6 0
7 0
8 1
9 0
dtype: int64
value counts:
0 5
1 5
dtype: int64
We’ll see more on how to use these structures, and other Pandas capabilities, later.
Indexing¶
Again, we’re just covering some basics here. For a complete guide to indexing in Pandas see here.
Pandas allows us to use the same basic []
indexing and .
attribute
operators that we’re used to with Python and NumPy. However, Pandas also
provides the (often preferred) .loc
labelled indexing method and the
.iloc
position indexing methods.
[]
Indexing¶
For basic []
indexing, we can select columns from a DataFrame and
items from a Series.
DataFrame¶
# select a single column
print("single column from DataFrame, gives us a Series:")
df["random_A"]
single column from DataFrame, gives us a Series:
3 0.591502
4 -0.870349
5 0.332426
6 0.219025
7 -0.699849
8 -0.671411
9 1.733254
10 2.537998
11 -1.934276
12 2.125796
Name: random_A, dtype: float64
# select two columns
print("two columns from DataFrame, gives us a DataFrame:")
df[["random_A", "random_B"]]
two columns from DataFrame, gives us a DataFrame:
random_A | random_B | |
---|---|---|
3 | 0.591502 | -0.453185 |
4 | -0.870349 | -1.213743 |
5 | 0.332426 | -0.127790 |
6 | 0.219025 | 0.801325 |
7 | -0.699849 | -0.256481 |
8 | -0.671411 | 1.131856 |
9 | 1.733254 | -1.113464 |
10 | 2.537998 | 1.169176 |
11 | -1.934276 | -0.295280 |
12 | 2.125796 | -1.515066 |
Note that we can’t do:
df[5]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
3801 try:
-> 3802 return self._engine.get_loc(casted_key)
3803 except KeyError as err:
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 5
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[7], line 1
----> 1 df[5]
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/core/frame.py:3807, in DataFrame.__getitem__(self, key)
3805 if self.columns.nlevels > 1:
3806 return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
3808 if is_integer(indexer):
3809 indexer = [indexer]
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/core/indexes/base.py:3804, in Index.get_loc(self, key, method, tolerance)
3802 return self._engine.get_loc(casted_key)
3803 except KeyError as err:
-> 3804 raise KeyError(key) from err
3805 except TypeError:
3806 # If we have a listlike key, _check_indexing_error will raise
3807 # InvalidIndexError. Otherwise we fall through and re-raise
3808 # the TypeError.
3809 self._check_indexing_error(key)
KeyError: 5
as this tries to access a row, not a column. But you can do this with a series (a single column):
Series¶
# select single item
print("single item from Series, gives us an item (of type numpy.int64,in this case):")
s[2]
single item from Series, gives us an item (of type numpy.int64,in this case):
0
# select two items
print("two items from Series, gives us a Series:")
s[[2, 4]]
two items from Series, gives us a Series:
2 0
4 1
dtype: int64
Attribute Access¶
Similarly, we can access a column from a DataFrame and an item from a Series using as an attribute. However, we can’t do this when the label is not a valid identifier.
df.random_A
3 0.591502
4 -0.870349
5 0.332426
6 0.219025
7 -0.699849
8 -0.671411
9 1.733254
10 2.537998
11 -1.934276
12 2.125796
Name: random_A, dtype: float64
.loc
¶
.loc
provides label-based indexing. .loc
can also be used for
slicing and we can even provide a callable
as its input! However, here
we’ll just show single item access.
df.loc[5]
random_A 0.332426
random_B -0.127790
Name: 5, dtype: float64
# and for a Series
s.loc[2]
0
.iloc
¶
.iloc
provides integer-based indexing. This closely resembles Python and NumPy slicing. Again, we’ll just show single item access.
# for DataFrame
df.iloc[5]
random_A -0.671411
random_B 1.131856
Name: 8, dtype: float64
# and for a Series
s.iloc[2]
0