2.2.2 Modifying Columns and Indices
Contents
2.2.2 Modifying Columns and Indices¶
Dropping Named Columns and Indices¶
We may wish to drop columns or indices entirely from a Pandas DataFrame.
This may be where we’ve already done some analysis and know the labels
of the column(s) to drop. We use the
DataFrame.drop
method:
from IPython.display import display
import pandas as pd
import numpy as np
# construct df with null values
df = pd.DataFrame.from_dict({ "col_a": [1, 2, np.nan, 4], "col_b": ["aaa", "bbb", "ccc", "ddd"], "col_c": ["AAA", "BBB", "CCC", None],"col_d": [np.nan, np.nan, np.nan, np.nan], })
# show the dataframe
print("original dataframe:")
display(df)
print("drop ‘col_b’:")
display(df.drop(columns=["col_b"]))
original dataframe:
col_a | col_b | col_c | col_d | |
---|---|---|---|---|
0 | 1.0 | aaa | AAA | NaN |
1 | 2.0 | bbb | BBB | NaN |
2 | NaN | ccc | CCC | NaN |
3 | 4.0 | ddd | None | NaN |
drop ‘col_b’:
col_a | col_c | col_d | |
---|---|---|---|
0 | 1.0 | AAA | NaN |
1 | 2.0 | BBB | NaN |
2 | NaN | CCC | NaN |
3 | 4.0 | None | NaN |
We can also use this method to drop indices:
# show the dataframe
print("drop rows 1 and 3")
display(df.drop(index=[1,3]))
drop rows 1 and 3
col_a | col_b | col_c | col_d | |
---|---|---|---|---|
0 | 1.0 | aaa | AAA | NaN |
2 | NaN | ccc | CCC | NaN |
Dropping Columns and Indices with Nulls¶
It may be that we wish to drop columns according to whether they have
missing data. Here, we use the
DataFrame.dropna
method:
# show the dataframe
print("original dataframe recap:")
display(df)
print("drop columns where *all* values are NA:")
display(df.dropna(axis="columns", how="all"))
print("drop columns where *any* values are NA:")
print("note that None is treated like NaN.")
display(df.dropna(axis="columns", how="any"))
original dataframe recap:
col_a | col_b | col_c | col_d | |
---|---|---|---|---|
0 | 1.0 | aaa | AAA | NaN |
1 | 2.0 | bbb | BBB | NaN |
2 | NaN | ccc | CCC | NaN |
3 | 4.0 | ddd | None | NaN |
drop columns where *all* values are NA:
col_a | col_b | col_c | |
---|---|---|---|
0 | 1.0 | aaa | AAA |
1 | 2.0 | bbb | BBB |
2 | NaN | ccc | CCC |
3 | 4.0 | ddd | None |
drop columns where *any* values are NA:
note that None is treated like NaN.
col_b | |
---|---|
0 | aaa |
1 | bbb |
2 | ccc |
3 | ddd |
Once again, we can use this method to drop indices:
print("drop rows where any of 'col_a', 'col_b', 'col_c' are NA:")
df.dropna(
axis="index",
how="any",
subset=["col_a", "col_b", "col_c"]
)
drop rows where any of 'col_a', 'col_b', 'col_c' are NA:
col_a | col_b | col_c | col_d | |
---|---|---|---|---|
0 | 1.0 | aaa | AAA | NaN |
1 | 2.0 | bbb | BBB | NaN |
Relabelling Columns and Indices¶
Sometimes we’ll be dealing with data that is inconveniently named.
Pandas provides an easy way to rename columns:
df = pd.DataFrame({"long_column_name": [1, 2, 3], "short": [4, 5, 6]})
print("original df:")
display(df)
df = df.rename(columns={"long_column_name": "long"})
print("renamed df:")
display(df)
original df:
long_column_name | short | |
---|---|---|
0 | 1 | 4 |
1 | 2 | 5 |
2 | 3 | 6 |
renamed df:
long | short | |
---|---|---|
0 | 1 | 4 |
1 | 2 | 5 |
2 | 3 | 6 |
We can also modify indices in this fashion.
print("original df:")
display(df)
df = df.rename(index={0: 10})
print("renamed df:")
display(df)
original df:
long | short | |
---|---|---|
0 | 1 | 4 |
1 | 2 | 5 |
2 | 3 | 6 |
renamed df:
long | short | |
---|---|---|
10 | 1 | 4 |
1 | 2 | 5 |
2 | 3 | 6 |