2.2.2 Modifying Columns and Indices

Dropping Named Columns and Indices

We may wish to drop columns or indices entirely from a Pandas DataFrame.

This may be where we’ve already done some analysis and know the labels of the column(s) to drop. We use the DataFrame.drop method:

from IPython.display import display

import pandas as pd 
import numpy as np

# construct df with null values

df = pd.DataFrame.from_dict({ "col_a": [1, 2, np.nan, 4], "col_b": ["aaa", "bbb", "ccc", "ddd"], "col_c": ["AAA", "BBB", "CCC", None],"col_d": [np.nan, np.nan, np.nan, np.nan], })

# show the dataframe
print("original dataframe:") 
display(df)

print("drop ‘col_b’:") 
display(df.drop(columns=["col_b"]))
original dataframe:
col_a col_b col_c col_d
0 1.0 aaa AAA NaN
1 2.0 bbb BBB NaN
2 NaN ccc CCC NaN
3 4.0 ddd None NaN
drop ‘col_b’:
col_a col_c col_d
0 1.0 AAA NaN
1 2.0 BBB NaN
2 NaN CCC NaN
3 4.0 None NaN

We can also use this method to drop indices:

# show the dataframe
print("drop rows 1 and 3")
display(df.drop(index=[1,3]))
drop rows 1 and 3
col_a col_b col_c col_d
0 1.0 aaa AAA NaN
2 NaN ccc CCC NaN

Dropping Columns and Indices with Nulls

It may be that we wish to drop columns according to whether they have missing data. Here, we use the DataFrame.dropna method:

# show the dataframe 
print("original dataframe recap:")
display(df)

print("drop columns where *all* values are NA:")
display(df.dropna(axis="columns", how="all"))

print("drop columns where *any* values are NA:")
print("note that None is treated like NaN.") 
display(df.dropna(axis="columns", how="any"))
original dataframe recap:
col_a col_b col_c col_d
0 1.0 aaa AAA NaN
1 2.0 bbb BBB NaN
2 NaN ccc CCC NaN
3 4.0 ddd None NaN
drop columns where *all* values are NA:
col_a col_b col_c
0 1.0 aaa AAA
1 2.0 bbb BBB
2 NaN ccc CCC
3 4.0 ddd None
drop columns where *any* values are NA:
note that None is treated like NaN.
col_b
0 aaa
1 bbb
2 ccc
3 ddd

Once again, we can use this method to drop indices:

print("drop rows where any of  'col_a', 'col_b', 'col_c' are NA:")

df.dropna(
    axis="index",
    how="any",
    subset=["col_a", "col_b", "col_c"]
)
drop rows where any of  'col_a', 'col_b', 'col_c' are NA:
col_a col_b col_c col_d
0 1.0 aaa AAA NaN
1 2.0 bbb BBB NaN

Relabelling Columns and Indices

Sometimes we’ll be dealing with data that is inconveniently named.

Pandas provides an easy way to rename columns:

df = pd.DataFrame({"long_column_name": [1, 2, 3], "short": [4, 5, 6]}) 
print("original df:") 
display(df) 

df = df.rename(columns={"long_column_name": "long"})
print("renamed df:")
display(df)
original df:
long_column_name short
0 1 4
1 2 5
2 3 6
renamed df:
long short
0 1 4
1 2 5
2 3 6

We can also modify indices in this fashion.

print("original df:")
display(df)

df = df.rename(index={0: 10})
print("renamed df:")
display(df)
original df:
long short
0 1 4
1 2 5
2 3 6
renamed df:
long short
10 1 4
1 2 5
2 3 6