2.2.1 Data Consistency¶

In Section 2.1.4 we saw how to load data of different types into pandas. In this section we’ll look at first steps you can perform to look at the data, understand what it contains, and a few common issues that may come up with data consistency like missing or mis-interpreted values.

Domain Knowledge¶

Before jumping into looking at and analysing the data you should check any information you’ve been given about it, so you know what to expect. In this section we’ll be using the Palmer penguins dataset, adapted from the version created by Allison Horst available here (doi:10.5281/zenodo.3960218).

We’ve made changes to the data to demonstrate the concepts we’re teaching, adding missing values and other common data issues. The (cleaner) original file is available here.

The dataset was originally collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network, and published in the PLOS ONE journal (doi:10.1371/journal.pone.0090081) in 2014 .

It includes measurements of the bill size, flipper length and weight of three different species of penguin (AdĂ©lie, Chinstrap, Gentoo) on three different islands (Biscoe, Dream, Torgersen) in the Palmer Archipelago, Antarctica. The dataset homepage contains more information about the columns and data types we expect. To reiterate, it’s always important to check the documentation and associated literature first.

lter_penguins.png

Artwork by @allison_horst.

Having a First Look at the Data¶

The dataset is saved in data/penguins.csv and we can load it with pd.read_csv, as seen previously:

import pandas as pd
import numpy
df = pd.read_csv("data/penguins.csv")

Display the first few ten rows of the data:

df.head(10)
Id species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 P-179 Gentoo Biscoe 47.8 15.0 215.0 5650.0 male 2007
1 P-306 Chinstrap Dream 40.9 16.6 187.0 3200.0 female 2008
2 P-247 Gentoo Biscoe 50.8 15.7 226.0 5200.0 male 2009
3 P-120 Adelie Torgersen 36.2 17.2 187.0 3150.0 female 2009
4 P-220 Gentoo Biscoe 43.5 14.2 220.0 4700.0 female 2008
5 P-150 Adelie Dream 36.0 17.1 187.0 3700.0 female 2009
6 P-348 Adelie Biscoe 36.4 18.1 193.0 285.0 female 2007
7 P-091 Adelie Dream 41.1 18.1 205.0 4300.0 male 2008
8 P-327 Chinstrap Dream 51.4 19.0 201.0 3950.0 male 2009
9 P-221 Gentoo Biscoe 50.7 15.0 223.0 5550.0 male 2008

We could also look at the last few rows of the data with df.tail(), or a random sample of rows with df.sample().

To check the number of rows and columns we can use:

print(df.shape)
(351, 9)

Our data has 351 rows and 9 columns. It might also be useful to look at the column names (especially for larger datasets with many columns where they may not all be displayed by df.head()):

print(df.columns)
Index(['Id', 'species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex', 'year'],
      dtype='object')

A useful command that summarises much of this information is df.info():

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 351 entries, 0 to 350
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Id                 351 non-null    object 
 1   species            351 non-null    object 
 2   island             351 non-null    object 
 3   bill_length_mm     347 non-null    float64
 4   bill_depth_mm      349 non-null    object 
 5   flipper_length_mm  349 non-null    float64
 6   body_mass_g        349 non-null    float64
 7   sex                340 non-null    object 
 8   year               351 non-null    int64  
dtypes: float64(3), int64(1), object(5)
memory usage: 24.8+ KB

This gives us the number of rows (entries) and columns at the top, and then a table with the name, number of non-null values, and data type of each column. Finally, it gives the amount of memory the data frame is using. Pandas can use a lot of memory, which may cause problems when analysing large datasets. The Scaling to large datasets page in the Pandas documentation gives pointers for what you can try in that case.

Null Values¶

The data frame info shows we have 351 “non-null” values in the Id, species, island and year columns, but fewer in the other columns.

“Null values” is Pandas’ way of describing data that is missing. Under the hood, these are encoded as NumPy’s NaN (not a number) constant (see here), which has type float64 so numeric columns with NaN values still have a numeric type and can have numeric operations applied to them.

To find missing values in a column we can use the isnull() function:

is_missing = df["bill_length_mm"].isnull()
print(is_missing)
0      False
1      False
2      False
3      False
4      False
       ...  
346    False
347    False
348    False
349    False
350    False
Name: bill_length_mm, Length: 351, dtype: bool

This returns a boolean series which is True if that row’s value is NaN, which can then be used to filter the data frame and show only the rows with missing data in the bill_length_mm column:

df[is_missing]
Id species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
233 P-344 Chinstrap Dream NaN 19.2 197.0 4000.0 male 2008
286 P-003 Adelie Torgersen NaN NaN NaN NaN NaN 2007
307 P-271 Gentoo Biscoe NaN NaN NaN NaN NaN 2009
312 P-345 Adelie Torgersen NaN 18.0 193.0 43400.0 female 2009

There are many reasons data could be missing and how you choose to deal with them is an important part of any research project. We’ll revisit this later.

Unexpected Column Types¶

Looking at the first few rows of our data (the output of df.head() above) it looks like we expect the bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g and year columns to have a numeric type. Comparing with the output of df.info() above most of them do, having a dtype (data type) of either int64 or float64.

However, the bill_depth_mm column has a dtype of object, which usually means the column is being treated as strings/text data. This will generally be because there is at least one row in the column that Pandas was not able to parse as a number. Common reasons this might happen include:

  • Data entry errors and typos, for example “23/15” instead of “23.15”.

  • Encoding of missing values: The pd.read_csv() function checks for common string representations of missing values like “NA” or “NULL” and converts these to numpy.nan when loading the data. But many different conventions for missing data exist, such as more verbose representations like “UNKNOWN”, and Pandas will load these as strings by default. This can be customised with the na_values parameter of pd.read_csv().

  • Additional metadata incorrectly loaded into the data frame, such as CSV files with headers and footers (as seen in the Data Sources & Formats section previously).

To see what’s wrong with the bill_depth_mm column we can try to convert it to a numeric type with the pd.to_numeric function:

df["bill_depth_mm"] = pd.to_numeric(df["bill_depth_mm"])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/_libs/lib.pyx:2369, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "14,2"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[9], line 1
----> 1 df["bill_depth_mm"] = pd.to_numeric(df["bill_depth_mm"])

File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/core/tools/numeric.py:185, in to_numeric(arg, errors, downcast)
    183 coerce_numeric = errors not in ("ignore", "raise")
    184 try:
--> 185     values, _ = lib.maybe_convert_numeric(
    186         values, set(), coerce_numeric=coerce_numeric
    187     )
    188 except (ValueError, TypeError):
    189     if errors == "raise":

File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/_libs/lib.pyx:2411, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "14,2" at position 142

The error above tells us Pandas has encountered a value “14,2”, which it doesn’t know how to convert into a number. It also says the problem is at index 142, which we can access ourselves to check the value directly:

df.loc[142, "bill_depth_mm"]
'14,2'

In this case it looks like a typo, the person entering the data probably meant to write 14.2, but we should check this first. There may be information in the data documentation, or you could ask the data provider.

Let’s say we’re confident it is a typo. We can fix it ourselves and then convert the column to a numeric type:

# set the incorrectly typed number to its intended value
df.loc[142, "bill_depth_mm"] = 14.2
# convert the column to a numeric type
df["bill_depth_mm"] = pd.to_numeric(df["bill_depth_mm"])

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 351 entries, 0 to 350
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Id                 351 non-null    object 
 1   species            351 non-null    object 
 2   island             351 non-null    object 
 3   bill_length_mm     347 non-null    float64
 4   bill_depth_mm      349 non-null    float64
 5   flipper_length_mm  349 non-null    float64
 6   body_mass_g        349 non-null    float64
 7   sex                340 non-null    object 
 8   year               351 non-null    int64  
dtypes: float64(4), int64(1), object(4)
memory usage: 24.8+ KB

The bill_depth_mm now has type float64 as we originally expected.

This was a simple example with just one strange value - we’ll see more approaches for handling and sanitising strings later.

Sanity Checking Values¶

Numeric Columns¶

The pandas describe() function gives summary statistics for the numeric columns in our data (the mean, standard deviation, minimum and maximum value, and quartiles for each column):

df.describe()
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
count 347.000000 349.000000 349.000000 349.000000 351.000000
mean 43.923055 17.152722 200.088825 4305.329513 2008.022792
std 5.491795 1.967049 21.320100 2256.300048 0.820832
min 32.100000 13.100000 -99.000000 285.000000 2007.000000
25% 39.200000 15.600000 190.000000 3550.000000 2007.000000
50% 44.500000 17.300000 197.000000 4050.000000 2008.000000
75% 48.500000 18.700000 213.000000 4775.000000 2009.000000
max 59.600000 21.500000 231.000000 43400.000000 2009.000000

Even though bill_length_mm, and several of the other columns, have missing (NaN) values, Pandas is able to compute statistics for that column. When calculating these Pandas will ignore all NaN values by default. To change this behaviour, some functions have a skipna argument, for example df["bill_length_mm"].mean(skipna=False) will return NaN if there are any NaN values in the column.

You should think carefully about which approach is more suitable for your data (for example, if a column only has a few non-null values will the mean be representative?)

Looking at these values gives us a better idea of what our data contains, but also allows us to perform some sanity checks. For example, do the minimum and maximum values in each column make sense given what we know about the dataset?

There are two things that might standout. First, the flipper_length_mm column has a minimum value of -99, but all the other values in the data are positive as we’d expect for measurements of lengths, widths and weights. In some datasets missing data is represented with negative values (but this may not always be the case so, as always, make sure to check what they mean in any data you’re using).

If we’re sure -99 is meant to be a missing value, we can replace those with numpy.nan so Pandas will treat them correctly:

df = df.replace(-99, numpy.nan)

With these values replaced, the “actual” minimum value of flipper_length_mm is 172 mm:

df["flipper_length_mm"].min()
172.0

The second thing that may stand out is the minimum value of 285 grams in body_mass_g, which looks far smaller than the other values (e.g., the 25% quartile of body_mass_g is only 3550g). Excluding the 285g value the next lightest penguin weighs 2700g:

# True for each row with body_mass_g greater than the min value of 285g
smaller_petals = df["body_mass_g"] > df["body_mass_g"].min()

# Lowest penguin weight out of all rows with weights above 285g
df.loc[smaller_petals, "body_mass_g"].min()
2700.0

Another way to see this would be to sort the data frame by body mass using the sort_values function:

df.sort_values(by="body_mass_g").head(10)
Id species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
6 P-348 Adelie Biscoe 36.4 18.1 193.0 285.0 female 2007
236 P-314 Chinstrap Dream 46.9 16.6 192.0 2700.0 female 2008
248 P-058 Adelie Biscoe 36.5 16.6 181.0 2850.0 female 2008
311 P-064 Adelie Biscoe 36.4 17.1 184.0 2850.0 female 2008
349 P-098 Adelie Dream 33.1 16.1 178.0 2900.0 female 2008
227 P-298 Chinstrap Dream 43.2 16.6 187.0 2900.0 female 2007
270 P-116 Adelie Torgersen 38.6 17.0 188.0 2900.0 female 2009
17 P-054 Adelie Biscoe 34.5 18.1 187.0 2900.0 female 2008
137 P-104 Adelie Biscoe 37.9 18.6 193.0 2925.0 female 2009
337 P-047 Adelie Dream 37.5 18.9 179.0 2975.0 NaN 2007

By default sort_values sorts values from smallest to largest, you can change that by setting ascending=False.

Again, we see the 2nd smallest value in the column is only 2700g. This could be another data entry error (perhaps the weight was meant to be 2850g rather than 285g), or perhaps that penguin is a chick and the rest are adults. This type of issue is much more nuanced and difficult to spot in real world scenarios. Visualizing the data (and distributions in the data) can be very helpful here, which is the focus of the next module.

Text and Categorical Columns¶

Note that the species, island, and sex columns do not appear when we use describe() above as they contain text. For both text and numeric columns, it can be helpful to know the number of unique values in each column:

df.nunique()
Id                   350
species                4
island                 3
bill_length_mm       164
bill_depth_mm         80
flipper_length_mm     55
body_mass_g           96
sex                    2
year                   3
dtype: int64

The measurement and Id columns have many unique values, whereas columns like island have only a few different unique values (categories). Looking closely, the species column has four different values, but from the dataset documentation we only expect there to be three penguin species.

The value_counts() function, applied to the species column, shows the number of occurrences of the different values in that column:

df["species"].value_counts()
Adelie       155
Gentoo       125
Chinstrap     70
UNKNOWN        1
Name: species, dtype: int64

The “AdĂ©lie”, “Chinstrap”, and “Gentoo” species described in the documentation all appear, but there’s also an “UNKNOWN” entry. This looks like it should have been treated as a missing value. To make Pandas correctly treat it as missing we can replace it with numpy.nan using the replace method:

df["species"] = df["species"].replace("UNKNOWN", numpy.nan)
df["species"].value_counts()
Adelie       155
Gentoo       125
Chinstrap     70
Name: species, dtype: int64

By default, the value_counts will not display the number of missing values in the column. To show that you can use df["species"].value_counts(dropna=False) instead. You can also try df["species"].value_counts(normalize=True) to show the fraction of data with each value, rather than the count.

We’ll look at more approaches for manipulating strings and categories in Sections 2.2.4.2 and 2.2.4.3 of this module.

Finally, it may be interesting to look at how the measurements vary between the species. We can do that with the Pandas groupby function:

df.groupby("species").mean()
/tmp/ipykernel_2269/2880954085.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  df.groupby("species").mean()
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
species
Adelie 38.757516 18.335714 189.993464 3934.805195 2008.006452
Chinstrap 48.800000 18.424286 195.785714 3733.571429 2007.957143
Gentoo 47.486290 14.975806 217.241935 5080.241935 2008.072000

df.groupby("species") splits the date frame into sub-groups with the same value in the “species” column. We then must specify a function we want to use to summarize the members of each group, in this case the mean. It looks like, on average, “Chinstrap” penguins have the largest bills, but “Gentoo” penguins have the largest flippers and body mass. For more information about using groupby see here.

Duplicate Data¶

In the output of df.nunique() above we see the Id column has 350 unique values, one fewer than the 351 rows in the dataset. We expect Id to be a unique identifier, so to have 351 unique values (1 for each row). What’s going on?

One explanation could be that there are duplicate rows in the data. The duplicated method of a data frame returns True for any rows that appear multiple times in the data (with an exact copy of all values):

df.duplicated()
0      False
1      False
2      False
3      False
4      False
       ...  
346    False
347    False
348    False
349    False
350    False
Length: 351, dtype: bool

We can use this to filter the data frame and show only the duplicated rows:

df[df.duplicated(keep=False)]
Id species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
36 P-276 Chinstrap Dream 46.5 17.9 192.0 3500.0 female 2007
324 P-276 Chinstrap Dream 46.5 17.9 192.0 3500.0 female 2007

By default, the duplicated function only marks the second and subsequent instances of the same data as being duplicates. Setting keep=False marks the first instance as a duplicate as well.

We see there are two entries for a penguin with id P-276 in the data. Why might that be the case? It could be caused by a data entry/processing issue and be there by mistake, or be a genuine repeated measurement for this penguin, for example. It’s important to understand the context before taking any action.

In some cases it may be appropriate to delete the duplicate data. This can be done with the drop_duplicates method:

print("Length before removing duplicates:", len(df), "rows")
df = df.drop_duplicates()
print("Length after removing duplicates:", len(df), "rows")
Length before removing duplicates: 351 rows
Length after removing duplicates: 350 rows

Displaying Data Frames with Style đŸ˜ŽÂ¶

You can get fancy with how you display data frames by highlighting and formatting cells differently using its style attribute. There are a few examples below, for more details see the Table Visualization page in the Pandas documentation.

Change the precision with which numbers are displayed:

df_top10 = df.head(10)  # just style the first 10 rows for demo purposes here

# round values to nearest integer (0 decimal places)
df_top10.style.format(precision=0)
  Id species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 P-179 Gentoo Biscoe 48 15 215 5650 male 2007
1 P-306 Chinstrap Dream 41 17 187 3200 female 2008
2 P-247 Gentoo Biscoe 51 16 226 5200 male 2009
3 P-120 Adelie Torgersen 36 17 187 3150 female 2009
4 P-220 Gentoo Biscoe 44 14 220 4700 female 2008
5 P-150 Adelie Dream 36 17 187 3700 female 2009
6 P-348 Adelie Biscoe 36 18 193 285 female 2007
7 P-091 Adelie Dream 41 18 205 4300 male 2008
8 P-327 Chinstrap Dream 51 19 201 3950 male 2009
9 P-221 Gentoo Biscoe 51 15 223 5550 male 2008

Apply a colour gradient to each column based on each cell’s value:

df_top10.style.background_gradient()
  Id species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 P-179 Gentoo Biscoe 47.800000 15.000000 215.000000 5650.000000 male 2007
1 P-306 Chinstrap Dream 40.900000 16.600000 187.000000 3200.000000 female 2008
2 P-247 Gentoo Biscoe 50.800000 15.700000 226.000000 5200.000000 male 2009
3 P-120 Adelie Torgersen 36.200000 17.200000 187.000000 3150.000000 female 2009
4 P-220 Gentoo Biscoe 43.500000 14.200000 220.000000 4700.000000 female 2008
5 P-150 Adelie Dream 36.000000 17.100000 187.000000 3700.000000 female 2009
6 P-348 Adelie Biscoe 36.400000 18.100000 193.000000 285.000000 female 2007
7 P-091 Adelie Dream 41.100000 18.100000 205.000000 4300.000000 male 2008
8 P-327 Chinstrap Dream 51.400000 19.000000 201.000000 3950.000000 male 2009
9 P-221 Gentoo Biscoe 50.700000 15.000000 223.000000 5550.000000 male 2008

Highlight the smallest value in each column:

df_top10.style.highlight_min()
  Id species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 P-179 Gentoo Biscoe 47.800000 15.000000 215.000000 5650.000000 male 2007
1 P-306 Chinstrap Dream 40.900000 16.600000 187.000000 3200.000000 female 2008
2 P-247 Gentoo Biscoe 50.800000 15.700000 226.000000 5200.000000 male 2009
3 P-120 Adelie Torgersen 36.200000 17.200000 187.000000 3150.000000 female 2009
4 P-220 Gentoo Biscoe 43.500000 14.200000 220.000000 4700.000000 female 2008
5 P-150 Adelie Dream 36.000000 17.100000 187.000000 3700.000000 female 2009
6 P-348 Adelie Biscoe 36.400000 18.100000 193.000000 285.000000 female 2007
7 P-091 Adelie Dream 41.100000 18.100000 205.000000 4300.000000 male 2008
8 P-327 Chinstrap Dream 51.400000 19.000000 201.000000 3950.000000 male 2009
9 P-221 Gentoo Biscoe 50.700000 15.000000 223.000000 5550.000000 male 2008