2.2.1 Data Consistency¶

In Section 2.1.4 we saw how to load data of different types into pandas. In this section we’ll look at first steps you can perform to look at the data, understand what it contains, and a few common issues that may come up with data consistency like missing or mis-interpreted values.

Domain Knowledge¶

Before jumping into looking at and analysing the data you should check any information you’ve been given about it, so you know what to expect. In this section we’ll be using the Palmer penguins dataset, adapted from the version created by Allison Horst available here (doi:10.5281/zenodo.3960218).

We’ve made changes to the data to demonstrate the concepts we’re teaching, adding missing values and other common data issues. The (cleaner) original file is available here.

The dataset was originally collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network, and published in the PLOS ONE journal (doi:10.1371/journal.pone.0090081) in 2014 .

It includes measurements of the bill size, flipper length and weight of three different species of penguin (Adélie, Chinstrap, Gentoo) on three different islands (Biscoe, Dream, Torgersen) in the Palmer Archipelago, Antarctica. The dataset homepage contains more information about the columns and data types we expect. To reiterate, it’s always important to check the documentation and associated literature first.


Artwork by @allison_horst.

Having a First Look at the Data¶

The dataset is saved in data/penguins.csv and we can load it with pd.read_csv, as seen previously:

import pandas as pd
import numpy

df = pd.read_csv("data/penguins.csv")

Display the first few ten rows of the data:

df.head(10)

	Id	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	P-179	Gentoo	Biscoe	47.8	15.0	215.0	5650.0	male	2007
1	P-306	Chinstrap	Dream	40.9	16.6	187.0	3200.0	female	2008
2	P-247	Gentoo	Biscoe	50.8	15.7	226.0	5200.0	male	2009
3	P-120	Adelie	Torgersen	36.2	17.2	187.0	3150.0	female	2009
4	P-220	Gentoo	Biscoe	43.5	14.2	220.0	4700.0	female	2008
5	P-150	Adelie	Dream	36.0	17.1	187.0	3700.0	female	2009
6	P-348	Adelie	Biscoe	36.4	18.1	193.0	285.0	female	2007
7	P-091	Adelie	Dream	41.1	18.1	205.0	4300.0	male	2008
8	P-327	Chinstrap	Dream	51.4	19.0	201.0	3950.0	male	2009
9	P-221	Gentoo	Biscoe	50.7	15.0	223.0	5550.0	male	2008

We could also look at the last few rows of the data with df.tail(), or a random sample of rows with df.sample().

To check the number of rows and columns we can use:

print(df.shape)

(351, 9)

Our data has 351 rows and 9 columns. It might also be useful to look at the column names (especially for larger datasets with many columns where they may not all be displayed by df.head()):

print(df.columns)

Index(['Id', 'species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex', 'year'],
      dtype='object')

A useful command that summarises much of this information is df.info():

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 351 entries, 0 to 350
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Id                 351 non-null    object 
 1   species            351 non-null    object 
 2   island             351 non-null    object 
 3   bill_length_mm     347 non-null    float64
 4   bill_depth_mm      349 non-null    object 
 5   flipper_length_mm  349 non-null    float64
 6   body_mass_g        349 non-null    float64
 7   sex                340 non-null    object 
 8   year               351 non-null    int64  
dtypes: float64(3), int64(1), object(5)
memory usage: 24.8+ KB

This gives us the number of rows (entries) and columns at the top, and then a table with the name, number of non-null values, and data type of each column. Finally, it gives the amount of memory the data frame is using. Pandas can use a lot of memory, which may cause problems when analysing large datasets. The Scaling to large datasets page in the Pandas documentation gives pointers for what you can try in that case.

Null Values¶

The data frame info shows we have 351 “non-null” values in the Id, species, island and year columns, but fewer in the other columns.

“Null values” is Pandas’ way of describing data that is missing. Under the hood, these are encoded as NumPy’s NaN (not a number) constant (see here), which has type float64 so numeric columns with NaN values still have a numeric type and can have numeric operations applied to them.

To find missing values in a column we can use the isnull() function:

is_missing = df["bill_length_mm"].isnull()
print(is_missing)

    False
    False
    False
    False
    False
       ...  
  False
  False
  False
  False
  False
Name: bill_length_mm, Length: 351, dtype: bool

This returns a boolean series which is True if that row’s value is NaN, which can then be used to filter the data frame and show only the rows with missing data in the bill_length_mm column:

df[is_missing]

	Id	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
233	P-344	Chinstrap	Dream	NaN	19.2	197.0	4000.0	male	2008
286	P-003	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007
307	P-271	Gentoo	Biscoe	NaN	NaN	NaN	NaN	NaN	2009
312	P-345	Adelie	Torgersen	NaN	18.0	193.0	43400.0	female	2009

There are many reasons data could be missing and how you choose to deal with them is an important part of any research project. We’ll revisit this later.

Unexpected Column Types¶

Looking at the first few rows of our data (the output of df.head() above) it looks like we expect the bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g and year columns to have a numeric type. Comparing with the output of df.info() above most of them do, having a dtype (data type) of either int64 or float64.

However, the bill_depth_mm column has a dtype of object, which usually means the column is being treated as strings/text data. This will generally be because there is at least one row in the column that Pandas was not able to parse as a number. Common reasons this might happen include:

Data entry errors and typos, for example “23/15” instead of “23.15”.
Encoding of missing values: The pd.read_csv() function checks for common string representations of missing values like “NA” or “NULL” and converts these to numpy.nan when loading the data. But many different conventions for missing data exist, such as more verbose representations like “UNKNOWN”, and Pandas will load these as strings by default. This can be customised with the na_values parameter of pd.read_csv().
Additional metadata incorrectly loaded into the data frame, such as CSV files with headers and footers (as seen in the Data Sources & Formats section previously).

To see what’s wrong with the bill_depth_mm column we can try to convert it to a numeric type with the pd.to_numeric function:

df["bill_depth_mm"] = pd.to_numeric(df["bill_depth_mm"])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/_libs/lib.pyx:2369, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "14,2"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[9], line 1
----> 1 df["bill_depth_mm"] = pd.to_numeric(df["bill_depth_mm"])

File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/core/tools/numeric.py:185, in to_numeric(arg, errors, downcast)
    183 coerce_numeric = errors not in ("ignore", "raise")
    184 try:
--> 185     values, _ = lib.maybe_convert_numeric(
    186         values, set(), coerce_numeric=coerce_numeric
    187     )
    188 except (ValueError, TypeError):
    189     if errors == "raise":

File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/_libs/lib.pyx:2411, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "14,2" at position 142

The error above tells us Pandas has encountered a value “14,2”, which it doesn’t know how to convert into a number. It also says the problem is at index 142, which we can access ourselves to check the value directly:

df.loc[142, "bill_depth_mm"]

'14,2'

In this case it looks like a typo, the person entering the data probably meant to write 14.2, but we should check this first. There may be information in the data documentation, or you could ask the data provider.

Let’s say we’re confident it is a typo. We can fix it ourselves and then convert the column to a numeric type:

# set the incorrectly typed number to its intended value
df.loc[142, "bill_depth_mm"] = 14.2
# convert the column to a numeric type
df["bill_depth_mm"] = pd.to_numeric(df["bill_depth_mm"])

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 351 entries, 0 to 350
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Id                 351 non-null    object 
 1   species            351 non-null    object 
 2   island             351 non-null    object 
 3   bill_length_mm     347 non-null    float64
 4   bill_depth_mm      349 non-null    float64
 5   flipper_length_mm  349 non-null    float64
 6   body_mass_g        349 non-null    float64
 7   sex                340 non-null    object 
 8   year               351 non-null    int64  
dtypes: float64(4), int64(1), object(4)
memory usage: 24.8+ KB

The bill_depth_mm now has type float64 as we originally expected.

This was a simple example with just one strange value - we’ll see more approaches for handling and sanitising strings later.

Sanity Checking Values¶

Numeric Columns¶

The pandas describe() function gives summary statistics for the numeric columns in our data (the mean, standard deviation, minimum and maximum value, and quartiles for each column):

df.describe()

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	year
count	347.000000	349.000000	349.000000	349.000000	351.000000
mean	43.923055	17.152722	200.088825	4305.329513	2008.022792
std	5.491795	1.967049	21.320100	2256.300048	0.820832
min	32.100000	13.100000	-99.000000	285.000000	2007.000000
25%	39.200000	15.600000	190.000000	3550.000000	2007.000000
50%	44.500000	17.300000	197.000000	4050.000000	2008.000000
75%	48.500000	18.700000	213.000000	4775.000000	2009.000000
max	59.600000	21.500000	231.000000	43400.000000	2009.000000

Even though bill_length_mm, and several of the other columns, have missing (NaN) values, Pandas is able to compute statistics for that column. When calculating these Pandas will ignore all NaN values by default. To change this behaviour, some functions have a skipna argument, for example df["bill_length_mm"].mean(skipna=False) will return NaN if there are any NaN values in the column.

You should think carefully about which approach is more suitable for your data (for example, if a column only has a few non-null values will the mean be representative?)

Looking at these values gives us a better idea of what our data contains, but also allows us to perform some sanity checks. For example, do the minimum and maximum values in each column make sense given what we know about the dataset?

There are two things that might standout. First, the flipper_length_mm column has a minimum value of -99, but all the other values in the data are positive as we’d expect for measurements of lengths, widths and weights. In some datasets missing data is represented with negative values (but this may not always be the case so, as always, make sure to check what they mean in any data you’re using).

If we’re sure -99 is meant to be a missing value, we can replace those with numpy.nan so Pandas will treat them correctly:

df = df.replace(-99, numpy.nan)

With these values replaced, the “actual” minimum value of flipper_length_mm is 172 mm:

df["flipper_length_mm"].min()

172.0

The second thing that may stand out is the minimum value of 285 grams in body_mass_g, which looks far smaller than the other values (e.g., the 25% quartile of body_mass_g is only 3550g). Excluding the 285g value the next lightest penguin weighs 2700g:

# True for each row with body_mass_g greater than the min value of 285g
smaller_petals = df["body_mass_g"] > df["body_mass_g"].min()

# Lowest penguin weight out of all rows with weights above 285g
df.loc[smaller_petals, "body_mass_g"].min()

2700.0

Another way to see this would be to sort the data frame by body mass using the sort_values function:

df.sort_values(by="body_mass_g").head(10)

	Id	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
6	P-348	Adelie	Biscoe	36.4	18.1	193.0	285.0	female	2007
236	P-314	Chinstrap	Dream	46.9	16.6	192.0	2700.0	female	2008
248	P-058	Adelie	Biscoe	36.5	16.6	181.0	2850.0	female	2008
311	P-064	Adelie	Biscoe	36.4	17.1	184.0	2850.0	female	2008
349	P-098	Adelie	Dream	33.1	16.1	178.0	2900.0	female	2008
227	P-298	Chinstrap	Dream	43.2	16.6	187.0	2900.0	female	2007
270	P-116	Adelie	Torgersen	38.6	17.0	188.0	2900.0	female	2009
17	P-054	Adelie	Biscoe	34.5	18.1	187.0	2900.0	female	2008
137	P-104	Adelie	Biscoe	37.9	18.6	193.0	2925.0	female	2009
337	P-047	Adelie	Dream	37.5	18.9	179.0	2975.0	NaN	2007

By default sort_values sorts values from smallest to largest, you can change that by setting ascending=False.

Again, we see the 2nd smallest value in the column is only 2700g. This could be another data entry error (perhaps the weight was meant to be 2850g rather than 285g), or perhaps that penguin is a chick and the rest are adults. This type of issue is much more nuanced and difficult to spot in real world scenarios. Visualizing the data (and distributions in the data) can be very helpful here, which is the focus of the next module.

Text and Categorical Columns¶

Note that the species, island, and sex columns do not appear when we use describe() above as they contain text. For both text and numeric columns, it can be helpful to know the number of unique values in each column:

df.nunique()

Id                   350
species                4
island                 3
bill_length_mm       164
bill_depth_mm         80
flipper_length_mm     55
body_mass_g           96
sex                    2
year                   3
dtype: int64

The measurement and Id columns have many unique values, whereas columns like island have only a few different unique values (categories). Looking closely, the species column has four different values, but from the dataset documentation we only expect there to be three penguin species.

The value_counts() function, applied to the species column, shows the number of occurrences of the different values in that column:

df["species"].value_counts()

Adelie       155
Gentoo       125
Chinstrap     70
UNKNOWN        1
Name: species, dtype: int64

The “Adélie”, “Chinstrap”, and “Gentoo” species described in the documentation all appear, but there’s also an “UNKNOWN” entry. This looks like it should have been treated as a missing value. To make Pandas correctly treat it as missing we can replace it with numpy.nan using the replace method:

df["species"] = df["species"].replace("UNKNOWN", numpy.nan)
df["species"].value_counts()

Adelie       155
Gentoo       125
Chinstrap     70
Name: species, dtype: int64

By default, the value_counts will not display the number of missing values in the column. To show that you can use df["species"].value_counts(dropna=False) instead. You can also try df["species"].value_counts(normalize=True) to show the fraction of data with each value, rather than the count.

We’ll look at more approaches for manipulating strings and categories in Sections 2.2.4.2 and 2.2.4.3 of this module.

Finally, it may be interesting to look at how the measurements vary between the species. We can do that with the Pandas groupby function:

df.groupby("species").mean()

/tmp/ipykernel_2269/2880954085.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  df.groupby("species").mean()

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	year
species
Adelie	38.757516	18.335714	189.993464	3934.805195	2008.006452
Chinstrap	48.800000	18.424286	195.785714	3733.571429	2007.957143
Gentoo	47.486290	14.975806	217.241935	5080.241935	2008.072000

df.groupby("species") splits the date frame into sub-groups with the same value in the “species” column. We then must specify a function we want to use to summarize the members of each group, in this case the mean. It looks like, on average, “Chinstrap” penguins have the largest bills, but “Gentoo” penguins have the largest flippers and body mass. For more information about using groupby see here.

Duplicate Data¶

In the output of df.nunique() above we see the Id column has 350 unique values, one fewer than the 351 rows in the dataset. We expect Id to be a unique identifier, so to have 351 unique values (1 for each row). What’s going on?

One explanation could be that there are duplicate rows in the data. The duplicated method of a data frame returns True for any rows that appear multiple times in the data (with an exact copy of all values):

df.duplicated()

    False
    False
    False
    False
    False
       ...  
  False
  False
  False
  False
  False
Length: 351, dtype: bool

We can use this to filter the data frame and show only the duplicated rows:

df[df.duplicated(keep=False)]

	Id	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
36	P-276	Chinstrap	Dream	46.5	17.9	192.0	3500.0	female	2007
324	P-276	Chinstrap	Dream	46.5	17.9	192.0	3500.0	female	2007

By default, the duplicated function only marks the second and subsequent instances of the same data as being duplicates. Setting keep=False marks the first instance as a duplicate as well.

We see there are two entries for a penguin with id P-276 in the data. Why might that be the case? It could be caused by a data entry/processing issue and be there by mistake, or be a genuine repeated measurement for this penguin, for example. It’s important to understand the context before taking any action.

In some cases it may be appropriate to delete the duplicate data. This can be done with the drop_duplicates method:

print("Length before removing duplicates:", len(df), "rows")
df = df.drop_duplicates()
print("Length after removing duplicates:", len(df), "rows")

Length before removing duplicates: 351 rows
Length after removing duplicates: 350 rows

Displaying Data Frames with Style 😎¶

You can get fancy with how you display data frames by highlighting and formatting cells differently using its style attribute. There are a few examples below, for more details see the Table Visualization page in the Pandas documentation.

Change the precision with which numbers are displayed:

df_top10 = df.head(10)  # just style the first 10 rows for demo purposes here

# round values to nearest integer (0 decimal places)
df_top10.style.format(precision=0)

	Id	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	P-179	Gentoo	Biscoe	48	15	215	5650	male	2007
1	P-306	Chinstrap	Dream	41	17	187	3200	female	2008
2	P-247	Gentoo	Biscoe	51	16	226	5200	male	2009
3	P-120	Adelie	Torgersen	36	17	187	3150	female	2009
4	P-220	Gentoo	Biscoe	44	14	220	4700	female	2008
5	P-150	Adelie	Dream	36	17	187	3700	female	2009
6	P-348	Adelie	Biscoe	36	18	193	285	female	2007
7	P-091	Adelie	Dream	41	18	205	4300	male	2008
8	P-327	Chinstrap	Dream	51	19	201	3950	male	2009
9	P-221	Gentoo	Biscoe	51	15	223	5550	male	2008

Apply a colour gradient to each column based on each cell’s value:

df_top10.style.background_gradient()

	Id	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	P-179	Gentoo	Biscoe	47.800000	15.000000	215.000000	5650.000000	male	2007
1	P-306	Chinstrap	Dream	40.900000	16.600000	187.000000	3200.000000	female	2008
2	P-247	Gentoo	Biscoe	50.800000	15.700000	226.000000	5200.000000	male	2009
3	P-120	Adelie	Torgersen	36.200000	17.200000	187.000000	3150.000000	female	2009
4	P-220	Gentoo	Biscoe	43.500000	14.200000	220.000000	4700.000000	female	2008
5	P-150	Adelie	Dream	36.000000	17.100000	187.000000	3700.000000	female	2009
6	P-348	Adelie	Biscoe	36.400000	18.100000	193.000000	285.000000	female	2007
7	P-091	Adelie	Dream	41.100000	18.100000	205.000000	4300.000000	male	2008
8	P-327	Chinstrap	Dream	51.400000	19.000000	201.000000	3950.000000	male	2009
9	P-221	Gentoo	Biscoe	50.700000	15.000000	223.000000	5550.000000	male	2008

Highlight the smallest value in each column:

df_top10.style.highlight_min()

	Id	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	P-179	Gentoo	Biscoe	47.800000	15.000000	215.000000	5650.000000	male	2007
1	P-306	Chinstrap	Dream	40.900000	16.600000	187.000000	3200.000000	female	2008
2	P-247	Gentoo	Biscoe	50.800000	15.700000	226.000000	5200.000000	male	2009
3	P-120	Adelie	Torgersen	36.200000	17.200000	187.000000	3150.000000	female	2009
4	P-220	Gentoo	Biscoe	43.500000	14.200000	220.000000	4700.000000	female	2008
5	P-150	Adelie	Dream	36.000000	17.100000	187.000000	3700.000000	female	2009
6	P-348	Adelie	Biscoe	36.400000	18.100000	193.000000	285.000000	female	2007
7	P-091	Adelie	Dream	41.100000	18.100000	205.000000	4300.000000	male	2008
8	P-327	Chinstrap	Dream	51.400000	19.000000	201.000000	3950.000000	male	2009
9	P-221	Gentoo	Biscoe	50.700000	15.000000	223.000000	5550.000000	male	2008

Research Data Science

2.2.1 Data Consistency

Contents