2.2.1 Data Consistency
Contents
2.2.1 Data Consistency¶
In Section 2.1.4 we saw how to load data of different types into pandas. In this section weâll look at first steps you can perform to look at the data, understand what it contains, and a few common issues that may come up with data consistency like missing or mis-interpreted values.
Domain Knowledge¶
Before jumping into looking at and analysing the data you should check any information youâve been given about it, so you know what to expect. In this section weâll be using the Palmer penguins dataset, adapted from the version created by Allison Horst available here (doi:10.5281/zenodo.3960218).
Weâve made changes to the data to demonstrate the concepts weâre teaching, adding missing values and other common data issues. The (cleaner) original file is available here.
The dataset was originally collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network, and published in the PLOS ONE journal (doi:10.1371/journal.pone.0090081) in 2014 .
It includes measurements of the bill size, flipper length and weight of three different species of penguin (AdĂ©lie, Chinstrap, Gentoo) on three different islands (Biscoe, Dream, Torgersen) in the Palmer Archipelago, Antarctica. The dataset homepage contains more information about the columns and data types we expect. To reiterate, itâs always important to check the documentation and associated literature first.
Artwork by @allison_horst. |
Having a First Look at the Data¶
The dataset is saved in data/penguins.csv
and we can load it with pd.read_csv
, as seen previously:
import pandas as pd
import numpy
df = pd.read_csv("data/penguins.csv")
Display the first few ten rows of the data:
df.head(10)
Id | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|---|
0 | P-179 | Gentoo | Biscoe | 47.8 | 15.0 | 215.0 | 5650.0 | male | 2007 |
1 | P-306 | Chinstrap | Dream | 40.9 | 16.6 | 187.0 | 3200.0 | female | 2008 |
2 | P-247 | Gentoo | Biscoe | 50.8 | 15.7 | 226.0 | 5200.0 | male | 2009 |
3 | P-120 | Adelie | Torgersen | 36.2 | 17.2 | 187.0 | 3150.0 | female | 2009 |
4 | P-220 | Gentoo | Biscoe | 43.5 | 14.2 | 220.0 | 4700.0 | female | 2008 |
5 | P-150 | Adelie | Dream | 36.0 | 17.1 | 187.0 | 3700.0 | female | 2009 |
6 | P-348 | Adelie | Biscoe | 36.4 | 18.1 | 193.0 | 285.0 | female | 2007 |
7 | P-091 | Adelie | Dream | 41.1 | 18.1 | 205.0 | 4300.0 | male | 2008 |
8 | P-327 | Chinstrap | Dream | 51.4 | 19.0 | 201.0 | 3950.0 | male | 2009 |
9 | P-221 | Gentoo | Biscoe | 50.7 | 15.0 | 223.0 | 5550.0 | male | 2008 |
We could also look at the last few rows of the data with df.tail()
, or a random sample of rows with df.sample()
.
To check the number of rows and columns we can use:
print(df.shape)
(351, 9)
Our data has 351 rows and 9 columns. It might also be useful to look at the column names (especially for larger datasets with many columns where they may not all be displayed by df.head()
):
print(df.columns)
Index(['Id', 'species', 'island', 'bill_length_mm', 'bill_depth_mm',
'flipper_length_mm', 'body_mass_g', 'sex', 'year'],
dtype='object')
A useful command that summarises much of this information is df.info()
:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 351 entries, 0 to 350
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 351 non-null object
1 species 351 non-null object
2 island 351 non-null object
3 bill_length_mm 347 non-null float64
4 bill_depth_mm 349 non-null object
5 flipper_length_mm 349 non-null float64
6 body_mass_g 349 non-null float64
7 sex 340 non-null object
8 year 351 non-null int64
dtypes: float64(3), int64(1), object(5)
memory usage: 24.8+ KB
This gives us the number of rows (entries) and columns at the top, and then a table with the name, number of non-null values, and data type of each column. Finally, it gives the amount of memory the data frame is using. Pandas can use a lot of memory, which may cause problems when analysing large datasets. The Scaling to large datasets page in the Pandas documentation gives pointers for what you can try in that case.
Null Values¶
The data frame info shows we have 351 ânon-nullâ values in the Id
, species
, island
and year
columns, but fewer in the other columns.
âNull valuesâ is Pandasâ way of describing data that is missing. Under the hood, these are encoded as NumPyâs NaN (not a number) constant (see here), which has type float64
so numeric columns with NaN values still have a numeric type and can have numeric operations applied to them.
To find missing values in a column we can use the isnull()
function:
is_missing = df["bill_length_mm"].isnull()
print(is_missing)
0 False
1 False
2 False
3 False
4 False
...
346 False
347 False
348 False
349 False
350 False
Name: bill_length_mm, Length: 351, dtype: bool
This returns a boolean series which is True if that rowâs value is NaN, which can then be used to filter the data frame and show only the rows with missing data in the bill_length_mm
column:
df[is_missing]
Id | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|---|
233 | P-344 | Chinstrap | Dream | NaN | 19.2 | 197.0 | 4000.0 | male | 2008 |
286 | P-003 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN | 2007 |
307 | P-271 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN | 2009 |
312 | P-345 | Adelie | Torgersen | NaN | 18.0 | 193.0 | 43400.0 | female | 2009 |
There are many reasons data could be missing and how you choose to deal with them is an important part of any research project. Weâll revisit this later.
Unexpected Column Types¶
Looking at the first few rows of our data (the output of df.head()
above) it looks like we expect the bill_length_mm
, bill_depth_mm
, flipper_length_mm
, body_mass_g
and year
columns to have a numeric type. Comparing with the output of df.info()
above most of them do, having a dtype
(data type) of either int64
or float64
.
However, the bill_depth_mm
column has a dtype of object
, which usually means the column is being treated as strings/text data. This will generally be because there is at least one row in the column that Pandas was not able to parse as a number. Common reasons this might happen include:
Data entry errors and typos, for example â23/15â instead of â23.15â.
Encoding of missing values: The
pd.read_csv()
function checks for common string representations of missing values like âNAâ or âNULLâ and converts these tonumpy.nan
when loading the data. But many different conventions for missing data exist, such as more verbose representations like âUNKNOWNâ, and Pandas will load these as strings by default. This can be customised with thena_values
parameter ofpd.read_csv()
.Additional metadata incorrectly loaded into the data frame, such as CSV files with headers and footers (as seen in the Data Sources & Formats section previously).
To see whatâs wrong with the bill_depth_mm
column we can try to convert it to a numeric type with the pd.to_numeric
function:
df["bill_depth_mm"] = pd.to_numeric(df["bill_depth_mm"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/_libs/lib.pyx:2369, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "14,2"
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In[9], line 1
----> 1 df["bill_depth_mm"] = pd.to_numeric(df["bill_depth_mm"])
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/core/tools/numeric.py:185, in to_numeric(arg, errors, downcast)
183 coerce_numeric = errors not in ("ignore", "raise")
184 try:
--> 185 values, _ = lib.maybe_convert_numeric(
186 values, set(), coerce_numeric=coerce_numeric
187 )
188 except (ValueError, TypeError):
189 if errors == "raise":
File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/_libs/lib.pyx:2411, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "14,2" at position 142
The error above tells us Pandas has encountered a value â14,2â, which it doesnât know how to convert into a number. It also says the problem is at index 142, which we can access ourselves to check the value directly:
df.loc[142, "bill_depth_mm"]
'14,2'
In this case it looks like a typo, the person entering the data probably meant to write 14.2
, but we should check this first. There may be information in the data documentation, or you could ask the data provider.
Letâs say weâre confident it is a typo. We can fix it ourselves and then convert the column to a numeric type:
# set the incorrectly typed number to its intended value
df.loc[142, "bill_depth_mm"] = 14.2
# convert the column to a numeric type
df["bill_depth_mm"] = pd.to_numeric(df["bill_depth_mm"])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 351 entries, 0 to 350
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 351 non-null object
1 species 351 non-null object
2 island 351 non-null object
3 bill_length_mm 347 non-null float64
4 bill_depth_mm 349 non-null float64
5 flipper_length_mm 349 non-null float64
6 body_mass_g 349 non-null float64
7 sex 340 non-null object
8 year 351 non-null int64
dtypes: float64(4), int64(1), object(4)
memory usage: 24.8+ KB
The bill_depth_mm
now has type float64
as we originally expected.
This was a simple example with just one strange value - weâll see more approaches for handling and sanitising strings later.
Sanity Checking Values¶
Numeric Columns¶
The pandas describe()
function gives summary statistics for the numeric columns in our data (the mean, standard deviation, minimum and maximum value, and quartiles for each column):
df.describe()
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | year | |
---|---|---|---|---|---|
count | 347.000000 | 349.000000 | 349.000000 | 349.000000 | 351.000000 |
mean | 43.923055 | 17.152722 | 200.088825 | 4305.329513 | 2008.022792 |
std | 5.491795 | 1.967049 | 21.320100 | 2256.300048 | 0.820832 |
min | 32.100000 | 13.100000 | -99.000000 | 285.000000 | 2007.000000 |
25% | 39.200000 | 15.600000 | 190.000000 | 3550.000000 | 2007.000000 |
50% | 44.500000 | 17.300000 | 197.000000 | 4050.000000 | 2008.000000 |
75% | 48.500000 | 18.700000 | 213.000000 | 4775.000000 | 2009.000000 |
max | 59.600000 | 21.500000 | 231.000000 | 43400.000000 | 2009.000000 |
Even though bill_length_mm
, and several of the other columns, have missing (NaN) values, Pandas is able to compute statistics for that column. When calculating these Pandas will ignore all NaN values by default. To change this behaviour, some functions have a skipna
argument, for example df["bill_length_mm"].mean(skipna=False)
will return NaN if there are any NaN values in the column.
You should think carefully about which approach is more suitable for your data (for example, if a column only has a few non-null values will the mean be representative?)
Looking at these values gives us a better idea of what our data contains, but also allows us to perform some sanity checks. For example, do the minimum and maximum values in each column make sense given what we know about the dataset?
There are two things that might standout. First, the flipper_length_mm
column has a minimum value of -99, but all the other values in the data are positive as weâd expect for measurements of lengths, widths and weights. In some datasets missing data is represented with negative values (but this may not always be the case so, as always, make sure to check what they mean in any data youâre using).
If weâre sure -99
is meant to be a missing value, we can replace those with numpy.nan
so Pandas will treat them correctly:
df = df.replace(-99, numpy.nan)
With these values replaced, the âactualâ minimum value of flipper_length_mm
is 172 mm:
df["flipper_length_mm"].min()
172.0
The second thing that may stand out is the minimum value of 285 grams in body_mass_g
, which looks far smaller than the other values (e.g., the 25% quartile of body_mass_g
is only 3550g). Excluding the 285g value the next lightest penguin weighs 2700g:
# True for each row with body_mass_g greater than the min value of 285g
smaller_petals = df["body_mass_g"] > df["body_mass_g"].min()
# Lowest penguin weight out of all rows with weights above 285g
df.loc[smaller_petals, "body_mass_g"].min()
2700.0
Another way to see this would be to sort the data frame by body mass using the sort_values
function:
df.sort_values(by="body_mass_g").head(10)
Id | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|---|
6 | P-348 | Adelie | Biscoe | 36.4 | 18.1 | 193.0 | 285.0 | female | 2007 |
236 | P-314 | Chinstrap | Dream | 46.9 | 16.6 | 192.0 | 2700.0 | female | 2008 |
248 | P-058 | Adelie | Biscoe | 36.5 | 16.6 | 181.0 | 2850.0 | female | 2008 |
311 | P-064 | Adelie | Biscoe | 36.4 | 17.1 | 184.0 | 2850.0 | female | 2008 |
349 | P-098 | Adelie | Dream | 33.1 | 16.1 | 178.0 | 2900.0 | female | 2008 |
227 | P-298 | Chinstrap | Dream | 43.2 | 16.6 | 187.0 | 2900.0 | female | 2007 |
270 | P-116 | Adelie | Torgersen | 38.6 | 17.0 | 188.0 | 2900.0 | female | 2009 |
17 | P-054 | Adelie | Biscoe | 34.5 | 18.1 | 187.0 | 2900.0 | female | 2008 |
137 | P-104 | Adelie | Biscoe | 37.9 | 18.6 | 193.0 | 2925.0 | female | 2009 |
337 | P-047 | Adelie | Dream | 37.5 | 18.9 | 179.0 | 2975.0 | NaN | 2007 |
By default sort_values
sorts values from smallest to largest, you can change that by setting ascending=False
.
Again, we see the 2nd smallest value in the column is only 2700g. This could be another data entry error (perhaps the weight was meant to be 2850g rather than 285g), or perhaps that penguin is a chick and the rest are adults. This type of issue is much more nuanced and difficult to spot in real world scenarios. Visualizing the data (and distributions in the data) can be very helpful here, which is the focus of the next module.
Text and Categorical Columns¶
Note that the species
, island
, and sex
columns do not appear when we use describe()
above as they contain text. For both text and numeric columns, it can be helpful to know the number of unique values in each column:
df.nunique()
Id 350
species 4
island 3
bill_length_mm 164
bill_depth_mm 80
flipper_length_mm 55
body_mass_g 96
sex 2
year 3
dtype: int64
The measurement and Id
columns have many unique values, whereas columns like island
have only a few different unique values (categories). Looking closely, the species
column has four different values, but from the dataset documentation we only expect there to be three penguin species.
The value_counts()
function, applied to the species
column, shows the number of occurrences of the different values in that column:
df["species"].value_counts()
Adelie 155
Gentoo 125
Chinstrap 70
UNKNOWN 1
Name: species, dtype: int64
The âAdĂ©lieâ, âChinstrapâ, and âGentooâ species described in the documentation all appear, but thereâs also an âUNKNOWNâ entry. This looks like it should have been treated as a missing value. To make Pandas correctly treat it as missing we can replace it with numpy.nan
using the replace
method:
df["species"] = df["species"].replace("UNKNOWN", numpy.nan)
df["species"].value_counts()
Adelie 155
Gentoo 125
Chinstrap 70
Name: species, dtype: int64
By default, the value_counts
will not display the number of missing values in the column. To show that you can use df["species"].value_counts(dropna=False)
instead. You can also try df["species"].value_counts(normalize=True)
to show the fraction of data with each value, rather than the count.
Weâll look at more approaches for manipulating strings and categories in Sections 2.2.4.2 and 2.2.4.3 of this module.
Finally, it may be interesting to look at how the measurements vary between the species. We can do that with the Pandas groupby
function:
df.groupby("species").mean()
/tmp/ipykernel_2269/2880954085.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
df.groupby("species").mean()
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | year | |
---|---|---|---|---|---|
species | |||||
Adelie | 38.757516 | 18.335714 | 189.993464 | 3934.805195 | 2008.006452 |
Chinstrap | 48.800000 | 18.424286 | 195.785714 | 3733.571429 | 2007.957143 |
Gentoo | 47.486290 | 14.975806 | 217.241935 | 5080.241935 | 2008.072000 |
df.groupby("species")
splits the date frame into sub-groups with the same value in the âspeciesâ column. We then must specify a function we want to use to summarize the members of each group, in this case the mean. It looks like, on average, âChinstrapâ penguins have the largest bills, but âGentooâ penguins have the largest flippers and body mass. For more information about using groupby
see here.
Duplicate Data¶
In the output of df.nunique()
above we see the Id
column has 350 unique values, one fewer than the 351 rows in the dataset. We expect Id
to be a unique identifier, so to have 351 unique values (1 for each row). Whatâs going on?
One explanation could be that there are duplicate rows in the data. The duplicated
method of a data frame returns True for any rows that appear multiple times in the data (with an exact copy of all values):
df.duplicated()
0 False
1 False
2 False
3 False
4 False
...
346 False
347 False
348 False
349 False
350 False
Length: 351, dtype: bool
We can use this to filter the data frame and show only the duplicated rows:
df[df.duplicated(keep=False)]
Id | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|---|
36 | P-276 | Chinstrap | Dream | 46.5 | 17.9 | 192.0 | 3500.0 | female | 2007 |
324 | P-276 | Chinstrap | Dream | 46.5 | 17.9 | 192.0 | 3500.0 | female | 2007 |
By default, the duplicated
function only marks the second and subsequent instances of the same data as being duplicates. Setting keep=False
marks the first instance as a duplicate as well.
We see there are two entries for a penguin with id P-276
in the data. Why might that be the case? It could be caused by a data entry/processing issue and be there by mistake, or be a genuine repeated measurement for this penguin, for example. Itâs important to understand the context before taking any action.
In some cases it may be appropriate to delete the duplicate data. This can be done with the drop_duplicates
method:
print("Length before removing duplicates:", len(df), "rows")
df = df.drop_duplicates()
print("Length after removing duplicates:", len(df), "rows")
Length before removing duplicates: 351 rows
Length after removing duplicates: 350 rows
Displaying Data Frames with Style đ¶
You can get fancy with how you display data frames by highlighting and formatting cells differently using its style
attribute. There are a few examples below, for more details see the Table Visualization page in the Pandas documentation.
Change the precision with which numbers are displayed:
df_top10 = df.head(10) # just style the first 10 rows for demo purposes here
# round values to nearest integer (0 decimal places)
df_top10.style.format(precision=0)
Id | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|---|
0 | P-179 | Gentoo | Biscoe | 48 | 15 | 215 | 5650 | male | 2007 |
1 | P-306 | Chinstrap | Dream | 41 | 17 | 187 | 3200 | female | 2008 |
2 | P-247 | Gentoo | Biscoe | 51 | 16 | 226 | 5200 | male | 2009 |
3 | P-120 | Adelie | Torgersen | 36 | 17 | 187 | 3150 | female | 2009 |
4 | P-220 | Gentoo | Biscoe | 44 | 14 | 220 | 4700 | female | 2008 |
5 | P-150 | Adelie | Dream | 36 | 17 | 187 | 3700 | female | 2009 |
6 | P-348 | Adelie | Biscoe | 36 | 18 | 193 | 285 | female | 2007 |
7 | P-091 | Adelie | Dream | 41 | 18 | 205 | 4300 | male | 2008 |
8 | P-327 | Chinstrap | Dream | 51 | 19 | 201 | 3950 | male | 2009 |
9 | P-221 | Gentoo | Biscoe | 51 | 15 | 223 | 5550 | male | 2008 |
Apply a colour gradient to each column based on each cellâs value:
df_top10.style.background_gradient()
Id | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|---|
0 | P-179 | Gentoo | Biscoe | 47.800000 | 15.000000 | 215.000000 | 5650.000000 | male | 2007 |
1 | P-306 | Chinstrap | Dream | 40.900000 | 16.600000 | 187.000000 | 3200.000000 | female | 2008 |
2 | P-247 | Gentoo | Biscoe | 50.800000 | 15.700000 | 226.000000 | 5200.000000 | male | 2009 |
3 | P-120 | Adelie | Torgersen | 36.200000 | 17.200000 | 187.000000 | 3150.000000 | female | 2009 |
4 | P-220 | Gentoo | Biscoe | 43.500000 | 14.200000 | 220.000000 | 4700.000000 | female | 2008 |
5 | P-150 | Adelie | Dream | 36.000000 | 17.100000 | 187.000000 | 3700.000000 | female | 2009 |
6 | P-348 | Adelie | Biscoe | 36.400000 | 18.100000 | 193.000000 | 285.000000 | female | 2007 |
7 | P-091 | Adelie | Dream | 41.100000 | 18.100000 | 205.000000 | 4300.000000 | male | 2008 |
8 | P-327 | Chinstrap | Dream | 51.400000 | 19.000000 | 201.000000 | 3950.000000 | male | 2009 |
9 | P-221 | Gentoo | Biscoe | 50.700000 | 15.000000 | 223.000000 | 5550.000000 | male | 2008 |
Highlight the smallest value in each column:
df_top10.style.highlight_min()
Id | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|---|
0 | P-179 | Gentoo | Biscoe | 47.800000 | 15.000000 | 215.000000 | 5650.000000 | male | 2007 |
1 | P-306 | Chinstrap | Dream | 40.900000 | 16.600000 | 187.000000 | 3200.000000 | female | 2008 |
2 | P-247 | Gentoo | Biscoe | 50.800000 | 15.700000 | 226.000000 | 5200.000000 | male | 2009 |
3 | P-120 | Adelie | Torgersen | 36.200000 | 17.200000 | 187.000000 | 3150.000000 | female | 2009 |
4 | P-220 | Gentoo | Biscoe | 43.500000 | 14.200000 | 220.000000 | 4700.000000 | female | 2008 |
5 | P-150 | Adelie | Dream | 36.000000 | 17.100000 | 187.000000 | 3700.000000 | female | 2009 |
6 | P-348 | Adelie | Biscoe | 36.400000 | 18.100000 | 193.000000 | 285.000000 | female | 2007 |
7 | P-091 | Adelie | Dream | 41.100000 | 18.100000 | 205.000000 | 4300.000000 | male | 2008 |
8 | P-327 | Chinstrap | Dream | 51.400000 | 19.000000 | 201.000000 | 3950.000000 | male | 2009 |
9 | P-221 | Gentoo | Biscoe | 50.700000 | 15.000000 | 223.000000 | 5550.000000 | male | 2008 |