{
"cells": [
{
"cell_type": "markdown",
"id": "e2ccdc55-63ab-429c-b6e4-7126f75df160",
"metadata": {},
"source": [
"# 2.2.6 Linking Datasets\n",
"\n",
"Data linking is the process of joining datasets together. Datasets can\n",
"be useful on their own but bringing them together can unlock new\n",
"insights.\n",
"\n",
"Discussion\n",
"\n",
"- When might this be helpful? Can you think of any examples where joining datasets might help us unlock additional insights?\n",
"\n",
"## Impact on Privacy\n",
"\n",
"Before we link datasets together, we should consider the impact on\n",
"privacy of doing so. There can be an increased risk of identification of\n",
"a person/entity when two datasets are linked.\n",
"\n",
"## How to Link\n",
"\n",
"The most straightforward way to link datasets is by a deterministic,\n",
"rules-based linkage, where records are linked if a specific **set of\n",
"identifiers** match. When working with tabular datasets we will often\n",
"see this accomplished with a *join* operation.\n",
"\n",
"### Joining in Pandas\n",
"\n",
"Pandas allows for database-style joins. If you have experience of SQL,\n",
"you'll be familiar with the terminology of left/right/inner/outer etc.\n",
"joins. Pandas makes these join types (or similar) available in the\n",
"Pandas API via `pd.merge` and provides a comprehensive summary of joins\n",
"[in the\n",
"docs](https://pandas.pydata.org/docs/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging).\n",
"Here, for demonstration, we will give an example using a simple left\n",
"join, similar to a `LEFT OUTER` join in SQL.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "a0c4e4be",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"df1\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" some_id | \n",
" a | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" foo | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" bar | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" zop | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" some_id a\n",
"0 foo 0\n",
"1 bar 1\n",
"2 zop 2"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"df2\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" some_id | \n",
" b | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" foo | \n",
" 3 | \n",
"
\n",
" \n",
" 1 | \n",
" bar | \n",
" 4 | \n",
"
\n",
" \n",
" 2 | \n",
" baz | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" some_id b\n",
"0 foo 3\n",
"1 bar 4\n",
"2 baz 5"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import pandas as pd\n",
"\n",
"df1 = pd.DataFrame({'some_id': ['foo', 'bar', 'zop'], 'a': [0,1,2]})\n",
"df2 = pd.DataFrame({'some_id': ['foo', 'bar', 'baz'], 'b': [3, 4,5]})\n",
"\n",
"print(\"df1\")\n",
"display(df1)\n",
"\n",
"print(\"df2\")\n",
"display(df2)"
]
},
{
"cell_type": "markdown",
"id": "54962111",
"metadata": {},
"source": [
"\n",
"\n",
"note that `df1` contains the key `'zop'` that is not contained in `df2` and that `df2` contains the key `'baz'` that is not contained in `df1`.\n",
"The other two keys (`'foo', 'bar'`) are contained in both DataFrames.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "da981ec1",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" some_id | \n",
" a | \n",
" b | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" foo | \n",
" 0 | \n",
" 3.0 | \n",
"
\n",
" \n",
" 1 | \n",
" bar | \n",
" 1 | \n",
" 4.0 | \n",
"
\n",
" \n",
" 2 | \n",
" zop | \n",
" 2 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" some_id a b\n",
"0 foo 0 3.0\n",
"1 bar 1 4.0\n",
"2 zop 2 NaN"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"result = pd.merge(left=df1, right=df2, how=\"left\", on=\"some_id\")\n",
"display(result)"
]
},
{
"cell_type": "markdown",
"id": "54348e72",
"metadata": {},
"source": [
"\n",
"the data for `'baz'` has been dropped after this join as it only\n",
"appeared on the 'right' side. The `how='left'` join only uses keys from\n",
"the left frame. Meanwhile, the entry for column `'b'` is null for key\n",
"`'zop'`.\n",
"\n",
"Different scenarios will require different join types, read the docs\n",
"carefully!\n",
"\n",
"### Probabilistic Matching\n",
"\n",
"If we don't have a set of common identifiers, we may wish to use\n",
"probabilistic matching.\n",
"\n",
"Probabilistic matching calculates a matching score between two records.\n",
"Typically, this is done by comparing several field values and assigning\n",
"a weight to each depending on how closely they match.\n",
"\n",
"Details of probabilistic matching approaches are, unfortunately, beyond\n",
"the scope of this course."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.15"
},
"vscode": {
"interpreter": {
"hash": "bfb6be6a1e2928ff2d2ab7904db3c9037c4b5ca4f399bd5e7064365a54a3f091"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}