{
"cells": [
{
"cell_type": "markdown",
"id": "c8cffa89-e0b8-4477-8f23-57b06934aba8",
"metadata": {},
"source": [
"# 2.1.4 Data Sources and Formats\n",
"\n",
"Once you’ve found a dataset for your research question there are many formats it could be in - tabular data, databases, documents,\n",
"images or many more. In this section we give an overview of common data\n",
"types and how they can be loaded into Python.\n",
"\n",
"Several of the file/data formats here are also covered in our [Research\n",
"Software Engineering\n",
"course](https://alan-turing-institute.github.io/rsd-engineeringcourse/).\n",
"Here we show how to load data with Pandas, and also introduce some new\n",
"data types.\n",
"\n",
"## Tabular Data Files\n",
"\n",
"### Comma-separated values files (CSVs)\n",
"\n",
"Tabular data files, and particularly comma-separated values (CSV) files,\n",
"are likely to be the data format you encounter most often. These specify\n",
"a single table of rows and columns in the following format:\n",
"\n",
" \"Subject\",\"Teacher\",\"Day of Week\",\"Time Start\",\"Time End\"\n",
" \"Maths\",\"Mr F\",\"Monday\",1000,1200\n",
" \"English\",\"Ms P\",\"Tuesday\",1100,1300\n",
" \"Physics\",\"Mrs A\",\"Thursday\",1400,1600\n",
" \"French\",\"Mr F\",\"Friday\",1000,1300\n",
"\n",
"The first line (usually) specifies the table column names, with each\n",
"name separated by a comma. The subsequent lines specify the values of\n",
"those columns for each row in our data, with each column value separated\n",
"by a comma.\n",
"\n",
"As an example, we will use a dataset downloaded from The World Bank,\n",
"giving the percentage of people living in urban environments by country\n",
"worldwide since 1960. You can find the original source data\n",
"[here](https://data.worldbank.org/indicator/SP.URB.TOTL.IN.ZS?view=chart).\n",
"\n",
"To load a CSV file to a Pandas data frame you can use the Pandas\n",
"[`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)\n",
"function:\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "24b32636",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Country Name | \n",
" Country Code | \n",
" Indicator Name | \n",
" Indicator Code | \n",
" 1960 | \n",
" 1980 | \n",
" 2000 | \n",
" 2020 | \n",
" Unnamed: 8 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Aruba | \n",
" ABW | \n",
" Urban population (% of total population) | \n",
" SP.URB.TOTL.IN.ZS | \n",
" 50.776000 | \n",
" 50.472000 | \n",
" 46.717000 | \n",
" 43.697000 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" Africa Eastern and Southern | \n",
" AFE | \n",
" Urban population (% of total population) | \n",
" SP.URB.TOTL.IN.ZS | \n",
" 14.704688 | \n",
" 20.845000 | \n",
" 28.669286 | \n",
" 36.783306 | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" Afghanistan | \n",
" AFG | \n",
" Urban population (% of total population) | \n",
" SP.URB.TOTL.IN.ZS | \n",
" 8.401000 | \n",
" 15.995000 | \n",
" 22.078000 | \n",
" 26.026000 | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" Africa Western and Central | \n",
" AFW | \n",
" Urban population (% of total population) | \n",
" SP.URB.TOTL.IN.ZS | \n",
" 14.670329 | \n",
" 24.518577 | \n",
" 35.352981 | \n",
" 47.848625 | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" Angola | \n",
" AGO | \n",
" Urban population (% of total population) | \n",
" SP.URB.TOTL.IN.ZS | \n",
" 10.435000 | \n",
" 24.298000 | \n",
" 50.087000 | \n",
" 66.825000 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Country Name Country Code \\\n",
"0 Aruba ABW \n",
"1 Africa Eastern and Southern AFE \n",
"2 Afghanistan AFG \n",
"3 Africa Western and Central AFW \n",
"4 Angola AGO \n",
"\n",
" Indicator Name Indicator Code 1960 \\\n",
"0 Urban population (% of total population) SP.URB.TOTL.IN.ZS 50.776000 \n",
"1 Urban population (% of total population) SP.URB.TOTL.IN.ZS 14.704688 \n",
"2 Urban population (% of total population) SP.URB.TOTL.IN.ZS 8.401000 \n",
"3 Urban population (% of total population) SP.URB.TOTL.IN.ZS 14.670329 \n",
"4 Urban population (% of total population) SP.URB.TOTL.IN.ZS 10.435000 \n",
"\n",
" 1980 2000 2020 Unnamed: 8 \n",
"0 50.472000 46.717000 43.697000 NaN \n",
"1 20.845000 28.669286 36.783306 NaN \n",
"2 15.995000 22.078000 26.026000 NaN \n",
"3 24.518577 35.352981 47.848625 NaN \n",
"4 24.298000 50.087000 66.825000 NaN "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.read_csv(\"data/urban_population.csv\")\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"id": "7e381f83",
"metadata": {},
"source": [
"Each row in the data corresponds to a country, with columns for the\n",
"country’s name, the urban population in 1960, 1980, 2000 and 2020, and\n",
"some other metadata.\n",
"\n",
"The `.info()` method of a DataFrame gives us a useful summary of the\n",
"columns it contains:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "e4688d7d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 266 entries, 0 to 265\n",
"Data columns (total 9 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Country Name 266 non-null object \n",
" 1 Country Code 266 non-null object \n",
" 2 Indicator Name 266 non-null object \n",
" 3 Indicator Code 266 non-null object \n",
" 4 1960 263 non-null float64\n",
" 5 1980 263 non-null float64\n",
" 6 2000 263 non-null float64\n",
" 7 2020 262 non-null float64\n",
" 8 Unnamed: 8 0 non-null float64\n",
"dtypes: float64(5), object(4)\n",
"memory usage: 18.8+ KB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "markdown",
"id": "94c1dd28",
"metadata": {},
"source": [
"\n",
"Each column in a Pandas DataFrame has a single type. The urban\n",
"population percentages in columns 1960, 1980, 2000 and 2020 all contain\n",
"floating point numbers (`float64`), for example. When columns contain a\n",
"mixture of data types or strings, Pandas will give the whole column the\n",
"generic `object` type. Sometimes quirks in the data may cause Pandas to\n",
"infer a different type to what you expect - we will revisit this in\n",
"[Section 2.2.1](2-02-01-DataConsistency).\n",
"\n",
"We can also see that some columns have missing values (the data has 266\n",
"rows, but some columns have fewer than 266 \"`non-null`\" values), and\n",
"have a strange additional column `Unnamed: 8` that has *only* missing\n",
"(null) values. This is another topic we’ll revisit later.\n",
"\n",
"### Customising pandas.read_csv\n",
"\n",
"The original file from the World Bank contains a few lines of metadata\n",
"at the top:\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "8a5a47b0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\"Data Source\",\"World Development Indicators\",\n",
"\n",
"\"Last Updated Date\",\"2021-06-30\",\n",
"Country Name,Country Code,Indicator Name,Indicator Code,1960,1980,2000,2020,\n",
"Aruba,ABW,Urban population (% of total population),SP.URB.TOTL.IN.ZS,50.776,50.472,46.717,43.697,\n",
"Africa Eastern and Southern,AFE,Urban population (% of total population),SP.URB.TOTL.IN.ZS,14.7046880270389,20.8449997980123,28.6692864525936,36.7833061490919,\n",
"Afghanistan,AFG,Urban population (% of total population),SP.URB.TOTL.IN.ZS,8.401,15.995,22.078,26.026,\n",
"Africa Western and Central,AFW,Urban population (% of total population),SP.URB.TOTL.IN.ZS,14.6703287907718,24.5185774336299,35.3529813285238,47.8486254352506,\n",
"Angola,AGO,Urban population (% of total population),SP.URB.TOTL.IN.ZS,10.435,24.298,50.087,66.825,\n",
"Albania,ALB,Urban population (% of total population),SP.URB.TOTL.IN.ZS,30.705,33.762,41.741,62.112,\n"
]
}
],
"source": [
"\n",
"!head data/urban_population_header.csv\n"
]
},
{
"cell_type": "markdown",
"id": "73299327",
"metadata": {},
"source": [
"\n",
"The column names start on line 4 of the file, and the previous lines\n",
"give metadata on where the file came from and when it was updated.\n",
"\n",
"Using `read_csv` on this file (with default arguments) gives an error:\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "7f4122f2",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Traceback (most recent call last):\n",
" File \"/var/folders/f4/m42_v8vj1fjc0wrwtb69xjt80000gr/T/ipykernel_20322/1010206246.py\", line 5, in \n",
" df = pd.read_csv(\"data/urban_population_header.csv\")\n",
" File \"/usr/local/Caskroom/miniforge/base/envs/playground/lib/python3.10/site-packages/pandas/util/_decorators.py\", line 311, in wrapper\n",
" return func(*args, **kwargs)\n",
" File \"/usr/local/Caskroom/miniforge/base/envs/playground/lib/python3.10/site-packages/pandas/io/parsers/readers.py\", line 586, in read_csv\n",
" return _read(filepath_or_buffer, kwds)\n",
" File \"/usr/local/Caskroom/miniforge/base/envs/playground/lib/python3.10/site-packages/pandas/io/parsers/readers.py\", line 488, in _read\n",
" return parser.read(nrows)\n",
" File \"/usr/local/Caskroom/miniforge/base/envs/playground/lib/python3.10/site-packages/pandas/io/parsers/readers.py\", line 1047, in read\n",
" index, columns, col_dict = self._engine.read(nrows)\n",
" File \"/usr/local/Caskroom/miniforge/base/envs/playground/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py\", line 224, in read\n",
" chunks = self._reader.read_low_memory(nrows)\n",
" File \"pandas/_libs/parsers.pyx\", line 801, in pandas._libs.parsers.TextReader.read_low_memory\n",
" File \"pandas/_libs/parsers.pyx\", line 857, in pandas._libs.parsers.TextReader._read_rows\n",
" File \"pandas/_libs/parsers.pyx\", line 843, in pandas._libs.parsers.TextReader._tokenize_rows\n",
" File \"pandas/_libs/parsers.pyx\", line 1925, in pandas._libs.parsers.raise_parser_error\n",
"pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 4, saw 9\n",
"\n"
]
}
],
"source": [
"from lib2to3.pgen2.parse import ParseError\n",
"import traceback # library used to print error without breaking python\n",
"\n",
"try:\n",
" df = pd.read_csv(\"data/urban_population_header.csv\")\n",
"except Exception as e: \n",
" traceback.print_exc()\n"
]
},
{
"cell_type": "markdown",
"id": "42bb2076",
"metadata": {},
"source": [
"This is because pandas is trying to use the first line in the file to define the columns present in our data. To avoid this, we can use the `skiprows` argument to tell pandas our table starts on line 4 (skipping the first 3 lines):\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "cfbb1a96",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" \n",
" \n",
" \n",
" | \n",
" Country Name | \n",
" Country Code | \n",
" Indicator Name | \n",
" Indicator Code | \n",
" 1960 | \n",
" 1980 | \n",
" 2000 | \n",
" 2020 | \n",
" Unnamed: 8 | \n",
" \n",
" \n",
" \n",
" \n",
" 0 | \n",
" Aruba | \n",
" ABW | \n",
" Urban population (% of total population) | \n",
" SP.URB.TOTL.IN.ZS | \n",
" 50.776000 | \n",
" 50.472000 | \n",
" 46.717000 | \n",
" 43.697000 | \n",
" NaN | \n",
" \n",
" \n",
" 1 | \n",
" Africa Eastern and Southern | \n",
" AFE | \n",
" Urban population (% of total population) | \n",
" SP.URB.TOTL.IN.ZS | \n",
" 14.704688 | \n",
" 20.845000 | \n",
" 28.669286 | \n",
" 36.783306 | \n",
" NaN | \n",
" \n",
" \n",
" 2 | \n",
" Afghanistan | \n",
" AFG | \n",
" Urban population (% of total population) | \n",
" SP.URB.TOTL.IN.ZS | \n",
" 8.401000 | \n",
" 15.995000 | \n",
" 22.078000 | \n",
" 26.026000 | \n",
" NaN | \n",
" \n",
" \n",
" 3 | \n",
" Africa Western and Central | \n",
" AFW | \n",
" Urban population (% of total population) | \n",
" SP.URB.TOTL.IN.ZS | \n",
" 14.670329 | \n",
" 24.518577 | \n",
" 35.352981 | \n",
" 47.848625 | \n",
" NaN | \n",
" \n",
" \n",
" 4 | \n",
" Angola | \n",
" AGO | \n",
" Urban population (% of total population) | \n",
" SP.URB.TOTL.IN.ZS | \n",
" 10.435000 | \n",
" 24.298000 | \n",
" 50.087000 | \n",
" 66.825000 | \n",
" NaN | \n",
" \n",
" \n",
" \n",
" "
],
"text/plain": [
" Country Name Country Code \\\n",
"0 Aruba ABW \n",
"1 Africa Eastern and Southern AFE \n",
"2 Afghanistan AFG \n",
"3 Africa Western and Central AFW \n",
"4 Angola AGO \n",
"\n",
" Indicator Name Indicator Code 1960 \\\n",
"0 Urban population (% of total population) SP.URB.TOTL.IN.ZS 50.776000 \n",
"1 Urban population (% of total population) SP.URB.TOTL.IN.ZS 14.704688 \n",
"2 Urban population (% of total population) SP.URB.TOTL.IN.ZS 8.401000 \n",
"3 Urban population (% of total population) SP.URB.TOTL.IN.ZS 14.670329 \n",
"4 Urban population (% of total population) SP.URB.TOTL.IN.ZS 10.435000 \n",
"\n",
" 1980 2000 2020 Unnamed: 8 \n",
"0 50.472000 46.717000 43.697000 NaN \n",
"1 20.845000 28.669286 36.783306 NaN \n",
"2 15.995000 22.078000 26.026000 NaN \n",
"3 24.518577 35.352981 47.848625 NaN \n",
"4 24.298000 50.087000 66.825000 NaN "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv(\"data/urban_population_header.csv\", skiprows=3)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"id": "9ee7af26",
"metadata": {},
"source": [
"\n",
"There’s not a single CSV data \"standard\" that everyone follows, so it’s\n",
"common to need to tweak things to load properly. Other examples include:\n",
"\n",
"- Different \"delimeters\", e.g., Tab-separated values files (TSVs) with\n",
" tabs separating column values instead of commas.\n",
"- Footers: Lines with metadata/other information not part of the table\n",
" at the end of the file.\n",
"- Comments: Annotations anywhere in the file that shouldn’t be\n",
" included in the table.\n",
"\n",
"`pd.read_csv` can be customised to deal with all of these and more. For example:\n",
"\n",
"```python\n",
"df = pd.read_csv(\n",
" \"myfile.csv\",\n",
" delimeter=\"\\t\", # columns separated by tabs\n",
" skipfooter=3, # exclude last 3 lines\n",
" comment=\"#\", # exclude lines starting with \"#\"\n",
")\n",
"```\n",
"\n",
"### Exercise\n",
"\n",
"Later on we'll be using the \"Palmer Penguins\" dataset, which we have saved at the path `data/penguins.csv`. Load this dataset into a pandas data frame.\n",
"\n",
"- What type of data do you think it contains?\n",
"- What are the types of the columns and do they match what you'd expect? \n",
"\n",
"Solution:\n",
"\n",
"`df = pd.read_csv(\"data/penguins.csv\")` to load the data into pandas. We'll discuss the answer to the other questions later, but you may have noticed that some columns have the generic `object` type even though it looks like they contain numeric data.\n",
"\n",
"### Excel Spreadsheets\n",
"\n",
"As CSV files are a plaintext (human readable) format and don’t need\n",
"proprietry software to use and create they are always preferable to\n",
"Excel spreadsheets for raw data, if available. However, it is still\n",
"common to find data in Excel `.xls` or `.xlsx` files.\n",
"\n",
"If needed, Pandas also has a function\n",
"[`pandas.read_excel`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)\n",
"to load data from Excel spreadsheets. The format is very similar to\n",
"`read_csv`:\n",
"\n",
"```python\n",
"df = pd.read_excel(\n",
" \"my_spreadsheet.xlsx\",\n",
" sheet_name=\"Sheet1\"\n",
")\n",
"```\n",
"\n",
"Note you must specify which worksheet you want to load as a data frame.\n",
"If you want to load tables from multiple worksheets you would need to\n",
"load them into multiple data frames.\n",
"\n",
"## Databases\n",
"\n",
"We can think of databases as containers for data. Typically, databases\n",
"allow for easy storage, retrieval, update, and analysis of data.\n",
"\n",
"In the case of tabular data, a database may have many tables. A database\n",
"can be flat, or it can be relational.\n",
"\n",
"Example:\n",
"\n",
"**Flat Database with single table:**\n",
"\n",
"| Subject | Teacher | Teacher Email | Teacher Phone Number | Day of Week | Time Start | Time End |\n",
"|---------|--------|------------|-----------------|-----------|----------|--------|\n",
"| Maths | Mr F | f@school.com | 07123123123 | Monday | 1000 | 1200 |\n",
"| English | Ms P | p@school.com | 07456456456 | Tuesday | 1100 | 1300 |\n",
"| Physics | Mrs A | a@school.com | 07789789789 | Thursday | 1400 | 1600 |\n",
"| French | Mr F | f@school.co | 07123123123 | Friday | 1000 | 1300 |\n",
"\n",
"**Relational database with two tables:**\n",
"\n",
"| Subject | Teacher | Day of Week | Time Start | Tie End |\n",
"|---------|---------|-------------|------------|---------|\n",
"| Maths | Mr F | Monday | 1000 | 1200 |\n",
"| English | Ms P | Tuesday | 1100 | 1300 |\n",
"| Physics | Mrs A | Thursday | 1400 | 1600 |\n",
"| French | Mr F | Friday | 1000 | 1300 |\n",
"\n",
"| Teacher | Teacher Email | Teacher Phone Number |\n",
"|---------|---------------|----------------------|\n",
"| Mr F | f@school.com | 07123123123 |\n",
"| Ms P | p@school.com | 07456456456 |\n",
"| Mrs A | a@school.com | 07789789789 |\n",
"\n",
"Flat databases can be simple but also inefficient. Relational databases,\n",
"with data stored in a series of interconnected tables, can be more\n",
"complicated but more computationally efficient.\n",
"\n",
"### SQL Databases & RDBMS\n",
"\n",
"Structured Query Lanage (SQL) is a standard language for storing,\n",
"manipulating and retrieving data in databases.\n",
"\n",
"An RDBMS (Relational Database Management System) is the basis for SQL\n",
"and relies on the [relational data\n",
"model](https://en.wikipedia.org/wiki/Relational_model). Many modern and\n",
"popular databases implement the ideas of an RDBMS. Some common\n",
"implementations are:\n",
"\n",
"- PostgreSQL\n",
"- SQLite\n",
"- MySQL\n",
"- Oracle\n",
"- MS Access\n",
"\n",
"We use SQL to interact with these databases. You can find some basic,\n",
"interactive, tutorials at [W3Schools](https://www.w3schools.com/sql/).\n",
"\n",
"### NoSQL Databases\n",
"\n",
"> NoSQL databases (aka \"not only SQL\") are non tabular, and store data\n",
"> differently than relational tables. NoSQL databases come in a variety\n",
"> of types based on their data model. The main types are document,\n",
"> key-value, wide-column, and graph. They provide flexible schemas and\n",
"> scale easily with large amounts of data and high user loads.\n",
"> *https://www.mongodb.com/nosql-explained*\n",
"\n",
"Some common NoSQL Databases include:\n",
"\n",
"- Elasticsearch (document)\n",
"- MongoDB (document)\n",
"- DynamoDB (key-value)\n",
"- Gaffer (graph)\n",
"\n",
"**Document example**\n",
"\n",
"```json\n",
"{\n",
" \"name\": \"Jane\",\n",
" \"dob\": \"2012-04-23T18:25:43.511Z\",\n",
" \"interests\": [\"running\", \"swimming\"],\n",
" \"counter\": 104\n",
"}\n",
"```\n",
"\n",
"**Key-Value example:**\n",
"\n",
"\n",
"\n",
"**Graph example:** \n",
"\n",
"Unfortunately, different types of databases will often have their own,\n",
"unique, query languages, beyond SQL - more to learn!\n",
"\n",
"### Where Is the Data?\n",
"\n",
"Databases can exist on your local machine or can be hosted elsewhere.\n",
"Typically, we’ll want to host a database somewhere that everyone who\n",
"needs access can reach.\n",
"\n",
"Many cloud providers offer database solutions with easy-to-use\n",
"interfaces, allowing us (for a cost!) to worry less about exactly how\n",
"the DBMS is working and focus, instead, on the data itself.\n",
"\n",
"### Which Database to Use\n",
"\n",
"Which type of database you choose should be driven by your data and how\n",
"you plan to use it. You may find that your data can be expressed in more\n",
"than one of the forms above or as a table. You then need to weigh up\n",
"different factors to decide which form and which database, if any, to\n",
"use.\n",
"\n",
"Some factors you may consider (unordered!):\n",
"\n",
"- Ease of use\n",
"- Cost\n",
"- Footprint on disk\n",
"- Footprint in memory\n",
"- Retrieval/update speed\n",
"\n",
"Many blogs and articles have been written on the choice of database, for\n",
"example [this xplenty\n",
"article](https://www.xplenty.com/blog/which-database/).\n",
"\n",
"## Application Programming Interfaces (APIs)\n",
"\n",
"Application programming interfaces (APIs) are services that allow you to\n",
"programmatically request information, including datasets, from a remote\n",
"server. Large, digitally mature organisations commonly have APIs\n",
"available to access their data, including\n",
"[GitHub](https://docs.github.com/en/rest/guides/getting-started-with-the-rest-api),\n",
"[Twitter](https://developer.twitter.com/en/docs/twitter-api) and the\n",
"[Office for National Statistics](https://developer.ons.gov.uk/), for\n",
"example. Advantages of APIs include:\n",
"\n",
"- It’s possible to make queries to request a tailored subset of a\n",
" dataset, so you only get what you’re interested in.\n",
"- It’s straightforward to automate making many similar queries (i.e.,\n",
" to request many different datasets) with different parameters.\n",
"- If a dataset is updating regularly (maybe you want to know the\n",
" current weather at a location, for example) they provide a robust\n",
" way to get the latest status of what you’re interested in.\n",
"- They usually should provide a reproducible way for others to\n",
" download the same data you’re using.\n",
"\n",
"Most APIs follow the REST style, which we briefly introduce here, but\n",
"you may also come across GraphQL APIs. [This\n",
"blog](https://www.smashingmagazine.com/2018/01/understanding-using-rest-api/)\n",
"has an overview of both types.\n",
"\n",
"### Datamuse API\n",
"\n",
"As an example, we’ll try the Datamuse API, which can be used to find\n",
"relationships between words and terms (similar words, words that rhyme\n",
"and so on). You can find documentation for it here:\n",
"https://www.datamuse.com/api/\n",
"\n",
"An API query has four main components: a base URL, an endpoint,\n",
"parameters, and headers:\n",
"\n",
"**Base URL:** The address of the server hosting the data we want to\n",
"access - all our API queries will start with this. For the Datamuse API\n",
"it’s: `https://api.datamuse.com`\n",
"\n",
"**Endpoint:** Each API might have multiple types of data we can query,\n",
"known as different \"endpoints\". The path to the endpoint is added to the\n",
"base URL. The Datamuse API has two endpoints:`/words` for searching for\n",
"relationships (of different types) between words, and `/sug`, which can\n",
"be used for auto-completion/spelling correction.\n",
"\n",
"**Parameters:** Define what data subset we’re interested in and are\n",
"provided in the format `name=value`. These are added to the query after\n",
"a `?` following the endpoint. To specify multiple parameters, separate\n",
"each parameter name and value pair with an `&`. The available parameters\n",
"should be listed in the API’s documentation\n",
"([here](https://www.datamuse.com/api/) for Datamuse).\n",
"\n",
"The parameter string `?rel_jjb=dog&max=5` defines two parameters and\n",
"values:\n",
"\n",
"- `rel_jjb=dog`: Find adjectives commonly used to describe the\n",
"noun \"dog\".\n",
"- `max=5`: Return 5 results.\n",
"\n",
"**Headers:** Provide additional context/metadata about your request, the\n",
"most common being authentication information. Authentication may be used\n",
"to only give you access to resources you have permission to see, limit\n",
"the number of requests you can make within a certain amount of time, or\n",
"for billing purposes (if the API isn’t free).\n",
"\n",
"The Datamuse API doesn’t require authentication, so we don’t need to add\n",
"any headers here. If the API you’re using requires it they should\n",
"provide documentation that describes what you need to do - [here is an\n",
"example from\n",
"GitHub](https://docs.github.com/en/rest/guides/getting-started-with-the-rest-api#authentication),\n",
"and [here is how to add headers in the Python requests\n",
"library](https://docs.python-requests.org/en/master/user/quickstart/#custom-headers).\n",
"\n",
"**Full query:** Adding together all the components we have the full API\n",
"query:\n",
"\n",
"- https://api.datamuse.com/words?rel_jjb=dog&max=5\n",
"\n",
"You can open this in a web browser and should see something like this:\n",
"\n",
"```json\n",
"[\n",
" {\"word\": \"little\", \"score\": 1001},\n",
" {\"word\": \"old\", \"score\": 1000},\n",
" {\"word\": \"hot\", \"score\": 999},\n",
" {\"word\": \"big\", \"score\": 998},\n",
" {\"word\": \"black\", \"score\": 997}\n",
"]\n",
"```\n",
"\n",
"We get the top 5 words describing \"dog\" and a score that indicates how\n",
"strong the association is.\n",
"\n",
"The most common format for data returned by an API is JSON (JavaScript\n",
"Object Notation) from the JavaScript language, which is commonly used in\n",
"web development. JSONs have a similar structure to Python dictionaries\n",
"or lists of Python dictionaries. Python has a built-in library `json`\n",
"for converting data to and from the JSON format (see\n",
"[here](https://docs.python.org/3/library/json.html)), and Pandas also\n",
"has a function for creating a DataFrame from a JSON file\n",
"([pandas.read_json](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html)).\n",
"\n",
"We can make the same API query in Python using the\n",
"[requests](https://docs.python-requests.org/en/master/user/quickstart/)\n",
"library, as follows:\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "6075eae5",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import requests\n",
"\n",
"url = \"https://api.datamuse.com/words\" \n",
"params = {\"rel_jjb\": \"dog\",\"max\": 5} \n",
"r = requests.get(url, params=params)"
]
},
{
"cell_type": "markdown",
"id": "d77fcf6d",
"metadata": {},
"source": [
"Note that we can define the parameters in a dictionary, which is much easier to read than the raw format in the query string seen earlier. To check whether the request worked you can check the status code:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "71adb6fc",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"200\n"
]
}
],
"source": [
"print(r.status_code)"
]
},
{
"cell_type": "markdown",
"id": "89f44b94",
"metadata": {},
"source": [
"\n",
"Codes in the 200s usually indicate a successful query, for the meanings\n",
"of other codes see\n",
"[here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status), or\n",
"`print(r.content)` may give you more information about what happened.\n",
"\n",
"We can convert the result into a list of dictionaries as follows:\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "9c1b507d",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[{'word': 'little', 'score': 1001}, {'word': 'old', 'score': 1000}, {'word': 'hot', 'score': 999}, {'word': 'big', 'score': 998}, {'word': 'black', 'score': 997}]\n"
]
}
],
"source": [
"result_list = r.json() \n",
"print(result_list)"
]
},
{
"cell_type": "markdown",
"id": "f7bbcc08",
"metadata": {},
"source": [
"\n",
"And we can interact with that list in the usual Python way:\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "d106c665",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"little\n"
]
}
],
"source": [
"print(result_list[0][\"word\"])"
]
},
{
"cell_type": "markdown",
"id": "b64823dd",
"metadata": {},
"source": [
"\n",
"You can also load an API query directly into a Pandas DataFrame:\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "91fc7ec2",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" \n",
" \n",
" \n",
" | \n",
" word | \n",
" score | \n",
" \n",
" \n",
" \n",
" \n",
" 0 | \n",
" little | \n",
" 1001 | \n",
" \n",
" \n",
" 1 | \n",
" old | \n",
" 1000 | \n",
" \n",
" \n",
" 2 | \n",
" hot | \n",
" 999 | \n",
" \n",
" \n",
" 3 | \n",
" big | \n",
" 998 | \n",
" \n",
" \n",
" 4 | \n",
" black | \n",
" 997 | \n",
" \n",
" \n",
" \n",
" "
],
"text/plain": [
" word score\n",
"0 little 1001\n",
"1 old 1000\n",
"2 hot 999\n",
"3 big 998\n",
"4 black 997"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.read_json(\"https://api.datamuse.com/words?rel_jjb=dog&max=5\") \n",
"df"
]
},
{
"cell_type": "markdown",
"id": "60c887a6",
"metadata": {},
"source": [
"But this may not work well if your query returns a more complex data structure - in that case it's best to start with the requests library.\n",
"\n",
"Bear in mind it's likely the service you're using will limit both the rate of queries you can make, and the amount of data returned per query. If you want more data than can be returned by one query, an API will usually provide a way to get the data with multiple queries - this is known as \"pagination\" (see [this blog post](https://nordicapis.com/everything-you-need-to-know-about-api-pagination/) for example). If you're making many queries you may need to limit their rate in your script to avoid breaching the usage rules of the API - a quick way to do that would be to use the [time.sleep](https://realpython.com/python-sleep/#adding-a-python-sleep-call-with-timesleep) function in Python.\n",
"\n",
"This \"public-apis\" repository on GitHub maintains a list of open APIs that should be good for learning purposes: https://github.com/public-apis/public-apis. The UK government also maintains a list of APIs providing data on many different areas of government & life in the UK: https://www.api.gov.uk/\n",
"\n",
"### Exercise\n",
"\n",
"Use the Datamuse API to find 3 words that rhyme with \"cat\" and load them into a Pandas data frame. You'll need to use the Datamuse documnetation to find which parameter will search for rhyming words.\n",
"\n",
"Solution:\n",
"\n",
"The relevant parameter is `rel_rhy`, and the query and pandas function needed is `pd.read_json(\"https://api.datamuse.com/words?rel_rhy=cat&max=3\")`\n",
"\n",
"## Image Data\n",
"\n",
"Images are commonly represented as a n-dimensional tensor of pixel\n",
"values.\n",
"\n",
"### Black and White\n",
"\n",
"A simple form of this can be shown with a black and white image. The\n",
"13x13 pixel smiley can be represented by a 2D tensor (width,height) of\n",
"0s and 1s. Black and white images can be thought of as a binary off/on\n",
"for each pixel.\n",
"\n",
"\n",
"\n",
"### Grayscale\n",
"\n",
"If we want to move to grayscale - allowing shades of gray between black\n",
"and white - we can do this with intermediate values between black and\n",
"white. Here we change black and white to 0 and 255 (8-bit\n",
"representation), shades of gray are everything in between.\n",
"\n",
"\n",
"\n",
"### RGB\n",
"\n",
"We can introduce colour using an RGB (Red Green Blue) representation.\n",
"Here, we store the red, green, and blue values separately - these\n",
"individual representations are known as channels. We now use a 3D tensor\n",
"to represent this image (width,height,channels).\n",
"\n",
"\n",
"\n",
"### Libraries\n",
"\n",
"Some popular libraries for processing and analysing image data in Python\n",
"include:\n",
"\n",
"- [opencv-python](https://pypi.org/project/opencv-python/):\n",
" [OpenCV](https://opencv.org/) (Open-source Computer Vision) packages\n",
" for Python. Contains hundreds of computer vision algorithms.\n",
"- [Pillow](https://pillow.readthedocs.io/en/stable/): PIL (Python\n",
" Imaging Library) fork. \"This library provides extensive file format\n",
" support, an efficient internal representation, and fairly powerful\n",
" image processing capabilities\".\n",
"- [torchvision](https://pytorch.org/vision/stable/index.html): Part of\n",
" the wider [PyTorch](http://pytorch.org/) project. \"The torchvision\n",
" package consists of popular datasets, model architectures, and\n",
" common image transformations for computer vision\".\n",
"- [tf.keras.preprocessing](https://www.tensorflow.org/tutorials/load_data/images#load_using_tfkeraspreprocessing):\n",
" Part of the wider tensorflow/keras ecosystem, the preprocessing\n",
" module provides preprocessing utilities for image data.\n",
"\n",
"## Text Data\n",
"\n",
"Text data is common in data science applications, anything from\n",
"analysing Tweets and social media to documents and novels. We’ll cover\n",
"some general techniques for handling text data in this module, but we\n",
"won’t delve into the details of the wider field of \"Natural Language\n",
"Processing\" (NLP). Some popular libraries and resources for processing\n",
"and analysing text data in Python include:\n",
"\n",
"- [NLTK](https://www.nltk.org/): Well-established natural language\n",
" processing toolkit for Python, offering a wide range of text\n",
" processing techniques in many different languages.\n",
"- [Spacy](https://spacy.io/): A more modern alternative to NLTK,\n",
" offering higher performance, state of the art algorithms, and better\n",
" integration with modelling frameworks in some cases.\n",
"- [Scikit-Learn](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html):\n",
" Has common NLP preprocessing functions which can be integrated with\n",
" modelling pipelines using many different machine learning\n",
" algorithms.\n",
"- [Hugging Face](https://huggingface.co/): For implementations of\n",
" state-of-the-art deep learning models from research.\n",
"- [Textract](https://textract.readthedocs.io/en/stable/): Text data\n",
" may appear in many file formats other than plaintext files (`.txt`\n",
" or similar), such as in PDFs, word documents, or even within images.\n",
" Textract provides a single interface for extracting text data from\n",
" many different formats.\n",
"\n",
"## Other\n",
"\n",
"- Audio: commonly stored as `.wav` or `.mp3` and displayed as a\n",
" [waveform](https://en.wikipedia.org/wiki/Waveform) - digitised audio\n",
" can be used for things like training speech recognition models. Lots\n",
" of crossover with signal processing!\n",
" [Librosa](https://librosa.org/doc/latest/index.html) is a popular\n",
" python package for working with audio.\n",
"- Video: can be thought of as many images + audio!\n",
"- Geospatial: data relating to any location on the Earth’s surface.\n",
" The [geopandas](https://geopandas.org/) library combines the\n",
" capabilities of [shapely](https://shapely.readthedocs.io/) and\n",
" Pandas to make working with this datatype easier.\n",
"- Time Series: Data that can be expressed as observations over time,\n",
" e.g., stock price data. Honourable mention to\n",
" [sktime](https://github.com/alan-turing-institute/sktime) for\n",
" analysis.\n",
"- XML (Extensible Markup Language): We recommend using a library such\n",
" as\n",
" [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)\n",
" to help parse this heavily structured format.\n",
"- Web scraping: extracting structured data from web sites.\n",
" [Scrapy](https://scrapy.org/) is a popular library here."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.15"
},
"vscode": {
"interpreter": {
"hash": "bfb6be6a1e2928ff2d2ab7904db3c9037c4b5ca4f399bd5e7064365a54a3f091"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}
|