2.1.4 Data Sources and Formats¶

Once you’ve found a dataset for your research question there are many formats it could be in - tabular data, databases, documents, images or many more. In this section we give an overview of common data types and how they can be loaded into Python.

Several of the file/data formats here are also covered in our Research Software Engineering course. Here we show how to load data with Pandas, and also introduce some new data types.

Tabular Data Files¶

Comma-separated values files (CSVs)¶

Tabular data files, and particularly comma-separated values (CSV) files, are likely to be the data format you encounter most often. These specify a single table of rows and columns in the following format:

"Subject","Teacher","Day of Week","Time Start","Time End"
"Maths","Mr F","Monday",1000,1200
"English","Ms P","Tuesday",1100,1300
"Physics","Mrs A","Thursday",1400,1600
"French","Mr F","Friday",1000,1300

The first line (usually) specifies the table column names, with each name separated by a comma. The subsequent lines specify the values of those columns for each row in our data, with each column value separated by a comma.

As an example, we will use a dataset downloaded from The World Bank, giving the percentage of people living in urban environments by country worldwide since 1960. You can find the original source data here.

To load a CSV file to a Pandas data frame you can use the Pandas read_csv function:

import pandas as pd

df = pd.read_csv("data/urban_population.csv")
df.head()

	Country Name	Country Code	Indicator Name	Indicator Code	1960	1980	2000	2020	Unnamed: 8
0	Aruba	ABW	Urban population (% of total population)	SP.URB.TOTL.IN.ZS	50.776000	50.472000	46.717000	43.697000	NaN
1	Africa Eastern and Southern	AFE	Urban population (% of total population)	SP.URB.TOTL.IN.ZS	14.704688	20.845000	28.669286	36.783306	NaN
2	Afghanistan	AFG	Urban population (% of total population)	SP.URB.TOTL.IN.ZS	8.401000	15.995000	22.078000	26.026000	NaN
3	Africa Western and Central	AFW	Urban population (% of total population)	SP.URB.TOTL.IN.ZS	14.670329	24.518577	35.352981	47.848625	NaN
4	Angola	AGO	Urban population (% of total population)	SP.URB.TOTL.IN.ZS	10.435000	24.298000	50.087000	66.825000	NaN

Each row in the data corresponds to a country, with columns for the country’s name, the urban population in 1960, 1980, 2000 and 2020, and some other metadata.

The .info() method of a DataFrame gives us a useful summary of the columns it contains:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Country Name    266 non-null    object 
 1   Country Code    266 non-null    object 
 2   Indicator Name  266 non-null    object 
 3   Indicator Code  266 non-null    object 
 4   1960            263 non-null    float64
 5   1980            263 non-null    float64
 6   2000            263 non-null    float64
 7   2020            262 non-null    float64
 8   Unnamed: 8      0 non-null      float64
dtypes: float64(5), object(4)
memory usage: 18.8+ KB

Each column in a Pandas DataFrame has a single type. The urban population percentages in columns 1960, 1980, 2000 and 2020 all contain floating point numbers (float64), for example. When columns contain a mixture of data types or strings, Pandas will give the whole column the generic object type. Sometimes quirks in the data may cause Pandas to infer a different type to what you expect - we will revisit this in Section 2.2.1.

We can also see that some columns have missing values (the data has 266 rows, but some columns have fewer than 266 “non-null” values), and have a strange additional column Unnamed: 8 that has only missing (null) values. This is another topic we’ll revisit later.

Customising pandas.read_csv¶

The original file from the World Bank contains a few lines of metadata at the top:

!head data/urban_population_header.csv

"Data Source","World Development Indicators",

"Last Updated Date","2021-06-30",
Country Name,Country Code,Indicator Name,Indicator Code,1960,1980,2000,2020,
Aruba,ABW,Urban population (% of total population),SP.URB.TOTL.IN.ZS,50.776,50.472,46.717,43.697,
Africa Eastern and Southern,AFE,Urban population (% of total population),SP.URB.TOTL.IN.ZS,14.7046880270389,20.8449997980123,28.6692864525936,36.7833061490919,
Afghanistan,AFG,Urban population (% of total population),SP.URB.TOTL.IN.ZS,8.401,15.995,22.078,26.026,
Africa Western and Central,AFW,Urban population (% of total population),SP.URB.TOTL.IN.ZS,14.6703287907718,24.5185774336299,35.3529813285238,47.8486254352506,
Angola,AGO,Urban population (% of total population),SP.URB.TOTL.IN.ZS,10.435,24.298,50.087,66.825,
Albania,ALB,Urban population (% of total population),SP.URB.TOTL.IN.ZS,30.705,33.762,41.741,62.112,

The column names start on line 4 of the file, and the previous lines give metadata on where the file came from and when it was updated.

Using read_csv on this file (with default arguments) gives an error:

from lib2to3.pgen2.parse import ParseError
import traceback # library used to print error without breaking python

try:
    df = pd.read_csv("data/urban_population_header.csv")
except Exception as e: 
    traceback.print_exc()

Traceback (most recent call last):
  File "/tmp/ipykernel_2210/1010206246.py", line 5, in <module>
    df = pd.read_csv("data/urban_population_header.csv")
  File "/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 611, in _read
    return parser.read(nrows)
  File "/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1778, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 230, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 808, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 866, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1973, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 4, saw 9

This is because pandas is trying to use the first line in the file to define the columns present in our data. To avoid this, we can use the skiprows argument to tell pandas our table starts on line 4 (skipping the first 3 lines):

df = pd.read_csv("data/urban_population_header.csv", skiprows=3)
df.head()

	Country Name	Country Code	Indicator Name	Indicator Code	1960	1980	2000	2020	Unnamed: 8
0	Aruba	ABW	Urban population (% of total population)	SP.URB.TOTL.IN.ZS	50.776000	50.472000	46.717000	43.697000	NaN
1	Africa Eastern and Southern	AFE	Urban population (% of total population)	SP.URB.TOTL.IN.ZS	14.704688	20.845000	28.669286	36.783306	NaN
2	Afghanistan	AFG	Urban population (% of total population)	SP.URB.TOTL.IN.ZS	8.401000	15.995000	22.078000	26.026000	NaN
3	Africa Western and Central	AFW	Urban population (% of total population)	SP.URB.TOTL.IN.ZS	14.670329	24.518577	35.352981	47.848625	NaN
4	Angola	AGO	Urban population (% of total population)	SP.URB.TOTL.IN.ZS	10.435000	24.298000	50.087000	66.825000	NaN

There’s not a single CSV data “standard” that everyone follows, so it’s common to need to tweak things to load properly. Other examples include:

Different “delimeters”, e.g., Tab-separated values files (TSVs) with tabs separating column values instead of commas.
Footers: Lines with metadata/other information not part of the table at the end of the file.
Comments: Annotations anywhere in the file that shouldn’t be included in the table.

pd.read_csv can be customised to deal with all of these and more. For example:

df = pd.read_csv(
   "myfile.csv",
   delimeter="\t",  # columns separated by tabs
   skipfooter=3,    # exclude last 3 lines
   comment="#",     # exclude lines starting with "#"
)

Exercise¶

Later on we’ll be using the “Palmer Penguins” dataset, which we have saved at the path data/penguins.csv. Load this dataset into a pandas data frame.

What type of data do you think it contains?
What are the types of the columns and do they match what you’d expect?

Solution:

df = pd.read_csv("data/penguins.csv") to load the data into pandas. We’ll discuss the answer to the other questions later, but you may have noticed that some columns have the generic object type even though it looks like they contain numeric data.

Excel Spreadsheets¶

As CSV files are a plaintext (human readable) format and don’t need proprietry software to use and create they are always preferable to Excel spreadsheets for raw data, if available. However, it is still common to find data in Excel .xls or .xlsx files.

If needed, Pandas also has a function pandas.read_excel to load data from Excel spreadsheets. The format is very similar to read_csv:

df = pd.read_excel(
   "my_spreadsheet.xlsx",
   sheet_name="Sheet1"
)

Note you must specify which worksheet you want to load as a data frame. If you want to load tables from multiple worksheets you would need to load them into multiple data frames.

Databases¶

We can think of databases as containers for data. Typically, databases allow for easy storage, retrieval, update, and analysis of data.

In the case of tabular data, a database may have many tables. A database can be flat, or it can be relational.

Example:

Flat Database with single table:

Subject	Teacher	Teacher Email	Teacher Phone Number	Day of Week	Time Start	Time End
Maths	Mr F	f@school.com	07123123123	Monday	1000	1200
English	Ms P	p@school.com	07456456456	Tuesday	1100	1300
Physics	Mrs A	a@school.com	07789789789	Thursday	1400	1600
French	Mr F	f@school.co	07123123123	Friday	1000	1300

Relational database with two tables:

Subject	Teacher	Day of Week	Time Start	Tie End
Maths	Mr F	Monday	1000	1200
English	Ms P	Tuesday	1100	1300
Physics	Mrs A	Thursday	1400	1600
French	Mr F	Friday	1000	1300

Teacher	Teacher Email	Teacher Phone Number
Mr F	f@school.com	07123123123
Ms P	p@school.com	07456456456
Mrs A	a@school.com	07789789789

Flat databases can be simple but also inefficient. Relational databases, with data stored in a series of interconnected tables, can be more complicated but more computationally efficient.

SQL Databases & RDBMS¶

Structured Query Lanage (SQL) is a standard language for storing, manipulating and retrieving data in databases.

An RDBMS (Relational Database Management System) is the basis for SQL and relies on the relational data model. Many modern and popular databases implement the ideas of an RDBMS. Some common implementations are:

PostgreSQL
SQLite
MySQL
Oracle
MS Access

We use SQL to interact with these databases. You can find some basic, interactive, tutorials at W3Schools.

NoSQL Databases¶

NoSQL databases (aka “not only SQL”) are non tabular, and store data differently than relational tables. NoSQL databases come in a variety of types based on their data model. The main types are document, key-value, wide-column, and graph. They provide flexible schemas and scale easily with large amounts of data and high user loads. https://www.mongodb.com/nosql-explained

Some common NoSQL Databases include:

Elasticsearch (document)
MongoDB (document)
DynamoDB (key-value)
Gaffer (graph)

Document example

{
    "name": "Jane",
    "dob": "2012-04-23T18:25:43.511Z",
    "interests": ["running", "swimming"],
    "counter": 104
}

Key-Value example:

key-valueexample

Graph example: graphexample

Unfortunately, different types of databases will often have their own, unique, query languages, beyond SQL - more to learn!

Where Is the Data?¶

Databases can exist on your local machine or can be hosted elsewhere. Typically, we’ll want to host a database somewhere that everyone who needs access can reach.

Many cloud providers offer database solutions with easy-to-use interfaces, allowing us (for a cost!) to worry less about exactly how the DBMS is working and focus, instead, on the data itself.

Which Database to Use¶

Which type of database you choose should be driven by your data and how you plan to use it. You may find that your data can be expressed in more than one of the forms above or as a table. You then need to weigh up different factors to decide which form and which database, if any, to use.

Some factors you may consider (unordered!):

Ease of use
Cost
Footprint on disk
Footprint in memory
Retrieval/update speed

Many blogs and articles have been written on the choice of database, for example this xplenty article.

Application Programming Interfaces (APIs)¶

Application programming interfaces (APIs) are services that allow you to programmatically request information, including datasets, from a remote server. Large, digitally mature organisations commonly have APIs available to access their data, including GitHub, Twitter and the Office for National Statistics, for example. Advantages of APIs include:

It’s possible to make queries to request a tailored subset of a dataset, so you only get what you’re interested in.
It’s straightforward to automate making many similar queries (i.e., to request many different datasets) with different parameters.
If a dataset is updating regularly (maybe you want to know the current weather at a location, for example) they provide a robust way to get the latest status of what you’re interested in.
They usually should provide a reproducible way for others to download the same data you’re using.

Most APIs follow the REST style, which we briefly introduce here, but you may also come across GraphQL APIs. This blog has an overview of both types.

Datamuse API¶

As an example, we’ll try the Datamuse API, which can be used to find relationships between words and terms (similar words, words that rhyme and so on). You can find documentation for it here: https://www.datamuse.com/api/

An API query has four main components: a base URL, an endpoint, parameters, and headers:

Base URL: The address of the server hosting the data we want to access - all our API queries will start with this. For the Datamuse API it’s: https://api.datamuse.com

Endpoint: Each API might have multiple types of data we can query, known as different “endpoints”. The path to the endpoint is added to the base URL. The Datamuse API has two endpoints:/words for searching for relationships (of different types) between words, and /sug, which can be used for auto-completion/spelling correction.

Parameters: Define what data subset we’re interested in and are provided in the format name=value. These are added to the query after a ? following the endpoint. To specify multiple parameters, separate each parameter name and value pair with an &. The available parameters should be listed in the API’s documentation (here for Datamuse).

The parameter string ?rel_jjb=dog&max=5 defines two parameters and values:

rel_jjb=dog: Find adjectives commonly used to describe the noun “dog”.
max=5: Return 5 results.

Headers: Provide additional context/metadata about your request, the most common being authentication information. Authentication may be used to only give you access to resources you have permission to see, limit the number of requests you can make within a certain amount of time, or for billing purposes (if the API isn’t free).

The Datamuse API doesn’t require authentication, so we don’t need to add any headers here. If the API you’re using requires it they should provide documentation that describes what you need to do - here is an example from GitHub, and here is how to add headers in the Python requests library.

Full query: Adding together all the components we have the full API query:

https://api.datamuse.com/words?rel_jjb=dog&max=5

You can open this in a web browser and should see something like this:

[
    {"word": "little", "score": 1001},
    {"word": "old", "score": 1000},
    {"word": "hot", "score": 999},
    {"word": "big", "score": 998},
    {"word": "black", "score": 997}
]

We get the top 5 words describing “dog” and a score that indicates how strong the association is.

The most common format for data returned by an API is JSON (JavaScript Object Notation) from the JavaScript language, which is commonly used in web development. JSONs have a similar structure to Python dictionaries or lists of Python dictionaries. Python has a built-in library json for converting data to and from the JSON format (see here), and Pandas also has a function for creating a DataFrame from a JSON file (pandas.read_json).

We can make the same API query in Python using the requests library, as follows:

import requests

url = "https://api.datamuse.com/words" 
params = {"rel_jjb": "dog","max": 5} 
r = requests.get(url, params=params)

Note that we can define the parameters in a dictionary, which is much easier to read than the raw format in the query string seen earlier. To check whether the request worked you can check the status code:

print(r.status_code)

Codes in the 200s usually indicate a successful query, for the meanings of other codes see here, or print(r.content) may give you more information about what happened.

We can convert the result into a list of dictionaries as follows:

result_list = r.json() 
print(result_list)

[{'word': 'little', 'score': 1001}, {'word': 'old', 'score': 1000}, {'word': 'hot', 'score': 999}, {'word': 'big', 'score': 998}, {'word': 'black', 'score': 997}]

And we can interact with that list in the usual Python way:

print(result_list[0]["word"])

little

You can also load an API query directly into a Pandas DataFrame:

import pandas as pd

df = pd.read_json("https://api.datamuse.com/words?rel_jjb=dog&max=5") 
df

	word	score
0	little	1001
1	old	1000
2	hot	999
3	big	998
4	black	997

But this may not work well if your query returns a more complex data structure - in that case it’s best to start with the requests library.

Bear in mind it’s likely the service you’re using will limit both the rate of queries you can make, and the amount of data returned per query. If you want more data than can be returned by one query, an API will usually provide a way to get the data with multiple queries - this is known as “pagination” (see this blog post for example). If you’re making many queries you may need to limit their rate in your script to avoid breaching the usage rules of the API - a quick way to do that would be to use the time.sleep function in Python.

This “public-apis” repository on GitHub maintains a list of open APIs that should be good for learning purposes: https://github.com/public-apis/public-apis. The UK government also maintains a list of APIs providing data on many different areas of government & life in the UK: https://www.api.gov.uk/

Exercise¶

Use the Datamuse API to find 3 words that rhyme with “cat” and load them into a Pandas data frame. You’ll need to use the Datamuse documnetation to find which parameter will search for rhyming words.

Solution:

The relevant parameter is rel_rhy, and the query and pandas function needed is pd.read_json("https://api.datamuse.com/words?rel_rhy=cat&max=3")

Image Data¶

Images are commonly represented as a n-dimensional tensor of pixel values.

Black and White¶

A simple form of this can be shown with a black and white image. The 13x13 pixel smiley can be represented by a 2D tensor (width,height) of 0s and 1s. Black and white images can be thought of as a binary off/on for each pixel.

black_and_white_smiley

Grayscale¶

If we want to move to grayscale - allowing shades of gray between black and white - we can do this with intermediate values between black and white. Here we change black and white to 0 and 255 (8-bit representation), shades of gray are everything in between.

grayscale_smiley

RGB¶

We can introduce colour using an RGB (Red Green Blue) representation. Here, we store the red, green, and blue values separately - these individual representations are known as channels. We now use a 3D tensor to represent this image (width,height,channels).

rgb_smiley

Libraries¶

Some popular libraries for processing and analysing image data in Python include:

opencv-python: OpenCV (Open-source Computer Vision) packages for Python. Contains hundreds of computer vision algorithms.
Pillow: PIL (Python Imaging Library) fork. “This library provides extensive file format support, an efficient internal representation, and fairly powerful image processing capabilities”.
torchvision: Part of the wider PyTorch project. “The torchvision package consists of popular datasets, model architectures, and common image transformations for computer vision”.
tf.keras.preprocessing: Part of the wider tensorflow/keras ecosystem, the preprocessing module provides preprocessing utilities for image data.

Text Data¶

Text data is common in data science applications, anything from analysing Tweets and social media to documents and novels. We’ll cover some general techniques for handling text data in this module, but we won’t delve into the details of the wider field of “Natural Language Processing” (NLP). Some popular libraries and resources for processing and analysing text data in Python include:

NLTK: Well-established natural language processing toolkit for Python, offering a wide range of text processing techniques in many different languages.
Spacy: A more modern alternative to NLTK, offering higher performance, state of the art algorithms, and better integration with modelling frameworks in some cases.
Scikit-Learn: Has common NLP preprocessing functions which can be integrated with modelling pipelines using many different machine learning algorithms.
Hugging Face: For implementations of state-of-the-art deep learning models from research.
Textract: Text data may appear in many file formats other than plaintext files (.txt or similar), such as in PDFs, word documents, or even within images. Textract provides a single interface for extracting text data from many different formats.

Other¶

Audio: commonly stored as .wav or .mp3 and displayed as a waveform - digitised audio can be used for things like training speech recognition models. Lots of crossover with signal processing! Librosa is a popular python package for working with audio.
Video: can be thought of as many images + audio!
Geospatial: data relating to any location on the Earth’s surface. The geopandas library combines the capabilities of shapely and Pandas to make working with this datatype easier.
Time Series: Data that can be expressed as observations over time, e.g., stock price data. Honourable mention to sktime for analysis.
XML (Extensible Markup Language): We recommend using a library such as BeautifulSoup to help parse this heavily structured format.
Web scraping: extracting structured data from web sites. Scrapy is a popular library here.

Research Data Science

2.1.4 Data Sources and Formats

Contents

2.1.4 Data Sources and Formats¶

Tabular Data Files¶

Comma-separated values files (CSVs)¶

Customising pandas.read_csv¶

Exercise¶

Excel Spreadsheets¶

Databases¶

SQL Databases & RDBMS¶

NoSQL Databases¶

Where Is the Data?¶

Which Database to Use¶

Application Programming Interfaces (APIs)¶

Datamuse API¶

Exercise¶

Image Data¶

Black and White¶

Grayscale¶

RGB¶

Libraries¶

Text Data¶

Other¶