2.1.1 Where to find data?

What is Open Data?

We can use the definition from The Turing Way:

Open data is freely available on the internet. Any user is permitted to download, copy, analyse, re-process, and re-use it for any other purpose with minimal financial, legal, and technical barriers.

The benefits of using an open dataset include (but are not limited to):

  • Access to the data is free and usually fast (without requiring a registration/approval process).

  • Other researchers may have used the same dataset and published guidelines, code and other details about the dataset that are also in the public domain. So we can benefit and draw inspiration from their prior work.

  • Transparency & reproducibility: If the dataset we’re using is open we can make all our research (data, code & publications) open as well, so others can easily reproduce and contribute to our work.

The Open Data Handbook has a longer discussion of the benefits of open data.

Even if a dataset is public, we must still evaluate whether it is ethical to use it and its licensing/legal requirements - not all public datasets are completely “open”!

Sources of Open Data

There are many sources of data online, below are a few ideas for places to look.

Tailored for Data Science

Some datasets have been heavily used for data science applications, and are available in easy-to-use formats that may already be cleaned/pre-processed for you. Many popular machine learning libraries like PyTorch have datasets built-in, for example. These are excellent sources for finding datasets to learn, test and benchmark algorithms or other analysis techniques, but are less likely to be rich sources for novel research projects.

Countries/Governments

Many governments commit to publishing data in the open for public interest and transparency. The datasets might be less “data science ready”, but they cover a broad range of topics.

Organisations

Large humanitarian organisations often make data available, such as:

General

General tools and repositories that contain data across many different domains:

  • Google Dataset Search

  • GitHub: Although large datasets can’t be stored on GitHub there are many smaller datasets to be found in GitHub repositories. There are also community-maintained lists of interesting datasets, e.g., awesome-public-datasets.

  • Zenodo: Combined repository for open data, papers, and code.

  • FAIRsharing: A catalogue of databases across many different domains.

When Open Data Isn’t Available

If you can start your project with an open dataset that’s always preferable. Even if the perfect dataset is not openly available it may be worth first prototyping with related data that is open. For example, it may be that an older version of the data you’re interested in has been made public. You can continue to explore options for getting the ideal data in parallel, but gaining data access is frequently an expensive or time-consuming process.

Two common reasons open data may not be available or appropriate are:

  1. The data is commercially sensitive or valuable.

  2. The data presents a privacy or security risk.

Access to detailed healthcare records, for example, is often heavily restricted even if personal identifiers have been removed due to the risk of re-identification. In August 2016 the Australian government openly published a de-identified medical billing dataset, but one month later researchers at the University of Melbourne demonstrated it was possible to re-identify individuals and the data was taken offline.

Options for finding a non-open dataset include:

Ask!

Although open data may not be available for your project, a collaborator, someone else at your institute, company, or the wider community could have something relevant. However, even if they’re willing to share it, you must check what the conditions are for access and usage of the data and get advice where necessary. Always err on the side of caution, especially if any of the data relates to living individuals. Alternatively, you may find someone else that’s interested in the same data, and you could join forces (or budgets) to get it.

Paywalled/Restricted Access

Getting access to a dataset that’s behind closed doors is likely to involve a registration or application process and may include a fee. Bear in mind that data can be expensive, and could easily cost thousands of pounds. If the application is approved, the resulting contract/ research agreement may specify precisely which data you can have access to (down to the level of individual fields), who will have access, the duration of access, and exactly what you’re allowed to do with it.

As an example, this website describes the process for accessing the UK Biobank, a large biomedical database for health research.

Creating Your Own Dataset

Ultimately the data you need might not be available anywhere, in which case the only option could be to collect it yourself. Designing datasets is not the focus of this course, but if you’re making your own remember you’ll be the one analysing it! Investing time in thinking about how your data will be structured and how you’ll deal with missing values and the many other issues common in datasets will save a lot of time later. You must also carefully consider whether it is ethical to collect the data and have approval from your organisation to do so.

Assessing Dataset Quality and Suitability

In Module 1 (Section 1.2, question 3 for scoping projects) we gave these overarching questions for evaluating a dataset:

  • Does the dataset contain what’s needed to solve the research question available?

  • Can I legally and ethically use the data?

  • Is the data easily accessible? - Is the dataset well-understood and tested?

  • Is data quality and quantity appropriate?

Another useful concept here is data readiness levels:

  • Band C: Accessibility

    • C4: You believe the data may exist, but haven’t seen/verified that it does.

    • C1: The data is ready to be loaded into an analysis: You have access, it’s in an appropriate format, and you have (both ethical and legal) permission to use it.

  • Band B: Faihfulness & Representation

    • B1: The data has been used in an exploratory analysis, and you have verified the contents match what you expected. You understand any limitations (e.g. how missing values were treated).

  • Band A: Data in Context

    • A1: The data has been prepared and is suitable to answer a specific research question.

In this module we cover many of the steps needed to take data from Band C to Band B: we start with data we know exists but don’t know how to analyse, and end with beginning an exploratory analysis.

For evaluating all of these, data documentation is essential. See The Turing Way and this Radboud University article for details of what good documentation should contain.

Sharing Data

If your project is working with, or has generated, a dataset consider whether you can publish it with an open license. The community can then benefit from all the advantages of open data we’ve talked about! It’s also becoming more common for funders and journals to require code and data to be published with papers.

We won’t discuss this here but the Sharing and Archiving Data chapter in the Turing Way is a great place to start.