Welcome!

Welcome to an Introduction to Research Data Science, developed by The Alan Turing Institute’s Research Engineering Group.

Introduction

Data science methods and tools have become commonplace in research projects across academia, government and industry. Researchers increasingly need to collaborate with multi-disciplinary teams of data scientists, software engineers and other stakeholders.

The goal of this course is to introduce how you can use data science principles to tackle real, complex, and sometimes vaguely defined research data science projects. The course is not a handbook of data science methods. Rather, the focus is how to begin using these methods on collaborative research projects, with an emphasis an awareness of ethics and diversity issues.

Who?

We are a group of data scientists and software engineers that work on a wide range of research problems.

You are someone interested in learning about, or using, data science methods in research. To completely follow along with the course some basic programming is needed, see Prerequisites for more information.

Course materials

This free and open course is primarily the jupyter book you’re reading. You can work through the material by yourself. See the Syllabus.

Some tips on how to use this course:

  • You will get a lot out of simply reading the online course book. However, the course is built by executable jupyter notebooks that you can run yourself, and we encourage learners to try the hands-on sections where we tackle a real research data science problem. Visit the the readme page to setup your computer to follow along.

  • There are some benefits to reading the course chronologically. The same dataset is used throughout the modules, especially on the hands-on sessions. However, much of the material is self-contained and can be consumed independently.

  • If you are a self-learner and have questions, comments, ideas or issues please use: RDS-Course Issues

  • There is also a synchronous, taught, version of the course, where modules are spread over a half-day taught session and a half-day hands-on session.

Syllabus

Module 1: Intro to Data Science

Taught session:

  • What data science and research data science are, overview of the variety of cultures within them.

  • Stages in a data science project and common issues when scoping a project.

  • Intro to EDI for data science.

  • How to work collaboratively in data science projects.

Hands-on session:

  • Scope a research data science project using a real-world survey individual-level dataset, including discussion of research question and EDI issues and setting up a collaborative GitHub repo.

Module 2: Handling data

Taught session:

  • Data wrangling, cleaning and provenance.

  • Handling missing data.

  • Data access: SQL, APIs.

  • Data privacy and security.

Hands-on session:

  • Explore, pre-process and clean the dataset from Module 1. Discuss and decode various complexities (e.g. missing/ambiguous values, bias in data collection, data privacy and sensitivity).

Module 3: Data visualisation

Taught session:

  • Figures gone wrong.

  • Rules of the data visualisation game.

  • Atlas of visualisations.

  • Storytelling with data visualisation.

  • Data visualisation for data exploration.

Hands-on session:

  • Build visualisations to understand the dataset from Module 1 and 2 using material from the taught sessions, explore the relationships and importance of variables.

Module 4: Modeling

Taught session:

  • The what and why of Statistical Modeling

  • Inside a Model.

  • Building a Model.

  • Evaluating and Validating Models.

Hands-on session:

  • Build your own model based on the knowledge acquired so far about the dataset and the techniques taught in this module. Improve upon baseline, interpret results to answer research questions and discuss limitations and alternative approaches.

Prerequisites

There is no code in Module 1. Students will get more out of Modules 2-4 if they:

This course complements the Turing’s Research Software Engineering with Python course.

Disclaimer

The work and materials here are developed by a group of [research] data scientists and software engineers from a diverse background. Many of the topics, examples and discussed work here is biased against our own experiences. As such, our definitions and understandings of certain words, phrases, or methodologies used may differ from others’. We do not claim to be a definitive authority, and welcome open discussion and feedback.

Acknowledgement

This work was supported by Wave 1 of The UKRI Strategic Priorities Fund under the EPSRC Grant EP/W006022/1 & The Alan Turing Institute.