1.1 What is [research] data science?

We use the term Data Science to refer to a varied ensemble of practices, methodologies, and tools that may be used to learn from or about data.

Research Data Science is a broader term, which we sometimes use to describe the work done in the research engineering group at Turing, it includes the additional challenges encountered when applying data science in a collaborative research environment.

Data is critical to data science. In this module we discuss:

  • key considerations when faced with a dataset

  • the main components of data science and the role that data science plays in research.

Data

The availability of large-scale datasets, for example for training face recognition algorithms or language models, is a fairly new phenomenon. Creating, refining and, making data consumable takes a substantial effort, and acquiring data remains complex (e.g., digitisation).

These processes should not be taken for granted. Two important considerations to keep in mind when faced with a dataset are how the data was created, and how representative the data is.

Data Creation

Data is not a natural resource, it is always the product of human decisions and actions. Data is often the outcome of an enormous amount of labour, resources, infrastructure, and time.

Example data creation tasks are:

  • collecting information or the tracking of historical information

  • organising information in specific categories

  • measuring and storing information as data on digital infrastructure

Whenever a dataset is enriched with metadata information or specific labels, keep in mind someone has provided those. Even if the labels you see are automatically assigned, the tool would have been trained on manual labels.

For example, the famous ImageNet dataset, central component for the development of many well known image recognition pipelines relies on two pillars:

  • a taxonomy developed since 1985 as part of the lexical database WordNet, which provides a top-down hierarchical structure of concepts (“chair” is under artifact->furnishing->furniture->seat->chair)

  • an enormous amount of cheap workforce provided by Amazon Mechanical Turk.

imagenet

The above ImageNet image, taken from Crawford and Paglen's essay Excavating AI, is a result a gigantic effort and the specific representation of the World of: 1) the people who have designed WordNet, 2) the researchers who have decided which WordNet categories are included and which are not in ImageNet, and 3) the many, many annotators who have selected which images associate to concepts like "brother", "boss", "debtor", "drug-addict" and "call girl", all included both in WordNet and ImageNet (at least until 2019).

Data Representativeness

The data we have available is just a sample and not the complete story. The question we should ask ourselves is:

  • “What can these data points tell us about the wider phenomenon that we’re really interested in?”

In an article by Anna Rogers,the author considers the following argument:

“the size of the data is so large that, in fact, our training sets are not a sample at all, they are the entire data universe”.

Rogers replies to it by saying that this argument would stand if the “data universe” that we use for training for instance a speech recognition system was the same as “the totality of human speech/writing”. It is not, and will hopefully never be, because collecting all speech is problematic for ethical, legal, and practical reasons (e.g., not all tweets are archived).

Anything less than the whole data universe is a sample. Given the existing social structures, no matter how big that sample is, it is not representative due to (amongst other things) unequal access to technology, unequal possibility to defend one’s privacy and copyright, and limited access to the huge volumes of speech produced in the “walled garden” platforms like Facebook.

Data Science

Data science allows us to learn about data. However, terms commonly associated with data science - e.g., AI, deep learning - are only the tip of the iceberg. They rely on many steps that are briefly introduced here, and built upon throughout the course.

The data science hierarchy of needs

Monica Rogati's representation of the "Data Science hierarchy of needs".

Collection and Storage

Part of the work of a data scientist is knowing the challenges and hurdles involved in data collection and storage (e.g., our Living with Machines project).

It is essential that we know who owns the data, what restrictions apply, how a resource should be stored for long-term preservation and made available to collaborators (e.g., Turing’s Data Safe Haven).

In small team contexts’ data scientists often take care directly of such responsibilities. These topics are explored more in Module 2: Getting and Loading Data.

Wrangling

Commonly, we need to alter a freshly obtained dataset to a more analysis-ready state. We might need to remove inconsistencies and inaccuracies (a process called data cleaning), combine data sources, or otherwise change the data into our desired format.

These tasks can be collectively called data wrangling. Although often this task is the largest part of a data scientist’s work it is often undervalued. We will explore this topic more in Module 2: Exploring and Wrangling.

Exploring and Visualisation

For many disciplines the availability of large datasets is unprecedented. This creates opportunity, but defining new research questions or business goals is complex.

Understanding a dataset well can unlock opportunities for new insights and help specific research questions.

  • For example, a dataset combining aggregate GPS activity and traffic monitoring data has been successfully and effectively used by two Turing projects: the London Air Quality project, and Project Odysseus, which monitored the level of activity during the pandemic.

A powerful tool to develop understanding is data visualisation. This will be the focus of Module 3.

Modelling

Modelling is the core activity of data scientists.

Though, as we have seen, we might spend more time on other activities, the goals of modelling are present throughout. We build models with a specific goal in mind (more in Module 4). This begins with the project scoping process (see 1.2 Research Data Science Project Lifecycle), specifying a research question, task, and measures of success.

When modelling we are often comparing what works “best” (which can also mean most reliably) in a given setting, rather than improving over a given start-of-the-art (which may be the job of an expert researcher).

Data-Driven Science

The availability of data and expanding toolbox of data science techniques is changing science.

Even as far back as 2009 it was suggested a fourth scientific paradigm was emerging.

fourth_para

As Bell, Hey and Szalay (2009) said in a famous article in Science, for a long time
"scientists have recognized experimental and theoretical science as the basic research paradigms for understanding nature. In recent decades, computer simulations have become an essential third paradigm..." They continue to say that "a fourth paradigm is emerging, consisting of the techniques and technologies needed to perform data-intensive science".


And one author even claimed the end of theory!

theory

Chris Anderson's 2008 Wired article contained provocative statements like "Petabytes allow us to say: ‘Correlation is enough. We can analyse the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all."

Here, and in similar articles (e.g., Kitchen, 2014), we see the following ideas:

  • Big Data can capture a whole domain and provide full resolution.

  • There is little need for a priori theory, models, or hypotheses.

  • Through the application of agnostic data analytics the data can speak for themselves free of human bias or framing, and any patterns and relationships within Big Data are inherently meaningful and truthful;

  • meaning transcends context or domain-specific knowledge, thus can be interpreted by anyone who can decode a statistic or data visualization.

Developing a Critical Mindset

In the intervening decade there has been a lot of discussion on the shortcomings on Big Data. Our perception of data science in society and research has changed. Our understanding of biases embedded within data has increased – biases that are passed on to trained models.

In this course we learn how to approach data, methods, and research questions in with a critical mindset. We aim to produce findings that are reliable, reproducible, and context-aware.

fourth_para

This Turing 'fireside chat' talks to Professor Kate Crawford about her book, Atlas of AI, which aims to expose how artificial intelligence is shaping society.

Research data scientist

We have seen that there are many components to data science. As a data scientist, we will often:

  • be the central element in the projects we are involved with, connecting data providers, domain experts and final users.

  • ask “why” people want to use data science approaches. We need to be aware of limitations and the societal context of the research.

  • contribute to shaping research directions and guaranteeing reproducibility.

reg

For a brief deep-dive on the role of a data scientist, consider our large-scale Living with Machines project. A five-year study on the Industrial Revolution using data-driven approach with over twenty members, and currently five Turing Research Engineering Group members. Our responsibilities have involved:

References

Rogers, A. (2021). Changing the World by Changing the Data. arXiv preprint arXiv:2105.13947.

Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. Wired magazine, 16(7), 16-07.

Bell, G., Hey, T., & Szalay, A. (2009). Beyond the data deluge. Science, 323(5919), 1297-1298.

Crawford, K. (2021). The Atlas of AI. Yale University Press.

D’ignazio, C., & Klein, L. F. (2020). Data feminism. MIT press.

Kitchin, R. (2014). Big Data, new epistemologies and paradigm shifts. Big data & society, 1(1).

Wickham, H. (2014). Tidy data. Journal of statistical software, 59(1), 1-23.