Tasks

Tasks#

What is a ‘task’?#

The concept of a task is central to DeepSensor. It originates from the meta-learning literature in machine learning and has a specific meaning.

Users unfamiliar with the notation and terminology of meta-learning are recommended to expand the section below.

Click to reveal the meta-learning primer

Sets of observations

A set of observations is a collection of \(M\) input-output pairs \(\{(\mathbf{x}_1, \mathbf{y}_1), (\mathbf{x}_2, \mathbf{y}_2), \ldots, (\mathbf{x}_M, \mathbf{y}_M)\}\). In DeepSensor \(\mathbf{x}_i \in \mathbb{R}^2\) is a 2D spatial location (such as latitude-longitude) and \(\mathbf{y}_i \in \mathbb{R}^N\) is an \(N\)-dimensional observation at that location (such as a temperature and precipitation). Context sets may lie on scattered, off-grid locations (such as weather stations), or on a regular grid (such as a reanalysis or satellite data). A set can be compactly written as \((\mathbf{X}, \mathbf{Y})\), where \(\mathbf{X} \in \mathbb{R}^{2\times M}\) and \(\mathbf{Y} \in \mathbb{R}^{N\times M}\).

Context sets

A context set is a set of observations that are used to make predictions for another set of observations. Following our notations above, we denote a context set as \(C_j=(\mathbf{X}^{(c)}, \mathbf{Y}^{(c)})_j\). We may have multiple context sets, denoted as \(C = \{ (\mathbf{X}^{(c)}, \mathbf{Y}^{(c)})_j \}_{j=1}^{N_C}\).

Target sets

A target set is a set of observations that we wish to predict using the context sets. Similarly to context sets, we denote the collection of all target sets as \(T = \{ (\mathbf{X}^{(t)}, \mathbf{Y}^{(t)})_j \}_{j=1}^{N_T}\). During training, the target observations \(\mathbf{y}_i\) are known, but at inference time will be unknown latent variables.

Tasks

A task is a collection of context sets and target sets. We denote a task as \(\mathcal{D} = (C, T)\). The modelling goal is make probabilistic predictions for the target variables \(\mathbf{Y}^{(t)}_j\) given the context sets \(C\) and target prediction locations \(\mathbf{X}^{(t)}_j\).

The DeepSensor Task#

In DeepSensor, a Task is a dict-like data structure that contains context sets, target sets, and other metadata. Before diving into the TaskLoader class which generates Task objects from xarray and pandas objects, we will first introduce the Task class itself.

First, we will generate a Task using DeepSensor. These code cells are kept hidden because they includes features that are only covered later in the User Guide. Only expand them if you are curious!

In the code cell below, task is a Task object. Printing a Task will print each of its entries and replace numerical arrays with their shape for convenience.

print(task)

time: 2016-06-25 00:00:00
ops: []
X_c: [(2, 52), (2, 112)]
Y_c: [(3, 52), (1, 112)]
X_t: [(2, 245)]
Y_t: [(2, 245)]

Task structure#

A Task typically contains at least the following entries:

"time": timestamp that was used for slicing the spatiotemporal data.
"ops" list of processing operations that have been applied to the data (more on this shortly).
"X_c" and "Y_c": length-\(N_C\) lists of context set observations \(\mathbf{X}^{(c)}_i \in \mathbb{R}^{2\times M}\) and \(\mathbf{Y}^{(c)}_i \in \mathbb{R}^{N\times M}\).
"X_t" and "Y_t": as above, but for the target sets. In the example above, the target observations are known, so this Task may be used for training.

Exercise:

For the task object above, use the "X_c", "Y_c", "X_t", and "Y_t" entries to work out the following (answer hidden below):

The number of context sets
The number of observations in each context set
The dimensionality of each context set
The number of target sets
The number of observations in each target set
The dimensionality of each target set

Click to reveal the answers!

Answers, respectively: 2 context sets, 52 and 112 context observations, 3 and 1 context dimensions, 1 target set, 245 target observations, 2 target dimensions.

Gridded data in Tasks#

For convenience, data that lies on a regular grid is given a compact tuple representation for the "X" entries:

task_with_gridded_data = task_loader("2016-06-25", context_sampling=["all", "all"], target_sampling=245)

print(task_with_gridded_data)

time: 2016-06-25 00:00:00
ops: []
X_c: [((1, 141), (1, 221)), ((1, 140), (1, 220))]
Y_c: [(3, 141, 221), (1, 140, 220)]
X_t: [(2, 245)]
Y_t: [(2, 245)]

In the above example, the first context set lies on a 141 x 221 grid, and the second context set lies on a 140 x 220 grid.

Task methods#

The Task class also contains methods for applying processing operations to the data (like removing NaNs, adding batch dimensions, etc.). These operations will be recorded in the order they were applied the "ops" entry of the Task. Operations can be chained together, for example:

print(task.add_batch_dim().convert_to_tensor())

time: 2016-06-25 00:00:00
ops: ['batch_dim', 'tensor']
X_c: [torch.Size([1, 2, 52]), torch.Size([1, 2, 112])]
Y_c: [torch.Size([1, 3, 52]), torch.Size([1, 1, 112])]
X_t: [torch.Size([1, 2, 245])]
Y_t: [torch.Size([1, 2, 245])]

Gridded data in a Task can be flattened using the .flatten_gridded_data method. Notice how the "X" entries are now 2D arrays of shape (2, M) rather than tuples of two 1D arrays of shape (M,).

print(task_with_gridded_data.flatten_gridded_data())

time: 2016-06-25 00:00:00
ops: ['gridded_data_flattened']
X_c: [(2, 31161), (2, 30800)]
Y_c: [(3, 31161), (1, 30800)]
X_t: [(2, 245)]
Y_t: [(2, 245)]