Tasks#
What is a ‘task’?#
The concept of a task is central to DeepSensor. It originates from the meta-learning literature in machine learning and has a specific meaning.
Users unfamiliar with the notation and terminology of meta-learning are recommended to expand the section below.
Click to reveal the meta-learning primer
Sets of observations
A set of observations is a collection of \(M\) input-output pairs \(\{(\mathbf{x}_1, \mathbf{y}_1), (\mathbf{x}_2, \mathbf{y}_2), \ldots, (\mathbf{x}_M, \mathbf{y}_M)\}\). In DeepSensor \(\mathbf{x}_i \in \mathbb{R}^2\) is a 2D spatial location (such as latitude-longitude) and \(\mathbf{y}_i \in \mathbb{R}^N\) is an \(N\)-dimensional observation at that location (such as a temperature and precipitation). Context sets may lie on scattered, off-grid locations (such as weather stations), or on a regular grid (such as a reanalysis or satellite data). A set can be compactly written as \((\mathbf{X}, \mathbf{Y})\), where \(\mathbf{X} \in \mathbb{R}^{2\times M}\) and \(\mathbf{Y} \in \mathbb{R}^{N\times M}\).
Context sets
A context set is a set of observations that are used to make predictions for another set of observations. Following our notations above, we denote a context set as \(C_j=(\mathbf{X}^{(c)}, \mathbf{Y}^{(c)})_j\). We may have multiple context sets, denoted as \(C = \{ (\mathbf{X}^{(c)}, \mathbf{Y}^{(c)})_j \}_{j=1}^{N_C}\).
Target sets
A target set is a set of observations that we wish to predict using the context sets. Similarly to context sets, we denote the collection of all target sets as \(T = \{ (\mathbf{X}^{(t)}, \mathbf{Y}^{(t)})_j \}_{j=1}^{N_T}\). During training, the target observations \(\mathbf{y}_i\) are known, but at inference time will be unknown latent variables.
Tasks
A task is a collection of context sets and target sets. We denote a task as \(\mathcal{D} = (C, T)\). The modelling goal is make probabilistic predictions for the target variables \(\mathbf{Y}^{(t)}_j\) given the context sets \(C\) and target prediction locations \(\mathbf{X}^{(t)}_j\).
The DeepSensor Task#
In DeepSensor, a Task
is a dict
-like data structure that contains context sets, target sets, and other metadata.
Before diving into the TaskLoader class which generates Task
objects from xarray
and pandas
objects,
we will first introduce the Task
class itself.
First, we will generate a Task
using DeepSensor. These code cells are kept hidden because they includes
features that are only covered later in the User Guide. Only expand them if you are curious!
Show code cell content
import logging
logging.captureWarnings(True)
import deepsensor.torch
from deepsensor.data import DataProcessor
from deepsensor.data.sources import get_ghcnd_station_data, get_era5_reanalysis_data, get_earthenv_auxiliary_data, get_gldas_land_mask
import matplotlib.pyplot as plt
# Using the same settings allows use to use pre-downloaded cached data
data_range = ("2016-06-25", "2016-06-30")
extent = "europe"
station_var_IDs = ["TAVG", "PRCP"]
era5_var_IDs = ["2m_temperature", "10m_u_component_of_wind", "10m_v_component_of_wind"]
auxiliary_var_IDs = ["elevation", "tpi"]
cache_dir = "../../.datacache"
station_raw_df = get_ghcnd_station_data(station_var_IDs, extent, date_range=data_range, cache=True, cache_dir=cache_dir)
era5_raw_ds = get_era5_reanalysis_data(era5_var_IDs, extent, date_range=data_range, cache=True, cache_dir=cache_dir)
auxiliary_raw_ds = get_earthenv_auxiliary_data(auxiliary_var_IDs, extent, "10KM", cache=True, cache_dir=cache_dir)
land_mask_raw_ds = get_gldas_land_mask(extent, cache=True, cache_dir=cache_dir)
data_processor = DataProcessor(x1_name="lat", x2_name="lon")
era5_ds = data_processor(era5_raw_ds)
aux_ds, land_mask_ds = data_processor([auxiliary_raw_ds, land_mask_raw_ds], method="min_max")
station_df = data_processor(station_raw_df)
100%|████████████████████████████████████████████████████████████████| 3124/3124 [02:38<00:00, 19.75it/s]
Show code cell content
from deepsensor.data import TaskLoader
task_loader = TaskLoader(context=[era5_ds, land_mask_ds], target=station_df)
task = task_loader("2016-06-25", context_sampling=[52, 112], target_sampling=245)
In the code cell below, task
is a Task
object.
Printing a Task
will print each of its entries and replace numerical arrays with their shape for convenience.
print(task)
time: 2016-06-25 00:00:00
ops: []
X_c: [(2, 52), (2, 112)]
Y_c: [(3, 52), (1, 112)]
X_t: [(2, 245)]
Y_t: [(2, 245)]
Task structure#
A Task
typically contains at least the following entries:
"time"
: timestamp that was used for slicing the spatiotemporal data."ops"
list of processing operations that have been applied to the data (more on this shortly)."X_c"
and"Y_c"
: length-\(N_C\) lists of context set observations \(\mathbf{X}^{(c)}_i \in \mathbb{R}^{2\times M}\) and \(\mathbf{Y}^{(c)}_i \in \mathbb{R}^{N\times M}\)."X_t"
and"Y_t"
: as above, but for the target sets. In the example above, the target observations are known, so thisTask
may be used for training.
Exercise:
For the task
object above, use the "X_c"
, "Y_c"
, "X_t"
, and "Y_t"
entries to work out the following (answer hidden below):
The number of context sets
The number of observations in each context set
The dimensionality of each context set
The number of target sets
The number of observations in each target set
The dimensionality of each target set
Click to reveal the answers!
Answers, respectively: 2 context sets, 52 and 112 context observations, 3 and 1 context dimensions, 1 target set, 245 target observations, 2 target dimensions.
Gridded data in Tasks#
For convenience, data that lies on a regular grid is given a compact tuple representation for the "X"
entries:
task_with_gridded_data = task_loader("2016-06-25", context_sampling=["all", "all"], target_sampling=245)
print(task_with_gridded_data)
time: 2016-06-25 00:00:00
ops: []
X_c: [((1, 141), (1, 221)), ((1, 140), (1, 220))]
Y_c: [(3, 141, 221), (1, 140, 220)]
X_t: [(2, 245)]
Y_t: [(2, 245)]
In the above example, the first context set lies on a 141 x 221 grid, and the second context set lies on a 140 x 220 grid.
Task methods#
The Task
class also contains methods for applying processing operations the data (like removing NaNs, adding batch dimensions, etc.).
These operations will be recorded in the order they were applied the "ops"
entry of the Task
.
Operations can be chained together, for example:
print(task.add_batch_dim().convert_to_tensor())
time: 2016-06-25 00:00:00
ops: ['batch_dim', 'tensor']
X_c: [torch.Size([1, 2, 52]), torch.Size([1, 2, 112])]
Y_c: [torch.Size([1, 3, 52]), torch.Size([1, 1, 112])]
X_t: [torch.Size([1, 2, 245])]
Y_t: [torch.Size([1, 2, 245])]
Gridded data in a Task
can be flattened using the .flatten_gridded_data
method.
Notice how the "X"
entries are now 2D arrays of shape (2, M)
rather than tuples of two 1D arrays of shape (M,)
.
print(task_with_gridded_data.flatten_gridded_data())
time: 2016-06-25 00:00:00
ops: ['gridded_data_flattened']
X_c: [(2, 31161), (2, 30800)]
Y_c: [(3, 31161), (1, 30800)]
X_t: [(2, 245)]
Y_t: [(2, 245)]