Tasks#

What is a ‘task’?#

The concept of a task is central to DeepSensor. It originates from the meta-learning literature in machine learning and has a specific meaning.

Users unfamiliar with the notation and terminology of meta-learning are recommended to expand the section below.

The DeepSensor Task#

In DeepSensor, a Task is a dict-like data structure that contains context sets, target sets, and other metadata. Before diving into the TaskLoader class which generates Task objects from xarray and pandas objects, we will first introduce the Task class itself.

First, we will generate a Task using DeepSensor. These code cells are kept hidden because they includes features that are only covered later in the User Guide. Only expand them if you are curious!

Hide code cell content
import logging

logging.captureWarnings(True)

import deepsensor.torch
from deepsensor.data import DataProcessor
from deepsensor.data.sources import get_ghcnd_station_data, get_era5_reanalysis_data, get_earthenv_auxiliary_data, get_gldas_land_mask

import matplotlib.pyplot as plt

# Using the same settings allows use to use pre-downloaded cached data
data_range = ("2016-06-25", "2016-06-30")
extent = "europe"
station_var_IDs = ["TAVG", "PRCP"]
era5_var_IDs = ["2m_temperature", "10m_u_component_of_wind", "10m_v_component_of_wind"]
auxiliary_var_IDs = ["elevation", "tpi"]
cache_dir = "../../.datacache"

station_raw_df = get_ghcnd_station_data(station_var_IDs, extent, date_range=data_range, cache=True, cache_dir=cache_dir)
era5_raw_ds = get_era5_reanalysis_data(era5_var_IDs, extent, date_range=data_range, cache=True, cache_dir=cache_dir)
auxiliary_raw_ds = get_earthenv_auxiliary_data(auxiliary_var_IDs, extent, "10KM", cache=True, cache_dir=cache_dir)
land_mask_raw_ds = get_gldas_land_mask(extent, cache=True, cache_dir=cache_dir)

data_processor = DataProcessor(x1_name="lat", x2_name="lon")
era5_ds = data_processor(era5_raw_ds)
aux_ds, land_mask_ds = data_processor([auxiliary_raw_ds, land_mask_raw_ds], method="min_max")
station_df = data_processor(station_raw_df)
100%|████████████████████████████████████████████████████████████████| 3124/3124 [02:38<00:00, 19.75it/s]
Hide code cell content
from deepsensor.data import TaskLoader
task_loader = TaskLoader(context=[era5_ds, land_mask_ds], target=station_df)
task = task_loader("2016-06-25", context_sampling=[52, 112], target_sampling=245)

In the code cell below, task is a Task object. Printing a Task will print each of its entries and replace numerical arrays with their shape for convenience.

print(task)
time: 2016-06-25 00:00:00
ops: []
X_c: [(2, 52), (2, 112)]
Y_c: [(3, 52), (1, 112)]
X_t: [(2, 245)]
Y_t: [(2, 245)]

Task structure#

A Task typically contains at least the following entries:

  • "time": timestamp that was used for slicing the spatiotemporal data.

  • "ops" list of processing operations that have been applied to the data (more on this shortly).

  • "X_c" and "Y_c": length-\(N_C\) lists of context set observations \(\mathbf{X}^{(c)}_i \in \mathbb{R}^{2\times M}\) and \(\mathbf{Y}^{(c)}_i \in \mathbb{R}^{N\times M}\).

  • "X_t" and "Y_t": as above, but for the target sets. In the example above, the target observations are known, so this Task may be used for training.

Exercise:

For the task object above, use the "X_c", "Y_c", "X_t", and "Y_t" entries to work out the following (answer hidden below):

  • The number of context sets

  • The number of observations in each context set

  • The dimensionality of each context set

  • The number of target sets

  • The number of observations in each target set

  • The dimensionality of each target set

Gridded data in Tasks#

For convenience, data that lies on a regular grid is given a compact tuple representation for the "X" entries:

task_with_gridded_data = task_loader("2016-06-25", context_sampling=["all", "all"], target_sampling=245)
print(task_with_gridded_data)
time: 2016-06-25 00:00:00
ops: []
X_c: [((1, 141), (1, 221)), ((1, 140), (1, 220))]
Y_c: [(3, 141, 221), (1, 140, 220)]
X_t: [(2, 245)]
Y_t: [(2, 245)]

In the above example, the first context set lies on a 141 x 221 grid, and the second context set lies on a 140 x 220 grid.

Task methods#

The Task class also contains methods for applying processing operations to the data (like removing NaNs, adding batch dimensions, etc.). These operations will be recorded in the order they were applied the "ops" entry of the Task. Operations can be chained together, for example:

print(task.add_batch_dim().convert_to_tensor())
time: 2016-06-25 00:00:00
ops: ['batch_dim', 'tensor']
X_c: [torch.Size([1, 2, 52]), torch.Size([1, 2, 112])]
Y_c: [torch.Size([1, 3, 52]), torch.Size([1, 1, 112])]
X_t: [torch.Size([1, 2, 245])]
Y_t: [torch.Size([1, 2, 245])]

Gridded data in a Task can be flattened using the .flatten_gridded_data method. Notice how the "X" entries are now 2D arrays of shape (2, M) rather than tuples of two 1D arrays of shape (M,).

print(task_with_gridded_data.flatten_gridded_data())
time: 2016-06-25 00:00:00
ops: ['gridded_data_flattened']
X_c: [(2, 31161), (2, 30800)]
Y_c: [(3, 31161), (1, 30800)]
X_t: [(2, 245)]
Y_t: [(2, 245)]