deepsensor.data.loader

deepsensor.data.loader#

class TaskLoader(task_loader_ID=None, context=None, target=None, aux_at_contexts=None, aux_at_targets=None, links=None, context_delta_t=0, target_delta_t=0, time_freq='D', xarray_interp_method='linear', discrete_xarray_sampling=False, dtype='float32')[source]#

Generates Task objects for training, testing, and inference with DeepSensor models.

Provides a suite of sampling methods for generating Task objects for different kinds of predictions, such as: spatial interpolation, forecasting, downscaling, or some combination of these.

The behaviour is the following:
  • If all data passed as paths, load the data and overwrite the paths with the loaded data

  • Either all data is passed as paths, or all data is passed as loaded data (else ValueError)

  • If all data passed as paths, the TaskLoader can be saved with the save method (using config)

Parameters:
  • task_loader_ID – If loading a TaskLoader from a config file, this is the folder the TaskLoader was saved in (using .save). If this argument is passed, all other arguments are ignored.

  • context (xarray.DataArray | xarray.Dataset | pandas.DataFrame | List[xarray.DataArray | xarray.Dataset, pandas.DataFrame]) – Context data. Can be a single xarray.DataArray, xarray.Dataset or pandas.DataFrame, or a list/tuple of these.

  • target (xarray.DataArray | xarray.Dataset | pandas.DataFrame | List[xarray.DataArray | xarray.Dataset, pandas.DataFrame]) – Target data. Can be a single xarray.DataArray, xarray.Dataset or pandas.DataFrame, or a list/tuple of these.

  • aux_at_contexts (Tuple[int, xarray.DataArray | xarray.Dataset], optional) – Auxiliary data at context locations. Tuple of two elements, where the first element is the index of the context set for which the auxiliary data will be sampled at, and the second element is the auxiliary data, which can be a single xarray.DataArray or xarray.Dataset. Default: None.

  • aux_at_targets (xarray.DataArray | xarray.Dataset, optional) – Auxiliary data at target locations. Can be a single xarray.DataArray or xarray.Dataset. Default: None.

  • links (Tuple[int, int] | List[Tuple[int, int]], optional) – Specifies links between context and target data. Each link is a tuple of two integers, where the first integer is the index of the context data and the second integer is the index of the target data. Can be a single tuple in the case of a single link. If None, no links are specified. Default: None.

  • context_delta_t (int | List[int], optional) – Time difference between context data and t=0 (task init time). Can be a single int (same for all context data) or a list/tuple of ints. Default is 0.

  • target_delta_t (int | List[int], optional) – Time difference between target data and t=0 (task init time). Can be a single int (same for all target data) or a list/tuple of ints. Default is 0.

  • time_freq (str, optional) – Time frequency of the data. Default: 'D' (daily).

  • xarray_interp_method (str, optional) – Interpolation method to use when interpolating xarray.DataArray. Default is 'linear'.

  • discrete_xarray_sampling (bool, optional) – When randomly sampling xarray variables, whether to sample at discrete points defined at grid cell centres, or at continuous points within the grid. Default is False.

  • dtype (object, optional) – Data type of the data. Used to cast the data to the specified dtype. Default: 'float32'.

__call__(date, context_sampling='all', target_sampling=None, split_frac=0.5, datewise_deterministic=False, seed_override=None)[source]#

Generate a task for a given date (or a list of data.task.Task objects for a list of dates).

There are several sampling strategies available for the context and target data:

  • “all”: Sample all observations.

  • int: Sample N observations uniformly at random.

  • float: Sample a fraction of observations uniformly at random.

  • numpy.ndarray, shape (2, N):

    Sample N observations at the given x1, x2 coordinates. Coords are assumed to be normalised.

  • “split”: Split pandas observations into disjoint context and target sets.

    split_frac determines the fraction of observations to use for the context set. The remaining observations are used for the target set. The context set and target set must be linked through the TaskLoader links argument. Only valid for pandas data.

  • “gapfill”: Generates a training task for filling NaNs in xarray data.

    Randomly samples a missing data (NaN) mask from another timestamp and adds it to the context set (i.e. increases the number of NaNs). The target set is then true values of the data at the added missing locations. The context set and target set must be linked through the TaskLoader links argument. Only valid for xarray data.

Parameters:
  • date (pandas.Timestamp) – Date for which to generate the task.

  • context_sampling (str | int | float | numpy.ndarray | List[str | int | float | numpy.ndarray], optional) – Sampling strategy for the context data, either a list of sampling strategies for each context set, or a single strategy applied to all context sets. Default is "all".

  • target_sampling (str | int | float | numpy.ndarray | List[str | int | float | numpy.ndarray], optional) – Sampling strategy for the target data, either a list of sampling strategies for each target set, or a single strategy applied to all target sets. Default is None, meaning no target data is returned.

  • split_frac (float, optional) – The fraction of observations to use for the context set with the “split” sampling strategy for linked context and target set pairs. The remaining observations are used for the target set. Default is 0.5.

  • datewise_deterministic (bool, optional) – Whether random sampling is datewise deterministic based on the date. Default is False.

  • seed_override (Optional[int], optional) – Override the seed for random sampling. This can be used to use the same random sampling at different date. Default is None.

Returns:

Task | List[Task] – Task object or list of task objects for each date containing the context and target data.

config_fname = 'task_loader_config.json'#
count_context_and_target_data_dims()[source]#

Count the number of data dimensions in the context and target data.

Returns:

tuple – context_dims, Tuple of data dimensions in the context data. tuple: target_dims, Tuple of data dimensions in the target data.

Raises:

ValueError – If the context/target data is not a tuple/list of xarray.DataArray, xarray.Dataset or pandas.DataFrame.

infer_context_and_target_var_IDs()[source]#

Infer the variable IDs of the context and target data.

Returns:

tuple – context_var_IDs, Tuple of variable IDs in the context data. tuple: target_var_IDs, Tuple of variable IDs in the target data.

Raises:

ValueError – If the context/target data is not a tuple/list of xarray.DataArray, xarray.Dataset or pandas.DataFrame.

load_dask()[source]#

Load any dask data into memory.

This function triggers the computation and loading of any data that is represented as dask arrays or datasets into memory.

Returns:

None

sample_da(da, sampling_strat, seed=None)[source]#

Sample a DataArray according to a given strategy.

Parameters:
  • da (xarray.DataArray | xarray.Dataset) – DataArray to sample, assumed to be sliced for the task already.

  • sampling_strat (str | int | float | numpy.ndarray) – Sampling strategy, either “all” or an integer for random grid cell sampling.

  • seed (int, optional) – Seed for random sampling. Default is None.

Returns:

Tuple[numpy.ndarray, numpy.ndarray] – Tuple of sampled target data and sampled context data.

Raises:

InvalidSamplingStrategyError – If the sampling strategy is not valid or if a numpy coordinate array is passed to sample an xarray object, but the coordinates are out of bounds.

sample_df(df, sampling_strat, seed=None)[source]#

Sample a DataFrame according to a given strategy.

Parameters:
  • df (pandas.DataFrame | pandas.Series) – Dataframe to sample, assumed to be time-sliced for the task already.

  • sampling_strat (str | int | float | numpy.ndarray) – Sampling strategy, either “all” or an integer for random grid cell sampling.

  • seed (int, optional) – Seed for random sampling. Default is None.

Returns:

Tuple[X_c, Y_c] – Tuple of sampled target data and sampled context data.

Raises:

InvalidSamplingStrategyError – If the sampling strategy is not valid or if a numpy coordinate array is passed to sample a pandas object, but the DataFrame does not contain all the requested samples.

sample_offgrid_aux(X_t, offgrid_aux)[source]#

Sample auxiliary data at off-grid locations.

Parameters:
Returns:

numpy.ndarray – [Description of the returned numpy ndarray]

Raises:

[ExceptionType] – [Description of under what conditions this function raises an exception]

save(folder)[source]#

Save TaskLoader config to JSON in folder

task_generation(date, context_sampling='all', target_sampling=None, split_frac=0.5, datewise_deterministic=False, seed_override=None)[source]#
time_slice_variable(var, date, delta_t=0)[source]#

Slice a variable by a given time delta.

Parameters:
  • var – Variable to slice.

  • delta_t – Time delta to slice by.

Returns:

var (…)

Sliced variable.

Raises:

ValueError – If the variable is of an unknown type.