deepsensor.data.loader#
- class TaskLoader(task_loader_ID=None, context=None, target=None, aux_at_contexts=None, aux_at_targets=None, links=None, context_delta_t=0, target_delta_t=0, time_freq='D', xarray_interp_method='linear', discrete_xarray_sampling=False, dtype='float32')[source]#
Generates
Task
objects for training, testing, and inference with DeepSensor models.Provides a suite of sampling methods for generating
Task
objects for different kinds of predictions, such as: spatial interpolation, forecasting, downscaling, or some combination of these.- The behaviour is the following:
If all data passed as paths, load the data and overwrite the paths with the loaded data
Either all data is passed as paths, or all data is passed as loaded data (else
ValueError
)If all data passed as paths, the TaskLoader can be saved with the
save
method (using config)
- Parameters:
task_loader_ID – If loading a TaskLoader from a config file, this is the folder the TaskLoader was saved in (using .save). If this argument is passed, all other arguments are ignored.
context (
xarray.DataArray
|xarray.Dataset
|pandas.DataFrame
| List[xarray.DataArray
|xarray.Dataset
,pandas.DataFrame
]) – Context data. Can be a singlexarray.DataArray
,xarray.Dataset
orpandas.DataFrame
, or a list/tuple of these.target (
xarray.DataArray
|xarray.Dataset
|pandas.DataFrame
| List[xarray.DataArray
|xarray.Dataset
,pandas.DataFrame
]) – Target data. Can be a singlexarray.DataArray
,xarray.Dataset
orpandas.DataFrame
, or a list/tuple of these.aux_at_contexts (Tuple[int,
xarray.DataArray
|xarray.Dataset
], optional) – Auxiliary data at context locations. Tuple of two elements, where the first element is the index of the context set for which the auxiliary data will be sampled at, and the second element is the auxiliary data, which can be a singlexarray.DataArray
orxarray.Dataset
. Default: None.aux_at_targets (
xarray.DataArray
|xarray.Dataset
, optional) – Auxiliary data at target locations. Can be a singlexarray.DataArray
orxarray.Dataset
. Default: None.links (Tuple[int, int] | List[Tuple[int, int]], optional) – Specifies links between context and target data. Each link is a tuple of two integers, where the first integer is the index of the context data and the second integer is the index of the target data. Can be a single tuple in the case of a single link. If None, no links are specified. Default: None.
context_delta_t (int | List[int], optional) – Time difference between context data and t=0 (task init time). Can be a single int (same for all context data) or a list/tuple of ints. Default is 0.
target_delta_t (int | List[int], optional) – Time difference between target data and t=0 (task init time). Can be a single int (same for all target data) or a list/tuple of ints. Default is 0.
time_freq (str, optional) – Time frequency of the data. Default:
'D'
(daily).xarray_interp_method (str, optional) – Interpolation method to use when interpolating
xarray.DataArray
. Default is'linear'
.discrete_xarray_sampling (bool, optional) – When randomly sampling xarray variables, whether to sample at discrete points defined at grid cell centres, or at continuous points within the grid. Default is
False
.dtype (object, optional) – Data type of the data. Used to cast the data to the specified dtype. Default:
'float32'
.
- __call__(date, context_sampling='all', target_sampling=None, split_frac=0.5, datewise_deterministic=False, seed_override=None)[source]#
Generate a task for a given date (or a list of
data.task.Task
objects for a list of dates).There are several sampling strategies available for the context and target data:
“all”: Sample all observations.
int: Sample N observations uniformly at random.
float: Sample a fraction of observations uniformly at random.
numpy.ndarray
, shape (2, N):Sample N observations at the given x1, x2 coordinates. Coords are assumed to be normalised.
- “split”: Split pandas observations into disjoint context and target sets.
split_frac determines the fraction of observations to use for the context set. The remaining observations are used for the target set. The context set and target set must be linked through the
TaskLoader
links
argument. Only valid for pandas data.
- “gapfill”: Generates a training task for filling NaNs in xarray data.
Randomly samples a missing data (NaN) mask from another timestamp and adds it to the context set (i.e. increases the number of NaNs). The target set is then true values of the data at the added missing locations. The context set and target set must be linked through the
TaskLoader
links
argument. Only valid for xarray data.
- Parameters:
date (
pandas.Timestamp
) – Date for which to generate the task.context_sampling (str | int | float |
numpy.ndarray
| List[str | int | float |numpy.ndarray
], optional) – Sampling strategy for the context data, either a list of sampling strategies for each context set, or a single strategy applied to all context sets. Default is"all"
.target_sampling (str | int | float |
numpy.ndarray
| List[str | int | float |numpy.ndarray
], optional) – Sampling strategy for the target data, either a list of sampling strategies for each target set, or a single strategy applied to all target sets. Default isNone
, meaning no target data is returned.split_frac (float, optional) – The fraction of observations to use for the context set with the “split” sampling strategy for linked context and target set pairs. The remaining observations are used for the target set. Default is 0.5.
datewise_deterministic (bool, optional) – Whether random sampling is datewise deterministic based on the date. Default is
False
.seed_override (Optional[int], optional) – Override the seed for random sampling. This can be used to use the same random sampling at different
date
. Default is None.
- Returns:
Task
| List[Task
] – Task object or list of task objects for each date containing the context and target data.
- config_fname = 'task_loader_config.json'#
- count_context_and_target_data_dims()[source]#
Count the number of data dimensions in the context and target data.
- Returns:
tuple – context_dims, Tuple of data dimensions in the context data. tuple: target_dims, Tuple of data dimensions in the target data.
- Raises:
ValueError – If the context/target data is not a tuple/list of
xarray.DataArray
,xarray.Dataset
orpandas.DataFrame
.
- infer_context_and_target_var_IDs()[source]#
Infer the variable IDs of the context and target data.
- Returns:
tuple – context_var_IDs, Tuple of variable IDs in the context data. tuple: target_var_IDs, Tuple of variable IDs in the target data.
- Raises:
ValueError – If the context/target data is not a tuple/list of
xarray.DataArray
,xarray.Dataset
orpandas.DataFrame
.
- load_dask()[source]#
Load any dask data into memory.
This function triggers the computation and loading of any data that is represented as dask arrays or datasets into memory.
- Returns:
None
- sample_da(da, sampling_strat, seed=None)[source]#
Sample a DataArray according to a given strategy.
- Parameters:
da (
xarray.DataArray
|xarray.Dataset
) – DataArray to sample, assumed to be sliced for the task already.sampling_strat (str | int | float |
numpy.ndarray
) – Sampling strategy, either “all” or an integer for random grid cell sampling.seed (int, optional) – Seed for random sampling. Default is None.
- Returns:
Tuple[
numpy.ndarray
,numpy.ndarray
] – Tuple of sampled target data and sampled context data.- Raises:
InvalidSamplingStrategyError – If the sampling strategy is not valid or if a numpy coordinate array is passed to sample an xarray object, but the coordinates are out of bounds.
- sample_df(df, sampling_strat, seed=None)[source]#
Sample a DataFrame according to a given strategy.
- Parameters:
df (
pandas.DataFrame
|pandas.Series
) – Dataframe to sample, assumed to be time-sliced for the task already.sampling_strat (str | int | float |
numpy.ndarray
) – Sampling strategy, either “all” or an integer for random grid cell sampling.seed (int, optional) – Seed for random sampling. Default is None.
- Returns:
Tuple[X_c, Y_c] – Tuple of sampled target data and sampled context data.
- Raises:
InvalidSamplingStrategyError – If the sampling strategy is not valid or if a numpy coordinate array is passed to sample a pandas object, but the DataFrame does not contain all the requested samples.
- sample_offgrid_aux(X_t, offgrid_aux)[source]#
Sample auxiliary data at off-grid locations.
- Parameters:
X_t (
numpy.ndarray
| Tuple[numpy.ndarray
,numpy.ndarray
]) – Off-grid locations at which to sample the auxiliary data. Can be a tuple of two numpy arrays, or a single numpy array.offgrid_aux (
xarray.DataArray
|xarray.Dataset
) – Auxiliary data at off-grid locations.
- Returns:
numpy.ndarray
– [Description of the returned numpy ndarray]- Raises:
[ExceptionType] – [Description of under what conditions this function raises an exception]
- task_generation(date, context_sampling='all', target_sampling=None, split_frac=0.5, datewise_deterministic=False, seed_override=None)[source]#
- time_slice_variable(var, date, delta_t=0)[source]#
Slice a variable by a given time delta.
- Parameters:
var – Variable to slice.
delta_t – Time delta to slice by.
- Returns:
- var (…)
Sliced variable.
- Raises:
ValueError – If the variable is of an unknown type.