deepsensor.data.sources

deepsensor.data.sources#

get_earthenv_auxiliary_data(var_IDs=('elevation', 'tpi'), extent='global', resolution='1KM', verbose=False, cache=False, cache_dir='.datacache')[source]#

Download global static auxiliary data from EarthEnv into an xarray DataArray. See: https://www.earthenv.org/topography

Note

Requires the rioxarray package to be installed. e.g. via pip install rioxarray. See the rioxarray pages for more installation options: https://corteva.github.io/rioxarray/stable/installation.html

Note

This method downloads the data from EarthEnv to disk, then reads it into memory, and then deletes the file from disk. This is because EarthEnv does not support OpenDAP, so we cannot read the data directly into memory.

Note

At 1KM resolution, the global data is ~3 GB per variable.

Note

Topographic Position Index (TPI) is a measure of the local topographic position of a cell relative to its surrounding landscape. It is calculated as the difference between the elevation of a cell and the mean elevation of its surrounding landscape. This highlights topographic features such as mountains (positive TPI) and valleys (negative TPI).

Todo

support land cover data: https://www.earthenv.org/landcover

Warning

If this function is updated, the cache will be invalidated and the data will need to be re-downloaded. To avoid this risk, set cache=False and save the data to disk manually.

Parameters:
  • var_IDs – tuple List of variable IDs. Options are: “elevation”, “tpi”.

  • extent – tuple[float, float, float, float] | str Tuple of (lon_min, lon_max, lat_min, lat_max) or string of region name. Options are: “global”, “north_america”, “uk”, “europe”.

  • resolution – str Resolution of data. Options are: “1KM”, “5KM”, “10KM”, “50KM”, “100KM”.

  • verbose – bool Whether to print status messages. Default is False.

  • cache – bool Whether to cache the station metadata and data locally. If True, calling the function again with the same arguments will load the data from the cache instead of downloading it again. Default is False.

  • cache_dir – str Directory to store the cached data. Default is ".datacache".

Returns:

xarray.DataArray

Auxiliary data with dimensions lat, lon and variable var_ID.

get_era5_reanalysis_data(var_IDs=None, extent='global', date_range=None, freq='D', num_processes=1, verbose=False, cache=False, cache_dir='.datacache')[source]#

Download ERA5 reanalysis data from Google Cloud Storage into an xarray Dataset. Source: https://cloud.google.com/storage/docs/public-datasets/era5

Supports parallelising downloads into monthly chunks across multiple CPUs. Supports caching the downloaded data locally to avoid re-downloading when calling the function again with the same arguments. The data is cached on a per-month basis, so if you call the function again with a different date range, data will only be downloaded if the new date range includes months that have not already been downloaded.

Note

The aggregation method for when freq = “D” is “mean” (which may not be appropriate for accumulated variables like precipitation).

Warning

If this function is updated, the cache will be invalidated and the data will need to be re-downloaded. To avoid this risk, set cache=False and save the data to disk manually.

Parameters:
  • var_IDs – list List of variable IDs to download. If None, all variables are downloaded. See the list of available variable IDs above.

  • extent – tuple[float, float, float, float] | str Tuple of (lon_min, lon_max, lat_min, lat_max) or string of region name. Options are: “global”, “north_america”, “uk”, “europe”.

  • date_range – tuple Tuple of (start_date, end_date) in format “YYYY-MM-DD”.

  • freq – str Frequency of data to download. Options are: “D” (daily) or “H” (hourly). If “D”, the data is downloaded from the 1-hourly dataset and then resampled to daily averages. If “H”, the 1-hourly data is returned as-is.

  • Optional[int] (num_processes) – Number of CPUs to use for downloading years of ERA5 data in parallel. Defaults to 1 (i.e. no parallelisation). 75% of all available CPUs or 8 CPUs, whichever is smaller.

  • verbose – bool Whether to print status messages. Default is False.

  • cache – bool Whether to cache the station metadata and data locally. If True, calling the function again with the same arguments will load the data from the cache instead of downloading it again. Default is False.

  • cache_dir – str Directory to store the cached data. Default is ".datacache".

Returns:

xarray.Dataset

ERA5 reanalysis data with dimensions time, lat, lon and variables var1, var2, etc.

get_ghcnd_station_data(var_IDs=None, extent='global', date_range=None, subsample_frac=1.0, num_processes=None, verbose=False, cache=False, cache_dir='.datacache')[source]#

Download Global Historical Climatology Network Daily (GHCND) station data from NOAA into a pandas DataFrame. Source: https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily

Note

Requires the scotthosking/get-station-data repository to be installed manually in your Python environment with: pip install git+https://github.com/scott-hosking/get-station-data.git

Note

Example key variable IDs: - "TAVG": Average temperature (degrees Celsius) - "TMAX": Maximum temperature (degrees Celsius) - "TMIN": Minimum temperature (degrees Celsius) - "PRCP": Precipitation (mm) - "SNOW": Snowfall - "AWND": Average wind speed (m/s) - "AWDR": Average wind direction (degrees)

The full list of variable IDs can be found here: https://www.ncei.noaa.gov/pub/data/ghcn/daily/readme.txt

Warning

If this function is updated, the cache will be invalidated and the data will need to be re-downloaded. To avoid this risk, set cache=False and save the data to disk manually.

Parameters:
  • var_IDs – list List of variable IDs to download. If None, all variables are downloaded. See the list of available variable IDs above.

  • extent – tuple[float, float, float, float] | str Tuple of (lon_min, lon_max, lat_min, lat_max) or string of region name. Options are: “global”, “north_america”, “uk”, “europe”.

  • date_range – tuple[str, str] Tuple of (start_date, end_date) in format “YYYY-MM-DD”.

  • subsample_frac – float Fraction of available stations to download (useful for reducing download size). Default is 1.0 (download all stations).

  • num_processes – int, optional Number of CPUs to use for downloading station data in parallel. If not specified, will use 75% of all available CPUs.

  • verbose – bool Whether to print status messages. Default is False.

  • cache – bool Whether to cache the station metadata and data locally. If True, calling the function again with the same arguments will load the data from the cache instead of downloading it again. Default is False.

  • cache_dir – str Directory to store the cached data. Default is ".datacache".

Returns:

pandas.DataFrame

Station data with indexes time, lat, lon, station and columns var1, var2, etc.

get_gldas_land_mask(extent='global', verbose=False, cache=False, cache_dir='.datacache')[source]#

Get GLDAS land mask at 0.25 degree resolution. Source: https://ldas.gsfc.nasa.gov/gldas/vegetation-class-mask

Warning

If this function is updated, the cache will be invalidated and the data will need to be re-downloaded. To avoid this risk, set cache=False and save the data to disk manually.

Parameters:
  • extent – tuple[float, float, float, float] | str Tuple of (lon_min, lon_max, lat_min, lat_max) or string of region name. Options are: “global”, “north_america”, “uk”, “europe”.

  • verbose – bool Whether to print status messages. Default is False.

  • cache – bool Whether to cache the station metadata and data locally. If True, calling the function again with the same arguments will load the data from the cache instead of downloading it again. Default is False.

  • cache_dir – str Directory to store the cached data. Default is ".datacache".

Returns:

xarray.DataArray

Land mask (1 = land, 0 = water) with dimensions lat, lon.