deepsensor.data.sources#
- get_earthenv_auxiliary_data(var_IDs=('elevation', 'tpi'), extent='global', resolution='1KM', verbose=False, cache=False, cache_dir='.datacache')[source]#
Download global static auxiliary data from EarthEnv into an xarray DataArray. See: https://www.earthenv.org/topography.
Note
Requires the rioxarray package to be installed. e.g. via
pip install rioxarray
. See therioxarray
pages for more installation options: https://corteva.github.io/rioxarray/stable/installation.htmlNote
This method downloads the data from EarthEnv to disk, then reads it into memory, and then deletes the file from disk. This is because EarthEnv does not support OpenDAP, so we cannot read the data directly into memory.
Note
At 1KM resolution, the global data is ~3 GB per variable.
Note
Topographic Position Index (TPI) is a measure of the local topographic position of a cell relative to its surrounding landscape. It is calculated as the difference between the elevation of a cell and the mean elevation of its surrounding landscape. This highlights topographic features such as mountains (positive TPI) and valleys (negative TPI).
Todo
support land cover data: https://www.earthenv.org/landcover
Warning
If this function is updated, the cache will be invalidated and the data will need to be re-downloaded. To avoid this risk, set
cache=False
and save the data to disk manually.- Parameters:
var_IDs – tuple List of variable IDs. Options are: “elevation”, “tpi”.
extent – tuple[float, float, float, float] | str Tuple of (lon_min, lon_max, lat_min, lat_max) or string of region name. Options are: “global”, “north_america”, “uk”, “europe”.
resolution – str Resolution of data. Options are: “1KM”, “5KM”, “10KM”, “50KM”, “100KM”.
verbose – bool Whether to print status messages. Default is
False
.cache – bool Whether to cache the station metadata and data locally. If
True
, calling the function again with the same arguments will load the data from the cache instead of downloading it again. Default isFalse
.cache_dir – str Directory to store the cached data. Default is
".datacache"
.
- Returns:
xarray.DataArray
Auxiliary data with dimensions
lat
,lon
and variablevar_ID
.
- get_era5_reanalysis_data(var_IDs=None, extent='global', date_range=None, freq='D', num_processes=1, verbose=False, cache=False, cache_dir='.datacache')[source]#
Download ERA5 reanalysis data from Google Cloud Storage into an xarray Dataset. Source: https://cloud.google.com/storage/docs/public-datasets/era5.
Supports parallelising downloads into monthly chunks across multiple CPUs. Supports caching the downloaded data locally to avoid re-downloading when calling the function again with the same arguments. The data is cached on a per-month basis, so if you call the function again with a different date range, data will only be downloaded if the new date range includes months that have not already been downloaded.
Note
See the list of available variable IDs here: https://console.cloud.google.com/storage/browser/gcp-public-data-arco-era5/ar/1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false
Note
The aggregation method for when freq = “D” is “mean” (which may not be appropriate for accumulated variables like precipitation).
Warning
If this function is updated, the cache will be invalidated and the data will need to be re-downloaded. To avoid this risk, set
cache=False
and save the data to disk manually.- Parameters:
var_IDs – list List of variable IDs to download. If None, all variables are downloaded. See the list of available variable IDs above.
extent – tuple[float, float, float, float] | str Tuple of (lon_min, lon_max, lat_min, lat_max) or string of region name. Options are: “global”, “north_america”, “uk”, “europe”.
date_range – tuple Tuple of (start_date, end_date) in format “YYYY-MM-DD”.
freq – str Frequency of data to download. Options are: “D” (daily) or “H” (hourly). If “D”, the data is downloaded from the 1-hourly dataset and then resampled to daily averages. If “H”, the 1-hourly data is returned as-is.
Optional[int] (num_processes) – Number of CPUs to use for downloading years of ERA5 data in parallel. Defaults to 1 (i.e. no parallelisation). 75% of all available CPUs or 8 CPUs, whichever is smaller.
verbose – bool Whether to print status messages. Default is
False
.cache – bool Whether to cache the station metadata and data locally. If
True
, calling the function again with the same arguments will load the data from the cache instead of downloading it again. Default isFalse
.cache_dir – str Directory to store the cached data. Default is
".datacache"
.
- Returns:
xarray.Dataset
ERA5 reanalysis data with dimensions
time
,lat
,lon
and variablesvar1
,var2
, etc.
- get_ghcnd_station_data(var_IDs=None, extent='global', date_range=None, subsample_frac=1.0, num_processes=None, verbose=False, cache=False, cache_dir='.datacache')[source]#
Download Global Historical Climatology Network Daily (GHCND) station data from NOAA into a pandas DataFrame. Source: https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily.
Note
Requires the scotthosking/get-station-data repository to be installed manually in your Python environment with:
pip install git+https://github.com/scott-hosking/get-station-data.git
Note
Example key variable IDs: -
"TAVG"
: Average temperature (degrees Celsius) -"TMAX"
: Maximum temperature (degrees Celsius) -"TMIN"
: Minimum temperature (degrees Celsius) -"PRCP"
: Precipitation (mm) -"SNOW"
: Snowfall -"AWND"
: Average wind speed (m/s) -"AWDR"
: Average wind direction (degrees)The full list of variable IDs can be found here: https://www.ncei.noaa.gov/pub/data/ghcn/daily/readme.txt
Warning
If this function is updated, the cache will be invalidated and the data will need to be re-downloaded. To avoid this risk, set
cache=False
and save the data to disk manually.- Parameters:
var_IDs – list List of variable IDs to download. If None, all variables are downloaded. See the list of available variable IDs above.
extent – tuple[float, float, float, float] | str Tuple of (lon_min, lon_max, lat_min, lat_max) or string of region name. Options are: “global”, “north_america”, “uk”, “europe”.
date_range – tuple[str, str] Tuple of (start_date, end_date) in format “YYYY-MM-DD”.
subsample_frac – float Fraction of available stations to download (useful for reducing download size). Default is 1.0 (download all stations).
num_processes – int, optional Number of CPUs to use for downloading station data in parallel. If not specified, will use 75% of all available CPUs.
verbose – bool Whether to print status messages. Default is
False
.cache – bool Whether to cache the station metadata and data locally. If
True
, calling the function again with the same arguments will load the data from the cache instead of downloading it again. Default isFalse
.cache_dir – str Directory to store the cached data. Default is
".datacache"
.
- Returns:
pandas.DataFrame
Station data with indexes
time
,lat
,lon
,station
and columnsvar1
,var2
, etc.
- get_gldas_land_mask(extent='global', verbose=False, cache=False, cache_dir='.datacache')[source]#
Get GLDAS land mask at 0.25 degree resolution. Source: https://ldas.gsfc.nasa.gov/gldas/vegetation-class-mask.
Warning
If this function is updated, the cache will be invalidated and the data will need to be re-downloaded. To avoid this risk, set
cache=False
and save the data to disk manually.- Parameters:
extent – tuple[float, float, float, float] | str Tuple of (lon_min, lon_max, lat_min, lat_max) or string of region name. Options are: “global”, “north_america”, “uk”, “europe”.
verbose – bool Whether to print status messages. Default is
False
.cache – bool Whether to cache the station metadata and data locally. If
True
, calling the function again with the same arguments will load the data from the cache instead of downloading it again. Default isFalse
.cache_dir – str Directory to store the cached data. Default is
".datacache"
.
- Returns:
xarray.DataArray
Land mask (1 = land, 0 = water) with dimensions
lat
,lon
.