Analysis pipeline guidance

This is a detailed guide to our analysis pipeline. see also this flowchart viz of the pipeline

1 Prerequisites

For our bias correction methods, we tap into dedicated packages in both python and R ecosystems. The integration of these languages allows us to utilize the cutting-edge functionalities implemented in each.

Given this dual-language nature of our analysis pipeline, we also provide preprocessing scripts written in both python and R. To facilitate a seamless experience, users are required to set up both python and R environments as detailed below.

1.1 Setting up your R environment

1.1.1 Download and Install R

Visit CRAN (The Comprehensive R Archive Network) to download the latest version of R compatible with your operating system. Then verify successful installation via command line:

R --version

1.1.2 Install Necessary R Packages

Our analysis utilizes several R packages. You can install these packages by starting R (just type R in your command line and press enter) and entering the following commands in the R console:

install.packages("package1")
install.packages("package2")
#... (continue for all necessary packages)

Replace "package1", "package2", etc., with the actual names of the required packages. A list of required R packages is usually at the top of each .R, .Rmd, some R specific .md files. Below is an example from WIP Comparing HADs grids:

library(terra)
library(sp)
library(exactextractr)

1.2 Setting up your python environment

For your python environment, we provide a conda-lock.yml environment file for ease-of-use.

conda-lock install -name clim-recal conda-lock.yml

In order to paralellize the reprojection step, we make use of the GNU parallel shell tool.

parallel --version

1.3 The cmethods library

This repository contains a python script used to run debiasing in climate data using a fork of the original python-cmethods module written by Benjamin Thomas Schwertfeger’s , which has been modified to function with the dataset used in the clim-recal project. This library has been included as a submodule to this project, so you must run the following command to pull the submodules required.

git submodule update --init --recursive

1.4 Downloading the data

1.4.1 Climate data

This streamlined pipeline is designed for raw data provided by the Met Office, accessible through the CEDA archive. It utilizes UKCP control, scenario data at 2.2km resolution, and HADs observational data. For those unfamiliar with this data, refer to our the dataset section.

To access the data, register here at the CEDA archive and configure your FTP credentials in “My Account”. Utilize our ceda_ftp_download.py script to download the data.

# cpm data
python3 ceda_ftp_download.py --input /badc/ukcp18/data/land-cpm/uk/2.2km/rcp85/ --output 'output_dir' --username 'uuu' --psw 'ppp' --change_hierarchy

# hads data
python3 ceda_ftp_download.py --input /badc/ukmo-hadobs/data/insitu/MOHC/HadOBS/HadUK-Grid/v1.1.0.0/1km --output 'output_dir' --username 'uuu' --psw 'ppp'

You need to replace uuu and ppp with your CEDA username and FTP password respectively and replace output_dir with the directory you want to write the data to.

The --change_hierarchy flag modifies the folder hierarchy to fit with the hierarchy in the Turing Azure file store. This flag only applies to the UKCP data and should not be used with HADs data. You can use the same script without the --change_hierarchy flag in order to download files without any changes to the hierarchy.

Tip

📢 If you are an internal collaborator you can access the raw data as well as intermediate steps through our Azure server. See here for a How-to.

1.4.2 Geospatial data

In addition to the climate data we use geospatial data to divide the data into smaller chunks. Specifically we use the following datasets for city boundaries:

Warning

Sections below may need updating.

2 Reprojecting the data

The HADs data and the UKCP projections have different resolution and coordinate system. For example: the HADs dataset uses the British National Grid coordinate system.

Click for raw HADs and CPM coordinate structures
CRS.from_wkt('PROJCS["undefined",GEOGCS["undefined",DATUM["undefined",SPHEROID["undefined",6377563.396,299.324961266495]],PRIMEM["undefined",0],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",49],PARAMETER["central_meridian",-2],PARAMETER["scale_factor",0.9996012717],PARAMETER["false_easting",400000],PARAMETER["false_northing",-100000],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH]]')
CRS.from_wkt('GEOGCRS["undefined",BASEGEOGCRS["undefined",DATUM["undefined",ELLIPSOID["undefined",6371229,0,LENGTHUNIT["metre",1,ID["EPSG",9001]]]],PRIMEM["undefined",0,ANGLEUNIT["degree",0.0174532925199433,ID["EPSG",9122]]]],DERIVINGCONVERSION["Pole rotation (netCDF CF convention)",METHOD["Pole rotation (netCDF CF convention)"],PARAMETER["Grid north pole latitude (netCDF CF convention)",37.5,ANGLEUNIT["degree",0.0174532925199433,ID["EPSG",9122]]],PARAMETER["Grid north pole longitude (netCDF CF convention)",177.5,ANGLEUNIT["degree",0.0174532925199433,ID["EPSG",9122]]],PARAMETER["North pole grid longitude (netCDF CF convention)",0,ANGLEUNIT["degree",0.0174532925199433,ID["EPSG",9122]]]],CS[ellipsoidal,2],AXIS["longitude",east,ORDER[1],ANGLEUNIT["degree",0.0174532925199433,ID["EPSG",9122]]],AXIS["latitude",north,ORDER[2],ANGLEUNIT["degree",0.0174532925199433,ID["EPSG",9122]]]]')

The first step in our analysis pipeline is to reproject the UKCP datasets to the British National Grid coordinate system. For this purpose, we utilize the Geospatial Data Abstraction Library (GDAL), designed for reading and writing raster and vector geospatial data formats.

Note that, to reproduce our exact pipeline, we switch environments here as explained in the requirements.

conda activate gdal_env

To execute the reprojection in parallel fashion, run the reproject_all.sh script from your shell. As an input to the script replace path_to_netcdf_files with the path to the raw netCDF files.

cd bash
sh reproject_all.sh path_to_netcdf_files

3 Resampling the data

Resample the HADs UK dataset from 1km to 2.2km grid to match the UKCP reprojected grid. We run the resampling python script specifying the --input location of the reprojected files from the previous step, the UKCP --grid file an the --output location for saving the resampled files.

# switch to main environment
conda activate clim-recal

# run resampling
cd ../python/resampling
python resampling_hads.py --input path_to_reprojected --grid path_to_grid_file --output path_to_resampled

4 Preparing the bias correction and assessment

4.1 Spatial cropping

Because the bias correction process is computationally intensive, handling large datasets can be challenging and time-consuming. Therefore, to make the pipeline more manageable and efficient, it is recommended to split the data into smaller subsets.

For the purposes of our example pipeline, we’ve opted for reducing the data to individual city boundaries. To crop you need to adjust the paths in Cropping_Rasters_to_three_cities.R script to fit 1your own directory sturcture. The cropping is implemented in the cpm_read_crop and hads_read_crop functions.

Rscript Cropping_Rasters_to_three_cities.R

4.2 calibration-validation data split

For the purpose of assessing our bias correction, we then split our data, both the projection as well as the ground-truth observations by dates. In this example here we calibrate the bias correction based on the years 1981 to 1983.

We then use data from year 2010 to validate the quality of the bias correction. You need to replace path_to_cropped with the path where the data from the previous cropping step was saved and path_to_preprocessed with the output directory you choose. You can leave the -v and -r flags as specified below or choose another metric and run if you prefer.

cd ../debiasing
python preprocess_data.py --mod path_to_cropped --obs path_to_cropped -v tasmax -r '05' --out path_to_preprocessed --calib_dates 19810101-19831230 --valid_dates 20100101-20101230

The preprocess_data.py script also aligns the calendars of the historical simulation data and observed data, ensuring that they have the same time dimension and checks that the observed and simulated historical data have the same dimensions.

Note

preprocess_data.py makes use of our custom data loader functions. In data_loader.py we have written a few functions for loading and concatenating data into a single xarray which can be used for running debiasing methods. Instructions in how to use these functions can be found in python/notebooks/load_data_python.ipynb.

5 Applying the bias correction

Note

By March 2023 we have only implemented the python-cmethods library.

The run_cmethods.py allow us to adjusts climate biases in climate data using the python-cmethods library. It takes as input observation data (HADs data), control data (historical UKCP data), and scenario data (future UKCP data), and applies a correction method to the scenario data. The resulting output is saved as a .nc to a specified directory.

The script will also produce a time-series and a map plot of the debiased data. To run this you need to replace path_to_validation_data with the output directories of the previous step and specify path_to_corrected_data as your output directory for the bias corrected data. You can also specify your preferred bias_correction_method (e.g. quantile_delta_mapping).

python3 run_cmethods.py --input_data_folder path_to_validation_data --out path_to_corrected_data --method bias_correction_method --v 'tas'

The run_cmethods.py script loops over the time periods and applies debiasing in periods of 10 years in the following steps:

  • Loads the scenario data for the current time period.
  • Applies the specified correction method to the scenario data.
  • Saves the resulting output to the specified directory.
  • Creates diagnotic figues of the output dataset (time series and time dependent maps) and saves it into the specified directory.

For each 10 year time period it will produce an .nc output file with the adjusted data and a time-series plot and a time dependent map plot of the adjusted data.

Back to top