Build TRE#

This task is the responsibility of Trusted Research Environments Service Area (TRESA).

Prerequisutes#

In order to deploy your TRE (Data Safe Haven SRE) to work with the prod4 Safe Haven Management Environment, you’ll need to have completed the Onboarding checklist.

Building the TRE#

Follow the Data Safe Haven (DSH) Secure Research Environment deployment guide, making sure you are reading the version of the docs (see the left-hand sidebar) that matches the DSH release you recorded in the TRE GitHub issue.

Important:

Ensure you have the correct release of the DSH codebase checked out. You should have recorded this in the TRE GitHub issue. Since you will have forked the codebase, it’s worth fetching the release tags from the upstream repo first (skip the first line if you have set the upstream remote previously):
```
git remote add upstream https://github.com/alan-turing-institute/data-safe-haven
git fetch --tags upstream
git checkout tags/vX.X.X
```
Use Start-Transcript as suggested in the guide to save a log of the deployment
- Save it somewhere memorable like logs/<SRE ID>/deploy.txt
See Create TRE config file when you reach the SRE Configuration properties step.
If the TRE you are building is to be used in a Data Study Group (DSG), follow the Data Study Group (DSG) TRE setup.

Create TRE config file#

Once you reach the SRE Configuration properties step in the DSH deployment documentation, you should generate a JSON configuration file used for generating a new environment. This is where you should include specific requirements for the TRE for deployment.

To create the config, do the following:

Copy the template codeblock into a new comment on the GitHub issue for this TRE (this will allow other members of TRESA to have a copy of the config)
You should set some of the config fields to Turing-specific recommendations:
- shmId: the name of the currently active SHM (as of 21st August 2023 this is prod4: the SHM deployed using AzureAD prod4.turingsafehaven.ac.uk)
- sreId: You should have recorded this when completing Create TRE GitHub issue
- tier: This is recorded in the Project Initialisation form on Sharepoint
- subscriptionName: This will have been provided by RCP - see Request Azure Credits. Either the subscription name or the ID work. It’s probably safer to use the subscription ID to avoid any issues with special characters in the subscription name (if the subscription name cannot be found at deployment time it will fail with the error Please provide a valid tenant or a valid subscription). You can find the subscription ID in the Azure Portal by going to Subscriptions, once the subscription has been created by RCP.
- computeVmImage.version: The Data Safe Haven SRD image you should also have recorded this when completing Create TRE GitHub issue
- deploymentIpAddresses: This must be an IP address (or addresses) that the deployment team have access to. We suggest using193.60.220.253 which is the IP address associated with the Turing VPN and/or the Turing Guest network.
- inboundAccessFrom: For Tier 3 use 193.60.220.240, which is the IP address associated with the Turing Secure network. For Tier 2 use 193.60.220.253, which is the IP address associated with the Turing VPN and/or the Turing Guest network. For tier 0/1, simply use Internet to allow access from anywhere.
Save the config as a JSON in the location described in the DSH docs. You’ll also need to save the prod4 SHM config file, which can be found here

Data Study Group (DSG) TRE setup#

Our current approach for DSG deployments has four steps:

Initial setup#

Per DSG challenge, we create a SRE with a single Secure Research Desktop (SRD). In this step, we rely on the default values for VM size provided by the DSH scripts, which deploy a VM of the smallest size. This is sufficient to run the smoke tests and verify that the SRE is working correctly. This step can be started three weeks before the start of the DSG event week.

Deploying additional SRDs#

DSGs will need more compute power than a single small SRD can provide.

To minimise costs, we deploy additional SRDs 10 days before the start of the DSG event week. This gives us time to verify that the SRDs are working correctly and to request quota increases if necessary. For teams of 10 participants, we found that a total of 4 VMs of the following sizes satisfy the teams compute requirements:

Dv5 and Dsv5-series (CPU only)
- One SRD of size Standard_D8_v5.
- Two SRDs of size Standard_D16_v5.
NCv3-series (GPU enabled)
- One SRD of size Standard_NC6s_v3.

If you deployed a single small SRD in the initial setup, you can resize it to the appropriate size and then deploy additional SRDs. The Data Safe Haven docs explain how to resize the VM of an existing SRD or deploy additional SRDs.

The SRD of size Standard_NC6s_v3 is powered by a GPU, and most of the time we will need to contact Azure support for a quota increase. Check the next section for details.

Important

All VMs should be deployed in the UK South region to ensure that the data remains within the UK and there is no incompatibility with the rest of the SRE infrastructure. Additionally, all VMs should have Intel processors, not AMD.

Note

When resizing a VM in Azure, it sometimes helps to turn off the VM before resizing. Especially the error message includes that the size is “not available in the current hardware cluster”.

Adding a GPU-powered SRD#

For Data Study Groups at the Turing, it’s common for participants to request access to a GPU enabled compute VM, so they can use applications such as CUDA in their research. Allocating a GPU-enabled VM that supports CUDA at short notice in Azure can prove tricky, but TRESA needs to be able to quickly provide this on request, so we’ve decided it’s prudent to set this up in advance. From experience, we have found that a single Standard_NC6s_v3 VM satisfies the GPU needs of most DSG teams.

When you try to deploy a new VM or resize an existing one, you might find that the VM family you want is not available. This is likely due to insufficient quota for your desired VM Family. The solution is to request a new quota increase for that VM Family via the Azure Portal. Follow the instructions under the Tip in the DSH docs, making sure to choose the UK South region. We recommend requesting a quota for the Standard NCSv3 Family vCPUs with a vCPU quota of 6, which will give you access to a Standard_NC6s_v3 VM.

You might find that, even if you have sufficient quota for the desired VM family, you might not be able to deploy a VM of that type due to high demand in the UK South region. We found that the solution in that case is to contact Microsoft support and request to make that VM family available to you in the region that you want. We also found that, even when Microsoft make the VM family available, sometimes you cannot resize an existing VM to the desired VM size. In that case it seems the only solution is to deploy a new VM of the desired type. To do so, follow the instructions in the DSH docs to add a new SRD.

When deploying a GPU-enabled VM, make sure to set the -ipLastOctet to something different from the CPU-enabled compute VMs; we recommend 180. Once the GPU machine has been deployed, you can log in to the SRD and verify that the GPU is visible to the OS by running nvidia-smi in a terminal window.

Shut down the VMs#

Leaving all VMs running is very expensive (especially for the GPU VM), so once deployed and tested they should be stopped and deallocated until a couple of days before the start of the DSG. Deallocation means that the resources are allocated to someone else, so you don’t get charged for them. You can stop and deallocate a VM by clicking the stop button in the Azure Portal. The easiest way to find all VMs is to go to the Virtual Machines section in the Azure Portal and filter by subscription. From experience, we found that starting a VM after it has been deallocated is relatively fast (a matter of minutes). But to be on the safe side, you can start all VMs in the afternoon of the Friday prior to the start of the DSG event week (or during the weekend if you are OK with that).