Troubleshooting#

General information on how to manage a TRE can be found here: Manage the TRE.

This page intends to explain what TRESA should do when various things go wrong with the production instance of the Data Safe Haven platform, who to ask for help and when. The headers on this page are numbered to reflect the order in which TRESA should consider doing these things,

1) Search the documentation for troubleshooting tips#

Troubleshooting for commonly found problems, for example users being unable to login, are documented in the Data Safe Haven System Manager docs.

It’s always worth searching through these docs (as well this TRESA docs site), to see if the problem you encounter has been experienced in the past and has a documented fix.

2) Consult the TRESA team#

More experienced members of TRESA may be familiar with the issue you are experiencing. Reach out to them, e.g. via the #tresa Slack channel on the Turing workspace.

If they are able to provide a solution that isn’t documented somewhere, you may wish to update the documentation for future reference. Think about whether the troubleshooting advice is generally applicable to the Data Safe Haven platform or a Turing-specific process and make a pull request on GitHub to either the Data Safe Haven docs or the TRESA docs.

If TRESA are unable to help, it’s possible there’s a bug in the Data Safe Haven codebase that needs to be reported.

3) Data Safe Haven bug reporting#

When any member of TRESA comes across a bug with the TRE they are using or deploying, they should raise an issue at https://github.com/alan-turing-institute/data-safe-haven/issues/new/choose via the relevant bug template, which the DSH dev team will address. In general, it’s not the responsibility of TRESA to fix these issues, however if you can think of a solution then fork the codebase and create a PR with your fix.

In the case where the bug fix is deemed urgent, it may be prudent to ping the Data Safe Haven development team (all of whom are in REG) via Slack and work on a solution together.

4) Changes to a production Data Safe Haven#

Warning

Making a change to a production Safe Haven Management environment (SHM) should only be done under exceptional circumstances, with the approval of the TRESA and DSH teams, and may also need consultation with the Turing data protection team, IT or legal.

Bug fixes should be tested in a development environment (by the Data Safe Haven development team) before being applied to a production system. Liase with the DSH dev team on what needs to be done to safely apply a bug fix to a deployed production SHM or SRE.

If the bug fix is for the SRE (rather than SHM), it may be simpler to tear down the SRE and re-build it with the patched release of the DSH code. Any data stored in a storage account will not be removed by a teardown, and any users who have set up their accounts on the SHM can be given access to the new SRE after it’s deployed (without having to recreate those accounts).

If the bug fix is for the SHM, liaise with the DSH development team on whether the fix can be made by running a script or requires a manual intervention in the Azure Portal.