Architecture - FRIDGE

Legend¶

In the following diagrams we use colours to indicate who has access or control over particular resources. These are mapped to our roles,

(blue) owned by TRE Operator Organisation
(green) owned by FRIDGE Hosting Organisation
(orange) used by Job Submitters, owned by TRE Operator Organisation

pink items indicate externally controlled resources, outside of the scope of our roles.

Arrows indicate the flow of permitted traffic. Solid lines indicate pushes, that is, they are triggered from the beginning of the arrow. Dotted lines indicate pulls, triggered from the end of the arrow.

Satellite TRE¶

Figure 1 demonstrates the high-level concept of a satellite TRE. It shows the connection of an existing TRE to a satellite TRE instance deployed remote infrastructure.

The secure TRE tenancy boundary enables the extension of existing governance to the satellite TRE. On the remote infrastructure, a dashed line indicate the boundary of the TRE tenancy. All resources within the tenancy are within the governance domain of the Home TRE, through the governance boundary extension agreed in the shared responsibility model.

A block diagram depicting a satellite TRE. It shows how the satellite TRE is an adjunct to an existing TRE. The TRE admins configure the satellite TRE on remote infrastructure, while the remote infrastructure admins configure the TRE tenancy. TRE Researchers are able to dispatch jobs and manage data from their home TRE workspace. — Figure 1:A schematic of the satellite TRE concept, showing the home TRE and TRE Tenancy.

FRIDGE and TRE Tenancy¶

Overview¶

Figure 2 gives an overview of the design of FRIDGE. Compared to Figure 1, Figure 2 reveals detail of the structure of a FRIDGE satellite TRE. It shows how management traffic is isolated from research traffic and part of how the TRE Tenancy is defined through network isolation.

Requirements¶

Figure 2 represents a generic FRIDGE deployment, and specific details may vary between implementations. However, there are some requirements which must be met by all implementations,

No traffic is allowed between the Access Network and Isolated Network except for,
- Kube Proxy to Kube API,
- FRIDGE Proxy to FRIDGE API,
- Container Runtime to Container Repository.
No outbound traffic is allowed from the Isolated Network, except for that described above.
No outbound traffic is allowed from the Access Network, except to select container repositories.
Both the Access Network and Isolated Network must be isolated from other networks on the FRIDGE Hosting Organisation's infrastructure.
On a cloud-like system, the TRE Tenancy must be isolated from any other tenancies. For example, it must not be possible to share resources from the TRE Tenancy with other tenants.

Dual Network¶

The FRIDGE instance is split into two networks, each of which contains a K8s cluster. The Access Cluster is responsible for routing traffic from the Home TRE to the Isolated Cluster. The Isolated Cluster has access to sensitive data, and runs jobs on that data. Traffic between the two clusters is strongly restricted by a firewall, with only the connections shown in Figure 2 permitted. In addition, the Isolated Network has no outbound access, beyond the Container Runtime being able to pull container images from the container repository in the Access Network.

The dual-network design forms an important part of our approach to Defence in Depth, in addition to K8s-native network control. In the event of container breakout, or otherwise compromising the K8s nodes, there is still no route to exfiltrate sensitive data.

Connection from Home TRE¶

Bastion¶

To avoid publicly exposing the Kube API of the Access Cluster, some sort of bastion (for example a virtual machine running an SSH server, or wireguard) should be used. The nature of this bastion may vary between implementations.

Router and Ingress¶

To correctly route traffic intended for the Access Cluster, a router or reverse proxy is used. This may route traffic based on port, hostname, prefix or some combination. The nature of this may vary between implementations. All must point to the Access Cluster where a K8s Ingress Controller will direct traffic to the correct service.

Proxies¶

For Job Submitters, the local API interface and FRIDGE proxy provide transparent access to the FRIDGE API. It will appear to them as a service in the network of their TRE workspace with endpoints for submitting and managing jobs dispatched to the FRIDGE instance. Similarly, TRE Administrators are able to manage the K8s components of their FRIDGE instance through their own API interface.

The proxies and Access Cluster's Kube API are distinct pods. Proxy pods run an SSH daemon and are used to pass requests through to the Isolated Cluster's Kube API or FRIDGE API via an SSH tunnel. Each API Interface at the Home TRE is required to generate an SSH key pair. Hence by installing the correct public key on each proxy, the TRE Operator Organisation can control who has access to the APIs in the Isolated Cluster. It would also be possible to further restrict traffic through network controls such as IP allowlists or exposing the Access Cluster only through a VPN.

FRIDGE internal¶

A block diagram showing the internal components of FRIDGE K8s clusters. — Figure 3:A diagram showing the key internal components of the FRIDGE Kubernetes clusters. Lines indicate access to private volumes.

Network Policy¶

Network traffic within the FRIDGE clusters is restricted. This is achieved using Cilium CNI plugin. This is in addition to the network isolation enforced by the networks.

TLS¶

cert-manager will automatically provision and renew TLS certificates for services which can be reached over HTTPS. For example, the container repository.

Proxies¶

FRIDGE API¶

The FRIDGE API provides users with endpoints to manage data, and submit and monitor jobs. Writing a custom API separates Job Submitters from the underlying implementation, so that they may use a single FRIDGE interface irrespective. This API will then be resilient to changes to the FRIDGE Workflow Manager and storage. It will also enable the creation of user-focused FRIDGE tools such as CLIs or web interfaces for job submission and management.

Workflow Manager¶

The workflow manager receives job specifications from the FRIDGE API and launches jobs in the Job Namespace. The workflow manager is an instance of Argo Workflows.

Job Namespace¶

To isolate Job Submitters’ processes from the rest of the Isolated Cluster, including components which enforce security, jobs may only be run in a dedicated namespace. This namespace has no access to external resources, other than research data and container images, and jobs are restricted to run without privileges.

Container Repository¶

An instance of the Harbor container registry provides access to container images for the isolated cluster. It acts both as a read-through cache for allowed public registries (such as Docker Hub, Quay and GitHub Container Registry) and as a repository for Job Submitters’ own container images. This allows Job Submitters to easily use custom software, by building a container image and pushing to the repository.

Storage¶

Storage classes¶

FRIDGE defines two storage classes. One is for holding sensitive data, and the other for non-sensitive data. These storage classes need to be implemented for each target platform, as the appropriate CSI and options will vary.

For secure storage, if an available CSI supports encryption with keys provided by Kubernetes, that can be used. Otherwise, FRIDGE can deploy Longhorn which will create Kubernetes volumes, backed by block storage, with data encrypted at rest.

Object storage¶

An object storage system is used for managing data assets in the FRIDGE instance. This provides a convenient way to handle the ingress of inputs and egress of results.

Buckets are created for inputs and results. The inputs bucket is read-only to jobs, to prevent the corruption of input data.

The object storage is provided by an instance of Minio and uses a volume of the secure storage class for its backend.

Secure volumes¶

For higher performance than object storage, encrypted block devices can be accessed directly by jobs.

Insecure volumes¶

Unencrypted volumes are used by the Container Repository for caching container images.

Glossary¶

Access Cluster

A Kubernetes cluster with services to manage the connection of the Home TRE to the Isolated Cluster, where sensitive-data workloads are run. It also hosts the Container Repository which enable the Isolated Cluster to pull container images, despite being isolated.

Access Network

The FRIDGE network hosting the Access Cluster. This network acts as a bridge connecting the Home TRE to FRIDGE job execution components.

Container Runtime

The container runtime is the component of a Kubernetes distribution which is responsible for running containers. Between distributions, the particular container runtime may differ, but all will communicate with Kubernetes through a standard interface.

In FRIDGE, it is important that the container runtime of the Isolated Cluster is configured to fetch container images from the container repository, as it will not be able to access public container registries.

Home TRE

An existing TRE, complete with infrastructure, data governance and processes. Research questions are established in this TRE, before data and job specifications are dispatched to the satellite TRE for execution. The satellite TRE formally belongs within the governance boundary of the home TRE.

Isolated Cluster

A Kubernetes cluster with services to run workloads on sensitive data, and manage inputs and results

Isolated Network

The FRIDGE network hosting the Isolated Cluster. This network creates a secure boundary around the FRIDGE components which run sensitive-data workloads. It is a key part of the security of a FRIDGE instance.