XClose

TuringBench

A reproducible workflow for benchmarking data science software with containers

Turing projects are starting to produce large, data science workflows in various domains with modules for retrieving data, pre-processing data, data analysis, and producing output or visualisations. Several different modules for retrieval, pre-processing, analysis and output could be present in an application with a long pipeline. These projects represent major Turing outputs in terms of research and impact, and there exists a need to determine appropriate ways for code to be made available in a modular fashion that can be deployed on many different computing systems and, where appropriate, architectures.

Routine containerisation as a practice is an appropriate mechanism to enable cross-system deployment, and benchmarking deployed applications and individual modules will then provide essential data about computational performance and statistical accuracy.

TuringBench has been developed as a loose set of instructions, a workflow, for carrying out benchmarking on multiple platforms by utilising containerisation, with Docker and Singularity.

Motivation for this site

This site has been set up to promote the practice of using containerisation for the purpose of benchmarking by Turing researchers and offers several examples of how to follow the TuringBench workflow.

Each of these examples attempts to show, with the use of a Jupyter notebook, how a researcher developing a piece of scientific software might follow the TuringBench workflow. Notebooks offer a convenient and familiar medium for following and documenting the workflow procedure and saving benchmark results as tables or charts.

We hope to encourage researchers to submit their own benchmarks for software they have developed to this site, where the results can be easily viewed and updated for subsequent software versions, additional benchmarks or additional computing platforms. Any submissions that are made will be listed on the Benchmarks page.

See below to get started with TuringBench and refer to the examples pages to see the workflow in action. The core workflow described here uses Docker containers, but the Elementary Guide to Platform Agnostic Development also provides information about running containers on HPC environments with Singularity.

Getting started

  1. Install Docker on each computing platform for which you wish to carry out benchmarking
  2. Familiarise yourself with the basics of how Dockerfile instructions work
  3. Create an account on Docker Hub
  4. If you wish to use automated builds, ensure your software is maintained with a GitHub repository
  5. Create benchmarks that can be run to evaluate your software’s performance

Workflow

You can follow the instructions below every time you release a new version of your data science software. Refer to the examples which contain more detail on how this is done in practice.

  1. Create a Docker image and build a container that installs your software and its dependencies and runs your benchmarking script:

    Give it an appropriate name, based on your Docker Hub account and a version tag e.g.

     docker build -t username/mybenchmarks:v2.1 .
    
  2. Push the image to Docker Hub (or set up an automatic build from GitHub)
     docker push username/mybenchmarks:v2.1
    
  3. Pull this container and run on each of your computing platforms to get your benchmark results

     docker pull username/mybenchmarks:v2.1
    
     docker run username/mybenchmarks:v2.1
    

Contributing Benchmarks

To contribute benchmarks for a piece of software, simply follow these instructions:

  1. Create a Jupyter notebook in the style shown by the examples on this site

    This notebook should document the TuringBench workflow you carried out and contain a table or chart with the results of benchmarking

  2. Run the following command to convert your notebook into markdown format:
      jupyter nbconvert --to markdown your_file.ipynb
    

    Note: you may wish instead to use RMarkdown or just create a markdown document from scratch

  3. Clone the github repo for this project
  4. Create a branch of the repo
  5. Add your markdown document to the _benchmarks directory

  6. Edit the markdown file by adding the following right at the top:
      ---
      layout: default
      title: Your notebook title
      ---
    

    Note: The title given will be used as the name of the link to your document on this site

  7. Push the branch and launch a pull request
  8. Once the pull request has been merged, a link to your benchmark document rendered as a web page will now appear on the Benchmarks page

About this site

This site is developed and maintained by Ed Chalstrey and is based on the project work of Tomas Lazauskas, Anthony Lee and Ed Chalstrey at The Alan Turing Institute.

For more details about this project, please visit the GitHub repository or check out the project poster.