Warning: For this tutorial, we will provide a Unix-based example, so it should work on any MacOS or Linux terminal---Windows users, apologies for the inconvenience; please raise an issue and we will try to expand this tutorial for you as soon as possible.
Goal. The goal of this tutorial is to make sure that you are able to create a container of your code repository in such a way that yourself in the future, or external collaborators, are able to fully reproduce your work. Most importantly, you will do this in such a way that anyone will be able to reproduce the work without having to worry about differences in software versions today, or 100 years from now.
Knowledge requirements. This example is done entirely using free, open-source software. In order for this to work relatively seamlessly, we expect that you have good experience with the R language, git system of version control, and some basic knowledge in bash. If you do not have any of the above skills, this may be a bit out there for you, but please feel free to still give it a go--- we would love to hear your feedback on how to improve the content of this tutorial.
Major steps. In this tutorial you will:
-
Pull an existing example git repository hosted on GitHub containing R code that runs a linear model on some dataset, and produces a .pdf output bivariate plot;
-
Learn how to build a Docker container from this repository, and run it locally;
-
Push this container to dockerhub;
-
Use Binder to build and run the Docker container remotely, which will be done via an interactive RStudio session on your web browser. There you will have access to the full content of the code repository, as well as the correct OS and package versions. This approach will allow any user to run and fully reproduce the output generated by the original code in the repository.
But first things first.
One of the biggest challenges scientists face today is making sure that their research is fully reproducible. This reproducibility has three main pillars. It starts with a fully transparent plan of the experimental design and hypothesis to be tested (see more here), moving onto the actual collection of the data (see more here), to finally making sure that all analyses and output are fully reproducible from scratch using the collected data and computer code. Here we will address the latter form of reproducibility based exclusively on open-source software and free-of-charge on-line platforms.
We will assume that you did maintain a record of your original research intention, and that the data is fully collected and, most importantly, untouched. Raw data should always be kept as read-only. Literally any modification that is needed to be applied to a dataset can be done via computer code, which helps you keep a fully transparent record of how the data was modified from original version to analysis version used to answer the research question.
Once files are in place, we need to make sure that we maintain everything organised. Here we will follow a simple project directory structure suggested by the NiceRCode blog (NB: this is not the only approach to organising your code). So for the sake of this particular tutorial, we will assume that you have also modularised your code into unit functions (see examples in the R language, but NB: modularising your code is a recommendation, not an obligation), and have been maintaining a full record of how the code has been modified through time using version control with git. Also, make sure you have your own free GitHub account.
This is where we start approaching the utility of Docker containers. Many of you may have experienced a situation where the code and project history are fully transparent, have been deposited in an on-line, free repository, but can no longer be reproduced because the software version and associated packages that you originally used are outdated. How can we solve this issue, such that our code will always yield the exact same output, even in a million years from now?
As with anything in computer programming, every skill and technique comes packed with horrible lingo. When it comes to Docker containers you will probably see the words image and container being used a lot, so let's go ahead and get those definitions out of our way.
An image is a static (unchangeable) file that bundles code and all its dependencies such that your code repository runs reliably on, say, both your original MacBook and your colleague's Windows PC. It contains the necessary system libraries, code, runtime and system tools for this magic to happen. However, an image is just a snapshot which serves as a template to build a container. In other words, a container is a running image, and cannot exist without the image, whereas an image can exist without a container.
Docker is a containerisation software which allows you to create lightweight, standalone, executable images from which you can create containers to run your code repository and fully reproduce the output. This essentially allows a scientist to isolate their code repository from its environment, solving the third pillar of our reproducibility crisis section above.
First of all, you need to download and install Docker on your machine. This is basically the software that contains all the tricks for you to run your own containers. Once you finished installing it, open the software which will contain OS-specific examples on how to build Docker images.
- We will start by forking a public git repository from GitHub. Make sure you are logged into your own GitHub account. Then open the
open-AIMS/docker-example
repository page on your web browser. At the top right hand corner, there is an option to "Fork" the repository. "Fork" makes a full copy of theopen-AIMS/docker-example
repository on your own account, allowing you to modify the content as much as needed without interfering with the history of the originalopen-AIMS/docker-example
.
- Now clone the forked repository from GitHub to a local folder of your choice on your machine (below referred to as
path_of_your_choice
). The forked repository will be namedusername/docker-example
, whereusername
corresponds to your own GitHub account name. Make sure to substitute the appropriate names in the example code below. On your Terminal:
cd path_of_your_choice
git clone https://github.com/username/docker-example.git
- Now locally navigate to the cloned repository
cd docker-example
Do not run anything on it just yet. Before we get on to the Docker building part, we need to have a look at the file structure in this repository. This code repository can be run by sourcing analysis.R
. In brief, this file loads the ggplot2) package, sources R functions from the R/functions.R
file, reads some data from the data
folder, and generates a plot which is saved automatically to a folder named output
. This code repository is structured following the NiceRCode guidelines. You should be able to simply source("analysis.R")
in R, and inspect the generated output image.
- We will use the files in this cloned repository to first build an image, and then run a container from this image. In terms of ensuring reproducibility, the key files are
DESCRIPTION
,Dockerfile
, and.dockerignore
.
The DESCRIPTION
contains a general description of what this code repository contains, info about the authors, the license (see here why you should always include a license with your public repository). It also details the packages dependencies needed to make the code run (in this example, just ggplot2).
The Dockerfile
contains a set of instructions that Docker uses to build your image with the correct specifications:
-
The three first rows contain information on what version of software you want (we use the
rocker/verse:3.6.3
image freely provided by the rocker team), as well as information about yourself (make sure to populate the fields with your own information accordingly). -
The FROM command points to rocker, which is in itself an image with all the instructions to install R, RStudio and its system dependencies at a particular version (in this example, 3.6.3).
-
The ARG command allows for additional user-specified arguments that can be passed to Docker on the command line when building the image.
-
In this example, we add the argument
WHEN
, which we will use to specify a precise date from which to install all R package dependencies listed on theDESCRIPTION
. This is possible by referring to the MRAN repository. -
The ENV / WORKDIR / RUN commands contain custom-built bash instructions that Docker uses while building the image. All you need to know at this stage is that we are telling Docker to create a folder
/home/rstudio
on which we will save all the files from this code repository. -
This "saving" step is accomplished by the COPY command which tells Docker to copy all files/folders from the code repository to the image.
-
The
Dockerfile
then tells Docker to install the packages listed withinDESCRIPTION
, and finally runs the CMD command which executes tasks when we tell Docker to run a container from the image (see more of this below in step 11).
Please visit this link for a more in-depth understanding of what the Dockerfile
is capable of. The .dockerignore
plays essentially the same role as the .gitignore
file on your version control system; it lists which files in the repository should not be added to the image. See more on why the .dockerignore
file is important here.
- Now that we're happy about the basic set up, make sure that Docker is open and running on your local machine.
docker --version
On my machine (as of 2020-10-21), this returns Docker version 19.03.13, build 4484c46d9d
.
- We're ready to build the image! Make sure you're on the
docker-example
path, and run on your Terminal (this took 8--10 minutes to run on my iMac):
docker build --build-arg WHEN=2020-03-31 -t docker-example .
NB: Do not forget the trailing .
. The --build-args
allows us to pass a value to WHEN
inside the Dockerfile
. The -t
(short for --tag
) flag allows you to attribute a name to your image (in this case, docker-example). A full list of build
arguments can be found here.
- Although the image was built, no container has been run or created from this image yet.
# lists all existing images, including docker-example
docker images -a
# lists all existing containers
docker ps -a
- If we want to inspect the image and its contents, we need to run it, i.e. essentially firing up a container. We can inspect the container, but for now we won't do anything. Navigate to your machine's home directory first (just to prove the point that the image is now independent of the code repository from which it was built), then run the code below, none line at a time, observing the outputs
cd
docker run --rm -it --entrypoint=/bin/bash docker-example
ls -lG
R
library(ggplot2)
packageVersion("ggplot2")
- Notice that the R version is 3.6.3 exactly as we specified (3.6.3), and ggplot2 version is 3.3.0, which was the available for R version 3.6.3 on the specified MRAN date. Now quit R and then the container:
q()
exit
- By default, Docker saves a version of the container to our local machine every time you run the image. This can clutter your disk space, and adding the flag
--rm
ensures that this does not happen, i.e. the container gets deleted after finishing the image run withexit
. In other words, if you run again
docker ps -a
you will see that no containers were saved to your machine. A container would have been saved if it were not for the --rm
flag. The -it --entrypoint=/bin/bash
flag allows interactive standalone shell access to your container.
- We can run the container without the interactive mode; Based on the
CMD
line of ourDockerfile
, Docker will navigate to/home/rstudio
and runRscript analysis.R
docker run --rm docker-example
- Notice that while it indicates that our plot was produced based on the screen output
Saving 7 x 7 in image
, the output is not made locally available to us, and the container was deleted given the--rm
flag. You can inspect a new container to check that the original image remains unaltered (i.e. nooutput
folder exists).
docker run --rm -it --entrypoint=/bin/bash docker-example
ls -lG
exit
- So, although the previous step was necessary for learning, it was of no use to us in practical terms. The practical solution is to run a new container associated with a local volume, so the output gets saved locally. To do that, we need to create a local directory, e.g.
outputdocker
which will serve as the volume onto which theoutput
folder inside the running container gets attached. The volume is indicated with the-v
flag followed bylocal_volume:container_directory
.
mkdir outputdocker
docker run --rm -v ~/outputdocker:/home/rstudio/output docker-example
NB: With the above code, Docker creates /home/rstudio/output
automatically, so when Docker runs analysis.R
, R will return a warning message stating that the folder output
already exists because analysis.R
also tries to create a folder output
; just ignore it.
- After the above, you should see the
myplot.pdf
also saved to your local folderoutputdocker
. Alternatively, everything can be run interactively attached to the localoutputdocker
volume at once, i.e. combining the above steps. open your local files explorer, remove the output filemyplot.pdf
and notice the changes as you run this code --- run it, but don't quit interactive mode just yet
docker run --rm -it --entrypoint=/bin/bash -v ~/outputdocker:/home/rstudio/output docker-example
Rscript analysis.R
- You should now see the
myplot.pdf
back inoutputdocker
. You can keep exploring and running anything you want on the container, including producing more code-produced files tooutput
, and, by extension,outputdocker
. For instance, remove themyplot.pdf
from the container
rm output/myplot.pdf
In doing so, it also gets removed from outputdocker
. The other way around also works; if you delete myplot.pdf
from outputdocker
on your machine, it will also be deleted from /home/rstudio/output
in the container.
NB: while the container is running you won't be able to delete the output
folder because it is linked to outputdocker
. Don't forget to exit the container
exit
You can also have a more advanced customised Dockerfile
to, for example, run a container from an RStudio session via your web browser (see this great example by Drs Daniel Falster and Saras Windecker.
This is essentially the last part of our tutorial. You may be happy with building your Docker container locally, but you may also want to make it accessible to your colleagues who are less versed in these tools. You have three main options, in increasing level (not that much) of time investment:
A) One option (harder for colleagues, easier for you) would be for them to also follow steps 2--13 above (they don't need to fork your GitHub repo as long as they don't try to push back to it -- they won't have the permissions to do so unless you give it to them).
B) Another option is to push the container to dockerhub, similar to how one would push code to their repository on GitHub. To do so, first make sure you have created an account with dockerhub, and that you are logged into this account locally on their Docker app. To make things consistent and easy to remember, I would try to have the account name be the same as your GitHub account name. Then, simply go back to the Terminal and type:
docker tag docker-example username/docker-example
docker push username/docker-example
remember to replace username
with your actual dockerhub account name. You can check that the container is now hosted on your dockerhub repositories. Your colleagues can then pull the Docker container locally on their machines, and can simply run it (i.e. step 13 in the above section). That requires them to also have the Docker app installed on their machines. They have to pull the Docker container you created and run it:
cd
docker pull username/docker-example
mkdir outputdocker
docker run --rm -v ~/outputdocker:/home/rstudio/output username/docker-example
NB: By default, the commands above will push a public repository to dockerhub. dockerhub will only provide the user with one private repository. You need to pay a fee to have access to multiple private repositories if needed to share containers in private.
C) The final option would be for you to generate a Binder link to your container remotely hosted on dockerhub. For that, you need to create a new Dockerfile
in your GitHub repository, and save it in a directory called .binder
. This new Dockerfile
will point to your dockerhub image via the FROM command. You can see an example .binder/Dockerfile
on our GitHub repo, though if attempting this step do not forget to customise the first three lines of .binder/Dockerfile
with your own Docker address, user and email information. Once this is done, navigate to mybinder.org, and paste the link of your GitHub repository:
Binder will generate a launcher badge link that you can then add to your GitHub repo's README.md
file, like this (the link below does not work, it serves just for illustration and guidance):
[![Launch Rstudio Binder](http://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/username/docker-example/master?urlpath=rstudio)
Notice that the end part of the link, ?urlpath=rstudio
, is not originally provided by Binder, but adding it will make sure that an RStudio GUI is triggered on the web browser. All you need to do now is commit the change to your README and push it. By clicking on the launcher badge on your README.md
, collaborators will have direct access to an on-line RStudio session containing your code which can then be run interactively. NB: Binder may take a few minutes (in this example < 5 min) to fire up the RStudio session.
This last option is more laborious, but most definitely ideal if your collaborators only want to have access to your code.
NB: Binder is free, amazing, but limited in what it can offer. There's limitation to usage and RAM to run a container, so projects that require more system memory to install libraries or run analyses may not work. See more detailed info about Binder limitations here.
We would like to thank Drs Daniel Falster and Saras Windecker for providing us with a great example on how to implement this reproducibility framework. Also, we thank the rocker and Binder teams for making this much possible.