-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path09-reproducible.Rmd
73 lines (50 loc) · 3.48 KB
/
09-reproducible.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# Be Reproducible {#reproducible}
(ref:reproducible-intro)
**You Must** - (ref:reproducible-must)
**You Should** - (ref:reproducible-should)
**You Could** - (ref:reproducible-could)
|Related Areas: | [Demonstrably Correct](#demonstrably_correct) <br> [Documentation](#documentation) |
|--------------- |------------------------------------------------------------|
## Unambiguous Documentation for Reproducibility {#unambiguous_docs}
To be able to reproduce your analysis a colleague may need the following:
* The right copy of the code
* The right versions of any dependencies (i.e. libraries used in the code)
* The platform on which code is run
* operating system
* folder structure
* machine specifications
* The source data, or details of how to get it.
At the most basic level, documenting all of these will go a long way to making your analysis reproducible.
It might not make it _easy_ to reproduce however.
## Portability {#portability}
There are some simple thing you can do to improve the chance that your code runs on other computers:
* Use relative paths, not absolute paths. ([Wikipedia - Absolute and Relative Paths](https://en.wikipedia.org/wiki/Path_%28computing%29#Absolute_and_relative_paths)).
* Use a standard and consistent structure for organising your work. See [Projects and Environments](#projects) for more details.
## Project Structure {#projects}
Most languages offer tools and templates for a project based workflow.
Typically these include a way of organising the following components:
* Source Data
* Code
* Outputs
* Environment / Dependencies
* Documentation
By following a standard template for these components you can take advantage of workflow tools provided by your IDE which make it easier to:
* Version Control your work
* Organise your code and source data
* Refactoring and improving your code
* Producing documentation
* Control your environment and dependencies
All of these things are good for sharing or collaborating with others.
See [R at DHSC](#r_projects) and [Python at DHSC](#py_projects) for more information.
## Reproducible Analytical Pipelines {#rap_section}
There is a government community dedicated to the production of reproducible analysis.
See [Reproducible Analytical Pipelines](#rap) for more.
## Packages and Modules {#packages}
Most languages have a standard structure which is used to share code and documentation with other people. You will likely have used code in this structure (libraries / packages / modules) when performing your analysis. Typically these structures include documentation, information about dependencies, and tests.
There is no reason you can't use the same approach to sharing your analysis!
See [R at DHSC](#r_packages) and [Python at DHSC](#py_packages) for more information.
## Containers / Docker {#containers}
Containers allow you to manage the whole environment which a bit of code runs in. They are powerful but perhaps more technically involved than packaging your code or using project structures to manage your environment.
[Docker](https://www.docker.com/) is a [containerisation](https://en.wikipedia.org/wiki/OS-level_virtualisation) platform, which lets you reproduce environments with a wider scope than just the packages present.
With Docker you can manage the entire environment from the operating system and network up (including any packages).
You can use tools such as [docker-compose](https://docs.docker.com/compose/) and [Kubernetes](https://kubernetes.io/) to manage groups of containers relative to one another.