Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: Experiment Management #2146

Merged
merged 20 commits into from
Feb 19, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
dac8b8b
guide: intro (empty) EM guide + basic links
jorgeorpinel Feb 3, 2021
352aaac
Merge branch 'master' into guide/experiments
jorgeorpinel Feb 3, 2021
b1e4c58
guide: intro, structure, and run-cache section in Exp Mgmt
jorgeorpinel Feb 3, 2021
92c3b1f
guide: finish Exp Mgmt intro, shorten run-cache section
jorgeorpinel Feb 4, 2021
70e4f91
guide: shorten Exp Mgmt intro and its run-cache section more, and
jorgeorpinel Feb 5, 2021
f73ae78
guide: finish explaining ephimeral exps
jorgeorpinel Feb 9, 2021
6b9b509
guide: improves to ephimeral exps, intro persisten exps in Exp Mgmt
jorgeorpinel Feb 9, 2021
203fa3c
guide: more info on ephimeral exps and wrap up persistent exps section
jorgeorpinel Feb 9, 2021
fd344d6
guide: begin Orging exps in Exp Mgmt
jorgeorpinel Feb 9, 2021
c7cc388
guide: copy edits Ephemeral experiments
jorgeorpinel Feb 9, 2021
2fabec3
Merge branch 'master' into guide/experiments
jorgeorpinel Feb 10, 2021
ea260f6
guide: remove 0. Tests and add add Checkpoints section
jorgeorpinel Feb 10, 2021
3099d1a
guide: 3 main exp forms in Exp Mgmt
jorgeorpinel Feb 10, 2021
d012334
guide: simplify Exp Mgmt tech details
jorgeorpinel Feb 10, 2021
006068c
guide: finish Checkpoints section in Exp Mgmt
jorgeorpinel Feb 11, 2021
520563b
guide: mention params, metrics, plots, and other copy edits to Exp Mgmt
jorgeorpinel Feb 11, 2021
b571f50
guide: wrap up comprehensive Ecx Mgmt doc, add checkpoint to dvc.yaml…
jorgeorpinel Feb 16, 2021
b540d4e
Merge branch 'master' into guide/experiments
jorgeorpinel Feb 16, 2021
f1bfda3
ref: move checkpoint field into dvc.yaml output spec
jorgeorpinel Feb 16, 2021
51fa327
glossary: migrate exp tooltip to new format (fontmatter)
jorgeorpinel Feb 16, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions content/docs/index.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# DVC Documentation

Data Version Control, or DVC, is a data and ML experiment management tool that
takes advantage of the existing engineering toolset that you're already familiar
with (Git, CI/CD, etc.).
Data Version Control, or DVC, is a data and ML
[experiment management](/doc/user-guide/experiment-management) tool that takes
advantage of the existing engineering toolset that you're already familiar with
(Git, CI/CD, etc.).

<cards>

Expand Down
1 change: 1 addition & 0 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@
"merge-conflicts"
]
},
"experiment-management",
"setup-google-drive-remote",
"large-dataset-optimization",
"external-dependencies",
Expand Down
3 changes: 3 additions & 0 deletions content/docs/start/experiments.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ Read on or watch our video to see how it's done!

https://youtu.be/iduHPtBncBk

> 📖 See [Experiment Management](/doc/user-guide/experiment-management) for more
> information on DVC's approach.

## Collecting metrics

First, let's see what is the mechanism to capture values for these ML experiment
Expand Down
11 changes: 11 additions & 0 deletions content/docs/user-guide/basic-concepts/experiment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
name: Experiment
match: [experiment, experiments]
tooltip: >-
An attempt to reach desired/better/interesting results during data pipelining
or ML model development. DVC is designed to help [manage
experiments](/doc/user-guide/experiment-management), having built-in
mechanisms like the
[run-cache](/doc/user-guide/project-structure/internal-files#run-cache) and
the `dvc experiments` commands (coming in DVC 2.0).
---
8 changes: 4 additions & 4 deletions content/docs/user-guide/basic-concepts/run-cache.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
name: 'Run-cache'
match: ['run-cache']
tooltip: >-
The DVC run-cache is a log of stages that have been run in the project. It's
comprised of `dvc.lock` file backups, identified as combinations of
dependencies, commands, and outputs that correspond to each other. `dvc repro`
and `dvc run` populate and reutilize the run-cache. See
A log of stages that have been run in the project. It's comprised of
`dvc.lock` file backups, identified as combinations of dependencies, commands,
and outputs that correspond to each other. `dvc repro` and `dvc run` populate
and reutilize the run-cache. See
[Run-cache](/doc/user-guide/project-structure/internal-files#run-cache) for
more details.
---
138 changes: 138 additions & 0 deletions content/docs/user-guide/experiment-management.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# Experiment Management

Data science and ML are iterative processes that require a large number of
attempts to reach a certain level of a metric. Experimentation is part of the
development of data features, hyperspace exploration, deep learning
optimization, etc. DVC helps you codify and manage all of your
<abbr>experiments</abbr>, supporting these main approaches:

1. Create [experiments](#experiments) that derive from your latest project
version without having to track them manually. DVC does that automatically,
letting you list and compare them. The best ones can be promoted, and the
rest archived.
2. Place in-code [checkpoints](#checkpoints-in-source-code) that mark a series
of variations, forming an in-depth experiment. DVC helps you capture them at
runtime, and manage them in batches.
3. Apply experiments or checkpoints as [persistent](#persistent-experiments)
commits in your <abbr>repository</abbr>. Or create these versions from
scratch like typical project changes.

At this point you may also want to consider the different
[ways to organize](#organization-patterns) experiments in your project (as
Git branches, as folders, etc.).

DVC also provides specialized features to codify and analyze experiments.
[Parameters](/doc/command-reference/params) are simple values you can tweak in a
human-readable text file, which cause different behaviors in your code and
models. On the other end, [metrics](/doc/command-reference/metrics) (and
[plots](/doc/command-reference/plots)) let you define, visualize, and compare
meaningful measures for the experimental results.

## Experiments

⚠️ This feature is only available in DVC 2.0 ⚠️

`dvc exp` commands let you automatically track a variation to an established
[data pipeline](/doc/command-reference/dag). You can create multiple isolated
experiments this way, as well as review, compare, and restore them later, or
roll back to the baseline. The basic workflow goes like this:

- Modify <abbr>dependencies</abbr> (e.g. input data or source code),
<abbr>hyperparameters</abbr>, or commands (`cmd` field of `dvc.yaml`) of
committed stages.
- Use `dvc exp run` (instead of `repro`) to execute the pipeline. This puts the
experiment's results in your <abbr>workspace</abbr>, and tracks it under the
hood.
- Visualize experiment configurations and results with `dvc exp show`. Repeat.
- Use [metrics](/doc/command-reference/metrics) in your pipeline to identify the
best experiment(s), and promote them to persistent experiments (regular
commits) with `dvc exp apply`.

<details>

### How does DVC track experiments?

DVC uses actual commits under custom
[Git references](https://git-scm.com/book/en/v2/Git-Internals-Git-References)
(found in `.git/refs/exps`) to keep track of experiments created with `dvc exp`.
Each commit has the repo `HEAD` as parent. These are not pushed to the Git
remote by default (see `dvc exp push`).

> References have a unique signature similar to the
> [entries in the run-cache](/doc/user-guide/project-structure/internal-files#run-cache).

</details>

## Checkpoints in source code

⚠️ This feature is only available in DVC 2.0 ⚠️

To track successive steps in a longer experiment, you can write your code so it
registers checkpoints with DVC at runtime. This allows you, for example, to
track the progress in deep learning techniques such as evolving neural networks.

This kind of experiment is also derived fom your latest project version, but it
tracks a series of variations (the checkpoints). You interact with them using
`dvc exp run`, `dvc exp resume`, and `dvc exp reset` (see also the `checkpoint`
field of `dvc.yaml` outputs).

<details>

### How are checkpoints captured by DVC?

When DVC runs a checkpoint-enabled pipeline, a custom Git branch (in
`.git/refs/exps`) is started off the repo `HEAD`. A new commit is appended each
time the code calls `dvc.api.make_checkpoint()` or writes a
`.dvc/tmp/DVC_CHECKPOINT` signal file. These are not pushed to the Git remote by
default (see `dvc exp push`).

</details>

## Persistent experiments

When your experiments are good enough to save or share, you may want to store
them persistently as commits in your <abbr>repository</abbr>.

Whether the results were produced with `dvc repro` directly, or after a
`dvc exp` workflow (refer to previous sections), the `dvc.yaml` and `dvc.lock`
pair in the <abbr>workspace</abbr> will codify the experiment as a new project
version. The right <abbr>outputs</abbr> (including
[metrics](/doc/command-reference/metrics)) should also be present, or available
via `dvc checkout`.

> 👨‍💻 See [Get Started: Experiments](/doc/start/experiments) for a hands-on
> introduction to regular experiments.

### Organization patterns

DVC takes care of arranging `dvc exp` experiments and the data
<abbr>cache</abbr> under the hood. But when it comes to full-blown persistent
experiments, it's up to you to decide how to organize them in your project.
These are the main alternatives:

- **Git tags and branches** - use the repo's "time dimension" to distribute your
experiments. This makes the most sense for experiments that build on each
other. Helpful if the Git [revisions](https://git-scm.com/docs/revisions) can
be easily visualized, for example with tools
[like GitHub](https://docs.github.com/en/github/visualizing-repository-data-with-graphs/viewing-a-repositorys-network).
- **Directories** - the project's "space dimension" can be structured with
directories (folders) to organize experiments. Useful when you want to see all
your experiments at the same time (without switching versions) by just
exploring the file system.
- **Hybrid** - combining an intuitive directory structure with a good repo
branching strategy tends to be the best option for complex projects.
Completely independent experiments live in separate directories, while their
progress can be found in different branches.

## Automatic log of stage runs (run-cache)

Every time you `dvc repro` pipelines or `dvc exp run` experiments, DVC logs the
unique signature of each stage run (to `.dvc/cache/runs` by default). If it
never happened before, the stage command(s) are executed normally. Every
subsequent time a [stage](/doc/command-reference/run) runs under the same
conditions, the previous results can be restored instantly, without wasting time
or computing resources.

✅ This built-in feature is called <abbr>run-cache</abbr> and it can
dramatically improve performance. It's enabled out-of-the-box (but can be
disabled with the `--no-run-cache` command option).
10 changes: 7 additions & 3 deletions content/docs/user-guide/project-structure/internal-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,9 +131,10 @@ That's how DVC knows that the other two cached files belong in the directory.
have been run in the project. It is found in the `runs/` directory inside the
cache (or [remote storage](/doc/command-reference/remote)).

Runs are identified as combinations of <abbr>dependencies</abbr>, commands, and
<abbr>outputs</abbr> that correspond to each other. These combinations are
hashed into special values that make up the file paths inside the run-cache dir.
Runs are identified as combinations of exact <abbr>dependency</abbr> contents
(or [parameter](/doc/command-reference/params) values), and the literal
command(s) to execute. These combinations are represented by special hashes that
translate to the file paths inside the run-cache dir:

```dvc
$ tree .dvc/cache/runs
Expand All @@ -151,3 +152,6 @@ run.

💡 `dvc push` and `dvc pull` (and `dvc fetch`) can download and upload the
run-cache to remote storage for sharing and/or as a back up.

> Note that the run-cache assumes that stage commands are deterministic (see
> **Avoiding unexpected behavior** in `dvc run`).
17 changes: 11 additions & 6 deletions content/docs/user-guide/project-structure/pipelines-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ changed to decide whether the stage requires re-execution (see `dvc status`).
If it writes files or dirs, they can be defined as <abbr>outputs</abbr>
(`outs`). DVC will track them going forward (similar to using `dvc add`).

> See the full stage entry [specification](#stage-entries).

### Parameter dependencies

[Parameters](/doc/command-reference/params) are a special type of stage
Expand Down Expand Up @@ -337,7 +339,9 @@ stages:
> Note that this feature is not compatible with [templating](#templating) at the
> moment.

## Specification
## Stage entries

These are the fields that are accepted in each stage:

| Field | Description |
| ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
Expand Down Expand Up @@ -369,11 +373,12 @@ validation and auto-completion.
> Notice that these are a subset of those in `.dvc` file
> [output entries](/doc/user-guide/project-structure/dvc-files#output-entries).

| Field | Description |
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
| `cache` | Whether or not this file or directory is <abbr>cached</abbr> (`true` by default). See the `--no-commit` option of `dvc add`. |
| `persist` | Whether the output file/dir should remain in place while `dvc repro` runs (`false` by default: outputs are deleted when `dvc repro` starts |
| `desc` | (Optional) user description for this output. This doesn't affect any DVC operations. |
| Field | Description |
| ------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `cache` | Whether or not this file or directory is <abbr>cached</abbr> (`true` by default). See the `--no-commit` option of `dvc add`. |
| `persist` | Whether the output file/dir should remain in place while `dvc repro` runs (`false` by default: outputs are deleted when `dvc repro` starts |
| `desc` | (Optional) user description for this output. This doesn't affect any DVC operations. |
| `checkpoint` | Set to `true` to let DVC know that this output is associated with [in-code checkpoints](/doc/user-guide/experiment-management#checkpoints-in-source-code) (for `dvc experiments`). |

## dvc.lock file

Expand Down
3 changes: 3 additions & 0 deletions content/docs/user-guide/related-technologies.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,9 @@ _Luigi_, etc.

## Experiment management software

> See also the [Experiment Management](/doc/user-guide/experiment-management)
> guide.

- DVC uses Git as the underlying version control layer for data, pipelines, and
experiments. Data versions exist as metadata in Git, as opposed to using
external databases or APIs, so no additional services are required.
Expand Down
9 changes: 5 additions & 4 deletions content/docs/user-guide/what-is-dvc.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
# What Is DVC?

**Data Version Control** is a new type of data versioning, workflow, and
experiment management software, that builds upon [Git](https://git-scm.com/)
(although it can work stand-alone). DVC reduces the gap between established
engineering tool sets and data science needs, allowing users to take advantage
of new [features](#core-features) while reusing existing skills and intuition.
[experiment management](/doc/user-guide/experiment-management) software, that
builds upon [Git](https://git-scm.com/) (although it can work stand-alone). DVC
reduces the gap between established engineering tool sets and data science
needs, allowing users to take advantage of new [features](#core-features) while
reusing existing skills and intuition.

![](/img/reproducibility.png) _DVC codifies data and ML experiments_

Expand Down