Skip to content

Commit

Permalink
guide: finish Exp Mgmt intro, shorten run-cache section
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgeorpinel committed Feb 4, 2021
1 parent b1e4c58 commit 776012d
Showing 1 changed file with 37 additions and 19 deletions.
56 changes: 37 additions & 19 deletions content/docs/user-guide/experiment-management.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,38 +2,56 @@

Data science and ML are iterative processes that tend to require a large number
of attempts during their course, for example to develop data features,
hyperspace exploration, model accuracy optimization, etc. DVC is designed to
help you codify and manage all of your experiments.

Kinds of exps... With DVC, no variation of your code or data is left
hyperparameters
hyperspace exploration, deep learning optimization, etc. DVC is designed to help
you codify and manage all of your experiments.

DVC considers certain levels at which the variants of your work are considered
_experiments_:

0. Tests you do on you own without DVC knowing about them — we can't help with
that!
1. An automatic log of every stage run through DVC is the entry point for these
features.
2. _Ephemeral experiments_ can be setup in virtual project branches. This is
where you can start **automating** their execution and generate reports
comparing many of them. At some point a few are selected/promoted, and the
rest can be abandoned.
3. _Persistent experiments_ can be picked up from previous levels, or they can
be registered manually by **committed** their results to Git. This is where
you may want to start thinking about the different ways to
[organize](#organizing-experimentats) them in your project (branches,
folders, etc.).

## Automatic log of stage runs

DVC already caches every change to <abbr>outputs</abbr> when it can (see also
`dvc status`). Additionally, `dvc repro` and `dvc run` by default populate and
reutilize a log of stages that have been run in the project, known as the
<abbr>run-cache</abbr>.
Every time you `dvc repro` each stage [stages](/doc/command-reference/run), DVC
determines a unique identifier of each stage "run" (logged to `.dvc/cache/runs`
by default). If it never happened before, the stage command(s) are executed and
their <abbr>outputs</abbr> cached normally. Every subsequent time the stage runs
under the same conditions, those results can be restored instantly, without
wasting time or computing resources.

This means that every time you execute [stages](/doc/command-reference/run) with
DVC, the unique combination that identifies that "run" is saved internally (in
`.dvc/cache/runs` by default). The corresponding results (typically
<abbr>cached</abbr>) can later be retrieved in subsequent runs, even if you
didn't remember that the combination had been tried before!
This mechanism can dramatically improve performance, and it's a built-in
feature, enabled out-of-the-box (it can be disabled via the `--no-run-cache`
option).

When this happens, the results are restored instantly, without wasting time or
computing resources. This can dramatically improve performance, and it's a
built-in feature that just works out-of-the-box (it can be disabled via the
`--no-run-cache` option).
> Note that the run-cache assumes that stage commands are deterministic (see
> **Avoiding unexpected behavior** in `dvc run`).
## Ephemeral experiments

Unique stage runs can be identified by the combination of their dependencies
(including params) and the command(s) to execute.

Every run of a stage or pipeline can be considered an experiment. These are
identified by the exact combination of dependencies, , and

frequent, transient, brain storming

## Persistent experiments

selected, committed

## Ways to organize experimentation
## Organizing experiments

Implicit vs. Git branches/tags vs. file structures

0 comments on commit 776012d

Please sign in to comment.