Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: Managing Experiments #2752

Closed
wants to merge 53 commits into from
Closed
Show file tree
Hide file tree
Changes from 52 commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
a67a3a3
initial structure and text from the ref added
iesahin Aug 4, 2021
03614f9
Added some pipelines info
iesahin Aug 5, 2021
a710bfb
fixed links, uniform stage names
iesahin Aug 6, 2021
b4ea144
added --interactive description
iesahin Aug 6, 2021
d10b404
added target and recursive search sections
iesahin Aug 6, 2021
ee770ee
added sidebar link
iesahin Aug 7, 2021
554f49b
parameters section
iesahin Aug 7, 2021
dc8ca5f
additions to parameters section
iesahin Aug 7, 2021
65cebd3
merged
iesahin Aug 9, 2021
abdc5cf
Rebased with origin/master
iesahin Aug 7, 2021
85bafd3
updates to parameters section
iesahin Aug 9, 2021
7b30610
parallel experiments added
iesahin Aug 9, 2021
a8df69b
temporary experiments
iesahin Aug 9, 2021
0fed151
correct
iesahin Aug 9, 2021
60ab835
some reorg
iesahin Aug 9, 2021
aaa166d
began writing checkpoints
iesahin Aug 10, 2021
cbe1ecf
added example for --temp
iesahin Aug 10, 2021
cda1ec8
added checkpoints explanations and julia example
iesahin Aug 10, 2021
de0a2fa
prettied
iesahin Aug 10, 2021
0cf27d0
WIP
iesahin Aug 11, 2021
6a118f7
r example done
iesahin Aug 11, 2021
0ae2f22
Merge branch 'iesahin/ug-exp-run-2675' of github.com:iterative/dvc.or…
iesahin Aug 11, 2021
1dbe4ce
fixes for the motivation part
iesahin Aug 16, 2021
856af67
Edits in the pipeline section
iesahin Aug 16, 2021
c7e12fc
updates to "running the pipeline" section
iesahin Aug 16, 2021
d941a83
removed `dvc exp run dir` because it doesn't work
iesahin Aug 16, 2021
1ebe108
running stages independently
iesahin Aug 16, 2021
f52793c
minor edits
iesahin Aug 16, 2021
68cab41
removed information about -R and --glob options
iesahin Aug 18, 2021
bb9b3a9
edits for simplification
iesahin Aug 18, 2021
a60f4e7
(re)moved some sections
iesahin Aug 18, 2021
af9469f
guide: move general exps motivation to index, focus on Running Exps
jorgeorpinel Aug 20, 2021
b88e34c
guide: simmarize pipeline concept intro in Running Exps
jorgeorpinel Aug 20, 2021
31f28fb
guide: copy edits to Pipeline/repro guide in Running Exps guide
jorgeorpinel Aug 20, 2021
fd1d1c5
guide: reduce note on `repro`
jorgeorpinel Aug 20, 2021
a496f74
guide: summarize parameters section of Running Exps
jorgeorpinel Aug 20, 2021
27a475c
guide: remove repetitive section + update set-param note
jorgeorpinel Aug 20, 2021
7c80f59
guide: fix up the Queue section of Running Exps
jorgeorpinel Aug 20, 2021
778e124
guide: remove Exp names section and create Managing Exps
jorgeorpinel Aug 21, 2021
5039adf
guide: remove checkpoint details from Running Exps
jorgeorpinel Aug 21, 2021
0f334ca
Merge branch 'master' into iesahin/ug-exp-run-2675
jorgeorpinel Aug 24, 2021
7a8f155
guide: clarify about parallel queued exps and run-cache
jorgeorpinel Aug 24, 2021
c07885c
guide: remove extra material for now
jorgeorpinel Aug 24, 2021
2639585
guide: begin Managing Experiments guide
jorgeorpinel Aug 24, 2021
d2382f6
guide: remove wrong note
jorgeorpinel Aug 25, 2021
85a38c9
guide: correct note about git-ignored files in queue/tmp exps
jorgeorpinel Aug 25, 2021
0d5148d
guide: dedupe and hide note about git staging files in exps
jorgeorpinel Aug 25, 2021
b1b2d8e
guide: remove broken link
jorgeorpinel Aug 25, 2021
52c22d1
guide: updates done to wrong branch oops
jorgeorpinel Aug 25, 2021
1523f04
Merge branch 'iesahin/ug-exp-run-2675' of github.com:iterative/dvc.or…
jorgeorpinel Aug 25, 2021
ec57601
Merge branch 'iesahin/ug-exp-run-2675' into guide/exps-mgmt
jorgeorpinel Aug 25, 2021
47b6c44
guide: begin Managing Experiments guide
jorgeorpinel Aug 25, 2021
ab728cf
Merge branch 'master' into guide/exps-mgmt
jorgeorpinel Aug 31, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,11 @@
"label": "Experiment Management",
"slug": "experiment-management",
"source": "experiment-management/index.md",
"children": ["sharing-experiments", "checkpoints"]
"children": [
"running-experiments",
"sharing-experiments",
"checkpoints"
]
},
"setup-google-drive-remote",
"large-dataset-optimization",
Expand Down
135 changes: 135 additions & 0 deletions content/docs/user-guide/experiment-management/checkpoints.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,140 @@
# Checkpoints

<!--
## Checkpoints

To track successive steps in a longer or deeper <abbr>experiment</abbr>, you can
register checkpoints from your code. Each `dvc exp run` will resume from the
last checkpoint.

Checkpoints provide a way to train models iteratively, keeping the metrics,
models and other artifacts associated with each epoch.

### Adding checkpoints to the pipeline

There are various ways to add checkpoints to a project. In common, these all
involve marking a stage <abbr>output</abbr> with `checkpoint: true` in
`dvc.yaml`. This is needed so that the experiment can resume later, based on the
<abbr>cached</abbr> output(s).

If you are adding a new stage with `dvc stage add`, you can mark its output(s)
with `--checkpoints` (`-c`) option. DVC will add a `checkpoint: true` to the
stage output in `dvc.yaml`.

Otherwise, if you are adding a checkpoint to an already existing project, you
can edit `dvc.yaml` and add a `checkpoint: true` to the stage output as shown
below:

```yaml
stages:
...
train:
...
outs:
- model.pt:
checkpoint: true
...
```

### Adding checkpoints to Python code

DVC is agnostic when it comes to the language you use in your project.
Checkpoints are basically a mechanism to associate outputs of a pipeline with
its metrics. Reading the model from previous iteration and writing a new model
as a file are not handled by DVC. DVC captures the signal produced by the
machine learning experimentation code and stores each successive checkpoint.

> 💡 DVC provides several automated ways to capture checkpoints for popular ML
> libraries in [DVClive](https://dvc.org/doc/dvclive). It may be more productive
> to use checkpoints via DVClive. Here we discuss adding checkpoints to a
> project manually.

If you are writing the project in Python, the easiest way to signal DVC to
capture the checkpoint is to use `dvc.api.make_checkpoint()` function. It
creates a checkpoint and records all artifacts changed after the previous
checkpoint as another experiment.

The following snippet shows an example that uses a Keras custom callback class.
The callback signals DVC to create a checkpoint at the end of each checkpoint.

```python
class DVCCheckpointsCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
dvc.api.make_checkpoint()
...

history = model.fit(
...
callbacks=[DVCCheckpointsCallback(), ...]
)
```

A similar approach can be taken in PyTorch when using a loop to train a model:

```python
for epoch in range(1, EPOCHS+1):
...
for x_batch, y_batch in train_loader:
train(model, x_batch, y_batch)
torch.save(model.state_dict(), "model.pt")
# Evaluate and checkpoint.
evaluate(model, x_test, y_test)
dvc.api.make_checkpoint()
...
```

Even if you're not using these libraries, you can use checkpoints in your
project at each epoch/step by first recording all intermediate artifacts and
metrics, then calling `dvc.api.make_checkpoint()`.

### Adding checkpoints to non-Python code

If you use another language in your project, you can mimic the behavior of
`make_checkpoint`. In essence `make_checkpoint` creates a special file named
`DVC_CHECKPOINT` inside `.dvc/tmp/` to signal DVC, and waits the file to be
removed.

```r

dvcroot <- Sys.getenv("DVC_ROOT")

if (dvcroot != "") {
signalfilepath = file.path(dvcroot, ".dvc", "tmp", "DVC_CHECKPOINT")
file.create(signalfilepath)
while (file.exists(signalfilepath)) {
Sys.sleep(0.01)
}

}

```

The following Julia snippet creates a signal file to create a checkpoint.

```julia

dvc_root = get(ENV, "DVC_ROOT", "")

if dvc_root != ""
signal_file_path = joinpath(dvc_root, ".dvc", "tmp", "DVC_CHECKPOINT")
open(signal_file_path, "w") do io
write(io, "")
end;
while isfile(signal_file_path)
sleep()
end;
```

<details>

### How are checkpoints captured?

Instead of a single commit, checkpoint experiments have multiple commits under
the custom Git reference (in `.git/refs/exps`), similar to a branch.

</details>
-->

ML checkpoints are an important part of deep learning because ML engineers like
to save the model files at certain points during a training process.

Expand Down
13 changes: 13 additions & 0 deletions content/docs/user-guide/experiment-management/index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,18 @@
# Experiment Management
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From #2690 (comment)

I guess I'm the reason for this new section? I'm okay with leaving this in Running experiments depending on what others think. It does seem a bit disjointed as is. I think we need to at least link here from Running experiments so it's clear how to see and compare experiments.


<!--
Machine Learning and Data Science projects usually involve experimentation.
These experiments' goals can range from finding good hyperparameters to testing
for data and concept drift. DVC 2.0 introduced a new set of commands to manage
experiments with minimum boilerplate. It allows to run experiments defined by
pipelines, track their associated data and model files, set parameters for each,
push experiment parameters and code to Git remotes without committing them,
create branches and persist them in Git.

Each experiment represents a project variation based on the changes in your
current <abbr>workspace</abbr>.
-->

_New in DVC 2.0_

Data science and ML are iterative processes that require a large number of
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Managing Experiments

Once you have defined and/or [run experiments] in your project, you can use
several features of DVC to see, compare, reproduce, share, or remove them.

[run experiments]: /doc/user-guide/experiment-management/running-experiments

## Experiment names

Experiments created with `dvc exp run` will have an auto-generated name like
`exp-bfe64` by default. It can be customized using the `--name` (`-n`) option:

```dvc
$ dvc exp run --name cnn-512 --set-param model.conv_units=512
```

When you create an experiment, DVC generates a Git-like SHA-1 hash from its
contents. This is shown when you [queue experiments] with `--queue`:
Comment on lines +17 to +18
Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Aug 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From #2690 (comment)

This is actually different from the hash generated after the experiments is committed, so I'm not sure it makes sense to mention it here.


[queue experiments]:
/doc/user-guide/experiment-management/running-experiments#the-experiments-queue

```dvc
$ dvc exp run --queue -S model.conv_units=32
Queued experiment '6518f17' for future execution.
```

After running queued experiment, DVC uses the regular name mentioned earlier.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From #2690 (comment)

The experiment name should show in dvc exp show even for queued experiments now.


> Note that you can set a queued experiment's name in advance:
>
> ```dvc
> $ dvc exp run --queue --name cnn-512 -S model.conv_units=512
> Queued experiment '86bd8f9' for future execution.
> ```

You can refer to experiments in `dvc exp apply` or `dvc exp branch` either with
regular experiment names or by their SHA hashes.

## Listing experiments

Use `dvc exp show` to see both run and queued experiments:

```dvc
$ dvc exp show --no-pager --no-timestamp \
--include-metrics loss --include-params model.conv_units
```

```dvctable
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ neutral:**Experiment** ┃ metric:**loss** ┃ param:**model.conv_units** ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ workspace │ 0.23534 │ 64 │
│ 3973b6b │ - │ 16 │
│ ├── aeaabb0 [exp-cb13f] │ 0.23534 │ 64 │
│ ├── d0ee7ce [exp-5dccf] │ 0.23818 │ 32 │
│ ├── 1533e4d [exp-88874] │ 0.24039 │ 128 │
│ ├── b1f41d3 [cnn-256] │ 0.23296 │ 256 │
│ ├── 07e927f [exp-6c06d] │ 0.23279 │ 24 │
│ ├── b2b8586 [exp-2a1d5] │ 0.25036 │ 16 │
│ └── *86bd8f9 │ - │ 512 │
└─────────────────────────┴─────────┴──────────────────┘
```

When an experiment is not run yet, only the former hash is shown (marked with
`*`).

<!-- WIP -->
Loading