Skip to content

Commit

Permalink
guide: simplify Checkpoints (Exps) and
Browse files Browse the repository at this point in the history
other misc. related changes

Extracted from #3182
  • Loading branch information
jorgeorpinel committed Jan 18, 2022
1 parent 062f8d5 commit 3e1a497
Show file tree
Hide file tree
Showing 3 changed files with 56 additions and 41 deletions.
5 changes: 4 additions & 1 deletion content/docs/api-reference/make_checkpoint.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
# dvc.api.make_checkpoint()

Make an [in-code checkpoint](/doc/user-guide/experiment-management/checkpoints).
Make an in-code [checkpoint].

```py
def make_checkpoint()
```

[checkpoint]:
/doc/user-guide/experiment-management/running-experiments#checkpoint-experiments

#### Usage:

```py
Expand Down
78 changes: 45 additions & 33 deletions content/docs/user-guide/experiment-management/checkpoints.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,35 @@

_New in DVC 2.0_

To track successive steps in a longer experiment, you can register checkpoints
from your code at runtime. This is especially helpful in machine learning, for
example to track the progress in deep learning techniques such as evolving
neural networks.

_Checkpoint experiments_ track a series of variations (the checkpoints) and
their execution can be stopped and resumed as needed. You interact with them
using the `--rev` and `--reset` options of `dvc exp run` (see also the
`checkpoint` field in `dvc.yaml` `outs`). They can help you
To track successive steps in a longer machine learning experiment, you can
register checkpoints from your code at runtime, for example to track the
progress with deep learning techniques. They can help you

- implement the best practice in deep learning to save your model weights as
checkpoints.
- track all code and data changes corresponding to the checkpoints.
- see when metrics start diverging and revert to the optimal checkpoint.
- automate the process of tracking every training epoch.

> Experiments and checkpoints are [implemented](/blog/experiment-refs) with
> hidden Git experiment commits branches.
Checkpoint [execution] can be stopped and resumed as needed. You interact with
them using the `--rev` and `--reset` options of `dvc exp run` (see also the
`checkpoint` field in `dvc.yaml` `outs`).

[execution]:
/doc/user-guide/experiment-management/running-experiments#checkpoint-experiments

<details>

### ⚙️ How are checkpoints captured?

Instead of a single reference like [regular experiments], checkpoint experiments
have multiple commits under the custom Git reference (in `.git/refs/exps`),
similar to a branch.

[regular experiments]:
/doc/user-guide/experiment-management/experiments-overview

</details>

Like with regular experiments, checkpoints can become persistent by
[committing them to Git](#committing-checkpoints-to-git).
Expand Down Expand Up @@ -62,38 +73,36 @@ running:
$ pip install -r requirements.txt
```

This will download all of the packages you need to run the example. Now you have
everything you need to get started with experiments and checkpoints.
This will download all of the packages you need to run the example.

To initialize this project as a <abbr>DVC repository</abbr>, use `dvc init`. Now
you have everything you need to get started with experiments and checkpoints.

</details>

## Setting up a DVC pipeline

DVC versions data and it also can version the ML model weights file as
checkpoints during the training process. To enable this, you will need to set up
a DVC pipeline to train your model.

Adding a DVC pipeline only takes a few commands. At the root of the project,
run:

```dvc
$ dvc init
```
DVC can version data as well as the ML model weights file in checkpoints during
the training process. To enable this, you will need to set up a
[DVC pipeline](/doc/start/data-pipelines) to train your model.

This sets up the files you need for your DVC pipeline to work.

Now we need to add a stage for training our model within a DVC pipeline. We'll
do that with `dvc stage add`, which we'll explain more later. For now, run the
following command:
Now we need to add a training stage to `dvc.yaml` including `checkpoint: true`
in its <abbr>output</abbr>. This tells DVC which <abbr>cached</abbr> output(s)
to use to resume the experiment later (a circular dependency). We'll do this
with `dvc stage add`.

```dvc
$ dvc stage add --name train --deps data/MNIST --deps train.py \
--checkpoints model.pt --plots-no-cache predictions.json \
--params seed,lr,weight_decay --live dvclive python train.py
$ dvc stage add --name train \
--deps data/MNIST --deps train.py \
--params seed,lr,weight_decay \
--checkpoints model.pt \
--plots-no-cache predictions.json \
--live dvclive \
python train.py
```

The `--live dvclive` option enables our special logger [DVCLive](/doc/dvclive),
which helps you register checkpoints from your code.
💡 The `--live dvclive` option enables our special logger
[DVCLive](/doc/dvclive), which helps you register checkpoints from code.

The checkpoints need to be enabled in DVC at the pipeline level. The
`-c / --checkpoint` option of the `dvc stage add` command defines the checkpoint
Expand Down Expand Up @@ -132,6 +141,9 @@ stages:
html: true
```
⚠️ Note that enabling checkpoints in a `dvc.yaml` file makes it incompatible
with `dvc repro`.

Before we go any further, this is a great point to add these changes to your Git
history. You can do that with the following commands:

Expand Down
14 changes: 7 additions & 7 deletions content/docs/user-guide/project-structure/pipelines-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -381,13 +381,13 @@ validation and auto-completion.
> These include a subset of the fields in `.dvc` file
> [output entries](/doc/user-guide/project-structure/dvc-files#output-entries).

| Field | Description |
| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `cache` | Whether or not this file or directory is <abbr>cached</abbr> (`true` by default). See the `--no-commit` option of `dvc add`. |
| `remote` | (Optional) name of the remote to use for pushing/fetching. |
| `persist` | Whether the output file/dir should remain in place while `dvc repro` runs (`false` by default: outputs are deleted when `dvc repro` starts |
| `checkpoint` | (Optional) Set to `true` to let DVC know that this output is associated with [in-code checkpoints](/doc/user-guide/experiment-management/checkpoints). These outputs are reverted to their last cached version at `dvc exp run` and also `persist` during the stage execution. |
| `desc` | (Optional) user description for this output. This doesn't affect any DVC operations. |
| Field | Description |
| ------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `cache` | Whether or not this file or directory is <abbr>cached</abbr> (`true` by default). See the `--no-commit` option of `dvc add`. |
| `remote` | (Optional) name of the remote to use for pushing/fetching. |
| `persist` | Whether the output file/dir should remain in place while `dvc repro` runs (`false` by default: outputs are deleted when `dvc repro` starts |
| `checkpoint` | (Optional) Set to `true` to let DVC know that this output is associated with [checkpoint experiments](/doc/user-guide/experiment-management/checkpoints). These outputs are reverted to their last cached version at `dvc exp run` and also `persist` during the stage execution. |
| `desc` | (Optional) user description for this output. This doesn't affect any DVC operations. |

⚠️ Note that using the `checkpoint` field in `dvc.yaml` is not compatible with
`dvc repro`.
Expand Down

0 comments on commit 3e1a497

Please sign in to comment.