Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: simplify Checkpoints guide #3189

Merged
merged 1 commit into from
Jan 26, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion content/docs/api-reference/make_checkpoint.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
# dvc.api.make_checkpoint()

Make an [in-code checkpoint](/doc/user-guide/experiment-management/checkpoints).
Make an in-code [checkpoint].

```py
def make_checkpoint()
```

[checkpoint]:
/doc/user-guide/experiment-management/running-experiments#checkpoint-experiments

#### Usage:

```py
Expand Down
78 changes: 45 additions & 33 deletions content/docs/user-guide/experiment-management/checkpoints.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,35 @@

_New in DVC 2.0_

To track successive steps in a longer experiment, you can register checkpoints
from your code at runtime. This is especially helpful in machine learning, for
example to track the progress in deep learning techniques such as evolving
neural networks.

_Checkpoint experiments_ track a series of variations (the checkpoints) and
their execution can be stopped and resumed as needed. You interact with them
using the `--rev` and `--reset` options of `dvc exp run` (see also the
`checkpoint` field in `dvc.yaml` `outs`). They can help you
To track successive steps in a longer machine learning experiment, you can
register checkpoints from your code at runtime, for example to track the
progress with deep learning techniques. They can help you
iesahin marked this conversation as resolved.
Show resolved Hide resolved

- implement the best practice in deep learning to save your model weights as
checkpoints.
- track all code and data changes corresponding to the checkpoints.
- see when metrics start diverging and revert to the optimal checkpoint.
- automate the process of tracking every training epoch.

> Experiments and checkpoints are [implemented](/blog/experiment-refs) with
> hidden Git experiment commits branches.
Checkpoint [execution] can be stopped and resumed as needed. You interact with
them using the `--rev` and `--reset` options of `dvc exp run` (see also the
`checkpoint` field in `dvc.yaml` `outs`).
iesahin marked this conversation as resolved.
Show resolved Hide resolved

[execution]:
/doc/user-guide/experiment-management/running-experiments#checkpoint-experiments

<details>

### ⚙️ How are checkpoints captured?

Instead of a single reference like [regular experiments], checkpoint experiments
have multiple commits under the custom Git reference (in `.git/refs/exps`),
similar to a branch.

[regular experiments]:
/doc/user-guide/experiment-management/experiments-overview

</details>

Like with regular experiments, checkpoints can become persistent by
[committing them to Git](#committing-checkpoints-to-git).
Expand Down Expand Up @@ -62,38 +73,36 @@ running:
$ pip install -r requirements.txt
```

This will download all of the packages you need to run the example. Now you have
everything you need to get started with experiments and checkpoints.
This will download all of the packages you need to run the example.

To initialize this project as a <abbr>DVC repository</abbr>, use `dvc init`. Now
you have everything you need to get started with experiments and checkpoints.

</details>

## Setting up a DVC pipeline

DVC versions data and it also can version the ML model weights file as
checkpoints during the training process. To enable this, you will need to set up
a DVC pipeline to train your model.

Adding a DVC pipeline only takes a few commands. At the root of the project,
run:

```dvc
$ dvc init
```
DVC can version data as well as the ML model weights file in checkpoints during
the training process. To enable this, you will need to set up a
[DVC pipeline](/doc/start/data-pipelines) to train your model.

This sets up the files you need for your DVC pipeline to work.

Now we need to add a stage for training our model within a DVC pipeline. We'll
do that with `dvc stage add`, which we'll explain more later. For now, run the
following command:
Now we need to add a training stage to `dvc.yaml` including `checkpoint: true`
in its <abbr>output</abbr>. This tells DVC which <abbr>cached</abbr> output(s)
to use to resume the experiment later (a circular dependency). We'll do this
with `dvc stage add`.

```dvc
$ dvc stage add --name train --deps data/MNIST --deps train.py \
--checkpoints model.pt --plots-no-cache predictions.json \
--params seed,lr,weight_decay --live dvclive python train.py
$ dvc stage add --name train \
--deps data/MNIST --deps train.py \
--params seed,lr,weight_decay \
--checkpoints model.pt \
--plots-no-cache predictions.json \
--live dvclive \
python train.py
```

The `--live dvclive` option enables our special logger [DVCLive](/doc/dvclive),
which helps you register checkpoints from your code.
💡 The `--live dvclive` option enables our special logger
[DVCLive](/doc/dvclive), which helps you register checkpoints from code.

The checkpoints need to be enabled in DVC at the pipeline level. The
`-c / --checkpoint` option of the `dvc stage add` command defines the checkpoint
Expand Down Expand Up @@ -132,6 +141,9 @@ stages:
html: true
```

⚠️ Note that enabling checkpoints in a `dvc.yaml` file makes it incompatible
with `dvc repro`.

Before we go any further, this is a great point to add these changes to your Git
history. You can do that with the following commands:

Expand Down
14 changes: 7 additions & 7 deletions content/docs/user-guide/project-structure/pipelines-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -381,13 +381,13 @@ validation and auto-completion.
> These include a subset of the fields in `.dvc` file
> [output entries](/doc/user-guide/project-structure/dvc-files#output-entries).

| Field | Description |
| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `cache` | Whether or not this file or directory is <abbr>cached</abbr> (`true` by default). See the `--no-commit` option of `dvc add`. |
| `remote` | (Optional) name of the remote to use for pushing/fetching. |
| `persist` | Whether the output file/dir should remain in place while `dvc repro` runs (`false` by default: outputs are deleted when `dvc repro` starts |
| `checkpoint` | (Optional) Set to `true` to let DVC know that this output is associated with [in-code checkpoints](/doc/user-guide/experiment-management/checkpoints). These outputs are reverted to their last cached version at `dvc exp run` and also `persist` during the stage execution. |
| `desc` | (Optional) user description for this output. This doesn't affect any DVC operations. |
| Field | Description |
| ------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `cache` | Whether or not this file or directory is <abbr>cached</abbr> (`true` by default). See the `--no-commit` option of `dvc add`. |
| `remote` | (Optional) name of the remote to use for pushing/fetching. |
| `persist` | Whether the output file/dir should remain in place while `dvc repro` runs (`false` by default: outputs are deleted when `dvc repro` starts |
| `checkpoint` | (Optional) Set to `true` to let DVC know that this output is associated with [checkpoint experiments](/doc/user-guide/experiment-management/checkpoints). These outputs are reverted to their last cached version at `dvc exp run` and also `persist` during the stage execution. |
| `desc` | (Optional) user description for this output. This doesn't affect any DVC operations. |

⚠️ Note that using the `checkpoint` field in `dvc.yaml` is not compatible with
`dvc repro`.
Expand Down