Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvclive: lightning log_model #4714

Merged
merged 5 commits into from
Jul 31, 2023
Merged

dvclive: lightning log_model #4714

merged 5 commits into from
Jul 31, 2023

Conversation

dberenbaum
Copy link
Contributor

@dberenbaum dberenbaum requested a review from daavoo July 21, 2023 20:45
@shcheklein shcheklein temporarily deployed to dvc-org-lightning-model-yorcas July 21, 2023 20:49 Inactive
@github-actions
Copy link
Contributor

github-actions bot commented Jul 21, 2023

Link Check Report

There were no links to check!

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. Good stuff, thanks @dberenbaum .

@@ -69,6 +86,17 @@ checkpointing at all as described in the

## Examples

- Using `log_model` to save the checkpoints and annotate the best one for use in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we show an output / end result of it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like actual dvc.yaml and directory structure?

created by
[`ModelCheckpoint`](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.ModelCheckpoint.html)
and annotate the best checkpoint with `type=model` and `name=best` for use in
[Studio model registry]. DVCLive will <abbr>cache</abbr> the checkpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to understand the part about cache and being able to recover it. I think most of the ppl will be confused by this (I expect they don't know much about DVC mechanics). Two suggestions:

  • remove it
  • try to be more explicit: DVCLive saves checkpoints with DVC - it means they are cached in DVC cache, also could be pushed to remote storage, etc. It also means that DVCLive drops the directory (it's safe) when it runs a new experiments. Model weights for any experiment are stored in DVC and can be restored ...

Btw, what happens, if save_dvc_exp is False? How do I recover those?

- `log_model` - (`False` by default) - use
[`live.log_artifact()`](/doc/dvclive/live/log_artifact) to log checkpoints
created by
[`ModelCheckpoint`](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.ModelCheckpoint.html)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I need to use ModelCheckpoint along side? Ot it's being used internally?

directory and delete checkpoints from previous <abbr>experiments</abbr>, but
you can recover the checkpoints from any experiment using DVC.

- if `log_model == 'all'`, checkpoints are logged during training.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all checkpoints are logged during training?

also, what does it practically mean? does it create a record per each checkpoint in dvc.yaml?

@@ -61,6 +61,23 @@ checkpointing at all as described in the

- `prefix` - (`None` by default) - string that adds to each metric name.

- `log_model` - (`False` by default) - use
[`live.log_artifact()`](/doc/dvclive/live/log_artifact) to log checkpoints
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log checkpoints - does it means that we might have multiple records? per checkpoint?

@shcheklein shcheklein temporarily deployed to dvc-org-lightning-model-yorcas July 22, 2023 14:17 Inactive
@dberenbaum
Copy link
Contributor Author

Updated to address the comments, moving most of the details to the examples. Hopefully the docs answer the questions, so I will hold off on answering each question to see if you find the doc answers them sufficiently.

Comment on lines 90 to 92
`Live.log_artifact()`. At the end of training, DVCLive will annotate the
[`best_model_path`][`ModelCheckpoint`] with `type=model` and `name=best` for use
in [Studio model registry].
Copy link
Contributor Author

@dberenbaum dberenbaum Jul 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a good example of how to use the best model, but I think it's blocked by iterative/dvc#9100 (comment)

@dberenbaum dberenbaum requested review from shcheklein and daavoo July 25, 2023 20:31
@dberenbaum
Copy link
Contributor Author

Adding reminders here to update vs code and studio snippets (cc @mattseddon @yathomasi )

@shcheklein shcheklein temporarily deployed to dvc-org-lightning-model-yorcas July 25, 2023 21:27 Inactive
@@ -69,6 +82,52 @@ checkpointing at all as described in the

## Examples

### Log model checkpoints

Use `log_model` to save the checkpoints. DVCLive will first delete checkpoints
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still so complicated :(

Sorry, Dave. I know you put tons of effort into this. And may be that's the best we can do ... but nonetheless we should recognize that's quire complex.

Let's start with this. Can it be just:

Use log_model to save the checkpoints (it will use Live.log_artifact() internally to save those). At the end of training, DVCLive will also mark the best the [best_model_path][ModelCheckpoint] (e.g. for then to be consumed in the Model Registry or automation scenarios).

Even this one is still too complicated. Do ppl care about MR at this point? I don't know . Can it go after the code example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I don't have that strong an opinion here at this point. I'm fine to go with that suggestion for now. We can always revisit.

Uses `dvc add` to track `path` with DVC, generating a `{path}.dvc` file. When
combined with [`save_dvc_exp=True`](/doc/dvclive#initialize-dvclive), it will
ensure that `{path}.dvc` is included in the experiment.
Uses `dvc add` to [track] `path` with DVC, saving it to the DVC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh, I still feel that we go right away into implementation details. High level (to me) this command first of all creates an entry somewhere (in this case dvc.yaml) with some information about the model - that we now even know that there is a model in the repo.

dvc add can be there or not, we should support other storages and other ways to get data then eventually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh, I still feel that we go right away into implementation details.

I added an intro paragraph to introduce it at a high level.

High level (to me) this command first of all creates an entry somewhere (in this case dvc.yaml) with some information about the model - that we now even know that there is a model in the repo.

dvc add can be there or not, we should support other storages and other ways to get data then eventually.

I don't really agree here. Every other logger I know uses log_artifact to store the artifact data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't other loggers have any means to log a remote model?

in our case I think we should support all the ways to store the models. that's why it feels a bit low level and artificial (it can be even a file in Git after all)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should support other ways to store models (we are discussing in iterative/dvclive#551 and we already have `Live.log_artifact(cache=False) that goes a little bit in this direction), but the expected default behavior for logging artifacts is to save them to the logger's storage so that it can be recovered later. In the intro, I think we should emphasize it as the primary scenario for the method.

@shcheklein shcheklein temporarily deployed to dvc-org-lightning-model-yorcas July 31, 2023 01:00 Inactive
@dberenbaum dberenbaum requested a review from shcheklein July 31, 2023 01:01
@dberenbaum
Copy link
Contributor Author

@shcheklein Let me know if you feel there are any blockers left.

@shcheklein shcheklein temporarily deployed to dvc-org-lightning-model-yorcas July 31, 2023 11:29 Inactive
@dberenbaum dberenbaum merged commit dbc446e into main Jul 31, 2023
@dberenbaum dberenbaum deleted the lightning-model branch July 31, 2023 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants