-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvclive: lightning log_model #4714
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments. Good stuff, thanks @dberenbaum .
@@ -69,6 +86,17 @@ checkpointing at all as described in the | |||
|
|||
## Examples | |||
|
|||
- Using `log_model` to save the checkpoints and annotate the best one for use in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we show an output / end result of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like actual dvc.yaml and directory structure?
created by | ||
[`ModelCheckpoint`](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.ModelCheckpoint.html) | ||
and annotate the best checkpoint with `type=model` and `name=best` for use in | ||
[Studio model registry]. DVCLive will <abbr>cache</abbr> the checkpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to understand the part about cache and being able to recover it. I think most of the ppl will be confused by this (I expect they don't know much about DVC mechanics). Two suggestions:
- remove it
- try to be more explicit: DVCLive saves checkpoints with DVC - it means they are cached in DVC cache, also could be pushed to remote storage, etc. It also means that DVCLive drops the directory (it's safe) when it runs a new experiments. Model weights for any experiment are stored in DVC and can be restored ...
Btw, what happens, if save_dvc_exp
is False? How do I recover those?
- `log_model` - (`False` by default) - use | ||
[`live.log_artifact()`](/doc/dvclive/live/log_artifact) to log checkpoints | ||
created by | ||
[`ModelCheckpoint`](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.ModelCheckpoint.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I need to use ModelCheckpoint
along side? Ot it's being used internally?
directory and delete checkpoints from previous <abbr>experiments</abbr>, but | ||
you can recover the checkpoints from any experiment using DVC. | ||
|
||
- if `log_model == 'all'`, checkpoints are logged during training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all checkpoints are logged during training?
also, what does it practically mean? does it create a record per each checkpoint in dvc.yaml
?
@@ -61,6 +61,23 @@ checkpointing at all as described in the | |||
|
|||
- `prefix` - (`None` by default) - string that adds to each metric name. | |||
|
|||
- `log_model` - (`False` by default) - use | |||
[`live.log_artifact()`](/doc/dvclive/live/log_artifact) to log checkpoints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log checkpoints - does it means that we might have multiple records? per checkpoint?
Updated to address the comments, moving most of the details to the examples. Hopefully the docs answer the questions, so I will hold off on answering each question to see if you find the doc answers them sufficiently. |
`Live.log_artifact()`. At the end of training, DVCLive will annotate the | ||
[`best_model_path`][`ModelCheckpoint`] with `type=model` and `name=best` for use | ||
in [Studio model registry]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a good example of how to use the best model, but I think it's blocked by iterative/dvc#9100 (comment)
Adding reminders here to update vs code and studio snippets (cc @mattseddon @yathomasi ) |
@@ -69,6 +82,52 @@ checkpointing at all as described in the | |||
|
|||
## Examples | |||
|
|||
### Log model checkpoints | |||
|
|||
Use `log_model` to save the checkpoints. DVCLive will first delete checkpoints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still so complicated :(
Sorry, Dave. I know you put tons of effort into this. And may be that's the best we can do ... but nonetheless we should recognize that's quire complex.
Let's start with this. Can it be just:
Use
log_model
to save the checkpoints (it will useLive.log_artifact()
internally to save those). At the end of training, DVCLive will also mark thebest
the [best_model_path
][ModelCheckpoint
] (e.g. for then to be consumed in the Model Registry or automation scenarios).
Even this one is still too complicated. Do ppl care about MR at this point? I don't know . Can it go after the code example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I don't have that strong an opinion here at this point. I'm fine to go with that suggestion for now. We can always revisit.
Uses `dvc add` to track `path` with DVC, generating a `{path}.dvc` file. When | ||
combined with [`save_dvc_exp=True`](/doc/dvclive#initialize-dvclive), it will | ||
ensure that `{path}.dvc` is included in the experiment. | ||
Uses `dvc add` to [track] `path` with DVC, saving it to the DVC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh, I still feel that we go right away into implementation details. High level (to me) this command first of all creates an entry somewhere (in this case dvc.yaml) with some information about the model - that we now even know that there is a model in the repo.
dvc add
can be there or not, we should support other storages and other ways to get data then eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh, I still feel that we go right away into implementation details.
I added an intro paragraph to introduce it at a high level.
High level (to me) this command first of all creates an entry somewhere (in this case dvc.yaml) with some information about the model - that we now even know that there is a model in the repo.
dvc add
can be there or not, we should support other storages and other ways to get data then eventually.
I don't really agree here. Every other logger I know uses log_artifact
to store the artifact data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't other loggers have any means to log a remote model?
in our case I think we should support all the ways to store the models. that's why it feels a bit low level and artificial (it can be even a file in Git after all)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should support other ways to store models (we are discussing in iterative/dvclive#551 and we already have `Live.log_artifact(cache=False) that goes a little bit in this direction), but the expected default behavior for logging artifacts is to save them to the logger's storage so that it can be recovered later. In the intro, I think we should emphasize it as the primary scenario for the method.
@shcheklein Let me know if you feel there are any blockers left. |
Docs for iterative/dvclive#613