Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple dvc add from log_artifact #572

Closed
shcheklein opened this issue May 20, 2023 · 8 comments · Fixed by #620
Closed

Decouple dvc add from log_artifact #572

shcheklein opened this issue May 20, 2023 · 8 comments · Fixed by #620
Labels
A: dvc DVC integration A: log_artifact Area: `live.log_artifact` enhancement

Comments

@shcheklein
Copy link
Member

shcheklein commented May 20, 2023

There are a bunch of issues in the current implementation, most of them are related to the workflow when one has to run log_artifact on each iteration (e.g. saving best model via callback).

  • It prints this stuff on each iteration (super noisy):
To track the changes with git, run:
        git add dvclive/artifacts/best.pt.dvc
To enable auto staging, run:
        dvc config core.autostage true
  • It saves a model into cache (takes time + space)
  • It requires settings up a DVC remote. Which is by all means optional- there are other ways to handle storage

Related to #551

@shcheklein shcheklein added enhancement A: dvc DVC integration A: log_artifact Area: `live.log_artifact` labels May 20, 2023
@daavoo
Copy link
Contributor

daavoo commented May 21, 2023

workflow when one has to run log_artifact on each iteration (e.g. saving best model via callback)

I think we have discussed this elsewhere but we went with the approach of not doing this in the integrations and instead documenting/explaining to the users how to track the model once the training loop is completed, like in:

https://dvc.org/doc/start/experiments/experiment-tracking?tab=Pytorch-Lightning

This was under the assumption of using log_artifact for the model registry scenario.

Are we discussing here a different scenario to use log_artifact as recovery from interruption?

@shcheklein
Copy link
Member Author

Scenarious include updating the best model from time to time (e.g to be able early stop or just in general have the best weights at the end) + recovery. Yes, it can be done at the end also with a copy flag, but it's unrealistic to expect users or callbacks to behave that way.

We can porentially run dvc add only at the end at least. That would solve at least some of the issues above. And even if it's interrupted at least model is there.

@dberenbaum
Copy link
Collaborator

when one has to run log_artifact on each iteration (e.g. saving best model via callback)

Why does log_artifact have to get called repeatedly here? And what is the point of doing so if you only want to dvc add at the end?

I think doing dvc add only at the end makes some sense if we aren't currently focused on checkpointing, but I'm confused about how that connects to calling log_artifact multiple times or compatibility with recovery from interruption.

@shcheklein
Copy link
Member Author

I think what I'm trying to cover here (and we can discuss the details about the YOLO) is the workflow when a framework has a callback - on_model_save or similar and we just add the log_artifact there without thinking too much. Scenarios are the same to why people in general save the best model periodically and the last one I think. I feel that it quite limiting to expect users to do the log_artifact only at the end + explaining then context, etc.

Re that specific YOLO repository. No strong reason since now it writes it locally (in run directory) and we are duplicating it for everything - plots, metrics, models. I was hoping to show a complete case where it can serve as these runs directory, but better.

@dberenbaum
Copy link
Collaborator

I think what I'm trying to cover here (and we can discuss the details about the YOLO) is the workflow when a framework has a callback - on_model_save or similar and we just add the log_artifact there without thinking too much. Scenarios are the same to why people in general save the best model periodically and the last one I think. I feel that it quite limiting to expect users to do the log_artifact only at the end + explaining then context, etc.

Makes sense, but I don't think it's typical to call log_artifact at each model savepoint by default for a callback. For example, looking at https://github.com/ultralytics/ultralytics/tree/main/ultralytics/yolo/utils/callbacks, it's only used by the internal ultralytics hub callback.

There are sometimes options for saving either at the end or at model checkpoints for other frameworks, but I don't see it enabled by default (see https://pytorch-lightning.readthedocs.io/en/1.6.1/extensions/logging.html, https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/callback, etc.).

In the cases where users opt to call log_artifact at each checkpoint, the only use case I see is that we track it with dvc and/or upload it so it can be recovered. Otherwise, I don't see why the user would want to call log_artifact repeatedly. I'd also like to compare how long this takes in mlflow, wandb, etc., since they will be uploading to a server each time this is called.

@shcheklein
Copy link
Member Author

That's the first I checked:

https://github.com/ultralytics/ultralytics/blob/main/ultralytics/yolo/utils/callbacks/comet.py#L274

It logs the best model on each fit epoch end.

(they don't use model_save because it was added later?)

When I talk about frameworks I also include Keras, et al. It would be great to just pass them our callback and be done with that. They suppose to do things internally, and if they write models regularly we should probably do also. (Here I don't have that much experience- based on my intuition, but I can take a look).

@dberenbaum
Copy link
Collaborator

What do you want log_artifact to do at each call of on_model_save and what value would it add?

I see the method doing 3 things:

  1. Sometimes copying the file to the dvclive dir -- unclear why this shouldn't wait for the end since you could still resume with the model at its original filepath.
  2. Adding the file to dvc.yaml:artifacts -- again unclear why this shouldn't wait for the end
  3. Dvc add to track the file -- again unclear why this shouldn't wait for the end

In the future, we could do something like upload each checkpoint to studio to make recovery easier, but right now I don't see the point in calling log_artifact repeatedly.


That's the first I checked:

I guess I missed that one, and AFAICT it only copies the file locally. Not sure what the user value is.

For the others, they are logging at the end of training if anywhere AFAICT:

When I talk about frameworks I also include Keras, et al. It would be great to just pass them our callback and be done with that.

That's why I included lightning and huggingface refs above (our two most popular frameworks). Check out how wandb handles these frameworks:

Each has its own bespoke logic with multiple options, and none include checkpointing by default with the basic logger. Instead, they focus on robust docs and examples to explain the options for how to do these things and make it easy to copy/paste.

@dberenbaum dberenbaum mentioned this issue May 25, 2023
@shcheklein
Copy link
Member Author

Agreed on the YOLO. I'll change the logic there.

What about other callbacks - like Keras, etc. Is it possible for us to rely on them to create model in the right location (dvclive) and register it? So that people don't have to run log_artifact at the end?

Also, what about other storage options? Should we introduce a flag? Should we change the default? How do we pass those flags into callbacks? (like don't do dvc add, don't create an experiment, etc - for example in case of YOLO).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: dvc DVC integration A: log_artifact Area: `live.log_artifact` enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants