Decouple `dvc add` from `log_artifact` #572

shcheklein · 2023-05-20T22:09:23Z

There are a bunch of issues in the current implementation, most of them are related to the workflow when one has to run log_artifact on each iteration (e.g. saving best model via callback).

It prints this stuff on each iteration (super noisy):

To track the changes with git, run:
        git add dvclive/artifacts/best.pt.dvc
To enable auto staging, run:
        dvc config core.autostage true

It saves a model into cache (takes time + space)
It requires settings up a DVC remote. Which is by all means optional- there are other ways to handle storage

Related to #551

The text was updated successfully, but these errors were encountered:

daavoo · 2023-05-21T10:43:20Z

workflow when one has to run log_artifact on each iteration (e.g. saving best model via callback)

I think we have discussed this elsewhere but we went with the approach of not doing this in the integrations and instead documenting/explaining to the users how to track the model once the training loop is completed, like in:

https://dvc.org/doc/start/experiments/experiment-tracking?tab=Pytorch-Lightning

This was under the assumption of using log_artifact for the model registry scenario.

Are we discussing here a different scenario to use log_artifact as recovery from interruption?

shcheklein · 2023-05-21T17:29:03Z

Scenarious include updating the best model from time to time (e.g to be able early stop or just in general have the best weights at the end) + recovery. Yes, it can be done at the end also with a copy flag, but it's unrealistic to expect users or callbacks to behave that way.

We can porentially run dvc add only at the end at least. That would solve at least some of the issues above. And even if it's interrupted at least model is there.

dberenbaum · 2023-05-22T23:53:05Z

when one has to run log_artifact on each iteration (e.g. saving best model via callback)

Why does log_artifact have to get called repeatedly here? And what is the point of doing so if you only want to dvc add at the end?

I think doing dvc add only at the end makes some sense if we aren't currently focused on checkpointing, but I'm confused about how that connects to calling log_artifact multiple times or compatibility with recovery from interruption.

shcheklein · 2023-05-23T03:43:59Z

I think what I'm trying to cover here (and we can discuss the details about the YOLO) is the workflow when a framework has a callback - on_model_save or similar and we just add the log_artifact there without thinking too much. Scenarios are the same to why people in general save the best model periodically and the last one I think. I feel that it quite limiting to expect users to do the log_artifact only at the end + explaining then context, etc.

Re that specific YOLO repository. No strong reason since now it writes it locally (in run directory) and we are duplicating it for everything - plots, metrics, models. I was hoping to show a complete case where it can serve as these runs directory, but better.

dberenbaum · 2023-05-24T01:19:20Z

I think what I'm trying to cover here (and we can discuss the details about the YOLO) is the workflow when a framework has a callback - on_model_save or similar and we just add the log_artifact there without thinking too much. Scenarios are the same to why people in general save the best model periodically and the last one I think. I feel that it quite limiting to expect users to do the log_artifact only at the end + explaining then context, etc.

Makes sense, but I don't think it's typical to call log_artifact at each model savepoint by default for a callback. For example, looking at https://github.com/ultralytics/ultralytics/tree/main/ultralytics/yolo/utils/callbacks, it's only used by the internal ultralytics hub callback.

There are sometimes options for saving either at the end or at model checkpoints for other frameworks, but I don't see it enabled by default (see https://pytorch-lightning.readthedocs.io/en/1.6.1/extensions/logging.html, https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/callback, etc.).

In the cases where users opt to call log_artifact at each checkpoint, the only use case I see is that we track it with dvc and/or upload it so it can be recovered. Otherwise, I don't see why the user would want to call log_artifact repeatedly. I'd also like to compare how long this takes in mlflow, wandb, etc., since they will be uploading to a server each time this is called.

shcheklein · 2023-05-24T01:24:52Z

That's the first I checked:

https://github.com/ultralytics/ultralytics/blob/main/ultralytics/yolo/utils/callbacks/comet.py#L274

It logs the best model on each fit epoch end.

(they don't use model_save because it was added later?)

When I talk about frameworks I also include Keras, et al. It would be great to just pass them our callback and be done with that. They suppose to do things internally, and if they write models regularly we should probably do also. (Here I don't have that much experience- based on my intuition, but I can take a look).

dberenbaum · 2023-05-24T18:20:02Z

What do you want log_artifact to do at each call of on_model_save and what value would it add?

I see the method doing 3 things:

Sometimes copying the file to the dvclive dir -- unclear why this shouldn't wait for the end since you could still resume with the model at its original filepath.
Adding the file to dvc.yaml:artifacts -- again unclear why this shouldn't wait for the end
Dvc add to track the file -- again unclear why this shouldn't wait for the end

In the future, we could do something like upload each checkpoint to studio to make recovery easier, but right now I don't see the point in calling log_artifact repeatedly.

That's the first I checked:

I guess I missed that one, and AFAICT it only copies the file locally. Not sure what the user value is.

For the others, they are logging at the end of training if anywhere AFAICT:

When I talk about frameworks I also include Keras, et al. It would be great to just pass them our callback and be done with that.

That's why I included lightning and huggingface refs above (our two most popular frameworks). Check out how wandb handles these frameworks:

Each has its own bespoke logic with multiple options, and none include checkpointing by default with the basic logger. Instead, they focus on robust docs and examples to explain the options for how to do these things and make it easy to copy/paste.

shcheklein · 2023-05-27T23:05:28Z

Agreed on the YOLO. I'll change the logic there.

What about other callbacks - like Keras, etc. Is it possible for us to rely on them to create model in the right location (dvclive) and register it? So that people don't have to run log_artifact at the end?

Also, what about other storage options? Should we introduce a flag? Should we change the default? How do we pass those flags into callbacks? (like don't do dvc add, don't create an experiment, etc - for example in case of YOLO).

shcheklein added enhancement A: dvc DVC integration A: log_artifact Area: `live.log_artifact` labels May 20, 2023

dberenbaum mentioned this issue May 25, 2023

log_model #586

Closed

aguschin mentioned this issue Jun 5, 2023

Model registry docs iterative/dvc.org#4423

Closed

13 tasks

aguschin mentioned this issue Jun 14, 2023

Add log_artifact examples for ML frameworks in dvclive docs iterative/dvc.org#4622

Closed

This was referenced Jul 3, 2023

auto save model in lightning #613

Closed

log_artifact: add cache option, only write to dvc.yaml if metadata ex… #620

Merged

dberenbaum closed this as completed in #620 Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple `dvc add` from `log_artifact` #572

Decouple `dvc add` from `log_artifact` #572

shcheklein commented May 20, 2023 •

edited

Loading

daavoo commented May 21, 2023 •

edited

Loading

shcheklein commented May 21, 2023

dberenbaum commented May 22, 2023

shcheklein commented May 23, 2023

dberenbaum commented May 24, 2023

shcheklein commented May 24, 2023

dberenbaum commented May 24, 2023

shcheklein commented May 27, 2023

Decouple dvc add from log_artifact #572

Decouple dvc add from log_artifact #572

Comments

shcheklein commented May 20, 2023 • edited Loading

daavoo commented May 21, 2023 • edited Loading

shcheklein commented May 21, 2023

dberenbaum commented May 22, 2023

shcheklein commented May 23, 2023

dberenbaum commented May 24, 2023

shcheklein commented May 24, 2023

dberenbaum commented May 24, 2023

shcheklein commented May 27, 2023

Decouple `dvc add` from `log_artifact` #572

Decouple `dvc add` from `log_artifact` #572

shcheklein commented May 20, 2023 •

edited

Loading

daavoo commented May 21, 2023 •

edited

Loading