-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple dvc add
from log_artifact
#572
Comments
I think we have discussed this elsewhere but we went with the approach of not doing this in the integrations and instead documenting/explaining to the users how to track the model once the training loop is completed, like in: https://dvc.org/doc/start/experiments/experiment-tracking?tab=Pytorch-Lightning This was under the assumption of using Are we discussing here a different scenario to use |
Scenarious include updating the best model from time to time (e.g to be able early stop or just in general have the best weights at the end) + recovery. Yes, it can be done at the end also with a copy flag, but it's unrealistic to expect users or callbacks to behave that way. We can porentially run |
Why does I think doing |
I think what I'm trying to cover here (and we can discuss the details about the YOLO) is the workflow when a framework has a callback - Re that specific YOLO repository. No strong reason since now it writes it locally (in |
Makes sense, but I don't think it's typical to call There are sometimes options for saving either at the end or at model checkpoints for other frameworks, but I don't see it enabled by default (see https://pytorch-lightning.readthedocs.io/en/1.6.1/extensions/logging.html, https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/callback, etc.). In the cases where users opt to call |
That's the first I checked: It logs the best model on each fit epoch end. (they don't use model_save because it was added later?) When I talk about frameworks I also include Keras, et al. It would be great to just pass them our callback and be done with that. They suppose to do things internally, and if they write models regularly we should probably do also. (Here I don't have that much experience- based on my intuition, but I can take a look). |
What do you want I see the method doing 3 things:
In the future, we could do something like upload each checkpoint to studio to make recovery easier, but right now I don't see the point in calling
I guess I missed that one, and AFAICT it only copies the file locally. Not sure what the user value is. For the others, they are logging at the end of training if anywhere AFAICT:
That's why I included lightning and huggingface refs above (our two most popular frameworks). Check out how wandb handles these frameworks:
Each has its own bespoke logic with multiple options, and none include checkpointing by default with the basic logger. Instead, they focus on robust docs and examples to explain the options for how to do these things and make it easy to copy/paste. |
Agreed on the YOLO. I'll change the logic there. What about other callbacks - like Keras, etc. Is it possible for us to rely on them to create model in the right location ( Also, what about other storage options? Should we introduce a flag? Should we change the default? How do we pass those flags into callbacks? (like don't do dvc add, don't create an experiment, etc - for example in case of YOLO). |
There are a bunch of issues in the current implementation, most of them are related to the workflow when one has to run
log_artifact
on each iteration (e.g. saving best model via callback).Related to #551
The text was updated successfully, but these errors were encountered: