Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean Expired Messages before Experiments Start #10578

Open
nablabits opened this issue Oct 3, 2024 · 3 comments
Open

Clean Expired Messages before Experiments Start #10578

nablabits opened this issue Oct 3, 2024 · 3 comments
Labels
A: experiments Related to dvc exp p2-medium Medium priority, should be done, but less important

Comments

@nablabits
Copy link

This is a follow up from:

where a minimal fix was shipped. As discussed in this comment, it may be a good idea to explore the possibility and implications of cleaning the expired messages before an experiment starts.

@shcheklein shcheklein added p2-medium Medium priority, should be done, but less important A: experiments Related to dvc exp labels Oct 3, 2024
@nablabits
Copy link
Author

Just a quick heads up in case someone lands in this issue and thinks that it's unattended, I'm working on it and I will put an update soon 🐌. Thanks for the patience

@nablabits
Copy link
Author

Context

Let's do a bit of backup on this to make sure it still makes sense

  • As part of the previous issue we discovered that kombu will put messages in the processing directory located in .dvc/tmp/exps/celery/broker/in/ one second in the future (Source).
  • This may prevent some messages to be properly cleaned once the first worker has finished (Source).
  • We may not want to add an extra 2" to that check as that may potentially cause one worker to remove the shutdown message before the other worker has consumed it.
  • It may be a good idea to clean the directory upfront to remove any message that have remained because of above.

What I Have Discovered

  • I couldn't run any scenario where a message is not cleaned because of the 1" with the fix we shipped in the previous issue, but who knows, there may be similar situations.
  • There's an scenario where a significant amount of files may remain. It goes as follows:
    • There's a big imbalance between experiments, say one experiment takes 1' more to complete than the other. For example --set-param "train.fine_tune_args.epochs=2,20"
    • The short experiment is placed in the first worker. This is not always the case as they seem to be picked at random
    • Because of attaching the --clean flag to the first worker (source), it will start cleaning right after shutdown while the first worker is still running.
    • The second worker is effectively putting messages in .dvc/tmp/exps/celery/broker/in/, but they seem to be for the [email protected] queue which suggests a fanout. These are the messages that remain.

Unknowns

  • It's not clear to me yet what the impact of these remaining files are in subsequent runs.
  • It's not clear either what are the implications of triggering clean over the messages in the second worker. By looking at dvc queue status it reads that both were successful.

What do you think @shcheklein, is there a strong case to clean upfront?

@shcheklein
Copy link
Member

The second worker is effectively putting messages in .dvc/tmp/exps/celery/broker/in/, but they seem to be for the [email protected] queue which suggests a fanout. These are the messages that remain.

@nablabits could you give a little bit more details please?

From you description it seems to be a problem that the first worker is cleaning up too early, not that there are files left? Or am I missing something?

What do you think @shcheklein, is there a strong case to clean upfront?

don't know yet, it well might be we don't need it. It would be great to get a bit more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

2 participants