-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exp: properly handle failed to execute experiments #7126
Comments
This behavior means that as I keep running new experiments queue is growing with failed experiments. There is no easy way to see which will be failing w/o running it first. It completely breaks the workflow. The only way is to cleanup all queued exps with |
So this is also related to The original reason for keeping the failed experiments in the queue is that currently if we remove them automatically, there will be no indication (in the I think what we probably want is a proper way to keep a separate ref(s) with the failed exp queue/stash commits, and then display those in the table separately from the existing queue (and indicate that they failed). Ideally, we would also be able to retrieve the logs for the failures so that the user can see what went wrong as well (related: #7002). In this case, the user would still need to explicitly remove them (with some new flag for |
For the record this is causing some weird behaviour in the VS Code extension. Could be related to iterative/vscode-dvc#828. |
@shcheklein Should we close this one now? Any issues at this point I think should be bugs that we can handle separately from this high-level product issue. What do you think? |
Yep, agreed! |
Bug Report
Description
It's easy to queue multiple experiments that then would conflict with already existing. Usually happens if you run it with the same set of params. It's extremely easy to make this mistake when you use tool extensively.
After that it breaks the background queue mode. It won't run conflicting ones, completes regular ones, but will fail to clean the queue (even for those that don't fail).
Reproduce
Clone
example-get-started
Run an experiment
Queue multiple experiments w/o changing params
Queue one with some different param
Run all
Expected
After run all executed there should:
Environment information
Output of
dvc doctor
:Additional Information (if any):
Mind the experiment
9ab83fd
and1c6dacd
That's how it looks like in logs:
The text was updated successfully, but these errors were encountered: