Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp: properly handle failed to execute experiments #7126

Closed
shcheklein opened this issue Dec 11, 2021 · 5 comments
Closed

exp: properly handle failed to execute experiments #7126

shcheklein opened this issue Dec 11, 2021 · 5 comments
Labels
A: experiments Related to dvc exp bug Did we break something? product: VSCode Integration with VSCode extension

Comments

@shcheklein
Copy link
Member

shcheklein commented Dec 11, 2021

Bug Report

Description

It's easy to queue multiple experiments that then would conflict with already existing. Usually happens if you run it with the same set of params. It's extremely easy to make this mistake when you use tool extensively.

After that it breaks the background queue mode. It won't run conflicting ones, completes regular ones, but will fail to clean the queue (even for those that don't fail).

Reproduce

Clone example-get-started
Run an experiment
Queue multiple experiments w/o changing params
Queue one with some different param
Run all

Expected

After run all executed there should:

  • a signal about conflicting exps (and an easy way to cleanup them)
  • executed experiments should be cleaned up

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.9.2.dev7+g896bbf1f
---------------------------------
Platform: Python 3.9.8 on macOS-12.0-arm64-arm-64bit
Supports:
	azure (adlfs = 2021.10.0, knack = 0.9.0, azure-identity = 1.7.1),
	gdrive (pydrive2 = 1.10.0),
	gs (gcsfs = 2021.11.1),
	hdfs (fsspec = 2021.11.1, pyarrow = 6.0.1),
	webhdfs (fsspec = 2021.11.1),
	http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
	https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
	s3 (s3fs = 2021.11.1, boto3 = 1.19.8),
	ssh (sshfs = 2021.11.2),
	oss (ossfs = 2021.8.0),
	webdav (webdav4 = 0.9.3),
	webdavs (webdav4 = 0.9.3)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

Additional Information (if any):

Mind the experiment 9ab83fd and 1c6dacd

Screen Shot 2021-12-10 at 6 55 03 PM

That's how it looks like in logs:

$ dvc exp run --run-all

...

Stage 'evaluate' is cached - skipping run, checking out outputs
Updating lock file 'dvc.lock'
ERROR: Experiment conflicts with existing experiment 'exp-237fe'. To overwrite the existing experiment run:

	dvc exp run -f ...

To run this experiment with a different name run:

	dvc exp run -n <new_name> ...

(.env) ?255 Projects/example-get-started %
@shcheklein shcheklein added A: experiments Related to dvc exp bug Did we break something? labels Dec 11, 2021
@shcheklein
Copy link
Member Author

This behavior means that as I keep running new experiments queue is growing with failed experiments. There is no easy way to see which will be failing w/o running it first. It completely breaks the workflow. The only way is to cleanup all queued exps with dvc exp remove --queue

@pmrowla
Copy link
Contributor

pmrowla commented Dec 13, 2021

So this is also related to --queue/--run-all being mostly experimental and not an actual solution to #5615.

The original reason for keeping the failed experiments in the queue is that currently if we remove them automatically, there will be no indication (in the exp show table) afterwards that we attempted to run that experiment at all (and that the run failed).

I think what we probably want is a proper way to keep a separate ref(s) with the failed exp queue/stash commits, and then display those in the table separately from the existing queue (and indicate that they failed). Ideally, we would also be able to retrieve the logs for the failures so that the user can see what went wrong as well (related: #7002).

In this case, the user would still need to explicitly remove them (with some new flag for remove/gc or maybe via a cleanup command), but the failed exps would not be automatically retried the next time the user uses --run-all.

@mattseddon mattseddon added the product: VSCode Integration with VSCode extension label Dec 19, 2021
@mattseddon
Copy link
Member

For the record this is causing some weird behaviour in the VS Code extension. Could be related to iterative/vscode-dvc#828.

@dberenbaum
Copy link
Collaborator

@shcheklein Should we close this one now? Any issues at this point I think should be bugs that we can handle separately from this high-level product issue. What do you think?

@shcheklein
Copy link
Member Author

Yep, agreed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp bug Did we break something? product: VSCode Integration with VSCode extension
Projects
None yet
Development

No branches or pull requests

4 participants