exp: properly handle failed to execute experiments #7126

shcheklein · 2021-12-11T03:00:57Z

Bug Report

Description

It's easy to queue multiple experiments that then would conflict with already existing. Usually happens if you run it with the same set of params. It's extremely easy to make this mistake when you use tool extensively.

After that it breaks the background queue mode. It won't run conflicting ones, completes regular ones, but will fail to clean the queue (even for those that don't fail).

Reproduce

Clone example-get-started
Run an experiment
Queue multiple experiments w/o changing params
Queue one with some different param
Run all

Expected

After run all executed there should:

a signal about conflicting exps (and an easy way to cleanup them)
executed experiments should be cleaned up

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.9.2.dev7+g896bbf1f
---------------------------------
Platform: Python 3.9.8 on macOS-12.0-arm64-arm-64bit
Supports:
	azure (adlfs = 2021.10.0, knack = 0.9.0, azure-identity = 1.7.1),
	gdrive (pydrive2 = 1.10.0),
	gs (gcsfs = 2021.11.1),
	hdfs (fsspec = 2021.11.1, pyarrow = 6.0.1),
	webhdfs (fsspec = 2021.11.1),
	http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
	https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
	s3 (s3fs = 2021.11.1, boto3 = 1.19.8),
	ssh (sshfs = 2021.11.2),
	oss (ossfs = 2021.8.0),
	webdav (webdav4 = 0.9.3),
	webdavs (webdav4 = 0.9.3)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

Additional Information (if any):

Mind the experiment 9ab83fd and 1c6dacd

That's how it looks like in logs:

$ dvc exp run --run-all

...

Stage 'evaluate' is cached - skipping run, checking out outputs
Updating lock file 'dvc.lock'
ERROR: Experiment conflicts with existing experiment 'exp-237fe'. To overwrite the existing experiment run:

	dvc exp run -f ...

To run this experiment with a different name run:

	dvc exp run -n <new_name> ...

(.env) ?255 Projects/example-get-started %

The text was updated successfully, but these errors were encountered:

shcheklein · 2021-12-11T03:09:11Z

This behavior means that as I keep running new experiments queue is growing with failed experiments. There is no easy way to see which will be failing w/o running it first. It completely breaks the workflow. The only way is to cleanup all queued exps with dvc exp remove --queue

pmrowla · 2021-12-13T06:44:28Z

So this is also related to --queue/--run-all being mostly experimental and not an actual solution to #5615.

The original reason for keeping the failed experiments in the queue is that currently if we remove them automatically, there will be no indication (in the exp show table) afterwards that we attempted to run that experiment at all (and that the run failed).

I think what we probably want is a proper way to keep a separate ref(s) with the failed exp queue/stash commits, and then display those in the table separately from the existing queue (and indicate that they failed). Ideally, we would also be able to retrieve the logs for the failures so that the user can see what went wrong as well (related: #7002).

In this case, the user would still need to explicitly remove them (with some new flag for remove/gc or maybe via a cleanup command), but the failed exps would not be automatically retried the next time the user uses --run-all.

mattseddon · 2021-12-19T21:29:06Z

For the record this is causing some weird behaviour in the VS Code extension. Could be related to iterative/vscode-dvc#828.

dberenbaum · 2022-07-29T14:55:15Z

@shcheklein Should we close this one now? Any issues at this point I think should be bugs that we can handle separately from this high-level product issue. What do you think?

shcheklein · 2022-07-29T23:39:47Z

Yep, agreed!

shcheklein added A: experiments Related to dvc exp bug Did we break something? labels Dec 11, 2021

mattseddon added the product: VSCode Integration with VSCode extension label Dec 19, 2021

mattseddon mentioned this issue Dec 19, 2021

exp --queue and --run-all can hang and cause future exp show runs to hang iterative/vscode-dvc#828

Closed

shcheklein closed this as completed Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exp: properly handle failed to execute experiments #7126

exp: properly handle failed to execute experiments #7126

shcheklein commented Dec 11, 2021 •

edited

Loading

shcheklein commented Dec 11, 2021

pmrowla commented Dec 13, 2021

mattseddon commented Dec 19, 2021

dberenbaum commented Jul 29, 2022

shcheklein commented Jul 29, 2022

exp: properly handle failed to execute experiments #7126

exp: properly handle failed to execute experiments #7126

Comments

shcheklein commented Dec 11, 2021 • edited Loading

Bug Report

Description

Reproduce

Expected

Environment information

shcheklein commented Dec 11, 2021

pmrowla commented Dec 13, 2021

mattseddon commented Dec 19, 2021

dberenbaum commented Jul 29, 2022

shcheklein commented Jul 29, 2022

shcheklein commented Dec 11, 2021 •

edited

Loading