Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

queue: preserve checkpoints for failed experiments #8750

Closed
dberenbaum opened this issue Dec 30, 2022 · 5 comments · Fixed by #9271
Closed

queue: preserve checkpoints for failed experiments #8750

dberenbaum opened this issue Dec 30, 2022 · 5 comments · Fixed by #9271
Labels
A: task-queue Related to task queue. bug Did we break something? p2-medium Medium priority, should be done, but less important

Comments

@dberenbaum
Copy link
Collaborator

    > > One problem remained for the `--queue` experiments, it returns results at failure but with only a failed task, while all of the checkpoints are lost.

@karajan1001 Is this still an issue?

This behavior is different from --temp in which we return completed checkpoint results even if the tasks failed. I think the --temp behavior is more reasonable.

Originally posted by @karajan1001 in #8668 (comment)

@dberenbaum dberenbaum added A: task-queue Related to task queue. p1-important Important, aka current backlog of things to do labels Dec 30, 2022
@dberenbaum
Copy link
Collaborator Author

If I gracefully stop an experiment (dvc queue kill), it shows that the experiment was a success, which is the same result I get from dvc exp run --temp:

──────────────────────────────────────────────────────────────────────────────────────>
  Experiment                 Created        train.loss   train.acc   test.loss   test.a>
 ──────────────────────────────────────────────────────────────────────────────────────>
  workspace                  -                  0.9731      0.7694     0.95962     0.77>
  main                       Dec 15, 2022       0.9731      0.7694     0.95962     0.77>
  │ ╓ 712de14 [worst-coho]   10:33 AM           1.0494     0.76165      1.0356     0.77>
  │ ╟ 71c453a                10:33 AM           1.1328     0.68682      1.1244      0.6>
  │ ╟ cd54bc5                10:33 AM           1.1393     0.74353      1.1267     0.75>
  │ ╟ dee160c                10:33 AM           1.2593     0.68985      1.2477     0.69>
  │ ╟ 336f87d                10:33 AM           1.2834      0.7272      1.2732     0.73>
  │ ╟ 9980f28                10:33 AM           1.4289     0.68332      1.4172     0.68>
  │ ╟ 903626d                10:32 AM           1.4766      0.6644      1.4685     0.66>
  │ ╟ afd7851                10:32 AM           1.6247      0.6674      1.6142     0.66>
  │ ╟ cb2cdd4                10:32 AM           1.6859     0.54373      1.6806     0.54>
  │ ╟ acd2f1d                10:32 AM           1.8042     0.65827      1.7964     0.66>
  │ ╟ a874e51                10:32 AM            1.956     0.49918      1.9518     0.50>
  │ ╟ 58a6699                10:32 AM           2.1043     0.40697       2.101     0.41>
  ├─╨ b48055c                10:32 AM            2.224     0.27073      2.2228     0.27>
 ──────────────────────────────────────────────────────────────────────────────────────>

If I forcefully stop an experiment (dvc queue kill -f), it shows a failure and I get separate entries for the failed task and the completed checkpoints:

 ──────────────────────────────────────────────────────────────────────────────────────>
  Experiment                 Created        State    train.loss   train.acc   test.loss>
 ──────────────────────────────────────────────────────────────────────────────────────>
  workspace                  -              -            0.9731      0.7694     0.95962>
  main                       Dec 15, 2022   -            0.9731      0.7694     0.95962>
  │ ╓ 05061e9 [rarer-flip]   10:38 AM       -            1.6247      0.6674      1.6142>
  │ ╟ 0d335ae                10:38 AM       -            1.6859     0.54373      1.6806>
  │ ╟ 487f6d0                10:38 AM       -            1.8042     0.65827      1.7964>
  │ ╟ 284a7e6                10:38 AM       -             1.956     0.49918      1.9518>
  │ ╟ 268b872                10:38 AM       -            2.1043     0.40697       2.101>
  ├─╨ 9d21a58                10:38 AM       -             2.224     0.27073      2.2228>
  └── 2491752                10:37 AM       Failed            -           -           ->
 ──────────────────────────────────────────────────────────────────────────────────────>

Although VS Code shows the failure under the same experiment:

Screenshot 2022-12-30 at 10 39 58 AM

This makes some sense to me except for the failed checkpoint showing up separate from the rest of the experiment. What do you think?

@dberenbaum dberenbaum added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Dec 30, 2022
@karajan1001
Copy link
Contributor

@dberenbaum, Yes, it looks like there is something wrong with my repository in the previous. After updating to the current main branch I get the same result, I think the current behavior is reasonable.

@mattseddon
Copy link
Member

I came across this issue whilst working through the checkboxes in iterative/vscode-dvc#3091

Generating the extra experiment record when killing an experiment running in the queue means that exp remove will throw an error if the user tries to remove the "failed record":

Screen.Recording.2023-01-27.at.12.57.04.pm.mov

provided error is ERROR: unexpected error - Invalid stash ref 'refs/exps/celery/failed@{0}'

Not a massive problem as the records get removed. More of a minor annoyance.

@dberenbaum
Copy link
Collaborator Author

Hmm, okay, raising this to p1 since removing any entry shown in the table shouldn't fail. Once we fix that problem raised by @mattseddon, I think we can close this one.

@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do and removed p2-medium Medium priority, should be done, but less important labels Jan 27, 2023
@pmrowla
Copy link
Contributor

pmrowla commented Mar 27, 2023

dropping this to P2 while the discussion about whether to drop checkpoints support entirely is still ongoing

@pmrowla pmrowla added p2-medium Medium priority, should be done, but less important bug Did we break something? and removed p1-important Important, aka current backlog of things to do labels Mar 27, 2023
@dberenbaum dberenbaum closed this as not planned Won't fix, can't repro, duplicate, stale May 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: task-queue Related to task queue. bug Did we break something? p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants