guide: GH resuming workflow #207

casperdcl · 2022-03-19T11:04:16Z

Add a self-hosted long-running example to https://cml.dev/doc/cml-with-dvc (or somewhere else)

GH action launches "self-hosted" GCP/AWS using cml runner --reuse --labels=cml and probably --cloud-spot
GH action runs the rest of the workflow on the "self-hosted" runner using runs-on: [self-hosted, cml] and timeout-minutes: 50400
If GH action is about to timeout, CML will restart the workflow

i.e. https://cml.dev/doc/self-hosted-runners?tab=GitHub#allocating-cloud-compute-resources-with-cml
The key is requesting GH's maximum timeout-minutes: 50400 - this signals to CML to restart the workflow just before timeout.
write code to cache results so that the restarted workflow will use previous results (e.g. use https://dvc.org/doc/user-guide/experiment-management/checkpoints#caching-checkpoints and Feature exp run: Dryer resume within the CI dvc#6823)

The text was updated successfully, but these errors were encountered:

casperdcl · 2022-04-01T15:51:37Z

more musings (for cml runner --cloud-spot):

live = dvclive.Live(resume=True)
model = Model(load="model.pkl" if Path("model.pkl").exists() else None)
while (epoch := live.get_step()) < 100:
    history = model.fit(X, Y, epochs=1)
    if epoch % 10 == 0:  # at most 10 epochs are lost upon CML respawing a spot instance
        model.save("model.pkl")
    live.log("loss", history['loss'])
    live.next_step()

jorgeorpinel · 2022-09-29T18:44:33Z

Out of curiosity, what makes this p1? Perhaps there are there lots of support cases that could be avoided by or redirected to this? Thanks

casperdcl · 2022-10-03T12:08:08Z

lots of support requests over YEARS; super overdue.

omesser · 2023-04-13T01:49:22Z

deprioritized and frozen. Removing from CML project board for now

casperdcl added documentation Markdown files p1-important High priority labels Mar 19, 2022

casperdcl mentioned this issue Apr 1, 2022

next_step() needs log() iterative/dvclive#232

Closed

casperdcl mentioned this issue Jul 27, 2022

clarify spot instance pricing #285

Open

casperdcl added the epic Collection of sub-issues label Jul 29, 2022

casperdcl assigned dacbd Nov 15, 2022

This was referenced Nov 24, 2022

Usage -> User Guide (initial) #382

Closed

Fix CML with DVC page #389

Open

dacbd removed their assignment Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

guide: GH resuming workflow #207

guide: GH resuming workflow #207

casperdcl commented Mar 19, 2022 •

edited

Loading

casperdcl commented Apr 1, 2022

jorgeorpinel commented Sep 29, 2022

casperdcl commented Oct 3, 2022

omesser commented Apr 13, 2023

guide: GH resuming workflow #207

guide: GH resuming workflow #207

Comments

casperdcl commented Mar 19, 2022 • edited Loading

casperdcl commented Apr 1, 2022

jorgeorpinel commented Sep 29, 2022

casperdcl commented Oct 3, 2022

omesser commented Apr 13, 2023

casperdcl commented Mar 19, 2022 •

edited

Loading