Allow checkpoint resume when recovering a workflow #486

andrewwdye · 2022-09-25T06:51:08Z

TL;DR

This changes propagates task level checkpoint info between failed and recovered node executions. Previously intra task checkpointing was only supported for task level retries within a single node execution.

Related PR: flyteorg/flyteadmin#479

NOTE: cannot merge this until #467 is resolved

Type

Bug Fix
Feature
Plugin

Are all requirements met?

Complete description

This change

Saves checkpoint path from successful or failed task nodes in TaskNodeMetadata
Sends to flyteadmin as part of the NodeExecutionEvent, to be stored in the db
When attempting to recover a node execution, read the checkpoint path from NodeExeuction.Closure and store in the ExecutableNodeStatus (persisted to the CRD) so that it's available for later phase processing
Provide this previous checkpoint path to the task on attempt 0, else continue passing path from attempt N-1 in the current node execution

Testing

Added various unit tests

Verified with a local version of Flyte, following setup steps here.

Run a task indefinitely that checkpoints every 10s (see code below)
Use kubectl to kill the pod (simulate infra failure)
Recover the workflow
On recovery, the task will find a checkpoint and exit successfully

import logging
import time

from flytekit import current_context, task, workflow

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__file__)


@task
def t():
    cp = current_context().checkpoint
    prev = cp.read()
    if prev:
        logger.info(f"Recovering from iteration {prev.decode()}")
        return
    iterations = 0
    while True:
        time.sleep(10)
        iterations += 1
        logger.info(iterations)
        cp.write(f"{iterations}".encode())


@workflow
def wf():
    t()

Tracking Issue

flyteorg/flyte#2254

Follow-up issue

Uncovered a few issues along the way
flyteorg/flyte#2894
flyteorg/flytesnacks#894 (PR)
flyteorg/flytekit#1189 (PR)

Signed-off-by: Flyte-Bot <[email protected]>

Signed-off-by: Haytham Abuelfutuh <[email protected]>

…propeller into flyte-bot-update-flyteidl

Signed-off-by: Flyte-Bot <[email protected]>

…propeller into flyte-bot-update-flyteidl

Signed-off-by: Andrew Dye <[email protected]>

codecov · 2022-09-25T06:57:16Z

Codecov Report

Merging #486 (dd86a60) into master (560bb1b) will decrease coverage by 0.03%.
The diff coverage is 53.33%.

Signed-off-by: Andrew Dye <[email protected]>

…into node-execution-checkpoints Signed-off-by: Andrew Dye <[email protected]>

Signed-off-by: Andrew Dye <[email protected]>

* Update flyteidl version Signed-off-by: Flyte-Bot <[email protected]> * Update flyteidl version Signed-off-by: Flyte-Bot <[email protected]> * Fix build break Signed-off-by: Haytham Abuelfutuh <[email protected]> * Update flyteidl version Signed-off-by: Flyte-Bot <[email protected]> * Save/restore CheckpointUri from NodeExecution Signed-off-by: Andrew Dye <[email protected]> * Lints, generate Signed-off-by: Andrew Dye <[email protected]> * Fix log line Signed-off-by: Andrew Dye <[email protected]> Signed-off-by: Flyte-Bot <[email protected]> Signed-off-by: Haytham Abuelfutuh <[email protected]> Signed-off-by: Andrew Dye <[email protected]> Co-authored-by: flyte-bot <[email protected]> Co-authored-by: Haytham Abuelfutuh <[email protected]> Co-authored-by: Dan Rammer <[email protected]>

flyte-bot and others added 7 commits September 9, 2022 19:46

Update flyteidl version

93311f0

Signed-off-by: Flyte-Bot <[email protected]>

Update flyteidl version

fdb475a

Signed-off-by: Flyte-Bot <[email protected]>

Fix build break

271e1ad

Signed-off-by: Haytham Abuelfutuh <[email protected]>

Merge branch 'flyte-bot-update-flyteidl' of github.com:flyteorg/flyte…

1fb0b2a

…propeller into flyte-bot-update-flyteidl

Update flyteidl version

f61946f

Signed-off-by: Flyte-Bot <[email protected]>

Merge branch 'flyte-bot-update-flyteidl' of github.com:flyteorg/flyte…

06e52aa

…propeller into flyte-bot-update-flyteidl

Save/restore CheckpointUri from NodeExecution

d34402f

Signed-off-by: Andrew Dye <[email protected]>

Lints, generate

9c6ffdc

Signed-off-by: Andrew Dye <[email protected]>

andrewwdye mentioned this pull request Sep 25, 2022

Save CheckpointUri in NodeExecution.Closure flyteorg/flyteadmin#479

Merged

8 tasks

andrewwdye added 2 commits October 4, 2022 23:02

Merge branch 'master' of https://github.com/andrewwdye/flytepropeller …

7186efc

…into node-execution-checkpoints Signed-off-by: Andrew Dye <[email protected]>

Fix log line

4e71330

Signed-off-by: Andrew Dye <[email protected]>

andrewwdye marked this pull request as ready for review October 5, 2022 18:18

andrewwdye requested review from kumare3, EngHabu and hamersaw as code owners October 5, 2022 18:18

hamersaw approved these changes Oct 5, 2022

View reviewed changes

Merge branch 'master' into node-execution-checkpoints

dd86a60

hamersaw merged commit a9b831b into flyteorg:master Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow checkpoint resume when recovering a workflow #486

Allow checkpoint resume when recovering a workflow #486

andrewwdye commented Sep 25, 2022 •

edited

Loading

codecov bot commented Sep 25, 2022 •

edited

Loading

Allow checkpoint resume when recovering a workflow #486

Allow checkpoint resume when recovering a workflow #486

Conversation

andrewwdye commented Sep 25, 2022 • edited Loading

TL;DR

Type

Are all requirements met?

Complete description

Testing

Tracking Issue

Follow-up issue

codecov bot commented Sep 25, 2022 • edited Loading

Codecov Report

andrewwdye commented Sep 25, 2022 •

edited

Loading

codecov bot commented Sep 25, 2022 •

edited

Loading