This repository has been archived by the owner on Oct 9, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 59
Allow checkpoint resume when recovering a workflow #486
Merged
hamersaw
merged 11 commits into
flyteorg:master
from
andrewwdye:node-execution-checkpoints
Oct 6, 2022
Merged
Allow checkpoint resume when recovering a workflow #486
hamersaw
merged 11 commits into
flyteorg:master
from
andrewwdye:node-execution-checkpoints
Oct 6, 2022
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Flyte-Bot <[email protected]>
Signed-off-by: Flyte-Bot <[email protected]>
Signed-off-by: Haytham Abuelfutuh <[email protected]>
…propeller into flyte-bot-update-flyteidl
Signed-off-by: Flyte-Bot <[email protected]>
…propeller into flyte-bot-update-flyteidl
Signed-off-by: Andrew Dye <[email protected]>
Signed-off-by: Andrew Dye <[email protected]>
8 tasks
…into node-execution-checkpoints Signed-off-by: Andrew Dye <[email protected]>
Signed-off-by: Andrew Dye <[email protected]>
hamersaw
approved these changes
Oct 5, 2022
eapolinario
pushed a commit
to eapolinario/flytepropeller
that referenced
this pull request
Aug 9, 2023
* Update flyteidl version Signed-off-by: Flyte-Bot <[email protected]> * Update flyteidl version Signed-off-by: Flyte-Bot <[email protected]> * Fix build break Signed-off-by: Haytham Abuelfutuh <[email protected]> * Update flyteidl version Signed-off-by: Flyte-Bot <[email protected]> * Save/restore CheckpointUri from NodeExecution Signed-off-by: Andrew Dye <[email protected]> * Lints, generate Signed-off-by: Andrew Dye <[email protected]> * Fix log line Signed-off-by: Andrew Dye <[email protected]> Signed-off-by: Flyte-Bot <[email protected]> Signed-off-by: Haytham Abuelfutuh <[email protected]> Signed-off-by: Andrew Dye <[email protected]> Co-authored-by: flyte-bot <[email protected]> Co-authored-by: Haytham Abuelfutuh <[email protected]> Co-authored-by: Dan Rammer <[email protected]>
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TL;DR
This changes propagates task level checkpoint info between failed and recovered node executions. Previously intra task checkpointing was only supported for task level retries within a single node execution.
Related PR: flyteorg/flyteadmin#479
NOTE: cannot merge this until #467 is resolved
Type
Are all requirements met?
Complete description
This change
TaskNodeMetadata
NodeExecutionEvent
, to be stored in the dbNodeExeuction.Closure
and store in theExecutableNodeStatus
(persisted to the CRD) so that it's available for later phase processingTesting
Added various unit tests
Verified with a local version of Flyte, following setup steps here.
Tracking Issue
flyteorg/flyte#2254
Follow-up issue
Uncovered a few issues along the way
flyteorg/flyte#2894
flyteorg/flytesnacks#894 (PR)
flyteorg/flytekit#1189 (PR)