Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Recovery fails to retry tasks that were SKIPPED in previous execution due to an upstream node failure in Flyte >=1.4.x #3578

Closed
2 tasks done
jeevb opened this issue Apr 7, 2023 · 1 comment · Fixed by flyteorg/flytepropeller#551
Labels
bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers

Comments

@jeevb
Copy link
Contributor

jeevb commented Apr 7, 2023

Describe the bug

When recovering an execution with tasks that were SKIPPED due to an upstream node failure, these tasks are not retried, and the execution is incorrectly marked as SUCCEEDED anyway.

Expected behavior

SKIPPED tasks should also be retried on recovery if the upstream node succeeds.

Additional context to reproduce

import random

from flytekit import task, workflow
from flytekit.core.workflow import WorkflowFailurePolicy


@task
def pass_through(input1: int) -> int:
    return input1


@task
def fail(input1: int) -> int:
    if random.randint(0, 10) < 7:
        assert False
    return input1


@workflow(failure_policy=WorkflowFailurePolicy.FAIL_AFTER_EXECUTABLE_NODES_COMPLETE)
def wf(wf_input: int) -> tuple[int, int]:
    a = fail(input1=wf_input)
    b = pass_through(input1=wf_input)
    c = pass_through(input1=a)
    return b, c

Execute against a flyte-sandbox like so:

pyflyte run --remote --image cr.flyte.org/flyteorg/flytekit:py3.10-latest test/recovery.py wf --wf_input 3

Since the task failure is non-deterministic, keep retrying until the first node fails. The last node should now be marked as SKIPPED. Then, recover until the first node succeeds and observe the behavior.

Screenshots

Initial failure:
Screenshot 2023-04-07 at 11 57 41 AM

After "successful" recovery:
Screenshot 2023-04-07 at 11 58 02 AM

How it SHOULD work (behavior in Flyte v1.3.0):
Screenshot 2023-04-07 at 12 03 29 PM

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@jeevb jeevb added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Apr 7, 2023
@welcome
Copy link

welcome bot commented Apr 7, 2023

Thank you for opening your first issue here! 🛠

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant