Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Incorrect output data when recovering partially completed dynamic workflows #3121

Closed
2 tasks done
hamersaw opened this issue Nov 30, 2022 · 2 comments · Fixed by flyteorg/flytepropeller#518
Closed
2 tasks done
Assignees
Labels
bug Something isn't working
Milestone

Comments

@hamersaw
Copy link
Contributor

Describe the bug

Dynamic workflows work by compiling a workflow closure defining subnodes which is then subsequently evaluated. In this process, Flyte assigns each subnode an incremental unique node ID which is used to track status, etc. Since these node IDs are incremental values, in scenarios where the dynamic task has some non-deterministic ordering a node ID in one execution of the dynamic workflow may not represent the same subnode as another execution.

This is problematic in partially completed dynamic workflows because the recovery mechanisms rely on node IDs to reuse previously completed executions. Under certain circumstances recovery of dynamic tasks will corrupt the result data.

Expected behavior

Recovery should work in all scenarios for dynamic workflows.

One solution is to ensure that the recovered dynamic workflow reuses the previously computed dynamic workflow closure rather than recomputing it. This means that node IDs would be the same for both the original execution and the recovered execution.

Additional context to reproduce

To reproduce run the following workflow with an input dict:

import time

from flytekit import Resources, dynamic, task, workflow
from typing import Dict

@task(requests=Resources(cpu="500m", mem="400Mi"), limits=Resources(cpu="1", mem="600Mi"))
def echo(a: int) -> int:
    time.sleep(10)
    return a

@task(requests=Resources(cpu="500m", mem="400Mi"), limits=Resources(cpu="1", mem="600Mi"))
def sleep(a: int, b: Dict[str, int]) -> Dict[str, int]:
    time.sleep(a)
    return b

@dynamic(cache=True, cache_version="1.0", requests=Resources(cpu="500m", mem="400Mi"), limits=Resources(cpu="1", mem="600Mi"))
def dynamic_dict(a: Dict[str, int]) -> Dict[str, int]:
    b = {}
    for x, y in a.items():
        b[x] = echo(a=y)

    return b

@workflow
def dynamic_dict_wf(a: Dict[str, int]) -> Dict[str, int]:
    b = dynamic_dict(a=a)
    c = sleep(a=10, b=b)
    return c

Then manually fail of the of the dynamic subnodes by repeatedly deleting the pod in k8s before it completes until Flyte exhausts the number of completions. Then use the 'Recover' button in the UI to start a workflow recovery. Analyze output values of the dynamic task to see incorrect data.

Screenshots

An example of the reproduce context. Launched the aforementioned workflow with inputs:

{
   "foo": 1,
   "bar": 2,
   "baz": 4,
}

Manually failed the n0-0-dn1 node to fail the workflow, the result is below:
failed
In this execution nodes n0-0-dn0 and n0-0-dn2 correspond to the input values 4 and 2.
failed-n0
failed-n2
So this means the n0-0-dn1 should be responsible for 1 because the python iteration over this Dict was orderd 4,1,2.

Then we recover the workflow and allow it to succeed and see that the output values are:

{
   "bar": 2,
   "baz": 2,
   "foo": 4,
}

recovered

This is because the dynamic task recompiled the workflow closure which resulted in a different ordering of the Dict iteration. The results from n0-0-dn0 and n0-0-dn2 were reused - albeit applied to different keys (ie. bar & foo rather than baz and bar) and n0-0-dn1 was executed.
recovered-n1

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@hamersaw hamersaw added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Nov 30, 2022
@eapolinario eapolinario added this to the 1.4.0 milestone Dec 2, 2022
@eapolinario eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label Dec 2, 2022
@cosmicBboy cosmicBboy removed this from the 1.4.0 milestone Jan 18, 2023
@hamersaw hamersaw self-assigned this Jan 31, 2023
@hamersaw hamersaw added this to the 1.4.0 milestone Jan 31, 2023
@cosmicBboy cosmicBboy modified the milestones: 1.4.0, 1.5.0 Mar 1, 2023
@hamersaw
Copy link
Contributor Author

prematurely closed from PR merge

@hamersaw hamersaw reopened this Mar 10, 2023
@hamersaw
Copy link
Contributor Author

completed now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants