[BUG] Workflow Conditional Operator conflict with task generated promises #3512

zeryx · 2023-03-21T22:20:21Z

Describe the bug

When using the Conditional operator here a bug is detected where when a task generated output is passed into the task call inside of a then() closure, an error is reported server side.

When this is run locally, this error is not reported and the output is successfully generated.

The error message when run on-cloud is the following:

failed at Node[n3-n0]. BindingResolutionError: Error binding Var [wf].[dataset], caused by: failed at Node[n0]. CausedByError: Failed to GetPrevious data from outputDir [s3://union-oc-production-demo/metadata/propeller/zeryx-demo-development-a9vk7kbf8ptkdcdwqrq6/n0/data/0/outputs.pb], caused by: path:s3://union-oc-production-demo/metadata/propeller/zeryx-demo-development-a9vk7kbf8ptkdcdwqrq6/n0/data/0/outputs.pb: not found

The node graph diagram displays the incorrect sequence of operations:

Expected behavior

Conditional Then closures within a Workflow should be able to accept tasks and task generated inputs as parameters for tasks.

Instead, only workflow level inputs can be provided to a task as a parameter within a Conditional Then closure, without seeing an error.

Additional context to reproduce

create a new pyflyte project
Inspect the gist here https://gist.github.com/zeryx/1a786f2041d840f08175ea19b741b68f
copy broken_example into the workflows/example.py file
update the requirements.txt from the gist
pip install the dependencies to run locally with pyflyte
run the following: pyflyte run workflows/example.py train_mnist_model --n_epoch 1
then run the following, pointing to your remote cluster of choice: pyflyte --config ~/.uctl/config.yaml run -p your_project -d development --image zeryx1211/mnist _gpu:latest workflows/example.y train_mnist_model --n_epoch1
View the failed workflow and execution
Rerun the same with working_example from the gist, repeat steps 6-8

Screenshots

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

eapolinario · 2023-03-22T00:02:57Z

To help the investigation, a smaller repro:

from flytekit import task, workflow, conditional

@task
def say_hello(name: str) -> str:
    return f"hello {name}"

@task
def get_name1() -> str:
    return "world"

@task
def get_name2() -> str:
    return "flyte"

@workflow
def wf(enabled: bool) -> str:
    name1 = get_name1()
    name2 = get_name2()
    return conditional("c") \
    .if_(enabled.is_true()) \
    .then(say_hello(name=name1)) \
    .else_() \
    .then(say_hello(name=name2))

kumare3 · 2023-03-22T03:20:51Z

This is a runtime error or compiler? Cc @hamersaw

hamersaw · 2023-03-22T13:32:33Z

This is a runtime error or compiler? Cc @hamersaw

Thanks @eapolinario for the smaller repro! Looks like a runtime error where propeller is looking for the output file of n0 in the wrong location. Will dive into this.

hamersaw · 2023-03-22T23:15:40Z

So what is happening here is that we override the upstream node dependencies when executing branch subnodes. This means that any other upstream dependencies are no longer maintained. In the minimum repo example:


@task
def say_hello(name: str) -> str:
    return f"hello {name}"

@task
def get_name1() -> str:
    return "world"

@task
def get_name2() -> str:
    return "flyte"

@workflow
def wf(enabled: bool) -> str:
    name1 = get_name1()
    name2 = get_name2()
    return conditional("c") \
    .if_(enabled.is_true()) \
    .then(say_hello(name=name1)) \
    .else_() \
    .then(say_hello(name=name2))

there will be three nodes n0 (get_name1), n1 (get_name2), and n2 (the conditional). The branch node (n2) will start executing immediately on workflow startup along with n0 and n1. Then when the branch subnode executes n2-n0 (say_hello(name=name1)) it's only upstream dependency is n2. So the error message of the missing s3 file is because node n0 has not yet completed and written it out.

There are (at least) two ways we can solve this:

Make the branch node dependent on all nodes that the subnodes are dependent on. In the above example node n2 will be dependent on n0 and n1. This is a very easy add but there are issues with pyflyte run for existing workflows because admin throws WorkflowAlreadyExists since the version must be based on some kind of hash of the code and we can't register a workflow with a different structure and the same version. Additionally, the BranchNode execution will have to wait for all subNode dependencies, even if they're not necessary for the branch that is actually taken.
Maintain the DAG when executing branch node subnodes. This requires adding the DAG to the NodeExecutionContext so that upstream node IDs are available during branch node evaluation.

I have implemented the later fix in flyteorg/flytepropeller#543 - if this isn't the route we want to go submitting the former fix is trivial.

zeryx added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Mar 21, 2023

hamersaw removed the untriaged This issues has not yet been looked at by the Maintainers label Mar 22, 2023

hamersaw self-assigned this Mar 22, 2023

hamersaw added this to the 1.6.0 milestone Mar 22, 2023

hamersaw mentioned this issue Mar 22, 2023

Including all upstream node deps on BranchNode subnode execution flyteorg/flytepropeller#543

Merged

8 tasks

hamersaw closed this as completed in flyteorg/flytepropeller#543 Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Workflow Conditional Operator conflict with task generated promises #3512

[BUG] Workflow Conditional Operator conflict with task generated promises #3512

zeryx commented Mar 21, 2023

eapolinario commented Mar 22, 2023

kumare3 commented Mar 22, 2023

hamersaw commented Mar 22, 2023

hamersaw commented Mar 22, 2023 •

edited

Loading

[BUG] Workflow Conditional Operator conflict with task generated promises #3512

[BUG] Workflow Conditional Operator conflict with task generated promises #3512

Comments

zeryx commented Mar 21, 2023

Describe the bug

Expected behavior

Additional context to reproduce

Screenshots

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

eapolinario commented Mar 22, 2023

kumare3 commented Mar 22, 2023

hamersaw commented Mar 22, 2023

hamersaw commented Mar 22, 2023 • edited Loading

hamersaw commented Mar 22, 2023 •

edited

Loading