Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Config Override #3553

Merged
merged 3 commits into from
Jul 4, 2023

Conversation

ByronHsu
Copy link
Contributor

Tracking issue

#475

Describe your changes

Add RFC

Note to reviewers

Need discussion on the UI part!

@bstadlbauer
Copy link
Member

I've just read the linked issue (#475) and it seems like there is a lot of good conversation going on there.
From reading it, the trickiest part seems to be doing "per-node" configuration changes, mostly because node names are not trivially knowable before starting a workflow and especially for deeply nested graphs, those are hard to figure out.

Should we finish the discussion on how to best achieve that in #475 and once things are ready copy the result over to this RFC?

@bstadlbauer
Copy link
Member

I've added my thoughts in #475 (comment)

@davidmirror-ops davidmirror-ops added the rfc A label for RFC issues label Mar 30, 2023
@davidmirror-ops
Copy link
Contributor

03-30-2023 Meeting notes. Byron Hsu: not sure how to implement this in the UI
FG: It would be a very complicated string. We can mark the tasks as overrideable
BH: what if your task resides at a deeper level
Eduardo: how does this work with Dynamic?
FG: in the dynamic subworkflow when I call a task could be .task.overrride. Whenever user register somethings dynamically with runtime.override: foo it could select runtime-level things like GPU
EA: basically naming overrides
FG: assign the name of the override at registration time
KU: create "override hooks"
FG: for dynamic it would be impossible
Greg Gydush: regarding the UX, (see #475 (comment)) I'd love to pass promises to the override config. That is the UX that I'd want
KU: we should not confuse them with inputs, it's config. Executions should take a config
HSU: we can parameterize it and pass it to override config
KU: with async tasks you should be able to do something like this
Bernhard: for nested workflows it could be complicated. If you run the same task twice you couldn't run it with different resources
KU: we could call it dynamic_config. In the dynamic case
Dan Rammer: scalability concerns. Checking for naming conflicts would be necessary but complicated
KU: what about making this an experimental feature

@fg91
Copy link
Member

fg91 commented Mar 30, 2023

I agree with @bstadlbauer that when workflows become very nested/deep, this syntax will be very difficult to write for the user:

wf_overrides = WFOverride(
    n0=TaskOverride(
       container_image="repo/image:0.0.1"
       limits=Resource(cpu="2")
    ),
    n1=WFOverride( # subwf
        n0=TaskOverride(
            container_image="repo/image:0.0.3"
            limits=Resource(cpu="3")
        )
    )
)

In addition, it will also be very complicated to come up with a way to show these overrides in the UI.


This is why I proposed in the contributors meeting to create "named overrides" in order to avoid replicating the DAG structure again in the overrides.

This could look e.g. like this:

@task 
def train(dataset_uri: str) -> str:
    pass

@task 
def evaluate(model_uri: str):
   ...

@workflow
def wf(...):
    model1 = train(...).with_runtime_overrides("model_1_resources")
    model2 = train(...).with_runtime_overrides("model_2_resources")
    
    evaluate(model_uri = model1)
    evaluate(model_uri = model2)

This effectively means that at registration time the author of the workflow can mark the configs of specific nodes overridable when starting an execution. By marking them this way in the workflow and not the respective task decorator, the same task, in this case train, can be executed with different sets of resources configured e.g. in FlyteConsole.

By naming the overrides, we also circumvent the nesting problem which would make it very simple to configure overrides in FlyteConsole:

Screenshot 2023-03-30 at 21 57 07

In addition, the FlyteConsole would not be overloaded when pre-filling the overridable configs since this is only possible for selected tasks, not the entire (potentially giant) workflow.


Considerations:

@task 
def train(dataset_uri: str) -> str:
    pass

@dynamic 
def subwf(model_uri: str):
    model = train(...).with_runtime_overrides("model_resources")
   ...

@workflow
def wf(...):
    subwf()

Let us consider a workflow where the runtime override is configured in a dynamic subworkflow. I don't know whether it would be possible when creating an execution of wf to already know which tasks the dynamic subwf will launch and whether "they are marked overridable".
If not, we would not be to pre-fill the config in the FlyteConsole or the exection spec yaml overrides section.

However, I feel it would be absolutely acceptable if the user specifies such a named override manually in the FlyteConsole when launching the execution and then when the dynamic subworkflow registers a task with a named override that matches a name the user specified when creating the execution, the override is applied.

@hamersaw raised another concern:

Let's say team 1 maintains this workflow:

@task 
def train_task_team1():
    ....

@workflow
def wf_team1():
    train_task_team1().with_runtime_overrides("train_resources")

And let's say team 2 maintains this workflow:

from team_1 import wf_team1

@task 
def train_task_team2():
    ....

@workflow
def wf_team2():
    wf_team1()

   train_task_team2().with_runtime_overrides("train_resources")

An engineer in team 2 might start an execution in the FlyteConsole and override the resources for their train task but potentially unknowingly also for the task of team 1.

I would definitely say that the named overrides should only be valid within an execution.

I personally am not too worried about this scenario but this might be a bigger issue in large organisations.

In theory (without knowing at this point whether this is feasible) I could imagine something like this:

@workflow
def wf_team2():
    wf_team1().without_runtime_overrides()

    train_task_team2().with_runtime_overrides("train_resources")

However, this might be going too far.

@yubofredwang
Copy link
Contributor

Hi @fg91, you approach looks pretty neat. Can you add more details on how model_1_resources is defined? Is that a different entity type other than task/workflow?

@bstadlbauer
Copy link
Member

bstadlbauer commented Mar 31, 2023

@fg91 I like this, the UX is great!

Trying to take a step back here, it seems like this is pretty much a special case of "tagging" + matching on a tag (option 2 from #475 (comment)), right?

So in theory, if .with_runtime_overrides("train_resources") would add some sort of tag to the task, say {"runtime_override": "train_resources"} this could be one of many things to match on.

I.e. on the backed, we could represent an override as something like (pseudocode; names all TBD):

class NodeMatchCriteria:
    task_name: str
    node_name: Optional[str] = None
    tags: Dict[str, str] = {}
    function_arguments: Dict[str, Any] = {}
    ...

class ConfigOverride:
    match: NodeMatchCriteria
    resources: Resources
    ...

I think if we could keep the matching a bit more general we could set us up for future usecases without too much additional implementation effort. One thing this would also allow (on the backend) is to re-run a workflow with override based on a node-name (as the node name is known from the UI in the second run).

What do you think?

@ByronHsu
Copy link
Contributor Author

ByronHsu commented Apr 17, 2023

My concern with "named override" method is that users can only override the configs on UI/CLI, but cannot override them inside the code. There are two folds of overriding in the code.

  1. constant value (already supported)
@task
def t1():
  ...
  
@workflow
def wf():
  t1.with_override(cpu=1)
  1. promise value
@task
def t1():
  ...
  
@workflow
def wf(x):
  t1.with_override(cpu=x)

The second case would be extremely useful when reference workflows are shared across the team, but each team wants to tailor the config to their need.

@ref_launch_plan
def ref_wf(cpu:int):
  ...
# team1
@workflow
def wf():
   ref_wf(cpu=3)

# team2
@workflow
def wf():
   ref_wf(cpu=2)

@ByronHsu
Copy link
Contributor Author

ByronHsu commented Apr 17, 2023

To resolve to second case above, one possible way is to define a special input parameter (task_config_override) with the special dataclass (TaskConfigOverride) that will be parsed in runtime and override taskTemplate.

@task(requests=Resources(cpu="3"), limits=Resources(cpu="4"))
def python_task(task_config_override: TaskConfigOverride):
    print("hello python task")
 
cfg = PythonTaskConfigOverride(
        requests=ResourcesOverride(cpu=requests_cpu),
        limits=ResourcesOverride(cpu=limits_cpu)
 )
 
@workflow
def python_wf():
    python_task(task_config_override=cfg)

@fg91
Copy link
Member

fg91 commented Apr 23, 2023

I'm personally not a fan of passing task resources (or their overrides) as worklow/task arguments since this mixes task logic configuration with task resource config:

@task
def t1():
 ...
 
@workflow
def wf(x):
 t1.with_override(cpu=x)

When being able to modify the code and not having to rely on overrides at runtime, what is the reason for not using the existing .with_overrides()?

@ByronHsu
Copy link
Contributor Author

https://docs.google.com/document/d/1gaWU3lsa66APG2aD95_BeL9hFEsEsFiNGGj7KcQmt8c/edit?usp=sharing

l listed down our usecase to better step back and think from the user side

@ByronHsu
Copy link
Contributor Author

ByronHsu commented May 4, 2023

To resolve the concern that users cannot override them inside the code regarding @fg91's idea. Me and @pingsutw came up with a solution.
If they don't want to specify on UI, they can pass the override config through workflow definition.

@workflow(
  override={
    "model_1_resources": <override obj 1>
    "model_2_resources": <override obj 2>
  }
)
def my_wf():
  ...

There are two ways they can override value:

  1. specifiy in @workflow with override obj, which will become the default override value on UI
  2. directly put override obj on UI

The idea basically combines @fg91 and @kumare3's ideas.

@fg91
Copy link
Member

fg91 commented May 4, 2023

That looks like a very nice UX to me @ByronHsu 👏

I think override could also become an arg to LaunchPlan.get_or_create(...).

@ByronHsu
Copy link
Contributor Author

ByronHsu commented May 5, 2023

yeah that would be nice @fg91

@kumare3
Copy link
Contributor

kumare3 commented May 8, 2023

@ByronHsu can we update the rfc itself with the proposal?

@ByronHsu
Copy link
Contributor Author

ByronHsu commented May 8, 2023

@goyalankit What do you think ^?

Copy link
Member

@bstadlbauer bstadlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ByronHsu for being persistent here even though this took longer than anticipated! Let's make sure to get this across the finish line and start implementing soon 👍

I think the UX is great now! The part that is still unclear to me is how the TaskNodeConfigOverride is tied to a particular task on the backend.

Two options would come to mind:

  1. .with_runtime_override("task-yee") would fill a new field (something like string runtime_override_name) in the TaskNode proto. Then the WorkflowMetadata, the LaunchPlanSpec as well as the ExecutionSpec would have a new field map<string, TaskNodeConfigOverride> runtime_overrides.

  2. We add a new field map<string, string> tags to TaskNode. Then .with_runtime_override("task-yee") would set a "runtime_override_name": "task-yee" tag on the node. The TaskNodeConfigOverride would then need to know what it applies to (similar to NodeMatchCriteria in this comment).
    While the UX would be the same, this would also allow for other matches (e.g. based on node name or like here) in the future, without us needing to change the flyteidl API.

Also, should we add a sentence about precedence of these? E.g. something like UI (=Execution) > LaunchPlan > Workflow?

rfc/system/3553-config-override.md Outdated Show resolved Hide resolved
rfc/system/3553-config-override.md Outdated Show resolved Hide resolved
rfc/system/3553-config-override.md Outdated Show resolved Hide resolved
fg91
fg91 previously approved these changes May 11, 2023
Copy link
Member

@fg91 fg91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much like the proposed UX now 👏

with_runtime_overrides was the first name that came to my mind when I first had this idea. @kumare3 called it with_override_hook(?) in the contributors meeting. Which name should we write into the RFC in the end?

@davidmirror-ops
Copy link
Contributor

2023-05-11: Agree on naming before merge, also address implementation concerns. Byron might need help to distribute implementation tasks. The suggestion is to work on backend first (propeller and flytekit first). TSC happy to help kickoff implementation.

@pingsutw
Copy link
Member

Here are all the different ways to override the config.

  • wf2.py
from flytekit import task, workflow, Resources
@task
def t1():
    print("hello world")

@workflow
def sub_wf():
    t1()
  • wf1.py
from flytekit import task, workflow, Resources
from wf2 import sub_wf

@task
def t0():
    print("hello world")

@workflow()
def wf():
    t0()
    sub_wf()
  1. To override the t0 at runtime
# wf1.py
@workflow()
def wf():
    t0().with_runtime_override("model_1_resources", runtime_override_config(cpu=1, mem="1Gi"))
    ...

cpu=1, mem="1Gi" is default execution config, people are still able to update the cpu and mem on flyteConsole.

  1. To override the t1 at runtime. update wf2.py
# wf2.py
@workflow
def sub_wf():
    t1().with_runtime_override("model_2_resources", runtime_override_config(cpu=1, mem="1Gi"))
  1. To override the default execution config of t1 from wf1.py. Add override config to @workflow decorator. Then t1 task will use 2 cpu by default.
# wf1.py
@workflow(override={"model_2_resources": runtime_override_config(cpu=2, mem="2Gi")})
def wf():
    ...  

FlyteIDL:

  1. Add more config to TaskOverrides
  2. Add string runtime_override_name to TaskNode
  3. Add runtimeOverrideConfigs (map<string, TaskOverrides>) to ExecutionSpec
  4. Add runtimeOverrideConfigs (map<string, TaskOverrides>) to LaunchPlanSpec

FlyteAdmin:

  1. Iterate runtimeOverrideConfigs. if the runtime_override_name in the TaskNode matches one of the elements in the runtimeOverrideConfigs, then update the task config (cpu, role, etc). https://github.com/flyteorg/flyteadmin/blob/610451d21ca2fa8f598ebc26b8ebb96e423e270a/pkg/manager/impl/execution_manager.go#L780-L783

Flytekit:

  1. Add runtimeOverrideConfigs to LaunchPlanSpec
  2. Add runtimeOverrideConfigs to ExecutionSpec
  3. Update Flytekit remote and CLI (pyflyte run --remote --override "{'model_1_resources': {'cpu': 3}}" wf1.py wf).

@ByronHsu is going to update IDL. cc @fg91 @bstadlbauer would you like to help work on flytekit and admin?

@ByronHsu
Copy link
Contributor Author

@fg91
Copy link
Member

fg91 commented May 19, 2023

@pingsutw @ByronHsu what do you think of this:

# wf1.py
@workflow
def wf():
    t0()
    sub_wf().override={"model_2_resources": runtime_override_config(cpu=2, mem="2Gi")}

instead of

# wf1.py
@workflow(override={"model_2_resources": runtime_override_config(cpu=2, mem="2Gi")})
def wf():

?

@pingsutw do we also want to override on launchplan.get_or_create?

Yes, I think this would be very useful 👍

@pingsutw
Copy link
Member

@fg91 I like that, too. It will be much easier to implement.

do we also want to override on launchplan.get_or_create?

yup, we should support it as well

@ByronHsu
Copy link
Contributor Author

ByronHsu commented May 19, 2023

what if users want to override a task under the subworkflow's subworkflow? and users want to do that inside wf function.
I think its more generic if users can override every overridable task on the top-level workflow.

# wf1.py
@workflow
def sub_wf():
   sub_sub_wf()
   ...

@workflow
def wf():
    t0()
    sub_wf()

@pingsutw
Copy link
Member

@ByronHsu You still do it at top-level workflow.

old one:

@workflow(override={"model_2_resources": runtime_override_config(cpu=2, mem="2Gi")})
def wf():
    t0()
    sub_wf()

Fabio suggested:

@workflow
def wf():
    t0()
    sub_wf().override={"model_2_resources": runtime_override_config(cpu=2, mem="2Gi")}

@ByronHsu
Copy link
Contributor Author

ByronHsu commented May 19, 2023

got it. So you meant if model_2_resources is under sub_wf's subflow's task, we can still override it.
yeah i agree @fg91 's idea is cleaner because we don't have to mess up @workflow argument.
So for task, they do

t1().with_runtime_override("model_2_resources", runtime_override_config(cpu=1, mem="1Gi"))

My idea for subworkflow is that they do

sub_wf().with_runtime_override("model_2_resources", runtime_override_config(cpu=2, mem="2Gi"), "model_1_resources", runtime_override_config(cpu=2, mem="2Gi")) 

The args are .(name1, override1, name2, override2). Or passing a dict is better?
Also, do you think we can use the same method name .with_runtime_override for both task and subworkflow?

@pingsutw
Copy link
Member

with_runtime_override can return promise. so it could be

sub_wf().with_runtime_override(...).with_runtime_override(...).with_runtime_override(...) 

@ByronHsu
Copy link
Contributor Author

ByronHsu commented May 20, 2023

We can provide 'with_runtime_override' to override a single task in a sub workflow and 'with_runtime_overrides' for multiple tasks in a sub workflow.

@bstadlbauer
Copy link
Member

I'm in favor of chaining (e.g. sub_wf().with_runtime_override(...).with_runtime_override(...).with_runtime_override(...)) instead of two different methods.

@pingsutw Happy to help out!

@fg91
Copy link
Member

fg91 commented May 22, 2023

Instead of chaining or two methods, one with s, we could also simply pass a dict?

sub_wf().with_runtime_override({"model_2_resources": runtime_override_config(cpu=1, mem="1Gi"), "model_3_resources": ...})

Also happy to take a ticket @pingsutw :)

@bstadlbauer
Copy link
Member

Also works of course 👍

pingsutw
pingsutw previously approved these changes May 23, 2023
@fg91
Copy link
Member

fg91 commented May 25, 2023

To me the RFC looks ready to approve after the results of the latest discussions have been incorporated into the rfc doc itself 🚀

Copy link
Contributor

@eapolinario eapolinario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is shaping up to be a great feature!

rfc/system/3553-config-override.md Show resolved Hide resolved
rfc/system/3553-config-override.md Show resolved Hide resolved
@bstadlbauer
Copy link
Member

@ByronHsu Also ok with this after all changes discussed there are incorporated in the RFC

@bstadlbauer
Copy link
Member

Also, one question that came up during a first stab at implementing this - what would the UX for overriding task_config look like?

@ByronHsu
Copy link
Contributor Author

ByronHsu commented Jun 7, 2023

@bstadlbauer how about passing a nested dict?

Signed-off-by: byhsu <[email protected]>
@ByronHsu ByronHsu dismissed stale reviews from pingsutw and fg91 via b62ec71 June 7, 2023 17:55
@ByronHsu ByronHsu force-pushed the byhsu/config-override branch from 5af675d to b62ec71 Compare June 7, 2023 18:02
@davidmirror-ops davidmirror-ops merged commit 4d7b656 into flyteorg:master Jul 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc A label for RFC issues
Projects
Status: Adopted by a Working Group
Development

Successfully merging this pull request may close these issues.

8 participants