-
Notifications
You must be signed in to change notification settings - Fork 669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Mechanism for injecting node affinities and tolerations to every Flyte task pod #435
Comments
@jeevb thank you for the excellent write up and explaining the problem. I completely understand in a shared K8s cluster it is desirable to run batch style jobs on certain nodes rather than others. I think we should support it (caveat). First let me dive into some concepts,
Wdyt? |
I understand the design considerations around |
@jeevb awesome and thank you for the quick response, I will create a ticket for 1, and lets keep this open, if we find a problem you can re-visit. We will use this ticket to write an example that shows how to use "sidecar" in dynamic. I forgot to mention we have users at Lyft doing that. |
Part 1: #439 |
This can now be resolved. the way to add affinities, tolerations etc is to use flytekitplugins-pod. This is a first party and highly usable plugin and will be maintained. The only reason to keep it as a separate lib, is to keep the k8s python dependencies in the core to a minimum |
…gs (flyteorg#434) * resurrected fold-logs.py script Signed-off-by: Daniel Rammer <[email protected]> * printing more info Signed-off-by: Daniel Rammer <[email protected]> * formatting output Signed-off-by: Daniel Rammer <[email protected]> * removed queue tracking Signed-off-by: Daniel Rammer <[email protected]> * added cache logs Signed-off-by: Daniel Rammer <[email protected]> * cleaning up block definitions for uniformity Signed-off-by: Daniel Rammer <[email protected]> * added comments Signed-off-by: Daniel Rammer <[email protected]> * added argparse Signed-off-by: Daniel Rammer <[email protected]> * tracking workflow enqueues on node updates Signed-off-by: Daniel Rammer <[email protected]> * moved fold-logs.py to a script directory Signed-off-by: Daniel Rammer <[email protected]> * parse gcp formatted logs (flyteorg#435) Co-authored-by: Babis Kiosidis <[email protected]> Signed-off-by: Daniel Rammer <[email protected]> Co-authored-by: Babis Kiosidis <[email protected]> Co-authored-by: Babis Kiosidis <[email protected]>
Signed-off-by: Samhita Alla <[email protected]>
Signed-off-by: eugenejahn <[email protected]>
…gs (flyteorg#434) * resurrected fold-logs.py script Signed-off-by: Daniel Rammer <[email protected]> * printing more info Signed-off-by: Daniel Rammer <[email protected]> * formatting output Signed-off-by: Daniel Rammer <[email protected]> * removed queue tracking Signed-off-by: Daniel Rammer <[email protected]> * added cache logs Signed-off-by: Daniel Rammer <[email protected]> * cleaning up block definitions for uniformity Signed-off-by: Daniel Rammer <[email protected]> * added comments Signed-off-by: Daniel Rammer <[email protected]> * added argparse Signed-off-by: Daniel Rammer <[email protected]> * tracking workflow enqueues on node updates Signed-off-by: Daniel Rammer <[email protected]> * moved fold-logs.py to a script directory Signed-off-by: Daniel Rammer <[email protected]> * parse gcp formatted logs (flyteorg#435) Co-authored-by: Babis Kiosidis <[email protected]> Signed-off-by: Daniel Rammer <[email protected]> Co-authored-by: Babis Kiosidis <[email protected]> Co-authored-by: Babis Kiosidis <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]> Signed-off-by: Eduardo Apolinario <[email protected]>
* add field Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Jeev B <[email protected]> * Pass task execution metadata from agent (#422) * Pass task execution metadata from agent Signed-off-by: Hongxin Liang <[email protected]> * Add doc Signed-off-by: Hongxin Liang <[email protected]> * Update protos/flyteidl/admin/agent.proto Co-authored-by: Kevin Su <[email protected]> Signed-off-by: Honnix <[email protected]> * Regenerate --------- Signed-off-by: Hongxin Liang <[email protected]> Signed-off-by: Honnix <[email protected]> Co-authored-by: Kevin Su <[email protected]> Signed-off-by: Jeev B <[email protected]> * Add tags to execution spec (#414) * add tags to execution spec Signed-off-by: Kevin Su <[email protected]> * add tags to execution spec Signed-off-by: Kevin Su <[email protected]> * add comment Signed-off-by: Kevin Su <[email protected]> --------- Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Jeev B <[email protected]> * Correct comment for array job max parallelism (#431) Signed-off-by: Katrina Rogan <[email protected]> Signed-off-by: Jeev B <[email protected]> * Add the scalar to the operand (#427) Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Jeev B <[email protected]> * add selector Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Jeev B <[email protected]> * move selectors from container to task metadata Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Jeev B <[email protected]> * drop only_preferred Signed-off-by: Jeev B <[email protected]> * Updating boilerplate to lock golangci-lint version (#435) Signed-off-by: Daniel Rammer <[email protected]> Signed-off-by: Jeev B <[email protected]> * add unpartitioned selector Signed-off-by: Jeev B <[email protected]> * refactor Signed-off-by: Jeev B <[email protected]> * refactor Signed-off-by: Jeev B <[email protected]> * fix oneof names Signed-off-by: Jeev B <[email protected]> * add build.os for read the docs Signed-off-by: Jeev B <[email protected]> --------- Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Jeev B <[email protected]> Signed-off-by: Hongxin Liang <[email protected]> Signed-off-by: Honnix <[email protected]> Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Katrina Rogan <[email protected]> Signed-off-by: Daniel Rammer <[email protected]> Co-authored-by: Honnix <[email protected]> Co-authored-by: Kevin Su <[email protected]> Co-authored-by: Kevin Su <[email protected]> Co-authored-by: Katrina Rogan <[email protected]> Co-authored-by: Jeev B <[email protected]> Co-authored-by: Dan Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
* add field Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Jeev B <[email protected]> * Pass task execution metadata from agent (#422) * Pass task execution metadata from agent Signed-off-by: Hongxin Liang <[email protected]> * Add doc Signed-off-by: Hongxin Liang <[email protected]> * Update protos/flyteidl/admin/agent.proto Co-authored-by: Kevin Su <[email protected]> Signed-off-by: Honnix <[email protected]> * Regenerate --------- Signed-off-by: Hongxin Liang <[email protected]> Signed-off-by: Honnix <[email protected]> Co-authored-by: Kevin Su <[email protected]> Signed-off-by: Jeev B <[email protected]> * Add tags to execution spec (#414) * add tags to execution spec Signed-off-by: Kevin Su <[email protected]> * add tags to execution spec Signed-off-by: Kevin Su <[email protected]> * add comment Signed-off-by: Kevin Su <[email protected]> --------- Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Jeev B <[email protected]> * Correct comment for array job max parallelism (#431) Signed-off-by: Katrina Rogan <[email protected]> Signed-off-by: Jeev B <[email protected]> * Add the scalar to the operand (#427) Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Jeev B <[email protected]> * add selector Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Jeev B <[email protected]> * move selectors from container to task metadata Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Jeev B <[email protected]> * drop only_preferred Signed-off-by: Jeev B <[email protected]> * Updating boilerplate to lock golangci-lint version (#435) Signed-off-by: Daniel Rammer <[email protected]> Signed-off-by: Jeev B <[email protected]> * add unpartitioned selector Signed-off-by: Jeev B <[email protected]> * refactor Signed-off-by: Jeev B <[email protected]> * refactor Signed-off-by: Jeev B <[email protected]> * fix oneof names Signed-off-by: Jeev B <[email protected]> * add build.os for read the docs Signed-off-by: Jeev B <[email protected]> --------- Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Jeev B <[email protected]> Signed-off-by: Hongxin Liang <[email protected]> Signed-off-by: Honnix <[email protected]> Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Katrina Rogan <[email protected]> Signed-off-by: Daniel Rammer <[email protected]> Co-authored-by: Honnix <[email protected]> Co-authored-by: Kevin Su <[email protected]> Co-authored-by: Kevin Su <[email protected]> Co-authored-by: Katrina Rogan <[email protected]> Co-authored-by: Jeev B <[email protected]> Co-authored-by: Dan Rammer <[email protected]>
Motivation: Why do you think this is important?
Please consider adding more control for specifying the pod spec of the Flyte task pods - namely node affinities and tolerations. This is critical on shared K8S clusters to target task pods to dedicated nodes that are better suited to run these tasks. These nodes are also typically tainted to block other workloads from scheduling on them.
Currently, via Flytekit, users only have the ability to specify the resource requirements for a given
python
ordynamic
task.sidecar
tasks provide more flexibility in terms of defining the spec for the pod that will run that task - allowing for the specification of both node affinities and tolerations. While mostpython
tasks can be adapted tosidecar
tasks to afford this flexibility, there is no way to do so fordynamic
tasks.Goal: What should the final outcome look like, ideally?
As a workflow/task author, I would like to be able to specify a set of node affinities and tolerations to target Flyte task pods to dedicated/tainted pipeline node pools on shared K8S clusters.
Describe alternatives you've considered
We are currently considering using mutating admission controllers to patch Flyte task pods accordingly.
Flyte component
[Optional] Propose: Link/Inline
I see 3 levels of implementations that can be relevant for this problem:
pyflyte register workflows -c
. All tasks registered this way will have these applied.sidecar
tasks. This will afford the most flexibility.Additional context
Add any other context or screenshots about the feature request here.
Is this a blocker for you to adopt Flyte
This is NOT a blocker.
The text was updated successfully, but these errors were encountered: