-
Notifications
You must be signed in to change notification settings - Fork 61
Conversation
Thank you for opening this pull request! 🙌 These tips will help get your PR across the finish line:
|
dask
plugin IDLdask
plugin IDL
Codecov Report
@@ Coverage Diff @@
## master #339 +/- ##
=======================================
Coverage 73.12% 73.12%
=======================================
Files 18 18
Lines 1362 1362
=======================================
Hits 996 996
Misses 315 315
Partials 51 51
Flags with carried forward coverage won't be shown. Click here to find out more. Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
22b8329
to
f5751de
Compare
dask
plugin IDLdask
plugin IDL #minor
38773a2
to
2757166
Compare
Signed-off-by: Bernhard <[email protected]> Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard <[email protected]> Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard <[email protected]> Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
2757166
to
00775d9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me. One small question - I see from the docs that the DaskJob
creates a number of Pods (more complex then other clusters ex. Spark / Ray). This is why we have separate configurable image
fields here which serve as a default overriding the container_name
flytekit parameter. Presumably we can have finer control, ex:
@task(container_image="foo2", task_config=DaskConfig(
image="foo1",
// totally not correct syntax, but you get the idea
))
def bar():
// random work
Is the main usage of this to allow a JobSpec
image to have additional python dependencies? I'm guessing the dask version needs to (or at least should) be the same between the cluster pods (ie. scheduler and worker) and the JobSpec
Pod right?
Hopefully very soon we want to offer Flyte support for persistent clusters, so that rather than ephemerally creating / deleting for each task we can have a cluster (Dask / Spark / Ray / etc) start at the beginning of a workflow and be used for multiple task within that workflow - or across workflows even. Just making sure none of this definition conflicts with that vision.
@hamersaw Thank for looking at this!
I mostly wanted to give users flexibility in choosing the image if they want to, but would strongly advise against doing so (unless folks know the pros/cons). I've also tried making that clear in the plugin docs. One case I could see is when folks want to use GPU nodes as workers, which may or may not require a different image. Especially as I might add support for Additional Worker Groups to the All that being said, I've also thought about getting rid of the possibility to override |
Oh perfect. Can you say a little more about what configuring additional worker groups might look like for this config? IIUC for ephemeral clusters, since there will only be one job running on each cluster this may not be I'm wondering if it makes sense to go another step further here and split the Also, right now it looks like we're using the existing
and
The two task configurations will result in the same task right? In the latter case we can use the existing flytekit configuration instead of including additional Dask configuration. I don't feel very strongly about this, just trying to make sure the API is ergonomic and relatively similar to other Flyte constructs. |
Super glad to have this conversation! All in for making this more ergonomic and similar to other Flyte constructs 👍 cc @jacobtomlinson feel free to add any opinion to this thread.
So there's two more concepts that the operator would support (but they cannot be passed into a I was thinking that both of these would then fit into the python @task(
task_config=Dask(
cluster=DaskCluster(
...,
additional_worker_groups=[
WorkerGroup(
image=...,
requests=...,
limits=...,
)
],
autoscaler=Autoscaler(
min_workers=...,
max_workers=...
)
),
),
)
That seems reasonable 👍 Given the thoughts on worker groups and autocaling, would you replace the the @task(
task_config=Dask(
...,
# I've kept the `cluster=` in the example, but we could also remove
cluster=DaskCluster(
scheduler=Scheduler(),
default_workers=WorkerGroup(
count=...,
image=...,
requests=...,
limits=...,
),
),
),
)
Ok, how about removing the By default, all So combining all of the above, my new proposal would be: @task(
task_config=Dask(
# All of these are optional and default to settings in `@task`
scheduler=Scheduler(
image=...,
limits=...,
requests=...,
),
# All of these are optional and default to settings in `@task`
workers=WorkerGroup(
count=...,
image=...,
requests=...,
limits=...,
),
),
image=...,
requests=...,
limits=...
) |
I really like this ^^^. Hoping for a bit more input before moving forward, can we give this a day? |
Sure thing, no rush on my end |
A few thoughts.
@hamersaw This would be great. This is one of the use cases we had in mind for workflow engines when we designed the new Dask Kubernetes Operator. You could create a
@bstadlbauer This is interesting. I don't imagine that we will nest additional Is there any limitation in Flyte that requires everything to be a single resource, or can you create multiple resources? |
@hamersaw @bstadlbauer yup, i like the final syntax a lot, thank you both for all the iterating. Should we just drop the To answer the question @jacobtomlinson i'll defer to @hamersaw - I'm not sure what the vision for this plugin is on the backend. If this takes the shape of a custom plugin on the backend, certainly that would have the ability to control more than one resource, though admittedly it makes the code more complex. |
@jacobtomlinson so right now the k8s plugins are designed to operate over a single resource. So basically running a task is creating a k8s resource and then monitoring it's status. Which is why adding the status to the top-level CRD for the The vision for enabling persistent clusters, while admittedly very early in design, is to create a new set of Flyte plugins that allow for managing clusters. So as you suggested, a workflow begins by calling the I'm wondering about the advantages of supporting dynamically adding worker groups to a cluster. Is it much better than creating a cluster with two worker groups at the beginning of the workflow? Or creating two separate clusters at the beginning of the workflow? Not sure this has to be addressed in these PRs - maybe we should start a new issue for supporting "persistent clusters" where these conversations can be tracked. |
The final syntax LGTM too! One question about it: would the |
@jacobtomlinson Thank you so much for having a look here!
@hamersaw already answered this, but yes, currently this is a Flyte limitation of one plugin task corresponding to one k8s CRD.
I would change up the protobuf structure to match the proposed one here. Which would roughly entail:
Or would anyone prefer to keep the current protobuf (without
I think it would be sufficient when additional worker groups are created as part of the cluster creation in the Flyte usecase. Creating two clusters would not work, as you'd want to build a (
I was thinking
|
Right that makes sense, I understand the motivation now.
For reference, the
It's less about being dynamic and more about being composable. Folks compose together a We made the decision to nest one Then we could push more in the direction of having some primitive Dask building blocks like Users who want to be super composable can create manifests or helm charts with Is that what you had in mind @bstadlbauer? |
Totally understand the breakdown of different components (ie. Once we work on persistent clusters (ie. starting at beginning of workflow and executing mutliple tasks) we should prioritize the ability to manage multiple resources, I don't foresee this being an issue. Thanks @jacobtomlinson for the thorough explanation into the requirements to fully utilize the k8s Dask offerings, this model sounds like it should apply well with other cluster types as well. |
Yeah this is what I had in mind because the |
Just leaving a +1 here for the |
cb4172c
to
5a4b8b3
Compare
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
5a4b8b3
to
dab2b8f
Compare
Thank you all for the quick responses as well as the great feedback! I've made the changes according to this discussion, but happy to revisit if there are any further comments. @hamersaw
Exactly, yes! |
protos/flyteidl/plugins/dask.proto
Outdated
message Scheduler { | ||
// Optional image to use. If unset, will use the default image. | ||
string image = 1; | ||
|
||
// Resources assigned to the scheduler pod. | ||
core.Resources resources = 2; | ||
} | ||
|
||
message WorkerGroup { | ||
// Number of workers in the group. | ||
uint32 number_of_workers = 1; | ||
|
||
// Optional image to use for the pods of the worker group. If unset, will use the default image. | ||
string image = 2; | ||
|
||
// Resources assigned to the all pods of the worker group. | ||
// As per https://kubernetes.dask.org/en/latest/kubecluster.html?highlight=limit#best-practices | ||
// it is advised to only set limits. If requests are not explicitly set, the plugin will make | ||
// sure to set requests==limits. | ||
// The plugin sets ` --memory-limit` as well as `--nthreads` for the workers according to the limit. | ||
core.Resources resources = 3; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we either start prefixing all messages with the name of the plugin or create separate go packages for them?
In python, these will be created under dask_pb2.Scheduler which is fine but in go, it'll be just plugins.Scheduler which isn't very distinguishable ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, thanks! I would vote for just prefixing each with Dask
, but could be convinced either way if anyone else feels strongly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Bernhard Stadlbauer <[email protected]>
6ae78d6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Thank you
Congrats on merging your first pull request! 🎉 |
Signed-off-by: Bernhard Stadlbauer <[email protected]>
Signed-off-by: Bernhard Stadlbauer <[email protected]>
TL;DR
Adds the IDL for creating a
DaskJob
Type
Are all requirements met?
Complete description
Tracking Issue
https://github.com/flyteorg/flyte/issues/
Follow-up issue
NA