-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24432] Support dynamic allocation without external shuffle service #24083
Conversation
Allows dynamically scaling executors up and down without external shuffle service. Tracks shuffle locations to know if executors can be safely scaled down.
whoops, thanks @mcheah, sorry for the mis-tag! |
ok to test |
Test build #103446 has finished for PR 24083 at commit
|
Test build #103447 has finished for PR 24083 at commit
|
Test build #103449 has finished for PR 24083 at commit
|
@mccheah it does feel to me like more discussion is required than is appropriate for a PR. Whether that means a separate JIRA ticket or a SPIP, I'm undecided. |
Test build #103451 has finished for PR 24083 at commit
|
I don't think this needs an SPIP (it's all internal changes), but agree that it should be its own separate bug in JIRA. It would also be better to separate the k8s-side changes into a different bug/PR, since those fix a different problem than the other changes. SPARK-24432 is an umbrella bug, so you could just create sub-tasks for it. |
Ping? I'm seeing two or 3 separate PRs here; the ExecutorAllocationManager changes + tracking of active stages being the most obvious. "Shuffle biased scheduling" builds on top of the active stage tracking, but feels separate from the dynamic allocation changes. And the k8s changes definitely should be a separate PR. Also, instead of adding YARN tests, I'd do that in core. |
If there's not going to be any activity here I'll just close this PR and let someone else handle this if they want... |
Apologize, was on vacation last week. I'll file separate sub-issues in JIRA for both the k8s changes and dynamic-allocation-without-ESS. Will close this PR and split out as three separate PRs, in order: (1) k8s changes, (2) ExecutorAllocationManager changes / active stage tracking (with additional tests in core), (3) shuffle biased task scheduling. |
@lwwmanning Hi, any progress on this work? Do you want some help? |
@lwwmanning and @vanzin, There are no updates here or on the issue. IMO this approach is worth considering, it may have its downside, but having it is an option in spark on kube, seems reasonable to me. Shall I take this work forward from here? |
I agree that this could be useful, and you're free to open a PR if you want. But I actually do have an implementation of this that is based on some code that is currently under review, and I plan to submit it when the PR it depends on is merged. |
Hi @vanzin, Do you have an update? Can you please provide link to the PR you were mentioning here? |
What changes were proposed in this pull request?
This PR adds a limited version of dynamic allocation that does not require the external shuffle service, and thus works on kubernetes (but does not support preemption).
The basic approach is to track which executors are holding shuffle files, and of those, which have shuffle files depended on by active stages. If an executor contains shuffle files depended on by an active stage, then we treat it as "active" (i.e., prevent the ExecutorAllocationManager from marking it as "idle"). If an executor contains only shuffle files that are not dependencies of active stages, then we treat those shuffle files similarly to cached data (i.e., configurable idle timeout that defaults to "infinity").
We also introduce the concept of "shuffle biased task scheduling", a heuristic attempt to schedule tasks for maximal efficacy of dynamic allocation. We do this by attempting to minimize the number of executors that contain (active) shuffle files, by packing as many tasks as possible onto "active" executors first, followed by scheduling them on executors with only inactive shuffle files, and finally all remaining executors.
This is a port of a series of PRs on Palantir's spark fork: palantir#427 palantir#445 palantir#446 palantir#447
Partially addresses https://issues.apache.org/jira/browse/SPARK-24432
cc: @rynorris @mccheah @robert3005
How was this patch tested?
We added additional tests explicitly as part of the PR, and did additional manual testing on small YARN and k8s clusters (partially documented on palantir#446). Then we successfully rolled this out for a small subset of workloads in production at Palantir, running entirely on kubernetes.
Please review http://spark.apache.org/contributing.html before opening a pull request.