[SPARK-24432] Support dynamic allocation without external shuffle service #24083

lwwmanning · 2019-03-13T17:31:51Z

What changes were proposed in this pull request?

This PR adds a limited version of dynamic allocation that does not require the external shuffle service, and thus works on kubernetes (but does not support preemption).

The basic approach is to track which executors are holding shuffle files, and of those, which have shuffle files depended on by active stages. If an executor contains shuffle files depended on by an active stage, then we treat it as "active" (i.e., prevent the ExecutorAllocationManager from marking it as "idle"). If an executor contains only shuffle files that are not dependencies of active stages, then we treat those shuffle files similarly to cached data (i.e., configurable idle timeout that defaults to "infinity").

We also introduce the concept of "shuffle biased task scheduling", a heuristic attempt to schedule tasks for maximal efficacy of dynamic allocation. We do this by attempting to minimize the number of executors that contain (active) shuffle files, by packing as many tasks as possible onto "active" executors first, followed by scheduling them on executors with only inactive shuffle files, and finally all remaining executors.

This is a port of a series of PRs on Palantir's spark fork: palantir#427 palantir#445 palantir#446 palantir#447

Partially addresses https://issues.apache.org/jira/browse/SPARK-24432

cc: @rynorris @mccheah @robert3005

How was this patch tested?

We added additional tests explicitly as part of the PR, and did additional manual testing on small YARN and k8s clusters (partially documented on palantir#446). Then we successfully rolled this out for a small subset of workloads in production at Palantir, running entirely on kubernetes.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Allows dynamically scaling executors up and down without external shuffle service. Tracks shuffle locations to know if executors can be safely scaled down.

mcheah · 2019-03-13T17:40:40Z

@mccheah

lwwmanning · 2019-03-13T17:42:11Z

whoops, thanks @mcheah, sorry for the mis-tag!

mccheah · 2019-03-13T18:04:29Z

ok to test

mccheah · 2019-03-13T18:05:44Z

cc @vanzin @rxin. Would a feature like this require an SPIP?

SparkQA · 2019-03-13T18:12:16Z

Test build #103446 has finished for PR 24083 at commit 42de6cf.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-13T18:50:36Z

Test build #103447 has finished for PR 24083 at commit 740632d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-13T18:52:05Z

Test build #103449 has finished for PR 24083 at commit 0fe8cdb.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

markhamstra · 2019-03-13T19:03:09Z

@mccheah it does feel to me like more discussion is required than is appropriate for a PR. Whether that means a separate JIRA ticket or a SPIP, I'm undecided.

SparkQA · 2019-03-13T19:07:27Z

Test build #103451 has finished for PR 24083 at commit 9318f99.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-03-13T19:11:55Z

I don't think this needs an SPIP (it's all internal changes), but agree that it should be its own separate bug in JIRA.

It would also be better to separate the k8s-side changes into a different bug/PR, since those fix a different problem than the other changes.

SPARK-24432 is an umbrella bug, so you could just create sub-tasks for it.

vanzin · 2019-03-15T21:19:11Z

Ping?

I'm seeing two or 3 separate PRs here; the ExecutorAllocationManager changes + tracking of active stages being the most obvious. "Shuffle biased scheduling" builds on top of the active stage tracking, but feels separate from the dynamic allocation changes. And the k8s changes definitely should be a separate PR.

Also, instead of adding YARN tests, I'd do that in core. YarnClusterSuite does not run when you don't change YARN code, so it often doesn't run when changes are made to the core code in this area. Using a local-cluster[...] for this works fine (I've been using it in some code I've been playing with), and even runs a bit faster than tests in the YARN suite.

vanzin · 2019-03-26T18:30:28Z

If there's not going to be any activity here I'll just close this PR and let someone else handle this if they want...

lwwmanning · 2019-03-26T19:17:48Z

Apologize, was on vacation last week.

I'll file separate sub-issues in JIRA for both the k8s changes and dynamic-allocation-without-ESS.

Will close this PR and split out as three separate PRs, in order: (1) k8s changes, (2) ExecutorAllocationManager changes / active stage tracking (with additional tests in core), (3) shuffle biased task scheduling.

ScrapCodes · 2019-04-30T09:16:10Z

@lwwmanning Hi, any progress on this work? Do you want some help?

ScrapCodes · 2019-05-28T12:23:25Z

@lwwmanning and @vanzin, There are no updates here or on the issue. IMO this approach is worth considering, it may have its downside, but having it is an option in spark on kube, seems reasonable to me. Shall I take this work forward from here?

vanzin · 2019-05-28T17:02:15Z

I agree that this could be useful, and you're free to open a PR if you want. But I actually do have an implementation of this that is based on some code that is currently under review, and I plan to submit it when the PR it depends on is merged.

ScrapCodes · 2019-08-05T06:17:55Z

Hi @vanzin, Do you have an update? Can you please provide link to the PR you were mentioning here?

vanzin · 2019-08-05T17:33:17Z

SPARK-27963

lwwmanning and others added 4 commits March 13, 2019 17:07

Support dynamic allocation without external shuffle service (#427)

79556f3

Allows dynamically scaling executors up and down without external shuffle service. Tracks shuffle locations to know if executors can be safely scaled down.

Fix dynamic allocation with external shuffle service (#445)

8473d16

Track active shuffles by stage (#446)

3e8b1f4

Shuffle biased task scheduling (#447)

921d72f

lwwmanning mentioned this pull request Mar 13, 2019

Revert soft dynamic allocation for SPARK-25299. palantir/spark#513

Merged

improve comments and move config to standard location

42de6cf

Will Manning added 3 commits March 13, 2019 18:34

fix error from cherry pick

740632d

resolve more conflicts

0fe8cdb

fix weird formatting

9318f99

lwwmanning closed this Mar 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24432] Support dynamic allocation without external shuffle service #24083

[SPARK-24432] Support dynamic allocation without external shuffle service #24083

lwwmanning commented Mar 13, 2019 •

edited

Loading

mcheah commented Mar 13, 2019

lwwmanning commented Mar 13, 2019

mccheah commented Mar 13, 2019

mccheah commented Mar 13, 2019

SparkQA commented Mar 13, 2019

SparkQA commented Mar 13, 2019

SparkQA commented Mar 13, 2019

markhamstra commented Mar 13, 2019

SparkQA commented Mar 13, 2019

vanzin commented Mar 13, 2019

vanzin commented Mar 15, 2019

vanzin commented Mar 26, 2019

lwwmanning commented Mar 26, 2019

ScrapCodes commented Apr 30, 2019

ScrapCodes commented May 28, 2019

vanzin commented May 28, 2019

ScrapCodes commented Aug 5, 2019 •

edited

Loading

vanzin commented Aug 5, 2019

[SPARK-24432] Support dynamic allocation without external shuffle service #24083

[SPARK-24432] Support dynamic allocation without external shuffle service #24083

Conversation

lwwmanning commented Mar 13, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

mcheah commented Mar 13, 2019

lwwmanning commented Mar 13, 2019

mccheah commented Mar 13, 2019

mccheah commented Mar 13, 2019

SparkQA commented Mar 13, 2019

SparkQA commented Mar 13, 2019

SparkQA commented Mar 13, 2019

markhamstra commented Mar 13, 2019

SparkQA commented Mar 13, 2019

vanzin commented Mar 13, 2019

vanzin commented Mar 15, 2019

vanzin commented Mar 26, 2019

lwwmanning commented Mar 26, 2019

ScrapCodes commented Apr 30, 2019

ScrapCodes commented May 28, 2019

vanzin commented May 28, 2019

ScrapCodes commented Aug 5, 2019 • edited Loading

vanzin commented Aug 5, 2019

lwwmanning commented Mar 13, 2019 •

edited

Loading

ScrapCodes commented Aug 5, 2019 •

edited

Loading