Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn on queuing by default #7213

Closed
11 tasks done
gjoseph92 opened this issue Oct 28, 2022 · 6 comments · Fixed by #7279
Closed
11 tasks done

Turn on queuing by default #7213

gjoseph92 opened this issue Oct 28, 2022 · 6 comments · Fixed by #7279
Assignees

Comments

@gjoseph92
Copy link
Collaborator

gjoseph92 commented Oct 28, 2022

We are currently intending to turn on scheduler-side queuing by default. This will give much better cluster memory use in most cases, and help users avoid the painful "running out of memory on simple workloads" experience #2602.

It does slow things down, but the increased stability and better out-of-the-box user experience seems worthwhile.

Specifically, this will mean changing the default worker-saturation value from inf to 1.0 or 1.1 or something. Advanced users who need to keep old behavior can get it back by changing this config, which will be documented #7203.

TODO:

@gjoseph92
Copy link
Collaborator Author

A key question: what should the default worker-saturation value be?

Previously, we'd discussed using 1.1. Due to round-up, this means that small workers would usually have 1 or 2 extra root tasks in memory; big workers would have more. So we'd still have a little oversaturation, but with a bound on it. The motivation for this, instead of plain 1.0, was that for very fast tasks on slow networks, it would mitigate the scheduler<->worker delay.

After re-running some benchmarks, I'm personally still leaning towards 1.0.

Comparing non-1.0* to 1.0, using a value >1.0 does not give a definitive decrease in runtime. In many tests, the difference is not statistically significant—it doesn't show up above the normal variation. The cases where it is definitively faster, it's only by a small amount, in the 2-8% range.

pchart

But >1.0 unquestionably hurts memory usage. We see up to 30% higher memory usage than 1.0. Every benchmark increases in memory usage.

pchart

Comparing those two charts, memory use shows a strong signal that >1.0 is worse, but runtime does not show much signal that >1.0 is better.

Of course, these benchmarks are only run with one cluster configuration (10 2cpu, 8gib workers). We haven't tested how more/larger workers affects this.

The main motivation for >1.0 was, "what about super slow corporate networks, or workers in multiple regions, etc?"?

  1. Tiny, fast tasks are not good practice in Dask to begin with. If your task runtime is on the same order as network latency plus scheduler overhead, you should be making your tasks bigger. Dask overhead will be a problem in these cases regardless of queuing for everything but the most basic client.map.

    Yes, client.map with instantaneous tasks is 15% slower in 1.0 vs 1.1. But client.map with realistic tasks (that take ~1s) is only 1% slower.

  2. This doesn't seem like that common of a workload, and is a rather niche case to penalize novice users in typical situations for.

All this said: from a theoretical standpoint, it does make sense that 1.1 would be somewhat faster that 1.0, so I'm surprised that we don't see a clearer runtime difference.

* I used 1.2 in this case, which due to worker size was equivalent to 1.1—either one meant 1 extra task per worker.

@crusaderky
Copy link
Collaborator

Do we have a measure of the scheduler<->worker RTT on coiled?
CC @ntabris

@crusaderky
Copy link
Collaborator

what about super slow corporate networks

This is wrongly posed. What about corporate networks that were set up ~5 years ago and do not receive the amount of constant attention that AWS does?

@ntabris
Copy link
Contributor

ntabris commented Oct 28, 2022

Do we have a measure of the scheduler<->worker RTT on coiled?

How much does worker.latency give you what you want? Maybe you'd want to look at min latency over time, since latency is (I suspect) pretty sensitive to how responsive the scheduler is when it gets worker heartbeat?

Worker latency is captured by Prometheus so we have lots of recent data there, e.g.,

image

I've also done a couple tests with scheduler and worker in different zones, and latencies looked like this:

image

image

(see https://github.com/coiled/platform/issues/165 for more details on that)

@gjoseph92
Copy link
Collaborator Author

What about corporate networks that were set up ~5 years ago and do not receive the amount of constant attention that AWS does?

True, it may not be that dramatic. My point is just that, as far as I know, it's already not good Dask practice to have tasks which are so fast that their runtime is the same order of magnitude as network latency (whatever that latency may be). You should probably increase task size in this case regardless, and if you do, it would make the worker-saturation effect pretty much moot.

All things being equal, I certainly want to improve performance in these higher-latency cases. It's just that the benchmark data makes it seem like that choice may come at a cost. And I wouldn't want to pick a default to help a less-common, bad-practice use case if it comes at the expense of a more common use-case.

@crusaderky
Copy link
Collaborator

How much does worker.latency give you what you want? Maybe you'd want to look at min latency over time, since latency is (I suspect) pretty sensitive to how responsive the scheduler is when it gets worker heartbeat?

It actually would be pretty interesting to see the difference between kernel-level ping (dictated almost exclusively by network) and application-level latency, where the GIL plays a major role. If we saw that the latter eclipses the former, it would inform us that even if the network became much worse we probably wouldn't notice as long as ping remains << than GIL-dominated RTT.

~250ms for the 4th quintile in your first plot above is pretty horrid. I'd be surprised if AWS were to blame for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants