Allow worker to refuse data requests with busy signal #2092

mrocklin · 2018-07-03T17:20:22Z

This allows workers to say "I'm too busy right now" when presented with
a request for data from another worker. That worker then waits a bit,
queries the scheduler to see if anyone else has that data, and then
tries again. The wait time is an exponential backoff.

Pragmatically this means that when single pieces of data are in high
demand that the cluster will informally do a tree scattering. Some workers
will get the data directly while others wait on the busy signal. Then other
workers will get from them, etc.. We used to ask users to do this explicitly
with the following:

client.replicate(future)
or
client.scatter(data, broadcast=True)

And now the replicate/broadcast step is no longer strictly necessary. (though
some scattering of local data still is).

Machines on the same host are given some preference, and so should be able to
sneak in more easily.

Currently this has two issues:

We need to unify the configuration with the total_connections parameter
(which does the same thing, but in the opposite direction)
We don't test the same-host behavior (this is hard because we're currently
getting host information from the socket.)

mrocklin · 2018-07-03T17:20:35Z

cc @ogrisel @seibert

mrocklin · 2018-07-03T17:23:04Z

distributed/worker.py

@@ -1381,28 +1398,32 @@ def transition_dep_waiting_flight(self, dep, worker=None):
                pdb.set_trace()
            raise

-    def transition_dep_flight_waiting(self, dep, worker=None):
+    def transition_dep_flight_waiting(self, dep, worker=None, busy=False):


This keyword should be inverted and changed to something like remove, which probably makes more sense locally

ogrisel · 2018-07-03T17:41:51Z

Machines on the same host

you mean workers on the same host?

ogrisel · 2018-07-03T17:45:56Z

distributed/tests/test_worker.py

+        yield wait(futures)
+
+    assert len(workers[0].outgoing_transfer_log) < 18
+    assert sum(not not w.outgoing_transfer_log for w in workers) >= 3


not not is weird. wouldn't bool(w.outgoing_transfer_log) work?

or sum(1 for w in workers if len(w.outgoing_transfer_log) > 0) that might be even more explicit.

Fixed to be more explicit. I'm using len(... for ... if ...)

ogrisel · 2018-07-03T17:52:37Z

distributed/tests/test_worker.py

+
+        yield wait(futures)
+
+    assert len(workers[0].outgoing_transfer_log) < 18


Where does the 18 and the 3 come from? Where they measured from empirical runs?

Would it be possible to increase the number secondary worker-to-worker transfers by increasing the size of x while making x cheaper to allocate initially? For instance:

x = c.submit(bytes, int(1e8), workers=[workers[0].address])

Two issues:

We may start to run out of RAM on travis with 1e8 * 20 bytes

Compression will make the transfers too fast

I'm not very concerned about the cost of creating the random array the first time. I don't think that this will affect the number of secondary worker-to-worker transfers. However I may not fully understand your meaning.

ogrisel

LGTM, but I am really not familiar with the code so I trust you and the existing test suite. Did you run benchmarks to ensure that this does not cause any significant performance regression?

Feel free to upgrade the joblib connector as part of this PR to remove the explicit broadcasting in the auto-scatter thingy.

ogrisel · 2018-07-03T17:55:03Z

distributed/tests/test_worker.py

+
+        yield wait(futures)
+
+    assert len(workers[0].outgoing_transfer_log) < 18


Would it be possible to increase the number secondary worker-to-worker transfers by increasing the size of x while making x cheaper to allocate initially? For instance:

x = c.submit(bytes, int(1e8), workers=[workers[0].address])

ogrisel · 2018-07-03T18:30:49Z

distributed/distributed.yaml

@@ -19,6 +19,7 @@ distributed:
  worker:
    multiprocessing-method: forkserver
    use-file-locking: True
+    max-connections: 10     # maximum simultaneous outgoing connections


Did you have the opportunity to run some IO intensive benchmark/stress test on a "real" cluster (e.g. on GCP) to measure the impact of that setting on the overall completion time of a data bottlenecked set of tasks?

If you do I would be curious to see the empirical arity of the tree structure of the resulting broadcasting for different values of max-connections.

No, I haven't yet done any benchmarking. I'll play a bit on my local machine. I may get to trying this out on a larger cluster, but that's uncertain. Instead, I suspect that we will end up changing the default value over time.

This allows workers to say "I'm too busy right now" when presented with a request for data from another worker. That worker then waits a bit, queries the scheduler to see if anyone else has that data, and then tries again. The wait time is an exponential backoff. Pragmatically this means that when single pieces of data are in high demand that the cluster will informally do a tree scattering. Some workers will get the data directly while others wait on the busy signal. Then other workers will get from them, etc.. We used to ask users to do this explicitly with the following: client.replicate(future) or client.scatter(data, broadcast=True) And now the replicate/broadcast step is no longer strictly necessary. (though some scattering of local data still is). Machines on the same host are given some preference, and so should be able to sneak in more easily. Currently this has two issues: 1. We need to unify the configuration with the total_connections parameter (which does the same thing, but in the opposite direction) 2. We don't test the same-host behavior (this is hard because we're currently getting host information from the socket.)

mrocklin · 2018-07-04T14:25:14Z

from dask_jobqueue import PBSCluster
cluster = PBSCluster(processes=18)
cluster.scale(20)  # results in 18 * 20 processes on 20 physical machines

from dask.distributed import Client
client = Client(cluster)
client

import numpy as np
x = client.submit(np.random.random, 100000000, pure=False)

workers = list(client.scheduler_info()['workers'])

futures = [client.submit(len, x, pure=False, workers=[w]) 
           for w in workers]

One max connection (double for same node)

Around 25s of communication time. (note that the communication starts after zero)

Ten max connections (double for same node)

Around 18s

100

Around 35s (note that the dashboard starts before zero for some reason)

1000

50-60s

mrocklin · 2018-07-04T14:31:50Z

The behavior here is as I would expect. I'm comfortable with this from a performance perspective, though I also think that we'll end up wanting to tune this default in the future by a factor of 2-3.

There is still some administrative cleanup to do I think.

ogrisel · 2018-07-04T16:35:08Z

Thanks for the benchmarks, that seems to work fine :)

mrocklin commented Jul 3, 2018

View reviewed changes

ogrisel reviewed Jul 3, 2018

View reviewed changes

ogrisel approved these changes Jul 3, 2018

View reviewed changes

mrocklin force-pushed the worker-outgoing-saturation branch from 93f6b80 to 592df43 Compare July 3, 2018 22:48

mrocklin mentioned this pull request Jul 4, 2018

Use case with deep-learning frameworks: prediction in parallel dask/dask-ml#281

Closed

mrocklin added 3 commits July 4, 2018 07:57

clean up test

f8dd428

Add max_connections properly

c018823

mrocklin force-pushed the worker-outgoing-saturation branch from 592df43 to c018823 Compare July 4, 2018 11:57

unify configuration and attributes around connections

40409d1

mrocklin force-pushed the worker-outgoing-saturation branch from 6eb9a96 to 40409d1 Compare July 5, 2018 16:39

busy->remove

b7bb54c

mrocklin merged commit 7a9fa83 into dask:master Jul 8, 2018

mrocklin deleted the worker-outgoing-saturation branch July 8, 2018 16:19

ogrisel added a commit to ogrisel/distributed that referenced this pull request Jul 12, 2018

Leverage dask#2092 to switch to broadcast=1

049aef4

crusaderky mentioned this pull request Apr 21, 2022

Redesign worker exponential backoff on busy-gather #6169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow worker to refuse data requests with busy signal #2092

Allow worker to refuse data requests with busy signal #2092

mrocklin commented Jul 3, 2018

mrocklin commented Jul 3, 2018

mrocklin Jul 3, 2018

ogrisel commented Jul 3, 2018

ogrisel Jul 3, 2018

ogrisel Jul 3, 2018

mrocklin Jul 3, 2018

ogrisel Jul 3, 2018

ogrisel Jul 3, 2018

mrocklin Jul 3, 2018

ogrisel left a comment

ogrisel Jul 3, 2018

ogrisel Jul 3, 2018

mrocklin Jul 3, 2018

mrocklin commented Jul 4, 2018 •

edited

Loading

mrocklin commented Jul 4, 2018

ogrisel commented Jul 4, 2018


		yield wait(futures)

		assert len(workers[0].outgoing_transfer_log) < 18

Allow worker to refuse data requests with busy signal #2092

Allow worker to refuse data requests with busy signal #2092

Conversation

mrocklin commented Jul 3, 2018

mrocklin commented Jul 3, 2018

Choose a reason for hiding this comment

ogrisel commented Jul 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Jul 4, 2018 • edited Loading

One max connection (double for same node)

Ten max connections (double for same node)

100

1000

mrocklin commented Jul 4, 2018

ogrisel commented Jul 4, 2018

mrocklin commented Jul 4, 2018 •

edited

Loading