-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker addresses are treated as unique identifiers, but may not be #6392
Comments
I think the first step to make the hash a bit more reliable is easy #6398 changing the usage of the address is a bit more elaborate but I'm also in favor. I think we should rather use Worker.id in the entire code base. I don't think this is a very disruptive change, it is just a bit of work |
I suspect it's going to be a lot of legwork. But I agree it's the sane thing to do. |
In #6585 I'm extending the hash to incorporate a unique counter. apart from hash collisions, the hash function should now be sufficiently unique. We might even considering defining the hash as only the id. I'm not sure if there is anything else we should do. Replacing all usage of address vs id/hash seems to be a bit excessive after giving it a bit more thought and would not protect us from state drift between scheduler and an extension as it is the case in the stealing deadlock. as long as we can rely on the equal methods, I'm good. thoughts? |
I think that we should eventually go through the legwork of cleaning up the use of
I'm not sure if I'm following, wasn't the fix in #6585 quite literally replacing an address-based check with an equality-based (i.e., ID-based) one? |
The
__hash__
of aWorkerState
object is just its address:distributed/distributed/scheduler.py
Line 480 in 33fc50c
As is the equality check (#3321 #3483):
distributed/distributed/scheduler.py
Lines 501 to 504 in 33fc50c
And in general, there are a number of places where we store things in dicts keyed by worker address, and assume that
if ws.address in self.workers
, thenws is self.workers[ws.address]
. (stealing.py
is especially guilty—most of its logic is basically built around this.)However, it's completely valid for a worker to disconnect, then for a new worker to connect from the same address. (Even with reconnection removed #6361, a Nanny #6387 or a user script could do this.) These are logically different workers, though they happen to have the same address.
This can cause:
WorkerState
object is updated which is no longer inself.workers
(though its address is), aTaskState
is made to point at aWorkerState
which has been removed, etc.Outcomes:
WorkerState
objects should be uniquely identifiable.WorkerState
objects referring to logically differentdask-worker
invocations must not be equal or have the same hash, even if they happen to have the same address.await
, or storing some state in a dict to be used later, etc.) must verify, each time it regains control, that the worker it's dealing with still exists in the cluster (not just that its address exists).Alternatives:
Causes #6356, #3256, #6263, maybe #3892
cc @crusaderky @fjetter @bnaul
The text was updated successfully, but these errors were encountered: