Worker addresses are treated as unique identifiers, but may not be #6392

gjoseph92 · 2022-05-20T02:01:46Z

The __hash__ of a WorkerState object is just its address:

Line 480 in 33fc50c

self._hash = hash(address)

As is the equality check (#3321 #3483):

Lines 501 to 504 in 33fc50c

    
           def __eq__(self, other: object) -> bool: 
        
               if not isinstance(other, WorkerState): 
        
                   return False 
        
               return self.address == other.address

And in general, there are a number of places where we store things in dicts keyed by worker address, and assume that if ws.address in self.workers, then ws is self.workers[ws.address]. (stealing.py is especially guilty—most of its logic is basically built around this.)

However, it's completely valid for a worker to disconnect, then for a new worker to connect from the same address. (Even with reconnection removed #6361, a Nanny #6387 or a user script could do this.) These are logically different workers, though they happen to have the same address.

This can cause:

bad decisions: a scheduling or work-stealing decision is made about the old worker at that address; when it's enacted, there's a different worker at that address and the decision may no longer be appropriate
deadlocks: a WorkerState object is updated which is no longer in self.workers (though its address is), a TaskState is made to point at a WorkerState which has been removed, etc.

Outcomes:

WorkerState objects should be uniquely identifiable. WorkerState objects referring to logically different dask-worker invocations must not be equal or have the same hash, even if they happen to have the same address.
Any logic which gives up control flow (via await, or storing some state in a dict to be used later, etc.) must verify, each time it regains control, that the worker it's dealing with still exists in the cluster (not just that its address exists).

Alternatives:

If this is too much of a change to make, we could instead maintain a monotonically-increasing set of worker addresses, and prohibit address reuse. The scheduler would just reject a worker trying to connect if it had an address we'd already seen before. Of course, this would eliminate the possibility of worker reconnection Add back worker reconnection #6391, and maybe break nannies too.

Causes #6356, #3256, #6263, maybe #3892

cc @crusaderky @fjetter @bnaul

The text was updated successfully, but these errors were encountered:

fjetter · 2022-05-20T10:05:00Z

I think the first step to make the hash a bit more reliable is easy #6398

changing the usage of the address is a bit more elaborate but I'm also in favor. I think we should rather use Worker.id in the entire code base. I don't think this is a very disruptive change, it is just a bit of work

crusaderky · 2022-05-20T14:46:09Z

I suspect it's going to be a lot of legwork. But I agree it's the sane thing to do.

fjetter · 2022-06-16T12:11:04Z

In #6585 I'm extending the hash to incorporate a unique counter. apart from hash collisions, the hash function should now be sufficiently unique. We might even considering defining the hash as only the id.

I'm not sure if there is anything else we should do. Replacing all usage of address vs id/hash seems to be a bit excessive after giving it a bit more thought and would not protect us from state drift between scheduler and an extension as it is the case in the stealing deadlock. as long as we can rely on the equal methods, I'm good. thoughts?

hendrikmakait · 2022-08-24T06:12:27Z

I'm not sure if there is anything else we should do. Replacing all usage of address vs id/hash seems to be a bit excessive after giving it a bit more thought

I think that we should eventually go through the legwork of cleaning up the use of address as an identifier. This will define the best practice of identifying workers. Diving into the codebase, I have been under the impression that this is fine given that workers are quite often identified by their addresses or stored in dicts keyed by them. I suspect we will also find a number of places where we carelessly check for addresses where an ID-based check would have been needed.

and would not protect us from state drift between scheduler and an extension as it is the case in the stealing deadlock.

I'm not sure if I'm following, wasn't the fix in #6585 quite literally replacing an address-based check with an equality-based (i.e., ID-based) one?

fjetter mentioned this issue May 20, 2022

WorkerState are different for different addresses #6398

Merged

This was referenced May 24, 2022

Task stuck in "processing" on closed worker #6263

Closed

Deadlock: tasks stolen to old WorkerState instance of a reconnected worker #6356

Closed

This was referenced Aug 19, 2022

Withhold root tasks [no co assignment] #6614

Merged

Always return ws.address from _remove_from_processing #6884

Merged

hendrikmakait mentioned this issue Nov 24, 2022

Tasks with worker restrictions get stuck in no-worker when required worker is removed #7346

Open

gjoseph92 mentioned this issue Jan 28, 2023

Scheduler TaskState objects should be unique, not hashed by key #7510

Open

hendrikmakait mentioned this issue Apr 25, 2023

Data loss possible with P2P shuffle when worker returns with same address #7798

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker addresses are treated as unique identifiers, but may not be #6392

Worker addresses are treated as unique identifiers, but may not be #6392

gjoseph92 commented May 20, 2022 •

edited

Loading

fjetter commented May 20, 2022

crusaderky commented May 20, 2022

fjetter commented Jun 16, 2022

hendrikmakait commented Aug 24, 2022 •

edited

Loading

Worker addresses are treated as unique identifiers, but may not be #6392

Worker addresses are treated as unique identifiers, but may not be #6392

Comments

gjoseph92 commented May 20, 2022 • edited Loading

fjetter commented May 20, 2022

crusaderky commented May 20, 2022

fjetter commented Jun 16, 2022

hendrikmakait commented Aug 24, 2022 • edited Loading

gjoseph92 commented May 20, 2022 •

edited

Loading

hendrikmakait commented Aug 24, 2022 •

edited

Loading