Is it intended that any error from a handler makes `Server.handle_stream` close the comm? #5483

gjoseph92 · 2021-10-29T19:34:07Z

In #5480, an unexpected error (#5482) in a stream handler caused the entire stream to close. It turns out we didn't properly reconnect when the worker<->scheduler stream closed (#5481), so things got into a broken state and deadlocked.

I'm curious though if it makes sense for handle_stream to close the whole stream in this case though. You can see here that there are a couple unhandled error cases that would cause the comm to close:

distributed/distributed/core.py

Lines 559 to 564 in 11c41b5

    
           handler = self.stream_handlers[op] 
        
           if is_coroutine_function(handler): 
        
               self.loop.add_callback(handler, **merge(extra, msg)) 
        
               await gen.sleep(0) 
        
           else: 
        
               handler(**merge(extra, msg))

stream_handlers[op] does not exist (message requesting an invalid op).
handler(**merge(extra, msg)) raises an error (what happened here).
Interestingly, if handler is async, the stream will stay open even if handler fails whenever it runs in the future. Why the inconsistency with synchronous?

I'm not quite sure what the protocol is meant to be here. Is closing the comm the only way we have to tell the sender that something went wrong, so we're using it as a signal? Or do we believe that after any failure, we can't trust subsequent messages to be valid, so we should give up and wait to restart the connection if desired?

The text was updated successfully, but these errors were encountered:

fjetter · 2021-11-02T13:49:58Z

Interestingly, if handler is async, the stream will stay open even if handler fails whenever it runs in the future. Why the inconsistency with synchronous?

Long known problem. While trying to fix this I ran into a bazillion other smaller inconsistencies and that escalated into an unmergable PR, see #4734. Also very interesting in this space is #5443

I'm not quite sure what the protocol is meant to be here. Is closing the comm the only way we have to tell the sender that something went wrong, so we're using it as a signal? Or do we believe that after any failure, we can't trust subsequent messages to be valid, so we should give up and wait to restart the connection if desired?

I believe closing the stream is a radical but safe way to deal with this. I'm not sure how else you'd like to handle that exception since you do not know what the exception is and therefore cannot implement a clean exception handler. However, what's even more important than whether or not we close the stream is how the reraised exception is handled since the handle_stream is always wrapped

distributed/distributed/worker.py

Lines 1236 to 1248 in 11c41b5

    
           try: 
        
               await self.handle_stream( 
        
                   comm, every_cycle=[self.ensure_communicating, self.ensure_computing] 
        
               ) 
        
           except Exception as e: 
        
               logger.exception(e) 
        
               raise 
        
           finally: 
        
               if self.reconnect and self.status in RUNNING: 
        
                   logger.info("Connection to scheduler broken.  Reconnecting...") 
        
                   self.loop.add_callback(self.heartbeat) 
        
               else: 
        
                   await self.close(report=False)

distributed/distributed/scheduler.py

Lines 5296 to 5300 in 11c41b5

    
           try: 
        
               await self.handle_stream(comm=comm, extra={"client": client}) 
        
           finally: 
        
               self.remove_client(client=client) 
        
               logger.debug("Finished handling client %s", client)

distributed/distributed/scheduler.py

Lines 5510 to 5515 in 11c41b5

    
           try: 
        
               await self.handle_stream(comm=comm, extra={"worker": worker}) 
        
           finally: 
        
               if worker in self.stream_comms: 
        
                   worker_comm.abort() 
        
                   await self.remove_worker(address=worker)

What they all have in common is that they retrigger a removal of the involved server which in turn may reconnect. I would argue the stream handling is fine. If any problems pop up, that's related to the reconnects.

gjoseph92 changed the title ~~Should Server.handle_stream suppress exceptions from handlers?~~ Is it intended that any error from a handler makes Server.handle_stream close the comm? Oct 29, 2021

gjoseph92 mentioned this issue Oct 29, 2021

Worker reconnection deadlock #5480

Closed

4 tasks

fjetter mentioned this issue Jan 21, 2022

Conditions under which a TCP connection may fail / close? #5678

Closed

fjetter mentioned this issue Feb 9, 2022

who_has not set for task in state fetch #5751

Closed

fjetter mentioned this issue Mar 21, 2022

the "distributed.worker" logger gains ~thousand DequeHandler instances during a pytest run #5973

Closed

gjoseph92 mentioned this issue Apr 27, 2022

Add fail_hard decorator for worker methods #6210

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it intended that any error from a handler makes `Server.handle_stream` close the comm? #5483

Is it intended that any error from a handler makes `Server.handle_stream` close the comm? #5483

gjoseph92 commented Oct 29, 2021

fjetter commented Nov 2, 2021

Is it intended that any error from a handler makes Server.handle_stream close the comm? #5483

Is it intended that any error from a handler makes Server.handle_stream close the comm? #5483

Comments

gjoseph92 commented Oct 29, 2021

fjetter commented Nov 2, 2021

Is it intended that any error from a handler makes `Server.handle_stream` close the comm? #5483

Is it intended that any error from a handler makes `Server.handle_stream` close the comm? #5483