Unblock event loop while waiting for ThreadpoolExecutor to shut down #5883

fjetter · 2022-03-01T14:25:52Z

I don't exactly understand what's happening but this may be an explanation for why we're occasionally seeing a lot of "event loop was blocked ..." warnings in our test suites. Potentially this may also explain some other spurious timeout errors

What's happening is that, by default, we close the workers after a test with gen_cluster in end_cluster using Worker.close(report=False), i.e. the default values of timeout=30 and executor_wait=True are respected.

I went through the test_semaphore cases because I noticed them all to be very slow. I figured I'd adjust a few parameters and speed up the entire thing. The test test_close_async was particularly interesting since it had a timeout of 120s defined which is required because a task is scheduled (fire_and_forget) that never finishes since it waits for a locked semaphore. Therefore, the ThreadPool could never gracefully shut down. Adding a timeout to this call actually didn't help but instead I saw this warning pop up.

It turns out that the executor.shutdown does block the event loop while waiting for a threading lock to be released

The code I wrote resolves all of this but it feels weird...

cc @graingert @gjoseph92 @crusaderky

fjetter · 2022-03-01T14:27:19Z

Particularly concerning is that the default value for Worker.close(timeout=30) is identical to our test timeout. See also #5791 where we're discussing setting the connect timeout to the same value. We might want to be more selective in picking these values since there is clearly a certain hierarchy involved we should respect

distributed/compatibility.py

distributed/worker.py

distributed/tests/test_worker.py

distributed/worker.py

github-actions · 2022-03-01T18:53:34Z

Unit Test Results

      12 files ±0       12 suites ±0 6h 50m 48s ⏱️ - 9m 53s
  2 624 tests +1   2 539 ✔️ - 2   81 💤 +1 4 ❌ +3
15 668 runs +6 14 802 ✔️ +4 861 💤 ±0 5 ❌ +3

For more details on these failures, see this check.

Results for commit c608315. ± Comparison against base commit 39c5e88.

♻️ This comment has been updated with latest results.

distributed/tests/test_worker.py

graingert · 2022-03-02T11:26:45Z

I'm wondering if it's better to use executor = w.executors["default"] directly - then you don't have to worry about pickling a threading.Event():

@gen_cluster(nthreads=[])
async def test_do_not_block_event_loop_during_shutdown(s, c):
    loop = asyncio.get_running_loop()
    called_handler = threading.Event()
    block_handler = threading.Event()

    w = await Worker(s.address)
    executor = w.executors["default"]

    async def block():
        def fn():
            called_handler.set()
            assert block_handler.wait(10)

        await loop.run_in_executor(executor, fn)

    async def set_future():
        while True:
            try:
                await loop.run_in_executor(executor, sleep, 0.1)
            except RuntimeError:  # executor has started shutting down
                block_handler.set()
                return

    async def close():
        called_handler.wait()
        # executor_wait is True by default but we want to be explicit here
        await w.close(executor_wait=True)

    await asyncio.gather(block(), close(), set_future())

distributed/tests/test_worker.py

fjetter · 2022-03-08T13:47:26Z

Failing tests on ubu are stuck while closing a worker, so I suspect this to be related to #5910

ubu py3.8 test_worker_stream_died_during_comm
ubu p3.9 test_missing_data_errant_worker
OSX py3.9 ci1 test_missing_data_errant_worker

The other test failures are unrelated as far as I can tell

OSX py3.8 test_reconnect, known offender
OSX py3.9 not ci test_dashboard_non_standard_ports connection error while trying to connect a client. I assume this is unrelated.

…6091) This reinstates #5883 which was reverted in #5961 / #5932 I could confirm the flakyness of `test_missing_data_errant_worker` after this change and am reasonably certain this is caused by #5910 which causes a closing worker to be restarted such that, even after `Worker.close` is done, the worker still appears to be partially up. The only reason I can see why this change promotes this behaviour is that if we no longer block the event loop while the threadpool is closing, this opens a much larger window for incoming requests to come in and being processed while close is running. Closes #6239

fjetter requested review from graingert and crusaderky March 1, 2022 14:29

fjetter force-pushed the unblock_shutdown_event_loop branch 4 times, most recently from 2eb9c4d to 223114c Compare March 1, 2022 16:20

fjetter mentioned this pull request Mar 1, 2022

Speed up test_semaphore.py #5885

Merged

crusaderky requested changes Mar 1, 2022

View reviewed changes

graingert reviewed Mar 1, 2022

View reviewed changes

distributed/worker.py Show resolved Hide resolved

graingert reviewed Mar 2, 2022

View reviewed changes