Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProcessPoolExecutor shutdown hangs after future cancel was requested #94440

Closed
yonatanp opened this issue Jun 30, 2022 · 13 comments
Closed

ProcessPoolExecutor shutdown hangs after future cancel was requested #94440

yonatanp opened this issue Jun 30, 2022 · 13 comments
Assignees
Labels
topic-multiprocessing type-bug An unexpected behavior, bug, or error

Comments

@yonatanp
Copy link
Contributor

yonatanp commented Jun 30, 2022

Bug report

With a ProcessPoolExecutor, after submitting and quickly canceling a future, a call to shutdown(wait=True) would hang indefinitely.
This happens pretty much on all platforms and all recent Python versions.

Here is a minimal reproduction:

import concurrent.futures

ppe = concurrent.futures.ProcessPoolExecutor(1)
ppe.submit(int).result()
ppe.submit(int).cancel()
ppe.shutdown(wait=True)

The first submission gets the executor going and creates its internal queue_management_thread.
The second submission appears to get that thread to loop, enter a wait state, and never receive a wakeup event.

Introducing a tiny sleep between the second submit and its cancel request makes the issue disappear. From my initial observation it looks like something in the way the queue_management_worker internal loop is structured doesn't handle this edge case well.

Shutting down with wait=False would return immediately as expected, but the queue_management_thread would then die with an unhandled OSError: handle is closed exception.

Environment

  • Discovered on macOS-12.2.1 with cpython 3.8.5.
  • Reproduced in Ubuntu and Windows (x64) as well, and in cpython versions 3.7 to 3.11.0-beta.3.
  • Reproduced in pypy3.8 as well, but not consistently. Seen for example in Ubuntu with Python 3.8.13 (PyPy 7.3.9).

Additional info

When tested with pytest-timeout under Ubuntu and cpython 3.8.13, these are the tracebacks at the moment of timing out:

_____________________________________ test _____________________________________
    @pytest.mark.timeout(10)
    def test():
        ppe = concurrent.futures.ProcessPoolExecutor(1)
        ppe.submit(int).result()
        ppe.submit(int).cancel()
>       ppe.shutdown(wait=True)
test_reproduce_python_bug.py:14:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/concurrent/futures/process.py:686: in shutdown
    self._queue_management_thread.join()
/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/threading.py:1011: in join
    self._wait_for_tstate_lock()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <Thread(QueueManagerThread, started daemon 140003176535808)>
block = True, timeout = -1
    def _wait_for_tstate_lock(self, block=True, timeout=-1):
        # Issue #18808: wait for the thread state to be gone.
        # At the end of the thread's life, after all knowledge of the thread
        # is removed from C data structures, C code releases our _tstate_lock.
        # This method passes its arguments to _tstate_lock.acquire().
        # If the lock is acquired, the C code is done, and self._stop() is
        # called.  That sets ._is_stopped to True, and ._tstate_lock to None.
        lock = self._tstate_lock
        if lock is None:  # already determined that the C code is done
            assert self._is_stopped
>       elif lock.acquire(block, timeout):
E       Failed: Timeout >10.0s
/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/threading.py:1027: Failed
----------------------------- Captured stderr call -----------------------------
+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++
~~~~~~~~~~~~~~~~~ Stack of QueueFeederThread (140003159754496) ~~~~~~~~~~~~~~~~~
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/multiprocessing/queues.py", line 227, in _feed
    nwait()
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/threading.py", line 302, in wait
    waiter.acquire()
~~~~~~~~~~~~~~~~ Stack of QueueManagerThread (140003176535808) ~~~~~~~~~~~~~~~~~
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/concurrent/futures/process.py", line 362, in _queue_management_worker
    ready = mp.connection.wait(readers + worker_sentinels)
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++

Tracebacks in PyPy are similar on the concurrent.futures.process level. Tracebacks in Windows are different in the lower-level areas, but again similar on the concurrent.futures.process level.

Linked PRs:

Linked PRs

@yonatanp yonatanp added the type-bug An unexpected behavior, bug, or error label Jun 30, 2022
yonatanp added a commit to yonatanp/cpython that referenced this issue Jun 30, 2022
yonatanp added a commit to yonatanp/cpython that referenced this issue Jun 30, 2022
@yonatanp
Copy link
Contributor Author

yonatanp commented Jul 1, 2022

Submitted a PR with test and fix. It can be easily backported, and I’ll be happy to get on it too once we merge on the main line :)

It took some time to pin point the issue, but once done, the fix is pretty surgical. The hang scenario happens when the shutdown sequence begins when the pending work items contains futures, and the all get canceled before the next wait. The addition makes sure there are always some running futures when we wait post shutdown, or none at all, in which case the children stopping kicks off.

@yonatanp
Copy link
Contributor Author

@AlexWaygood do you know which expert in multiprocessing I could tag here? I'm eager to see this resolved in the next versions :)

@yonatanp
Copy link
Contributor Author

Hi @brianquinlan,
I'd love your feedback on this issue and my suggested fix #94468 :)

@b-kamphorst
Copy link

I experience the same issue (Python 3.8.16). It would be great if a fix (presumably #94468) got merged and backported! @brianquinlan, with the understanding that maintainers are busy and appreciated, would you have time to consider the PR?

@AlexWaygood
Copy link
Member

AlexWaygood commented Mar 14, 2023

I experience the same issue (Python 3.8.16). It would be great if a fix (presumably #94468) got merged and backported! @brianquinlan, with the understanding that maintainers are busy and appreciated, would you have time to consider the PR?

FYI, 3.8 is end of life (3.9 is also) — it now only receives fixes for security-related issues, and this isn't security-related. If you want bugfixes, you'll have to upgrade to at least Python 3.10, which is the oldest Python version that still has bugfixes backported to it.

@brianquinlan
Copy link
Contributor

I'm not active on Python right now so I would not use my commit powers in any case :-(

@yonatanp
Copy link
Contributor Author

That’s good to know, thanks anyway!
I was able to reproduce the issue on 3.10 and 3.11 at the time of opening the issue.
It would be really great to merge the PR. The fix is delicate but is pretty straightforward.

What if the way forward to get this merged? @AlexWaygood do you know maybe?

@AlexWaygood
Copy link
Member

I'm sorry you've had to wait such a long time for a review :( unfortunately we don't really have any multiprocessing experts on the core dev team at the moment, and it's a complex module, so few core devs feel confident reviewing PRs relating to multiprocessing.

I'll ask around the core dev team and see if anybody could take a look.

@yonatanp
Copy link
Contributor Author

yonatanp commented Mar 14, 2023

Thanks! Would be great to get this moving. The bug has been hiding there for a very long time, but when it gets triggered, it is quite painful.

@b-kamphorst
Copy link

b-kamphorst commented Mar 14, 2023

I experience the same issue (Python 3.8.16). It would be great if a fix (presumably #94468) got merged and backported! @brianquinlan, with the understanding that maintainers are busy and appreciated, would you have time to consider the PR?

FYI, 3.8 is end of life (3.9 is also) — it now only receives fixes for security-related issues, and this isn't security-related. If you want bugfixes, you'll have to upgrade to at least Python 3.10, which is the oldest Python version that still has bugfixes backported to it.

That is quite valid. Our packages are intended to work for python 3.7-3.11, so the PR still solves part of the problem. I did not experience issues with python>=3.9 (where I can call shutdown(wait=False, cancel_futures=True)), but I don't know if that call is bug-free or whether I was just lucky. At least I now understand why shutdown(wait=True) hangs -- I've added a time.sleep(0.001) just before shutdown to prevent the hang and this seems to be a functional work-around for python<3.9.

@yonatanp
Copy link
Contributor Author

I've added a time.sleep(0.001) just before shutdown to prevent the hang and this seems to be a functional work-around for python<3.9.

Yeah, I’ve been doing exactly the same.

@gpshead gpshead self-assigned this Mar 16, 2023
gpshead pushed a commit that referenced this issue Mar 16, 2023
Fix an issue of concurrent.futures ProcessPoolExecutor shutdown hanging.

Co-authored-by: Alex Waygood <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Mar 16, 2023
…thonGH-94468)

Fix an issue of concurrent.futures ProcessPoolExecutor shutdown hanging.

(cherry picked from commit 2dc9463)

Co-authored-by: yonatanp <[email protected]>
Co-authored-by: Alex Waygood <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Mar 16, 2023
…thonGH-94468)

Fix an issue of concurrent.futures ProcessPoolExecutor shutdown hanging.

(cherry picked from commit 2dc9463)

Co-authored-by: yonatanp <[email protected]>
Co-authored-by: Alex Waygood <[email protected]>
@gpshead
Copy link
Member

gpshead commented Mar 16, 2023

Thanks for the fix! 3.9 and earlier only receive security fixes at this point.

@gpshead gpshead closed this as completed Mar 16, 2023
miss-islington added a commit that referenced this issue Mar 16, 2023
Fix an issue of concurrent.futures ProcessPoolExecutor shutdown hanging.

(cherry picked from commit 2dc9463)

Co-authored-by: yonatanp <[email protected]>
Co-authored-by: Alex Waygood <[email protected]>
miss-islington added a commit that referenced this issue Mar 16, 2023
Fix an issue of concurrent.futures ProcessPoolExecutor shutdown hanging.

(cherry picked from commit 2dc9463)

Co-authored-by: yonatanp <[email protected]>
Co-authored-by: Alex Waygood <[email protected]>
carljm added a commit to carljm/cpython that referenced this issue Mar 17, 2023
* main: (34 commits)
  pythongh-102701: Fix overflow in dictobject.c (pythonGH-102750)
  pythonGH-78530: add support for generators in `asyncio.wait` (python#102761)
  Increase stack reserve size for Windows debug builds to avoid test crashes (pythonGH-102764)
  pythongh-102755: Add PyErr_DisplayException(exc) (python#102756)
  Fix outdated note about 'int' rounding or truncating (python#102736)
  pythongh-102192: Replace PyErr_Fetch/Restore etc by more efficient alternatives (python#102760)
  pythongh-99726: Improves correctness of stat results for Windows, and uses faster API when available (pythonGH-102149)
  pythongh-102192: remove redundant exception fields from ssl module socket (python#102466)
  pythongh-102192: Replace PyErr_Fetch/Restore etc by more efficient alternatives (python#102743)
  pythongh-102737: Un-ignore ceval.c in the CI globals check (pythongh-102745)
  pythonGH-102748: remove legacy support for generator based coroutines from `asyncio.iscoroutine` (python#102749)
  pythongh-102721: Improve coverage of `_collections_abc._CallableGenericAlias` (python#102722)
  pythonGH-102653: Make recipe docstring show the correct distribution (python#102742)
  Add comments to `{typing,_collections_abc}._type_repr` about each other (python#102752)
  pythongh-102594: PyErr_SetObject adds note to exception raised on normalization error (python#102675)
  pythongh-94440: Fix issue of ProcessPoolExecutor shutdown hanging (python#94468)
  pythonGH-100112:  avoid using iterable coroutines in asyncio internally (python#100128)
  pythongh-102690: Use Edge as fallback in webbrowser instead of IE (python#102691)
  pythongh-102660: Fix Refleaks in import.c (python#102744)
  pythongh-102738: remove from cases generator the code related to register instructions (python#102739)
  ...
Fidget-Spinner pushed a commit to Fidget-Spinner/cpython that referenced this issue Mar 27, 2023
…thon#94468)

Fix an issue of concurrent.futures ProcessPoolExecutor shutdown hanging.

Co-authored-by: Alex Waygood <[email protected]>
warsaw pushed a commit to warsaw/cpython that referenced this issue Apr 11, 2023
…thon#94468)

Fix an issue of concurrent.futures ProcessPoolExecutor shutdown hanging.

Co-authored-by: Alex Waygood <[email protected]>
@maximilianmordig
Copy link

I think it is still not working on Python3.10.11.
Pressing Ctrl-C after 1-2 seconds, it still runs all submitted jobs:

import concurrent.futures
import tqdm

import time
# fun = lambda x: time.sleep(x)
def fun(x):
    print("Sleeping", x)
    time.sleep(2*x)
    return x
args_list = list(range(10))
max_workers=2


try:
    with concurrent.futures.ProcessPoolExecutor(max_workers=max_workers) as executor:
        futures = []
        with tqdm.tqdm(total=len(args_list)) as pbar:
            for args in args_list:
                futures.append(executor.submit(fun, args))
            results = []
            for future in concurrent.futures.as_completed(futures):
                results.append(future.result())
                pbar.update(1)
    # also not working
    # with concurrent.futures.ProcessPoolExecutor(max_workers=max_workers) as executor:
    #     results = list(tqdm.tqdm(executor.map(fun, args_list), total=len(args_list)))
except KeyboardInterrupt:
    # executor.shutdown(wait=False)
    print("Cancelling")
    time.sleep(0.1)
    executor.shutdown(wait=False, cancel_futures=True)
    print("Cancelled")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-multiprocessing type-bug An unexpected behavior, bug, or error
Projects
Development

No branches or pull requests

6 participants