Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaked semaphore objects in test_concurrent_futures #104090

Open
furkanonder opened this issue May 2, 2023 · 11 comments
Open

Leaked semaphore objects in test_concurrent_futures #104090

furkanonder opened this issue May 2, 2023 · 11 comments
Labels
tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error

Comments

@furkanonder
Copy link
Contributor

furkanonder commented May 2, 2023

Bug report

$ ./python -m test test_concurrent_futures
0:00:00 load avg: 5.30 Run tests sequentially
0:00:00 load avg: 5.30 [1/1] test_concurrent_futures
/workspace/cpython/Lib/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=150426) is multi-threaded, use of fork() may lead to deadlocks in the child.
  self.pid = os.fork()
/workspace/cpython/Lib/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Traceback (most recent call last):
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/cpython/Lib/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/cpython/Lib/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/cpython/Lib/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/cpython/Lib/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/cpython/Lib/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/workspace/cpython/Lib/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/cpython/Lib/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
test_concurrent_futures passed in 2 min 17 sec

== Tests result: SUCCESS ==

1 test OK.

Total duration: 2 min 17 sec
Tests result: SUCCESS

Your environment

  • CPython versions tested on: CPython 3.12.0a7+
  • Operating system and architecture: Ubuntu 22.04.2 LTS | Linux-5.15.0-47-generic-x86_64-with-glibc2.35 little-endian

Linked PRs

@furkanonder furkanonder added the type-bug An unexpected behavior, bug, or error label May 2, 2023
@arhadthedev arhadthedev added the tests Tests in the Lib/test dir label May 3, 2023
@sunmy2019
Copy link
Member

The leakage is from FailingInitializerMixin. Don't know how to fix though.

@sobolevn
Copy link
Member

sobolevn commented May 3, 2023

Two failing tests:

  • test.test_concurrent_futures.ProcessPoolForkserverFailingInitializerTest.test_initializer
  • test.test_concurrent_futures.ProcessPoolSpawnFailingInitializerTest.test_initializer

@bityob
Copy link
Contributor

bityob commented Jul 14, 2023

  1. Found out using git bisect that the first failed commit is 6883007, bpo-4080: unittest durations #12271
  2. I splited the commit to multiple new commits each per file, so I can use git bisect again to find the specific problematic file
  3. The file is unittest/result.py and this is the git diff
diff --git a/Lib/unittest/result.py b/Lib/unittest/result.py
index 5ca4c23238..fa9bea47c8 100644
--- a/Lib/unittest/result.py
+++ b/Lib/unittest/result.py
@@ -43,6 +43,7 @@ def __init__(self, stream=None, descriptions=None, verbosity=None):
       self.skipped = []
       self.expectedFailures = []
       self.unexpectedSuccesses = []
+        self.collectedDurations = []
       self.shouldStop = False
       self.buffer = False
       self.tb_locals = False
@@ -157,6 +158,12 @@ def addUnexpectedSuccess(self, test):
       """Called when a test was expected to fail, but succeed."""
       self.unexpectedSuccesses.append(test)

+    def addDuration(self, test, elapsed):
+        """Called when a test finished to run, regardless of its outcome."""
+        # support for a TextTestRunner using an old TestResult class
+        if hasattr(self, "collectedDurations"):
+            self.collectedDurations.append((test, elapsed))
+
   def wasSuccessful(self):
       """Tells whether or not this result was a success."""
       # The hasattr check is for test_result's OldResult test.  That
  1. Indeed when I put a return before self.collectedDurations.append((test, elapsed)) , all is OK
$ ./python -m unittest test.test_concurrent_futures.ProcessPoolSpawnFailingInitializerTest.test_initializer 2>&1
.
----------------------------------------------------------------------
Ran 1 test in 0.398s

OK
  1. I found out with some debugging and print that on the failed case (collectedDurations.append) the resource_tracker is called before the sem_unlink which will try to unlink them later and fail, since the resource tracker already did it. Only 3 semaphore objects are failing here, trying to unlink after addDuration function.
case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-4lu9pdgs
case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-61ynpjfw
case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-wrenmqfa
case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-e01eptc5
case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-sssjxme9
0.27s run addDuration
Lib/multiprocessing/resource_tracker.py:236: UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown
call _CLEANUP_FUNCS semaphore function for /mp-u8k6h3v7 file
call _CLEANUP_FUNCS semaphore function for /mp-ihqrt4f2 file
call _CLEANUP_FUNCS semaphore function for /mp-n4h31gcc file
suite.py(132)run() -> self._handleModuleTearDown(result) -> run sem_unlink(name) on /mp-n4h31gcc
FileNotFoundError: [Errno 2] No such file or directory
suite.py(132)run() -> self._handleModuleTearDown(result) -> run sem_unlink(name) on /mp-u8k6h3v7
FileNotFoundError: [Errno 2] No such file or directory
suite.py(132)run() -> self._handleModuleTearDown(result) -> run sem_unlink(name) on /mp-ihqrt4f2
FileNotFoundError: [Errno 2] No such file or directory

(reduced the output so it will be clearer)
5. In other hand, when I ignore this line, the sem_ulink for those 3 semaphore objects are called suceesfully after the addDuration function but from another trace

case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-ch65p5qm
case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-z2m0y_8x
case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-mi_764q8
case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-ntvllres
case.py(638)run() -> self._callTearDown() -> run sem_unlink(name) on /mp-f5553813
0.27s run addDuration
.suite.py(122)test(result) -> suite.py(84)return self.run(*args, **kwds) -> run sem_unlink(name) on /mp-jvota4jy
suite.py(122)test(result) -> suite.py(84)return self.run(*args, **kwds) -> run sem_unlink(name) on /mp-hxf6fzql
suite.py(122)test(result) -> suite.py(84)return self.run(*args, **kwds) -> run sem_unlink(name) on /mp-fusub384
  1. I'm trying now to understand what caused the resource_tracker to trigger befure the sem_unlink. For now I think that for some reason the sem_unlink from TestSuite line 122 is not reached, and thefore the resouce tracker is trying to release the resources. And later the call to sem_unlink from TestSuite line 132 is trying to run and fails
    image
  2. The key is to understand why when the self.collectedDurations is not empty, it blocks us from cleaning as usual those 3 semaphore objects

@bityob
Copy link
Contributor

bityob commented Jul 14, 2023

Ok, found more -
The leaked semaphore objects are from FailingInitializerMixin (as @sunmy2019 mentioned before, but now it's clearer for me).

image

This line creates a new queue with 3 semaphore related and when the collectedDurations is not empty the self.log_queue is not released (still in used I guess)

################
FailingInitializerMixin.setUp: Calling self.mp_context.Queue()
Queue __init__: Set self._rlock = ctx.Lock()
Add semlock to register and finalize self._semlock.name='/mp-lnmmngl0', self.__class__.__name__='Lock'
Queue __init__: Set self._wlock = ctx.Lock()
Add semlock to register and finalize self._semlock.name='/mp-5fhxbfsv', self.__class__.__name__='Lock'
Queue __init__: Set ctx.BoundedSemaphore(maxsize)
Add semlock to register and finalize self._semlock.name='/mp-8mzh94sd', self.__class__.__name__='BoundedSemaphore'
################

Still investigating

@shailshouryya
Copy link
Contributor

For some additional context, I also ran git bisect and landed on the same commit identified above. I did not get as far as the post above me did in terms of identifying the exact reasons why the 6 leaked semaphore objects were not cleaned up before shutdown, but I also narrowed down the problem and found the 6 leaked semaphore objects warning goes away when the following block (also mentioned above):

    def addDuration(self, test, elapsed):
        """Called when a test finished to run, regardless of its outcome."""
        # support for a TextTestRunner using an old TestResult class
        if hasattr(self, "collectedDurations"):
            self.collectedDurations.append((test, elapsed))

is completely commented out:

#    def addDuration(self, test, elapsed):
#        """Called when a test finished to run, regardless of its outcome."""
#        # support for a TextTestRunner using an old TestResult class
#        if hasattr(self, "collectedDurations"):
#            self.collectedDurations.append((test, elapsed))

Another thing that I found interesting was the leaked semaphore objects look like TemporaryFiles:

/mp-ch65p5qm
/mp-z2m0y_8x
/mp-mi_764q8
/mp-ntvllres
/mp-f5553813

However, I'm not sure about this, since it does not look like TemporaryFileis called anywhere in test_concurrent_futures.py except for

from tempfile import TemporaryFile
with TemporaryFile(mode="w+") as f:
but this does not look relevant for the problem here (but I'm adding this observation here in case it is relevant).

@bityob
Copy link
Contributor

bityob commented Jul 14, 2023

@shailshouryya

Not related to TemporaryFile, those are virtual files used by each semaphore object as the system locking mechanisem

Accessing named semaphores via the file system
On Linux, named semaphores are created in a virtual file system, normally mounted under /dev/shm

Source: https://linux.die.net/man/7/sem_overview

For example, if you stop the process in the middle

image

you will be able to see those virtual files here

image

which will be deleted on _multiprocessing.sem_unlink method

@bityob
Copy link
Contributor

bityob commented Jul 14, 2023

Ok, found how to fix the issue! 🚀

Full details

The current code calls self.collectedDurations.append((test, elapsed)) on addDuration method (as wrote in previous comment).

This causes the 3 sempahors of FailingInitializerMixin.log_queue to not being released before the resource tracker is stopped and clear by itself all left resources.

Example 1 (without .collectedDurations.append), the ResourceTracker is stopped after we clear the 3 semaphors -

image

Example 2 (current code), the ResourceTracker is stopped before we clear the 3 semaphors -

image

This happens since we call _run_finalizers() after resource_tracker._resource_tracker._stop()
image

So, optional solution No.1 will be to change the order of the calls, like this -

image

And it solves the issue perfectly
image

@vstinner -
I saw that you were the one who wrote this part,
Can we do such change?
If yes, I'll make a PR for it.

@iritkatriel -
I'll happy to here what you think here, thanks.

Still open question for me, is why the objects are not finalize normally (by the gc I assume) like it was before.
I'm still investigating on this

@bityob
Copy link
Contributor

bityob commented Jul 15, 2023

Hey,

Found better solution and also what's happening here.

The reason those 3 semaphores were still alive and not deleted, it's because there weren't garabage collected yet, since they have a reference.

How?

Those 3 semaphors are related to FailingInitializerMixin.log_queue object (they are created as part of this queue).

This mixin is used to create the test class ProcessPoolSpawnFailingInitializerTest which was used on the test result private object collectedDurations, and since the result object continue to live after the test object ended, the test object can't be collected and therefore the semaphors too...

Why we need the test object there?

For displaying it on unittest.runner._printDurations method.
image

So how to fix it?

Very simple, instead of passing the test object, we just need here the test repr only!

        if hasattr(self, "collectedDurations"):
            self.collectedDurations.append((str(test), elapsed))

And this works perfect! 🚀

I'll create a PR for this
@iritkatriel

@iritkatriel
Copy link
Member

Good find!

bityob added a commit to bityob/cpython that referenced this issue Jul 15, 2023
…rent_futures-debug-session' into pythongh-104090-fix-leaked-semaphors-on-test_concurrent_futures
bityob added a commit to bityob/cpython that referenced this issue Jul 15, 2023
bityob added a commit to bityob/cpython that referenced this issue Jul 15, 2023
…-leaked-semaphors-on-test_concurrent_futures
bityob added a commit to bityob/cpython that referenced this issue Jul 16, 2023
…-leaked-semaphors-on-test_concurrent_futures
bityob added a commit to bityob/cpython that referenced this issue Jul 16, 2023
…_futures' of github.com:bityob/cpython into pythongh-104090-fix-leaked-semaphors-on-test_concurrent_futures
bityob added a commit to bityob/cpython that referenced this issue Jul 16, 2023
bityob added a commit to bityob/cpython that referenced this issue Jul 17, 2023
bityob added a commit to bityob/cpython that referenced this issue Jul 18, 2023
bityob added a commit to bityob/cpython that referenced this issue Jul 18, 2023
bityob added a commit to bityob/cpython that referenced this issue Jul 18, 2023
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jul 19, 2023
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jul 19, 2023
iritkatriel pushed a commit that referenced this issue Jul 19, 2023
…106795) (#106888)

gh-104090: Fix unittest collectedDurations resources leak (GH-106795)
(cherry picked from commit 70b961e)

Co-authored-by: Yonatan Bitton <[email protected]>
bityob added a commit to bityob/cpython that referenced this issue Jul 19, 2023
bityob added a commit to bityob/cpython that referenced this issue Jul 19, 2023
…_futures-v2' of github.com:bityob/cpython into pythongh-104090-fix-leaked-semaphors-on-test_concurrent_futures-v2
bityob added a commit to bityob/cpython that referenced this issue Jul 20, 2023
bityob added a commit to bityob/cpython that referenced this issue Jul 23, 2023
bityob added a commit to bityob/cpython that referenced this issue Aug 31, 2023
…_futures-v2' of github.com:bityob/cpython into pythongh-104090-fix-leaked-semaphors-on-test_concurrent_futures-v2
pitrou added a commit to bityob/cpython that referenced this issue Sep 19, 2023
encukou added a commit to bityob/cpython that referenced this issue Feb 5, 2024
encukou added a commit that referenced this issue Feb 21, 2024
This builds on #106807, which adds
a return code to ResourceTracker, to make future debugging easier.
Testing this “in situ” proved difficult, since the global ResourceTracker is
involved in test infrastructure. So, the tests here create a new instance and
feed it fake data.

---------

Co-authored-by: Yonatan Bitton <[email protected]>
Co-authored-by: Yonatan Bitton <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
@encukou
Copy link
Member

encukou commented Feb 22, 2024

That PR was a helper to make failures like this easier to diagnose.
The issue itself is not fixed; it still appears in refleak tests (./python -m test test_concurrent_futures -R:).

woodruffw pushed a commit to woodruffw-forks/cpython that referenced this issue Mar 4, 2024
…thonGH-115410)

This builds on python#106807, which adds
a return code to ResourceTracker, to make future debugging easier.
Testing this “in situ” proved difficult, since the global ResourceTracker is
involved in test infrastructure. So, the tests here create a new instance and
feed it fake data.

---------

Co-authored-by: Yonatan Bitton <[email protected]>
Co-authored-by: Yonatan Bitton <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
@zuliani99
Copy link

I get the same error in my case. I'm doing distributed training using spawn multiprocessing and DDP, also at the end of running my application I get a specific number of leaked semaphores.

This is the link of the issue.

I'm wondering if this issue is solved or not.

diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
…thonGH-115410)

This builds on python#106807, which adds
a return code to ResourceTracker, to make future debugging easier.
Testing this “in situ” proved difficult, since the global ResourceTracker is
involved in test infrastructure. So, the tests here create a new instance and
feed it fake data.

---------

Co-authored-by: Yonatan Bitton <[email protected]>
Co-authored-by: Yonatan Bitton <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error
Projects
Status: No status
Development

No branches or pull requests

9 participants