-
-
Notifications
You must be signed in to change notification settings - Fork 30.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refleak test failure in test_exceptions #89058
Comments
For background see https://bugs.python.org/issue33930#msg399403 iritkatriel@Irits-MBP cpython % repeat 10 ./python.exe -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler -m test_recursion_normalizing_with_no_memory == Tests result: SUCCESS == 1 test OK. Total duration: 5.9 sec == Tests result: SUCCESS == 1 test OK. Total duration: 5.8 sec == Tests result: SUCCESS == 1 test OK. Total duration: 5.8 sec == Tests result: FAILURE == 1 test failed: 1 re-run test: Total duration: 5.9 sec == Tests result: FAILURE == 1 test failed: 1 re-run test: Total duration: 5.8 sec == Tests result: FAILURE == 1 test failed: 1 re-run test: Total duration: 6.0 sec == Tests result: FAILURE == 1 test failed: 1 re-run test: Total duration: 6.0 sec == Tests result: SUCCESS == 1 test OK. Total duration: 6.3 sec == Tests result: FAILURE == 1 test failed: 1 re-run test: Total duration: 6.1 sec |
I fail to reproduce the issue on Linux on the 3.10 branch. I interrupted the job after 20 minutes: $ ./python -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler -m test_recursion_normalizing_with_no_memory -F -j1
(...)
0:21:38 load avg: 2.01 [134] test_exceptions passed
beginning 6 repetitions
123456
...... I tested the commit 5d44443. |
I see it on Mac but not windows. Running the test with -v or adding prints makes the leak go away. I’m suspecting that one of the recursion tests fails to clear an exception or something like that. Removing any one of the tests makes it stop failing. |
Victor, if the …cycle2 test is hanging then you need the change in PR27626. |
I am not more lucky in the main branch. I ran it for 30 minutes and it didn't fail. $ ./python -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler -m test_recursion_normalizing_with_no_memory -F -j1
(...)
0:31:15 load avg: 1.88 [184] test_exceptions passed
beginning 6 repetitions
123456
......
^C
Kill <TestWorkerProcess #1 running test=test_exceptions pid=173796 time=4.2 sec> process group == Tests result: INTERRUPTED == 184 tests OK. Total duration: 31 min 19 sec I tested commit 8ac0886. |
I don't know why you need to kill it. It's not a long test. |
Aha, maybe there is something specific on Mac which triggers an exact code path to trigger to bug. I already saw bugs like that. You can try to play with gc.set_threshold() to change how garbage collection is triggered. |
Ah I see, -F. |
The problem disappears if I add a gc.collect() loop at the beginning of the new test: diff --git a/Lib/test/test_exceptions.py b/Lib/test/test_exceptions.py
index 79798ecf05..e0aeac9d10 100644
--- a/Lib/test/test_exceptions.py
+++ b/Lib/test/test_exceptions.py
@@ -1014,6 +1014,7 @@ def cycle():
def test_no_hang_on_context_chain_cycle2(self):
# See issue 25782. Cycle at head of context chain.
+ while gc.collect(): pass
class A(Exception):
pass |
This fails (test_recursion_normalizing_with_no_memory is not needed): ./python.exe -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler |
My smallest failing example so far: def test_recursion_in_except_handler(self):
self.test_no_hang_on_context_chain_cycle2()
def set_relative_recursion_limit(n):
depth = 1
while True:
try:
sys.setrecursionlimit(depth)
except RecursionError:
depth += 1
else:
break
sys.setrecursionlimit(depth+n)
def recurse_in_body_and_except():
try:
recurse_in_body_and_except()
except:
recurse_in_body_and_except()
recursionlimit = sys.getrecursionlimit()
try:
set_relative_recursion_limit(10)
try:
recurse_in_body_and_except()
except RecursionError:
pass
else:
self.fail("Should have raised a RecursionError")
finally:
sys.setrecursionlimit(recursionlimit) |
Out of curiosity, is the failure deterministic in environments where it fails? If not, what is the source of the indeterminism -- some kind of race condition or something else? |
It’s not deterministic. |
My hunch is it could be timing of GC. But it's hard to say at this point. |
I merged the workaround to get refleak buildbots unstuck for the weekend. I bumped the issue stage back to "needs patch" as the available one isn't a solution. |
Chris:
The GC uses counters and thresholds to decide which collection and when they should be collected. I tried to play with gc.set_threshold(), but I failed to reproduce the issue on Linux. Maybe it's a very precise threshold which triggers the issue. Between Linux and macOS, the site module executes different code paths which produce different GC counters. Sometimes, the GC must happen in a very precise line, one line later is too late. See bpo-44422 for a case of threading.enumerate() (different kind of bug, but it's related to the GC). See bpo-44184 for another example of GC bug which only occurred on Windows and only if you type an exact command line. |
I have tried to reproduce this on my Linux and Mac machines and I have been unable to do it for the time being in either of them. |
The script below hangs on my mac (it's an extract from test_recursion_in_except_handler). ----------- import sys
count = 0
def main():
def f():
global count
count += 1
try:
f()
except RecursionError:
f()
sys.setrecursionlimit(30) try: main()
print(count) When I kill it the traceback shows it alternating between the two recursive calls, but not in a regular pattern: ... [snipped a lot] File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
RecursionError: maximum recursion depth exceeded
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
f()
^^^
RecursionError: maximum recursion depth exceeded
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
RecursionError: maximum recursion depth exceeded
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
f()
^^^
RecursionError: maximum recursion depth exceeded
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
RecursionError: maximum recursion depth exceeded
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
f()
^^^
RecursionError: maximum recursion depth exceeded
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
RecursionError: maximum recursion depth exceeded
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
f()
^^^
RecursionError: maximum recursion depth exceeded
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
RecursionError: maximum recursion depth exceeded
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
f()
^^^
RecursionError: maximum recursion depth exceeded
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
RecursionError: maximum recursion depth exceeded
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
f()
^^^
RecursionError: maximum recursion depth exceeded
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
RecursionError: maximum recursion depth exceeded
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
f()
^^^
RecursionError: maximum recursion depth exceeded
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
RecursionError: maximum recursion depth exceeded
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
f()
^^^
[Previous line repeated 1 more time]
RecursionError: maximum recursion depth exceeded
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/iritkatriel/src/cpython/tt.py", line 22, in <module>
main()
^^^^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 18, in main
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
[Previous line repeated 10 more times]
File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 11, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 13, in f
f()
^^^
File "/Users/iritkatriel/src/cpython/tt.py", line 7, in f
def f():
KeyboardInterrupt |
Adding Mark since he worked on recursion recently. |
From my macOS, I was not able to reproduce the issue also as like as Victor and Pablo. repeat 10 ./python.exe -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler -m test_recursion_normalizing_with_no_memory However hang was able to reproduce. |
I meant msg399539 |
Oops sorry. Without PR 27746, I was able to reproduce on my macOS environment. test_exceptions leaked [6, 6, 6] references, sum=18 |
Regarding the hang in msg399539, do we need some special handling of the case were a RecursionError is being raised and its context is another RecursionError? Like a FatalError? |
But this makes it not leak again, so now we know it's the traceback:
|
I am still amazed that I cannot reproduce the leaks on my machine. Maybe I am missing some steps. Could you outline how are you reproducing the leaks? |
The traceback is a very heavy gc object because it contains the frames, and in the locals of the frame is the traceback itself via the exception. This is a well known heavy cycle, but the gc should take care of it if the cycle becomes isolated |
This doesn't leak:
So this isn't about the traceback being part of the cycle. The B() needs to be there too. |
refleak.py seems to run forever for me. Got bored after 6000 iterations. Which O/S and configure options are needed for refleak.py to fail? |
I'm using a mac. It typically fails within at most 10-20 iterations. |
Sorry I missed this. Just run the script: iritkatriel@Irits-MBP cpython-1 % ./python.exe refleak.py |
There is some timing-related race condition at hand because I tried in a Linux machine and a MacBook Pro (intel) laptop and I wasn't able to reproduce after 20 min of the script running :( |
Irit: do you still reproduce the issue using gc.disable()? Or you can try different values to call gc.set_threshold(). You may also try different values in ns.huntrleaks, like: ns.huntrleaks = (3, 20, 'tt_out.py'). With 20 iterations, do you still reproduce the leak? To make the issue more likely, you can change check_rc_deltas() in Lib/test/libregrtest/refleak.py. Try: def check_rc_deltas(deltas):
return any(delta >= 1 for delta in deltas) With this change, the script stops immediately: test leaked [-4, 4, -4] references, sum=-4 Using 20 iterations, I get: test leaked [-4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4] references, sum=0 |
If I disable the GC, the script no longer detects "leaks". Same using a GC threshold of 800. Using a GC threshold between 100 and 700, I reproduce the [-4, 4, -4] references leak. Funny bug. |
You can also play with the clear() method of frame objects and traceback.clear_frames() (which expects a traceback object). |
I think the leak happens when we invoke GC while the recursion limit is exceeded. It goes way if make this change: def recurse_in_body_and_except():
try:
recurse_in_body_and_except()
except RecursionError as e:
gc.disable()
recurse_in_body_and_except()
gc.enable() I also added a __del__ to B and saw it being called when recursion limit is small (though adding the __del__ makes the leak go away. Actually just adding "def f(): pass" is enough to make it go away). |
A variant of this problem: bpo-36560 with a leak 1 memory block per iteration. |
Victor - I am also getting a feeling that this may not be a real leak but rather a bug in the refleaks test bookkeeping. It could be a problem with the way _Py_RefTotal is incremented/decremented. |
It can be a cache which is kept alive and then clear the following iteration. See my previous comment: test leaked [-4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4, -4, 4] references, sum=0 libregrtest/refleak.py does its best to clear *all* caches, type caches, free lists, etc. A single cache which is not cleared can confused libregrtest. |
Oh, I see now what you meant. I think you're right that it's cleanup. I changed dash_R_cleanup to put the sys._clear_type_cache() after clear_caches() and the leak is gone: @@ -166,9 +166,6 @@ def dash_R_cleanup(fs, ps, pic, zdc, abcs):
zipimport._zip_directory_cache.clear()
zipimport._zip_directory_cache.update(zdc)
- # clear type cache
- sys._clear_type_cache()
-
# Clear ABC registries, restoring previously saved ABC registries.
abs_classes = [getattr(collections.abc, a) for a in collections.abc.__all__]
abs_classes = filter(isabstract, abs_classes)
@@ -181,6 +178,12 @@ def dash_R_cleanup(fs, ps, pic, zdc, abcs):
clear_caches()
+ # clear type cache
+ sys._clear_type_cache()
+
+ support.gc_collect()
+ I did this because I suspected it's something related to types, because it only happen when we define the exception classes (A and B) inside the test function. So maybe the type objects of these exceptions are not cleared from the type cache because they have refcount > 0 but by the time clear_caches() and gc.collect() finish they are cleared? Note this doesn't mean the change I made is always right - maybe the whole body of dash_R_cleanup should be in a loop until nothing changes? |
Aha! The order of function calls in dash_R_cleanup() matters. Calling sys._clear_type_cache() at the end of dash_R_cleanup() perfectly makes sense. If you do that, please add a comment to explain that the order matters. For example, the code about ABC classes is likely to modify classes. About the gc_collect() call, I dislike calling it in clear_caches() and then again in dash_R_cleanup(). I would even prefer to *remove* it from both functions and call it explicitly at the call site: in dash_R() and in runtest.py. |
With your patch I don't see a leak for refleak.py, but I still see one for ./python.exe -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler -m test_recursion_normalizing_with_no_memory -F -j1 (once I remove the @Skip on test_no_hang_on_context_chain_cycle2). However, if I make this change it looks better: --- a/Lib/test/libregrtest/refleak.py
+++ b/Lib/test/libregrtest/refleak.py
@@ -90,8 +90,9 @@ def get_pooled_int(value):
for i in rep_range:
test_func()
- dash_R_cleanup(fs, ps, pic, zdc, abcs)
- support.gc_collect()
+ for i in range(3):
+ dash_R_cleanup(fs, ps, pic, zdc, abcs)
+ support.gc_collect() |
Oh. I forgot that the test is still skipped!
If I apply attached strict_refleak_check.patch and I remove @unittest.skip("See bpo-44895") of test_no_hang_on_context_chain_cycle2(), I get: $ ./python -m test -R 3:3 test_exceptions -m test_no_hang_on_context_chain_cycle2 -m test_recursion_normalizing_infinite_exception -m test_recursion_in_except_handler -m test_recursion_normalizing_with_no_memory (...) |
I changed it to + for i in range(2): (loop twice instead of 3 times) and that seems fine too. |
I was unable to reproduce this locally now. Making a PR to unskip the test, we'll see what the buildbots say. |
Test has been re-enabled. Let's close this in the spirit of optimism - if the refleak returns we can reopen. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: