-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random Access Violations on shutdown #1977
Comments
Can you please give the exact example that produces the error for you (Python and C# code)? |
We're using pythonnet in many different scenarios, and this issue seems to be happening randomly. This is the simplest example I can find that (occasionally) reproduces the failure
Note that we are not able to consistently repro the issue. We are running a series of unit tests using pytest, and every run, we get different tests failing with this same stack trace. Is there something specific we could be doing in our code that may cause this crash in the new release? I ask because we're getting this issue when upgrading to pythonnet3.0.0post1, we're currently on pythonnet2.5.2 and haven't had this issue. |
One major difference is that older versions of Python.NET didn't even try to properly clean up in the end. How are your tests set up that you get test failures due to shutdowns in them? Are they spinning up Python sub-processes? Can the issue be reproduced by just taking the example that you gave and running it a few thousand times or is this just an example from your test suite? |
This is an example from the test suite, I'll try running it a few thousand times to see if I can repro. We're just using pytest, AFAIK, it creates one python process, runs a set of tests sequentially, then terminates the python process. What's interesting is that the tests themselves pass, but when the python process is shutting down, that's where we start getting these random access violations. Do we need to have certain cleanup code in between each test that runs in the process? |
Still working on a repro outside of our unit tests, is there anything from the stack trace that may help you identify if there's something we're doing that may cause this? Pytest runs all the tests in one python process I confirmed that, so the shutdown should only be happening once, shouldn't be spinning up python sub-processes. |
Pythonnet version: 3.0.1 I encountered a crash with the same traceback as the first post when running unittests with Python Development Mode activated ( I was able to reproduce this crash outside of my tests with the following sample: import clr
clr.AddReference("System.Windows.Forms")
from System.Windows.Forms import Form
def example():
form = Form()
form.Show()
example() Save the above as issue1977.py and run with In my case I am not using System.Windows.Form but a different assembly and once I construct a class from that assembly after importing then I get this issue. Python 3.7 doesn't have experience the same problem. |
For me, the above snippet is deterministic., it results in the System.AccessViolationException every time. |
The actual issue is not the dev mode as such, but that it uses the
|
Ah, yes my traceback had the call to ReadInt32 rather than PyObject_TYPE from the first post. Thanks for narrowing down my specific case and providing an alternative to still get the benefits I was aiming for with the dev mode. |
@filmor I tried disabling the PYTHONMALLOC default but still get it (see below) |
@filmor Ok I've realized that settings PYTHONMALLOC=malloc made the error disappear? So, maybe should be used this instead of default? |
None of these should be necessary, but running Python.NET with anything but the default allocation is undertested (as in, not at all). My guess is that somewhere in one of these debug hooks, some .NET code is evaluated and throws (across FFI boundaries). I'll have some time to debug this this weekend. |
@filmor just wondering if there's something we should do in the mean time to mitigate this issue? |
I confirm I am also experiencing this error:
Unfortunately I cannot provide you with the C# code that triggers the problem because I did not write it myself. My Python code that calls that code is very much similar to what @donno posted above. I am running python 3.9.16 and pythonnet 3.0.1. The error does not manifest itself in version 2.5.2. |
@filmor I was wondering if there's any update on this issue or if there's anything we can do to work around this issue. Would you suggest we force using the default allocator? |
Should we maintain any hopes that this issue will be addressed anytime soon? |
Unfortunately it did not work:
|
The error above was preceded by this one:
I guess I am stuck to version 2.5.2? |
@jgmarcel If you can't provide a reproducible example, we can't debug. It's as easy as that. I could reproduce the example from the start of the thread and figured out that it had something to do with allocator, using the |
@filmor Sure thing. The C# code I am calling from Python is distributed in a DLL shipped with a commercial product, so I am afraid I cannot include it in my report. My Python code, on the other hand, is very simple and just calls the C# code as instructed by the software vendor. Is there anything else (an output, some profiling results, etc.) I could provide you with that would be useful? |
This is a use-after-free; and switching the allocator only changes the likelyhood of the problem resulting in a crash. I've debugged the case with
Due to the use of a debug allocator, this overwrites the ob_type field of the freed object with a After that,
With the default allocator (without -Xdev; i.e. pymalloc), I've also observed cases where Switching to PYTHONMALLOC=malloc just means the memory is more likely to stay around long enough to avoid the crash, making the use-after-free somewhat harmless. But with the default pymalloc allocator, it's likely to cause real crashes; and with the debug allocator it's basically guaranteed to crash. |
Would this be an appropriate fix? diff --git a/src/runtime/Finalizer.cs b/src/runtime/Finalizer.cs
index 713564f..d6bbc3b 100644
--- a/src/runtime/Finalizer.cs
+++ b/src/runtime/Finalizer.cs
@@ -229,6 +229,7 @@ namespace Python.Runtime
IntPtr copyForException = obj.PyObj;
Runtime.XDecref(StolenReference.Take(ref obj.PyObj));
+ CLRObject.reflectedObjects.Remove(copyForException);
collected++;
try
{
@@ -236,7 +237,7 @@ namespace Python.Runtime
}
catch (Exception e)
{
- HandleFinalizationException(obj.PyObj, e);
+ HandleFinalizationException(copyForException, e);
}
} The fix is simple: remove a possible reference to the currently disposed object from the I am not entirely sure why a variable |
@siegfriedpammer could you create a PR with this fix? We have found a scenario on python 3.11 that hits this crash deterministically and could test that this fix resolves this issue. |
I think, the above fix does only work if a But I hope this is fixed at some point by someone more knowledgeable about pythonnet so I don't have to patch the library again if I ever need to update. Thanks! |
I see the same issue when I am running robot framework tests that use 3.0.1 version of pythonnet. When I downgrade I don't see the issue anymore. Will this be fixed soon? |
So this is not a trivial issue, and we need a reproducible example to debug on our side. The thing is: if The only exception from this rule I am aware of is Python types that are derived from .NET types. Under certain circumstances they change the GC handle type to weak reference, which allows it to be collected and placed into finalization queue. See
There's a force break loops bit in |
The example from #1977 (comment) is perfectly reproducible for us when using the debug allocator (-Xdev). |
Yes, please add an option to disable the shutdown code completely. I have had cases where simply |
|
Hi @filmor, I had a look for the And just to clarify, would I be right to understand that when those tickets are resolved, we should expect to have an option to disable |
@filmor I have the same Problem. This code does NOT create any error messages and always works perfectly fine:
but this code which is just a loop always produces the error:
(German) errormessage: |
To avoid the above exception error, I have removed the below lines from the Runtime.cs file: if (!HostedInPython && !ProcessIsTerminating)
|
We get similar issue in our application as well. Here are more details of our use case:
I spent a lot of time to go over the pythonnet code, and debugging info into pythonnet, step through the execution in Visual Studio. I think I have a better understanding of the pythonnet implementation (which is complicated but elegant) and may find something relevant (at least to our use case).
The code below illustrates that the allocated memories are reused later. If the code is complicated enough to cause the heap to shrink, we can expect the exception to happen. def get_refcnt(pt):
return f'({pt.X:.3f}, {pt.Y:.3f}), {len(gc.get_referrers(pt))}, {getrefcount(pt)}, {c_long.from_address(id(pt))}'
# 10 random .net points
pts = MyCSPoint.TenRandomPoints
cnts = [get_refcnt(pt) for pt in pts]
ids = [id(pt) for pt in pts]
print(".Net Results")
print("\n".join(cnts))
print([c_long.from_address(id) for id in ids])
# 10 random python points
pts2 = [MyPyPoint(x,y) for (x,y) in zip(np.random.rand(10), np.random.rand(10))]
cnts2 = [get_refcnt(pt) for pt in pts2]
ids2 = [id(pt) for pt in pts2]
print("Python Results")
print("\n".join(cnts2))
print([c_long.from_address(id) for id in ids2]) Here is the output. I am not sure what is the correct way to measure ref count of a python object created by the .net code, so I tried all 3 ways I am aware of. They return different results, but we can see the pattern here: when the objects are still in use, the .net ref count is 1 less than python ref count; when the objects are not in use, the .net ref count is kind of random (which indicates that the memory is reused).
Possible fix? I tried to increase the ref count by 1 after the object is created. var py = Runtime.PyType_GenericAlloc(tp, 0);
Runtime.XIncref(py.Borrow()); It seems to work (no more exceptions), but it also means that these allocated objects will stay in the heap forever. Experts, any suggestion for a more robust fix? |
I am also struggling with this issue. I am trying to dig into it. @duyang76 has the right idea that the problem is something in regards to the reference count but the solution var py = Runtime.PyType_GenericAlloc(tp, 0);
Runtime.XIncref(py.Borrow()); is not correct because However, in the method The EDIT: Basically related to this comment: #1977 (comment) @lostmsu But the crux is that the exception is thrown while trying to null the GC Handles. Why is this done after the reference count is decremented to zero? |
When nulling the GC handles on shutdown the reference count of all objects pointed to by the IntPtr in the `reflectedObjects` are zero. This caused an exception in some scenarios because `Runtime.PyObject_TYPE(reflectedClrObject)` is called while the reference counter is at zero, hence, not being guaranteed the Python object is still there and the memory not reclaimed. The solution presented is treating the pointer in `reflectedObjects` as strong references - incrementing the respective ref count adding to the set and decrementing removing from it or clearing the entire set (the latter is already done after nulling the GC Handles).
When nulling the GC handles on shutdown the reference count of all objects pointed to by the IntPtr in the `reflectedObjects` are zero. This caused an exception in some scenarios because `Runtime.PyObject_TYPE(reflectedClrObject)` is called while the reference counter is at zero, hence, not being guaranteed the Python object is still there and the memory not reclaimed. The solution presented is treating the pointer in `reflectedObjects` as strong references - incrementing the respective ref count adding to the set and decrementing removing from it or clearing the entire set (the latter is already done after nulling the GC Handles).
When nulling the GC handles on shutdown the reference count of all objects pointed to by the IntPtr in the `reflectedObjects` are zero. This caused an exception in some scenarios because `Runtime.PyObject_TYPE(reflectedClrObject)` is called while the reference counter is at zero, hence, not being guaranteed the Python object is still there and the memory not reclaimed. The solution presented is treating the pointer in `reflectedObjects` as strong references - incrementing the respective ref count adding to the set and decrementing removing from it or clearing the entire set (the latter is already done after nulling the GC Handles).
When nulling the GC handles on shutdown the reference count of all objects pointed to by the IntPtr in the `CLRObject.reflectedObjects` are zero. This caused an exception in some scenarios because `Runtime.PyObject_TYPE(reflectedClrObject)` is called while the reference counter is at zero. After `TypeManager.RemoveTypes();` is called in the `Runtime.Shutdown()` method, reference count decrements to zero do not invoke `ClassBase.tp_clear` for managed objects anymore which normally is responsible for removing references from `CLRObject.reflectedObjects`. Collecting objects referenced in `CLRObject.reflectedObjects` only after leads to an unstable state in which the reference count for these object addresses is zero while still maintaining them to be used for further pseudo-cleanup. In that time, the memory could have been reclaimed already which leads to the exception.
When nulling the GC handles on shutdown the reference count of all objects pointed to by the IntPtr in the `CLRObject.reflectedObjects` are zero. This caused an exception in some scenarios because `Runtime.PyObject_TYPE(reflectedClrObject)` is called while the reference counter is at zero. After `TypeManager.RemoveTypes();` is called in the `Runtime.Shutdown()` method, reference count decrements to zero do not invoke `ClassBase.tp_clear` for managed objects anymore which normally is responsible for removing references from `CLRObject.reflectedObjects`. Collecting objects referenced in `CLRObject.reflectedObjects` only after leads to an unstable state in which the reference count for these object addresses is zero while still maintaining them to be used for further pseudo-cleanup. In that time, the memory could have been reclaimed already which leads to the exception.
This issue can be closed, right? I tested 3.0.4 again. |
I still get: Exit code is -1073741819 (Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt. In my tests of my C# web API. I initialize my python in my program.cs And my OnShutDown(){ Each integrationtest seems to spin up a program.cs on it's own (i didn't write the test suite). Before I had my Initialize and shutdown in my service class. Which seemed to work fine. But it was rather slow, having to spin up the python engine on each request. |
That callstack is on initialization. I would suggest a separate issue. |
Environment
Details
Describe what you were trying to get done.
I am getting random access violations during shutdown in pythonnet. The issue can be re-produced with a C# function as simple as
public static int Add(int a, int b) => a + b;
I get the access violation; however, it is not deterministic at all. Some times, it succeeds as expected, other times I get this. Note that I am upgrading from pythonnet 2.5.2 to pythonnet 3.0.0.post1 so this code used to work. All I am doing is invoking this code from python. I am running this python code that uses pythonnet in pytest. I haven't seen this when we were using pythonnet 2.5.2.
The text was updated successfully, but these errors were encountered: