Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random Access Violations on shutdown #1977

Open
rohanjain101 opened this issue Oct 18, 2022 · 40 comments
Open

Random Access Violations on shutdown #1977

rohanjain101 opened this issue Oct 18, 2022 · 40 comments

Comments

@rohanjain101
Copy link

rohanjain101 commented Oct 18, 2022

Environment

  • Pythonnet version: 3.0.0.post1
  • Python version: 3.8.10
  • Operating System: Windows 11
  • .NET Runtime: .NET Framework 472

Details

  • Describe what you were trying to get done.

    I am getting random access violations during shutdown in pythonnet. The issue can be re-produced with a C# function as simple as

public static int Add(int a, int b) => a + b;

I get the access violation; however, it is not deterministic at all. Some times, it succeeds as expected, other times I get this. Note that I am upgrading from pythonnet 2.5.2 to pythonnet 3.0.0.post1 so this code used to work. All I am doing is invoking this code from python. I am running this python code that uses pythonnet in pytest. I haven't seen this when we were using pythonnet 2.5.2.

  • If there was a crash, please include the traceback here.
Unhandled Exception: System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at Python.Runtime.Runtime.PyObject_TYPE(BorrowedReference op)
   at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
   at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
   at Python.Runtime.Runtime.Shutdown()
   at Python.Runtime.PythonEngine.Shutdown()
   at Python.Runtime.Loader.Shutdown(IntPtr data, Int32 size)
@filmor
Copy link
Member

filmor commented Oct 18, 2022

Can you please give the exact example that produces the error for you (Python and C# code)?

@rohanjain101
Copy link
Author

rohanjain101 commented Oct 18, 2022

We're using pythonnet in many different scenarios, and this issue seems to be happening randomly. This is the simplest example I can find that (occasionally) reproduces the failure

using System;
using System.Linq;

namespace Test.Local
{
    public static partial class Operations
    {
        public static decimal? explicit_cast_float128_Float128_Scalar(decimal value) =>
            (decimal?)value;
    }
}

# Python Code:
import clr
clr.AddReference(r"Test.Local.dll")
from Test.Local import Opertations
from System import Decimal
Operations.explicit_cast_float128_Float128_Scalar(Decimal(123))

Note that we are not able to consistently repro the issue. We are running a series of unit tests using pytest, and every run, we get different tests failing with this same stack trace. Is there something specific we could be doing in our code that may cause this crash in the new release? I ask because we're getting this issue when upgrading to pythonnet3.0.0post1, we're currently on pythonnet2.5.2 and haven't had this issue.

@filmor
Copy link
Member

filmor commented Oct 18, 2022

One major difference is that older versions of Python.NET didn't even try to properly clean up in the end. How are your tests set up that you get test failures due to shutdowns in them? Are they spinning up Python sub-processes? Can the issue be reproduced by just taking the example that you gave and running it a few thousand times or is this just an example from your test suite?

@rohanjain101
Copy link
Author

This is an example from the test suite, I'll try running it a few thousand times to see if I can repro. We're just using pytest, AFAIK, it creates one python process, runs a set of tests sequentially, then terminates the python process. What's interesting is that the tests themselves pass, but when the python process is shutting down, that's where we start getting these random access violations.

Do we need to have certain cleanup code in between each test that runs in the process?

@rohanjain101
Copy link
Author

Still working on a repro outside of our unit tests, is there anything from the stack trace that may help you identify if there's something we're doing that may cause this? Pytest runs all the tests in one python process I confirmed that, so the shutdown should only be happening once, shouldn't be spinning up python sub-processes.

@donno
Copy link

donno commented Dec 1, 2022

Pythonnet version: 3.0.1
Python: 3.8 and 3.11.

I encountered a crash with the same traceback as the first post when running unittests with Python Development Mode activated (-X dev)

I was able to reproduce this crash outside of my tests with the following sample:

import clr
clr.AddReference("System.Windows.Forms")
from System.Windows.Forms import Form

def example():  
  form = Form()
  form.Show()

example()

Save the above as issue1977.py and run with python -X dev issue1977.py on Windows 10 results in an access violation except.

In my case I am not using System.Windows.Form but a different assembly and once I construct a class from that assembly after importing then I get this issue.

Python 3.7 doesn't have experience the same problem.

@rohanjain101
Copy link
Author

Thanks @donno . Are you able to repro this crash deterministically or do you have the same issue as us where it seems to be happening randomly? @filmor is there anything we can do to avoid this crash?

@donno
Copy link

donno commented Dec 2, 2022

For me, the above snippet is deterministic., it results in the System.AccessViolationException every time.

@filmor
Copy link
Member

filmor commented Dec 2, 2022

The actual issue is not the dev mode as such, but that it uses the debug mode for the allocator. If you just want to use all of the other warnings, you can explicitly force the non-debug one by setting the PYTHONMALLOC=default environment variable. I don't see the exact same backtrace, though, so my hunch is that this is a different bug:

Unhandled Exception: System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at System.Runtime.InteropServices.Marshal.ReadInt32(IntPtr ptr, Int32 ofs)
   at Python.Runtime.ManagedType.TryFreeGCHandle(BorrowedReference reflectedClrObject, BorrowedReference type)
   at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
   at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
   at Python.Runtime.Runtime.Shutdown()
   at Python.Runtime.PythonEngine.Shutdown()
   at Python.Runtime.Loader.Shutdown(IntPtr data, Int32 size)

@donno
Copy link

donno commented Dec 3, 2022

Ah, yes my traceback had the call to ReadInt32 rather than PyObject_TYPE from the first post.

Thanks for narrowing down my specific case and providing an alternative to still get the benefits I was aiming for with the dev mode.

@maxcapodi78
Copy link

@filmor I tried disabling the PYTHONMALLOC default but still get it (see below)
Should I do something before the fix is released to avoid the issue?

image
image

@maxcapodi78
Copy link

@filmor Ok I've realized that settings PYTHONMALLOC=malloc made the error disappear? So, maybe should be used this instead of default?

@filmor
Copy link
Member

filmor commented Dec 6, 2022

None of these should be necessary, but running Python.NET with anything but the default allocation is undertested (as in, not at all). My guess is that somewhere in one of these debug hooks, some .NET code is evaluated and throws (across FFI boundaries). I'll have some time to debug this this weekend.

@rohanjain101
Copy link
Author

@filmor just wondering if there's something we should do in the mean time to mitigate this issue?

@jgmarcel
Copy link

jgmarcel commented Feb 7, 2023

I confirm I am also experiencing this error:

Unhandled Exception: System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at Python.Runtime.Runtime.PyObject_TYPE(BorrowedReference op)
   at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
   at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
   at Python.Runtime.Runtime.Shutdown()
   at Python.Runtime.PythonEngine.Shutdown()
   at Python.Runtime.Loader.Shutdown(IntPtr data, Int32 size)

Unfortunately I cannot provide you with the C# code that triggers the problem because I did not write it myself. My Python code that calls that code is very much similar to what @donno posted above.

I am running python 3.9.16 and pythonnet 3.0.1. The error does not manifest itself in version 2.5.2.

@rohanjain101
Copy link
Author

@filmor I was wondering if there's any update on this issue or if there's anything we can do to work around this issue. Would you suggest we force using the default allocator?

@jgmarcel
Copy link

jgmarcel commented May 5, 2023

Should we maintain any hopes that this issue will be addressed anytime soon?

@lostmsu
Copy link
Member

lostmsu commented May 5, 2023

@jgmarcel we can not fix the issue we can not reproduce. Does your scenario prohibit use of malloc allocator as suggested above?

@jgmarcel
Copy link

jgmarcel commented May 5, 2023

@jgmarcel we can not fix the issue we can not reproduce. Does your scenario prohibit use of malloc allocator as suggested above?

Unfortunately it did not work:

Failed to shutdown pythonnet: System.NullReferenceException: Object reference not set to an instance of an object.
   at Python.Runtime.BorrowedReference.DangerousGetAddress()
   at Python.Runtime.ManagedType.TryFreeGCHandle(BorrowedReference reflectedClrObject, BorrowedReference type)
   at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
   at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
   at Python.Runtime.Runtime.Shutdown()
   at Python.Runtime.PythonEngine.Shutdown()
   at Python.Runtime.Loader.Shutdown(IntPtr data, Int32 size)
   at Python.Runtime.BorrowedReference.DangerousGetAddress()
   at Python.Runtime.ManagedType.TryFreeGCHandle(BorrowedReference reflectedClrObject, BorrowedReference type)
   at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
   at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
   at Python.Runtime.Runtime.Shutdown()
   at Python.Runtime.PythonEngine.Shutdown()
   at Python.Runtime.Loader.Shutdown(IntPtr data, Int32 size)Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "C:\Users\prodbwgi\Anaconda3\envs\bwgi\lib\site-packages\pythonnet\__init__.py", line 158, in unload
    raise RuntimeError("Failed to call Python.NET shutdown")
RuntimeError: Failed to call Python.NET shutdown

@jgmarcel
Copy link

jgmarcel commented May 5, 2023

The error above was preceded by this one:

Unhandled Exception: System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at Python.Runtime.Runtime.PyObject_TYPE(BorrowedReference op)
   at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
   at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
   at Python.Runtime.Runtime.Shutdown()
   at Python.Runtime.PythonEngine.Shutdown()
   at Python.Runtime.Loader.Shutdown(IntPtr data, Int32 size)

I guess I am stuck to version 2.5.2?

@filmor
Copy link
Member

filmor commented May 6, 2023

@jgmarcel If you can't provide a reproducible example, we can't debug. It's as easy as that. I could reproduce the example from the start of the thread and figured out that it had something to do with allocator, using the malloc allocator made /that/ problem go away. It's entirely possible that your problem, while exhibiting a similar backtrace, is different.

@jgmarcel
Copy link

jgmarcel commented May 8, 2023

@filmor Sure thing. The C# code I am calling from Python is distributed in a DLL shipped with a commercial product, so I am afraid I cannot include it in my report. My Python code, on the other hand, is very simple and just calls the C# code as instructed by the software vendor. Is there anything else (an output, some profiling results, etc.) I could provide you with that would be useful?

@dgrunwald
Copy link

This is a use-after-free; and switching the allocator only changes the likelyhood of the problem resulting in a crash.

I've debugged the case with -Xdev a bit, and what's happening is that a Python object is getting freed because its last reference is released by:

Python.Runtime.Runtime.XDecref(Python.Runtime.StolenReference)
Python.Runtime.Finalizer.DisposeAll()
Python.Runtime.Runtime.TryCollectingGarbage(Int32, Boolean)
Python.Runtime.Runtime.Shutdown()
Python.Runtime.PythonEngine.Shutdown()

Due to the use of a debug allocator, this overwrites the ob_type field of the freed object with a 0xdddddddd bit pattern.

After that, NullGCHandles(CLRObject.reflectedObjects); is called. But the reflectedObjects still contain a pointer to the deleted PyObject, so PyObject_TYPE is a use-after-free and retrieves the invalid PyTypeObject* 0xdddddddd. Attempting to dereference that to get at the tp_clr_inst field results in the crash:

   at System.Runtime.InteropServices.Marshal.ReadInt32(IntPtr ptr, Int32 ofs)
   at Python.Runtime.ManagedType.TryFreeGCHandle(BorrowedReference reflectedClrObject, BorrowedReference type)
   at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
   at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
   at Python.Runtime.Runtime.Shutdown()
   at Python.Runtime.PythonEngine.Shutdown()

With the default allocator (without -Xdev; i.e. pymalloc), I've also observed cases where PyObject_TYPE crashes directly, but those are more tricky to reproduce. That likely happens because the freed object was the last in its arena, so pymalloc returned the memory to the system. But it is likely that the same bug is the cause for both.

Switching to PYTHONMALLOC=malloc just means the memory is more likely to stay around long enough to avoid the crash, making the use-after-free somewhat harmless. But with the default pymalloc allocator, it's likely to cause real crashes; and with the debug allocator it's basically guaranteed to crash.

@siegfriedpammer
Copy link
Contributor

Would this be an appropriate fix?

diff --git a/src/runtime/Finalizer.cs b/src/runtime/Finalizer.cs
index 713564f..d6bbc3b 100644
--- a/src/runtime/Finalizer.cs
+++ b/src/runtime/Finalizer.cs
@@ -229,6 +229,7 @@ namespace Python.Runtime

                         IntPtr copyForException = obj.PyObj;
                         Runtime.XDecref(StolenReference.Take(ref obj.PyObj));
+                        CLRObject.reflectedObjects.Remove(copyForException);
                         collected++;
                         try
                         {
@@ -236,7 +237,7 @@ namespace Python.Runtime
                         }
                         catch (Exception e)
                         {
-                            HandleFinalizationException(obj.PyObj, e);
+                            HandleFinalizationException(copyForException, e);
                         }
                     }

The fix is simple: remove a possible reference to the currently disposed object from the reflectedObjects set. If the item does not exist the Remove is a no-op.

I am not entirely sure why a variable copyForException is not used in the catch block.

@rohanjain101
Copy link
Author

@siegfriedpammer could you create a PR with this fix? We have found a scenario on python 3.11 that hits this crash deterministically and could test that this fix resolves this issue.

@siegfriedpammer
Copy link
Contributor

I think, the above fix does only work if a PyObject only has one reference left. For my own use of the library I have just removed the offending call to NullGCHandles because the process using the library is cleaned up by the OS anyway. The .NET runtime dies when the Python runtime dies when the process dies.

But I hope this is fixed at some point by someone more knowledgeable about pythonnet so I don't have to patch the library again if I ever need to update. Thanks!

@johnpp143
Copy link

I see the same issue when I am running robot framework tests that use 3.0.1 version of pythonnet. When I downgrade I don't see the issue anymore. Will this be fixed soon?

@lostmsu
Copy link
Member

lostmsu commented Jun 2, 2023

But the reflectedObjects still contain a pointer to the deleted PyObject

So this is not a trivial issue, and we need a reproducible example to debug on our side.

The thing is: if PyObject is alive and is referenced from .NET, it is present in reflectedObjects and also it has a GC handle pointing back to .NET, which means that the .NET object can not end up in the finalization queue, because there is a GC handle that references it.

The only exception from this rule I am aware of is Python types that are derived from .NET types. Under certain circumstances they change the GC handle type to weak reference, which allows it to be collected and placed into finalization queue. See

GCHandle gc = GCHandle.Alloc(self, GCHandleType.Weak);

There's a force break loops bit in Shutdown, but when it nulls GC handles it also cleans the reflectedObjects collection.

@dgrunwald
Copy link

The example from #1977 (comment) is perfectly reproducible for us when using the debug allocator (-Xdev).

@rohanjain101
Copy link
Author

rohanjain101 commented Jun 3, 2023

@lostmsu @filmor is there any reason for a manual shutdown/cleanup in the first place instead of relying on the OS as older versions of pythonnet did?

@siegfriedpammer
Copy link
Contributor

Yes, please add an option to disable the shutdown code completely. I have had cases where simply Ctrl+Cing the process caused a .NET stack trace to be printed, which should not be the case. I have tried monkey-patching the pythonnet.unload function, but it made everything worse, that is, even more annoying stack traces printed to the console.

@filmor
Copy link
Member

filmor commented Jun 4, 2023

  1. PRs are welcome
  2. The problem with not cleaning up explicitly is that Python and .NET still have references on each other and it's completely random whether destructors/finalizers are actually run properly
  3. The slow down is not due to the cleanup as such but because of the serialization that is run to support AppDomain switching, there is a ticket to make it (and thus appdomain support) optional I think and it should also be completely deactivated in case of hosting in Python

@agoodwin
Copy link

agoodwin commented Sep 6, 2023

Hi @filmor, I had a look for the AppDomain switching ticket you were referring to in your third point, would I be right to understand that it's these ones: #1781 and #2008?

And just to clarify, would I be right to understand that when those tickets are resolved, we should expect to have an option to disable AppDomain support (or, it would just always be disabled in the use-case of using Python.NET to invoke .NET code from Python), and if this was to be done then we should then expect this "random access violation on shutdown" behaviour to simply disappear as a nice side-benefit?

@vreindelESZ
Copy link

vreindelESZ commented Sep 25, 2023

@filmor I have the same Problem. This code does NOT create any error messages and always works perfectly fine:

def cascade_function_sdata_filepaths(ordered_list_data_paths):
    check_number_of_ports_is_equal = []
    for sdata in ordered_list_data_paths:
        sdata = _Generic.LoadSParamData[_UncNumber](sdata)
        nports = sdata.NPorts
        check_number_of_ports_is_equal.append(nports)
    print(check_number_of_ports_is_equal)
    # Überprüfe, ob alle nports gleich sind
    if all(x == check_number_of_ports_is_equal[0] for x in check_number_of_ports_is_equal):
        nports = check_number_of_ports_is_equal[0]
        print("Alle nports sind gleich. Wert: ", nports)
        if len(ordered_list_data_paths) == 1:
            data_path_one = ordered_list_data_paths[0]
            s1data = _Generic.LoadSParamData[_UncNumber](data_path_one)
            print(s1data)

#from here different code below#######################################

        if len(ordered_list_data_paths) == 2:
            data_path_one = ordered_list_data_paths[0]
            s1data = _Generic.LoadSParamData[_UncNumber](data_path_one)
            print(s1data)

            data_path_two = ordered_list_data_paths[1]
            s2data = _Generic.LoadSParamData[_UncNumber](data_path_two)
            print(s2data)
            result = SParamTools.Cascade[_UncNumber](s1data, s2data)
            return(result)
        
        if len(ordered_list_data_paths) == 3:
            data_path_one = ordered_list_data_paths[0]
            s1data = _Generic.LoadSParamData[_UncNumber](data_path_one)
            print(s1data)

            data_path_two = ordered_list_data_paths[1]
            s2data = _Generic.LoadSParamData[_UncNumber](data_path_two)
            print(s2data)
            result = SParamTools.Cascade[_UncNumber](s1data, s2data)

            data_path_three = ordered_list_data_paths[2]
            s3data = _Generic.LoadSParamData[_UncNumber](data_path_three)
            result = SParamTools.Cascade[_UncNumber](result, s3data)
            return(result)

but this code which is just a loop always produces the error:

def cascade_function_sdata_filepaths(ordered_list_data_paths):
    check_number_of_ports_is_equal = []
    for sdata in ordered_list_data_paths:
        sdata = _Generic.LoadSParamData[_UncNumber](sdata)
        nports = sdata.NPorts
        check_number_of_ports_is_equal.append(nports)
    print(check_number_of_ports_is_equal)
    # Überprüfe, ob alle nports gleich sind
    if all(x == check_number_of_ports_is_equal[0] for x in check_number_of_ports_is_equal):
        nports = check_number_of_ports_is_equal[0]
        print(f"Alle nports sind gleich. Wert: {nports}")
        if len(ordered_list_data_paths) == 1:
            data_path_one = ordered_list_data_paths[0]
            s1data = _Generic.LoadSParamData[_UncNumber](data_path_one)
            print(s1data)

#from here different code below#######################################

        result = None
        for data_path in ordered_list_data_paths:
            sdata = _Generic.LoadSParamData[_UncNumber](data_path)
            print(sdata)
            if result is None:
                result = sdata
            else:
                result = SParamTools.Cascade[_UncNumber](result, sdata)
        result_data = create_dataarray_from_loaded_sdatb(result)
        print(result_data)
        return result

(German) errormessage:
Unbehandelte Ausnahme: System.AccessViolationException: Es wurde versucht, im geschützten Speicher zu lesen oder zu schreiben. Dies ist häufig ein Hinweis darauf, dass anderer Speicher beschädigt ist.
bei Python.Runtime.Runtime.PyObject_TYPE(BorrowedReference op)
bei Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
bei Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
bei Python.Runtime.Runtime.Shutdown()
bei Python.Runtime.PythonEngine.Shutdown()
bei Python.Runtime.Loader.Shutdown(IntPtr data, Int32 size)

@doctormachine
Copy link

doctormachine commented Oct 31, 2023

To avoid the above exception error, I have removed the below lines from the Runtime.cs file:

if (!HostedInPython && !ProcessIsTerminating)
{
// avoid saving dead objects
TryCollectingGarbage(runs: 3);

            RuntimeData.Stash();
        }

        AssemblyManager.Shutdown();
        OperatorMethod.Shutdown();
        ImportHook.Shutdown();

        ClearClrModules();
        RemoveClrRootModule();

        NullGCHandles(ExtensionType.loadedExtensions);
        ClassManager.RemoveClasses();
        TypeManager.RemoveTypes();
        _typesInitialized = false;

        MetaType.Release();
        PyCLRMetaType.Dispose();
        PyCLRMetaType = null!;

        Exceptions.Shutdown();
        PythonEngine.InteropConfiguration.Dispose();
        DisposeLazyObject(clrInterop);
        DisposeLazyObject(inspect);
        DisposeLazyObject(hexCallable);
        PyObjectConversions.Reset();

        PyGC_Collect();
        bool everythingSeemsCollected = TryCollectingGarbage(MaxCollectRetriesOnShutdown,
                                                             forceBreakLoops: true);
        Debug.Assert(everythingSeemsCollected);

        Finalizer.Shutdown();
        InternString.Shutdown();

        ResetPyMembers();

@duyang76
Copy link

duyang76 commented Mar 6, 2024

We get similar issue in our application as well. Here are more details of our use case:

  1. We have a very heavy .net layer to deal with data and business logic
  2. We use pythonnet in our python application to interop with our .net layer
  3. We frequent get AccessViolationException when the python application shutdown. It is not fully deterministic, but we are able to reproduce the exception every time with a reasonably simple python code. The stack trace is always the same (as below).

Python.Runtime.dll!Python.Runtime.Runtime.PyObject_TYPE(Python.Runtime.BorrowedReference op) Line 913.
Python.Runtime.dll!Python.Runtime.Runtime.NullGCHandles(System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<System.IntPtr, string>> objects) Line 479.
Python.Runtime.dll!Python.Runtime.Runtime.TryCollectingGarbage(int runs, bool forceBreakLoops) Line 356.
Python.Runtime.dll!Python.Runtime.Runtime.Shutdown() Line 300.
Python.Runtime.dll!Python.Runtime.PythonEngine.Shutdown() Line 395.
Python.Runtime.dll!Python.Runtime.Loader.Shutdown(System.IntPtr data, int size) Line 49.

I spent a lot of time to go over the pythonnet code, and debugging info into pythonnet, step through the execution in Visual Studio. I think I have a better understanding of the pythonnet implementation (which is complicated but elegant) and may find something relevant (at least to our use case).

  1. When a .net object is used on the python side (say iterating through an IEnumerable), CLRObject.Create() is called (see the stack trace below)

Python.Runtime.dll!Python.Runtime.CLRObject.Create(object ob, Python.Runtime.BorrowedReference tp) Line 28.
Python.Runtime.dll!Python.Runtime.CLRObject.GetReference(object ob, System.Type type) Line 51.
Python.Runtime.dll!Python.Runtime.Converter.ToPython(object value, System.Type type) Line 184.
Python.Runtime.dll!Python.Runtime.Iterator.tp_iternext(Python.Runtime.BorrowedReference ob) Line 46.

  1. To create the python object, CLRObject.Create() calls Runtime.PyType_GenericAlloc(), which I think does not increase the refcount of the new object.
  2. After the python object is used, the refcount decreases to 0. The memory space allocated in step 2 may be reused or reclaimed.
  3. When pythonnet shuts down (see the stack trace at the beginning), ManagedType.TryFreeGCHandle() calls Runtime.PyObject_TYPE(), which tries to access the reclaimed memory space and causes the AccessViolationException.

The code below illustrates that the allocated memories are reused later. If the code is complicated enough to cause the heap to shrink, we can expect the exception to happen.

    def get_refcnt(pt):
        return f'({pt.X:.3f}, {pt.Y:.3f}), {len(gc.get_referrers(pt))}, {getrefcount(pt)}, {c_long.from_address(id(pt))}'

    # 10 random .net points
    pts = MyCSPoint.TenRandomPoints
    cnts = [get_refcnt(pt) for pt in pts]
    ids = [id(pt) for pt in pts]
    print(".Net Results")
    print("\n".join(cnts))
    print([c_long.from_address(id) for id in ids])

    # 10 random python points
    pts2 = [MyPyPoint(x,y) for (x,y) in zip(np.random.rand(10), np.random.rand(10))]
    cnts2 = [get_refcnt(pt) for pt in pts2]
    ids2 = [id(pt) for pt in pts2]
    print("Python Results")
    print("\n".join(cnts2))
    print([c_long.from_address(id) for id in ids2])

Here is the output. I am not sure what is the correct way to measure ref count of a python object created by the .net code, so I tried all 3 ways I am aware of. They return different results, but we can see the pattern here: when the objects are still in use, the .net ref count is 1 less than python ref count; when the objects are not in use, the .net ref count is kind of random (which indicates that the memory is reused).

.Net Results
...
(0.700, 0.924), 0, 4, c_long(3)
(0.459, 0.048), 0, 4, c_long(3)
(0.108, 0.517), 0, 4, c_long(3)
(0.683, 0.056), 0, 4, c_long(3)
[c_long(0), c_long(7), c_long(0), c_long(7), c_long(0), c_long(0), c_long(0), c_long(0), c_long(0), c_long(7)]

Python Results
...
(0.572, 0.850), 1, 5, c_long(4)
(0.130, 0.236), 1, 5, c_long(4)
(0.764, 0.736), 1, 5, c_long(4)
(0.406, 0.526), 1, 5, c_long(4)
[c_long(1), c_long(1), c_long(1), c_long(1), c_long(1), c_long(1), c_long(1), c_long(1), c_long(1), c_long(1)]

Possible fix? I tried to increase the ref count by 1 after the object is created.

    var py = Runtime.PyType_GenericAlloc(tp, 0);
    Runtime.XIncref(py.Borrow());

It seems to work (no more exceptions), but it also means that these allocated objects will stay in the heap forever. Experts, any suggestion for a more robust fix?

@Frawak
Copy link
Contributor

Frawak commented May 7, 2024

I am also struggling with this issue. I am trying to dig into it. @duyang76 has the right idea that the problem is something in regards to the reference count but the solution

    var py = Runtime.PyType_GenericAlloc(tp, 0);
    Runtime.XIncref(py.Borrow());

is not correct because var py = Runtime.PyType_GenericAlloc(tp, 0); already increases the reference count for the object under tp.

However, in the method NullGCHandles(CLRObject.reflectedObjects) within the Shutdown (when the exception occurs) all references in CLRObject.reflectedObjects have a reference count of zero.
The question is: Why?

The NewReference which are created with that var py = Runtime.PyType_GenericAlloc(tp, 0); and whose IntPtr are added to CLRObject.reflectedObjects must have been disposed (along with all other references to the object) while these pointers remain.
And (this is my intepretation) because of the zero reference count the memory is opened up to be reclaimed.

EDIT: Basically related to this comment: #1977 (comment)

@lostmsu But the crux is that the exception is thrown while trying to null the GC Handles. Why is this done after the reference count is decremented to zero?

Frawak added a commit to Frawak/pythonnet that referenced this issue May 8, 2024
When nulling the GC handles on shutdown the reference count of all objects pointed to by the IntPtr in the `reflectedObjects` are zero.
This caused an exception in some scenarios because `Runtime.PyObject_TYPE(reflectedClrObject)` is called while the reference counter is at zero, hence, not being guaranteed the Python object is still there and the memory not reclaimed.
The solution presented is treating the pointer in `reflectedObjects` as strong references - incrementing the respective ref count adding to the set and decrementing removing from it or clearing the entire set (the latter is already done after nulling the GC Handles).
Frawak added a commit to Frawak/pythonnet that referenced this issue May 8, 2024
When nulling the GC handles on shutdown the reference count of all objects pointed to by the IntPtr in the `reflectedObjects` are zero.
This caused an exception in some scenarios because `Runtime.PyObject_TYPE(reflectedClrObject)` is called while the reference counter is at zero, hence, not being guaranteed the Python object is still there and the memory not reclaimed.
The solution presented is treating the pointer in `reflectedObjects` as strong references - incrementing the respective ref count adding to the set and decrementing removing from it or clearing the entire set (the latter is already done after nulling the GC Handles).
Frawak added a commit to Frawak/pythonnet that referenced this issue May 8, 2024
When nulling the GC handles on shutdown the reference count of all objects pointed to by the IntPtr in the `reflectedObjects` are zero.
This caused an exception in some scenarios because `Runtime.PyObject_TYPE(reflectedClrObject)` is called while the reference counter is at zero, hence, not being guaranteed the Python object is still there and the memory not reclaimed.
The solution presented is treating the pointer in `reflectedObjects` as strong references - incrementing the respective ref count adding to the set and decrementing removing from it or clearing the entire set (the latter is already done after nulling the GC Handles).
Frawak added a commit to Frawak/pythonnet that referenced this issue May 13, 2024
When nulling the GC handles on shutdown the reference count of all objects pointed to by the IntPtr in the `CLRObject.reflectedObjects` are zero.
This caused an exception in some scenarios because `Runtime.PyObject_TYPE(reflectedClrObject)` is called while the reference counter is at zero.

After `TypeManager.RemoveTypes();` is called in the `Runtime.Shutdown()` method, reference count decrements to zero do not invoke `ClassBase.tp_clear` for managed objects anymore which normally is responsible for removing references from `CLRObject.reflectedObjects`. Collecting objects referenced in `CLRObject.reflectedObjects` only after leads to an unstable state in which the reference count for these object addresses is zero while still maintaining them to be used for further pseudo-cleanup. In that time, the memory could have been reclaimed already which leads to the exception.
Frawak added a commit to Frawak/pythonnet that referenced this issue May 13, 2024
When nulling the GC handles on shutdown the reference count of all objects pointed to by the IntPtr in the `CLRObject.reflectedObjects` are zero.
This caused an exception in some scenarios because `Runtime.PyObject_TYPE(reflectedClrObject)` is called while the reference counter is at zero.

After `TypeManager.RemoveTypes();` is called in the `Runtime.Shutdown()` method, reference count decrements to zero do not invoke `ClassBase.tp_clear` for managed objects anymore which normally is responsible for removing references from `CLRObject.reflectedObjects`. Collecting objects referenced in `CLRObject.reflectedObjects` only after leads to an unstable state in which the reference count for these object addresses is zero while still maintaining them to be used for further pseudo-cleanup. In that time, the memory could have been reclaimed already which leads to the exception.
@Frawak
Copy link
Contributor

Frawak commented Oct 11, 2024

This issue can be closed, right?

I tested 3.0.4 again.

@Panzerfury
Copy link

I still get:

Exit code is -1073741819 (Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at Python.Runtime.Runtime.PyErr_Clear()
at Python.Runtime.Exceptions.Clear()
at Python.Runtime.PythonEngine.Initialize(System.Collections.Generic.IEnumerable1<System.String>, Boolean, Boolean) at Python.Runtime.PythonEngine.Initialize(Boolean, Boolean) at Python.Runtime.PythonEngine.Initialize() at Program.<Main>$(System.String[]) at DynamicClass.InvokeStub_Program.<Main>$(System.Object, System.Span1<System.Object>)
at System.Reflection.MethodBaseInvoker.InvokeWithOneArg(System.Object, System.Reflection.BindingFlags, System.Reflection.Binder, System.Object[], System.Globalization.CultureInfo)
at System.Reflection.MethodBase.Invoke(System.Object, System.Object[])
at Microsoft.Extensions.Hosting.HostFactoryResolver+HostingListener.b__10_0())

In my tests of my C# web API.

I initialize my python in my program.cs
var path = builder.Configuration.GetSection("PythonDllLocation").Value;
Runtime.PythonDLL = path;
PythonEngine.Initialize();
PythonEngine.BeginAllowThreads();

And my OnShutDown(){
AppContext.SetSwitch("System.Runtime.Serialization.EnableUnsafeBinaryFormatterSerialization", true);
PythonEngine.Shutdown();
AppContext.SetSwitch("System.Runtime.Serialization.EnableUnsafeBinaryFormatterSerialization", false);
}

Each integrationtest seems to spin up a program.cs on it's own (i didn't write the test suite).

Before I had my Initialize and shutdown in my service class. Which seemed to work fine. But it was rather slow, having to spin up the python engine on each request.

@Frawak
Copy link
Contributor

Frawak commented Oct 21, 2024

That callstack is on initialization. I would suggest a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests