Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python: ReferenceError: weakly-referenced object no longer exists #2671

Closed
grondo opened this issue Jan 24, 2020 · 6 comments
Closed

python: ReferenceError: weakly-referenced object no longer exists #2671

grondo opened this issue Jan 24, 2020 · 6 comments
Assignees

Comments

@grondo
Copy link
Contributor

grondo commented Jan 24, 2020

I'm seeing this very strange error with Python 3.6.9 on an Ubuntu system. I'm really baffled by this behavior, but hopefully it is easily explained.

I've boiled the reproducer down to the following script:

import flux

def dummy(foo):
    pass

h = flux.Flux()
h.rpc ("job-info.job-stats", "{}").get()

This script succeeds, but prints a warning and backtrace

grondo@asp:~/git/f.git$ python3 t.py 
{'job_states': {'depend': 0, 'sched': 0, 'run': 0, 'cleanup': 0, 'inactive': 1, 'total': 1}}
Exception ignored in: <bound method Wrapper.__del__ of <flux.core.handle.Flux object at 0x7faf9e9e89e8>>
Traceback (most recent call last):
  File "/home/grondo/git/f.git/src/bindings/python/flux/wrapper.py", line 363, in __del__
  File "/home/grondo/git/f.git/src/bindings/python/flux/wrapper.py", line 339, in _clear
  File "/home/grondo/git/f.git/src/bindings/python/flux/wrapper.py", line 166, in __call__
ReferenceError: weakly-referenced object no longer exists

The baffling thing is if I comment out the dummy function that wasn't even called, I don't get a warning:

grondo@asp:~/git/f.git$ cp t.py v.py
grondo@asp:~/git/f.git$ vim v.py
grondo@asp:~/git/f.git$ diff -u t.py v.py
--- t.py	2020-01-24 02:47:27.796516585 +0000
+++ v.py	2020-01-24 02:49:03.413078616 +0000
@@ -2,8 +2,8 @@
 # -*- coding: utf-8 -*-
 import flux
 
-def dummy(foo):
-    pass
+#def dummy(foo):
+#    pass
 
 h = flux.Flux()
 print(h.rpc ("job-info.job-stats", "{}").get())
grondo@asp:~/git/f.git$ python3 v.py
{'job_states': {'depend': 0, 'sched': 0, 'run': 0, 'cleanup': 0, 'inactive': 1, 'total': 1}}

This occurs for me whenever there is a global function defined in the script, whether my code calls it or not.

Anyone know what is going on here?

@SteVwonder
Copy link
Member

😱 That is absolutely horrifying. I suspect the weakref that I inserted to fix #2549 is somehow implicated in this.

I'll stick this on my TODO list, but I probably won't be able to look at it in earnest until Monday 😞

@grondo
Copy link
Contributor Author

grondo commented Mar 29, 2020

Hitting this with a similar script when python rpc gets an error (e.g. EPERM)

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/flux/util.py", line 37, in func_wrapper
    return func(calling_obj, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/flux/wrapper.py", line 220, in __call__
    "{}: {}".format(errno.errorcode[errnum], os.strerror(errnum)),
PermissionError: [Errno 1] EPERM: Operation not permitted

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./rpc.py", line 6, in <module>
    print(h.rpc (sys.argv[1], "{}").get())
  File "/usr/lib/python3.6/site-packages/flux/rpc.py", line 50, in get
    resp_str = self.get_str()
  File "/usr/lib/python3.6/site-packages/flux/rpc.py", line 44, in get_str
    self.pimpl.flux_rpc_get(payload_str)
  File "/usr/lib/python3.6/site-packages/flux/util.py", line 51, in func_wrapper
    raise EnvironmentError(error.errno, errmsg.decode("utf-8"))
PermissionError: [Errno 1] Operation not permitted
Exception ignored in: <bound method Wrapper.__del__ of <flux.core.handle.Flux object at 0x7f2c44666400>>                                                                         
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/flux/wrapper.py", line 364, in __del__
  File "/usr/lib/python3.6/site-packages/flux/wrapper.py", line 340, in _clear
  File "/usr/lib/python3.6/site-packages/flux/wrapper.py", line 166, in __call__
ReferenceError: weakly-referenced object no longer exists
Exception ignored in: <bound method Wrapper.__del__ of <flux.future.Future.InnerWrapper object at 0x7f2c4466f908>>                                                               
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/flux/wrapper.py", line 364, in __del__
  File "/usr/lib/python3.6/site-packages/flux/wrapper.py", line 340, in _clear
  File "/usr/lib/python3.6/site-packages/flux/wrapper.py", line 166, in __call__
ReferenceError: weakly-referenced object no longer exists

@SteVwonder
Copy link
Member

Thanks @grondo for the reproducer! I took a poke at this, and I think now I get what is happening but not necessarily how to solve it. Here is my best summary of the situation:

  • instances of the Wrapper class automatically call the C destructors for the instance's handle (if the handle and destructor are both set).
    • relevant to this error: the Flux object is an instance of Wrapper and Flux has it's handle destructor set to raw.flux_close.
  • raw is acutally a Core object which is itself a wrapper instance, so raw.flux_close is a method of the raw object that should point to a function in the cffi module.
  • this is where the weak reference comes in. After wrapper searches the auto-generated bindings at runtime for a matching function, it binds a method to the object (via setattr) so that subsequent calls to the same function are "cached" and cheaper. To avoid a circular reference when creating the bound method, we use a weak.proxy for the reference to self. Here is the relevant code if that helps provide more context.
  • the weakly referenced error is occurring after the program has exited because when the Flux object's __del__ method get's called, it goes to call raw.flux_close, but the weak reference to self inside the bound method of raw.flux_close is no longer valid, even though raw's __del__ method has not yet been called.

My hypothesis for why reference inside the weak.proxy becomes None/invalid is that the weak.proxy object is being destroyed before the various Flux objects. Not sure if this is because weak.proxy registers an atexit handler to destroy itself that get's called before other destructors or what.

I'm scrapping the bottom of the barrel for ideas as to how to solve this or debug further. We can avoid using raw for all destructors, but that seems prone to regressions and doesn't necessarily seem like the cleanest way to handle this (see the commented out line in the diff below for an example of what I mean). We could always revert the weakproxy change, but that means we have to rely on the GC to run and find circular references before we exhaust our pool of futures (#2549). Another thought is that @trws was considering replacing the Wrapper class with a different auto-generation method; not sure if that would suffer the same problem though.

For reference on how I got to these conclusions:
I used @grondo's reproducer basically verbatim, just threw a print("Exiting") at the bottom. I also tossed these changes into the bindings:

diff --git a/src/bindings/python/flux/core/handle.py b/src/bindings/python/flux/core/handle.py
index 44810831b..96bba97aa 100644
--- a/src/bindings/python/flux/core/handle.py
+++ b/src/bindings/python/flux/core/handle.py
@@ -36,7 +36,11 @@ class Flux(Wrapper):
             prefixes=["flux_", "FLUX_"],
             destructor=raw.flux_close,
         )
-
+        #self.destructor = lib.flux_close
+        print(
+            "Creating new Flux instance, with destructor {}. raw.flux_close == {}".format(
+                self.destructor, raw.flux_close)
+        )
         if handle is None:
             try:
                 self.handle = raw.flux_open(url, flags)
diff --git a/src/bindings/python/flux/core/inner.py b/src/bindings/python/flux/core/inner.py
index e729f7842..ab1fcdf0f 100644
--- a/src/bindings/python/flux/core/inner.py
+++ b/src/bindings/python/flux/core/inner.py
@@ -21,7 +21,6 @@ class Core(Wrapper):
         """Set up the wrapper interface for functions prefixed with flux_"""
         super(Core, self).__init__(ffi, lib, prefixes=["flux_", "FLUX_"])
 
-
 # keeping this for compatibility
 # pylint: disable=invalid-name
 raw = Core()
diff --git a/src/bindings/python/flux/wrapper.py b/src/bindings/python/flux/wrapper.py
index d981c2a87..46ad25a31 100644
--- a/src/bindings/python/flux/wrapper.py
+++ b/src/bindings/python/flux/wrapper.py
@@ -337,6 +337,7 @@ class Wrapper(WrapperBase):
             return
         if self.destructor is None:
             return
+        print("Calling destructor ({}) on {}".format(self.destructor, self))
         self.destructor(handle)
         # ensure we don't double destruct
         self._handle = None
@@ -361,6 +362,7 @@ class Wrapper(WrapperBase):
         self._handle = h
 
     def __del__(self):
+        print("Calling __del__ on {}".format(self))
         self._clear()
 
     def __enter__(self):

That produces the following output:

❯ ./weakref.py
Creating new Flux instance, with destructor <bound method ? of <weakproxy at 0x7efe4e85f5e8 to Core at 0x7efe51d5e860>>. raw.flux_close == <bound method ? of <weakproxy at 0x7efe4e85f5e8 to Core at 0x7efe51d5e860>>
Calling __del__ on <flux.future.Future.InnerWrapper object at 0x7efe52026710>
Calling destructor (<bound method ? of <weakproxy at 0x7efe4e85f5e8 to Core at 0x7efe51d5e860>>) on <flux.future.Future.InnerWrapper object at 0x7efe52026710>
Exiting
Calling __del__ on <flux.core.handle.Flux object at 0x7efe523885f8>
Calling destructor (<bound method ? of <weakproxy at 0x7efe4e85f5e8 to NoneType at 0x9d4380>>) on <flux.core.handle.Flux object at 0x7efe523885f8>
Exception ignored in: <bound method Wrapper.__del__ of <flux.core.handle.Flux object at 0x7efe523885f8>>
Traceback (most recent call last):
  File "/home/herbein1/Repositories/flux-framework/flux-core/src/bindings/python/flux/wrapper.py", line 366, in __del__
  File "/home/herbein1/Repositories/flux-framework/flux-core/src/bindings/python/flux/wrapper.py", line 341, in _clear
  File "/home/herbein1/Repositories/flux-framework/flux-core/src/bindings/python/flux/wrapper.py", line 166, in __call__
ReferenceError: weakly-referenced object no longer exists
Calling __del__ on <flux.core.inner.Core object at 0x7efe51d5e860>

@trws
Copy link
Member

trws commented Mar 30, 2020

Oof... ok, this one might take some tweaking...

I think I might have a solution though, short of going to the generated eplicit bindings option. What we should have been doing is adding the generated method to the class not the instance anyway, but not to Wrapper, specifically to the derived class that is being used by the object __getattr__ is being called on. Now that we are in python3 land rather than interop land, we may be able to just do that, which should both solve this problem and make the whole thing faster at the same time. The trick is that we need a way to get ahold of the most specific class of self... Much easier now. Anyway, if we replace this:

        new_method = six.create_bound_method(new_fun, weakref.proxy(self))

        # Store the wrapper function into the instance
        # to prevent a second lookup
        setattr(self, name, new_method)

with something kinda like this:

        # no method conversion needed, functions in classes are methods
        # new_method = six.create_bound_method(new_fun, weakref.proxy(self))

        # Store the wrapper function into the class
        # to prevent a second lookup
        setattr(type(self), name, new_fun)

Also we can't return the thing directly anymore, we need to wrap it and do the bound method thing for the return value here to get the correct behavior, but that should be reaped immediately and not hang around to cause us grief. Something like what we used to have:

return types.MethodType(new_fun, self)

It should get rid of the weakref entirely, give us per-derived-class method lookups and hopefully fix up some other issues.

@SteVwonder
Copy link
Member

Thanks @trws for taking a look at this!

I tried out your suggested changes. It looks like the binding of the method to the class doesn't work without further modification due to RPC and Future sharing the same InnerWrapper class, whose instances match entirely different sets of function prefixes (i.e., flux_future_ vs flux_rpc). For an easy reproducer of this, ./job-manager/bulk-state.py hangs with the method bound to the class. There may be some other deeper issues there too (e.g., the pymod test fails with a seg fault when binding to the class).

After walking that change back, it looks like the explicit binding of the method is still required too. Without the method explicitly being bound, the self argument to the FunctionWrapper __call__ method is never passed, so all the arguments are off-by-1. Note that the below error occurs on the second call the to raw.rpc method since the first call uses the bound MethodType returned by the first call to __getattr__:

Traceback (most recent call last):
File "/home/herbein1/Repositories/flux-framework/flux-core/t/schedutil/req_and_unload.py", line 44, in <module>
main()
File "/home/herbein1/Repositories/flux-framework/flux-core/t/schedutil/req_and_unload.py", line 29, in main
free = h.rpc("sched.free", json.dumps({"id": 0}))
File "/home/herbein1/Repositories/flux-framework/flux-core/src/bindings/python/flux/core/handle.py", line 120, in rpc
return RPC(self, topic, payload, nodeid, flags)
File "/home/herbein1/Repositories/flux-framework/flux-core/src/bindings/python/flux/rpc.py", line 41, in __init__
future_handle = raw.flux_rpc(flux_handle, topic, payload, nodeid, flags)
File "/home/herbein1/Repositories/flux-framework/flux-core/src/bindings/python/flux/wrapper.py", line 166, in __call__
calling_object.ffi.errno = 0
AttributeError: cdata 'struct flux_handle_struct *' points to an opaque type: cannot read fields

So walking those two changes back, and I think we are back where we were in #2549, minus the need for six. Caveat: I haven't actually run a 500 job stress test to confirm.

trws added a commit to trws/flux-core that referenced this issue Mar 30, 2020
Rather than caching in the instance, this should remove dangling
reference issues like in flux-framework#2671, and hopefully also be somewhat faster in
that the lookup happens once per *type* rather than once per *instance*.
trws added a commit to trws/flux-core that referenced this issue Mar 30, 2020
Rather than caching in the instance, this should remove dangling
reference issues like in flux-framework#2671, and hopefully also be somewhat faster in
that the lookup happens once per *type* rather than once per *instance*.
trws added a commit to trws/flux-core that referenced this issue Mar 30, 2020
Rather than caching in the instance, this should remove dangling
reference issues like in flux-framework#2671, and hopefully also be somewhat faster in
that the lookup happens once per *type* rather than once per *instance*.
@SteVwonder
Copy link
Member

I believe this was fixed with the merging of #2878. We can re-open if not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants