Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bindings/python: destroy future after use #2553

Closed

Conversation

andre-merzky
Copy link
Contributor

Following a suggestion by @grondo, the submit_get_id-future sees
delayed / no garbage collection and causes a significant slowdowns due to
apparent limits in the future layer. This commit adds explicit
destruction after future completion, thus avoiding that issue.

I am not familiar with your PR policies: what branch to point to, what additional information to provide, etc. Please let me know this is off...

This closes #2549

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and thanks for the PR!

We run python code through the "black" formatter in travis-ci and it's called out a need for this change:

--- src/bindings/python/flux/job.py	2019-11-22 10:18:42.824332 +0000
+++ src/bindings/python/flux/job.py	2019-11-22 10:18:56.523516 +0000
@@ -53,6 +53,5 @@
     future = submit_async(flux_handle, jobspec, priority, flags)
     jid = submit_get_id(future)
     future.destroy()
     return jid
 
-

Couple other minor nits if you don't mind:

  • fix misspelling of "slowdowns" in commit message
  • add "bindings/python:" prefix to commit title

For future reference, and for what it's worth, we put up these documents describing flux project practices:

@andre-merzky andre-merzky force-pushed the fix/issue_321 branch 2 times, most recently from 85a2d42 to ae6ea79 Compare November 22, 2019 14:15
@grondo
Copy link
Contributor

grondo commented Nov 22, 2019

Awesome! Thanks!

Following a suggestion by @grondo, the `submit_get_id`-future sees
delayed / no garbage collection and causes a significant slowdown
due to apparent limits in the future layer.  This commit adds explicit
destruction after future completion, thus avoiding that issue.
@andre-merzky andre-merzky changed the title destroy future after use bindings/python: destroy future after use Nov 22, 2019
@andre-merzky andre-merzky requested a review from garlick November 22, 2019 14:33
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks again! I'll set the merge-when-passing label and then mergify will take care of the merge.

@garlick
Copy link
Member

garlick commented Nov 22, 2019

Uh oh, looks like one of the builders failed in t2206-job-manager-bulk-state.t with this message and a long backtrace:

*** Error in `python': corrupted size vs. prev_size: 0x0000000001e87360 ***

That runs a python script t/job-manager/bulk-state.py which uses job.submit() .

So we'd better run that down before this goes in (not putting that on you @andre-merzky)

@garlick
Copy link
Member

garlick commented Nov 22, 2019

Out of my depth here, but a66ead4 relates to the lifetimes of futures in the python bindings. Maybe @SteVwonder will have some thoughts on this pr?

Edit: fixed commit ref

Copy link
Member

@SteVwonder SteVwonder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andre-merzky for posting this fix. Some inline comments below. For the one suggested change to work, the __clear method in wrapper.py will need to be renamed to _clear (single underscore).

Using the _clear method should avoid the memory corruption we are seeing in Travis since the _clear method has some extra protections to ensure the C destructor is only called once.

Comment on lines +118 to +120
def destroy(self):
self.pimpl.destroy()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to not include this public method. I previously attempted to make this work, but exposing it publically opens up all sorts of edge-cases. I'd prefer to leverage the _clear method in the Wrapper class.

@@ -51,4 +51,6 @@ def submit_get_id(future):

def submit(flux_handle, jobspec, priority=lib.FLUX_JOB_PRIORITY_DEFAULT, flags=0):
future = submit_async(flux_handle, jobspec, priority, flags)
return submit_get_id(future)
jid = submit_get_id(future)
future.destroy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
future.destroy()
future.pimpl._clear()

@SteVwonder
Copy link
Member

Rename patch:

diff --git a/src/bindings/python/flux/wrapper.py b/src/bindings/python/flux/wrapper.py
index 4a58e7f12..a74d090f5 100644
--- a/src/bindings/python/flux/wrapper.py
+++ b/src/bindings/python/flux/wrapper.py
@@ -326,7 +326,7 @@ class Wrapper(WrapperBase):
         setattr(self, name, new_method)
         return new_method

-    def __clear(self):
+    def _clear(self):
         # avoid recursion
         if hasattr(self, "_handle") and self._handle is not None:
             handle = self._handle
@@ -354,14 +354,14 @@ class Wrapper(WrapperBase):
                     )
                 )
         if self._handle is not None:
-            self.__clear()
+            self._clear()
         self._handle = h

     def __del__(self):
-        self.__clear()
+        self._clear()

     def __enter__(self):
         return self

     def __exit__(self, type_arg, value, unused):
-        self.__clear()
+        self._clear()

@garlick
Copy link
Member

garlick commented Nov 25, 2019

I gave @SteVwonder's suggested modification to this PR a try this morning and it seems to both get past the 500 job hang and pass make check, at least in limited testing on my desktop.

Edit: although this pylint error pops up in travis:

src/bindings/python/flux/job.py:55:4: W0212: Access to a protected member _clear of a client class (protected-access)

@SteVwonder
Copy link
Member

Thanks @garlick for giving it a spin. Adding # pylint: disable=protected-access to right above the offending line should make the pylint error go away for now.

@garlick
Copy link
Member

garlick commented Nov 27, 2019

Thanks @garlick for giving it a spin. Adding # pylint: disable=protected-access to right above the offending line should make the pylint error go away for now.

Oh cool. Is this the way to go or should we try to make your weakref thing in #2549 work?

@SteVwonder
Copy link
Member

I think ultimately we need to fix the circular reference so that futures exposed to users aren’t leaked, but if this issue is blocking @andre-merzky, then I have no issue with this patch for now.

@andre-merzky
Copy link
Contributor Author

I think ultimately we need to fix the circular reference so that futures exposed to users aren’t leaked, but if this issue is blocking @andre-merzky, then I have no issue with this patch for now.

Thanks for that! But I am still prototyping and quite a bit away from experiments or production, and don't mind waiting. I am using a version of this PR right now, so I am not stuck.

garlick pushed a commit to garlick/flux-core that referenced this pull request Dec 2, 2019
Allow future.pimpl._clear() to be called from other parts
of the binding by dropping an underscore from its prefix.

The _clear method has some extra protections to ensure the C
destructor is only called once, thus future.pimpl._clear()
is the preferred way to explicitly dispose of a future.

This is a prereq for explicit cleanup of futures in the
job() class.

Originally proposed in flux-framework#2553.
garlick added a commit to garlick/flux-core that referenced this pull request Dec 2, 2019
Problem: job.submit() and job.wait() both seem to
leak futures.

The futures in these "synchronous" methods go out
of scope and thus should be automatically destroyed,
but due to a circular reference alluded to by
@SteVwonder in flux-framework#2549, they persist.

As a workaround, explicitly call _clear() on the
futures in these methods.

Credit goes to @andre-merzky for proposing the first
version of this patch in flux-framework#2553, based on a suggestion
by @grondo, with changes proposed by @SteVwonder.
Group effort!

Fixes flux-framework#2549.
garlick pushed a commit to garlick/flux-core that referenced this pull request Dec 2, 2019
Allow future.pimpl._clear() to be called from other parts
of the binding by dropping an underscore from its prefix.

The _clear method has some extra protections to ensure the C
destructor is only called once, thus future.pimpl._clear()
is the preferred way to explicitly dispose of a future.

This is a prereq for explicit cleanup of futures in the
job() class.

Originally proposed in flux-framework#2553.
garlick added a commit to garlick/flux-core that referenced this pull request Dec 2, 2019
Problem: job.submit() and job.wait() both seem to
leak futures.

The futures in these "synchronous" methods go out
of scope and thus should be automatically destroyed,
but due to a circular reference alluded to by
@SteVwonder in flux-framework#2549, they persist.

As a workaround, explicitly call _clear() on the
futures in these methods.

Credit goes to @andre-merzky for proposing the first
version of this patch in flux-framework#2553, based on a suggestion
by @grondo, with changes proposed by @SteVwonder.
Group effort!

Fixes flux-framework#2549.
@garlick
Copy link
Member

garlick commented Dec 2, 2019

#2563 just went in and included bits gathered from this PR, so closing this one. Thanks @andre-merzky for running this down and proposing the initial fix!

@garlick garlick closed this Dec 2, 2019
@andre-merzky andre-merzky deleted the fix/issue_321 branch December 2, 2019 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hang when submitting many jobs via Python
4 participants