-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle dynamic modules globals modification #201
Handle dynamic modules globals modification #201
Conversation
Codecov Report
@@ Coverage Diff @@
## master #201 +/- ##
=========================================
+ Coverage 84.95% 85.06% +0.1%
=========================================
Files 1 1
Lines 565 569 +4
Branches 109 110 +1
=========================================
+ Hits 480 484 +4
Misses 62 62
Partials 23 23
Continue to review full report at Codecov.
|
@pierreglaser could you please rebase on top of the current master now that #198 has been merged? |
ae1079b
to
f022c41
Compare
@ogrisel done. |
Thanks. The current code is problematic however: it can cause a memory leak in applications that generate a large number of transient dynamic modules with different names. Instead of storing the globals vars in the >>> import weakref
>>> import imp
>>> import textwrap
>>> import cloudpickle
>>> mod = imp.new_module('mod')
>>> code = '''
... x = 1
... def func_that_relies_on_dynamic_module():
... global x
... return x
... '''
...
>>> exec(textwrap.dedent(code), mod.__dict__)
>>> _dynamic_modules = weakref.WeakValueDictionary()
>>> _dynamic_modules[mod.__name__] = mod
>>> _dynamic_modules
<WeakValueDictionary at 0x7f73d3f84320>
>>> list(_dynamic_modules.keys())
['mod']
>>> del mod
>>> list(_dynamic_modules.keys())
[] |
Sounds good. Should there be a test that effectively verifies if |
You can check that if you unpickle the dynamic module function in a local function run a subprocess, |
cloudpickle/cloudpickle.py
Outdated
"""class used to instanciate base_globals, in order to be weakly refrenced | ||
into _dynamic_modules_globals | ||
""" | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you keep a reference to the module object itself as I suggested in #201 (comment) ?
That would make the code much more homogeneous, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case, a module was never properly created in the previous code, simply a dict
object was created. But we can change this by using imp.new_module
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No my point would be to register the dynamic modules themselves in _dynamic_modules
when we unpickle them (in dynamic_subimport
I think) and then use their name as value for base_globals
exactly as we do for regular modules. Falling back to empty dict might still be useful for weird modules that are neither found in sys.modules
nor registered in _dynamic_modules
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case where we pickle only a function from a dynamic module, and not the dynamic module itself, it is not possible, because only mod.f
(a function) is unpickled, so a dynamic module is never unpickled.
Creating a new module as I said does not work though, because it will only be referenced in _dynamic_modules_globals, however, since it is a WeakValueDictionnary
, the module will be garbage collected as soon as we get out of the function where the module is created.
By using a dictionary instead of a module, things worked because f.__globals__
is a reference to a value in _dynamic_modules_globals
, so the item in this dict was not garbage collected as long as the function was not deleted, which is not the case anymore if we use a module, as f
does not have a reference to the module itself.
So in this particular use case, it seems hard to keep a reference to the dynamic module where an unpickled function lived.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, that's a good point.
tests/cloudpickle_test.py
Outdated
@@ -440,6 +442,48 @@ def method(self, x): | |||
mod1, mod2 = pickle_depickle([mod, mod]) | |||
self.assertEqual(id(mod1), id(mod2)) | |||
|
|||
def test_dynamic_modules_cache(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a cache. A cache has specific semantics which are not the semantics of this hold. It's more like a list of reference to the known active dynamic modules.
tests/cloudpickle_test.py
Outdated
code = ''' | ||
x = 1 | ||
def f(): | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe test_function_from_dynamic_module_with_globals_modifications
is already testing this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is. I can merge the two test, but unsure if it is better.
for now only a function of a dynamic module is unpickled, to show a case where creating a dynamic module is hard.
I changed the test to show the case where creating a dynamic module is challenging. Also, shouldn't the |
tests/cloudpickle_test.py
Outdated
return | ||
''' | ||
exec(textwrap.dedent(code), mod.__dict__) | ||
|
||
pickled_module_path = 'mod.pkl' | ||
print(mod.func.__module__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spurious print
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a first batch of comments, I still need to review the second test.
cloudpickle/cloudpickle.py
Outdated
|
||
class DynamicDict(dict): | ||
"""class used to instanciate base_globals, in order to be weakly refrenced | ||
into _dynamic_modules_globals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: please follow PEP 257 for docstring: oneliner short summary, then optional blank line + paragraph with more details that could not fit in the summary. Following PEP 257 is helpful to make code editors documentation tooltips work better (as well as sphinx API doc for public function / classes).
Also I think we should make the this class private explicitly with a leading underscore.
class _DynamicModuleFuncGlobals(dict):
"""Global variables referenced by a function defined in a dynamic module
To avoid leaking references we store such context in a WeakValueDictionary instance.
However instances of python builtin types such as dict cannot be used directly as values
in such a construct, hence the need for a derived class.
"""
pass
cloudpickle/cloudpickle.py
Outdated
base_globals = _dynamic_modules_globals.get( | ||
base_globals, DynamicDict()) | ||
# base_globals is not a string anymore, using base_globals_name | ||
# instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is unnecessary in my opinion. The base_globals_name
variable name is explicit enough.
cloudpickle/cloudpickle.py
Outdated
# lived in | ||
base_globals = vars(sys.modules[base_globals]) | ||
else: | ||
base_globals = {} | ||
base_globals = _dynamic_modules_globals.get( | ||
base_globals, DynamicDict()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small optim:
base_globals = _dynamic_modules_globals.get(base_globals_name, None)
if base_globals is None:
base_globals = _DynamicModuleFuncGlobals()
to spare the creation of _DynamicModuleFuncGlobals
when not necessary.
tests/cloudpickle_test.py
Outdated
@@ -42,7 +43,8 @@ | |||
tornado = None | |||
|
|||
import cloudpickle | |||
from cloudpickle.cloudpickle import _find_module, _make_empty_cell, cell_set | |||
from cloudpickle.cloudpickle import _find_module, _make_empty_cell, cell_set,\ | |||
_dynamic_modules_globals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a new from .... import line
instead of multiline imports: I find it more readable and it makes it easier to avoid merge conflict.
tests/cloudpickle_test.py
Outdated
@@ -440,6 +442,58 @@ def method(self, x): | |||
mod1, mod2 = pickle_depickle([mod, mod]) | |||
self.assertEqual(id(mod1), id(mod2)) | |||
|
|||
def test_dynamic_modules_globals(self): | |||
# _dynamic_modules_globals is a WeakValueDictionary, so if this dynamic | |||
# module has no other reference than in this # dict (whether on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#
leftover
tests/cloudpickle_test.py
Outdated
# _dynamic_modules_globals is a WeakValueDictionary, so if this dynamic | ||
# module has no other reference than in this # dict (whether on the | ||
# child process or the parent process), this module will be garbage | ||
# collected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of this test, _dynamic_modules_globals
is only used in the child process.
Also it's not the module that gets garbage collected by the function globals dict.
tests/cloudpickle_test.py
Outdated
|
||
# A dictionnary storing the globals of the newly unpickled function | ||
# should have been created | ||
assert list(_dynamic_modules_globals.keys())==['mod'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: whitespaces around ==
.
tests/cloudpickle_test.py
Outdated
# There is no reference to the globals of func since func has been | ||
# deleted and _dynamic_modules_globals is a WeakValueDictionary, | ||
# so _dynamic_modules_globals should now be empty | ||
assert list(_dynamic_modules_globals.keys())==[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem
For some reason I could not directly push to your branch so I took over in #205. |
I created this PR to have us take a stand about the behavior with regards to dynamic modules caching.
cloudpickle bevaves inconstently when a function defined in a dynamic module that mutates globals variables of that module gets pickled and then unpickled several times: new globals are created each time and erase the previous values.
As as example, let
mod
be a dynamic module:this will run fine
The question is: is this what we want, or would we rather cache the globals of the dynamic module?
The potential risk is if the globals are large objects, and that we keep a reference to it, the size of the cache my increase, leading to a potential memory leak.
pinging @ogrisel @tomMoral on this.