Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle dynamic modules globals modification #201

Conversation

pierreglaser
Copy link
Member

@pierreglaser pierreglaser commented Sep 11, 2018

I created this PR to have us take a stand about the behavior with regards to dynamic modules caching.

cloudpickle bevaves inconstently when a function defined in a dynamic module that mutates globals variables of that module gets pickled and then unpickled several times: new globals are created each time and erase the previous values.

As as example, let mod be a dynamic module:

import imp
import textwrap
import cloudpickle
import pickle

mod = imp.new_module('mod')
code = '''
x = 1
def func_that_relies_on_dynamic_module():
    global x
    return x
'''
exec(textwrap.dedent(code), mod.__dict__)

this will run fine

first_f = pickle.loads(
    cloudpickle.dumps(mod.func_that_relies_on_dynamic_module)
)
mod.x = 2
new_f = pickle.loads(cloudpickle.dumps(mod.func_that_relies_on_dynamic_module))
assert first_f() == 1
assert new_f() == 2

The question is: is this what we want, or would we rather cache the globals of the dynamic module?
The potential risk is if the globals are large objects, and that we keep a reference to it, the size of the cache my increase, leading to a potential memory leak.

pinging @ogrisel @tomMoral on this.

@codecov-io
Copy link

codecov-io commented Sep 12, 2018

Codecov Report

Merging #201 into master will increase coverage by 0.1%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #201     +/-   ##
=========================================
+ Coverage   84.95%   85.06%   +0.1%     
=========================================
  Files           1        1             
  Lines         565      569      +4     
  Branches      109      110      +1     
=========================================
+ Hits          480      484      +4     
  Misses         62       62             
  Partials       23       23
Impacted Files Coverage Δ
cloudpickle/cloudpickle.py 85.06% <100%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81a7a8c...ae1079b. Read the comment docs.

@ogrisel
Copy link
Contributor

ogrisel commented Sep 13, 2018

@pierreglaser could you please rebase on top of the current master now that #198 has been merged?

@pierreglaser pierreglaser force-pushed the handle-dynamic-modules-globals-modification branch from ae1079b to f022c41 Compare September 13, 2018 12:15
@pierreglaser
Copy link
Member Author

@ogrisel done.

@ogrisel
Copy link
Contributor

ogrisel commented Sep 13, 2018

Thanks. The current code is problematic however: it can cause a memory leak in applications that generate a large number of transient dynamic modules with different names.

Instead of storing the globals vars in the _dynamic_modules_globals dict, it should be possible to refactor your code to use a weakref.WeakValueDictionary instance named _dynamic_modules to keep track of the dynamic module themselves without causing a memory leak when there are no longer used. See below:

>>> import weakref
>>> import imp
>>> import textwrap
>>> import cloudpickle
>>> mod = imp.new_module('mod')
>>> code = '''
... x = 1
... def func_that_relies_on_dynamic_module():
...     global x
...     return x
... '''
...
>>> exec(textwrap.dedent(code), mod.__dict__)
>>> _dynamic_modules = weakref.WeakValueDictionary()
>>> _dynamic_modules[mod.__name__] = mod
>>> _dynamic_modules
<WeakValueDictionary at 0x7f73d3f84320>
>>> list(_dynamic_modules.keys())
['mod']
>>> del mod
>>> list(_dynamic_modules.keys())
[]

@pierreglaser
Copy link
Member Author

Sounds good. Should there be a test that effectively verifies if _ dynamic_modules does not increase too much in size when several modules are created and deleted?

@ogrisel
Copy link
Contributor

ogrisel commented Sep 14, 2018

You can check that if you unpickle the dynamic module function in a local function run a subprocess, list(_dynamic_modules.keys()) is back to empty list after the function call.

"""class used to instanciate base_globals, in order to be weakly refrenced
into _dynamic_modules_globals
"""
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you keep a reference to the module object itself as I suggested in #201 (comment) ?

That would make the code much more homogeneous, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case, a module was never properly created in the previous code, simply a dict object was created. But we can change this by using imp.new_module instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No my point would be to register the dynamic modules themselves in _dynamic_modules when we unpickle them (in dynamic_subimport I think) and then use their name as value for base_globals exactly as we do for regular modules. Falling back to empty dict might still be useful for weird modules that are neither found in sys.modules nor registered in _dynamic_modules.

Copy link
Member Author

@pierreglaser pierreglaser Sep 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where we pickle only a function from a dynamic module, and not the dynamic module itself, it is not possible, because only mod.f (a function) is unpickled, so a dynamic module is never unpickled.
Creating a new module as I said does not work though, because it will only be referenced in _dynamic_modules_globals, however, since it is a WeakValueDictionnary , the module will be garbage collected as soon as we get out of the function where the module is created.

By using a dictionary instead of a module, things worked because f.__globals__ is a reference to a value in _dynamic_modules_globals, so the item in this dict was not garbage collected as long as the function was not deleted, which is not the case anymore if we use a module, as f does not have a reference to the module itself.

So in this particular use case, it seems hard to keep a reference to the dynamic module where an unpickled function lived.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, that's a good point.

@@ -440,6 +442,48 @@ def method(self, x):
mod1, mod2 = pickle_depickle([mod, mod])
self.assertEqual(id(mod1), id(mod2))

def test_dynamic_modules_cache(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a cache. A cache has specific semantics which are not the semantics of this hold. It's more like a list of reference to the known active dynamic modules.

code = '''
x = 1
def f():
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need the function to reference and mutate the global variable as we did in #188 and #198. Otherwise we cannot make assertions on the correct handling of the global variables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe test_function_from_dynamic_module_with_globals_modifications is already testing this behavior?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. I can merge the two test, but unsure if it is better.

for now only a function of a dynamic module is unpickled,
to show a case where creating a dynamic module is hard.
@pierreglaser
Copy link
Member Author

I changed the test to show the case where creating a dynamic module is challenging. Also, shouldn't the _dynamic_modules_globals.keys() be done in the subprocess itself, since it is here where _dynamic_modules_globals gets populated? (I also re-worked this in the new commit)

return
'''
exec(textwrap.dedent(code), mod.__dict__)

pickled_module_path = 'mod.pkl'
print(mod.func.__module__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spurious print

Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a first batch of comments, I still need to review the second test.


class DynamicDict(dict):
"""class used to instanciate base_globals, in order to be weakly refrenced
into _dynamic_modules_globals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: please follow PEP 257 for docstring: oneliner short summary, then optional blank line + paragraph with more details that could not fit in the summary. Following PEP 257 is helpful to make code editors documentation tooltips work better (as well as sphinx API doc for public function / classes).

Also I think we should make the this class private explicitly with a leading underscore.

class _DynamicModuleFuncGlobals(dict):
    """Global variables referenced by a function defined in a dynamic module

    To avoid leaking references we store such context in a WeakValueDictionary instance.
    However instances of python builtin types such as dict cannot be used directly as values
    in such a construct, hence the need for a derived class.
    """
    pass

base_globals = _dynamic_modules_globals.get(
base_globals, DynamicDict())
# base_globals is not a string anymore, using base_globals_name
# instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is unnecessary in my opinion. The base_globals_name variable name is explicit enough.

# lived in
base_globals = vars(sys.modules[base_globals])
else:
base_globals = {}
base_globals = _dynamic_modules_globals.get(
base_globals, DynamicDict())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small optim:

             base_globals = _dynamic_modules_globals.get(base_globals_name, None)
             if base_globals is None:
                 base_globals = _DynamicModuleFuncGlobals()

to spare the creation of _DynamicModuleFuncGlobals when not necessary.

@@ -42,7 +43,8 @@
tornado = None

import cloudpickle
from cloudpickle.cloudpickle import _find_module, _make_empty_cell, cell_set
from cloudpickle.cloudpickle import _find_module, _make_empty_cell, cell_set,\
_dynamic_modules_globals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a new from .... import line instead of multiline imports: I find it more readable and it makes it easier to avoid merge conflict.

@@ -440,6 +442,58 @@ def method(self, x):
mod1, mod2 = pickle_depickle([mod, mod])
self.assertEqual(id(mod1), id(mod2))

def test_dynamic_modules_globals(self):
# _dynamic_modules_globals is a WeakValueDictionary, so if this dynamic
# module has no other reference than in this # dict (whether on the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# leftover

# _dynamic_modules_globals is a WeakValueDictionary, so if this dynamic
# module has no other reference than in this # dict (whether on the
# child process or the parent process), this module will be garbage
# collected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of this test, _dynamic_modules_globals is only used in the child process.

Also it's not the module that gets garbage collected by the function globals dict.


# A dictionnary storing the globals of the newly unpickled function
# should have been created
assert list(_dynamic_modules_globals.keys())==['mod']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: whitespaces around ==.

# There is no reference to the globals of func since func has been
# deleted and _dynamic_modules_globals is a WeakValueDictionary,
# so _dynamic_modules_globals should now be empty
assert list(_dynamic_modules_globals.keys())==[]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem

@ogrisel
Copy link
Contributor

ogrisel commented Sep 18, 2018

For some reason I could not directly push to your branch so I took over in #205.

@ogrisel ogrisel closed this Sep 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants