Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reloading a session into the same module keeps the previous session #498

Closed
svartkanin opened this issue Jun 3, 2022 · 3 comments · Fixed by #507
Closed

Reloading a session into the same module keeps the previous session #498

svartkanin opened this issue Jun 3, 2022 · 3 comments · Fixed by #507
Labels
Milestone

Comments

@svartkanin
Copy link

I'm generating 2 session dumps with

import dill
a = 1234
dill.dump_session('test1.session')

and

import dill
b = 1234
dill.dump_session('test2.session')

Then loading them back into a module sequentially

import types
import dill

mod = types.ModuleType('test')
print(mod.__dict__.get('a', None))
print(mod.__dict__.get('b', None))

dill.load_session('test1.session', main=mod)
print(mod.__dict__.get('a', None))
print(mod.__dict__.get('b', None))


mod = types.ModuleType('test')
print(mod.__dict__.get('a', None))
print(mod.__dict__.get('b', None))

dill.load_session('test2.session', main=mod)
print(mod.__dict__.get('a', None))
print(mod.__dict__.get('b', None))

I get the following output

None   # a
None   # b
1234   # a
None   # b
None   # a
None   # b
1234   # a
1234   # b

Before the second session file is read, it clearly shows that the mod variable has no a nor b set. But then when reading in the test2.session file it suddenly sets both a and b. But only b was actually dumped into that file.

Is this an expected behavior?
I'd have expected that after the test2.session was loaded, only the b variable to be present

@leogama
Copy link
Contributor

leogama commented Jun 4, 2022

Thank you for reporting, @svartkanin. This is a complicated issue and, at the same time I'm explaining this to you, I'm reasoning with myself before we make any irreversible changes. If you want to add to the discussion with ideas or questions, you are more than welcome.

Some history

First things first, if you read carefully dill's documentation about session saving, it was created and implemented—this is important— for users to be able to save an interactive session and restore it later (in the same interpreter instance or not). Note that the documentation doesn't mention, except in the description of the parameter main, the use case of saving the state of a module other than __main__, to the point that this use is kind of a (supported) concession. In practice, the functions work as if they were dump_module() and load_module(). But things can evolve...

The issue

Your odd results came from an actual bug mixed with a usage that was not foreseen, that is: restoring the state of a dumped module in a different one (and one that have a different name, and that is not importable, etc.). In this regard, if you could tell what kind of problem you wanted to solve and how you were trying to do it when this came up, it would help us in making a couple of decisions.

Namespace hell

Python is a extremely dynamical language and, because of this, simple things like finding out if a class belongs to a module must be done at runtime. To deal with it, both Python and dill rely on some assumptions about a module's attributes, like __name__ and __spec__, if it is importable and if it is registered in the sys.modules dictionary. They also inspect some attributes of classes and functions, like __module__ and __qualname__. Therefore, it is a bad idea in general to mess with these things in an inconsistent manner, as this could give rise to hard to detect bugs in completely unrelated code.

If you want to have a glimpse of this issue's complexity, take a look at the lengths the Standard Library module pickle goes to in order to know if it's safe to save an object by reference:

The first part of the method save_global()...
https://github.com/python/cpython/blob/3d647e70cf4fd0e7cef68ed6662de3cb2cb0d63d/Lib/pickle.py#L1060-L1078
...that calls the function whichmodule().
https://github.com/python/cpython/blob/3d647e70cf4fd0e7cef68ed6662de3cb2cb0d63d/Lib/pickle.py#L335-L352

Should load_session() support the main parameter?

Because of all this, @mmckerns and I are converging to the opinion that this usage (restoring the state of a module in a different one) should not be supported in load_session(). The bug I've actually found in the code has a trivial fix, however even without it the function would have unpredictable side effects if used in that way. And there are no straightforward remedies. On the other hand, dill should support saving and restoring, flawlessly, modules created at runtime, like your test example.

Don't tell me what I can't do

Even though the Zen of Python enunciates "There should be one-- and preferably only one --obvious way to do it.", the language doesn't grab programmers by their hands and say what they can or cannot do. Much like the C language, in which its reference interpreter is implemented, Python has a philosophy of granting total freedom to the programmer. Want to relentlessly cast pointers back and forward? Go ahead. Want to replace built-in functions? Please, do! They will sometimes recommend against and warn you, but never forbid.

It should be possible —currently it is not for the module __main__, but that's another issue— to reproduce your example but using dill.dump(vars(module), file) and vars(module).update(dill.load(file)) if you really wants to. Just beware of the mentioned side effects.

Proposed changes

These are the operations that dill.load_session() should support, beyond the already documented:

Save and restore the state of an importable module other than __main__.

import module
dill.dump_session(main=module)
# --- in a different session --- #
module = dill.load_session()  # the module is imported, its state restored and it is returned
# or
import module
dill.load_session(main=module)  # the argument to 'main' has no effect, but is accepted for backward compatibility

Save and restore a module created at runtime, that may be not importable.

module = types.ModuleType('runtime')
module.var = 'spam'
dill.dump_session(main=module)
# --- in a different session --- #
module = types.ModuleType('runtime')
dill.load_session(main=module)  # the module's state is restored (no import is attempted)
# or
module = dill.load_session()  # a module named 'runtime' is *created*, its state restored and it is returned

In the last example, the module runtime is not added to the sys.modules dictionary, which is the expected behavior. But the user can easily do this in the same line:

runtime = sys.modules['runtime'] = dill.load_session()
# or
module = dill.load_session()
sys.modules[module.__name__] = module

Not supported: Save a module state and try to restore it in a different one.

import foo
dill.dump_session(main=foo)
# --- in a different session --- #
import bar  # or bar = types.ModuleType('whatever')
dill.load_session(main=bar)  # raises an UnpicklingException

Note: If the OP or anyone wants to follow the full discussion, please read what has already been raised here: https://stackoverflow.com/questions/72486294/dill-keeps-previous-loaded-session-when-loading-new-session#tab-top, and here: #463 (comment)

@svartkanin
Copy link
Author

Yes it seemed that I'm attempting something that is probably not the right way of doing things :)
For my use case, I'm currently working on a editor that allows to write python code, I'd like to run the code in the background in a separate process, e.g.

code = '...'

proc = await asyncio.create_subprocess_exec(
        'python', '-c', code,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE
)

stdout, stderr = await proc.communicate()

and then analyze the Python objects after the execution, so NOT the stdout but the actual objects of the code that was run.
Since I can't retrieve the objects from the output I append the code a simple 2 liner

import dill
dill.dump_session('session.dump')

When the async execution is completed I attempt to read the resulting session back in and start analyzing the objects.
I do suspect this is not the best way of doing it but I could not find any working alternative to it.

@leogama
Copy link
Contributor

leogama commented Jun 9, 2022

Thank you again for complementing your report. The usage you are attempting is actually interesting: I can envision someone having to compare multiple versions of an object saved in different sessions by loading them to modules created at runtime. Maybe this should indeed be allowed if the target module is not in sys.modules.values() and the names match, possibly with a warning.


About what you are trying to do, there's a serious risk with the load_session() approach because, in practice, there's no way to sandbox pickled objects. Even just unpickling an object, without ever touching it afterwards, may execute arbitrary code and mess with the interpreter session, eventually crashing the whole process in the worst case.

I don't know if your code editor is an exercise/toy project or a serious one, but the best thing you could do to isolate the user code (but still be able to inspect it) is to use something like an IPython kernel in a subprocess to execute the program and then pass commands for it to inspect the results. The kernel would return things always as text, for example, the names of the objects in __main__ as a string of identifiers separated by spaces, which you would just .split(), or the value of str(obj) or repr(obj) for you to print.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants