-
-
Notifications
You must be signed in to change notification settings - Fork 30.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pickle.loads
will crash with self-references inside a custom hash function
#124937
Comments
Your stack trace shows the code throws an exception before it ever gets to pickle. Do you have a different example that crashes in pickle code? |
ah yea, sorry import pickle
class Foo:
def __init__(self):
self.x = None
def __hash__(self):
return hash(self.x)
foo = Foo()
foo.x = {foo}
print(pickle.loads(pickle.dumps(foo))) => Traceback (most recent call last):
File "/home/charles/vyper/foo.py", line 13, in <module>
print(pickle.loads(pickle.dumps(foo)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/charles/vyper/foo.py", line 8, in __hash__
return hash(self.x)
^^^^^^
AttributeError: 'Foo' object has no attribute 'x' |
Hmm, this might be a wontfix. In your reproducer,
This is pretty much expected, because your hash function is recursive. The calls are There's not much we can do to prevent you from writing recursive hash functions. But, maybe there's a better solution here. How are you doing this in practice? |
Nevermind, it seems we don't support hashing sets at all: |
i don't think it really matters if the hash function works or not. for example if i replace the cycle using a set with a frozenset, which is hashable, i can get the same error: import pickle
class Foo:
def __init__(self):
self.x = None#{1}
def __hash__(self):
return hash(self.x)
foo = Foo()
foo.x = frozenset([foo])
print(hash(foo)) # valid
print(pickle.loads(pickle.dumps(foo)).x) |
the issue is still there even if the hash function is not recursive. here is another variant to demonstrate: import pickle
class Foo:
def __init__(self):
self.x = set()
self.y = 5 # pretend it is immutable
def __hash__(self):
return hash(self.y)
foo = Foo()
foo.x.add(foo)
print(hash(foo)) # valid
print(pickle.loads(pickle.dumps(foo)).x) =>
|
It's very difficult to serialize circular references. For example, >>> import json
>>> x = {1: 2}
>>> x["x"] = x
>>> json.dumps(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.12/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/json/encoder.py", line 200, in encode
chunks = self.iterencode(o, _one_shot=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/json/encoder.py", line 258, in iterencode
return _iterencode(o, 0)
^^^^^^^^^^^^^^^^^
ValueError: Circular reference detected The hashing is still technically recursive in your reproducer. While the But maybe this is a bug. I'm not 100% certain that we don't support circular references with pickle, so if we're supposed to, then it is an issue. It seems like you aren't the first person to encounter this, see this SO question. Interestingly, this does seem to work when the hash function isn't custom: >>> class Foo:
... def __init__(self):
... self.x = set()
...
>>> foo = Foo()
>>> foo.x.add(foo)
>>> pickle.loads(pickle.dumps(foo)).x
{<__main__.Foo object at 0x7a504f893a40>} In any event, it's probably worth making the error nicer if we really want to not support this. |
pickle absolutely serializes circular references. as i pointed out in the issue description, a list works fine here:
pickle's ability to handle cycles is documented, part of the pickle format design, and clear from the implementation, e.g. Lines 987 to 999 in 1f9025a
i've spent some time today triaging the issue, and it seems that the issue is that during deserialization of dicts, sets and frozensets, the hash function is called before the child items are totally reconstructed, so apparently, the technique that works for lists and tuples (i.e. putting not-fully reconstructed pointers into the list), does not work here: Lines 1820 to 1844 in 1f9025a
yes i am aware, i included it in the issue description. |
Oh, my bad. I'm guilty of only reading the reproducer from time to time 😆 I guess this is a bug, thank you for the report! A quick search shows that this issue was actually reported at bpo-1761028 over 15 years ago, but it was decided there that it was too complicated to fix, so I doubt anyone else will want to work on this -- would you like to author the PR? A fair warning though: this might be impossible to fix without breaking backwards compatibility (because we might need to move where |
i have a partial fix here (against v3.11.10 tag 0c47759) diff --git a/Lib/pickle.py b/Lib/pickle.py
index f760bcdcba9..062a0e0ff6c 100644
--- a/Lib/pickle.py
+++ b/Lib/pickle.py
@@ -1204,6 +1204,9 @@ def load(self):
self.proto = 0
read = self.read
dispatch = self.dispatch
+
+ self.deferred = []
+
try:
while True:
key = read(1)
@@ -1212,6 +1215,8 @@ def load(self):
assert isinstance(key, bytes_types)
dispatch[key[0]](self)
except _Stop as stopinst:
+ for f in self.deferred:
+ f()
return stopinst.value
# Return a list of items pushed in the stack after last MARK instruction.
@@ -1701,12 +1706,15 @@ def load_setitems(self):
def load_additems(self):
items = self.pop_mark()
set_obj = self.stack[-1]
- if isinstance(set_obj, set):
- set_obj.update(items)
- else:
- add = set_obj.add
- for item in items:
- add(item)
+
+ def finalize():
+ if isinstance(set_obj, set):
+ set_obj.update(items)
+ else:
+ add = set_obj.add
+ for item in items:
+ add(item)
+ self.deferred.append(finalize)
dispatch[ADDITEMS[0]] = load_additems
def load_build(self):
@@ -1774,6 +1782,7 @@ def _loads(s, /, *, fix_imports=True, encoding="ASCII", errors="strict",
# Use the faster _pickle if possible
try:
+ raise ImportError() # testing
from _pickle import (
PickleError,
PicklingError, but it only works for sets, and doesn't appear to scale to other data structures. for example, when i extend the technique to |
FWIW, we can't fix this on 3.11 -- that's security-only. I would write the fix towards against the main branch, and then depending on the complexity of the fix, we can backport it to 3.12 at the lowest. |
I don't really see room for a fix here. When pickle deserializes an object, it first creates it with no attributes (without calling You can fix that for sets specifically with a fix like the one above, but it will still fail for other kinds of references. In any case, your code puts a mutable object in a set, and the hash changes later. That's questionable anyway; it may mean that you can't find back the object later. I would recommend to not make any changes here. |
Pickle has nothing to do with it. The exception is raised earlier. |
The exception occurs during the deserialization process, but yeah it would be a complicated (or possibly impossible) fix. |
I see, there are other examples. Anyway, the bug is in the user code. |
The variant I posted above demonstrates the issue does not have to do with whether the hash changes or not. It only has to do with the fact that hash is called before the object is fully instantiated. |
And moving that would probably break things :( |
I disagree. Because of how dict deserialization works, hash is called on the objects before they are properly instantiated. |
I see, there are more examples. And even without recursive hash. You can solve this problem by implementing either def __new__(cls):
self = super().__new__(cls)
self.y = 5
return self or def __reduce__(self):
return Foo, () |
yea, these are interesting. I shrank the examples for this bug report, so I'm not sure if the btw, here is the full example which is causing the issue for me: charles-cooper/vyper@2a5facf the cycle is being raised when reconstructing this object: https://github.com/charles-cooper/vyper/blob/2a5facf43ffb08b66f76d3d2b16b8f72f22d0625/vyper/semantics/analysis/base.py#L229-L232 |
@serhiy-storchaka the in the meantime, i see no reason this cannot work for any dataclass, so i'll submit a PR for that. |
workaround for gh python#124937, following the recommended approach at python#124937 (comment).
Indeed, looking on your example I now wonder why cc @ericvsmith |
i would say it's both :). none of the snippets i posted in this thread use dataclasses, but apparently dataclasses have enough information to work around the issue anyway. |
fix a bug where unpickling `annotated_vyper_module` would lead to a crash: ``` AttributeError: 'VarAccess' object has no attribute 'variable' ``` this is a blocker for tooling, for instance, titanoboa relies on pickle/unpickle to cache `CompilerData` objects: https://github.com/vyperlang/titanoboa/blob/86df8936654db2068641/boa/util/disk_cache.py#L65-L66 the underlying issue is that `pickle.loads()` calls `obj.__hash__()` for objects that are keys in a hashed data structure - namely, dicts, sets and frozensets. this causes a crash when there is a cycle in the object graph, because the object is not fully instantiated at the time that `__hash__()` is called. this is a cpython issue, reported at python/cpython#124937. @serhiy-storchaka suggested the approach taken in this PR, which breaks the loop before pickling: python/cpython#124937 (comment) note that the implementation of `__reduce__()` in this PR is safe, since there is no cycle in the hash function itself, since the recursion breaks in `VarInfo.__hash__()`. in other words, there is no possibility of `VarAccess.__hash__()` changing mid-way through reconstructing the object.
Bug report
Bug description:
here is a reproduction of the issue:
running this will result in the following exception:
a workaround to the issue has been described at https://stackoverflow.com/a/44888113. however, i consider this a bug in the cpython implementation, because pickle theoretically handles object cycles (e.g., replacing line 5 with
self.x = [self]
poses no problem to the unpickler).i suspect that cpython rehashes all items when reconstructing a dict or set, which makes the issue even more problematic, e.g. if the hash function has any side-effects, they will be executed by the unpickler.
build info:
CPython versions tested on:
3.11
Operating systems tested on:
Linux
The text was updated successfully, but these errors were encountered: