Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pickle.loads will crash with self-references inside a custom hash function #124937

Open
charles-cooper opened this issue Oct 3, 2024 · 24 comments
Labels
extension-modules C modules in the Modules dir stdlib Python modules in the Lib dir topic-dataclasses type-bug An unexpected behavior, bug, or error

Comments

@charles-cooper
Copy link

charles-cooper commented Oct 3, 2024

Bug report

Bug description:

here is a reproduction of the issue:

import pickle

class Foo:
    def __init__(self):
        self.x: object = {self}

    def __hash__(self):
        return hash(self.x)

foo = Foo()

print(pickle.loads(pickle.dumps(foo)))

running this will result in the following exception:

Traceback (most recent call last):
  File "/home/charles/vyper/foo.py", line 10, in <module>
    foo = Foo()
          ^^^^^
  File "/home/charles/vyper/foo.py", line 5, in __init__
    self.x: object = {self}
                     ^^^^^^
  File "/home/charles/vyper/foo.py", line 8, in __hash__
    return hash(self.x)
                ^^^^^^
AttributeError: 'Foo' object has no attribute 'x'

a workaround to the issue has been described at https://stackoverflow.com/a/44888113. however, i consider this a bug in the cpython implementation, because pickle theoretically handles object cycles (e.g., replacing line 5 with self.x = [self] poses no problem to the unpickler).

i suspect that cpython rehashes all items when reconstructing a dict or set, which makes the issue even more problematic, e.g. if the hash function has any side-effects, they will be executed by the unpickler.

build info:

$ python
Python 3.11.10 (main, Sep  7 2024, 18:35:41) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

CPython versions tested on:

3.11

Operating systems tested on:

Linux

@charles-cooper charles-cooper added the type-bug An unexpected behavior, bug, or error label Oct 3, 2024
@JelleZijlstra
Copy link
Member

Your stack trace shows the code throws an exception before it ever gets to pickle. Do you have a different example that crashes in pickle code?

@charles-cooper
Copy link
Author

charles-cooper commented Oct 3, 2024

ah yea, sorry

import pickle

class Foo:
    def __init__(self):
        self.x = None

    def __hash__(self):
        return hash(self.x)

foo = Foo()
foo.x = {foo}

print(pickle.loads(pickle.dumps(foo)))

=>

Traceback (most recent call last):
  File "/home/charles/vyper/foo.py", line 13, in <module>
    print(pickle.loads(pickle.dumps(foo)))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/charles/vyper/foo.py", line 8, in __hash__
    return hash(self.x)
                ^^^^^^
AttributeError: 'Foo' object has no attribute 'x'

@ZeroIntensity ZeroIntensity added the stdlib Python modules in the Lib dir label Oct 3, 2024
@ZeroIntensity
Copy link
Member

ZeroIntensity commented Oct 3, 2024

Hmm, this might be a wontfix. In your reproducer, foo itself is not hashable:

>>> hash(foo)
Traceback (most recent call last):
  File "<python-input-1>", line 1, in <module>
    hash(foo)
    ~~~~^^^^^
  File "<python-input-0>", line 8, in __hash__
    return hash(self.x)
TypeError: unhashable type: 'set'

This is pretty much expected, because your hash function is recursive. The calls are foo.__hash__ -> set.__hash__ -> foo.__hash__ again, and so on!

There's not much we can do to prevent you from writing recursive hash functions. But, maybe there's a better solution here. How are you doing this in practice?

@ZeroIntensity
Copy link
Member

Nevermind, it seems we don't support hashing sets at all: hash({1}) results in a TypeError. I wouldn't expect pickle to be able to handle something unsupported any better.

@charles-cooper
Copy link
Author

charles-cooper commented Oct 3, 2024

Nevermind, it seems we don't support hashing sets at all: hash({1}) results in a TypeError. I wouldn't expect pickle to be able to handle something unsupported any better.

i don't think it really matters if the hash function works or not. for example if i replace the cycle using a set with a frozenset, which is hashable, i can get the same error:

import pickle

class Foo:
    def __init__(self):
        self.x = None#{1}

    def __hash__(self):
        return hash(self.x)

foo = Foo()
foo.x = frozenset([foo])

print(hash(foo))  # valid

print(pickle.loads(pickle.dumps(foo)).x)

@charles-cooper
Copy link
Author

charles-cooper commented Oct 3, 2024

This is pretty much expected, because your hash function is recursive. The calls are foo.__hash__ -> set.__hash__ -> foo.__hash__ again, and so on!

the issue is still there even if the hash function is not recursive. here is another variant to demonstrate:

import pickle

class Foo:
    def __init__(self):
        self.x = set()
        self.y = 5  # pretend it is immutable

    def __hash__(self):
        return hash(self.y)

foo = Foo()
foo.x.add(foo)

print(hash(foo))  # valid

print(pickle.loads(pickle.dumps(foo)).x)

=>

  File "/home/charles/vyper/foo.py", line 11, in __hash__
    return hash(self.y)
                ^^^^^^
AttributeError: 'Foo' object has no attribute 'y'

@ZeroIntensity
Copy link
Member

It's very difficult to serialize circular references. For example, json has protection against this:

>>> import json
>>> x = {1: 2}
>>> x["x"] = x
>>> json.dumps(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.12/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 200, in encode
    chunks = self.iterencode(o, _one_shot=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 258, in iterencode
    return _iterencode(o, 0)
           ^^^^^^^^^^^^^^^^^
ValueError: Circular reference detected

The hashing is still technically recursive in your reproducer. While the __hash__ itself is not recursive, pickle has to serialize every attribute, and the hash function for foo.x is still recursive. (To be fair, I'm not sure that the hash function itself matters, but the point is that pickle doesn't have a good way to serialize it.)

But maybe this is a bug. I'm not 100% certain that we don't support circular references with pickle, so if we're supposed to, then it is an issue. It seems like you aren't the first person to encounter this, see this SO question.

Interestingly, this does seem to work when the hash function isn't custom:

>>> class Foo:
...     def __init__(self):
...         self.x = set()
... 
>>> foo = Foo()
>>> foo.x.add(foo)
>>> pickle.loads(pickle.dumps(foo)).x
{<__main__.Foo object at 0x7a504f893a40>}

In any event, it's probably worth making the error nicer if we really want to not support this.

@ZeroIntensity ZeroIntensity added 3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Oct 3, 2024
@charles-cooper
Copy link
Author

charles-cooper commented Oct 3, 2024

It's very difficult to serialize circular references. For example, json has protection against this:

pickle absolutely serializes circular references. as i pointed out in the issue description, a list works fine here:

replacing line 5 with self.x = [self] poses no problem to the unpickler).

pickle's ability to handle cycles is documented, part of the pickle format design, and clear from the implementation, e.g.

cpython/Lib/pickle.py

Lines 987 to 999 in 1f9025a

if id(obj) in memo:
# Subtle. d was not in memo when we entered save_tuple(), so
# the process of saving the tuple's elements must have saved
# the tuple itself: the tuple is recursive. The proper action
# now is to throw away everything we put on the stack, and
# simply GET the tuple (it's already constructed). This check
# could have been done in the "for element" loop instead, but
# recursive tuples are a rare thing.
get = self.get(memo[id(obj)][0])
if self.bin:
write(POP_MARK + get)
else: # proto 0 -- POP_MARK not available
write(POP * (n+1) + get)

i've spent some time today triaging the issue, and it seems that the issue is that during deserialization of dicts, sets and frozensets, the hash function is called before the child items are totally reconstructed, so apparently, the technique that works for lists and tuples (i.e. putting not-fully reconstructed pointers into the list), does not work here:

cpython/Lib/pickle.py

Lines 1820 to 1844 in 1f9025a

def load_setitem(self):
stack = self.stack
value = stack.pop()
key = stack.pop()
dict = stack[-1]
dict[key] = value
dispatch[SETITEM[0]] = load_setitem
def load_setitems(self):
items = self.pop_mark()
dict = self.stack[-1]
for i in range(0, len(items), 2):
dict[items[i]] = items[i + 1]
dispatch[SETITEMS[0]] = load_setitems
def load_additems(self):
items = self.pop_mark()
set_obj = self.stack[-1]
if isinstance(set_obj, set):
set_obj.update(items)
else:
add = set_obj.add
for item in items:
add(item)
dispatch[ADDITEMS[0]] = load_additems

It seems like you aren't the first person to encounter this, see this SO question.

yes i am aware, i included it in the issue description.

@ZeroIntensity
Copy link
Member

Oh, my bad. I'm guilty of only reading the reproducer from time to time 😆 I guess this is a bug, thank you for the report!

A quick search shows that this issue was actually reported at bpo-1761028 over 15 years ago, but it was decided there that it was too complicated to fix, so I doubt anyone else will want to work on this -- would you like to author the PR? A fair warning though: this might be impossible to fix without breaking backwards compatibility (because we might need to move where __hash__ is called).

@picnixz
Copy link
Contributor

picnixz commented Oct 3, 2024

cc @serhiy-storchaka

@picnixz picnixz added the extension-modules C modules in the Modules dir label Oct 3, 2024
@charles-cooper
Copy link
Author

charles-cooper commented Oct 3, 2024

would you like to author the PR? A fair warning though: this might be impossible to fix without breaking backwards compatibility (because we might need to move where __hash__ is called).

i have a partial fix here (against v3.11.10 tag 0c47759)

diff --git a/Lib/pickle.py b/Lib/pickle.py
index f760bcdcba9..062a0e0ff6c 100644
--- a/Lib/pickle.py
+++ b/Lib/pickle.py
@@ -1204,6 +1204,9 @@ def load(self):
         self.proto = 0
         read = self.read
         dispatch = self.dispatch
+
+        self.deferred = []
+
         try:
             while True:
                 key = read(1)
@@ -1212,6 +1215,8 @@ def load(self):
                 assert isinstance(key, bytes_types)
                 dispatch[key[0]](self)
         except _Stop as stopinst:
+            for f in self.deferred:
+                f()
             return stopinst.value
 
     # Return a list of items pushed in the stack after last MARK instruction.
@@ -1701,12 +1706,15 @@ def load_setitems(self):
     def load_additems(self):
         items = self.pop_mark()
         set_obj = self.stack[-1]
-        if isinstance(set_obj, set):
-            set_obj.update(items)
-        else:
-            add = set_obj.add
-            for item in items:
-                add(item)
+
+        def finalize():
+            if isinstance(set_obj, set):
+                set_obj.update(items)
+            else:
+                add = set_obj.add
+                for item in items:
+                    add(item)
+        self.deferred.append(finalize)
     dispatch[ADDITEMS[0]] = load_additems
 
     def load_build(self):
@@ -1774,6 +1782,7 @@ def _loads(s, /, *, fix_imports=True, encoding="ASCII", errors="strict",
 
 # Use the faster _pickle if possible
 try:
+    raise ImportError()  # testing
     from _pickle import (
         PickleError,
         PicklingError,

but it only works for sets, and doesn't appear to scale to other data structures. for example, when i extend the technique to load_setitems, the cycle occurs during execution of the deferred functions. i suspect setitems is getting called for object serialization somewhere in my example, but i'm not sure how much time i have to continue looking into this.

@ZeroIntensity
Copy link
Member

FWIW, we can't fix this on 3.11 -- that's security-only. I would write the fix towards against the main branch, and then depending on the complexity of the fix, we can backport it to 3.12 at the lowest.

@JelleZijlstra
Copy link
Member

I don't really see room for a fix here. When pickle deserializes an object, it first creates it with no attributes (without calling __init__), then deserializes all the attributes and sets them on the instance. But here, to create the attributes, we need to put the object itself in a set. Initializing a set means hashing the members, and this object's __hash__() doesn't work yet when there are no attributes yet.

You can fix that for sets specifically with a fix like the one above, but it will still fail for other kinds of references.

In any case, your code puts a mutable object in a set, and the hash changes later. That's questionable anyway; it may mean that you can't find back the object later.

I would recommend to not make any changes here.

@serhiy-storchaka
Copy link
Member

Pickle has nothing to do with it. The exception is raised earlier.

@ZeroIntensity
Copy link
Member

ZeroIntensity commented Oct 3, 2024

The exception occurs during the deserialization process, but yeah it would be a complicated (or possibly impossible) fix.

@serhiy-storchaka
Copy link
Member

I see, there are other examples. Anyway, the bug is in the user code.

@ZeroIntensity ZeroIntensity added the pending The issue will be closed if no feedback is provided label Oct 3, 2024
@charles-cooper
Copy link
Author

In any case, your code puts a mutable object in a set, and the hash changes later. That's questionable anyway; it may mean that you can't find back the object later.

The variant I posted above demonstrates the issue does not have to do with whether the hash changes or not. It only has to do with the fact that hash is called before the object is fully instantiated.

#124937 (comment)

@ZeroIntensity
Copy link
Member

It only has to do with the fact that hash is called before the object is fully instantiated.

And moving that would probably break things :(

@ZeroIntensity ZeroIntensity removed 3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Oct 3, 2024
@charles-cooper
Copy link
Author

I see, there are other examples. Anyway, the bug is in the user code.

I disagree. Because of how dict deserialization works, hash is called on the objects before they are properly instantiated.

@serhiy-storchaka
Copy link
Member

I see, there are more examples. And even without recursive hash.

You can solve this problem by implementing either __new__(), so you would have functioning object before adding it to a set,

def __new__(cls):
    self = super().__new__(cls)
    self.y = 5
    return self

or __reduce__() that returns a function which creates the object and the set in one step, without creating a loop in pickle.

def __reduce__(self):
    return Foo, ()

charles-cooper added a commit to charles-cooper/vyper that referenced this issue Oct 3, 2024
@charles-cooper
Copy link
Author

charles-cooper commented Oct 3, 2024

I see, there are more examples. And even without recursive hash.

You can solve this problem by implementing either __new__(), so you would have functioning object before adding it to a set,

def __new__(cls):
    self = super().__new__(cls)
    self.y = 5
    return self

or __reduce__() that returns a function which creates the object and the set in one step, without creating a loop in pickle.

def __reduce__(self):
    return Foo, ()

yea, these are interesting. I shrank the examples for this bug report, so I'm not sure if the __new__ approach will work, since the real example does not set self.y to a constant, it depends on input. the __reduce__ approach might work, let me check. shouldn't the pickle implementation do something like this to break the loop in the deserializer then?

btw, here is the full example which is causing the issue for me: charles-cooper/vyper@2a5facf

the cycle is being raised when reconstructing this object: https://github.com/charles-cooper/vyper/blob/2a5facf43ffb08b66f76d3d2b16b8f72f22d0625/vyper/semantics/analysis/base.py#L229-L232

@charles-cooper
Copy link
Author

@serhiy-storchaka the __reduce__ approach seemed to work. can't pickle do something similar internally to break the cycle during serialization?

in the meantime, i see no reason this cannot work for any dataclass, so i'll submit a PR for that.

@serhiy-storchaka
Copy link
Member

Indeed, looking on your example I now wonder why dataclass(frozen=True) does not do this by default. This is not general pickle issue, but it may be a dataclasses issue.

cc @ericvsmith

@charles-cooper
Copy link
Author

Indeed, looking on your example I now wonder why dataclass(frozen=True) does not do this by default. This is not general pickle issue, but it may be a dataclasses issue.

cc @ericvsmith

i would say it's both :). none of the snippets i posted in this thread use dataclasses, but apparently dataclasses have enough information to work around the issue anyway.

charles-cooper added a commit to vyperlang/vyper that referenced this issue Oct 7, 2024
fix a bug where unpickling `annotated_vyper_module` would lead to
a crash:

```
AttributeError: 'VarAccess' object has no attribute 'variable'
```

this is a blocker for tooling, for instance, titanoboa
relies on pickle/unpickle to cache `CompilerData` objects:
https://github.com/vyperlang/titanoboa/blob/86df8936654db2068641/boa/util/disk_cache.py#L65-L66

the underlying issue is that `pickle.loads()` calls `obj.__hash__()`
for objects that are keys in a hashed data structure - namely, dicts,
sets and frozensets. this causes a crash when there is a cycle in
the object graph, because the object is not fully instantiated at the
time that `__hash__()` is called. this is a cpython issue, reported at
python/cpython#124937.

@serhiy-storchaka suggested the approach taken in this PR, which breaks
the loop before pickling:
python/cpython#124937 (comment)

note that the implementation of `__reduce__()` in this PR is safe, since
there is no cycle in the hash function itself, since the recursion
breaks in `VarInfo.__hash__()`. in other words, there is no possibility
of `VarAccess.__hash__()` changing mid-way through reconstructing the
object.
@picnixz picnixz removed the pending The issue will be closed if no feedback is provided label Dec 2, 2024
@picnixz picnixz moved this to In Progress in Pickle and copy issues 🥒 Dec 2, 2024
@picnixz picnixz moved this from In Progress to Todo in Pickle and copy issues 🥒 Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extension-modules C modules in the Modules dir stdlib Python modules in the Lib dir topic-dataclasses type-bug An unexpected behavior, bug, or error
Projects
Status: Todo
Development

No branches or pull requests

5 participants