cache_hash breaks copy and pickle for non-slots classes #613

pganssle · 2020-01-13T19:05:03Z

After "fixing" #611, I realized that my test in #612 was incomplete - I was not asserting that the copy.deepcopy worked, and it turns out it did not. The test, modified as below, fails because b.x is never set (the attribute doesn't even exist in the copied object):

    def test_copy_roundtrip(self):
        @attr.s(frozen=True, cache_hash=True)
        class C(object):
            x = attr.ib()

        a = C(1)
        b = copy.deepcopy(C(1))

        assert a == b

I may be missing something, but it seems like #489 actually broke serialization entirely for any class with cache_hash:

>>> import attr
>>> import copy
>>> @attr.s(hash=True, cache_hash=True)
... class SomeClass:
...     x = attr.ib()
...
>>> SomeClass(1)
SomeClass(x=1)
>>> copy.deepcopy(SomeClass(1))
SomeClass(x=NOTHING)

I believe the reason for this is that copy and pickle don't do whatever their default behavior is if __setstate__ is set - they just create a new object and then call __setstate__, which means that when __setstate__ doesn't actually initialize the object, the object remains uninitialized.

I am assuming this went unnoticed because @gabbard (who had the problem in the first place) is, I'm assuming, using a slots class, which doesn't have this problem (slots defines a __setstate__).

I think there are two options here:

Define a custom __getstate__ and __setstate__ for classes with cache_hash=True to duplicate what pickle and copy were doing anyway.
Define a default __reduce__ method that removes the hash cache.

I don't like the first option very much, because it means that we have to re-implement copy and pickle's default behavior (which may even diverge from one another)! I like the second one a lot more, particularly because this will just be the default __reduce__. People implementing their own custom __reduce__ can choose to include or not include the cached hash (though I'm not sure if there's a public variable anywhere they can access to tell what member it would be - maybe exposing such a public member should be part of this)?

The text was updated successfully, but these errors were encountered:

gabbard · 2020-01-13T21:15:29Z

@pganssle : Thanks for catching this bug! You are correct that we use serialization with hash caching extensively in our codebase, but all the classes are slots=True, so in my workplace's own use we would not have caught this.

Looking at the PR, I actually did attempt to test for slots and non-slots classes having different __setstate__ behavior, but I screwed it up, because what I was doing to check that the cache was cleared on de-serialization masked exactly the problem you are seeing. I should have made a distinct test for just checking that simple serialization worked.

I agree that the second proposed solution seems cleaner. I am not sure, though, that we even want to expose the identity of the hash cache field name. It seems to me it is always wrong to serialize it - it is entirely an implementation detail and not part of the semantics of an object, and it's really easy to shoot yourself in the foot if you serialize it. On the other hand, if for some reason a user's reduce method did not assume it knew in advance all the object's fields and therefore needed to know the hash cache field in order to exclude it, I could see some value. That seems like a very odd case, though.

(@hynek : while checking into this, I discovered another problem in this test and created #614 to fix it)

This fixes GH issue python-attrs#613 and python-attrs#494. It turns out that the hash cache-clearing implementation for non-slots classes was flawed and never quite worked properly. This switches away from using __setstate__ and instead adds a custom __reduce__ that removes the cached hash value from the default serialized output. This commit also refactors some of the tests a bit, to try and more cleanly organize the tests related to this issue.

pganssle · 2020-01-13T21:46:44Z

@gabbard Well, the point of exposing a variable pointing to that name would be to allow people with custom __reduce__ implementations to choose not to serialize it. If you do the naive thing like object does, your implementation of __reduce__ will just pass myclass.__dict__ as the state of your object and for non-slots classes, which will include the cached hash value.

I have prepared a PR fixing this using the __reduce__ method, #615 (I also refactored the pickle round-trip code, so if that gets merged #614 won't be necessary), but I did not bother trying to expose a variable including the cache location. We can save that for a future feature request if it's really desired.

That said, I'm a bit puzzled by this:

It seems to me it is always wrong to serialize it - it is entirely an implementation detail and not part of the semantics of an object, and it's really easy to shoot yourself in the foot if you serialize it.

I agree that it's an implementation detail and not part of the semantics of the object, but in trying to write a test for this, I couldn't actually think of a legitimate use case where serializing it would make a big difference. For any immutable classes composed of values with deterministic hashes, the hash will always be the same. It will make the serialized form slightly bigger, but probably a rounding error on the total size if these objects are big enough that you want to cache the hashes.

You shouldn't be able to hash dictionaries or sets and frozenset seems to have a deterministic hash, so the only scenario I could imagine this causing a problem with is if you're pickling something on one version of Python and unpickling it on another version where the hash implementations have changed - which is not how pickle is supposed to be used and you can expect all kinds of stuff to break if you use it that way (same thing regarding different versions of libraries). Maybe some of my assumptions are wrong? What is an example of some way that you can shoot yourself in the foot when doing this?

gabbard · 2020-01-13T21:56:18Z

@pganssle : the biggest problem is that there are values in Python the user might naively expect have deterministic hash codes which do not - most notably, strings. For example,

 ~/projects/attrs/attrs > python
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:07:29) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> x="hello"
>>> hash(x)
-4850999967032345441
[Ctrl-D]
~/projects/attrs/attrs >  python                                                                                                                                             Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:07:29) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> x="hello"
>>> hash(x)
4063278766481041368

There are other cases where the hash code might be deterministic in practice but this fact is not documented or guaranteed.

This fixes GH issue python-attrs#613 and python-attrs#494. It turns out that the hash cache-clearing implementation for non-slots classes was flawed and never quite worked properly. This switches away from using __setstate__ and instead adds a custom __reduce__ that removes the cached hash value from the default serialized output. This commit also refactors some of the tests a bit, to try and more cleanly organize the tests related to this issue.

This fixes GH issue python-attrs#613 and python-attrs#494. It turns out that the hash cache-clearing implementation for non-slots classes was flawed and never quite worked properly. This switches away from using __setstate__ and instead adds a custom __reduce__ that removes the cached hash value from the default serialized output.

Rather than attempting to remove the hash cache from the object state on deserialization or serialization, instead we store the hash cache in an object that reduces to None, thus clearing itself when pickled or copied. This fixes GH python-attrs#494 and python-attrs#613. Co-authored-by: Matt Wozniski <[email protected]>

* Use an self-clearing subclass to store hash cache Rather than attempting to remove the hash cache from the object state on deserialization or serialization, instead we store the hash cache in an object that reduces to None, thus clearing itself when pickled or copied. This fixes GH #494 and #613. Co-authored-by: Matt Wozniski <[email protected]> * Add test for two-argument __reduce__ I couldn't think of any way to make a useful and meaningful class that has no state and also has no custom __reduce__ method, so I went minimalist with it. * Improve test for hash clearing behavior. Previously, there was some miniscule risk of hash collision, and also it was relying on the implementation details of `pickle` (the assumption that `hash()` is never called as part of `pickle.loads`). * Add improved testing around cache_hash * Update src/attr/_make.py Co-Authored-By: Ryan Gabbard <[email protected]> * Update comment in slots_setstate Since the cached hash value is not actually serialized in __getstate__, __setstate__ is not actually "clearing" it on deserialization - it's initializing the value to None. * Add changelog entry * Remove changelog for #611 This change was overshadowed by a more fundamental change in #620. Co-authored-by: Matt Wozniski <[email protected]> Co-authored-by: Ryan Gabbard <[email protected]>

pganssle mentioned this issue Jan 13, 2020

WIP: Implement __reduce__ with cache_hash #615

Closed

4 tasks

wsanchez added the Bug label Jan 13, 2020

pganssle mentioned this issue Jan 14, 2020

WIP: Switch over to a global hash cache #616

Closed

4 tasks

pganssle mentioned this issue Jan 27, 2020

Use a wrapper class for cache hash to prevent cache serialization #620

Merged

4 tasks

hynek closed this as completed in #620 Feb 10, 2020

charles-cooper mentioned this issue Oct 5, 2024

dataclasses fail to unpickle when there is an object cycle with a custom __hash__ function python/cpython#125004

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache_hash breaks copy and pickle for non-slots classes #613

cache_hash breaks copy and pickle for non-slots classes #613

pganssle commented Jan 13, 2020 •

edited

Loading

gabbard commented Jan 13, 2020

pganssle commented Jan 13, 2020

gabbard commented Jan 13, 2020 •

edited

Loading

cache_hash breaks copy and pickle for non-slots classes #613

cache_hash breaks copy and pickle for non-slots classes #613

Comments

pganssle commented Jan 13, 2020 • edited Loading

gabbard commented Jan 13, 2020

pganssle commented Jan 13, 2020

gabbard commented Jan 13, 2020 • edited Loading

pganssle commented Jan 13, 2020 •

edited

Loading

gabbard commented Jan 13, 2020 •

edited

Loading