Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache_hash breaks copy and pickle for non-slots classes #613

Closed
pganssle opened this issue Jan 13, 2020 · 3 comments · Fixed by #620
Closed

cache_hash breaks copy and pickle for non-slots classes #613

pganssle opened this issue Jan 13, 2020 · 3 comments · Fixed by #620
Labels

Comments

@pganssle
Copy link
Member

pganssle commented Jan 13, 2020

After "fixing" #611, I realized that my test in #612 was incomplete - I was not asserting that the copy.deepcopy worked, and it turns out it did not. The test, modified as below, fails because b.x is never set (the attribute doesn't even exist in the copied object):

    def test_copy_roundtrip(self):
        @attr.s(frozen=True, cache_hash=True)
        class C(object):
            x = attr.ib()

        a = C(1)
        b = copy.deepcopy(C(1))

        assert a == b

I may be missing something, but it seems like #489 actually broke serialization entirely for any class with cache_hash:

>>> import attr
>>> import copy
>>> @attr.s(hash=True, cache_hash=True)
... class SomeClass:
...     x = attr.ib()
...
>>> SomeClass(1)
SomeClass(x=1)
>>> copy.deepcopy(SomeClass(1))
SomeClass(x=NOTHING)

I believe the reason for this is that copy and pickle don't do whatever their default behavior is if __setstate__ is set - they just create a new object and then call __setstate__, which means that when __setstate__ doesn't actually initialize the object, the object remains uninitialized.

I am assuming this went unnoticed because @gabbard (who had the problem in the first place) is, I'm assuming, using a slots class, which doesn't have this problem (slots defines a __setstate__).

I think there are two options here:

  1. Define a custom __getstate__ and __setstate__ for classes with cache_hash=True to duplicate what pickle and copy were doing anyway.
  2. Define a default __reduce__ method that removes the hash cache.

I don't like the first option very much, because it means that we have to re-implement copy and pickle's default behavior (which may even diverge from one another)! I like the second one a lot more, particularly because this will just be the default __reduce__. People implementing their own custom __reduce__ can choose to include or not include the cached hash (though I'm not sure if there's a public variable anywhere they can access to tell what member it would be - maybe exposing such a public member should be part of this)?

@gabbard
Copy link
Member

gabbard commented Jan 13, 2020

@pganssle : Thanks for catching this bug! You are correct that we use serialization with hash caching extensively in our codebase, but all the classes are slots=True, so in my workplace's own use we would not have caught this.

Looking at the PR, I actually did attempt to test for slots and non-slots classes having different __setstate__ behavior, but I screwed it up, because what I was doing to check that the cache was cleared on de-serialization masked exactly the problem you are seeing. I should have made a distinct test for just checking that simple serialization worked.

I agree that the second proposed solution seems cleaner. I am not sure, though, that we even want to expose the identity of the hash cache field name. It seems to me it is always wrong to serialize it - it is entirely an implementation detail and not part of the semantics of an object, and it's really easy to shoot yourself in the foot if you serialize it. On the other hand, if for some reason a user's reduce method did not assume it knew in advance all the object's fields and therefore needed to know the hash cache field in order to exclude it, I could see some value. That seems like a very odd case, though.

(@hynek : while checking into this, I discovered another problem in this test and created #614 to fix it)

pganssle added a commit to pganssle/attrs that referenced this issue Jan 13, 2020
This fixes GH issue python-attrs#613 and python-attrs#494. It turns out that the hash
cache-clearing implementation for non-slots classes was flawed and never
quite worked properly. This switches away from using __setstate__ and
instead adds a custom __reduce__ that removes the cached hash value from
the default serialized output.

This commit also refactors some of the tests a bit, to try and more
cleanly organize the tests related to this issue.
@pganssle
Copy link
Member Author

@gabbard Well, the point of exposing a variable pointing to that name would be to allow people with custom __reduce__ implementations to choose not to serialize it. If you do the naive thing like object does, your implementation of __reduce__ will just pass myclass.__dict__ as the state of your object and for non-slots classes, which will include the cached hash value.

I have prepared a PR fixing this using the __reduce__ method, #615 (I also refactored the pickle round-trip code, so if that gets merged #614 won't be necessary), but I did not bother trying to expose a variable including the cache location. We can save that for a future feature request if it's really desired.

That said, I'm a bit puzzled by this:

It seems to me it is always wrong to serialize it - it is entirely an implementation detail and not part of the semantics of an object, and it's really easy to shoot yourself in the foot if you serialize it.

I agree that it's an implementation detail and not part of the semantics of the object, but in trying to write a test for this, I couldn't actually think of a legitimate use case where serializing it would make a big difference. For any immutable classes composed of values with deterministic hashes, the hash will always be the same. It will make the serialized form slightly bigger, but probably a rounding error on the total size if these objects are big enough that you want to cache the hashes.

You shouldn't be able to hash dictionaries or sets and frozenset seems to have a deterministic hash, so the only scenario I could imagine this causing a problem with is if you're pickling something on one version of Python and unpickling it on another version where the hash implementations have changed - which is not how pickle is supposed to be used and you can expect all kinds of stuff to break if you use it that way (same thing regarding different versions of libraries). Maybe some of my assumptions are wrong? What is an example of some way that you can shoot yourself in the foot when doing this?

@gabbard
Copy link
Member

gabbard commented Jan 13, 2020

@pganssle : the biggest problem is that there are values in Python the user might naively expect have deterministic hash codes which do not - most notably, strings. For example,

 ~/projects/attrs/attrs > python
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:07:29) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> x="hello"
>>> hash(x)
-4850999967032345441
[Ctrl-D]
~/projects/attrs/attrs >  python                                                                                                                                             Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:07:29) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> x="hello"
>>> hash(x)
4063278766481041368

There are other cases where the hash code might be deterministic in practice but this fact is not documented or guaranteed.

pganssle added a commit to pganssle/attrs that referenced this issue Jan 13, 2020
This fixes GH issue python-attrs#613 and python-attrs#494. It turns out that the hash
cache-clearing implementation for non-slots classes was flawed and never
quite worked properly. This switches away from using __setstate__ and
instead adds a custom __reduce__ that removes the cached hash value from
the default serialized output.

This commit also refactors some of the tests a bit, to try and more
cleanly organize the tests related to this issue.
@wsanchez wsanchez added the Bug label Jan 13, 2020
pganssle added a commit to pganssle/attrs that referenced this issue Jan 14, 2020
This fixes GH issue python-attrs#613 and python-attrs#494. It turns out that the hash
cache-clearing implementation for non-slots classes was flawed and never
quite worked properly. This switches away from using __setstate__ and
instead adds a custom __reduce__ that removes the cached hash value from the
default serialized output.
pganssle added a commit to pganssle/attrs that referenced this issue Jan 22, 2020
This fixes GH issue python-attrs#613 and python-attrs#494. It turns out that the hash
cache-clearing implementation for non-slots classes was flawed and never
quite worked properly. This switches away from using __setstate__ and
instead adds a custom __reduce__ that removes the cached hash value from the
default serialized output.
pganssle added a commit to pganssle/attrs that referenced this issue Jan 27, 2020
Rather than attempting to remove the hash cache from the object state on
deserialization or serialization, instead we store the hash cache in an
object that reduces to None, thus clearing itself when pickled or
copied.

This fixes GH python-attrs#494 and python-attrs#613.

Co-authored-by: Matt Wozniski <[email protected]>
pganssle added a commit to pganssle/attrs that referenced this issue Feb 5, 2020
Rather than attempting to remove the hash cache from the object state on
deserialization or serialization, instead we store the hash cache in an
object that reduces to None, thus clearing itself when pickled or
copied.

This fixes GH python-attrs#494 and python-attrs#613.

Co-authored-by: Matt Wozniski <[email protected]>
pganssle added a commit to pganssle/attrs that referenced this issue Feb 10, 2020
Rather than attempting to remove the hash cache from the object state on
deserialization or serialization, instead we store the hash cache in an
object that reduces to None, thus clearing itself when pickled or
copied.

This fixes GH python-attrs#494 and python-attrs#613.

Co-authored-by: Matt Wozniski <[email protected]>
hynek pushed a commit that referenced this issue Feb 10, 2020
* Use an self-clearing subclass to store hash cache

Rather than attempting to remove the hash cache from the object state on
deserialization or serialization, instead we store the hash cache in an
object that reduces to None, thus clearing itself when pickled or
copied.

This fixes GH #494 and #613.

Co-authored-by: Matt Wozniski <[email protected]>

* Add test for two-argument __reduce__

I couldn't think of any way to make a useful and meaningful class that
has no state and also has no custom __reduce__ method, so I went
minimalist with it.

* Improve test for hash clearing behavior.

Previously, there was some miniscule risk of hash collision, and also it
was relying on the implementation details of `pickle` (the assumption
that `hash()` is never called as part of `pickle.loads`).

* Add improved testing around cache_hash

* Update src/attr/_make.py

Co-Authored-By: Ryan Gabbard <[email protected]>

* Update comment in slots_setstate

Since the cached hash value is not actually serialized in __getstate__,
__setstate__ is not actually "clearing" it on deserialization - it's
initializing the value to None.

* Add changelog entry

* Remove changelog for #611

This change was overshadowed by a more fundamental change in #620.

Co-authored-by: Matt Wozniski <[email protected]>
Co-authored-by: Ryan Gabbard <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants