-
-
Notifications
You must be signed in to change notification settings - Fork 381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache_hash breaks copy and pickle for non-slots classes #613
Comments
@pganssle : Thanks for catching this bug! You are correct that we use serialization with hash caching extensively in our codebase, but all the classes are Looking at the PR, I actually did attempt to test for slots and non-slots classes having different I agree that the second proposed solution seems cleaner. I am not sure, though, that we even want to expose the identity of the hash cache field name. It seems to me it is always wrong to serialize it - it is entirely an implementation detail and not part of the semantics of an object, and it's really easy to shoot yourself in the foot if you serialize it. On the other hand, if for some reason a user's reduce method did not assume it knew in advance all the object's fields and therefore needed to know the hash cache field in order to exclude it, I could see some value. That seems like a very odd case, though. (@hynek : while checking into this, I discovered another problem in this test and created #614 to fix it) |
This fixes GH issue python-attrs#613 and python-attrs#494. It turns out that the hash cache-clearing implementation for non-slots classes was flawed and never quite worked properly. This switches away from using __setstate__ and instead adds a custom __reduce__ that removes the cached hash value from the default serialized output. This commit also refactors some of the tests a bit, to try and more cleanly organize the tests related to this issue.
@gabbard Well, the point of exposing a variable pointing to that name would be to allow people with custom I have prepared a PR fixing this using the That said, I'm a bit puzzled by this:
I agree that it's an implementation detail and not part of the semantics of the object, but in trying to write a test for this, I couldn't actually think of a legitimate use case where serializing it would make a big difference. For any immutable classes composed of values with deterministic hashes, the hash will always be the same. It will make the serialized form slightly bigger, but probably a rounding error on the total size if these objects are big enough that you want to cache the hashes. You shouldn't be able to hash dictionaries or sets and |
@pganssle : the biggest problem is that there are values in Python the user might naively expect have deterministic hash codes which do not - most notably, strings. For example,
There are other cases where the hash code might be deterministic in practice but this fact is not documented or guaranteed. |
This fixes GH issue python-attrs#613 and python-attrs#494. It turns out that the hash cache-clearing implementation for non-slots classes was flawed and never quite worked properly. This switches away from using __setstate__ and instead adds a custom __reduce__ that removes the cached hash value from the default serialized output. This commit also refactors some of the tests a bit, to try and more cleanly organize the tests related to this issue.
This fixes GH issue python-attrs#613 and python-attrs#494. It turns out that the hash cache-clearing implementation for non-slots classes was flawed and never quite worked properly. This switches away from using __setstate__ and instead adds a custom __reduce__ that removes the cached hash value from the default serialized output.
This fixes GH issue python-attrs#613 and python-attrs#494. It turns out that the hash cache-clearing implementation for non-slots classes was flawed and never quite worked properly. This switches away from using __setstate__ and instead adds a custom __reduce__ that removes the cached hash value from the default serialized output.
Rather than attempting to remove the hash cache from the object state on deserialization or serialization, instead we store the hash cache in an object that reduces to None, thus clearing itself when pickled or copied. This fixes GH python-attrs#494 and python-attrs#613. Co-authored-by: Matt Wozniski <[email protected]>
Rather than attempting to remove the hash cache from the object state on deserialization or serialization, instead we store the hash cache in an object that reduces to None, thus clearing itself when pickled or copied. This fixes GH python-attrs#494 and python-attrs#613. Co-authored-by: Matt Wozniski <[email protected]>
Rather than attempting to remove the hash cache from the object state on deserialization or serialization, instead we store the hash cache in an object that reduces to None, thus clearing itself when pickled or copied. This fixes GH python-attrs#494 and python-attrs#613. Co-authored-by: Matt Wozniski <[email protected]>
* Use an self-clearing subclass to store hash cache Rather than attempting to remove the hash cache from the object state on deserialization or serialization, instead we store the hash cache in an object that reduces to None, thus clearing itself when pickled or copied. This fixes GH #494 and #613. Co-authored-by: Matt Wozniski <[email protected]> * Add test for two-argument __reduce__ I couldn't think of any way to make a useful and meaningful class that has no state and also has no custom __reduce__ method, so I went minimalist with it. * Improve test for hash clearing behavior. Previously, there was some miniscule risk of hash collision, and also it was relying on the implementation details of `pickle` (the assumption that `hash()` is never called as part of `pickle.loads`). * Add improved testing around cache_hash * Update src/attr/_make.py Co-Authored-By: Ryan Gabbard <[email protected]> * Update comment in slots_setstate Since the cached hash value is not actually serialized in __getstate__, __setstate__ is not actually "clearing" it on deserialization - it's initializing the value to None. * Add changelog entry * Remove changelog for #611 This change was overshadowed by a more fundamental change in #620. Co-authored-by: Matt Wozniski <[email protected]> Co-authored-by: Ryan Gabbard <[email protected]>
After "fixing" #611, I realized that my test in #612 was incomplete - I was not asserting that the
copy.deepcopy
worked, and it turns out it did not. The test, modified as below, fails becauseb.x
is never set (the attribute doesn't even exist in the copied object):I may be missing something, but it seems like #489 actually broke serialization entirely for any class with
cache_hash
:I believe the reason for this is that
copy
andpickle
don't do whatever their default behavior is if__setstate__
is set - they just create a new object and then call__setstate__
, which means that when__setstate__
doesn't actually initialize the object, the object remains uninitialized.I am assuming this went unnoticed because @gabbard (who had the problem in the first place) is, I'm assuming, using a slots class, which doesn't have this problem (
slots
defines a__setstate__
).I think there are two options here:
__getstate__
and__setstate__
for classes withcache_hash=True
to duplicate whatpickle
andcopy
were doing anyway.__reduce__
method that removes the hash cache.I don't like the first option very much, because it means that we have to re-implement
copy
andpickle
's default behavior (which may even diverge from one another)! I like the second one a lot more, particularly because this will just be the default__reduce__
. People implementing their own custom__reduce__
can choose to include or not include the cached hash (though I'm not sure if there's a public variable anywhere they can access to tell what member it would be - maybe exposing such a public member should be part of this)?The text was updated successfully, but these errors were encountered: