WIP: Implement reduce with cache_hash #615

pganssle · 2020-01-13T21:34:53Z

It turns out that the hash cache-clearing implementation for non-slots classes was flawed and never quite worked properly. This switches away from using __setstate__ and instead adds a custom __reduce__ that removes the cached hash value from the default serialized output.

This commit also refactors some of the tests a bit, to try and more cleanly organize the tests related to this issue.

Fixes #613.
Fixes #494.

Pull Request Check List

Added tests for changed code.
Updated documentation for changed code.
Documentation in .rst files is written using semantic newlines.
Changes (and possible deprecations) have news fragments in changelog.d.

pganssle · 2020-01-13T22:41:16Z

I am not sure I understand what's going on with the pypy target - I cannot reproduce the failure locally.

src/attr/_make.py

This fixes GH issue python-attrs#613 and python-attrs#494. It turns out that the hash cache-clearing implementation for non-slots classes was flawed and never quite worked properly. This switches away from using __setstate__ and instead adds a custom __reduce__ that removes the cached hash value from the default serialized output.

This is a rough attempt at building a compatibility function that allows us to detect whether an object defines its own __reduce__ or __reduce_ex__ on any interpreter. Some versions of PyPy 2.7 don't seem to return True with "is" comparisons, and instead require ==. I have arbitrarily chosen int as a builtin that derives from object and doesn't define its own __reduce__. I think this is fairly safe to do in 2.7-only code, since it is stable now.

I couldn't think of any way to make a useful and meaningful class that has no state and also has no custom __reduce__ method, so I went minimalist with it.

Previously, there was some miniscule risk of hash collision, and also it was relying on the implementation details of `pickle` (the assumption that `hash()` is never called as part of `pickle.loads`).

pganssle

Once the general approach is approved, I will go through and modify the documentation to clarify some things about this.

One problem here is that if you write your own __reduce__, there's no public attribute or function that allows you to clear out the hash cache. In practice I don't think this will be a major problem because my impression of this is that when you write a custom __reduce__, you tend to do it for the purposes of white-listing attributes. Spot checking a few custom defined __reduce__s on Github seems to bear this out.

So I think the options are:

Use this method of having cache_hash generate a default __reduce__ with no warning and just a mention in the documentation and the changelog.
Use this method and call any existing __reduce__, then do the modification if and only if:
a.) we see that the third argument is self.__dict__ - this will probably catch most edge cases, but it will also make the bug even harder to detect (since changing from return x, y, self.__dict__ to return x, y, copy.copy(self.__dict__) will cause pickling to have a subtle bug).
b.) we see that the third argument is a dictionary and contains _hash_cache_field. This is probably a very good heuristic, but we don't actually know what this dictionary is used for in any random custom __reduce__, since it's defined in relationship with the first two arguments, so I'm somewhat wary of messing with it.
Throw an exception if someone has a custom __reduce__ telling them they can't use cache_hash - this would be a breaking change and seems unnecessary anyway.
Warn if someone has a custom __reduce__ with cache_hash=True. This is less intrusive but the warning will mostly be noise - most __reduce__ implementations won't have this problem.
Basically do get rid of <> surrounding repr #1, but also either expose _hash_cache_attr or add a public function like remove_hash_cache_from_dict as a public attribute, so people can use it in their custom __reduce__ implementations.

I think the best options are probably 1 or 2b, with a fallback to 5 if it turns out that this is causing real problems for end users.

pganssle · 2020-01-22T21:21:34Z

src/attr/_make.py

@@ -483,6 +513,13 @@ def __init__(
            self._cls_dict["__setattr__"] = _frozen_setattrs
            self._cls_dict["__delattr__"] = _frozen_delattrs

+        if (
+            cache_hash
+            and _method_eq(cls.__reduce__, object.__reduce__)


Note to reviewers: This is probably the most controversial part - the __reduce__ is generated only if you haven't defined your own __reduce__ or __reduce_ex__.

I have done this because it seems mildly inappropriate to squash someone's existing __reduce__ and I am not confident that we can count on the third return value from a custom __reduce__ being the object's __dict__.

hynek · 2020-02-01T06:51:27Z

Closing in favor of #620.

pganssle mentioned this pull request Jan 13, 2020

cache_hash breaks copy and pickle for non-slots classes #613

Closed

pganssle force-pushed the fix_cache_hash branch from 0285d3d to c5b1dfd Compare January 13, 2020 22:14

pganssle changed the title ~~Implement __reduce__ with cache_hash~~ WIP: Implement __reduce__ with cache_hash Jan 14, 2020

pganssle mentioned this pull request Jan 14, 2020

WIP: Switch over to a global hash cache #616

Closed

4 tasks

pganssle force-pushed the fix_cache_hash branch from 4849c42 to 17e209b Compare January 14, 2020 20:28

hynek reviewed Jan 22, 2020

View reviewed changes

src/attr/_make.py Outdated Show resolved Hide resolved

pganssle added 5 commits January 22, 2020 16:10

Add improved testing around cache_hash

1bf023f

Add test for two-argument __reduce__

c815db6

I couldn't think of any way to make a useful and meaningful class that has no state and also has no custom __reduce__ method, so I went minimalist with it.

Improve test for hash clearing behavior.

4e8acb8

Previously, there was some miniscule risk of hash collision, and also it was relying on the implementation details of `pickle` (the assumption that `hash()` is never called as part of `pickle.loads`).

pganssle force-pushed the fix_cache_hash branch from 17e209b to 4e8acb8 Compare January 22, 2020 21:10

pganssle commented Jan 22, 2020

View reviewed changes

pganssle mentioned this pull request Jan 27, 2020

Use a wrapper class for cache hash to prevent cache serialization #620

Merged

4 tasks

hynek closed this Feb 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Implement reduce with cache_hash #615

WIP: Implement reduce with cache_hash #615

pganssle commented Jan 13, 2020 •

edited

Loading

pganssle commented Jan 13, 2020

pganssle left a comment •

edited

Loading

pganssle Jan 22, 2020

hynek commented Feb 1, 2020

WIP: Implement __reduce__ with cache_hash #615

WIP: Implement __reduce__ with cache_hash #615

Conversation

pganssle commented Jan 13, 2020 • edited Loading

Pull Request Check List

pganssle commented Jan 13, 2020

pganssle left a comment • edited Loading

Choose a reason for hiding this comment

pganssle Jan 22, 2020

Choose a reason for hiding this comment

hynek commented Feb 1, 2020

WIP: Implement reduce with cache_hash #615

WIP: Implement reduce with cache_hash #615

pganssle commented Jan 13, 2020 •

edited

Loading

pganssle left a comment •

edited

Loading