Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constant hash value for None to aid reproducibility #99540

Closed
yonillasky opened this issue Nov 16, 2022 · 18 comments
Closed

Constant hash value for None to aid reproducibility #99540

yonillasky opened this issue Nov 16, 2022 · 18 comments
Labels
type-feature A feature request or enhancement

Comments

@yonillasky
Copy link
Contributor

yonillasky commented Nov 16, 2022

Feature or enhancement

Fix hash(None) to a constant value.

Pitch

(Updated 2022.11.18)

  • Under current behavior, the runtime leaks the ASLR offset, since the original address of the None singleton is fixed and _Py_HashPointerRaw is reversible. Admittedly, there are other similar objects, like NotImplemented or Ellipsis that also have this problem, and need to be similarly fixed.

  • Because of ASLR, hash(None) changes every run; that consequently means the hash of many useful "key" types changes every run, particularly tuples, NamedTuples and frozen dataclasses that have Optional fields.

  • The other source of hash value instability across runs in common "key" types like str or Enum, can be fixed using the PYTHONHASHSEED environment var.

  • other singletons commonly used as (or as part of) mapping keys, True and False already have fixed hash values.

CPython's builtin set classes, as do all other non-concurrent hash-tables, either open or closed, AFAIK, grant the user a certain stability property. Given a specific sequence of initialization and subsequent mutation (if any), and given specific inputs with certain hash values, if one were to "replay" it, the result set will be in the same observable state every time: not only have the same items (correctness), but also they would be retrieved from the set in the same order when iterated.

This property means that code that starts out with identical data, performs computations and makes decisions based on the results will behave identically between runs. For example, if based on some mathematical properties of the input, we have computed a set of N valid choices, they are given integer scores, then we pick the first choice that has maximal score. If the set guarantees the property described above, we are also guaranteed that the exact same choice will be made every time this code runs, even in case of ties. This is very helpful for reproducibility, especially in complex algorithmic code that makes a lot of combinatorial decisions of that kind.

There is a counterargument that we should simply just offer StableSet and StableFrozenSet that guarantee a specific order, the same way that dict does.
A few things to note about that:

  • I've written such set classes as an adapter over dict[T, None], there is a substantial perf overhead to that
  • Is it worth the extra "weight" in code inside the core? That's suspect - why hasn't it been added all those years?
  • In a large codebase, it requires automated code inspection and editing tools to enforce this. It's all too easy, and natural, to add a seemingly harmless set comprehension somewhere and defeat the whole effort
  • The insertion-order-as-iteration-order guarantee is stronger than what we actually require, in order to have the "reproducability" property I've described, so we're paying extra for something we don't really need.

My PR makes a small change to CPython, in objects.c, that sets the tp_hash descriptor of NoneType to a function that simply returns a constant value.

Admittedly, determinism between runs isn't a concern that most users/programs care about. It is rather niche. However, I argue that still, there is no externalized cost to this change.

Previous discussion

https://discuss.python.org/t/constant-hash-for-none/21110

Linked PRs

@rhettinger
Copy link
Contributor

Thanks for the suggestion but this doesn't make sense. The default hash for every object is its object id. There is nothing special about None in this regard. Also, hash randomization was added intentionally for strings and bytes — we're definitely not in business of trying to make hashes constant and we don't want people to come to rely on a particular hash order or value.

@rhettinger rhettinger closed this as not planned Won't fix, can't repro, duplicate, stale Nov 19, 2022
@yonillasky
Copy link
Contributor Author

If you're definitely not in the business of making hashes constant, why is the hash of True and False constant? It didn't have to be. You could use object hash for those too.

the PYTHONHASHSEED env var exists solely so it may be set to a known value. If you wanted "not make hashes constant" Python would randomize it internally and wouldn't allow users to set it to a value.

The reality is that there is no grand design behind the current behavior. It just happens that None uses a non-constant hash, to the detriment of anyone wishing their code to behave in a reproducible manner. No explanation was given what the benefit of that actually is.

I can't imagine you've spent more than a few minutes thinking about this. Can I appeal your decision somewhere?

@yonillasky
Copy link
Contributor Author

Come to think of it, we could have the hash of None be equal to that of an empty string. Then it would stay unstable in normal use, but still allow the user to make the hashing stable by setting PYTHONHASHSEED to a known value. What do you think about that version of the idea?

@pochmann
Copy link
Contributor

we could have the hash of None be equal to that of an empty string. Then it would stay unstable in normal use

Would it? Isn't that (in CPython) always 0?

@yonillasky
Copy link
Contributor Author

yonillasky commented Nov 19, 2022

To my surprise it is. I was sure string hash calculations were always dependent on the hashing secret, but turns out the empty string isn't. It makes sense - the code explains it is done to avoid leaking information about the hash secret.

So what we can do instead, is to hash some constant bytes, once, upon setting the hash secret, then cache that result and return it from None's hash function. Then, it is deterministic if PYTHONHASHSEED is set, but otherwise it's not, and it should be fine security-wise.

@sweeneyde
Copy link
Member

If you're definitely not in the business of making hashes constant, why is the hash of True and False constant? It didn't have to be. You could use object hash for those too.

That's not true. The contract is that a == b should imply that hash(a) == hash(b). Since 0 == 0.0 == fractions.Fraction(0, 1) == False, we want hash(0) == hash(0.0) == hash(fractions.Fraction(0,1)) == hash(False). Likewise, we should have hash(True) == hash(1) == hash(1.0) == hash(fractions.Fraction(1, 1)).

That's not to say that making None's hash constant would ruin anything, just that there are differences that make the argument from consistency insufficient.

@yonillasky
Copy link
Contributor Author

yonillasky commented Nov 20, 2022

Funny, I used Python for so many years and never knew True == 1 and False == 0 (rather, I knew only of bool(1) is True and bool(0) is False)

TIL.

Anyway - I concur that it shoots down my argument from consistency entirely.

I think the root cause of what I'm trying to fix is that, at some point, we started using None inside what amounts to POD types (conceptually at least), ... as an instance of the "unit" type; inadvertantly injecting hash non-determinism into data types where it doesn't really belong. In practice, the problem only became "observable" when ASLR was introduced. And, unlike the str/Enum case, there is no possibility for the user to remove the non-determinism even if they want to do so.

One could argue a cleaner fix is to come up with an actual Unit type singleton that hashes to a constant, and use X | Unit instead of Optional. Unfortunately, it would be a large effort to sanitize away all the Nones in an entire codebase, and keep it that way. It's used by convention practically everywhere.

I don't know, if no one else thinks it's really a problem, the issue can stay closed

@gryznar
Copy link

gryznar commented Nov 21, 2022

I'm not sure, but won't immortal objects fix this?

@gpshead
Copy link
Member

gpshead commented Nov 27, 2022

won't immortal objects fix this?

No. These are all implementation details. Code depending on them does so at its peril.

Related to the OP: Python sets do not guarantee any form of stability. Code depending on that is already depending on something it should not because it never explicitly asked for whatever stability it depends on. Sets are fundamentally unordered. https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset

@yonillasky
Copy link
Contributor Author

The scenario does not involve code that depends on the specific iteration order for its correctness
Instead it involves running the same code on the same input multiple times and expecting it to behave the same

It's useful for testing, debugging and research purposes.

Again, has nothing to do with assumptions in the code. The code merely assumes that it will traverse the items in the set in some order. No further assumptions are needed for correctness.

@gpshead
Copy link
Member

gpshead commented Nov 27, 2022

You cannot run the same code on the same input multiple times and expect it to behave the same unless everything you've done in the code provides that guarantee. Python set types explicitly do not provide this guarantee.

Every invocation of code interating over a set will produce each value once in "some order". That order may or may not be the same as "some order" in another invocation.

@yonillasky
Copy link
Contributor Author

You just told me what the requirements are. Not what the actual behavior is.

Can you bring a set s into a state where if you do:

x1 = tuple(s)
x2 = tuple(s)

you end up with x1 != x2?
In CPython, or any other known version of Python?

@yonillasky
Copy link
Contributor Author

Another related question

can you find a series of operations on a set, starting with its creation, that involves fixed data with fixed hashes, and ends with converting the set into a tuple, that will return a different result every time?

Not hypothetically, but actual code that does this
in any version of Python

@gpshead
Copy link
Member

gpshead commented Nov 27, 2022

This issue is closed and you are no longer discussing anything related to it. Please take it up on a discuss.python.org thread.

@yonillasky
Copy link
Contributor Author

The fact you think my questions are unrelated to the change is a strong indicator that you do not understand it.
I will leave it at that.
I gave up on making the change - we can drop it here also.

@ethanfurman ethanfurman changed the title Constant hash value for None Constant hash value for None to aid reproducibility Dec 10, 2022
@ethanfurman
Copy link
Member

Raymond Hettinger:
The default hash for every object is its object id.

However, many objects have a specific and unchanging hash, the most obvious example being integers.

Raymond Hettinger:
Also, hash randomization was added intentionally for strings and bytes

But it is possible to set the seed for that randomization so that multiple runs can produce the same results, which can be vital for debugging.

Oscar Benjamin:
... this also demonstrates a significant reason why None is special: it's a singleton that only compares equal to itself. The reason for using id for hash in other cases is to make different instances have different hashes but there is only ever one instance of None.

Indeed, None, like True and False, is a special object in Python, and should have either a constant hash, or one that can be set by specifying the random hash seed.

@ethanfurman ethanfurman reopened this Dec 10, 2022
@yonillasky
Copy link
Contributor Author

We still have to decide whether to use a PYTHONHASHSEED-derived value or a constant.

I've updated my PR with an implementation that does make it depend on PYTHONHASHSEED so we can weigh the increased size of the implementation.

Personally, I will gain a tiny bit more from the PYTHONHASHSEED dependency, as it allows me to "fuzz" my program slightly better.

But we have to consider the effect on all users. Some might be currently using None in their composite key types on platforms that do not perform ASLR by default, and then the PYTHONHASHSEED dependency will inject non-determinism into their programs by default. Arguments can be made both ways, whether that's a good or bad thing.

@ethanfurman
Copy link
Member

Let's do a constant hash for now.

ethanfurman pushed a commit that referenced this issue Dec 16, 2022
carljm added a commit to carljm/cpython that referenced this issue Dec 16, 2022
* main:
  pythongh-99540: Constant hash for _PyNone_Type to aid reproducibility (pythonGH-99541)
  pythongh-100039: enhance __signature__ to work with str and callables (pythonGH-100168)
  pythongh-99830: asyncio: Document returns of remove_{reader,writer} (python#100302)
  "Compound statement" docs: Fix with-statement step indexing (python#100286)
  pythonGH-90043: Handle NaNs in COMPARE_OP_FLOAT_JUMP (pythonGH-100278)
shihai1991 added a commit to shihai1991/cpython that referenced this issue Dec 18, 2022
* origin/main: (1306 commits)
  Correct CVE-2020-10735 documentation (python#100306)
  pythongh-100272: Fix JSON serialization of OrderedDict (pythonGH-100273)
  pythongh-93649: Split tracemalloc tests from _testcapimodule.c (python#99551)
  Docs: Use `PY_VERSION_HEX` for version comparison (python#100179)
  pythongh-97909: Fix markup for `PyMethodDef` members (python#100089)
  pythongh-99240: Reset pointer to NULL when the pointed memory is freed in argument parsing (python#99890)
  pythongh-99240: Reset pointer to NULL when the pointed memory is freed in argument parsing (python#99890)
  pythonGH-98831: Add DECREF_INPUTS(), expanding to DECREF() each stack input (python#100205)
  pythongh-78707: deprecate passing >1 argument to `PurePath.[is_]relative_to()` (pythonGH-94469)
  pythongh-99540: Constant hash for _PyNone_Type to aid reproducibility (pythonGH-99541)
  pythongh-100039: enhance __signature__ to work with str and callables (pythonGH-100168)
  pythongh-99830: asyncio: Document returns of remove_{reader,writer} (python#100302)
  "Compound statement" docs: Fix with-statement step indexing (python#100286)
  pythonGH-90043: Handle NaNs in COMPARE_OP_FLOAT_JUMP (pythonGH-100278)
  Improve stats presentation for calls. (pythonGH-100274)
  Better stats for `LOAD_ATTR` and `STORE_ATTR` (pythonGH-100295)
  pythongh-81057: Move the Cached Parser Dummy Name to _PyRuntimeState (python#100277)
  Document that zipfile's pwd parameter is a `bytes` object (python#100209)
  pythongh-99767: mark `PyTypeObject.tp_watched` as internal use only in table (python#100271)
  Fix typo in introduction.rst (python#100266)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

7 participants