Make digestof and hash() return USize instead of U64 and add hash64() #2615

Praetonus · 2018-03-28T19:42:34Z

Having hash values being 64-bit wide on both 32 and 64-bit platforms was a bit odd. With USize, that width will match the machine word width on every platform.

mfelsche · 2018-03-30T11:01:05Z

On first sight, this makes total sense.

Afaik this change means that hashing behaviour and identity equality behave differently on 64 bit and 32 bit machines. Does this introduce a problem for having these two kinds of systems speak to each other in a future distributed pony? Or even for exchanging serialized stuff?

Praetonus · 2018-03-30T13:54:58Z

Serialisation already produces different results on different data layouts because the size of pointers and numeric types is different. The communication of 32-bit and 64-bit systems in distributed Pony is a larger problem that will have to be resolved eventually (not necessarily in the initial implementation of distributed Pony). Of course, the simplest way to resolve it would be to state that only systems with the same data layout can communicate natively, and that different data layouts require the programmer to handle the communication manually.

jemc · 2018-03-31T14:53:18Z

I think we discussed this idea on a previous sync call, and decided not to do it. I'm going to try to find some record of that decision and bring it back to this thread.

jemc · 2018-03-31T15:08:10Z

Here's the record of that decision, and by the date on which it occurred you could choose to go back into the archives and listen to it if you want to hear more.

#774 (comment)

Basically, the rationale is that 32-bit hashing is a lot more likely to have collisions. Here's an illustration of the problem, taken from this blog post:

In certain applications — such as when using hash values as IDs — it can be very important to avoid collisions. That’s why the most interesting probabilities are the small ones.

Assuming your hash values are 32-bit, 64-bit or 160-bit, the following table contains a range of small probabilities. If you know the number of hash values, simply find the nearest matching row. To help put the numbers in perspective, I’ve included a few real-world probabilities scraped from the web, like the odds of winning the lottery.

In data structures like hash tables, a collision just means a performance loss. But in situations where you're using hashes as unique IDs, it can be much more problematic. I personally am working on a CRDT-based application where hashed values are used as replica IDs, and coordination cannot be done to verify uniqueness across the set of replicas because it must be coordination-free by design - in situations like this, hash collisions will compromise the correctness of the algorithm, and I don't feel comfortable using 32-bit hashes for this.

mfelsche · 2018-03-31T17:04:59Z

Given the rationale @jemc outlined above and from the cited issue discussion, does this also mean that #2607 is also essentially the wrong thing to do as it already produces 32 bit hashes on 32 bit machines afaik?

jemc · 2018-04-04T20:24:44Z

Discussed on the sync call.

We discussed the possibility of having two hash functions: hash(): USize and hash64(): U64, with the former being oriented toward reducing memory overhead, and the latter being oriented toward reducing collisions. We'd have two HashFunction interfaces as well. I said this would meet my goals.

@Praetonus then followed up with a question about whether the low-collision hash should be 128-bit instead of 64-bit. I'll have to think a bit more about this, but it sounds like it might be a good solution as well.

Praetonus · 2018-04-11T20:19:35Z

Discussed again during sync. We agreed on the second hash function being 64 bit. I'll update the PR.

Praetonus · 2018-04-11T21:18:35Z

@jemc Besides Hashable, do you think the HashFunction interface and its standard implementations HashEq and HashIs should have a 64 bit version in the standard library?

jemc · 2018-04-12T23:11:02Z

@Praetonus - yes, I believe so.

Praetonus · 2018-04-13T13:45:11Z

I've updated the PR with the discussed changes. I've also added manual changelog entries so I've removed the changelog label from the PR.

Default hash values will now match the platform machine word width for performance. `hash64()` can be used if a low collision rate is needed.

…ponylang#2615) Default hash values will now match the platform machine word width for performance. `hash64()` can be used if a low collision rate is needed.

Praetonus added the changelog - changed Automatically add "Changed" CHANGELOG entry on merge label Mar 28, 2018

mfelsche added the needs discussion during sync label Apr 4, 2018

Praetonus removed the needs discussion during sync label Apr 11, 2018

Praetonus force-pushed the hash-usize branch from 75e191f to 1bc2c55 Compare April 13, 2018 13:43

Praetonus changed the title ~~Make digestof and hash() return USize instead of U64~~ Make digestof and hash() return USize instead of U64 and add hash64() Apr 13, 2018

Praetonus removed the changelog - changed Automatically add "Changed" CHANGELOG entry on merge label Apr 13, 2018

jemc approved these changes Apr 13, 2018

View reviewed changes

Make digestof and hash() return USize instead of U64 and add hash64()

a30f31a

Default hash values will now match the platform machine word width for performance. `hash64()` can be used if a low collision rate is needed.

Praetonus force-pushed the hash-usize branch from 1bc2c55 to a30f31a Compare April 13, 2018 16:16

Praetonus merged commit 04f0d04 into ponylang:master Apr 13, 2018

Praetonus deleted the hash-usize branch April 13, 2018 17:51

EpicEric mentioned this pull request Apr 15, 2018

Build error due to fall through in digestof case #2653

Closed

Praetonus mentioned this pull request May 5, 2018

gen_digestof_box uses wrong type for phi expression #2683

Closed

Praetonus mentioned this pull request May 19, 2018

Release 0.22.0 #2517

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make digestof and hash() return USize instead of U64 and add hash64() #2615

Make digestof and hash() return USize instead of U64 and add hash64() #2615

Praetonus commented Mar 28, 2018

mfelsche commented Mar 30, 2018

Praetonus commented Mar 30, 2018

jemc commented Mar 31, 2018 •

edited

Loading

jemc commented Mar 31, 2018

mfelsche commented Mar 31, 2018 •

edited

Loading

jemc commented Apr 4, 2018

Praetonus commented Apr 11, 2018

Praetonus commented Apr 11, 2018

jemc commented Apr 12, 2018

Praetonus commented Apr 13, 2018

Make digestof and hash() return USize instead of U64 and add hash64() #2615

Make digestof and hash() return USize instead of U64 and add hash64() #2615

Conversation

Praetonus commented Mar 28, 2018

mfelsche commented Mar 30, 2018

Praetonus commented Mar 30, 2018

jemc commented Mar 31, 2018 • edited Loading

jemc commented Mar 31, 2018

mfelsche commented Mar 31, 2018 • edited Loading

jemc commented Apr 4, 2018

Praetonus commented Apr 11, 2018

Praetonus commented Apr 11, 2018

jemc commented Apr 12, 2018

Praetonus commented Apr 13, 2018

jemc commented Mar 31, 2018 •

edited

Loading

mfelsche commented Mar 31, 2018 •

edited

Loading