Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make digestof and hash() return USize instead of U64 and add hash64() #2615

Merged
merged 1 commit into from
Apr 13, 2018

Conversation

Praetonus
Copy link
Member

Having hash values being 64-bit wide on both 32 and 64-bit platforms was a bit odd. With USize, that width will match the machine word width on every platform.

@Praetonus Praetonus added the changelog - changed Automatically add "Changed" CHANGELOG entry on merge label Mar 28, 2018
@mfelsche
Copy link
Contributor

On first sight, this makes total sense.

Afaik this change means that hashing behaviour and identity equality behave differently on 64 bit and 32 bit machines. Does this introduce a problem for having these two kinds of systems speak to each other in a future distributed pony? Or even for exchanging serialized stuff?

@Praetonus
Copy link
Member Author

Serialisation already produces different results on different data layouts because the size of pointers and numeric types is different. The communication of 32-bit and 64-bit systems in distributed Pony is a larger problem that will have to be resolved eventually (not necessarily in the initial implementation of distributed Pony). Of course, the simplest way to resolve it would be to state that only systems with the same data layout can communicate natively, and that different data layouts require the programmer to handle the communication manually.

@jemc
Copy link
Member

jemc commented Mar 31, 2018

I think we discussed this idea on a previous sync call, and decided not to do it. I'm going to try to find some record of that decision and bring it back to this thread.

@jemc
Copy link
Member

jemc commented Mar 31, 2018

Here's the record of that decision, and by the date on which it occurred you could choose to go back into the archives and listen to it if you want to hear more.

#774 (comment)

Basically, the rationale is that 32-bit hashing is a lot more likely to have collisions. Here's an illustration of the problem, taken from this blog post:

In certain applications — such as when using hash values as IDs — it can be very important to avoid collisions. That’s why the most interesting probabilities are the small ones.

Assuming your hash values are 32-bit, 64-bit or 160-bit, the following table contains a range of small probabilities. If you know the number of hash values, simply find the nearest matching row. To help put the numbers in perspective, I’ve included a few real-world probabilities scraped from the web, like the odds of winning the lottery.

hash-probabilities


In data structures like hash tables, a collision just means a performance loss. But in situations where you're using hashes as unique IDs, it can be much more problematic. I personally am working on a CRDT-based application where hashed values are used as replica IDs, and coordination cannot be done to verify uniqueness across the set of replicas because it must be coordination-free by design - in situations like this, hash collisions will compromise the correctness of the algorithm, and I don't feel comfortable using 32-bit hashes for this.

@mfelsche
Copy link
Contributor

mfelsche commented Mar 31, 2018

Given the rationale @jemc outlined above and from the cited issue discussion, does this also mean that #2607 is also essentially the wrong thing to do as it already produces 32 bit hashes on 32 bit machines afaik?

@jemc
Copy link
Member

jemc commented Apr 4, 2018

Discussed on the sync call.

We discussed the possibility of having two hash functions: hash(): USize and hash64(): U64, with the former being oriented toward reducing memory overhead, and the latter being oriented toward reducing collisions. We'd have two HashFunction interfaces as well. I said this would meet my goals.

@Praetonus then followed up with a question about whether the low-collision hash should be 128-bit instead of 64-bit. I'll have to think a bit more about this, but it sounds like it might be a good solution as well.

@Praetonus
Copy link
Member Author

Discussed again during sync. We agreed on the second hash function being 64 bit. I'll update the PR.

@Praetonus
Copy link
Member Author

@jemc Besides Hashable, do you think the HashFunction interface and its standard implementations HashEq and HashIs should have a 64 bit version in the standard library?

@jemc
Copy link
Member

jemc commented Apr 12, 2018

@Praetonus - yes, I believe so.

@Praetonus Praetonus changed the title Make digestof and hash() return USize instead of U64 Make digestof and hash() return USize instead of U64 and add hash64() Apr 13, 2018
@Praetonus Praetonus removed the changelog - changed Automatically add "Changed" CHANGELOG entry on merge label Apr 13, 2018
@Praetonus
Copy link
Member Author

I've updated the PR with the discussed changes. I've also added manual changelog entries so I've removed the changelog label from the PR.

Default hash values will now match the platform machine word width for
performance. `hash64()` can be used if a low collision rate is needed.
@Praetonus Praetonus merged commit 04f0d04 into ponylang:master Apr 13, 2018
@Praetonus Praetonus deleted the hash-usize branch April 13, 2018 17:51
@Praetonus Praetonus mentioned this pull request May 19, 2018
dipinhora pushed a commit to dipinhora/ponyc that referenced this pull request Jun 5, 2018
…ponylang#2615)

Default hash values will now match the platform machine word width for
performance. `hash64()` can be used if a low collision rate is needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants