Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not use the lowermost bit when forcing hash to not match known values #22

Merged
merged 1 commit into from
May 16, 2023

Conversation

VSadov
Copy link
Owner

@VSadov VSadov commented May 13, 2023

We need to make sure that the real hashes do not match one of the special values. That requires wasting of a few bits of the hash, but we have a choice which bits we are sacrificing.

The lowermost bit might not be the best choice as it is plausible that some hashing schemes will produce consecutive/dense hashes and | 1 will result in a collision in every adjacent pair.

Fixes: #21

@ENikS
Copy link
Contributor

ENikS commented May 13, 2023

Perhaps checking for the changed max size is in order too?

@VSadov
Copy link
Owner Author

VSadov commented May 13, 2023

Perhaps checking for the changed max size is in order too?

Right, with this change there is no point to have tables larger than 1 << 29. While previously we could use the extra space for collision resolution, it would not work now.
It is probably ok as a tradeoff. 1/2 bln items is still a very large hashtable.

@ENikS
Copy link
Contributor

ENikS commented May 13, 2023

For large tables (larger than 2^29), idx could be shifted left, effectively creating bins, but I personally don't think it is worth the effort to implement.

@VSadov
Copy link
Owner Author

VSadov commented May 14, 2023

I have played with this a bit to see what happens when the hashtable gets close to the limits of its capacity. Turns out allowing table sizes > 1^29 can still useful.

When we need a table that is this large, things are obviously not fast. However, when we can allocate a 1 << 30 table, we can still utilize the extra elements. Reprobing may need quite a few extra hops, but because it is quadratic, it can still utilize the other half of the table.
On the other hand if we can only have 1 << 29 table and really need something bigger, we get into the churning mode when the hashtable is usable, but has more elements than a single table can fit. In such mode the dictionary continuously tries to rehash, hoping that some keys will be dropped and we will fit into the same table size.
This mode is a lot slower than when we might need to reprobe extra few times.

I think the aggressiveness of rehashing could be tuned down a bit when the table is large, to better accommodate the "more items than can fit" scenario, but ultimately it will always perform suboptimally.

Allowing 1^30 elements table makes some edge case scenarios still usable, compared to 1^29 limit.
With smaller dictionary sizes, the limit does not have much effect either way, so I think there is no reason to make the limit smaller.

@VSadov
Copy link
Owner Author

VSadov commented May 16, 2023

Thanks!!

@VSadov VSadov merged commit fab343b into main May 16, 2023
@VSadov VSadov deleted the consHash branch May 16, 2023 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consecutive hashes create conflicts
2 participants