Do not use the lowermost bit when forcing hash to not match known values #22

VSadov · 2023-05-13T06:56:11Z

We need to make sure that the real hashes do not match one of the special values. That requires wasting of a few bits of the hash, but we have a choice which bits we are sacrificing.

The lowermost bit might not be the best choice as it is plausible that some hashing schemes will produce consecutive/dense hashes and | 1 will result in a collision in every adjacent pair.

Fixes: #21

ENikS · 2023-05-13T15:46:58Z

Perhaps checking for the changed max size is in order too?

VSadov · 2023-05-13T16:58:50Z

Perhaps checking for the changed max size is in order too?

Right, with this change there is no point to have tables larger than 1 << 29. While previously we could use the extra space for collision resolution, it would not work now.
It is probably ok as a tradeoff. 1/2 bln items is still a very large hashtable.

ENikS · 2023-05-13T17:50:03Z

For large tables (larger than 2^29), idx could be shifted left, effectively creating bins, but I personally don't think it is worth the effort to implement.

VSadov · 2023-05-14T04:09:43Z

I have played with this a bit to see what happens when the hashtable gets close to the limits of its capacity. Turns out allowing table sizes > 1^29 can still useful.

When we need a table that is this large, things are obviously not fast. However, when we can allocate a 1 << 30 table, we can still utilize the extra elements. Reprobing may need quite a few extra hops, but because it is quadratic, it can still utilize the other half of the table.
On the other hand if we can only have 1 << 29 table and really need something bigger, we get into the churning mode when the hashtable is usable, but has more elements than a single table can fit. In such mode the dictionary continuously tries to rehash, hoping that some keys will be dropped and we will fit into the same table size.
This mode is a lot slower than when we might need to reprobe extra few times.

I think the aggressiveness of rehashing could be tuned down a bit when the table is large, to better accommodate the "more items than can fit" scenario, but ultimately it will always perform suboptimally.

Allowing 1^30 elements table makes some edge case scenarios still usable, compared to 1^29 limit.
With smaller dictionary sizes, the limit does not have much effect either way, so I think there is no reason to make the limit smaller.

VSadov · 2023-05-16T21:03:04Z

Thanks!!

use not the lowermost bit when forcing hash to not match known values

b3bd265

VSadov merged commit fab343b into main May 16, 2023

VSadov deleted the consHash branch May 16, 2023 21:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not use the lowermost bit when forcing hash to not match known values #22

Do not use the lowermost bit when forcing hash to not match known values #22

VSadov commented May 13, 2023

ENikS commented May 13, 2023

VSadov commented May 13, 2023

ENikS commented May 13, 2023 •

edited

Loading

VSadov commented May 14, 2023 •

edited

Loading

VSadov commented May 16, 2023

Do not use the lowermost bit when forcing hash to not match known values #22

Do not use the lowermost bit when forcing hash to not match known values #22

Conversation

VSadov commented May 13, 2023

ENikS commented May 13, 2023

VSadov commented May 13, 2023

ENikS commented May 13, 2023 • edited Loading

VSadov commented May 14, 2023 • edited Loading

VSadov commented May 16, 2023

ENikS commented May 13, 2023 •

edited

Loading

VSadov commented May 14, 2023 •

edited

Loading