-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High clustering of hash function #20
Comments
@ENikS Are you planning to contribute the change to this? If not, I was thinking about looking into this too :) |
Hmm, I was thinking along the lines of matching the hashing behavior to what standard library
|
The method you are referencing uses prime-sized arrays and MOD to calc buckets. Normally it is less sensitive to inconsistent hashing but slower than the power of two tables. With FastMod it might be different though. I would be interested to see benchmarks |
I've run a few benchmarks with BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1635/22H2/2022Update/SunValley2)
12th Gen Intel Core i7-1260P, 1 CPU, 16 logical and 12 physical cores
.NET SDK=8.0.100-preview.2.23157.25
[Host] : .NET 7.0.5 (7.0.523.17405), X64 RyuJIT AVX2
Job-LBSDVQ : .NET 7.0.5 (7.0.523.17405), X64 RyuJIT AVX2
InvocationCount=1 UnrollFactor=1
|
some clustering is a bit of intentional trade off as hinted in
Some background on this: In theory a good hash function maps any given set of keys to a uniformly distributed set of buckets. Also the hashtable is not really in the business of hashing the keys. Ideally it could assume that In practice There is also one common scenario where keys are integers - 42, 43, 44, 45,... Someone will certainly use the dictionary as a lock-free arraylist. It also could be phone numbers or postal codes, memory addresses, or some other numbers. In such case preserving some clustering is useful. The keys with "good" Right now the shuffle will preserve short runs of consecutive hashes, up to 8, if I recall correctly. After that it should scatter the hashes. The picture that the shader map shows is expected.
I assume this happens only when dictionary is small and the keys are integers. Once the dictionary gets much bigger than 8, the fact that we preserve runs of 8 in the shuffle will no longer matter. So far I think the tradeoffs are reasonable, but let me think more about the alternatives. |
I started researching this implementation because I see it as a promising algorithm for Unity Container's new storage method. I trust Dr. Click that it is faster than MOD-based hashes and scales better. Unfortunately, all the benchmarks I am running are quite disappointing. In theory, it should be much faster, but as you can see from benchmarks, it is not the case. |
I've removed the mixing method altogether and used the worst-case scenario for the hash values. I used consecutive hash codes from 0 to N. Considering this issue there is a hash conflict on every other insertion. These are the benchmarks: BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1702/22H2/2022Update/SunValley2)
12th Gen Intel Core i7-1260P, 1 CPU, 16 logical and 12 physical cores
.NET SDK=7.0.300-preview.23179.2
[Host] : .NET 6.0.16 (6.0.1623.17311), X64 RyuJIT AVX2
Job-XVWKVQ : .NET 6.0.16 (6.0.1623.17311), X64 RyuJIT AVX2
InvocationCount=1 UnrollFactor=1
The point I am attempting to make is this: mixing should probably be either improved or removed as redundant. As it is right now, it creates overhead without improving performance. Personally, I'd vote for removing it and enjoy an instant bump in performance. |
Sorry for not getting to this earlier. It has been pretty busy time. My current thinking is that we still need the mixing. It is correct that mixing is redundant when keys are already random or otherwise well-behaving. However a general-purpose hashtable also must avoid perf traps if keys are not well-behaved. For example keys that are multiples of big ^2 numbers - like I understand the desire to be able to say "my keys are well-behaved, please do not mix" through some API or flag when the hashtable is created and enjoy the performance boost, but I can't think of an efficient way to support that. |
I've noticed that the dictionary has an unusually high rate of resizes even when the load factor is still in mid 50 - 55%.
Digging deeper, I found that the mixing algorithm used in the implementation is to blame:
I've attached two screenshots of GLSL shaders using the hash functions. The first screenshot is the original Wang/Jenkins hash method used by Dr. Cliff Click.
The second screenshot is the hash used by the C# implementation:
As you can clearly see, there is a lot of clustering and, as a result, erroneous resizing even moderately empty tables.
The text was updated successfully, but these errors were encountered: