-
Notifications
You must be signed in to change notification settings - Fork 26
Avalanche Tests (Part 2)
For this part, I'll be focusing on various iterations of simple hash functions that I've been experimenting with. This shows how a hash function design might evolve.
function cyb0a(key, h = 1) {
for(let i = 0; i < key.length; i++) {
h += h << 1;
h += key[i];
}
return h >>> 0;
}
This is our base function, a bare minimum hash. It's a terrible hash, but it introduces us two important concepts:
- We initialize the hash to a non-zero value, so any zeroes in the input still change the output.
- We add a left shifted version of the hash back into the hash. This establishes position-dependence and serves as a minimal mixing operation. Adding or XORing a transformed version of the hash, back into the hash, is a common approach for achieving mixing, and is often extended to multiplication as well.
function cyb0b(key, h = 1) {
for(let i = 0; i < key.length; i++) {
h += key[i];
h += h << 1;
}
return h >>> 0;
}
With a simple change of the operation order, we achieve a very slight improvement. Generally, we want to incorporate our input as early as possible, so that dispersive operations have more entropy to work with.
function cyb0c(key, h = 1) {
for(let i = 0; i < key.length; i++) {
h ^= key[i];
h += h << 1;
}
return h >>> 0;
}
XORing the input byte here instead of adding has an interesting effect. In some cases I've found that it reduces collisions. In some cases, the reverse is true, however.
function cyb0d(key, h = 1) {
for(let i = 0; i < key.length; i++) {
h ^= key[i];
h += h << 7;
}
return h >>> 0;
}
The next obvious improvement is to change the shift amount from 1 to 7. This appears to be the best possible value. It starts to get worse if you go beyond 7 for some reason. What it effectively does, is multiply by 128.
Also, this is where using XOR appears to beat ADD. At least in some particular tests, Add results in more collisions. So we're sticking with XOR for now.
function cyb1a(key, h = 1) {
for(let i = 0; i < key.length; i++) {
h ^= key[i];
h += h << 7;
}
h += h >>> 14;
return h >>> 0;
}
The next obvious step is to add what is called a finalizer. This is an operation that occurs after the input is fully integrated into the hash, to do further scrambling of the bits.
In this case, we add a single addshift
operation, which appears to result in the best distribution out of other combinations. However, a single step is quite weak, and typically cannot scramble the result sufficiently, so we will need to do more.
function cyb1b(key, h = 1) {
for(let i = 0; i < key.length; i++) {
h ^= key[i];
h += h << 7;
}
h += h << 17;
h ^= h >>> 12;
return h >>> 0;
}
With two mixing steps instead of 1, the results are noticeably improved. These are very likely the most optimal combination of operations as well. To do any better, we have to do more operations.
function cyb1c(key, h = 1) {
for(let i = 0; i < key.length; i++) {
h ^= key[i];
h += h << 7;
}
h += h << 17;
h ^= h >>> 12;
h += h << 4;
return h >>> 0;
}
As you can see, as we add more operations, the mixing improves. Interestingly, both Marsaglia's Xorshift pseudorandom number generator and Jenkin's One-at-a-time hash use a 3-step mixer. I'm not entirely sure what the best combination is, but 17 12 4
appears better than Marsaglia and Jenkins' constants.
function cyb1d(key, h = 1) {
for(let i = 0; i < key.length; i++) {
h ^= key[i];
h += h << 7;
}
h ^= h << 17;
h += h >>> 12;
h ^= h << 4;
h += h >>> 7;
h ^= h << 5;
h += h >>> 13;
h ^= h << 15;
return h >>> 0;
}
Taking this approach to the extreme, we start to see what a more complete avalanche looks like - most pixels should be black. But there's still detectable bias. Notice some patterns still exist in the grey pixels, as well as a few lone blue pixels.
function jenkins(key, h = 1) {
for(let i = 0; i < key.length; i++) {
h += key[i];
h += h << 10;
h ^= h >>> 6;
}
h += h << 3;
h ^= h >>> 11;
h += h << 15;
return h >>> 0;
}
Here's Jenkins' well known hash (init h = 1
), its quite similar to the other hashes so far, except it does two steps for every byte instead of one. This allows it to mix slightly better with its 3-step final mixer.
Oddly, my final mixer doesn't work as well in this case, nor can I seem to find any better constants for this construction. It also appears to do better with h += key[i]
instead of h ^= key[i]
. Very curious.
However, it still requires up to 6-7 final mixing steps to achieve avalanche (it fully avalanches with the 7 final steps of cyb1d
). So it's probably not worth the extra op per byte IMO.
function cyb2a(key, h = 1) {
for(let i = 0; i < key.length; i++) {
h ^= key[i];
h += h << 7;
}
h += h >>> 15;
h ^= Math.imul(h, h);
h += h >>> 15;
return h >>> 0;
}
Up until now we've been limited to left and right shifts,intentionally avoiding multiplication. In the past, multiplication was slow, and shifting was fast, so it made sense to use the fast option. But multiplication allows for far better mixing, as we can see here.
In fact, here we're using 'xorsquare', which has some interesting properties. However, constant multipliers can also be used in fewer steps than shifts.
function cyb2b(key, h = 1) {
for(let i = 0; i < key.length; i++) {
h ^= key[i];
h = Math.imul(h, 1828025797);
}
h ^= h >>> 15;
h ^= Math.imul(h, h);
h ^= h >>> 15;
return h >>> 0;
}
Using Math.imul
in the loop, with a large 32-bit multiplier, can skyrocket us into avalanche territory. Not bad at all.