Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The "theoretical bias limit (bias = ~0.021)" for 32-bit hash functions isn't correct. #12

Open
camel-cdr opened this issue Apr 15, 2021 · 7 comments

Comments

@camel-cdr
Copy link

camel-cdr commented Apr 15, 2021

Introduction

I've been doing a lot of testing using the bias calculation method used in the repo lately, and I wanted to calculate the best possible bias (the bias that would occur if swapping one bit would change every bit with a probability of 50%).

Your blog post mentioned the value bias = ~0.021 in the context of 32-bit hashes, but since I'm testing other powes-of-two, I needed to calculate the bias by my self.

I came to the conclusion that the optimal bias for 32-bit hashes is ~0.006 and that, because I needed to modify the bias calculation code, triple32 has a bias of ~0.0083.

I validated this result by implementing a hash function that returns random number independent of the value to hash and the results confirm my calculations:

original method:
    calculated: ~0.021
    tested:      0.015610715938394663
my method:
    calculated: 0.006087376104758189
    tested:     0.006108424031481263

How I calculated the bias

I was initially skeptical towards the normalization method, because of the "FIXME" comment in the implementation of estimate_bias32, and thus I took the liberty to change how the bias is calculated.

The theoretical bias limit is equal to the mean of the probability distribution of obtaining a specific bias.
Here that is a binomial distribution where with a 50% probability either one or zero is added to the counter, thus the mean is at n/2. For an easier time we can now center the normal distribution by subtracting the mean (n/2) and divide by the n to normalize the bias. Then we take the absolute value since we are interested in the average error, and we need to get rid of the sign to calculate that.

Now we have a half-normal distribution (a normal distribution where the absolute value of the result is taken) from which the mean can be calculated using variance*sqrt(2/pi), where the variance is the variance of the original normal distribution where the absolute value hasn't been taken yet. The original variance can be calculated using the formula sqrt(n*0.5*0.5), since we are dealing with a binomial distribution with p=0.5 and q=0.5.

Here is the complete formula to calculate the theoretical bias limit using maxima:

n: 2**32;
originalVariance: sqrt(n*(0.5)*(0.5));
variance: originalVariance*sqrt(2/%pi);
scaledVariance: variance/n*1000,numer;
print(scaledVariance);
-> 0.006087376104758189

This changed the way the code needs to calculate the mean a bit. So we divide by 4294967296.0 instead of 2147483648.0, although this is just a cosmetic change. And we also take the absolute value of the difference instead of squaring it and calculating the square root later. You'd assume that this would have the same result since abs(a) == sqrt(a*a), but sqrt((b*b) / b) != sqrt(a*a)/b and sqrt(a1*a1 + a2*a2) != sqrt(a1*a1) + sqrt(a2*a2), so this change is vital.

double mean = 0.0;
for (int j = 0; j < 32; j++) {
for (int k = 0; k < 32; k++) {
double diff = (bins[j][k] - 2147483648L) / 2147483648.0;
mean += (diff * diff) / (32 * 32);
}
}
return sqrt(mean) * 1000.0;

double mean = 0.0;
for (int j = 0; j < 32; j++) {
    for (int k = 0; k < 32; k++) {
        double diff = (bins[j][k] - 2147483648L) / 4294967296.0;
        mean += fabs(diff);
    }
}
return mean * 1000.0 / (32 * 32);
@skeeto
Copy link
Owner

skeeto commented Apr 24, 2021 via email

@TheIronBorn
Copy link

TheIronBorn commented Apr 19, 2022

Unless I'm mistaken, I don't think the correction changes
the overall ranking of hash functions

This isn't true.

Throwing some random uint32_t[32 * 32] at it quickly generates different orderings:

skeeto_a = 583.0685451747753
skeeto_b = 585.1873324547721
camel_a = 255.15427878349328
camel_b = 253.92012490920024

gist for the full input values: https://gist.github.com/TheIronBorn/62e118c615de4272ee0f5a0a8cd0b4a8

@TheIronBorn
Copy link

TheIronBorn commented Apr 21, 2022

The best known (#19) is no longer a local optimum if we use this math. (using the same neighborhood as #19)

#19 best: 0.03668391400424298
[16 21f0aaad 15 f35a2d97 15] = 0.036331523915578146

it isn't a large difference though so the global best via the two methods might not be that different in quality.

@TheIronBorn
Copy link

Turns out my search in #19 wasn't quite exhaustive enough and [16 21f0aaad 15 f35a2d97 15] is actually a new best known by both math
[16 21f0aaad 15 f35a2d97 15] = 0.10734781817103507

@TheIronBorn
Copy link

TheIronBorn commented May 21, 2022

By the way I calculated the exact theoretical bias limit for the current math with Mathematica:

bins = 32; (* 32^2 is too slow *)
vars = Array[Unique[x]&, bins];
dists = Distributed[#, BinomialDistribution[n, 1/2]] &/@ vars;
dist = TransformedDistribution[Total[((vars - n/2) / (n/2))^2], dists];
Mean[dist] / bins
1000 / sqrt(n)

1000 / sqrt(2^32) = 0.0152587890625
Which as noted earlier is lower than the 0.021 estimate. Perhaps there's still room for improvement in 3-round hashes.

It also produces this issue's limit:

TransformedDistribution[Total[Abs[(vars - n/2) / n]], dists]

And after applying Stirling's approximation for the binomial coefficient we get exactly 1/sqrt(2 * Pi * n)

Note that though these limits differ only by a constant multiplier the variances might differ by more so I don't know how well we can translate them. (Mathematica can't handle Variance[camelDist])

@TheIronBorn
Copy link

This should help normalize the bias to bin size.

I hoped it would also help normalize it to sample size but even the z-score steadily increases with sample size

@funny-falcon
Copy link

Test with cryptographic quality function (SipHash) show 0.021 is valid estimation of bias limit.

See #19 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants