-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The "theoretical bias limit (bias = ~0.021)" for 32-bit hash functions isn't correct. #12
Comments
My bias values have no absolute meaning or units. They are really just a
relative measure for comparing hash quality during an automated search. I
could never get those values normalized properly probably because of my
errors you've pointed out here.
My comment about "theoretical bias limit" is hand-waving since I didn't
really know what to call it. I wanted to find a lower bound on bias since
it was quickly obvious it wasn't going to be zero. So I set up a giant
(16GB), 32-bit S-box, shuffled it thoroughly (/dev/urandom), then used it
as an integer hash function in an "exact" bias measurement. This is the
ideal case, a truly random permutation with no mathematical relationship
between input and output, uncompressible, and no hash function I could
devise could possibly beat this.
I ran this a few times (it's very slow!), and they all ended up around
that ~0.021 bias (by my flawed measure). This was the lower bound, the
limit. Since my best 3-round xorshift-multiply hashes hit this limit,
there's no reason to try 4-round hashes. They're just as biased as an
ideal permutation, so there's no room for improvement.
Your idea of accomplishing the same by replacing the hash function with a
PRNG is interesting, though I'm skeptical that it accurately identifies
the lower bound. It doesn't have the constraints of a permutation.
Since I got what I needed, I didn't keep that 32-bit S-box code. My new
16-bit prospector can test S-boxes very quickly and easily since they're
so small, so I added a special mode for it. It takes only a few seconds to
(roughly) find this same lower bound for 16-bit hashes:
./hp16 -S >/dev/null
Just like with 32-bit xorshift-multiply, three rounds are sufficient to
hit this lower bound (hash16_xm3). Though two carefully-chosen rounds can
come pretty close (hash16_xm2).
I think you're right about my math being wrong. I'm hesitant to change it
at this point since I've got a bunch of data and momentum with the old
measurement, and I'd at least identify the corrected measurement by a
different name. Unless I'm mistaken, I don't think the correction changes
the overall ranking of hash functions, i.e. under the old measurement A is
better than B, but under the new measurement B is better than A. The
prospector only needs relative rankings to function properly.
|
This isn't true. Throwing some random
gist for the full input values: https://gist.github.com/TheIronBorn/62e118c615de4272ee0f5a0a8cd0b4a8 |
it isn't a large difference though so the global best via the two methods might not be that different in quality. |
Turns out my search in #19 wasn't quite exhaustive enough and |
By the way I calculated the exact theoretical bias limit for the current math with Mathematica: bins = 32; (* 32^2 is too slow *)
vars = Array[Unique[x]&, bins];
dists = Distributed[#, BinomialDistribution[n, 1/2]] &/@ vars;
dist = TransformedDistribution[Total[((vars - n/2) / (n/2))^2], dists];
Mean[dist] / bins
It also produces this issue's limit: TransformedDistribution[Total[Abs[(vars - n/2) / n]], dists] And after applying Stirling's approximation for the binomial coefficient we get exactly Note that though these limits differ only by a constant multiplier the variances might differ by more so I don't know how well we can translate them. (Mathematica can't handle |
This should help normalize the bias to bin size. I hoped it would also help normalize it to sample size but even the z-score steadily increases with sample size |
Test with cryptographic quality function (SipHash) show 0.021 is valid estimation of bias limit. See #19 (comment) |
Introduction
I've been doing a lot of testing using the bias calculation method used in the repo lately, and I wanted to calculate the best possible bias (the bias that would occur if swapping one bit would change every bit with a probability of 50%).
Your blog post mentioned the value
bias = ~0.021
in the context of 32-bit hashes, but since I'm testing other powes-of-two, I needed to calculate the bias by my self.I came to the conclusion that the optimal bias for 32-bit hashes is
~0.006
and that, because I needed to modify the bias calculation code,triple32
has a bias of~0.0083
.I validated this result by implementing a hash function that returns random number independent of the value to hash and the results confirm my calculations:
How I calculated the bias
I was initially skeptical towards the normalization method, because of the "FIXME" comment in the implementation of
estimate_bias32
, and thus I took the liberty to change how the bias is calculated.The theoretical bias limit is equal to the mean of the probability distribution of obtaining a specific bias.
Here that is a binomial distribution where with a 50% probability either one or zero is added to the counter, thus the mean is at
n/2
. For an easier time we can now center the normal distribution by subtracting the mean (n/2
) and divide by then
to normalize the bias. Then we take the absolute value since we are interested in the average error, and we need to get rid of the sign to calculate that.Now we have a half-normal distribution (a normal distribution where the absolute value of the result is taken) from which the mean can be calculated using
variance*sqrt(2/pi)
, where thevariance
is the variance of the original normal distribution where the absolute value hasn't been taken yet. The original variance can be calculated using the formulasqrt(n*0.5*0.5)
, since we are dealing with a binomial distribution withp=0.5
andq=0.5
.Here is the complete formula to calculate the theoretical bias limit using maxima:
This changed the way the code needs to calculate the mean a bit. So we divide by
4294967296.0
instead of2147483648.0
, although this is just a cosmetic change. And we also take the absolute value of the difference instead of squaring it and calculating the square root later. You'd assume that this would have the same result sinceabs(a) == sqrt(a*a)
, butsqrt((b*b) / b) != sqrt(a*a)/b
andsqrt(a1*a1 + a2*a2) != sqrt(a1*a1) + sqrt(a2*a2)
, so this change is vital.hash-prospector/prospector.c
Lines 659 to 666 in ea7ee99
The text was updated successfully, but these errors were encountered: