-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change computation of hash value. #4675
Conversation
Fresh PR in place of #4621 |
src/big/big_float.cr
Outdated
@@ -17,6 +17,18 @@ struct BigFloat < Float | |||
LibGMP.mpf_init_set_str(out @mpf, str, 10) | |||
end | |||
|
|||
def initialize(num : BigInt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request is much broader than simply changing the hashing calculation. But these updates are really important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest builds of #4653 shows unstable behavior of constructors (Compiler can prefer another one under some conditions).
Please remove 'em and use to_big_X
.
src/float.cr
Outdated
def hash_normalize | ||
float_normalize_wrap do | ||
{% if flag?(:x86) || flag?(:x86_64) || flag(:arm) || flag(:aarch64) %} | ||
# it should work on every architecture where endianess of Float32 and Int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you hint which way?
I relied on crystal tool format
, but it doesn't help much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just apply proper indentation on this and the following lines. This line should be indented by 8 spaces for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@funny-falcon indent by two spaces.
def hash_normalize
float_normalize_wrap do
{% if flag?(:x86) || flag?(:x86_64) || flag(:arm) || flag(:aarch64) %}
# it should work on every architecture where endianess of Float32 and Int32
no tabs please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see: i used tabs unintentionally instead of spaces.
Fixed that, and reduced relying on macroses.
src/float.cr
Outdated
{% if flag?(:x86) || flag?(:x86_64) || flag(:arm) || flag(:aarch64) %} | ||
# it should work on every architecture where endianess of Float32 and Int32 | ||
# matches and float is IEEE754. | ||
unsafe_int = unsafe_as(Int32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaik tool format
don't touch macro ({% %}
) lines, so they should be formatted manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed usage of macros here in following commits.
src/float.cr
Outdated
{% else %} | ||
float_normalize_reference | ||
{% end %} | ||
if FLOAT64_IS_IEEE754 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this no longer a macro as in the first commit? Can this constant change at runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A constant can't change at runtime so that's not possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sry, I meant if it can differ based on the system where the code is executed or is it fixed for every target architecture? If fixed, the appropriate branch should be chosen according to target arch at compile time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, you are right: it is tested in runtime even in release build.
Lets think a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like crystal compiler does very strange things with constants:
even if it knows constant value at compile time and can use it in macros, it still inserts checks for constant's initializer in final code, and doesn't propagate them as real constants to llvm optimizer.
I add private def hash_bits
and private def hash_modulus
to workaround this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @asterite
src/number/hash_normalize.cr
Outdated
@@ -0,0 +1,115 @@ | |||
module Number::HashNormalize | |||
# Idea by Akzhan Abdulin @akzhan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please drop this line, thanks.
src/number/hash_normalize.cr
Outdated
# exponentiation algorithm. For reduce(2**e) it's even better: since | ||
# P is of the form 2**n-1, reduce(2**e) is 2**(e mod n), and multiplication | ||
# by 2**(e mod n) modulo 2**n-1 just amounts to a rotation of bits. | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this empty comment too.
This is great 😍 |
One question: does anyone uses Crystal on architectures with restricted unaligned read? Problem line is https://github.com/crystal-lang/crystal/pull/4675/files?diff=unified#diff-b88fe795012d5923ca24e66769201422R158 - permuting slice of bytes. Can any one with access to arm with crystal run following, please: str = "1234567"
buf = str.to_slice
ptr = buf.to_unsafe
4.times do |i|
p (ptr+i).as(Pointer(UInt32)).value.to_s(16)
end |
@funny-falcon We probably do unaligned reads all over the stdlib. As long as there's a spec which will break and alert us to the problem on any new architectures with this limitation it's fine to do unaligned reads for now. |
Antirez has a nice writeup about ARM and unaligned accesses: ARMv5 doesn't allow unaligned accesses. ARMv6 allows word-sized unaligned accesses but fails on multiple words. The Linux kernel will rescue and fix unaligned accesses, so unaligned accesses do work on Linux on ARM, but with slow performance. As stated by @RX14 we probably have unaligned accesses in the core/stdlib, thought that doesn't mean we shouldn't care. I tried to a qemu VM but the ARM emulator allows unaligned accesses, so I can't check what happens in |
Note that LLVM can be smart and fix unaligned accesses at compile time when it's obvious. |
I've tested equivalent C program on Raspberry Pi2, and it handles unaligned read well, so I think no issue there. |
src/stdhasher.cr
Outdated
# end | ||
|
||
# StdHasher used as standard hasher in `Object#hash` | ||
# It have to provide defenense against HashDos, and be reasonably fast. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defenense > defense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
src/number/hash_normalize.cr
Outdated
int_to_hashnorm 314159 | ||
end | ||
|
||
# This function is for reference implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should note that this method is used under some conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I improved comment.
I mentioned, that currently hash for BigFloat is not precise for numbers with big fractional part (ie it doesn't distinguish '1.0000000000000001' and '1.0000000000000002').
It happens, cause BigFloat#hash_normalize
falls back to to_f.hash_normalize
for such numbers.
It is possible to improve, using this float_normalize_reference
and overloading Math.frexp
. (or introducing BigFloat#frexp
and using it).
Should it be done? (ie, is hashing of such numbers is common task?)
If yes, should it be done in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to merge #4653 :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But no, I think that's hashing of Big integers/floats is very uncommon task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those implementation doesn' help, cause they still returns {Float64, Int}, so they strictly equal to Math.frexp v.to_f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation suitable for exact hashing should look like:
module Math
def frexp(value : BigFloat)
LibGMP.mpf_get_d_2exp(out exp, value)
frac = BigFloat.new { |mpf|
if exp >= 0
LibGMP.mpf_div_2exp(mpf, self, exp)
else
LibGMP.mpf_mul_2exp(mpf, self, -exp)
end
}
{frac, exp}
end
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But no, I think that's hashing of Big integers/floats is very uncommon task.
It is good.
Note, that BigInteger still hashed exactly. Also, BigFloat with big integer part and small fractional part are also hashed exactly.
Stop... I think, my last sentence is not true. 61bit Mersen Prime is larger than Float64 precision (53bit), so precision will be lost with conversion to_f.hash_normalize
:-(
I should think about carefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like, exact hashing for BigFloat is a single reliable way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code is readable, effective and solves the task of protecting against Hash DoS attacks.
src/big/big_rational.cr
Outdated
@@ -41,6 +41,22 @@ struct BigRational < Number | |||
initialize(num, 1) | |||
end | |||
|
|||
# Creates a exact representation of float as rational. | |||
# | |||
# It sures that `BigRational.new(f) == f` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sures -> ensures
src/big/big_int.cr
Outdated
def hash | ||
to_u64 | ||
def hash_normalize | ||
# remainder(hash_modulus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug leftover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, implementation description.
src/big/big_rational.cr
Outdated
def hash | ||
to_f64.hash | ||
def hash_normalize | ||
# self.remainder(hash_modulus).to_f.hash_normalize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is implementation description
src/hash.cr
Outdated
@@ -710,14 +710,19 @@ class Hash(K, V) | |||
# | |||
# ``` | |||
# foo = {"foo" => "bar"} | |||
# foo.hash # => 3247054 | |||
# foo.hash # => 3247054 (not exactly) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not exactly -> approximation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is "example". In fact, will be different on every process run.
src/stdhasher.cr
Outdated
require "crystal/system/random" | ||
|
||
# Hasher usable for `def hash(hasher)` should satisfy protocol: | ||
# class MyHasher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, use triple backticks (```) for code blocks
src/stdhasher.cr
Outdated
high = (v >> 32).to_u32 | ||
# This condition here cause of some 32bit issue in LLVM binding, | ||
# so compiler_spec doesn't pass without it. | ||
# Fill free to comment and debug. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fill -> feel
src/stdhasher.cr
Outdated
# It have to provide defense against HashDos, and be reasonably fast. | ||
# To protect against HashDos, it is seeded with secure random, and have | ||
# permutation that hard to forge without knowing seed and seeing hash digest. | ||
struct StdHasher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just Hasher
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone suggested, I don't remember exactly who. Why not StdHasher
?
There is obviously could be different implementations, still usable with suggested protocol, so it is just 'standard hasher'.
But if you really think it should be Hasher
, I'll rename it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get your drift, yet this Std
abbreviation looks IMO odd, it would be a precedent in stdlib. Maybe someone have a better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hash::Hasher
?
Hasher
looks like too generic name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed to Hash::Hasher
.
Btw, you can exec bin/crystal docs your files to preview generated documentation. Or make doc afair |
|
src/hash.cr
Outdated
# foo.hash # => 3247054 (not exactly) | ||
# ``` | ||
# Protocol method for generic hashing. | ||
# Note: it should be independent of iteration order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note -> NOTE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@funny-falcon That means your compiler version is too old. |
@RX14 how could be |
Prepare hash infrastructor to future change of hashing algrorithm to protect against Hash DoS. Class|Struct should define method `def hash(hasher)` and call `hasher << @ivar` inside. As an option, for speed, and for backward compatibility, `def hash` still could be implemented. It will be used for Hash of matched type. `Thread#hash` and `Signal#hash` is implemented as unseeded cause they are used before `StdHasher @@seed` is initialized. Hash::Hasher is default hasher that uses `hash(hasher)` and it is used as default seeded hasher. Also, number normalization for hashing introduced, ie rule 'equality forces hash equality' is forced (`a == b` => `a.hash == b.hash`). Normalization idea is borrowed from Python implementation. It fixes several issues with BigInt and BigFloat on 32bit platform, but not all issues. Fixes crystal-lang#4578 Fixes crystal-lang#3932 Prerequisite for crystal-lang#4557 Replaces crystal-lang#4581 Correlates with crystal-lang#4653
cause StringPool is used in json decoding, it is important to have it safe.
b916ac1
to
93c97f1
Compare
I've rebased. Hope this saga will end :-) |
# Type for Hash::Hasher#digest | ||
alias Value = UInt32 | ||
|
||
@@seed = uninitialized StaticArray(UInt32, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just make this a random U32? If we use a constant with initializer (SEED = Random::System.rand(UInt32::MIN..UInt32::MAX)
), shouldn't the constant be initialized on-demand, avoiding the need for special "unseeded" methods below?
buf = pointerof(@@seed).as(Pointer(UInt8)) | ||
Crystal::System::Random.random_bytes(buf.to_slice(sizeof(typeof(@@seed)))) | ||
|
||
protected getter a : UInt32 = 0_u32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice if this was renamed to something a bit more descriptive.
end | ||
|
||
# Calculate hashsum for value | ||
def self.hashit(value) : Value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think i'd just name this hash
.
# Mix nil to state | ||
def <<(v : Nil) : Nil | ||
permute_nil() | ||
nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can leave these trailing nil
s off if we have the : Nil
restriction, iirc.
end | ||
|
||
# Mix nil to state | ||
def <<(v : Nil) : Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only function of these methods seem to be calling permute
. Why not move the permute body into this function?
protected def permute_nil | ||
# LFSR | ||
mx = (@a.to_i32 >> 31).to_u32 & 0xa8888eef_u32 | ||
@a = (@a << 1) ^ mx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the current implementation of this essentially @a *= 31
not a LFSR?
c = normalize_byte(c) | ||
h = 31 * h + c | ||
def hash(hasher) | ||
hasher.raw(bytesize.to_u32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why raw here? I think that the correct way to use hasher should always be <<
without anyone having to think about raw
except people implementing hash(hasher)
on numbers. Shouldn't raw
be :nodoc:
?
if entry | ||
return entry | ||
mask = (@capacity - 1).to_u32 | ||
index, d = hash & mask, 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please split this into two assignments for readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops
Chris, most of your last remarks are meaningless, given it is just template for stronger algorithm you will have to implement soon after merging this. Renaming I will not waste my time on this PR anymore. |
And why do you think I use class variable instead of constant? To avoid "initialization on demand" inlined into every call to |
@funny-falcon seed is only used when initializing the hasher, so it's relatively off the hot path compared to say the inner loop of a hash function. Furthermore it's random so it doesn't help at all with further optimizations. Where is there another I think that i've made some relatively minor requests for a cleanup which are relevant both to this algorithm and to the implemntation of any future algorithm. I see no reason why siphash-1-2 can't be implemented in |
What if hot loop calls Isn't Will you inline siphash implementation into every
|
I think it's likely to be a very minor performance hi, more than worth it for the advantage of removing the complications of having to deal with initialization order and the We're talking about a class method, I'm not an expert in siphash, but wouldn't you would implement siphash in terms of one input length and the other Thanks, I hadn't thought of that. It appears the only differece (currently) between |
You arguments are out of my sense of sanity. |
Class vars and non-simple constants are always lazily initialized because you never know when one constant will need the value of another constant. It's a way to avoid depending on the order of initialization. Probably seed can be moved to a constant (that will always be lazily initialized, and checked on every access, but in any case I don't think this should affect performance much) |
Friends, I don't want to discus it any more. You may change anything after merging. Instead you pushing on me in this small details. I think code will be worse if I apply your last suggestions. Therefore I will not apply them. I'm too tired of this PR. If you are not going to merge it in its current shape, I will close this PR and remove code after tomorrow. |
@funny-falcon You are right. Open source can be a PITA, I feel the same every day. I will take care of this issue, don't worry. Thank you for this contribution! I will probably copy some pieces of code, if that's OK with you. |
@asterite , thank you. |
s/can/may/ |
@funny-falcon Thanks for your time invested here, Sometimes Open Source software is very hard to manage. I need to quote this:
|
I understand you, This PR is the most commented until now 😅 I think 216 comments in this PR is a record for this project. stay tuned of #4946 |
As declared by Crystal language reference, 1i32.hash should equal to 1f64.hash. Extracted from crystal-lang#4675, also replaces crystal-lang#4581.
As declared by Crystal language reference, 1i32.hash should equal to 1f64.hash. Extracted from crystal-lang#4675, also replaces crystal-lang#4581.
As declared by Crystal language reference, 1i32.hash should equal to 1f64.hash. Extracted from crystal-lang#4675, also replaces crystal-lang#4581.
For now all great parts of #4675 now merged into Thanks again to all of us. |
To protect against Hash DoS, change the way hash value is computed.
Class|Struct should define method
def hash(hasher)
and callhasher << @ivar
inside.As an option, for speed, and for backward compatibility,
def hash
still could be implemented. It will be used for Hash of matched type.
Thread#hash
andSignal#hash
is implemented as unseeded cause they areused before
StdHasher @@seed
is initialized.But it is better to implement
def hash(hasher)
.StdHasher is default hasher that uses
hash(hasher)
and it is used as defaultseeded hasher. It also implements
unseeded
forEnums
.Also, number normalization for hashing introduced, ie rule 'equality
forces hash equality' is forced (
a == b
=>a.hash == b.hash
).Normalization idea is borrowed from Python implementation.
(idea by Akzhan Abdulin @akzhan)
Fixes #4578
Prerequisite for #4557
Replaces #4581