change computation of hash value. #4675

funny-falcon · 2017-07-05T08:12:58Z

To protect against Hash DoS, change the way hash value is computed.
Class|Struct should define method def hash(hasher) and call
hasher << @ivar inside.

As an option, for speed, and for backward compatibility, def hash
still could be implemented. It will be used for Hash of matched type.
Thread#hash and Signal#hash is implemented as unseeded cause they are
used before StdHasher @@seed is initialized.

But it is better to implement def hash(hasher).

StdHasher is default hasher that uses hash(hasher) and it is used as default
seeded hasher. It also implements unseeded for Enums.

Also, number normalization for hashing introduced, ie rule 'equality
forces hash equality' is forced (a == b => a.hash == b.hash).
Normalization idea is borrowed from Python implementation.
(idea by Akzhan Abdulin @akzhan)

Fixes #4578
Prerequisite for #4557
Replaces #4581

funny-falcon · 2017-07-05T08:15:33Z

Fresh PR in place of #4621

akzhan · 2017-07-05T08:20:04Z

src/big/big_float.cr

@@ -17,6 +17,18 @@ struct BigFloat < Float
    LibGMP.mpf_init_set_str(out @mpf, str, 10)
  end

+  def initialize(num : BigInt)


This pull request is much broader than simply changing the hashing calculation. But these updates are really important.

Latest builds of #4653 shows unstable behavior of constructors (Compiler can prefer another one under some conditions).

Please remove 'em and use to_big_X.

straight-shoota · 2017-07-05T08:39:21Z

src/float.cr

+  def hash_normalize
+    float_normalize_wrap do
+      {% if flag?(:x86) || flag?(:x86_64) || flag(:arm) || flag(:aarch64) %}
+	# it should work on every architecture where endianess of Float32 and Int32


This needs formatting.

Can you hint which way?
I relied on crystal tool format, but it doesn't help much.

Just apply proper indentation on this and the following lines. This line should be indented by 8 spaces for example.

@funny-falcon indent by two spaces.

def hash_normalize float_normalize_wrap do {% if flag?(:x86) || flag?(:x86_64) || flag(:arm) || flag(:aarch64) %} # it should work on every architecture where endianess of Float32 and Int32

no tabs please.

I see: i used tabs unintentionally instead of spaces.
Fixed that, and reduced relying on macroses.

konovod · 2017-07-05T08:45:41Z

src/float.cr

+      {% if flag?(:x86) || flag?(:x86_64) || flag(:arm) || flag(:aarch64) %}
+	# it should work on every architecture where endianess of Float32 and Int32
+	# matches and float is IEEE754.
+	unsafe_int = unsafe_as(Int32)


Afaik tool format don't touch macro ({% %}) lines, so they should be formatted manually.

I removed usage of macros here in following commits.

straight-shoota · 2017-07-05T10:57:10Z

src/float.cr

-      {% else %}
-	float_normalize_reference
-      {% end %}
+      if FLOAT64_IS_IEEE754


Why is this no longer a macro as in the first commit? Can this constant change at runtime?

A constant can't change at runtime so that's not possible

Sry, I meant if it can differ based on the system where the code is executed or is it fixed for every target architecture? If fixed, the appropriate branch should be chosen according to target arch at compile time.

Unfortunately, you are right: it is tested in runtime even in release build.
Lets think a bit more.

looks like crystal compiler does very strange things with constants:
even if it knows constant value at compile time and can use it in macros, it still inserts checks for constant's initializer in final code, and doesn't propagate them as real constants to llvm optimizer.

I add private def hash_bits and private def hash_modulus to workaround this issue.

/cc @asterite

akzhan · 2017-07-05T13:10:42Z

src/number/hash_normalize.cr

@@ -0,0 +1,115 @@
+module Number::HashNormalize
+  # Idea by Akzhan Abdulin @akzhan


Please drop this line, thanks.

akzhan · 2017-07-05T13:11:20Z

src/number/hash_normalize.cr

+  # exponentiation algorithm.  For reduce(2**e) it's even better: since
+  # P is of the form 2**n-1, reduce(2**e) is 2**(e mod n), and multiplication
+  # by 2**(e mod n) modulo 2**n-1 just amounts to a rotation of bits.
+  #


And this empty comment too.

sdogruyol · 2017-07-05T13:13:37Z

This is great 😍

funny-falcon · 2017-07-05T14:03:29Z

One question: does anyone uses Crystal on architectures with restricted unaligned read?
Looks like x86, x86_64 are certainly allow unaligned access.
arm and aarch64 allows to do unaligned read under certain conditions, but I'm not sure if program compilled with Crystal falls into that conditions.

Problem line is https://github.com/crystal-lang/crystal/pull/4675/files?diff=unified#diff-b88fe795012d5923ca24e66769201422R158 - permuting slice of bytes.

Can any one with access to arm with crystal run following, please:

   str = "1234567"
   buf = str.to_slice
   ptr = buf.to_unsafe
   4.times do |i|
     p (ptr+i).as(Pointer(UInt32)).value.to_s(16)
   end

RX14 · 2017-07-05T14:19:31Z

@funny-falcon We probably do unaligned reads all over the stdlib. As long as there's a spec which will break and alert us to the problem on any new architectures with this limitation it's fine to do unaligned reads for now.

ysbaddaden · 2017-07-05T15:03:41Z

Antirez has a nice writeup about ARM and unaligned accesses:
http://antirez.com/news/111

ARMv5 doesn't allow unaligned accesses. ARMv6 allows word-sized unaligned accesses but fails on multiple words. The Linux kernel will rescue and fix unaligned accesses, so unaligned accesses do work on Linux on ARM, but with slow performance. As stated by @RX14 we probably have unaligned accesses in the core/stdlib, thought that doesn't mean we shouldn't care.

I tried to a qemu VM but the ARM emulator allows unaligned accesses, so I can't check what happens in /proc/cpu/alignment.

ysbaddaden · 2017-07-05T15:09:55Z

Note that LLVM can be smart and fix unaligned accesses at compile time when it's obvious.

funny-falcon · 2017-07-05T17:55:15Z

I've tested equivalent C program on Raspberry Pi2, and it handles unaligned read well, so I think no issue there.

akzhan · 2017-07-06T08:49:44Z

src/stdhasher.cr

+#   end
+
+# StdHasher used as standard hasher in `Object#hash`
+# It have to provide defenense against HashDos, and be reasonably fast.


defenense > defense

akzhan · 2017-07-06T09:03:08Z

src/number/hash_normalize.cr

+    int_to_hashnorm 314159
+  end
+
+  # This function is for reference implementation.


You should note that this method is used under some conditions.

I improved comment.
I mentioned, that currently hash for BigFloat is not precise for numbers with big fractional part (ie it doesn't distinguish '1.0000000000000001' and '1.0000000000000002').
It happens, cause BigFloat#hash_normalize falls back to to_f.hash_normalize for such numbers.
It is possible to improve, using this float_normalize_reference and overloading Math.frexp. (or introducing BigFloat#frexp and using it).

Should it be done? (ie, is hashing of such numbers is common task?)
If yes, should it be done in this PR?

Feel free to merge #4653 :-)

But no, I think that's hashing of Big integers/floats is very uncommon task.

Those implementation doesn' help, cause they still returns {Float64, Int}, so they strictly equal to Math.frexp v.to_f

Implementation suitable for exact hashing should look like:

module Math def frexp(value : BigFloat) LibGMP.mpf_get_d_2exp(out exp, value) frac = BigFloat.new { |mpf| if exp >= 0 LibGMP.mpf_div_2exp(mpf, self, exp) else LibGMP.mpf_mul_2exp(mpf, self, -exp) end } {frac, exp} end end

But no, I think that's hashing of Big integers/floats is very uncommon task.

It is good.
Note, that BigInteger still hashed exactly. Also, BigFloat with big integer part and small fractional part are also hashed exactly.
Stop... I think, my last sentence is not true. 61bit Mersen Prime is larger than Float64 precision (53bit), so precision will be lost with conversion to_f.hash_normalize :-(
I should think about carefully.

looks like, exact hashing for BigFloat is a single reliable way.

akzhan

Code is readable, effective and solves the task of protecting against Hash DoS attacks.

Sija · 2017-07-06T11:00:50Z

src/big/big_rational.cr

@@ -41,6 +41,22 @@ struct BigRational < Number
    initialize(num, 1)
  end

+  # Creates a exact representation of float as rational.
+  #
+  # It sures that `BigRational.new(f) == f`


sures -> ensures

Sija · 2017-07-06T11:01:08Z

src/big/big_int.cr

-  def hash
-    to_u64
+  def hash_normalize
+    # remainder(hash_modulus)


debug leftover?

No, implementation description.

Sija · 2017-07-06T11:01:16Z

src/big/big_rational.cr

-  def hash
-    to_f64.hash
+  def hash_normalize
+    # self.remainder(hash_modulus).to_f.hash_normalize


It is implementation description

Sija · 2017-07-06T11:02:33Z

src/hash.cr

@@ -710,14 +710,19 @@ class Hash(K, V)
  #
  # ```
  # foo = {"foo" => "bar"}
-  # foo.hash # => 3247054
+  # foo.hash # => 3247054 (not exactly)


not exactly -> approximation

it is "example". In fact, will be different on every process run.

Sija · 2017-07-06T11:03:43Z

src/stdhasher.cr

+require "crystal/system/random"
+
+# Hasher usable for `def hash(hasher)` should satisfy protocol:
+#   class MyHasher


Please, use triple backticks (```) for code blocks

Sija · 2017-07-06T11:10:32Z

src/stdhasher.cr

+    high = (v >> 32).to_u32
+    # This condition here cause of some 32bit issue in LLVM binding,
+    # so compiler_spec doesn't pass without it.
+    # Fill free to comment and debug.


fill -> feel

Sija · 2017-07-06T11:11:47Z

src/stdhasher.cr

+# It have to provide defense against HashDos, and be reasonably fast.
+# To protect against HashDos, it is seeded with secure random, and have
+# permutation that hard to forge without knowing seed and seeing hash digest.
+struct StdHasher


Why not just Hasher?

Someone suggested, I don't remember exactly who. Why not StdHasher?
There is obviously could be different implementations, still usable with suggested protocol, so it is just 'standard hasher'.
But if you really think it should be Hasher, I'll rename it.

I get your drift, yet this Std abbreviation looks IMO odd, it would be a precedent in stdlib. Maybe someone have a better name?

Hash::Hasher?

Hasher looks like too generic name.

I renamed to Hash::Hasher.

akzhan · 2017-07-06T12:35:12Z

Btw, you can exec bin/crystal docs your files to preview generated documentation. Or make doc afair

funny-falcon · 2017-07-06T12:48:34Z

@akzhan

$ bin/crystal doc
Using compiled compiler at `.build/crystal'
Error in line 1: while requiring "./src/**"

in src/crystal/system/unix/arc4random.cr:1: expanding macro

{% skip_file unless flag?(:openbsd) %}
^

in src/crystal/system/unix/arc4random.cr:1: undefined macro variable 'skip_file'

{% skip_file unless flag?(:openbsd) %}
   ^~~~~~~~~

Sija · 2017-07-06T13:00:29Z

src/hash.cr

-  # foo.hash # => 3247054 (not exactly)
-  # ```
+  # Protocol method for generic hashing.
+  # Note: it should be independent of iteration order.


Note -> NOTE

RX14 · 2017-07-06T13:45:53Z

@funny-falcon That means your compiler version is too old.

funny-falcon · 2017-07-06T13:49:40Z

@RX14 how could be bin/crystal too old? I thought it runs just compiled binary.

Prepare hash infrastructor to future change of hashing algrorithm to protect against Hash DoS. Class|Struct should define method `def hash(hasher)` and call `hasher << @ivar` inside. As an option, for speed, and for backward compatibility, `def hash` still could be implemented. It will be used for Hash of matched type. `Thread#hash` and `Signal#hash` is implemented as unseeded cause they are used before `StdHasher @@seed` is initialized. Hash::Hasher is default hasher that uses `hash(hasher)` and it is used as default seeded hasher. Also, number normalization for hashing introduced, ie rule 'equality forces hash equality' is forced (`a == b` => `a.hash == b.hash`). Normalization idea is borrowed from Python implementation. It fixes several issues with BigInt and BigFloat on 32bit platform, but not all issues. Fixes crystal-lang#4578 Fixes crystal-lang#3932 Prerequisite for crystal-lang#4557 Replaces crystal-lang#4581 Correlates with crystal-lang#4653

cause StringPool is used in json decoding, it is important to have it safe.

funny-falcon · 2017-09-10T11:00:51Z

I've rebased. Hope this saga will end :-)

RX14 · 2017-09-10T11:21:20Z

src/hash/hasher.cr

+    # Type for Hash::Hasher#digest
+    alias Value = UInt32
+
+    @@seed = uninitialized StaticArray(UInt32, 1)


Why not just make this a random U32? If we use a constant with initializer (SEED = Random::System.rand(UInt32::MIN..UInt32::MAX)), shouldn't the constant be initialized on-demand, avoiding the need for special "unseeded" methods below?

RX14 · 2017-09-10T11:21:55Z

src/hash/hasher.cr

+    buf = pointerof(@@seed).as(Pointer(UInt8))
+    Crystal::System::Random.random_bytes(buf.to_slice(sizeof(typeof(@@seed))))
+
+    protected getter a : UInt32 = 0_u32


Would be nice if this was renamed to something a bit more descriptive.

RX14 · 2017-09-10T11:22:21Z

src/hash/hasher.cr

+    end
+
+    # Calculate hashsum for value
+    def self.hashit(value) : Value


I think i'd just name this hash.

RX14 · 2017-09-10T11:23:08Z

src/hash/hasher.cr

+    # Mix nil to state
+    def <<(v : Nil) : Nil
+      permute_nil()
+      nil


You can leave these trailing nils off if we have the : Nil restriction, iirc.

RX14 · 2017-09-10T11:27:20Z

src/hash/hasher.cr

+    end
+
+    # Mix nil to state
+    def <<(v : Nil) : Nil


The only function of these methods seem to be calling permute. Why not move the permute body into this function?

RX14 · 2017-09-10T11:30:27Z

src/hash/hasher.cr

+    protected def permute_nil
+      # LFSR
+      mx = (@a.to_i32 >> 31).to_u32 & 0xa8888eef_u32
+      @a = (@a << 1) ^ mx


Isn't the current implementation of this essentially @a *= 31 not a LFSR?

RX14 · 2017-09-10T11:37:26Z

src/http/headers.cr

-        c = normalize_byte(c)
-        h = 31 * h + c
+    def hash(hasher)
+      hasher.raw(bytesize.to_u32)


Why raw here? I think that the correct way to use hasher should always be << without anyone having to think about raw except people implementing hash(hasher) on numbers. Shouldn't raw be :nodoc:?

RX14 · 2017-09-10T11:41:40Z

src/string_pool.cr

-      if entry
-        return entry
+    mask = (@capacity - 1).to_u32
+    index, d = hash & mask, 1


please split this into two assignments for readability.

RX14

oops

funny-falcon · 2017-09-10T12:49:44Z

Chris, most of your last remarks are meaningless, given it is just template for stronger algorithm you will have to implement soon after merging this.

Renaming hashit to hash is quite ridiculous: having same name+arity combinatiom for different semantic actions is error prone.

I will not waste my time on this PR anymore.

funny-falcon · 2017-09-10T13:07:08Z

And why do you think I use class variable instead of constant? To avoid "initialization on demand" inlined into every call to hash.

RX14 · 2017-09-10T13:46:49Z

@funny-falcon seed is only used when initializing the hasher, so it's relatively off the hot path compared to say the inner loop of a hash function. Furthermore it's random so it doesn't help at all with further optimizations.

Where is there another self.hash method on Hasher?

I think that i've made some relatively minor requests for a cleanup which are relevant both to this algorithm and to the implemntation of any future algorithm. I see no reason why siphash-1-2 can't be implemented in << instead of permute.

funny-falcon · 2017-09-10T14:31:56Z

What if hot loop calls val.hash on integer, and val.hash it is not inlined, so SEED initialization could not be moved out of loop? What about code bloat, cause SEED initialization inlined in many places.

Isn't Hasher is also Object that already has Object#hash(h)?

Will you inline siphash implementation into every <<?

raw is needed cause there is difference between hashing int as number that needs normalization, and as opaque part of opaque structure, where normalisation is certainly not needed.

RX14 · 2017-09-10T15:56:54Z

@funny-falcon

I think it's likely to be a very minor performance hi, more than worth it for the advantage of removing the complications of having to deal with initialization order and the unseeded method. @asterite would likely have more knowledge of the performance of class vars vs constants here.

We're talking about a class method, self.hash which is entirely seperate from #hash on the instance.

I'm not an expert in siphash, but wouldn't you would implement siphash in terms of one input length and the other << overloads would call the << overload which implements siphash.

Thanks, I hadn't thought of that. It appears the only differece (currently) between raw and << is for 64-bit signed and unsigned integers, would that change for example with siphash? Why can't we just mix in the higher then lower bits always and avoid this?

funny-falcon · 2017-09-10T16:14:11Z

You arguments are out of my sense of sanity.
Perhaps you are right, but I cann't get it.
I will not change this PR any more.

asterite · 2017-09-10T16:18:20Z

Class vars and non-simple constants are always lazily initialized because you never know when one constant will need the value of another constant. It's a way to avoid depending on the order of initialization.

Probably seed can be moved to a constant (that will always be lazily initialized, and checked on every access, but in any case I don't think this should affect performance much)

funny-falcon · 2017-09-10T17:02:30Z

Friends, I don't want to discus it any more. You may change anything after merging. Instead you pushing on me in this small details. I think code will be worse if I apply your last suggestions. Therefore I will not apply them.

I'm too tired of this PR. If you are not going to merge it in its current shape, I will close this PR and remove code after tomorrow.

asterite · 2017-09-10T18:44:44Z

@funny-falcon You are right. Open source can be a PITA, I feel the same every day.

I will take care of this issue, don't worry. Thank you for this contribution! I will probably copy some pieces of code, if that's OK with you.

funny-falcon · 2017-09-10T19:02:36Z

@asterite , thank you.
Yes, you (and anybody else) can copy code from this PR as much as you want.

funny-falcon · 2017-09-10T19:13:54Z

s/can/may/

faustinoaq · 2017-09-10T20:40:24Z

@funny-falcon Thanks for your time invested here, Sometimes Open Source software is very hard to manage.

I need to quote this:

Ary: ... The community part is a whole world of its own, tackling issues and PRs, and I'd say it's even more challenging (but also rewarding!) than developing the language itself :-)

faustinoaq · 2017-09-10T22:26:44Z

I'm too tired of this PR

I understand you, This PR is the most commented until now 😅

I think 216 comments in this PR is a record for this project.

stay tuned of #4946

As declared by Crystal language reference, 1i32.hash should equal to 1f64.hash. Extracted from crystal-lang#4675, also replaces crystal-lang#4581.

* Introduces real Number normalization for Crystal::Hasher. As declared by Crystal language reference, 1i32.hash should equal to 1f64.hash. Extracted from #4675, also replaces #4581. * hash specializations for BigInt, BigFloat, BigRational.

akzhan · 2017-11-24T17:54:57Z

For now all great parts of #4675 now merged into master branch in some way.

Thanks again to all of us.

funny-falcon mentioned this pull request Jul 5, 2017

change computation of hash value. #4621

Closed

akzhan reviewed Jul 5, 2017

View reviewed changes

straight-shoota reviewed Jul 5, 2017

View reviewed changes

akzhan mentioned this pull request Jul 5, 2017

Float.hash is based on the reduction of float modulo the prime #4581

Closed

konovod reviewed Jul 5, 2017

View reviewed changes

straight-shoota reviewed Jul 5, 2017

View reviewed changes

akzhan reviewed Jul 5, 2017

View reviewed changes

akzhan reviewed Jul 6, 2017

View reviewed changes

akzhan approved these changes Jul 6, 2017

View reviewed changes

Sija reviewed Jul 6, 2017

View reviewed changes

funny-falcon added 2 commits September 10, 2017 13:53

use Hash::Hasher and openaddressing in StringPool

93c97f1

cause StringPool is used in json decoding, it is important to have it safe.

funny-falcon force-pushed the hasher1 branch from b916ac1 to 93c97f1 Compare September 10, 2017 10:59

RX14 approved these changes Sep 10, 2017

View reviewed changes

RX14 requested changes Sep 10, 2017

View reviewed changes

asterite closed this Sep 10, 2017

asterite mentioned this pull request Sep 10, 2017

Introduce Crystal::Hasher for computing a hash value #4946

Merged

akzhan mentioned this pull request Sep 18, 2017

use Crystal::Hasher and openaddressing in StringPool #5000

Merged

akzhan mentioned this pull request Oct 8, 2017

Implement BigDecimal #4876

Merged

akzhan mentioned this pull request Nov 11, 2017

Number normalization for Crystal::Hasher #5276

Merged

akzhan mentioned this pull request Nov 24, 2017

Change Hash implementation #5256

Closed

		@@ -0,0 +1,115 @@
		module Number::HashNormalize
		# Idea by Akzhan Abdulin @akzhan

change computation of hash value. #4675

change computation of hash value. #4675

Conversation

funny-falcon commented Jul 5, 2017

funny-falcon commented Jul 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sdogruyol Jul 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sdogruyol commented Jul 5, 2017

funny-falcon commented Jul 5, 2017

RX14 commented Jul 5, 2017

ysbaddaden commented Jul 5, 2017 • edited Loading

ysbaddaden commented Jul 5, 2017

funny-falcon commented Jul 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

funny-falcon Jul 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akzhan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akzhan Jul 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akzhan commented Jul 6, 2017

funny-falcon commented Jul 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RX14 commented Jul 6, 2017

funny-falcon commented Jul 6, 2017

funny-falcon commented Sep 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RX14 left a comment

Choose a reason for hiding this comment

funny-falcon commented Sep 10, 2017

funny-falcon commented Sep 10, 2017

RX14 commented Sep 10, 2017

funny-falcon commented Sep 10, 2017

RX14 commented Sep 10, 2017

funny-falcon commented Sep 10, 2017

sdogruyol Jul 5, 2017 •

edited

Loading

ysbaddaden commented Jul 5, 2017 •

edited

Loading

funny-falcon Jul 6, 2017 •

edited

Loading

akzhan Jul 6, 2017 •

edited

Loading

faustinoaq commented Sep 10, 2017 •

edited

Loading