Use minimal perfect hashing for lookups #37

raphlinus · 2019-04-10T01:06:30Z

This patch moves many lookups from large match statements to a custom approach based on minimal perfect hashing.

Should improve #29 considerably - cargo build --release goes from 6.28s to 2.11s on my machine. In addition, code size is considerably improved (1432576 to 858112 bytes for the benchmark executable). Speed is basically the same.

Also moves generation script to Python 3. Note, the Unicode version is still 9.0.

raphlinus · 2019-04-10T01:13:29Z

More detail on the benchmarks. Here's the before:

test bench_is_nfc_ascii                      ... bench:          23 ns/iter (+/- 5)
test bench_is_nfc_normalized                 ... bench:          36 ns/iter (+/- 3)
test bench_is_nfc_not_normalized             ... bench:         452 ns/iter (+/- 163)
test bench_is_nfc_stream_safe_ascii          ... bench:          22 ns/iter (+/- 2)
test bench_is_nfc_stream_safe_normalized     ... bench:          46 ns/iter (+/- 6)
test bench_is_nfc_stream_safe_not_normalized ... bench:         528 ns/iter (+/- 225)
test bench_is_nfd_ascii                      ... bench:          21 ns/iter (+/- 4)
test bench_is_nfd_normalized                 ... bench:          45 ns/iter (+/- 3)
test bench_is_nfd_not_normalized             ... bench:          16 ns/iter (+/- 3)
test bench_is_nfd_stream_safe_ascii          ... bench:          24 ns/iter (+/- 4)
test bench_is_nfd_stream_safe_normalized     ... bench:          55 ns/iter (+/- 5)
test bench_is_nfd_stream_safe_not_normalized ... bench:          17 ns/iter (+/- 3)
test bench_nfc_ascii                         ... bench:         661 ns/iter (+/- 113)
test bench_nfc_long                          ... bench:     234,811 ns/iter (+/- 44,577)
test bench_nfd_ascii                         ... bench:         308 ns/iter (+/- 51)
test bench_nfd_long                          ... bench:     127,452 ns/iter (+/- 11,391)
test bench_nfkc_ascii                        ... bench:         599 ns/iter (+/- 49)
test bench_nfkc_long                         ... bench:     236,973 ns/iter (+/- 19,020)
test bench_nfkd_ascii                        ... bench:         316 ns/iter (+/- 21)
test bench_nfkd_long                         ... bench:     141,850 ns/iter (+/- 22,229)
test bench_streamsafe_adversarial            ... bench:         507 ns/iter (+/- 26)
test bench_streamsafe_ascii                  ... bench:          75 ns/iter (+/- 5)

And here's the after:

test bench_is_nfc_ascii                      ... bench:          22 ns/iter (+/- 1)
test bench_is_nfc_normalized                 ... bench:          35 ns/iter (+/- 4)
test bench_is_nfc_not_normalized             ... bench:         419 ns/iter (+/- 119)
test bench_is_nfc_stream_safe_ascii          ... bench:          26 ns/iter (+/- 7)
test bench_is_nfc_stream_safe_normalized     ... bench:          45 ns/iter (+/- 8)
test bench_is_nfc_stream_safe_not_normalized ... bench:         447 ns/iter (+/- 49)
test bench_is_nfd_ascii                      ... bench:          22 ns/iter (+/- 1)
test bench_is_nfd_normalized                 ... bench:          46 ns/iter (+/- 6)
test bench_is_nfd_not_normalized             ... bench:          16 ns/iter (+/- 5)
test bench_is_nfd_stream_safe_ascii          ... bench:          22 ns/iter (+/- 2)
test bench_is_nfd_stream_safe_normalized     ... bench:          61 ns/iter (+/- 8)
test bench_is_nfd_stream_safe_not_normalized ... bench:          16 ns/iter (+/- 4)
test bench_nfc_ascii                         ... bench:         620 ns/iter (+/- 376)
test bench_nfc_long                          ... bench:     195,177 ns/iter (+/- 21,275)
test bench_nfd_ascii                         ... bench:         392 ns/iter (+/- 42)
test bench_nfd_long                          ... bench:     146,535 ns/iter (+/- 9,473)
test bench_nfkc_ascii                        ... bench:         550 ns/iter (+/- 41)
test bench_nfkc_long                         ... bench:     212,233 ns/iter (+/- 16,049)
test bench_nfkd_ascii                        ... bench:         384 ns/iter (+/- 27)
test bench_nfkd_long                         ... bench:     155,408 ns/iter (+/- 12,506)
test bench_streamsafe_adversarial            ... bench:         458 ns/iter (+/- 24)
test bench_streamsafe_ascii                  ... bench:          77 ns/iter (+/- 6)

More commentary. I also tested the singleton bucket "optimization" as described in Steve Hanov's blog post on minimal perfect hashing, and it was about 50% slower on the long tests. It saves rehashing work, but the effect of the extra branching is worse. Not doing it makes table generation a bit slower, and also less robust (it would not be too difficult to construct an adversarial example that would overflow the salt).

I wouldn't be surprised if there was a better hash function. Using a single multiplication doesn't work, there are too many collisions. I also tried a variant of the Jenkins one-at-a-time hash function, and it was slower. Several other proposals were mentioned on a Twitter thread, but I don't think anything will be faster.

trishume · 2019-04-10T03:02:32Z

Another approach that might lead to even better compile times, is to output the tables in some simple packed binary format and include them with the include_bytes! macro, then just index into the byte arrays to extract what you need. Would avoid generating a 0.5 megabyte Rust file. Not sure how much compile time it would save though for the effort it would take.

raphlinus · 2019-04-10T03:23:02Z

@trishume That's well worth considering. One factor against it is that this crate has strictly no unsafe code, and the deserialization from the packed format would at least have checks for the conversion into char. But it's probably a good idea to investigate.

Manishearth

Slightly would prefer if the generated code and the generated tables lived separately -- so you have the functions generate DECOMPOSITION_KEYS and DECOMPOSITION_SALTS tables, and the actual mph_lookup calls live outside of tables.rs, so that tables.rs is just tables and no actual code.

Manishearth · 2019-04-13T23:45:41Z

scripts/unicode.py

+    return (y * n) >> 32
+
+# Compute minimal perfect hash function, d can be either a dict or list of keys.
+def minimal_perfect_hash(d, singleton_buckets = False):


I'd prefer if this function had more comments

Manishearth · 2019-04-13T23:53:41Z

scripts/unicode.py

@@ -432,13 +436,61 @@ def gen_tests(tests, out):

    out.write("];\n")

+def my_hash(x, salt, n):


probably should have a comment saying "guaranteed to be less than n"

Manishearth · 2019-04-13T23:58:34Z

scripts/unicode.py

+    for (bucket_size, h) in bsorted:
+        if bucket_size == 0:
+            break
+        elif singleton_buckets and bucket_size == 1:


Do we use the singleton_buckets case at all?

No, I can remove it, especially as it seems to perform worse in benchmarks. The main reason I left it in is that it's more robust; without it there's a much greater probability the hashing will fail.

Manishearth · 2019-04-14T00:33:39Z

scripts/unicode.py

+        else:
+            for salt in range(1, 32768):
+                rehashes = [my_hash(key, salt, n) for key in buckets[h]]
+                if all(not claimed[hash] for hash in rehashes):


Is there a guarantee that we won't have a collision amongst the rehashes? Is it just really unlikely? (I suspect it's the latter but want to confirm)

Yes, if it finds a suitable salt that comes with a guarantee the rehash won't have a collision (this is what the claimed bool-array keeps track of). On the other hand, it's possible that no salt can be found that satisfies that, but I believe it to be quite a low probability. There's things that can be done to make it more robust. I'll try to add a comment outlining that in case somebody does run into it with a data update.

Oh, wait, the set check deals with this, I'd forgotten it was there 😄 . To be clear, I was specifically worried about cases where a single run of rehashes has collisions, which claimed won't catch since we update it later.

(worth leaving a comment saying that)

Manishearth · 2019-04-14T00:36:14Z

scripts/unicode.py

    out.write("pub fn composition_table(c1: char, c2: char) -> Option<char> {\n")
-    out.write("    match (c1, c2) {\n")
+    out.write("    if c1 < '\\u{10000}' && c2 < '\\u{10000}' {\n")
+    out.write("        mph_lookup((c1 as u32) << 16 | (c2 as u32), &[\n")


Could the code outputting mph_lookup calls be factored out into a function?

The code has been moved out of the tables module into perfect_hash, and there is a bit more explanation in comments.

raphlinus · 2019-04-16T17:45:28Z

@Manishearth does this address your concerns? It's a bit denser (less cut'n'paste of generated code), but hopefully reasonably clear in organization and with comments.

Manishearth

LGTM, minor issue

Manishearth · 2019-04-16T17:48:56Z

src/perfect_hash.rs

+/// Look up the canonical combining class for a codepoint.
+/// 
+/// The value returned is as defined in the Unicode Character Database.
+pub fn canonical_combining_class(c: char) -> u8 {


These functions should live elsewhere

Their own module? That's what I did in 40f9ba6.

Manishearth · 2019-04-16T18:04:45Z

Thanks!

raphlinus added 5 commits April 9, 2019 10:43

Update unicode script to Python 3

aa84f63

Use minimal perfect hashing for combining class lookup

3a4a8f6

More lookups moved to perfect hashing

db57ffc

Use perfect hashing for decomposition also

2d7bfd1

Use perfect hashing for compose lookup

f64d47c

raphlinus mentioned this pull request Apr 12, 2019

Add implementation of likely subtags projectfluent/fluent-langneg-rs#11

Closed

Manishearth approved these changes Apr 14, 2019

View reviewed changes

raphlinus added 2 commits April 16, 2019 10:42

Move code out of tables

2e432d2

The code has been moved out of the tables module into perfect_hash, and there is a bit more explanation in comments.

One more comment

08996fa

Manishearth approved these changes Apr 16, 2019

View reviewed changes

Move lookups into own module

40f9ba6

Manishearth merged commit 7c23cc9 into unicode-rs:master Apr 16, 2019

SimonSapin mentioned this pull request Oct 21, 2019

WASM file size servo/rust-url#557

Closed

ashthespy mentioned this pull request Nov 7, 2019

[idna] Bump unicode-normalization to 0.1.9 servo/rust-url#562

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use minimal perfect hashing for lookups #37

Use minimal perfect hashing for lookups #37

raphlinus commented Apr 10, 2019

raphlinus commented Apr 10, 2019

trishume commented Apr 10, 2019

raphlinus commented Apr 10, 2019

Manishearth left a comment

Manishearth Apr 13, 2019

Manishearth Apr 13, 2019

Manishearth Apr 13, 2019

raphlinus Apr 14, 2019

Manishearth Apr 14, 2019

raphlinus Apr 14, 2019

Manishearth Apr 14, 2019

Manishearth Apr 14, 2019

raphlinus commented Apr 16, 2019

Manishearth left a comment

Manishearth Apr 16, 2019

raphlinus Apr 16, 2019

Manishearth commented Apr 16, 2019

		@@ -432,13 +436,61 @@ def gen_tests(tests, out):

		out.write("];\n")

		def my_hash(x, salt, n):

Use minimal perfect hashing for lookups #37

Use minimal perfect hashing for lookups #37

Conversation

raphlinus commented Apr 10, 2019

raphlinus commented Apr 10, 2019

trishume commented Apr 10, 2019

raphlinus commented Apr 10, 2019

Manishearth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raphlinus commented Apr 16, 2019

Manishearth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Manishearth commented Apr 16, 2019