Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use minimal perfect hashing for lookups #37

Merged
merged 8 commits into from
Apr 16, 2019

Conversation

raphlinus
Copy link
Contributor

This patch moves many lookups from large match statements to a custom approach based on minimal perfect hashing.

Should improve #29 considerably - cargo build --release goes from 6.28s to 2.11s on my machine. In addition, code size is considerably improved (1432576 to 858112 bytes for the benchmark executable). Speed is basically the same.

Also moves generation script to Python 3. Note, the Unicode version is still 9.0.

@raphlinus
Copy link
Contributor Author

More detail on the benchmarks. Here's the before:

test bench_is_nfc_ascii                      ... bench:          23 ns/iter (+/- 5)
test bench_is_nfc_normalized                 ... bench:          36 ns/iter (+/- 3)
test bench_is_nfc_not_normalized             ... bench:         452 ns/iter (+/- 163)
test bench_is_nfc_stream_safe_ascii          ... bench:          22 ns/iter (+/- 2)
test bench_is_nfc_stream_safe_normalized     ... bench:          46 ns/iter (+/- 6)
test bench_is_nfc_stream_safe_not_normalized ... bench:         528 ns/iter (+/- 225)
test bench_is_nfd_ascii                      ... bench:          21 ns/iter (+/- 4)
test bench_is_nfd_normalized                 ... bench:          45 ns/iter (+/- 3)
test bench_is_nfd_not_normalized             ... bench:          16 ns/iter (+/- 3)
test bench_is_nfd_stream_safe_ascii          ... bench:          24 ns/iter (+/- 4)
test bench_is_nfd_stream_safe_normalized     ... bench:          55 ns/iter (+/- 5)
test bench_is_nfd_stream_safe_not_normalized ... bench:          17 ns/iter (+/- 3)
test bench_nfc_ascii                         ... bench:         661 ns/iter (+/- 113)
test bench_nfc_long                          ... bench:     234,811 ns/iter (+/- 44,577)
test bench_nfd_ascii                         ... bench:         308 ns/iter (+/- 51)
test bench_nfd_long                          ... bench:     127,452 ns/iter (+/- 11,391)
test bench_nfkc_ascii                        ... bench:         599 ns/iter (+/- 49)
test bench_nfkc_long                         ... bench:     236,973 ns/iter (+/- 19,020)
test bench_nfkd_ascii                        ... bench:         316 ns/iter (+/- 21)
test bench_nfkd_long                         ... bench:     141,850 ns/iter (+/- 22,229)
test bench_streamsafe_adversarial            ... bench:         507 ns/iter (+/- 26)
test bench_streamsafe_ascii                  ... bench:          75 ns/iter (+/- 5)

And here's the after:

test bench_is_nfc_ascii                      ... bench:          22 ns/iter (+/- 1)
test bench_is_nfc_normalized                 ... bench:          35 ns/iter (+/- 4)
test bench_is_nfc_not_normalized             ... bench:         419 ns/iter (+/- 119)
test bench_is_nfc_stream_safe_ascii          ... bench:          26 ns/iter (+/- 7)
test bench_is_nfc_stream_safe_normalized     ... bench:          45 ns/iter (+/- 8)
test bench_is_nfc_stream_safe_not_normalized ... bench:         447 ns/iter (+/- 49)
test bench_is_nfd_ascii                      ... bench:          22 ns/iter (+/- 1)
test bench_is_nfd_normalized                 ... bench:          46 ns/iter (+/- 6)
test bench_is_nfd_not_normalized             ... bench:          16 ns/iter (+/- 5)
test bench_is_nfd_stream_safe_ascii          ... bench:          22 ns/iter (+/- 2)
test bench_is_nfd_stream_safe_normalized     ... bench:          61 ns/iter (+/- 8)
test bench_is_nfd_stream_safe_not_normalized ... bench:          16 ns/iter (+/- 4)
test bench_nfc_ascii                         ... bench:         620 ns/iter (+/- 376)
test bench_nfc_long                          ... bench:     195,177 ns/iter (+/- 21,275)
test bench_nfd_ascii                         ... bench:         392 ns/iter (+/- 42)
test bench_nfd_long                          ... bench:     146,535 ns/iter (+/- 9,473)
test bench_nfkc_ascii                        ... bench:         550 ns/iter (+/- 41)
test bench_nfkc_long                         ... bench:     212,233 ns/iter (+/- 16,049)
test bench_nfkd_ascii                        ... bench:         384 ns/iter (+/- 27)
test bench_nfkd_long                         ... bench:     155,408 ns/iter (+/- 12,506)
test bench_streamsafe_adversarial            ... bench:         458 ns/iter (+/- 24)
test bench_streamsafe_ascii                  ... bench:          77 ns/iter (+/- 6)

More commentary. I also tested the singleton bucket "optimization" as described in Steve Hanov's blog post on minimal perfect hashing, and it was about 50% slower on the long tests. It saves rehashing work, but the effect of the extra branching is worse. Not doing it makes table generation a bit slower, and also less robust (it would not be too difficult to construct an adversarial example that would overflow the salt).

I wouldn't be surprised if there was a better hash function. Using a single multiplication doesn't work, there are too many collisions. I also tried a variant of the Jenkins one-at-a-time hash function, and it was slower. Several other proposals were mentioned on a Twitter thread, but I don't think anything will be faster.

@trishume
Copy link

Another approach that might lead to even better compile times, is to output the tables in some simple packed binary format and include them with the include_bytes! macro, then just index into the byte arrays to extract what you need. Would avoid generating a 0.5 megabyte Rust file. Not sure how much compile time it would save though for the effort it would take.

@raphlinus
Copy link
Contributor Author

@trishume That's well worth considering. One factor against it is that this crate has strictly no unsafe code, and the deserialization from the packed format would at least have checks for the conversion into char. But it's probably a good idea to investigate.

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly would prefer if the generated code and the generated tables lived separately -- so you have the functions generate DECOMPOSITION_KEYS and DECOMPOSITION_SALTS tables, and the actual mph_lookup calls live outside of tables.rs, so that tables.rs is just tables and no actual code.

return (y * n) >> 32

# Compute minimal perfect hash function, d can be either a dict or list of keys.
def minimal_perfect_hash(d, singleton_buckets = False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if this function had more comments

@@ -432,13 +436,61 @@ def gen_tests(tests, out):

out.write("];\n")

def my_hash(x, salt, n):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should have a comment saying "guaranteed to be less than n"

for (bucket_size, h) in bsorted:
if bucket_size == 0:
break
elif singleton_buckets and bucket_size == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use the singleton_buckets case at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I can remove it, especially as it seems to perform worse in benchmarks. The main reason I left it in is that it's more robust; without it there's a much greater probability the hashing will fail.

else:
for salt in range(1, 32768):
rehashes = [my_hash(key, salt, n) for key in buckets[h]]
if all(not claimed[hash] for hash in rehashes):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a guarantee that we won't have a collision amongst the rehashes? Is it just really unlikely? (I suspect it's the latter but want to confirm)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if it finds a suitable salt that comes with a guarantee the rehash won't have a collision (this is what the claimed bool-array keeps track of). On the other hand, it's possible that no salt can be found that satisfies that, but I believe it to be quite a low probability. There's things that can be done to make it more robust. I'll try to add a comment outlining that in case somebody does run into it with a data update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wait, the set check deals with this, I'd forgotten it was there 😄 . To be clear, I was specifically worried about cases where a single run of rehashes has collisions, which claimed won't catch since we update it later.

(worth leaving a comment saying that)

out.write("pub fn composition_table(c1: char, c2: char) -> Option<char> {\n")
out.write(" match (c1, c2) {\n")
out.write(" if c1 < '\\u{10000}' && c2 < '\\u{10000}' {\n")
out.write(" mph_lookup((c1 as u32) << 16 | (c2 as u32), &[\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the code outputting mph_lookup calls be factored out into a function?

The code has been moved out of the tables module into perfect_hash, and
there is a bit more explanation in comments.
@raphlinus
Copy link
Contributor Author

@Manishearth does this address your concerns? It's a bit denser (less cut'n'paste of generated code), but hopefully reasonably clear in organization and with comments.

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor issue

/// Look up the canonical combining class for a codepoint.
///
/// The value returned is as defined in the Unicode Character Database.
pub fn canonical_combining_class(c: char) -> u8 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions should live elsewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Their own module? That's what I did in 40f9ba6.

@Manishearth
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants