-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use minimal perfect hashing for lookups #37
Conversation
More detail on the benchmarks. Here's the before:
And here's the after:
More commentary. I also tested the singleton bucket "optimization" as described in Steve Hanov's blog post on minimal perfect hashing, and it was about 50% slower on the long tests. It saves rehashing work, but the effect of the extra branching is worse. Not doing it makes table generation a bit slower, and also less robust (it would not be too difficult to construct an adversarial example that would overflow the salt). I wouldn't be surprised if there was a better hash function. Using a single multiplication doesn't work, there are too many collisions. I also tried a variant of the Jenkins one-at-a-time hash function, and it was slower. Several other proposals were mentioned on a Twitter thread, but I don't think anything will be faster. |
Another approach that might lead to even better compile times, is to output the tables in some simple packed binary format and include them with the |
@trishume That's well worth considering. One factor against it is that this crate has strictly no unsafe code, and the deserialization from the packed format would at least have checks for the conversion into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly would prefer if the generated code and the generated tables lived separately -- so you have the functions generate DECOMPOSITION_KEYS
and DECOMPOSITION_SALTS
tables, and the actual mph_lookup
calls live outside of tables.rs
, so that tables.rs
is just tables and no actual code.
scripts/unicode.py
Outdated
return (y * n) >> 32 | ||
|
||
# Compute minimal perfect hash function, d can be either a dict or list of keys. | ||
def minimal_perfect_hash(d, singleton_buckets = False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if this function had more comments
@@ -432,13 +436,61 @@ def gen_tests(tests, out): | |||
|
|||
out.write("];\n") | |||
|
|||
def my_hash(x, salt, n): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably should have a comment saying "guaranteed to be less than n
"
scripts/unicode.py
Outdated
for (bucket_size, h) in bsorted: | ||
if bucket_size == 0: | ||
break | ||
elif singleton_buckets and bucket_size == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we use the singleton_buckets case at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I can remove it, especially as it seems to perform worse in benchmarks. The main reason I left it in is that it's more robust; without it there's a much greater probability the hashing will fail.
else: | ||
for salt in range(1, 32768): | ||
rehashes = [my_hash(key, salt, n) for key in buckets[h]] | ||
if all(not claimed[hash] for hash in rehashes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a guarantee that we won't have a collision amongst the rehashes? Is it just really unlikely? (I suspect it's the latter but want to confirm)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if it finds a suitable salt that comes with a guarantee the rehash won't have a collision (this is what the claimed
bool-array keeps track of). On the other hand, it's possible that no salt can be found that satisfies that, but I believe it to be quite a low probability. There's things that can be done to make it more robust. I'll try to add a comment outlining that in case somebody does run into it with a data update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, wait, the set
check deals with this, I'd forgotten it was there 😄 . To be clear, I was specifically worried about cases where a single run of rehashes has collisions, which claimed
won't catch since we update it later.
(worth leaving a comment saying that)
scripts/unicode.py
Outdated
out.write("pub fn composition_table(c1: char, c2: char) -> Option<char> {\n") | ||
out.write(" match (c1, c2) {\n") | ||
out.write(" if c1 < '\\u{10000}' && c2 < '\\u{10000}' {\n") | ||
out.write(" mph_lookup((c1 as u32) << 16 | (c2 as u32), &[\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the code outputting mph_lookup
calls be factored out into a function?
The code has been moved out of the tables module into perfect_hash, and there is a bit more explanation in comments.
@Manishearth does this address your concerns? It's a bit denser (less cut'n'paste of generated code), but hopefully reasonably clear in organization and with comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, minor issue
src/perfect_hash.rs
Outdated
/// Look up the canonical combining class for a codepoint. | ||
/// | ||
/// The value returned is as defined in the Unicode Character Database. | ||
pub fn canonical_combining_class(c: char) -> u8 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions should live elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Their own module? That's what I did in 40f9ba6.
Thanks! |
This patch moves many lookups from large match statements to a custom approach based on minimal perfect hashing.
Should improve #29 considerably -
cargo build --release
goes from 6.28s to 2.11s on my machine. In addition, code size is considerably improved (1432576 to 858112 bytes for the benchmark executable). Speed is basically the same.Also moves generation script to Python 3. Note, the Unicode version is still 9.0.