Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make short string hashing 30% faster by splitting Hash::hash_end from Hash::hash #29139

Closed
wants to merge 1 commit into from

Conversation

sorear
Copy link
Contributor

@sorear sorear commented Oct 18, 2015

Why?

Since hash functions are already designed to prevent collisions between a string and its prefixes, it's somewhat inelegant that we append a sentinel byte to strings for hashing. I was looking at #25237 a few days ago and realized that if we distinguish hashing contexts which are at the end of the key from those that aren't, we can suppress the sentinel byte (and also vector lengths) in the cases where they aren't needed; in addition to saving a byte of hashing, it saves a call to update and associated buffer-management overhead.

This attacks the same problem as #28044.

How?

This adds a new method hash_end to the Hash trait, which behaves exactly as hash except that it need not produce a prefix-free encoding. It is always legal for hash_end to be the same as hash, and as such this is the default implementation. There are specialized implementations for strings and slices which remove the end/length markers.

How much?

Here's a small benchmark script:

use std::hash::{Hasher,Hash,SipHasher};
use std::env;

fn main() {
    let args : Vec<String> = env::args().collect();
    let mut acc = 0u64;
    match &*args[1] {
        "0" => {
            for i in 1 .. 10_000_000 {
                acc += format!("{}", i).len() as u64; // not doing hashing
            }
        },
        "1" => {
            for i in 1 .. 10_000_000 {
                let mut _h = SipHasher::new();
                format!("{}", i).hash_end(&mut _h);
                acc += _h.finish();
            }
        },
        "2" => {
            for i in 1 .. 10_000_000 {
                let mut _h = SipHasher::new();
                format!("{}", i).hash(&mut _h);
                acc += _h.finish();
            }
        },
        "3" => {
            let mut s = std::collections::HashSet::new();
            for i in 1 .. 10_000_000 {
                s.insert(format!("{}", i));
            }
            acc = s.len() as u64;
        },
        "4" => {
            let mut s = std::collections::HashSet::new();
            for i in 1 .. 100_000 {
                s.insert(format!("{}", i));
            }
            for i in 1 .. 10_000_000 {
                if s.contains(&format!("{}", i)) { acc += 1; }
            }
        }
        _ => {},
    }
    println!("{}", acc);
}

I ran it in each mode on the patched and baseline rust compilers (with -O, on x86_64 OSX), median of 27 runs each time, for the following timings:

            (0)   (1)   (2)   (3)   (4)
PATCHED   0.864 1.133 1.276 5.564 1.619
BASELINE  0.852 ----- 1.298 5.654 1.736

Subtracting out the baseline (0) case which just allocates and frees strings, it looks like a 34% improvement on short string hashing, 13% on hashset queries, and 2% on hashset insertions. Uncertainty for the medians seems to be around 10ms.

What's the catch?

  • Naturally this changes hash values. It's not clear how big a deal that is, especially in re semver.
  • More importantly: anybody who needs two types to have the same hash values (in particular Borrow implementers) can no longer generally do so by forwarding the hash method; hash_end must be forwarded as well. This situation exists exactly once in the compiler. #[derive(Hash)] has been modified to forward hash_end, so newtype-ish wrappers will just work (outside of the compiler; the compiler needs to implement hash_end itself when it's needed for Borrow, because we can't rely on the stage0 to do it.)

Wait!

hash_end should probably be feature gated. It's not in this version of the patch, because when I tried feature gating it deriving broke; I'm not sure how to tell rustc to ignore feature gates in deriving-generated code.

I'm not sure whether this belongs as an RFC.

By distinguishing the end hash operations from middle hash operations, we can
avoid hashing unnecessary sentinels.  For instance, (String, String) only needs
a 0xFF in the middle, not at the end.
@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @brson (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. The way Github handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@bluss
Copy link
Member

bluss commented Oct 18, 2015

More importantly: anybody who needs two types to have the same hash values (in particular Borrow implementers) can no longer generally do so by forwarding the hash method; hash_end must be forwarded as well.

I'm a bit worried this will run into the same problem @gankro found in the first attempt: This is a breaking change for any external impls of Hash, especially those that try to hash the same as str or slices, and do so by method forwarding, they'd need to now define this method too.

I'm absolutely interested in any approach to solve or improve short input hashing and accommodating other hash algorithms.

I offer one argument in favour of the approach in my PR: Whether to care about prefixfreeness or not, and how to solve it, should be a property of the Hasher, not the value to be hashed (Hash trait). I also think it has much lower backward compat risk.

@alexcrichton
Copy link
Member

I think that with this and #28044 it may be the point that we should hold off for an RFC to work through the design space here. I'm personally a little unsure about what the constraints are and e.g. where it falls down today.

@sorear
Copy link
Contributor Author

sorear commented Oct 21, 2015

@alexcrichton What would the path forward for that be? Shall I reformat my version of the proposal as an RFC and take it there?

@alexcrichton
Copy link
Member

@sorear yeah I think that may be the best path forward, you may want to work with @bluss on the RFC and at least have a mention of #28044 in the alternatives section

@bors
Copy link
Contributor

bors commented Oct 25, 2015

☔ The latest upstream changes (presumably #29254) made this pull request unmergeable. Please resolve the merge conflicts.

@bstrie
Copy link
Contributor

bstrie commented Oct 27, 2015

@sorear Always happy to see yet another DCSS developer here. :P

@brson
Copy link
Contributor

brson commented Nov 23, 2015

Seems like an RFC was desired. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants