Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for custom hashers in HashMap #27713

Closed
alexcrichton opened this issue Aug 12, 2015 · 47 comments
Closed

Tracking issue for custom hashers in HashMap #27713

alexcrichton opened this issue Aug 12, 2015 · 47 comments
Labels
A-collections Area: `std::collection` B-unstable Blocker: Implemented in the nightly compiler and unstable. final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@alexcrichton
Copy link
Member

This is a tracking issue for the unstable hashmap_hasher feature in the standard library. This provides the ability to create a HashMap with a custom hashing implementation that abides by the HashState trait. This has already been used quite a bit in the compiler itself as well as in Servo I believe.

Some notable points to consider:

  • Is an extra HashState trait really necessary?
  • Is the naming of HashState correct?
  • Is the aggressive use of Default appropriate here?
  • Can the new constructor be leveraged to create hash maps that use a hasher implementing Default? Right now the new constructor only works with RandomState.
  • How ergonomic is it to create a hash map using a custom Hasher implementation? In theory it should be quite ergonomic.

cc @gankro

@alexcrichton alexcrichton added A-collections Area: `std::collection` T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. B-unstable Blocker: Implemented in the nightly compiler and unstable. labels Aug 12, 2015
@tomaka
Copy link
Contributor

tomaka commented Aug 12, 2015

I don't know much about hashing functions, but I just want to say that this one is very important for performances.

The SipHasher is very slow, and always eats up between 5% and 20% of the CPU time of each of my projects. By comparison the FnvHasher is around 6 times faster on my machine according to some quick benchmarking.

The purpose of the SipHasher is to protect against DDoSes, but the vast majority of the applications written in Rust never even open a socket.

@bluss
Copy link
Member

bluss commented Aug 12, 2015

SipHash looks good on long enough data, but it is slow on small values like integers. Its fixed four round finalization at the end is an example of overhead that is comparatively larger when hashing small values.

@Gankra
Copy link
Contributor

Gankra commented Aug 12, 2015

See: https://github.com/shepmaster/twox-hash/blob/master/README.md for scaling comparison.

Based on these results (which I haven't vetted for quality), one can conclude that SipHash is actually a pretty good "general purpose" hasher. Fnv doesn't scale as well to longer strings (> 32 bytes), and XXHash doesn't scale well to smaller strings (< 32 bytes).

@Gankra
Copy link
Contributor

Gankra commented Aug 12, 2015

@bluss
Copy link
Member

bluss commented Aug 12, 2015

(XXhash was constructed to be a fast checksum for gigabytes of lz4 data, so that's what it is good at)

@alexcrichton
Copy link
Member Author

If HashState is stabilized then we should also ensure that SipHasher isn't exposed as the default hash algorithm on hash maps. Right now the associated type Hasher in RandomState's HashState implementation exposes this and we'll want to wrap that up in something like DefaultHasher instead.

@Gankra
Copy link
Contributor

Gankra commented Aug 23, 2015

Just noting some discussion that was tangentially related to this:

The current hasher infra heavily penalizes Farmhash (and cityhash, and murmurhash), which wants to branch on the size of the input to choose the "optimal" algorithm. Community's current solution is to just buffer in a Vec until finish is called. This appears to be overkill, at least in the case of Farmhash.

However more generally it would be nice (even for SipHash) to be able to to indicate to the hasher "there is only one thing to hash, here it is, now give me the final hash". Sketch of design:

add:

pub trait Hasher {
    ...
    /// Hash only the given bytes, and immediately finalize.
    /// Results are unspecified if other bytes were previously written to this hasher
    fn write_only(&mut self, bytes: &[u8]) -> u64 {
        self.write(bytes);
        self.finish()
    }
}

pub trait Hash {
    ...
    /// Hashes only this value
    fn hash_one_shot<H: Hasher>(&self, state: &mut H) -> u64 {
        self.hash(state);
        state.finish()
    }
}

So far, so pointless. But back-compat!

But now hashers can override write_only to something optimized, and Hash impls can specialize in the following way:

// #[derive]d impl
impl Hash {
   fn hash() { ... } // same-old

   #[if(only_has_one_field)]  // magic
   fn hash_one_shot<H: Hasher>(&self, state: &mut H) -> u64 {
      self.only_field.hash_one_shot(state)
   }
}

And key types like slice, str, primitives, etc can manually impl hash_one_shot:

fn hash_one_shot<H: Hasher>(&self, state: &mut H) -> u64 {
    state.write_only(self.as_slice())
}

However we must be wary of #5257!

In particular, slices and strs must mix in some "bonus" value so that ("aa", "a") does not hash the same as ("a", "aa"). My proposed solution to this is simple, but sketchy: have has_one_shot and hash do different things. one_shot knows it's the only thing in existence, and so doesn't need to mix in the bonus data to prevent concatenation collisions. hash doesn't, so it has to conservatively mix in the bonus data.

This seems really sketchy, but I can't think of a situation where this would actually break things. You will get potentially curious behaviour, where &["hello"] doesn't hash to the same thing as "hello" or ("hello",), but this doesn't seem like a serious problem. They're fundamentally different types! Anyone relying on them hashing the same is in a pretty dubious position. Alternatively, in the case of slices, you could actually have them branch at runtime on their len to determine whether to call hash or hash_one_shot on their elements.

I've been flying around the country and haven't slept though.

@Gankra
Copy link
Contributor

Gankra commented Aug 23, 2015

Incidentally, this proposal soft-fixes #27108 (&[u8] and &str would indeed hash_one_shot to equal values).

@ranma42
Copy link
Contributor

ranma42 commented Aug 24, 2015

To me, the naming of Hasher and HashState is quite unintuitive. I expected the hash state (in many other libs, see openssl and crypto++, named "context") to be updated and finalised (using .write() and .finish()), but they actually belong to the Hasher trait.

@bluss
Copy link
Member

bluss commented Aug 27, 2015

@gankro I'd like to explore a simpler model, where the Hash trait remains the same (for backwards compatibility and sanity), and the Hasher is extended. Changing str and slice hashes by default too. A Wip is up at #28044

@ranma42
Copy link
Contributor

ranma42 commented Aug 27, 2015

@gankro I'm exploring a slightly different design here: https://github.com/ranma42/rust-hash/tree/master/src

Besides from the naming, which can obviously be made more consistent with that currently used in Rust, the main difference is that the stream-oriented part of the code is aggressively inlined. Ideally I would want the implementation of one-shot digest to be automatically generated by the compiler, which should be able to get rid of the streaming overhead simply with constant folding and dead code elimination.

This actually happens in the branch I posted, namely the digest method for [u8] does not contain the tail and ntail variables nor the buffered tail flushing loop. Even better, for u8 the length variable gets optimised away, too.

The usual structure of hash functions makes me wonder whether there is much to gain by having write_u8, write_u16, etc. My main objection is that most hash functions do not consume a single byte, but a block of bytes every "step". This means that some kind of buffering and re-alignment is going to be needed in most cases, probably removing any opportunity for effective optimisations.
I believe that a much more effective approach is to have digest_u8, digest_u16 and so on (in the equivalent of my HashFunction trait, which I believe currently is HashState), because this would ensure that the initialisation and finalisation of the hash context can be folded into the computation and that no re-alignment is needed. Additionally for the primitive types {u,i}_{size,8,16,32,64} the fixed length means that the code for most hash functions would be very easy to automatically optimise by the compiler.
I wonder if it might even make sense to add a dispatch for [u8] slices that optimises a small number of fixed sizes (basically arrays up to a certain size)... I will know after some more experimentation ;)

@malbarbo
Copy link
Contributor

An important point is to allow setting the seed for the HashState (or the equivalent). Some applications (like simulators) must have reproducible executions, so its necessary to specify the same seed for the hasher to get the same iteration order on HashSet and HashMap.

@Gankra
Copy link
Contributor

Gankra commented Oct 28, 2015

@malbarbo Yes, this is well-handled by the with_hash_state method. You make your own state with the desired seed, and then pass it to the HashMap.

@aldanor
Copy link

aldanor commented Oct 29, 2015

@gankro the whole problem here is that with_hash_state is unstable...

@alexcrichton
Copy link
Member Author

Nominating for 1.6 discussion

@Gankra
Copy link
Contributor

Gankra commented Nov 4, 2015

CC @shepmaster

As one of the hash-thing maintainers, are you happy with this API?

@alexcrichton
Copy link
Member Author

🔔 This issue is now entering its cycle-long final comment period for stabilization 🔔

The libs team decided that this API is likely ready for stabilization pending a final audit of the ergonomics and usage. If anything arises it will be bumped out of FCP for a later cycle.

@alexcrichton alexcrichton added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed I-nominated labels Nov 5, 2015
@shepmaster
Copy link
Member

From the point-of-view of a consumer of the API, I've found it to be pleasing enough to use, especially because of the Default implementations. Basically I've just added the hasher's type parameter to my hashmap's declaration and made sure to use Default::default() to construct it instead of HashMap::new():

use std::collections::HashMap;
use twox_hash::RandomXxHashState;

let mut hash: HashMap<_, _, RandomXxHashState> = Default::default();
hash.insert(42, "the answer");
assert_eq!(hash.get(&42), Some(&"the answer"));

From an implementor point-of-view, I'll softly agree with @ranma42's point:

To me, the naming of Hasher and HashState is quite unintuitive.

In my mind, the state of the hash is the internal bits-and-bobs that change every time you add more data to be hashed. In the usage of twox-hash, it's really more like the "state" is a seed and the "hasher" contains the real state.

However, looking back from a user POV, neither "state" nor "seed" sound great when you write the declaration (HashMap<_, _, RandomXxHashState>).

Naming is hard!

Falling into Java naming land, the parameter to the HashMap is really a factory. XxHashFactory is OK, but then we add the complication of the randomness. RandomXxHashFactory sounds like it creates a randomly-seeded XxHash each time you call it, but in reality it generates the random number once when the factory is constructed. Doing the former doesn't work so well, ask me how I know 😇 .

From a pragmatic POV, there will probably never be more than ~100 implementations of hash algorithms, so it's a very small number of people that will have to struggle with this. That may mean it's not worth spending enormous amounts of time on this naming.

There will be many more people who use a custom hash algorithm, so as long as the implementations have a nice bit of example of "use it like this", it will probably be just fine.

@alexcrichton
Copy link
Member Author

Thanks for weighing in @shepmaster! I agree that the naming here perhaps isn't the best, although I am also at a bit of a loss of what would fit as a good name.

For construction of a custom hash map, though, you can even use HashMap::default() as the method name instead of Default::default but I suspect that'll have more of an impact on aesthetics than anything because the hash algorithm otherwise still needs to be annotated (requiring a HashMap elsewhere).

@ranma42
Copy link
Contributor

ranma42 commented Nov 10, 2015

@alexcrichton I mentioned that context (HashContext) might be a better name for the "state" (at least more in line with the standard naming in OpenSSL and crypto++), i.e. the object in which you accumulate a stream/message (usually through an update operation, not a write) and which is finalised to a message digest.

In the stub library I wrote, I named HashFunction the trait which exposes an operation to convert a byte sequence (&[u8]) to a message digest (something like hash_one_shot, which in a way is the usual concept of "hash function", something that maps byte seqs to constant-sized message digests) and an additional operation to extract a new HashContext. In a way, HashFunction roughly corresponds to HashState (except that it adds hash_one_shot), while HashContext corresponds to Hasher in std.

Would it still be possible to rename these traits (and maybe their methods)? Although the names I proposed may not be much easier to understand, they would align with the existing nomenclature. This would at least make them easier to use for people which is already using them in other languages (and it would provide nice consistency for anybody writing library bindings/wrappers).

@alexcrichton
Copy link
Member Author

Hm yeah HashContext may make more sense, although what's stored in HashMap is actually distinct from the object which you feed data into. The actual hash algorithm's state may require more storage than what's needed to create a fresh new state. For example FNV hashing requires no "startup state" whereas SipHash requires 128 bits. In both cases, though, the intermediate state during hashing is larger.

Regardless I think it's still possible to tweak some naming here. If larger changes happen, however, then I think we'll have to punt on this until another cycle.

@ranma42
Copy link
Contributor

ranma42 commented Nov 11, 2015

@alexcrichton Yes, in the HashMap you would be storing a HashFunction object, not a HashContext, as you want something that can map keys to digest values, not an object in which you accumulate messages to obtain a single digest upon finalisation.

In my stub library, the object hierarchy is designed so that the additional state (seeds, initialization vectors, number of ruounds/passes, any parameters) of the hashing algorithm belongs to the object implementing the HashFunction trait. Hash functions are thus seen as a family of functions parametrised on seed & co.

This has the convenient advantage that in most cases the initialisation data does not need to be stored in the context (for example, the SipContext here does not contain k0 or k1) https://github.com/ranma42/rust-hash/blob/master/src/sip.rs#L25 . The tradeoff is that SipContext is unable to reset its own state, you need to get a new one from a SipHashFunction object.
The naming I used is mostly based on the idea that a HashFunction is meant to be a function from &[u8] to the digest type, so any additional parameters of the hash algorithm can be used to define which specific hash function you are looking for.

I did some more work on this, but I did not clean it up and push it to the public repo, but since there seem to be some interest, I will try to get around to it this weekend.

@shepmaster
Copy link
Member

the builder pattern in Rust already has a concept [...] which this isn't quite the same as

I guess I'd disagree. For example, I could see some code like:

trait BuildHasher {
    type Hasher;

    fn build_hasher(&self) -> Self::Hasher;
}

struct MyCoolHash {
    seed: u32,
    stream: u32,
}

struct MyCoolHashBuilder {
    seed: u32,
    stream: u32,
}

impl MyCoolHashBuilder {
    fn new() -> Self {
        MyCoolHashBuilder {
            seed: 0, // make a random seed
            stream: 1, // make a random stream
        }
    }

    fn seed(self, seed: u32) -> Self {
        MyCoolHashBuilder {
            seed: seed,
            ..self
        }
    }

    fn stream(self, stream: u32) -> Self {
        MyCoolHashBuilder {
            stream: stream,
            ..self
        }
    }
}

impl BuildHasher for MyCoolHashBuilder {
    type Hasher = MyCoolHash;

    fn build_hasher(&self) -> Self::Hasher {
        MyCoolHash {
            seed: self.seed,
            stream: self.stream,
        }
    }
}

trait BuildHasher {
    type Hasher;

    fn build_hasher(&self) -> Self::Hasher;
}

struct MyCoolHash {
    seed: u32,
    stream: u32,
}

struct MyCoolHashBuilder {
    seed: u32,
    stream: u32,
}

impl MyCoolHashBuilder {
    fn new() -> Self {
        MyCoolHashBuilder {
            seed: 0, // make a random seed
            stream: 1, // make a random stream
        }
    }

    fn seed(self, seed: u32) -> Self {
        MyCoolHashBuilder {
            seed: seed,
            ..self
        }
    }

    fn stream(self, stream: u32) -> Self {
        MyCoolHashBuilder {
            stream: stream,
            ..self
        }
    }
}

impl BuildHasher for MyCoolHashBuilder {
    type Hasher = MyCoolHash;

    fn build_hasher(&self) -> Self::Hasher {
        MyCoolHash {
            seed: self.seed,
            stream: self.stream,
        }
    }
}

fn main() {
    use std::collections::HashMap;

    let x = MyCoolHashBuilder::new().seed(42).stream(1);
    HashMap::with_hasher(x);
}

There's two parts to the builder pattern - the interesting and unique configuration and the actual building. The HashMap only cares about the building aspect, but there's no reason that the configuration couldn't follow the familiar configuration part.

The biggest downside I see is that I'd expect that most hashing is going to have a small number of knobs to tweak, which means the number of configuration methods would be small (or zero!).

@alexcrichton
Copy link
Member Author

Oh interesting! That's actually a good point that the trait could be considered the "final step" rather than the configuration up front, and along those lines I'd be pretty cool with BuildHasher as a trait anme and with_hash_builder as the method names perhaps?

@aturon
Copy link
Member

aturon commented Dec 4, 2015

@alexcrichton I'd still argue for the proposed MakeHasher, because the trait doesn't cover any builder API, it just covers construction. I think it's perfectly clear.

@alexcrichton
Copy link
Member Author

I suppose I personally prefer the name "hash builder" or BuildHasher, I just needed a way to justify naming it that :)

We don't have that many instances of "make" in libstd I think, the ones I could find are:

  • make_ascii_{upper,lower}case - but this is unstable (and I'm not a huge fan of the name)
  • make_place - also unstable and likely to be renamed

I feel that "build", however, at least does show up throughout libstd

@aturon
Copy link
Member

aturon commented Dec 4, 2015

I agree that build is more common -- but specifically in the context of a builder API, which this is at best a degenerate version of.

(That said, I won't stand strongly in the way of BuildHasher).

@bluss
Copy link
Member

bluss commented Dec 4, 2015

We're avoiding factory for no reason at all, that's fine, builder can be our own factory. I think BuildHasher is better than MakeHasher. However, no reason to make a builder style API for it just because of the name. Utility first.

@alexcrichton
Copy link
Member Author

🔔 This issue is now entering its final comment period for stabilization 🔔

The final piece we believe warrants discussion is the naming of the trait itself, of which the two leading candidates seem to be MakeHasher and BuildHasher. The libs team is currently leaning a little bit more towards BuildHasher.

@Gankra
Copy link
Contributor

Gankra commented Dec 18, 2015

Is it too late to submit Womb as Rust's version of a Factory? I think this would definitely resolve any and all confusion with regards to the semantics.

Definitely.

@jminer
Copy link

jminer commented Dec 25, 2015

I feel like BuildHasher fits in with the naming in the standard library better than MakeHasher. Using "build" as a verb is appropriately different from using "builder" that I wouldn't worry about this trait not having configuration methods. Also, renaming with_hasher to something like with_build_hasher would be a big improvement.

@pczarn
Copy link
Contributor

pczarn commented Jan 6, 2016

I think the name RandomState should be hidden, because the state doesn't have to be completely random, forever. To make an analogy to DefaultHasher, it can be called DefaultState.

@alexcrichton
Copy link
Member Author

@jminer could you elaborate on why you're feeling with_build_hasher is an improvement over with_hasher? In terms of what I would say out loud the latter seems better, but I agree that conventionally the former may fit better.

@pczarn Good point!

@Gankra
Copy link
Contributor

Gankra commented Jan 12, 2016

I think it does have to be random forever, at least to some extent. Changing the default would undermine the safety of applications that use HashMap. If you accept that, guaranteeing it in the name seems kind've reasonable, as a warning.

@Gankra
Copy link
Contributor

Gankra commented Jan 12, 2016

Also: I maintain that from_iter (which takes an IntoIterator, and not an Iterator) is sufficient precedent to eschew mentioning build. This is easier to find and grok intuitively, and reads nicer.

@alexcrichton
Copy link
Member Author

The libs team discussed this in triage recently and the decision was to stabilize with the names BuildHasher and with_hasher. We also didn't feel a strong enough need to rename RandomState as well.

@jminer
Copy link

jminer commented Jan 16, 2016

@alexcrichton I had two reasons for preferring with_build_hasher. One was convention as you mentioned (which is important), and the other was that it gives a stronger indication of what you pass to the method. My first thought would be to assume that you pass an object that could hash stuff and could be reset to more stuff. with_build_hasher would help me learn more quilckly that it takes a factory. Of course the parameter type says the same thing, but it takes more effort to read a generic bound and you only see the method name when reading code.

Then again, I agree that with_hasher sounds a lot better, so that might make it better.

alexcrichton added a commit to alexcrichton/rust that referenced this issue Jan 26, 2016
This commit implements the stabilization of the custom hasher support intended
for 1.7 but left out due to some last-minute questions that needed some
decisions. A summary of the actions done in this PR are:

Stable

* `std::hash::BuildHasher`
* `BuildHasher::Hasher`
* `BuildHasher::build_hasher`
* `std::hash::BuildHasherDefault`
* `HashMap::with_hasher`
* `HashMap::with_capacity_and_hasher`
* `HashSet::with_hasher`
* `HashSet::with_capacity_and_hasher`
* `std::collections::hash_map::RandomState`
* `RandomState::new`

Deprecated

* `std::collections::hash_state`
* `std::collections::hash_state::HashState` - this trait was also moved into
  `std::hash` with a reexport here to ensure that we can have a blanket impl to
  prevent immediate breakage on nightly. Note that this is unstable in both
  location.
* `HashMap::with_hash_state` - renamed
* `HashMap::with_capacity_and_hash_state` - renamed
* `HashSet::with_hash_state` - renamed
* `HashSet::with_capacity_and_hash_state` - renamed

Closes rust-lang#27713
bors added a commit that referenced this issue Jan 26, 2016
This commit implements the stabilization of the custom hasher support intended
for 1.7 but left out due to some last-minute questions that needed some
decisions. A summary of the actions done in this PR are:

Stable

* `std::hash::BuildHasher`
* `BuildHasher::Hasher`
* `BuildHasher::build_hasher`
* `std::hash::BuildHasherDefault`
* `HashMap::with_hasher`
* `HashMap::with_capacity_and_hasher`
* `HashSet::with_hasher`
* `HashSet::with_capacity_and_hasher`
* `std::collections::hash_map::RandomState`
* `RandomState::new`

Deprecated

* `std::collections::hash_state`
* `std::collections::hash_state::HashState` - this trait was also moved into
  `std::hash` with a reexport here to ensure that we can have a blanket impl to
  prevent immediate breakage on nightly. Note that this is unstable in both
  location.
* `HashMap::with_hash_state` - renamed
* `HashMap::with_capacity_and_hash_state` - renamed
* `HashSet::with_hash_state` - renamed
* `HashSet::with_capacity_and_hash_state` - renamed

Closes #27713
alexcrichton added a commit to alexcrichton/rust that referenced this issue Feb 3, 2016
This commit implements the stabilization of the custom hasher support intended
for 1.7 but left out due to some last-minute questions that needed some
decisions. A summary of the actions done in this PR are:

Stable

* `std::hash::BuildHasher`
* `BuildHasher::Hasher`
* `BuildHasher::build_hasher`
* `std::hash::BuildHasherDefault`
* `HashMap::with_hasher`
* `HashMap::with_capacity_and_hasher`
* `HashSet::with_hasher`
* `HashSet::with_capacity_and_hasher`
* `std::collections::hash_map::RandomState`
* `RandomState::new`

Deprecated

* `std::collections::hash_state`
* `std::collections::hash_state::HashState` - this trait was also moved into
  `std::hash` with a reexport here to ensure that we can have a blanket impl to
  prevent immediate breakage on nightly. Note that this is unstable in both
  location.
* `HashMap::with_hash_state` - renamed
* `HashMap::with_capacity_and_hash_state` - renamed
* `HashSet::with_hash_state` - renamed
* `HashSet::with_capacity_and_hash_state` - renamed

Closes rust-lang#27713
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-collections Area: `std::collection` B-unstable Blocker: Implemented in the nightly compiler and unstable. final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests