Tracking issue for custom hashers in HashMap #27713

alexcrichton · 2015-08-12T17:48:08Z

This is a tracking issue for the unstable hashmap_hasher feature in the standard library. This provides the ability to create a HashMap with a custom hashing implementation that abides by the HashState trait. This has already been used quite a bit in the compiler itself as well as in Servo I believe.

Some notable points to consider:

Is an extra HashState trait really necessary?
Is the naming of HashState correct?
Is the aggressive use of Default appropriate here?
Can the new constructor be leveraged to create hash maps that use a hasher implementing Default? Right now the new constructor only works with RandomState.
How ergonomic is it to create a hash map using a custom Hasher implementation? In theory it should be quite ergonomic.

cc @gankro

The text was updated successfully, but these errors were encountered:

tomaka · 2015-08-12T20:55:20Z

I don't know much about hashing functions, but I just want to say that this one is very important for performances.

The SipHasher is very slow, and always eats up between 5% and 20% of the CPU time of each of my projects. By comparison the FnvHasher is around 6 times faster on my machine according to some quick benchmarking.

The purpose of the SipHasher is to protect against DDoSes, but the vast majority of the applications written in Rust never even open a socket.

bluss · 2015-08-12T21:02:43Z

SipHash looks good on long enough data, but it is slow on small values like integers. Its fixed four round finalization at the end is an example of overhead that is comparatively larger when hashing small values.

Gankra · 2015-08-12T21:17:59Z

See: https://github.com/shepmaster/twox-hash/blob/master/README.md for scaling comparison.

Based on these results (which I haven't vetted for quality), one can conclude that SipHash is actually a pretty good "general purpose" hasher. Fnv doesn't scale as well to longer strings (> 32 bytes), and XXHash doesn't scale well to smaller strings (< 32 bytes).

Gankra · 2015-08-12T21:19:03Z

CC @frankmcsherry @pczarn @cgaebel @shepmaster

bluss · 2015-08-12T21:27:21Z

(XXhash was constructed to be a fast checksum for gigabytes of lz4 data, so that's what it is good at)

alexcrichton · 2015-08-21T21:32:01Z

If HashState is stabilized then we should also ensure that SipHasher isn't exposed as the default hash algorithm on hash maps. Right now the associated type Hasher in RandomState's HashState implementation exposes this and we'll want to wrap that up in something like DefaultHasher instead.

Gankra · 2015-08-23T13:33:54Z

Just noting some discussion that was tangentially related to this:

The current hasher infra heavily penalizes Farmhash (and cityhash, and murmurhash), which wants to branch on the size of the input to choose the "optimal" algorithm. Community's current solution is to just buffer in a Vec until finish is called. This appears to be overkill, at least in the case of Farmhash.

However more generally it would be nice (even for SipHash) to be able to to indicate to the hasher "there is only one thing to hash, here it is, now give me the final hash". Sketch of design:

add:

pub trait Hasher {
    ...
    /// Hash only the given bytes, and immediately finalize.
    /// Results are unspecified if other bytes were previously written to this hasher
    fn write_only(&mut self, bytes: &[u8]) -> u64 {
        self.write(bytes);
        self.finish()
    }
}

pub trait Hash {
    ...
    /// Hashes only this value
    fn hash_one_shot<H: Hasher>(&self, state: &mut H) -> u64 {
        self.hash(state);
        state.finish()
    }
}

So far, so pointless. But back-compat!

But now hashers can override write_only to something optimized, and Hash impls can specialize in the following way:

// #[derive]d impl
impl Hash {
   fn hash() { ... } // same-old

   #[if(only_has_one_field)]  // magic
   fn hash_one_shot<H: Hasher>(&self, state: &mut H) -> u64 {
      self.only_field.hash_one_shot(state)
   }
}

And key types like slice, str, primitives, etc can manually impl hash_one_shot:

fn hash_one_shot<H: Hasher>(&self, state: &mut H) -> u64 {
    state.write_only(self.as_slice())
}

However we must be wary of #5257!

In particular, slices and strs must mix in some "bonus" value so that ("aa", "a") does not hash the same as ("a", "aa"). My proposed solution to this is simple, but sketchy: have has_one_shot and hash do different things. one_shot knows it's the only thing in existence, and so doesn't need to mix in the bonus data to prevent concatenation collisions. hash doesn't, so it has to conservatively mix in the bonus data.

This seems really sketchy, but I can't think of a situation where this would actually break things. You will get potentially curious behaviour, where &["hello"] doesn't hash to the same thing as "hello" or ("hello",), but this doesn't seem like a serious problem. They're fundamentally different types! Anyone relying on them hashing the same is in a pretty dubious position. Alternatively, in the case of slices, you could actually have them branch at runtime on their len to determine whether to call hash or hash_one_shot on their elements.

I've been flying around the country and haven't slept though.

Gankra · 2015-08-23T13:38:26Z

Incidentally, this proposal soft-fixes #27108 (&[u8] and &str would indeed hash_one_shot to equal values).

ranma42 · 2015-08-24T12:36:13Z

To me, the naming of Hasher and HashState is quite unintuitive. I expected the hash state (in many other libs, see openssl and crypto++, named "context") to be updated and finalised (using .write() and .finish()), but they actually belong to the Hasher trait.

bluss · 2015-08-27T16:58:27Z

@gankro I'd like to explore a simpler model, where the Hash trait remains the same (for backwards compatibility and sanity), and the Hasher is extended. Changing str and slice hashes by default too. A Wip is up at #28044

ranma42 · 2015-08-27T19:12:19Z

@gankro I'm exploring a slightly different design here: https://github.com/ranma42/rust-hash/tree/master/src

Besides from the naming, which can obviously be made more consistent with that currently used in Rust, the main difference is that the stream-oriented part of the code is aggressively inlined. Ideally I would want the implementation of one-shot digest to be automatically generated by the compiler, which should be able to get rid of the streaming overhead simply with constant folding and dead code elimination.

This actually happens in the branch I posted, namely the digest method for [u8] does not contain the tail and ntail variables nor the buffered tail flushing loop. Even better, for u8 the length variable gets optimised away, too.

The usual structure of hash functions makes me wonder whether there is much to gain by having write_u8, write_u16, etc. My main objection is that most hash functions do not consume a single byte, but a block of bytes every "step". This means that some kind of buffering and re-alignment is going to be needed in most cases, probably removing any opportunity for effective optimisations.
I believe that a much more effective approach is to have digest_u8, digest_u16 and so on (in the equivalent of my HashFunction trait, which I believe currently is HashState), because this would ensure that the initialisation and finalisation of the hash context can be folded into the computation and that no re-alignment is needed. Additionally for the primitive types {u,i}_{size,8,16,32,64} the fixed length means that the code for most hash functions would be very easy to automatically optimise by the compiler.
I wonder if it might even make sense to add a dispatch for [u8] slices that optimises a small number of fixed sizes (basically arrays up to a certain size)... I will know after some more experimentation ;)

malbarbo · 2015-10-28T23:48:04Z

An important point is to allow setting the seed for the HashState (or the equivalent). Some applications (like simulators) must have reproducible executions, so its necessary to specify the same seed for the hasher to get the same iteration order on HashSet and HashMap.

Gankra · 2015-10-28T23:51:48Z

@malbarbo Yes, this is well-handled by the with_hash_state method. You make your own state with the desired seed, and then pass it to the HashMap.

aldanor · 2015-10-29T12:19:05Z

@gankro the whole problem here is that with_hash_state is unstable...

alexcrichton · 2015-11-04T01:48:22Z

Nominating for 1.6 discussion

Gankra · 2015-11-04T23:26:07Z

CC @shepmaster

As one of the hash-thing maintainers, are you happy with this API?

alexcrichton · 2015-11-05T17:46:48Z

🔔 This issue is now entering its cycle-long final comment period for stabilization 🔔

The libs team decided that this API is likely ready for stabilization pending a final audit of the ergonomics and usage. If anything arises it will be bumped out of FCP for a later cycle.

shepmaster · 2015-11-07T16:17:02Z

From the point-of-view of a consumer of the API, I've found it to be pleasing enough to use, especially because of the Default implementations. Basically I've just added the hasher's type parameter to my hashmap's declaration and made sure to use Default::default() to construct it instead of HashMap::new():

use std::collections::HashMap;
use twox_hash::RandomXxHashState;

let mut hash: HashMap<_, _, RandomXxHashState> = Default::default();
hash.insert(42, "the answer");
assert_eq!(hash.get(&42), Some(&"the answer"));

From an implementor point-of-view, I'll softly agree with @ranma42's point:

To me, the naming of Hasher and HashState is quite unintuitive.

In my mind, the state of the hash is the internal bits-and-bobs that change every time you add more data to be hashed. In the usage of twox-hash, it's really more like the "state" is a seed and the "hasher" contains the real state.

However, looking back from a user POV, neither "state" nor "seed" sound great when you write the declaration (HashMap<_, _, RandomXxHashState>).

Naming is hard!

Falling into Java naming land, the parameter to the HashMap is really a factory. XxHashFactory is OK, but then we add the complication of the randomness. RandomXxHashFactory sounds like it creates a randomly-seeded XxHash each time you call it, but in reality it generates the random number once when the factory is constructed. Doing the former doesn't work so well, ask me how I know 😇 .

From a pragmatic POV, there will probably never be more than ~100 implementations of hash algorithms, so it's a very small number of people that will have to struggle with this. That may mean it's not worth spending enormous amounts of time on this naming.

There will be many more people who use a custom hash algorithm, so as long as the implementations have a nice bit of example of "use it like this", it will probably be just fine.

alexcrichton · 2015-11-09T16:44:04Z

Thanks for weighing in @shepmaster! I agree that the naming here perhaps isn't the best, although I am also at a bit of a loss of what would fit as a good name.

For construction of a custom hash map, though, you can even use HashMap::default() as the method name instead of Default::default but I suspect that'll have more of an impact on aesthetics than anything because the hash algorithm otherwise still needs to be annotated (requiring a HashMap elsewhere).

ranma42 · 2015-11-10T06:32:15Z

@alexcrichton I mentioned that context (HashContext) might be a better name for the "state" (at least more in line with the standard naming in OpenSSL and crypto++), i.e. the object in which you accumulate a stream/message (usually through an update operation, not a write) and which is finalised to a message digest.

In the stub library I wrote, I named HashFunction the trait which exposes an operation to convert a byte sequence (&[u8]) to a message digest (something like hash_one_shot, which in a way is the usual concept of "hash function", something that maps byte seqs to constant-sized message digests) and an additional operation to extract a new HashContext. In a way, HashFunction roughly corresponds to HashState (except that it adds hash_one_shot), while HashContext corresponds to Hasher in std.

Would it still be possible to rename these traits (and maybe their methods)? Although the names I proposed may not be much easier to understand, they would align with the existing nomenclature. This would at least make them easier to use for people which is already using them in other languages (and it would provide nice consistency for anybody writing library bindings/wrappers).

alexcrichton · 2015-11-10T17:31:47Z

Hm yeah HashContext may make more sense, although what's stored in HashMap is actually distinct from the object which you feed data into. The actual hash algorithm's state may require more storage than what's needed to create a fresh new state. For example FNV hashing requires no "startup state" whereas SipHash requires 128 bits. In both cases, though, the intermediate state during hashing is larger.

Regardless I think it's still possible to tweak some naming here. If larger changes happen, however, then I think we'll have to punt on this until another cycle.

ranma42 · 2015-11-11T08:36:24Z

@alexcrichton Yes, in the HashMap you would be storing a HashFunction object, not a HashContext, as you want something that can map keys to digest values, not an object in which you accumulate messages to obtain a single digest upon finalisation.

In my stub library, the object hierarchy is designed so that the additional state (seeds, initialization vectors, number of ruounds/passes, any parameters) of the hashing algorithm belongs to the object implementing the HashFunction trait. Hash functions are thus seen as a family of functions parametrised on seed & co.

This has the convenient advantage that in most cases the initialisation data does not need to be stored in the context (for example, the SipContext here does not contain k0 or k1) https://github.com/ranma42/rust-hash/blob/master/src/sip.rs#L25 . The tradeoff is that SipContext is unable to reset its own state, you need to get a new one from a SipHashFunction object.
The naming I used is mostly based on the idea that a HashFunction is meant to be a function from &[u8] to the digest type, so any additional parameters of the hash algorithm can be used to define which specific hash function you are looking for.

I did some more work on this, but I did not clean it up and push it to the public repo, but since there seem to be some interest, I will try to get around to it this weekend.

shepmaster · 2015-12-04T14:18:40Z

the builder pattern in Rust already has a concept [...] which this isn't quite the same as

I guess I'd disagree. For example, I could see some code like:

trait BuildHasher {
    type Hasher;

    fn build_hasher(&self) -> Self::Hasher;
}

struct MyCoolHash {
    seed: u32,
    stream: u32,
}

struct MyCoolHashBuilder {
    seed: u32,
    stream: u32,
}

impl MyCoolHashBuilder {
    fn new() -> Self {
        MyCoolHashBuilder {
            seed: 0, // make a random seed
            stream: 1, // make a random stream
        }
    }

    fn seed(self, seed: u32) -> Self {
        MyCoolHashBuilder {
            seed: seed,
            ..self
        }
    }

    fn stream(self, stream: u32) -> Self {
        MyCoolHashBuilder {
            stream: stream,
            ..self
        }
    }
}

impl BuildHasher for MyCoolHashBuilder {
    type Hasher = MyCoolHash;

    fn build_hasher(&self) -> Self::Hasher {
        MyCoolHash {
            seed: self.seed,
            stream: self.stream,
        }
    }
}

trait BuildHasher {
    type Hasher;

    fn build_hasher(&self) -> Self::Hasher;
}

struct MyCoolHash {
    seed: u32,
    stream: u32,
}

struct MyCoolHashBuilder {
    seed: u32,
    stream: u32,
}

impl MyCoolHashBuilder {
    fn new() -> Self {
        MyCoolHashBuilder {
            seed: 0, // make a random seed
            stream: 1, // make a random stream
        }
    }

    fn seed(self, seed: u32) -> Self {
        MyCoolHashBuilder {
            seed: seed,
            ..self
        }
    }

    fn stream(self, stream: u32) -> Self {
        MyCoolHashBuilder {
            stream: stream,
            ..self
        }
    }
}

impl BuildHasher for MyCoolHashBuilder {
    type Hasher = MyCoolHash;

    fn build_hasher(&self) -> Self::Hasher {
        MyCoolHash {
            seed: self.seed,
            stream: self.stream,
        }
    }
}

fn main() {
    use std::collections::HashMap;

    let x = MyCoolHashBuilder::new().seed(42).stream(1);
    HashMap::with_hasher(x);
}

There's two parts to the builder pattern - the interesting and unique configuration and the actual building. The HashMap only cares about the building aspect, but there's no reason that the configuration couldn't follow the familiar configuration part.

The biggest downside I see is that I'd expect that most hashing is going to have a small number of knobs to tweak, which means the number of configuration methods would be small (or zero!).

alexcrichton · 2015-12-04T17:30:20Z

Oh interesting! That's actually a good point that the trait could be considered the "final step" rather than the configuration up front, and along those lines I'd be pretty cool with BuildHasher as a trait anme and with_hash_builder as the method names perhaps?

aturon · 2015-12-04T17:32:15Z

@alexcrichton I'd still argue for the proposed MakeHasher, because the trait doesn't cover any builder API, it just covers construction. I think it's perfectly clear.

alexcrichton · 2015-12-04T17:34:47Z

I suppose I personally prefer the name "hash builder" or BuildHasher, I just needed a way to justify naming it that :)

We don't have that many instances of "make" in libstd I think, the ones I could find are:

make_ascii_{upper,lower}case - but this is unstable (and I'm not a huge fan of the name)
make_place - also unstable and likely to be renamed

I feel that "build", however, at least does show up throughout libstd

aturon · 2015-12-04T17:38:45Z

I agree that build is more common -- but specifically in the context of a builder API, which this is at best a degenerate version of.

(That said, I won't stand strongly in the way of BuildHasher).

bluss · 2015-12-04T17:58:28Z

We're avoiding factory for no reason at all, that's fine, builder can be our own factory. I think BuildHasher is better than MakeHasher. However, no reason to make a builder style API for it just because of the name. Utility first.

alexcrichton · 2015-12-17T22:26:53Z

🔔 This issue is now entering its final comment period for stabilization 🔔

The final piece we believe warrants discussion is the naming of the trait itself, of which the two leading candidates seem to be MakeHasher and BuildHasher. The libs team is currently leaning a little bit more towards BuildHasher.

Gankra · 2015-12-18T06:36:33Z

Is it too late to submit Womb as Rust's version of a Factory? I think this would definitely resolve any and all confusion with regards to the semantics.

Definitely.

jminer · 2015-12-25T22:50:34Z

I feel like BuildHasher fits in with the naming in the standard library better than MakeHasher. Using "build" as a verb is appropriately different from using "builder" that I wouldn't worry about this trait not having configuration methods. Also, renaming with_hasher to something like with_build_hasher would be a big improvement.

pczarn · 2016-01-06T21:45:10Z

I think the name RandomState should be hidden, because the state doesn't have to be completely random, forever. To make an analogy to DefaultHasher, it can be called DefaultState.

alexcrichton · 2016-01-11T21:17:38Z

@jminer could you elaborate on why you're feeling with_build_hasher is an improvement over with_hasher? In terms of what I would say out loud the latter seems better, but I agree that conventionally the former may fit better.

@pczarn Good point!

Gankra · 2016-01-12T19:54:48Z

I think it does have to be random forever, at least to some extent. Changing the default would undermine the safety of applications that use HashMap. If you accept that, guaranteeing it in the name seems kind've reasonable, as a warning.

Gankra · 2016-01-12T19:59:59Z

Also: I maintain that from_iter (which takes an IntoIterator, and not an Iterator) is sufficient precedent to eschew mentioning build. This is easier to find and grok intuitively, and reads nicer.

alexcrichton · 2016-01-15T17:48:03Z

The libs team discussed this in triage recently and the decision was to stabilize with the names BuildHasher and with_hasher. We also didn't feel a strong enough need to rename RandomState as well.

jminer · 2016-01-16T22:19:57Z

@alexcrichton I had two reasons for preferring with_build_hasher. One was convention as you mentioned (which is important), and the other was that it gives a stronger indication of what you pass to the method. My first thought would be to assume that you pass an object that could hash stuff and could be reset to more stuff. with_build_hasher would help me learn more quilckly that it takes a factory. Of course the parameter type says the same thing, but it takes more effort to read a generic bound and you only see the method name when reading code.

Then again, I agree that with_hasher sounds a lot better, so that might make it better.

This commit implements the stabilization of the custom hasher support intended for 1.7 but left out due to some last-minute questions that needed some decisions. A summary of the actions done in this PR are: Stable * `std::hash::BuildHasher` * `BuildHasher::Hasher` * `BuildHasher::build_hasher` * `std::hash::BuildHasherDefault` * `HashMap::with_hasher` * `HashMap::with_capacity_and_hasher` * `HashSet::with_hasher` * `HashSet::with_capacity_and_hasher` * `std::collections::hash_map::RandomState` * `RandomState::new` Deprecated * `std::collections::hash_state` * `std::collections::hash_state::HashState` - this trait was also moved into `std::hash` with a reexport here to ensure that we can have a blanket impl to prevent immediate breakage on nightly. Note that this is unstable in both location. * `HashMap::with_hash_state` - renamed * `HashMap::with_capacity_and_hash_state` - renamed * `HashSet::with_hash_state` - renamed * `HashSet::with_capacity_and_hash_state` - renamed Closes rust-lang#27713

This commit implements the stabilization of the custom hasher support intended for 1.7 but left out due to some last-minute questions that needed some decisions. A summary of the actions done in this PR are: Stable * `std::hash::BuildHasher` * `BuildHasher::Hasher` * `BuildHasher::build_hasher` * `std::hash::BuildHasherDefault` * `HashMap::with_hasher` * `HashMap::with_capacity_and_hasher` * `HashSet::with_hasher` * `HashSet::with_capacity_and_hasher` * `std::collections::hash_map::RandomState` * `RandomState::new` Deprecated * `std::collections::hash_state` * `std::collections::hash_state::HashState` - this trait was also moved into `std::hash` with a reexport here to ensure that we can have a blanket impl to prevent immediate breakage on nightly. Note that this is unstable in both location. * `HashMap::with_hash_state` - renamed * `HashMap::with_capacity_and_hash_state` - renamed * `HashSet::with_hash_state` - renamed * `HashSet::with_capacity_and_hash_state` - renamed Closes #27713

This commit implements the stabilization of the custom hasher support intended for 1.7 but left out due to some last-minute questions that needed some decisions. A summary of the actions done in this PR are: Stable * `std::hash::BuildHasher` * `BuildHasher::Hasher` * `BuildHasher::build_hasher` * `std::hash::BuildHasherDefault` * `HashMap::with_hasher` * `HashMap::with_capacity_and_hasher` * `HashSet::with_hasher` * `HashSet::with_capacity_and_hasher` * `std::collections::hash_map::RandomState` * `RandomState::new` Deprecated * `std::collections::hash_state` * `std::collections::hash_state::HashState` - this trait was also moved into `std::hash` with a reexport here to ensure that we can have a blanket impl to prevent immediate breakage on nightly. Note that this is unstable in both location. * `HashMap::with_hash_state` - renamed * `HashMap::with_capacity_and_hash_state` - renamed * `HashSet::with_hash_state` - renamed * `HashSet::with_capacity_and_hash_state` - renamed Closes rust-lang#27713

alexcrichton added A-collections Area: `std::collection` T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. B-unstable Blocker: Implemented in the nightly compiler and unstable. labels Aug 12, 2015

Ms2ger mentioned this issue Aug 16, 2015

Tracking: Unstable Rust feature gates used by Servo servo/servo#5286

Closed

99 tasks

apasel422 mentioned this issue Oct 1, 2015

Consider make it build with stable rustc contain-rs/linked-hash-map#28

Closed

alexcrichton added the I-nominated label Nov 4, 2015

alexcrichton added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed I-nominated labels Nov 5, 2015

alexcrichton added the I-nominated label Dec 16, 2015

alexcrichton removed the I-nominated label Dec 17, 2015

alexcrichton mentioned this issue Jan 15, 2016

std: Stabilize APIs for the 1.7 release #30943

Merged

alexcrichton mentioned this issue Jan 21, 2016

std: Stabilize custom hasher support in HashMap #31081

Merged

dpc mentioned this issue Jan 22, 2016

Move to rustc stable. dpc/rhex#6

Closed

bors closed this as completed in #31081 Jan 26, 2016

cardoe mentioned this issue Apr 14, 2016

Docopt can cause a process to block when used before random pool is initialized docopt/docopt.rs#178

Closed

tosti007 mentioned this issue Mar 29, 2023

Use HashMap::with_capacity_and_hasher instead of using base #109214

Merged

Tracking issue for custom hashers in HashMap #27713

Tracking issue for custom hashers in HashMap #27713

Comments

alexcrichton commented Aug 12, 2015

tomaka commented Aug 12, 2015

bluss commented Aug 12, 2015

Gankra commented Aug 12, 2015

Gankra commented Aug 12, 2015

bluss commented Aug 12, 2015

alexcrichton commented Aug 21, 2015

Gankra commented Aug 23, 2015

Gankra commented Aug 23, 2015

ranma42 commented Aug 24, 2015

bluss commented Aug 27, 2015

ranma42 commented Aug 27, 2015

malbarbo commented Oct 28, 2015

Gankra commented Oct 28, 2015

aldanor commented Oct 29, 2015

alexcrichton commented Nov 4, 2015

Gankra commented Nov 4, 2015

alexcrichton commented Nov 5, 2015

shepmaster commented Nov 7, 2015

alexcrichton commented Nov 9, 2015

ranma42 commented Nov 10, 2015

alexcrichton commented Nov 10, 2015

ranma42 commented Nov 11, 2015

shepmaster commented Dec 4, 2015

alexcrichton commented Dec 4, 2015

aturon commented Dec 4, 2015

alexcrichton commented Dec 4, 2015

aturon commented Dec 4, 2015

bluss commented Dec 4, 2015

alexcrichton commented Dec 17, 2015

Gankra commented Dec 18, 2015

jminer commented Dec 25, 2015

pczarn commented Jan 6, 2016

alexcrichton commented Jan 11, 2016

Gankra commented Jan 12, 2016

Gankra commented Jan 12, 2016

alexcrichton commented Jan 15, 2016

jminer commented Jan 16, 2016