std: Stabilize the std::hash module #19673

alexcrichton · 2014-12-09T22:39:59Z

This commit aims to stabilize the std::hash module by standardizing on its
hashing interface while rationalizing the current usage with the HashMap and
HashSet types. The primary goal of this slight redesign is to separate the
concepts of a hasher's state from a hashing algorithm itself.

The primary change of this commit is to separate the Hasher trait into a
Hasher and a HashState trait. Conceptually the old Hasher trait was
actually just a factory for various states, but hashing had very little control
over how these states were used. Additionally the old Hasher trait was
actually fairly unrelated to hashing.

This commit redesigns the existing Hasher trait to match what the notion of a
Hasher normally implies with the following definition:

trait Hasher: Writer {
    type Output;

    fn reset(&mut self);
    fn finish(&self) -> Output;
}

Note that the Output associated type is currently a type parameter due to
associated types not being fully implemented yet. This new Hasher trait
emphasizes that all hashers are sinks for bytes, and hashing algorithms may
produce outputs other than a u64, so a the output type is made generic.

With this definition, the old Hasher trait is realized as a new HashState
trait in the collections::hash_state module as an experimental addition for
now. The current definition looks like:

trait HashState {
    type H: Hasher;
    fn hasher(&self) -> H;
}

Note that the H associated type (along with its O output) are both type
parameters on the HashState trait due to the current limitations of associated
types. The purpose of this trait is to emphasize that the one piece of
functionality for implementors is that new instances of Hasher can be created.
This conceptually represents the two keys from which more instances of a
SipHasher can be created, and a HashState is what's stored in a HashMap,
not a Hasher.

Implementors of custom hash algorithms should implement the Hasher trait, and
only hash algorithms intended for use in hash maps need to implement or worry
about the HashState trait.

Some other stability decision made for the std::hash module are:

The name of the module, hash, is #![stable]
The Hash and Hasher traits are #[unstable] due to type parameters that
want to be associated types.
The Writer trait remains #[experimental] as it's intended to be replaced
with an io::Writer (more details soon).
The top-level hash function is #[unstable] as it is intended to be generic
over the hashing algorithm instead of hardwired to SipHasher
The inner sip module is now private as its one export, SipHasher is
reexported in the hash module.

There are many breaking changes outlined above, and as a result this commit is
a:

[breaking-change]

Gankra · 2014-12-09T22:47:42Z

CC @pczarn @cgaebel @thestinger

thestinger · 2014-12-09T22:53:30Z

Hashes should have the option of single pass implementations rather than ones with state and this doesn't seem like it's enough to provide that.

tbu- · 2014-12-10T01:03:50Z

Does single-pass mean that the hasher gets to see e.g. the whole array of bytes instead of getting a look at them one at a time?

alexcrichton · 2014-12-10T03:51:47Z

@thestinger I'm not quite sure what you mean by "single pass", could you elaborate? Implementations of Hash are still allowed to specialize on the kind of Hasher used to use methods beyond what Writer provides (see the tests for an example).

thestinger · 2014-12-10T20:00:01Z

@tbu-: Yes, it means it doesn't have to be a state machine.

thestinger · 2014-12-10T20:08:29Z

@alexcrichton: The fact that the default hash implementations don't have single pass implementations for contiguous blocks of memory is a major performance issue and should be addressed in the design. I don't see how it's ready to be stabilized if these issues haven't even been considered.

alexcrichton · 2014-12-10T22:40:34Z

I'm trying to drill down into something more specific because what you've said so far sounds like this is already possible with the PR. I would like to make your concerns more concrete because the initial design of the Hash module had many performance considerations taken into account, and this proposed tweaking did not give up on any of that generality.

Specifically:

If you literally have one large block of memory, is write() not sufficient?
If you have a specific hashing optimization, is an implementation of Hash<SpecificHasher> not sufficient?

It would be helpful for you to be a little more concrete in your concerns to make sure that we can address them!

pczarn · 2014-12-11T22:15:49Z

src/libcollections/hash/mod.rs

 pub trait Writer {
    fn write(&mut self, bytes: &[u8]);
 }

+/// Hash a value with the default SipHasher algorithm (two initial keys of 0).
+///
+/// The speified value will be hashed with this hasher and then the resulting


speified -> specified

tbu- · 2014-12-11T23:21:39Z

@alexcrichton I believe that this is not possible due to the current standard implementation for any slice that just hashes the members separately.

thestinger · 2014-12-12T00:09:17Z

@alexcrichton: A single-pass implementation is more efficient than a stateful one even if it's only using a single call to write. The infrastructure to do this for collections like slices is also missing.

alexcrichton · 2014-12-12T17:12:03Z

@tbu- I agree that we can't generically hash a slice of bytes as just a call to write, as this would require specialization. Without giving up the Hash implementation for slices in general, though, I'm not sure that there's a way around this.

@thestinger is this what you are referring to? I still don't understand your concerns about state because with the specialization of the hasher type parameter on the Hash trait you can always bypass all internal state via bytes anyway. Can you also elaborate more on what infrastructure you're talking about? Do you think that hashers should take a slice of T to hash in bulk?

This commit aims to stabilize the `std::hash` module by standardizing on its hashing interface while rationalizing the current usage with the `HashMap` and `HashSet` types. The primary goal of this slight redesign is to separate the concepts of a hasher's state from a hashing algorithm itself. The primary change of this commit is to separate the `Hasher` trait into a `Hasher` and a `HashState` trait. Conceptually the old `Hasher` trait was actually just a factory for various states, but hashing had very little control over how these states were used. Additionally the old `Hasher` trait was actually fairly unrelated to hashing. This commit redesigns the existing `Hasher` trait to match what the notion of a `Hasher` normally implies with the following definition: trait Hasher: Writer { type Output; fn reset(&mut self); fn finish(&self) -> Output; } Note that the `Output` associated type is currently a type parameter due to associated types not being fully implemented yet. This new `Hasher` trait emphasizes that all hashers are sinks for bytes, and hashing algorithms may produce outputs other than a `u64`, so a the output type is made generic. With this definition, the old `Hasher` trait is realized as a new `HashState` trait in the `collections::hash_state` module as an experimental addition for now. The current definition looks like: trait HashState { type H: Hasher; fn hasher(&self) -> H; } Note that the `H` associated type (along with its `O` output) are both type parameters on the `HashState` trait due to the current limitations of associated types. The purpose of this trait is to emphasize that the one piece of functionality for implementors is that new instances of `Hasher` can be created. This conceptually represents the two keys from which more instances of a `SipHasher` can be created, and a `HashState` is what's stored in a `HashMap`, not a `Hasher`. Implementors of custom hash algorithms should implement the `Hasher` trait, and only hash algorithms intended for use in hash maps need to implement or worry about the `HashState` trait. Some other stability decision made for the `std::hash` module are: * The name of the module, hash, is `#![stable]` * The `Hash` and `Hasher` traits are `#[unstable]` due to type parameters that want to be associated types. * The `Writer` trait remains `#[experimental]` as it's intended to be replaced with an `io::Writer` (more details soon). * The top-level `hash` function is `#[unstable]` as it is intended to be generic over the hashing algorithm instead of hardwired to `SipHasher` * The inner `sip` module is now private as its one export, `SipHasher` is reexported in the `hash` module. There are many breaking changes outlined above, and as a result this commit is a: [breaking-change]

pczarn · 2014-12-15T10:24:14Z

A single-pass implementation is more efficient than a stateful one even if it's only using a single call to write.

Calls to write, however many, must be followed by a call to finish. This overhead is constant. Is there some other fundamental difference I'm missing?

I'm assuming the problem is that LLVM doesn't understand the semantics of write enough to optimize (inlined) invocations. Further, hash is called individually on each value. However, contiguous slices can be merged before they are passed to write:

pub trait MyHash<S: Writer = SipHasher> for Sized? {
    fn myhash<'a>(&'a self, state: &mut S, msg: &mut Option<&'a [u8]>);
}

pub fn myhash<T: MyHash<SipHasher>>(value: &T) -> u64 {
    let mut state = SipState::new();
    let mut msg = None;
    value.myhash(&mut state, &mut msg);
    if let Some(tail_msg) = msg {
        state.write(tail_msg);
    };
    state.finish()
}

// for primitive integer types
// impl<S: Writer> MyHash<S> for $ty
fn myhash<'a>(&'a self, state: &mut S, msg: &mut Option<&'a [u8]>) {
    if (*self as $ty).to_le() == *self as $ty { unsafe {
        let a = slice::from_raw_buf(mem::transmute(&self), mem::size_of::<$ty>());
        let a_ptr = a.as_ptr();

        match msg {
            &Some(ref mut m) if a_ptr == m.as_ptr().offset(m.len() as int) => {
                *msg = slice::from_raw_buf(mem::transmute(&m.as_ptr()),
                                           m.len() + a.len());
            }
            _ => {
                if let Some(msg_part) = *msg { state.write(msg_part); }
                *msg = Some(a);
            }
        }
    } } else {
        let a: [u8, ..::core::$ty::BYTES] = unsafe {
            mem::transmute((*self as $ty).to_le() as $ty)
        };
        if let Some(msg_part) = *msg { state.write(msg_part); }
        state.write(a.as_slice());
        *msg = None;
    }
}

I can only imagine it working efficiently with generous inlining and optimizations. Currently, the check for a.as_ptr() == msg_end isn't optimized out in code that hashes a slice, but slices become contiguous nonetheless. Maybe aliasing information could help here?

drewm1980 · 2014-12-15T19:33:19Z

One piece of infrastructure that might be relevant is a function for combining adjacent slices that I am working on. It is here in the playpen for now; I intend to write a PR.
http://is.gd/zDpaJd

pczarn · 2014-12-16T22:45:30Z

src/libcollections/hash/mod.rs

+/// The specified value will be hashed with this hasher and then the resulting
+/// hash will be returned.
+#[unstable = "the hashing algorithm used will likely become generic soon"]
+pub fn hash<T: Hash<SipHasher>>(value: &T) -> u64 {


Why not Sized??

alexcrichton · 2014-12-19T22:25:39Z

I've looked into some other languages, and it seems that the common trend is for a definition that looks something like:

trait Hash {
    fn hash(&self) -> uint;
}

The drawback of this, however, is that containers like HashMap cannot be parametric over the hash algorithm used. This trait can also be encoded via the definitions proposed in this PR:

impl Hash<uint> for MyType {
    fn hash(&self, state: &mut uint) {
        *state = self.my_hash();
    }
}

When it comes to optimizing hashing or trying to make an implementation that doesn't work incrementally (which it sounds like @pczarn, @thestinger are alluding to), one of the problems that jumps out is how to deal with aggregate structures with #[deriving(Hash)]. When it comes to deriving, unless we hard-code one specific algorithm, I don't think we have any option other than incrementally updating a hash (e.g. just calling .hash(state) on all members).

From what I can tell, one can always specialize hashing to perform optimally for any one particular type (minimizing write calls), but it would be very difficult to do so generically. For example the &[u8] type likely cannot hash well due to the lack of specialization implementations, whereas the &str type can likely hash very well due to actually being one call to .write().

@pczarn I'm sorry I didn't quite follow your comment, but are you basically saying that requiring .write() + .finish() is imposing overhead, even in the &str case for example? Are you also saying that this is not inherent to the SipHasher implementation, but rather inherent to the design of the hashing traits?

emberian · 2015-01-05T06:43:32Z

Bump.

alexcrichton · 2015-01-06T02:36:24Z

I'd like to rebase this with a rewrite with true associated types, but I'm hitting a number of errors in the compiler which are preventing the usage of associated types. @nikomatsakis has a fix though, and I'll rebase as soon as we have a snapshot with those fixes.

pczarn reviewed Dec 11, 2014
View reviewed changes

alexcrichton force-pushed the stabilize-hash branch from 0e649fa to d4b85f3 Compare December 12, 2014 17:09

alexcrichton force-pushed the stabilize-hash branch from d4b85f3 to c47f2ed Compare December 12, 2014 18:02

alexcrichton force-pushed the stabilize-hash branch from c47f2ed to 085838c Compare December 14, 2014 02:26

aturon mentioned this pull request Dec 16, 2014

Stabilization metabug: 1.0-alpha #19260

Closed

pczarn reviewed Dec 16, 2014
View reviewed changes

alexcrichton closed this Jan 6, 2015

alexcrichton mentioned this pull request Jan 6, 2015

std: Stabilize the std::hash module #20654

Merged

alexcrichton mentioned this pull request Feb 11, 2015

RFC: Simplify std::hash rust-lang/rfcs#823

Merged

alexcrichton mentioned this pull request Jul 29, 2015

std: Stabilize a number of small APIs #27370

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

std: Stabilize the std::hash module #19673

std: Stabilize the std::hash module #19673

alexcrichton commented Dec 9, 2014

Gankra commented Dec 9, 2014

thestinger commented Dec 9, 2014

tbu- commented Dec 10, 2014

alexcrichton commented Dec 10, 2014

thestinger commented Dec 10, 2014

thestinger commented Dec 10, 2014

alexcrichton commented Dec 10, 2014

pczarn Dec 11, 2014

tbu- commented Dec 11, 2014

thestinger commented Dec 12, 2014

alexcrichton commented Dec 12, 2014

pczarn commented Dec 15, 2014

drewm1980 commented Dec 15, 2014

pczarn Dec 16, 2014

alexcrichton commented Dec 19, 2014

emberian commented Jan 5, 2015

alexcrichton commented Jan 6, 2015

std: Stabilize the std::hash module #19673

std: Stabilize the std::hash module #19673

Conversation

alexcrichton commented Dec 9, 2014

Gankra commented Dec 9, 2014

thestinger commented Dec 9, 2014

tbu- commented Dec 10, 2014

alexcrichton commented Dec 10, 2014

thestinger commented Dec 10, 2014

thestinger commented Dec 10, 2014

alexcrichton commented Dec 10, 2014

pczarn Dec 11, 2014

Choose a reason for hiding this comment

tbu- commented Dec 11, 2014

thestinger commented Dec 12, 2014

alexcrichton commented Dec 12, 2014

pczarn commented Dec 15, 2014

drewm1980 commented Dec 15, 2014

pczarn Dec 16, 2014

Choose a reason for hiding this comment

alexcrichton commented Dec 19, 2014

emberian commented Jan 5, 2015

alexcrichton commented Jan 6, 2015