Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std: Stabilize the std::hash module #19673

Closed
wants to merge 1 commit into from

Conversation

alexcrichton
Copy link
Member

This commit aims to stabilize the std::hash module by standardizing on its
hashing interface while rationalizing the current usage with the HashMap and
HashSet types. The primary goal of this slight redesign is to separate the
concepts of a hasher's state from a hashing algorithm itself.

The primary change of this commit is to separate the Hasher trait into a
Hasher and a HashState trait. Conceptually the old Hasher trait was
actually just a factory for various states, but hashing had very little control
over how these states were used. Additionally the old Hasher trait was
actually fairly unrelated to hashing.

This commit redesigns the existing Hasher trait to match what the notion of a
Hasher normally implies with the following definition:

trait Hasher: Writer {
    type Output;

    fn reset(&mut self);
    fn finish(&self) -> Output;
}

Note that the Output associated type is currently a type parameter due to
associated types not being fully implemented yet. This new Hasher trait
emphasizes that all hashers are sinks for bytes, and hashing algorithms may
produce outputs other than a u64, so a the output type is made generic.

With this definition, the old Hasher trait is realized as a new HashState
trait in the collections::hash_state module as an experimental addition for
now. The current definition looks like:

trait HashState {
    type H: Hasher;
    fn hasher(&self) -> H;
}

Note that the H associated type (along with its O output) are both type
parameters on the HashState trait due to the current limitations of associated
types. The purpose of this trait is to emphasize that the one piece of
functionality for implementors is that new instances of Hasher can be created.
This conceptually represents the two keys from which more instances of a
SipHasher can be created, and a HashState is what's stored in a HashMap,
not a Hasher.

Implementors of custom hash algorithms should implement the Hasher trait, and
only hash algorithms intended for use in hash maps need to implement or worry
about the HashState trait.

Some other stability decision made for the std::hash module are:

  • The name of the module, hash, is #![stable]
  • The Hash and Hasher traits are #[unstable] due to type parameters that
    want to be associated types.
  • The Writer trait remains #[experimental] as it's intended to be replaced
    with an io::Writer (more details soon).
  • The top-level hash function is #[unstable] as it is intended to be generic
    over the hashing algorithm instead of hardwired to SipHasher
  • The inner sip module is now private as its one export, SipHasher is
    reexported in the hash module.

There are many breaking changes outlined above, and as a result this commit is
a:

[breaking-change]

@Gankra
Copy link
Contributor

Gankra commented Dec 9, 2014

CC @pczarn @cgaebel @thestinger

@thestinger
Copy link
Contributor

Hashes should have the option of single pass implementations rather than ones with state and this doesn't seem like it's enough to provide that.

@tbu-
Copy link
Contributor

tbu- commented Dec 10, 2014

Does single-pass mean that the hasher gets to see e.g. the whole array of bytes instead of getting a look at them one at a time?

@alexcrichton
Copy link
Member Author

@thestinger I'm not quite sure what you mean by "single pass", could you elaborate? Implementations of Hash are still allowed to specialize on the kind of Hasher used to use methods beyond what Writer provides (see the tests for an example).

@thestinger
Copy link
Contributor

@tbu-: Yes, it means it doesn't have to be a state machine.

@thestinger
Copy link
Contributor

@alexcrichton: The fact that the default hash implementations don't have single pass implementations for contiguous blocks of memory is a major performance issue and should be addressed in the design. I don't see how it's ready to be stabilized if these issues haven't even been considered.

@alexcrichton
Copy link
Member Author

I'm trying to drill down into something more specific because what you've said so far sounds like this is already possible with the PR. I would like to make your concerns more concrete because the initial design of the Hash module had many performance considerations taken into account, and this proposed tweaking did not give up on any of that generality.

Specifically:

  • If you literally have one large block of memory, is write() not sufficient?
  • If you have a specific hashing optimization, is an implementation of Hash<SpecificHasher> not sufficient?

It would be helpful for you to be a little more concrete in your concerns to make sure that we can address them!

pub trait Writer {
fn write(&mut self, bytes: &[u8]);
}

/// Hash a value with the default SipHasher algorithm (two initial keys of 0).
///
/// The speified value will be hashed with this hasher and then the resulting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

speified -> specified

@tbu-
Copy link
Contributor

tbu- commented Dec 11, 2014

@alexcrichton I believe that this is not possible due to the current standard implementation for any slice that just hashes the members separately.

@thestinger
Copy link
Contributor

@alexcrichton: A single-pass implementation is more efficient than a stateful one even if it's only using a single call to write. The infrastructure to do this for collections like slices is also missing.

@alexcrichton
Copy link
Member Author

@tbu- I agree that we can't generically hash a slice of bytes as just a call to write, as this would require specialization. Without giving up the Hash implementation for slices in general, though, I'm not sure that there's a way around this.

@thestinger is this what you are referring to? I still don't understand your concerns about state because with the specialization of the hasher type parameter on the Hash trait you can always bypass all internal state via bytes anyway. Can you also elaborate more on what infrastructure you're talking about? Do you think that hashers should take a slice of T to hash in bulk?

This commit aims to stabilize the `std::hash` module by standardizing on its
hashing interface while rationalizing the current usage with the `HashMap` and
`HashSet` types. The primary goal of this slight redesign is to separate the
concepts of a hasher's state from a hashing algorithm itself.

The primary change of this commit is to separate the `Hasher` trait into a
`Hasher` and a `HashState` trait. Conceptually the old `Hasher` trait was
actually just a factory for various states, but hashing had very little control
over how these states were used. Additionally the old `Hasher` trait was
actually fairly unrelated to hashing.

This commit redesigns the existing `Hasher` trait to match what the notion of a
`Hasher` normally implies with the following definition:

    trait Hasher: Writer {
        type Output;

        fn reset(&mut self);
        fn finish(&self) -> Output;
    }

Note that the `Output` associated type is currently a type parameter due to
associated types not being fully implemented yet. This new `Hasher` trait
emphasizes that all hashers are sinks for bytes, and hashing algorithms may
produce outputs other than a `u64`, so a the output type is made generic.

With this definition, the old `Hasher` trait is realized as a new `HashState`
trait in the `collections::hash_state` module as an experimental addition for
now. The current definition looks like:

    trait HashState {
        type H: Hasher;
        fn hasher(&self) -> H;
    }

Note that the `H` associated type (along with its `O` output) are both type
parameters on the `HashState` trait due to the current limitations of associated
types. The purpose of this trait is to emphasize that the one piece of
functionality for implementors is that new instances of `Hasher` can be created.
This conceptually represents the two keys from which more instances of a
`SipHasher` can be created, and a `HashState` is what's stored in a `HashMap`,
not a `Hasher`.

Implementors of custom hash algorithms should implement the `Hasher` trait, and
only hash algorithms intended for use in hash maps need to implement or worry
about the `HashState` trait.

Some other stability decision made for the `std::hash` module are:

* The name of the module, hash, is `#![stable]`
* The `Hash` and `Hasher` traits are `#[unstable]` due to type parameters that
  want to be associated types.
* The `Writer` trait remains `#[experimental]` as it's intended to be replaced
  with an `io::Writer` (more details soon).
* The top-level `hash` function is `#[unstable]` as it is intended to be generic
  over the hashing algorithm instead of hardwired to `SipHasher`
* The inner `sip` module is now private as its one export, `SipHasher` is
  reexported in the `hash` module.

There are many breaking changes outlined above, and as a result this commit is
a:

[breaking-change]
@pczarn
Copy link
Contributor

pczarn commented Dec 15, 2014

A single-pass implementation is more efficient than a stateful one even if it's only using a single call to write.

Calls to write, however many, must be followed by a call to finish. This overhead is constant. Is there some other fundamental difference I'm missing?

I'm assuming the problem is that LLVM doesn't understand the semantics of write enough to optimize (inlined) invocations. Further, hash is called individually on each value. However, contiguous slices can be merged before they are passed to write:

pub trait MyHash<S: Writer = SipHasher> for Sized? {
    fn myhash<'a>(&'a self, state: &mut S, msg: &mut Option<&'a [u8]>);
}

pub fn myhash<T: MyHash<SipHasher>>(value: &T) -> u64 {
    let mut state = SipState::new();
    let mut msg = None;
    value.myhash(&mut state, &mut msg);
    if let Some(tail_msg) = msg {
        state.write(tail_msg);
    };
    state.finish()
}

// for primitive integer types
// impl<S: Writer> MyHash<S> for $ty
fn myhash<'a>(&'a self, state: &mut S, msg: &mut Option<&'a [u8]>) {
    if (*self as $ty).to_le() == *self as $ty { unsafe {
        let a = slice::from_raw_buf(mem::transmute(&self), mem::size_of::<$ty>());
        let a_ptr = a.as_ptr();

        match msg {
            &Some(ref mut m) if a_ptr == m.as_ptr().offset(m.len() as int) => {
                *msg = slice::from_raw_buf(mem::transmute(&m.as_ptr()),
                                           m.len() + a.len());
            }
            _ => {
                if let Some(msg_part) = *msg { state.write(msg_part); }
                *msg = Some(a);
            }
        }
    } } else {
        let a: [u8, ..::core::$ty::BYTES] = unsafe {
            mem::transmute((*self as $ty).to_le() as $ty)
        };
        if let Some(msg_part) = *msg { state.write(msg_part); }
        state.write(a.as_slice());
        *msg = None;
    }
}

I can only imagine it working efficiently with generous inlining and optimizations. Currently, the check for a.as_ptr() == msg_end isn't optimized out in code that hashes a slice, but slices become contiguous nonetheless. Maybe aliasing information could help here?

@drewm1980
Copy link
Contributor

One piece of infrastructure that might be relevant is a function for combining adjacent slices that I am working on. It is here in the playpen for now; I intend to write a PR.
http://is.gd/zDpaJd

/// The specified value will be hashed with this hasher and then the resulting
/// hash will be returned.
#[unstable = "the hashing algorithm used will likely become generic soon"]
pub fn hash<T: Hash<SipHasher>>(value: &T) -> u64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not Sized??

@alexcrichton
Copy link
Member Author

I've looked into some other languages, and it seems that the common trend is for a definition that looks something like:

trait Hash {
    fn hash(&self) -> uint;
}

The drawback of this, however, is that containers like HashMap cannot be parametric over the hash algorithm used. This trait can also be encoded via the definitions proposed in this PR:

impl Hash<uint> for MyType {
    fn hash(&self, state: &mut uint) {
        *state = self.my_hash();
    }
}

When it comes to optimizing hashing or trying to make an implementation that doesn't work incrementally (which it sounds like @pczarn, @thestinger are alluding to), one of the problems that jumps out is how to deal with aggregate structures with #[deriving(Hash)]. When it comes to deriving, unless we hard-code one specific algorithm, I don't think we have any option other than incrementally updating a hash (e.g. just calling .hash(state) on all members).

From what I can tell, one can always specialize hashing to perform optimally for any one particular type (minimizing write calls), but it would be very difficult to do so generically. For example the &[u8] type likely cannot hash well due to the lack of specialization implementations, whereas the &str type can likely hash very well due to actually being one call to .write().

@pczarn I'm sorry I didn't quite follow your comment, but are you basically saying that requiring .write() + .finish() is imposing overhead, even in the &str case for example? Are you also saying that this is not inherent to the SipHasher implementation, but rather inherent to the design of the hashing traits?

@emberian
Copy link
Member

emberian commented Jan 5, 2015

Bump.

@alexcrichton
Copy link
Member Author

I'd like to rebase this with a rewrite with true associated types, but I'm hitting a number of errors in the compiler which are preventing the usage of associated types. @nikomatsakis has a fix though, and I'll rebase as soon as we have a snapshot with those fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants