-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
std: Stabilize the std::hash module #19673
Conversation
Hashes should have the option of single pass implementations rather than ones with state and this doesn't seem like it's enough to provide that. |
Does single-pass mean that the hasher gets to see e.g. the whole array of bytes instead of getting a look at them one at a time? |
@thestinger I'm not quite sure what you mean by "single pass", could you elaborate? Implementations of |
@tbu-: Yes, it means it doesn't have to be a state machine. |
@alexcrichton: The fact that the default hash implementations don't have single pass implementations for contiguous blocks of memory is a major performance issue and should be addressed in the design. I don't see how it's ready to be stabilized if these issues haven't even been considered. |
I'm trying to drill down into something more specific because what you've said so far sounds like this is already possible with the PR. I would like to make your concerns more concrete because the initial design of the Specifically:
It would be helpful for you to be a little more concrete in your concerns to make sure that we can address them! |
pub trait Writer { | ||
fn write(&mut self, bytes: &[u8]); | ||
} | ||
|
||
/// Hash a value with the default SipHasher algorithm (two initial keys of 0). | ||
/// | ||
/// The speified value will be hashed with this hasher and then the resulting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
speified -> specified
@alexcrichton I believe that this is not possible due to the current standard implementation for any slice that just hashes the members separately. |
@alexcrichton: A single-pass implementation is more efficient than a stateful one even if it's only using a single call to |
0e649fa
to
d4b85f3
Compare
@tbu- I agree that we can't generically hash a slice of bytes as just a call to @thestinger is this what you are referring to? I still don't understand your concerns about state because with the specialization of the hasher type parameter on the |
d4b85f3
to
c47f2ed
Compare
This commit aims to stabilize the `std::hash` module by standardizing on its hashing interface while rationalizing the current usage with the `HashMap` and `HashSet` types. The primary goal of this slight redesign is to separate the concepts of a hasher's state from a hashing algorithm itself. The primary change of this commit is to separate the `Hasher` trait into a `Hasher` and a `HashState` trait. Conceptually the old `Hasher` trait was actually just a factory for various states, but hashing had very little control over how these states were used. Additionally the old `Hasher` trait was actually fairly unrelated to hashing. This commit redesigns the existing `Hasher` trait to match what the notion of a `Hasher` normally implies with the following definition: trait Hasher: Writer { type Output; fn reset(&mut self); fn finish(&self) -> Output; } Note that the `Output` associated type is currently a type parameter due to associated types not being fully implemented yet. This new `Hasher` trait emphasizes that all hashers are sinks for bytes, and hashing algorithms may produce outputs other than a `u64`, so a the output type is made generic. With this definition, the old `Hasher` trait is realized as a new `HashState` trait in the `collections::hash_state` module as an experimental addition for now. The current definition looks like: trait HashState { type H: Hasher; fn hasher(&self) -> H; } Note that the `H` associated type (along with its `O` output) are both type parameters on the `HashState` trait due to the current limitations of associated types. The purpose of this trait is to emphasize that the one piece of functionality for implementors is that new instances of `Hasher` can be created. This conceptually represents the two keys from which more instances of a `SipHasher` can be created, and a `HashState` is what's stored in a `HashMap`, not a `Hasher`. Implementors of custom hash algorithms should implement the `Hasher` trait, and only hash algorithms intended for use in hash maps need to implement or worry about the `HashState` trait. Some other stability decision made for the `std::hash` module are: * The name of the module, hash, is `#![stable]` * The `Hash` and `Hasher` traits are `#[unstable]` due to type parameters that want to be associated types. * The `Writer` trait remains `#[experimental]` as it's intended to be replaced with an `io::Writer` (more details soon). * The top-level `hash` function is `#[unstable]` as it is intended to be generic over the hashing algorithm instead of hardwired to `SipHasher` * The inner `sip` module is now private as its one export, `SipHasher` is reexported in the `hash` module. There are many breaking changes outlined above, and as a result this commit is a: [breaking-change]
c47f2ed
to
085838c
Compare
Calls to I'm assuming the problem is that LLVM doesn't understand the semantics of pub trait MyHash<S: Writer = SipHasher> for Sized? {
fn myhash<'a>(&'a self, state: &mut S, msg: &mut Option<&'a [u8]>);
}
pub fn myhash<T: MyHash<SipHasher>>(value: &T) -> u64 {
let mut state = SipState::new();
let mut msg = None;
value.myhash(&mut state, &mut msg);
if let Some(tail_msg) = msg {
state.write(tail_msg);
};
state.finish()
}
// for primitive integer types
// impl<S: Writer> MyHash<S> for $ty
fn myhash<'a>(&'a self, state: &mut S, msg: &mut Option<&'a [u8]>) {
if (*self as $ty).to_le() == *self as $ty { unsafe {
let a = slice::from_raw_buf(mem::transmute(&self), mem::size_of::<$ty>());
let a_ptr = a.as_ptr();
match msg {
&Some(ref mut m) if a_ptr == m.as_ptr().offset(m.len() as int) => {
*msg = slice::from_raw_buf(mem::transmute(&m.as_ptr()),
m.len() + a.len());
}
_ => {
if let Some(msg_part) = *msg { state.write(msg_part); }
*msg = Some(a);
}
}
} } else {
let a: [u8, ..::core::$ty::BYTES] = unsafe {
mem::transmute((*self as $ty).to_le() as $ty)
};
if let Some(msg_part) = *msg { state.write(msg_part); }
state.write(a.as_slice());
*msg = None;
}
} I can only imagine it working efficiently with generous inlining and optimizations. Currently, the check for |
One piece of infrastructure that might be relevant is a function for combining adjacent slices that I am working on. It is here in the playpen for now; I intend to write a PR. |
/// The specified value will be hashed with this hasher and then the resulting | ||
/// hash will be returned. | ||
#[unstable = "the hashing algorithm used will likely become generic soon"] | ||
pub fn hash<T: Hash<SipHasher>>(value: &T) -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not Sized?
?
I've looked into some other languages, and it seems that the common trend is for a definition that looks something like: trait Hash {
fn hash(&self) -> uint;
} The drawback of this, however, is that containers like impl Hash<uint> for MyType {
fn hash(&self, state: &mut uint) {
*state = self.my_hash();
}
} When it comes to optimizing hashing or trying to make an implementation that doesn't work incrementally (which it sounds like @pczarn, @thestinger are alluding to), one of the problems that jumps out is how to deal with aggregate structures with From what I can tell, one can always specialize hashing to perform optimally for any one particular type (minimizing @pczarn I'm sorry I didn't quite follow your comment, but are you basically saying that requiring |
Bump. |
I'd like to rebase this with a rewrite with true associated types, but I'm hitting a number of errors in the compiler which are preventing the usage of associated types. @nikomatsakis has a fix though, and I'll rebase as soon as we have a snapshot with those fixes. |
This commit aims to stabilize the
std::hash
module by standardizing on itshashing interface while rationalizing the current usage with the
HashMap
andHashSet
types. The primary goal of this slight redesign is to separate theconcepts of a hasher's state from a hashing algorithm itself.
The primary change of this commit is to separate the
Hasher
trait into aHasher
and aHashState
trait. Conceptually the oldHasher
trait wasactually just a factory for various states, but hashing had very little control
over how these states were used. Additionally the old
Hasher
trait wasactually fairly unrelated to hashing.
This commit redesigns the existing
Hasher
trait to match what the notion of aHasher
normally implies with the following definition:Note that the
Output
associated type is currently a type parameter due toassociated types not being fully implemented yet. This new
Hasher
traitemphasizes that all hashers are sinks for bytes, and hashing algorithms may
produce outputs other than a
u64
, so a the output type is made generic.With this definition, the old
Hasher
trait is realized as a newHashState
trait in the
collections::hash_state
module as an experimental addition fornow. The current definition looks like:
Note that the
H
associated type (along with itsO
output) are both typeparameters on the
HashState
trait due to the current limitations of associatedtypes. The purpose of this trait is to emphasize that the one piece of
functionality for implementors is that new instances of
Hasher
can be created.This conceptually represents the two keys from which more instances of a
SipHasher
can be created, and aHashState
is what's stored in aHashMap
,not a
Hasher
.Implementors of custom hash algorithms should implement the
Hasher
trait, andonly hash algorithms intended for use in hash maps need to implement or worry
about the
HashState
trait.Some other stability decision made for the
std::hash
module are:#![stable]
Hash
andHasher
traits are#[unstable]
due to type parameters thatwant to be associated types.
Writer
trait remains#[experimental]
as it's intended to be replacedwith an
io::Writer
(more details soon).hash
function is#[unstable]
as it is intended to be genericover the hashing algorithm instead of hardwired to
SipHasher
sip
module is now private as its one export,SipHasher
isreexported in the
hash
module.There are many breaking changes outlined above, and as a result this commit is
a:
[breaking-change]