Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ustring): ustring hash collision protection #4350

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

lgritz
Copy link
Collaborator

@lgritz lgritz commented Jul 21, 2024

The gist is that the ustring::strhash(str) function is modified to strip out the MSB from Strutil::strhash. The rep entry is filed in the ustring table based on this hash. So effectively, the computed hash is 63 bits, not 64.

But rep->hashed field consists of the lower 63 bits being the computed hash, and the MSB indicates whether this is the 2nd (or more) entry in the table that had the same 63 bit hash.

ustring::hash() then is modified as follows: If the MSB is 0, the computed hash is the hash. If the MSB is 1, though, we DON'T use that hash, and instead we use the pointer to the unique characters, but with the MSB set (that's an invalid address by itself). Note that the computed hashes never have MSB set, and the char*+MSB always have MSB set, so therefore ustring::hash() will never have the same value for two different ustrings.

But -- please note! -- that ustring::strhash(str) and ustring(str).hash() will only match (and also be the same value on every execution) if the ustring is the first to receive that hash, which should be approximately always. Probably always, in practice.

But in the very improbable case of a hash collision, one of them (the second to be turned into a ustring) will be using the alternate hash based on the character address, which is both not the same as ustring::strhash(chars), nor is it expected to be the same constant on every program execution.

The gist is that the ustring::strhash(str) function is modified to
strip out the MSB from Strutil::strhash.  The rep entry is filed in
the ustring table based on this hash.  So effectively, the computed
hash is 63 bits, not 64.

But rep->hashed field consists of the lower 63 bits being the computed
hash, and the MSB indicates whether this is the 2nd (or more) entry in
the table that had the same 63 bit hash.

ustring::hash() then is modified as follows: If the MSB is 0, the
computed hash is the hash. If the MSB is 1, though, we DON'T use that
hash, and instead we use the pointer to the unique characters, but
with the MSB set (that's an invalid address by itself). Note that the
computed hashes never have MSB set, and the char*+MSB always have MSB
set, so therefore ustring::hash() will never have the same value for
two different ustrings.

But -- please note! -- that ustring::strhash(str) and
ustring(str).hash() will only match (and also be the same value on
every execution) if the ustring is the first to receive that hash,
which should be approximately always. Probably always, in practice.

But in the very improbable case of a hash collision, one of them (the
second to be turned into a ustring) will be using the alternate hash
based on the character address, which is both not the same as
ustring::strhash(chars), nor is it expected to be the same constant on
every program execution.

Signed-off-by: Larry Gritz <[email protected]>
@lgritz
Copy link
Collaborator Author

lgritz commented Aug 7, 2024

Comments @cmstein @sfriedmapixar @etheory ?

@lgritz
Copy link
Collaborator Author

lgritz commented Aug 7, 2024

Maybe @chellmuth too?

Copy link

@sfriedmapixar sfriedmapixar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented this locally with the MSB polarity flipped from what you've presented. I did this it's easier to see (in a debugger) pointer-based perfect hashes are pointers, it doesn't require bit manip to deref them, and so and if it happens, you can just dive into those in a debugger without having to mask off the high bit. Semantically I think it's also cleaner to actually use that pointer value as the hashing value into the map, and store it in the tablerep. Really, just think of it as a way to create a perfect hash out of an imperfect one -- with the caveat that it assumes you're on a system with user-space pointers that can never have the MSB set. You could do the same thing with the LSB instead since the tablerep always comes out of a pool, by guaranteeing the alignment of those allocations wouldn't have the low bit set for pointers and that would work everywhere. If you just think of the collision resolution as part of the "hashing" function, then it makes sense to store that hash in the tablerep for it in the "hashed" member.

static OIIO_HOSTDEVICE constexpr hash_t strhash(string_view str)
{
return Strutil::strhash(str) & hash_mask;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our implementation, we do an OR instead of an and, so that "pointer hashes" look like pointers in a debugger, and it saves clearing the bit when we want to use it like a pointer, and "hash-hashes" are always negative when looking at them in a debugger.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I think I was trying to preserve the "0 means empty string" property without needing extra conditionals, but maybe it's easier than I thought. I will noodle with this and see if I can make it work just as simply with the reversed polarity, because I do see your point about how it's nice for the pointers to be the unchanged values.

hash_t h = rep()->hashed;
return OIIO_LIKELY((h & duplicate_bit) == 0)
? h
: hash_t(m_chars) | duplicate_bit;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a "pointer hash", you can actually just store that "hash" the "hashed" member on TableRep creation, then you don't need to do any of this hashing or checking at all, you just return the value like before.

@@ -757,10 +776,29 @@ class OIIO_UTIL_API ustring {
TableRep(string_view strref, hash_t hash);
~TableRep();
const char* c_str() const noexcept { return (const char*)(this + 1); }
constexpr bool unique_hash() const
{
return (hashed & duplicate_bit) == 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you flip the polarity of the indicator bit, this needs to flip.

@@ -71,6 +105,10 @@ template<unsigned BASE_CAPACITY, unsigned POOL_SIZE> struct TableRepMap {

const char* lookup(string_view str, uint64_t hash)
{
if (OIIO_UNLIKELY(hash & ustring::duplicate_bit)) {
// duplicate bit is set -- the hash is related to the chars!
return OIIO::bitcast<const char*>(hash & ustring::hash_mask);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you flip the polarity of the indicator bit, the extra & is uneeded.

@@ -98,6 +135,10 @@ template<unsigned BASE_CAPACITY, unsigned POOL_SIZE> struct TableRepMap {
// the hash.
const char* lookup(uint64_t hash)
{
if (OIIO_UNLIKELY(hash & ustring::duplicate_bit)) {
// duplicate bit is set -- the hash is related to the chars!
return OIIO::bitcast<const char*>(hash & ustring::hash_mask);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you flip the polarity of the indicator bit, the extra & is uneeded.

++dist;
pos = (pos + dist) & mask; // quadratic probing
}
}

const char* insert(string_view str, uint64_t hash)
{
OIIO_ASSERT((hash & ustring::duplicate_bit) == 0); // can't happen?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you flip the polarity of the indicator bit, flip this assert.

// If we encountered another ustring with the same hash (if one
// exists, it would have hashed to the same address so we would have
// seen it), set the duplicate bit in the rep's hashed field.
if (duplicate_hash) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change the polarity of the indicator bit, you can pass "duplicate_hash" into make_rep, and then it can just store the pointer from the pool alloc in the "hashed" member, and all you have to do here is inc the collisions counter (or the inc and dbug print could be moved into make_rep as well. -- if you do that, you need to do another pobe for a valid "pos" here based on the pointer bits rather than the original hash bits.

// ensure low bit clear
OIIO_DASSERT((size_t(rep->c_str()) & 1) == 0);
// ensure low bit clear
OIIO_DASSERT((size_t(rep->c_str()) & ustring::duplicate_bit) == 0);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change the polarity of the indicator bit, you need to flip this assert.

hash_t hash = Strutil::strhash64(strref);
// This line, if uncommented, lets you force lots of hash collisions:
// hash &= ~hash_t(0xffffff);
hash_t hash = ustring::strhash(strref);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use a similar "force a lot of collisions" to to locally test that the collision code is all correct, now that it's much more complicated with the ptr vs hash perfect hashing.

collisions->emplace_back(ustring::from_unique(c.first));
return all_hash_collisions.size();
if (collisions) {
// Disabled for now

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to keep a collection of all hash collisions if you put the pointer into the "hashed" member on collision...then you can just iterate through all the strings in the system, and print out those who's hash is actually a pointer.

@etheory
Copy link
Contributor

etheory commented Aug 8, 2024

I'm OK with any of these options as long as collisions are handled.
I do agree with @sfriedmapixar about the simplicity of debugging if the top bit is 1 for a hash and 0 for a string though.

Also remember that on every current system, the bottom few bits are zero due to alignment, and the top few are zero due to memory allocation guarantees: https://stackoverflow.com/questions/46509152/why-in-x86-64-the-virtual-address-are-4-bits-shorter-than-physical-48-bits-vs which means you can often rely on the top 12 to 16bits also being zero. Keep that in mind when you do any of these tricks. We use these bits in glimpse for storage of data, so it's quite useful.

@lgritz
Copy link
Collaborator Author

lgritz commented Aug 8, 2024

Something that has been nagging at my mind... I know that the MSB will always be 0 for a real address on x86, but do we feel confident that this will also be true for ARM, RISC V, or others that may come along?

I also considered using the LSB for this, also safe because the allocations are always aligned, the characters will never start on an odd address, so if we "|1" the hashes to set the low bit, then that's also a simple way to distinguish pointers from hashes and will be true of any future architecture.

Thoughts?

@lgritz
Copy link
Collaborator Author

lgritz commented Aug 8, 2024

The reason I really like empty string to be all 0 bits (and thus thought that must mean that it's pointers, not hashes, with the extra bit set) is so we can memset to 0 a region and know that we've made valid (empty) ustrings or ustringhashes.

@ThiagoIze
Copy link
Collaborator

I also considered using the LSB for this, also safe because the allocations are always aligned, the characters will never start on an odd address, so if we "|1" the hashes to set the low bit, then that's also a simple way to distinguish pointers from hashes and will be true of any future architecture.

Setting the LSB hash will reduce the hash quality in a risky way. Suppose for instance someone did: my_array[hash%my_array.length()] where the length is 8 (it could be any size, but this is simple to show). The hash%8 will result in just the three lowest signed bits being used. But note that your proposal has the LSB at 1, so now we just have 2 bits in play and so only half the array will be referenceable.

@lgritz
Copy link
Collaborator Author

lgritz commented Aug 8, 2024

Right, gotcha. That's a good reason to stick with MSB.

It's ok, we'll have asserts that no real address has the high bit set. We'll have faith that we'll catch it right away if anybody ports to a hardware platform that might use that part of the address space.

@lgritz
Copy link
Collaborator Author

lgritz commented Aug 9, 2024

OK, in my tree (not yet pushed here) I recoded it all to make the MSB 1 for "original" hashes, and 0 for rehashes (which are replaced by the char ptr as the return value for ustring.hash()). I believe it's all working, but there is a tradeoff involved that I'm not entirely happy with. Let me review the issues:

In the table (invisible to the user), each entry has a hashed field, and the chars themselves. The MSB of the hashed field reveals if this is the first entry with that hash (used to be 0, now is 1 in my new draft), or if it is a subsequently encountered string that had the same 31 bit hash. The public hash() function returns rep->hashed if it's a first string entry having that hash, or the char address if it's a duplicate hash.

A property we've always had that I think is very important to retain is that both ustring and ustringhash are represented by a true 0 value for empty strings. In other words, it's special -- the ptr is null and the hash should be 0. This is what lets us zero out a block of memory and know that it's initializing any ustring or ustringhash therein to be valid, and represent an empty string.

Now then, the way I did it first in this PR trivially preserved that property, because MSB was 1 only for duplicate hashes, so the empty string is canonical and hashes to 0. The alternate way, where it would "want" to be a MSB 1 for canonical hashes, requires special handling of the empty string case in a variety of places. I think this is a little confusing and error-prone, and certainly has a runtime cost for the extra testing and branching.

@ssteinbach and others pointed out that it's nice for the pointers to have the MSB=0 rather than =1, because then you can just cast it to a pointer to get to the characters, especially helpful when debugging.

But here's where I want to push back a little: Is it, really? The only case where you'd get a hash that's a pointer is for a 2nd string that had the same hash. This should be exceptionally rare, so it's not something that would come up... almost never... while in the debugger looking at a ustringhash. Usually you'd just see the hash, so there's not a direct way to get to the chars, so it doesn't matter if MSB is 0 or 1 for that case. (And if we set MSB on the pointer, that just means to get to the chars, you'd need to mask it... awkward, but how often does that really come up? You could also just CALL -- yes, in the debugger -- ustringhash.c_str() to get to the chars.)

So here's the crux of it:

MSB=1 for duplicate hash:

  • PRO: no special handling of empty string case to make the hash 0
  • CON: "pointer substituted for duplicate hash" needs to have MSB masked off to get to the chars, which is painful in the debugger but almost never happens.

MSB=0 for duplicate hash:

  • PRO: duplicate hashes trivially castble to pointers, easy in the debugger, but this almost never happens.
  • CON: extra operations and potentially a branch to ensure ustringhash("") == 0, in several places that actually execute frequently.

Thoughts? Are people willing to pay (slightly) in runtime perf to have this property of trivial cast of duplicate hashes to pointers in the debuger, that you'll almost never encounter in real life?

@etheory
Copy link
Contributor

etheory commented Aug 10, 2024

You know what, add a static function to do the cast in the debugger and you solve all the issues. Should be straight forward enough. Then you can just call the correct function in the debugger directly. Just document it clearly and that should resolve the issue no?

@lgritz
Copy link
Collaborator Author

lgritz commented Aug 10, 2024

You know what, add a static function to do the cast in the debugger and you solve all the issues.

Isn't that just the existing ustringhash.c_str()?

@etheory
Copy link
Contributor

etheory commented Aug 12, 2024

You know what, add a static function to do the cast in the debugger and you solve all the issues.

Isn't that just the existing ustringhash.c_str()?

If you only have the pointer for some reason, you cannot call the function on it (though I guess you could cast it....).
My point is that if you provide a free function to convert the bits for display/conversion, then you can just call it at any time in a debugger in any situation. Whether you have the object reference or not. If you can just cast the address and call the function, then sure, it's fine as is.

@lgritz
Copy link
Collaborator Author

lgritz commented Aug 12, 2024

If you have the pointer, you already have the pointer.

If you have a hash value h as a ustringhash, you get to the chars with h.c_str().

If you have a hash value v as a uint64, you get to the chars with ustringhash(v).c_str().

@lgritz
Copy link
Collaborator Author

lgritz commented Aug 15, 2024

Does my last answer assuage concerns, or do we need an additional utility function?

@etheory
Copy link
Contributor

etheory commented Aug 17, 2024

Yep all good thanks @lgritz - sorry for the slow turn around, I was on leave.

@lgritz
Copy link
Collaborator Author

lgritz commented Aug 25, 2024

@sfriedmapixar you had asked about the stab I took at swapping the bit to the other way as you originally suggested. If you're curious, it's here: https://github.com/lgritz/OpenImageIO/tree/lg-ustringhash2 (just one additional commit on top of what I have here in this PR)

I like this PR here better, as it avoids some extra logic (both potentially perf impacting, but maybe more importantly, error prone and easy to forget) that is necessary to special case empty string handing.

Let me know if you feel really strongly about doing it the other way. If you're ok with this, though, I'd like to merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants