-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement adaptive hashing using specialization #5
base: master
Are you sure you want to change the base?
Conversation
@@ -54,6 +59,8 @@ use table::BucketState::{ | |||
Full, | |||
}; | |||
|
|||
pub use adaptive_map::HashMapInterface; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this exists temporarily. As long as specialization doesn't work for inherent impls, users of adaptive hashing must import this trait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the inherent methods on HashMap
could defer to HashMapInterface
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try that, it's a great idea!
Now, some benchmarks. With Adaptive
(Outdated!! the 2-4 variant is no longer used) SipHash-2-4
|
@@ -317,7 +323,7 @@ fn test_resize_policy() { | |||
/// } | |||
/// ``` | |||
#[derive(Clone)] | |||
pub struct HashMap<K, V, S = RandomState> { | |||
pub struct HashMap<K, V, S = AdaptiveState> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this would be a breaking change at the API level, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this change can't be added to the std library. Is this library a drop-in replacement for std's HashMap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the purpose of this repo is to iterate outside std, but with the ultimate goal of incorporating any worthwhile changes back into std.
Two problems remain.
Second, the DoS safeguard ignores the effect of fn robin_hood. The safeguard is simple; consider a situation where the user does an insertion followed by searches. The insertion of an element ensures that searches for that element in the future won't take more than a limited number of iterations (a threshold of 128). However, in between element insertion and searches for that element, other unrelated insertions may displace that element. It seems This should be closely reconsidered and documented in the form of a proof. |
84929fb
to
9dcdf9c
Compare
9dcdf9c
to
18a8b5f
Compare
18a8b5f
to
3c3aca6
Compare
3c3aca6
to
14d75b2
Compare
Updated and rebased. /cc @contain-rs/publishers @bill-myers @pnkfelix @arthurprs @Jurily @gereeter @ticki @divyekapoor |
The difference is brutal. Does this apply to &str/String as well? |
Not yet, because this requires explicit one-shot hashing with a hasher such as FarmHash, which probably needs an RFC. |
Wonderful, @pczarn. Great work! |
The difference is now smaller, because SipHash-1-3 is used by default. |
I'm currently writing an RFC. |
We should investigate using the integers as the hash in the Adaptive path. |
@arthurprs I'm sure there's no good way of using integers as the hash. The adaptive path allows us to use a non-cryptographic hash, but the hash should still be statistically good. However, we could use hashes that are cheaply converted back to integer keys. That is, a reversible function for hashing. That would save some space at the expense of slower access to integer keys. Unfortunately, HashMap exposes iterators that must borrow keys stored somewhere inside the structure. It's a dead end. There's more to be gained from using 32-bit hashes on 32-bit platforms. Perhaps 48-bit hashes on 64-bit platforms, too. |
@pczarn If the collision threshold is N, what's to stop an attacker from attempting to collide many different buckets N-1 times in order to DoS the server without triggering the collision detector? |
@bstrie How much different would that be from just spamming keys randomly to fill the hashtable up? Clearly, it is limited how much you can delay a particular key. |
@ticki Very different, as the expected number of probes per lookup for random keys is constant - something like 3, AFAIK, depending on the load factor - no matter how big the hash table gets, while the worst case before hitting the limit is 127 probes. If the u64 hash values are different (only equal modulo the table size), that's just 127 u64 comparisons, which is a bit slower but no big deal. If they're the same, though, that could be 127 comparisons of really long strings. In theory (as is mentioned in one of the todo comments) this could be a problem even with a single chain that gets repeatedly looked up - lookups aren't much cheaper than insertions, so AFAICS there's no real need to blow up the table... This should be solvable by specifically checking for equal hashes. Set a much lower chain length limit for the fast path, like 16 or so - which should still be very uncommon. Every insertion that exceeds that chain length triggers a separate function that scans the chain to see if many keys (say, also 16) have equal hashes. If this is true, perform the switch, and of course still do it if the chain length exceeds 128 even with unequal hashes. If the hasher truly behaves randomly (as it really should for non-attack scenarios), the chance of even one 64-bit collision should be rather low, and every additional collision required divides the probability by on the order of 2^64. Well, technically not just attack scenarios: it's possible to end up with a non-malicious but systemic source of full collisions, such as if someone's custom Hash implementation hashes only part of the object. That usually means the input to the hasher is the same, so with enough collisions the hash table is doomed to failure no matter what hash it uses (or if anything, switching to SipHash may be beneficial). But if there are only a few objects with equal hasher input, my proposal would make a useless switch more likely. Since this would already be a serious bug and SipHash is not that slow, I don't think that's a big deal. |
That's not really my point. Say you want to attack key The only advantage over the uninformed attack is that you partially get to choose which keys to slow down, but even you can only slow it down by An idea: When reallocating the table, it should check the highest probe length and conditionally switch back to the fast hash function. |
I'm fine with having Edit: My calculation was off. |
Sadly strong chain length bounds only apply with the double-probing variant of Robin Hood hashing. Chains regularly get to length ~46 for million-long maps with the linear probing variant. Insertions have much more worrisome behaviour, since they can end up shifting 1k elements even with purely random elements in the worst case. |
@Veedrac Yeah, clustering can have a real, negative effect. One possible solution is to use quadratic + Robin Hood. That should make it much less likely to happen. |
@ticki Slowing down isn't all or nothing. If you insert N keys with the same hash-modulo-table-size, the first takes 0 probes, the second takes 1, ..., the Nth takes N-1; total number of probes is N(N-1)/2. That's for each chain; the current E doesn't really matter, since you can just repeat this an arbitrary number of times (as long as you know where in the table there's free space) and let the table be grown, ideally having the hash also be equal modulo the new table size - in which case each resize repeats all the work done so far, asymptotically doubling the total number of probes. The average number of probes per insertion (which we understand as a cost to the attacker) is thus roughly (N-1)/2 without growth - for N=128, that's 63.5 - while inserting keys randomly keeps it under 3. Double both numbers to factor in growth. Anyway, that's the worst case for the attacker, if (a) they can only perform an arbitrary number of insertions, not lookups, and (b) the inserted keys are required to be unique (e.g. the program refuses the request if an insertion finds an existing key). If they're able to either keep looking up the same key or keep re-inserting it, they can just hammer the last key in the chain, N probes for each operation. In this case the only incentive to create multiple chains is to pessimize cache behavior. @Veedrac Hrm. If my suggestion to have an intermediate step of checking for fully equal keys is followed, having a few chains >= 16 but < 128 is not the end of the world. I guess it depends how cheap the check can be made... But if an alternate probing scheme can avoid the code bloat of adding that logic while improving performance in general, it sounds like a good idea even at the expense of some code bloat of its own. Pure double hashing has bad cache behavior, but what about starting with linear probing and switching to double hashing after some low iteration count (like 4)? That is, the probe locations would be h1, h1+1, h1+2, h1+3, h1+h2, h1+2h2, etc. Or maybe ..., h1+3, h1+h2, h1+h2+1, h1+h2+2, h2+h2+3, h2+2h2, ... |
@comex You forget the fact that collisions are not merely enough. You have to deal with probe length, i.e. you need to keep |
The code doesn't cover strings yet. The algorithm will eventually work just fine for strings and slices.
I can't think of any such way, despite lots of consideration. This proposal is already reasonably simple. If someone can come up with a simplification, I would be impressed and grateful. I still have to write an RFC.
When a chain turns out not to have many equal hashes, you need to resume insertion. The hard part is implementing the resuming it in a way that won't harm code generation. A recursive call is not ideal for generated code, unless it gets tail call optimization. Should I be more concerned about equal hashes? I think applications should at least ensure their inputs are not too long. Web servers certainly check the length of query keys. Typical DDoS is devastating for many hashmaps, e.g. in Java, because every key comparison requires reading a heap-allocated object, which loads a cache line. With large strings, we're reading contiguous memory.
I think it's not necessary. Why implement switching back, when the algorithm is meant to never switch in practice. |
@ticki I don't know what you mean. It should be possible to fill up the hash table without gaps, up to the maximum number of entries before resize - like this, imagining N were 4:
|
@comex. Good point.
Well, let's consider a strict hypothetical scenario: You use the hashtable for a long running KV store (say, a server), and some attack generates a lot of collisions and inserts them into the hash table. Now, the hash table will move on to a secure hash function. When the attacker realize that their approach is inefficient, they might discontinue the attack. If the hash table keeps growing, and then reallocates, nothing is lost by switching back to the old hash function. Of course, this gives a new attack vector: To do the collision attack, then let it reallocate another time, switching back to the old function, and repeat indefinite. Although, it is worth noting that getting a table to reallocate is not exactly trivial, since it requires you to insert a high (exponentially growing) number of insertions. Generally, I think the best way of modeling the security in this case is to compare the attack to the naïve one in which a lot of random keys are spammed, given that it requires this technique to revert it back to an unsecure hash function, it is only marginally worse than having a "never-go-back" approach. The question is this one: How big is the gain, and is it worth it? Unfortunately, that cannot be measured by micro benchmarks. |
I changed the way the safeguard works. Previously, it allowed for huge cost of map creation, only disallowing costly lookups. Now, the safeguard is a part of insertion instead of every lookup. The code is a bit less complex. Two tasks.
|
Very cool. It can be seen as an adaptive load factor now, which is great. Maybe it should take into account resizing policy though. |
It is quite easy to add seed to this mix function: https://en.wikipedia.org/wiki/Xor-encrypt-xor
So if we have 128bit seed, then we can simply: self.hash = mix(msg_data ^ seed.k0) ^ seed.k1; This construction will be strong enough to not use fallback to SipHash at all (for simple keys). (note: Even-Mansour scheme relies on "strong pseudorandom permutation". |
Then there is no need in Adaptive Hashing for simple keys. |
Adaptive hashmap implementation All credits to @pczarn who wrote rust-lang/rfcs#1796 and contain-rs/hashmap2#5 **Background** Rust std lib hashmap puts a strong emphasis on security, we did some improvements in #37470 but in some very specific cases and for non-default hashers it's still vulnerable (see #36481). This is a simplified version of rust-lang/rfcs#1796 proposal sans switching hashers on the fly and other things that require an RFC process and further decisions. I think this part has great potential by itself. **Proposal** This PR adds code checking for extra long probe and shifts lengths (see code comments and rust-lang/rfcs#1796 for details), when those are encountered the hashmap will grow (even if the capacity limit is not reached yet) _greatly_ attenuating the degenerate performance case. We need a lower bound on the minimum occupancy that may trigger the early resize, otherwise in extreme cases it's possible to turn the CPU attack into a memory attack. The PR code puts that lower bound at half of the max occupancy (defined by ResizePolicy). This reduces the protection (it could potentially be exploited between 0-50% occupancy) but makes it completely safe. **Drawbacks** * May interact badly with poor hashers. Maps using those may not use the desired capacity. * It adds 2-3 branches to the common insert path, luckily those are highly predictable and there's room to shave some in future patches. * May complicate exposure of ResizePolicy in the future as the constants are a function of the fill factor. **Example** Example code that exploit the exposure of iteration order and weak hasher. ``` const MERGE: usize = 10_000usize; #[bench] fn merge_dos(b: &mut Bencher) { let first_map: $hashmap<usize, usize, FnvBuilder> = (0..MERGE).map(|i| (i, i)).collect(); let second_map: $hashmap<usize, usize, FnvBuilder> = (MERGE..MERGE * 2).map(|i| (i, i)).collect(); b.iter(|| { let mut merged = first_map.clone(); for (&k, &v) in &second_map { merged.insert(k, v); } ::test::black_box(merged); }); } ``` _91 is stdlib and _ad is patched (the end capacity in both cases is the same) ``` running 2 tests test _91::merge_dos ... bench: 47,311,843 ns/iter (+/- 2,040,302) test _ad::merge_dos ... bench: 599,099 ns/iter (+/- 83,270) ```
Adaptive hashing provides fast and complexity-safe hashing for hashmaps with simple key types. The user doesn't need to change any code to get speedups from adaptive hashing, in contrast to the use of FnvHasher.