-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adaptive hashmap implementation #38368
Conversation
(rust_highfive has picked a reviewer for you, use r? to override) |
The code is in very rough shape, I wanted to collect feedback on the idea first. |
r? @bluss cc @pczarn, @apasel422 |
let mut old_table = replace(&mut self.table, RawTable::new(new_raw_cap)); | ||
let old_size = old_table.size(); | ||
|
||
if old_table.capacity() == 0 || old_table.size() == 0 { | ||
if old_table.size() == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was the capacity conditional removed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to be part of the PR. The capacity check is redundant though, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it's the existence of it to start with that is puzzling, if capacity is 0, size is surely already 0.
I can remove that change. But the capacity check is redundant, right? |
src/libstd/collections/hash/map.rs
Outdated
NoElem(bucket) => bucket.put(self.hash, self.key, value).into_mut_refs().1, | ||
NeqElem(bucket, disp) => { | ||
let (shift, v_ref) = robin_hood(bucket, disp, self.hash, self.key, value); | ||
if disp >= 128 || shift >= 512 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These of course will be moved into well commented constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can get away only checking probe length. It's still possible to abuse long shifts without hitting the probe length limit but that's a lot harder.
Looks remarkably simple for what it does. That's good. Advantages:
Obviously the constants involved need constant names, tuning and comments. I think we can make an argument that for example a displacement of 128 slots from its best position is always a bad case and should never occur in a healthy hash table, no matter its size? |
The good thing is that the math behind this is independent from the map size, it's only a function of the fill factor and the hasher being good. The first is fine as the constants work for fill factors smaller than the one they were calculated. Interacting badly with bad hashers could be problematic in practice as hashmap may never reach the maximum fill factor (the check for half filled is useful here so it doesn't blow up). |
This sounds like a good idea, but it means it only counters the n=2 case (aka. merging two maps, rather than, say, the first nth of n maps). That's definitely an improvement, though. |
@Veedrac what do you mean by n=2 case? |
Putting into more generic terms, you mean that it can still be abused while between 0% and 50% filled? |
Yes, basically. I'll try to cook up some examples later, to give a more concrete demo. |
Trying to resume the conversation... I think the obvious open question here is the interaction with less than good hashers, hashmaps using those may not use the desired capacity. |
The libs team discussed this briefly at triage the other day, and we were wondering if we could perhaps land this ahead of the RFC? The changes to probing here are universally better, even if we don't do the hasher changes yet, right? If so perhaps, the PR title/description could be cleaned up to the current state and we could look to merge? |
I'll update the PR/description to provide clearer picture.
I wouldn't say universally, but mostly. |
Ah ok, thanks for the clarification. Want to ping me when updated and we can look to merge? |
I should have elaborated that. It's not strictly better because the interaction with poor hashers isn't great, with those it's possible that the hashmap resizes early even on non-rogue input. I'll ping when I update it. |
PR updated, now there's two constants and lots of comment lines. |
Thanks @arthurprs! Out of curiosity, would it be at all possible to add a test for this? |
I think so. It's possible to observe the early resizes from the public api and it's somewhat easy to trigger it by merging two maps with the same hash seed (like the example in first post). I'll write something tomorrow. |
I rebased and squashed the commits. |
@bors: r+ Thanks again and for being patient @arthurprs! |
📌 Commit 57940d0 has been approved by |
Adaptive hashmap implementation All credits to @pczarn who wrote rust-lang/rfcs#1796 and contain-rs/hashmap2#5 **Background** Rust std lib hashmap puts a strong emphasis on security, we did some improvements in #37470 but in some very specific cases and for non-default hashers it's still vulnerable (see #36481). This is a simplified version of rust-lang/rfcs#1796 proposal sans switching hashers on the fly and other things that require an RFC process and further decisions. I think this part has great potential by itself. **Proposal** This PR adds code checking for extra long probe and shifts lengths (see code comments and rust-lang/rfcs#1796 for details), when those are encountered the hashmap will grow (even if the capacity limit is not reached yet) _greatly_ attenuating the degenerate performance case. We need a lower bound on the minimum occupancy that may trigger the early resize, otherwise in extreme cases it's possible to turn the CPU attack into a memory attack. The PR code puts that lower bound at half of the max occupancy (defined by ResizePolicy). This reduces the protection (it could potentially be exploited between 0-50% occupancy) but makes it completely safe. **Drawbacks** * May interact badly with poor hashers. Maps using those may not use the desired capacity. * It adds 2-3 branches to the common insert path, luckily those are highly predictable and there's room to shave some in future patches. * May complicate exposure of ResizePolicy in the future as the constants are a function of the fill factor. **Example** Example code that exploit the exposure of iteration order and weak hasher. ``` const MERGE: usize = 10_000usize; #[bench] fn merge_dos(b: &mut Bencher) { let first_map: $hashmap<usize, usize, FnvBuilder> = (0..MERGE).map(|i| (i, i)).collect(); let second_map: $hashmap<usize, usize, FnvBuilder> = (MERGE..MERGE * 2).map(|i| (i, i)).collect(); b.iter(|| { let mut merged = first_map.clone(); for (&k, &v) in &second_map { merged.insert(k, v); } ::test::black_box(merged); }); } ``` _91 is stdlib and _ad is patched (the end capacity in both cases is the same) ``` running 2 tests test _91::merge_dos ... bench: 47,311,843 ns/iter (+/- 2,040,302) test _ad::merge_dos ... bench: 599,099 ns/iter (+/- 83,270) ```
☀️ Test successful - status-appveyor, status-travis |
@istankovic Please make a PR 😃 |
@arthurprs Nah, it was just something I noticed so I made the comments, but it doesn't bother me enough to make a PR, sorry... |
Fix spelling in hashmap comments Fixing my bad english from rust-lang#38368 Note to self: triple check spelling/grammar
The shift length math is broken. It turns out that checking for the shift length is complicated. Using simulations it's possible to see that a value of 2000 will only get probabilities down to ~1e-7 when the hashmap load factor is 90% (rust goes up to 90.9% as of today). That's probably not good enough to go into the stdlib with pluggable hashers. See rust-lang/rfcs#1796 (comment) and rust-lang/rfcs#1796 (comment) I suggest taking that part out and keeping only displacement check, which is much safer and very useful by itself. |
I agree. This issue also indicates that the hashmap load factor may be too high. Thanks for help running these simulations. |
Fix spelling in hashmap comments Fixing my bad english from rust-lang#38368 Note to self: triple check spelling/grammar
Simplify/fix adaptive hashmap Please see rust-lang#38368 (comment) for context. The shift length math is broken. It turns out that checking for the shift length is complicated. Using simulations it's possible to see that a value of 2000 will only get probabilities down to ~1e-7 when the hashmap load factor is 90% (rust goes up to 90.9% as of today). That's probably not good enough to go into the stdlib with pluggable hashers. So this PR simplify the adaptive behavior to only consider displacement, which is much safer and very useful by itself. There's two comments because one of them is already being tested to be merged by bors.
Simplify/fix adaptive hashmap Please see rust-lang#38368 (comment) for context. The shift length math is broken. It turns out that checking for the shift length is complicated. Using simulations it's possible to see that a value of 2000 will only get probabilities down to ~1e-7 when the hashmap load factor is 90% (rust goes up to 90.9% as of today). That's probably not good enough to go into the stdlib with pluggable hashers. So this PR simplify the adaptive behavior to only consider displacement, which is much safer and very useful by itself. There's two comments because one of them is already being tested to be merged by bors.
Because of alignment(?), this one (We caught this in Servo because we have unit tests that check |
Could the extra bit be packed in |
Yes, of course. The code is going to be messy, though. If we're able to restrict adaptive hashing to maps with the default hasher, I'd prefer to have the extra bit in |
It's just a matter of finding how to use the bit with reasonable code. I'd argue against making it RandomState only, the selling point was supporting all hashmaps. Edit: I also think that making it RandomState only will require even more code. |
Let’s discuss in #40042. |
Simplify/fix adaptive hashmap Please see rust-lang#38368 (comment) for context. The shift length math is broken. It turns out that checking for the shift length is complicated. Using simulations it's possible to see that a value of 2000 will only get probabilities down to ~1e-7 when the hashmap load factor is 90% (rust goes up to 90.9% as of today). That's probably not good enough to go into the stdlib with pluggable hashers. So this PR simplify the adaptive behavior to only consider displacement, which is much safer and very useful by itself. There's two comments because one of them is already being tested to be merged by bors.
Simplify/fix adaptive hashmap Please see rust-lang#38368 (comment) for context. The shift length math is broken. It turns out that checking for the shift length is complicated. Using simulations it's possible to see that a value of 2000 will only get probabilities down to ~1e-7 when the hashmap load factor is 90% (rust goes up to 90.9% as of today). That's probably not good enough to go into the stdlib with pluggable hashers. So this PR simplify the adaptive behavior to only consider displacement, which is much safer and very useful by itself. There's two comments because one of them is already being tested to be merged by bors.
Simplify/fix adaptive hashmap Please see rust-lang/rust#38368 (comment) for context. The shift length math is broken. It turns out that checking for the shift length is complicated. Using simulations it's possible to see that a value of 2000 will only get probabilities down to ~1e-7 when the hashmap load factor is 90% (rust goes up to 90.9% as of today). That's probably not good enough to go into the stdlib with pluggable hashers. So this PR simplify the adaptive behavior to only consider displacement, which is much safer and very useful by itself. There's two comments because one of them is already being tested to be merged by bors.
All credits to @pczarn who wrote rust-lang/rfcs#1796 and contain-rs/hashmap2#5
Background
Rust std lib hashmap puts a strong emphasis on security, we did some improvements in #37470 but in some very specific cases and for non-default hashers it's still vulnerable (see #36481).
This is a simplified version of rust-lang/rfcs#1796 proposal sans switching hashers on the fly and other things that require an RFC process and further decisions. I think this part has great potential by itself.
Proposal
This PR adds code checking for extra long probe and shifts lengths (see code comments and rust-lang/rfcs#1796 for details), when those are encountered the hashmap will grow (even if the capacity limit is not reached yet) greatly attenuating the degenerate performance case.
We need a lower bound on the minimum occupancy that may trigger the early resize, otherwise in extreme cases it's possible to turn the CPU attack into a memory attack. The PR code puts that lower bound at half of the max occupancy (defined by ResizePolicy). This reduces the protection (it could potentially be exploited between 0-50% occupancy) but makes it completely safe.
Drawbacks
Example
Example code that exploit the exposure of iteration order and weak hasher.
_91 is stdlib and _ad is patched (the end capacity in both cases is the same)