Fixed worst-case miri performance with lossy string decoding #98592

Kixiron · 2022-06-27T20:51:42Z

Addresses miri/#2273 (I'm not sure I'd call it a fix since this situation could conceivably happen in other code, so it's more of a bandaid)

Reduces the runtime of the test case from miri/#2273 (comment) from ~forever to a little over 5 seconds on my machine (windows, could possibly be faster on linux)

cc @RalfJung
r? @saethlin

RalfJung · 2022-06-27T21:02:38Z

Why does this help Miri performance?
Are you sure this does not hurt runtime performance?

Kixiron · 2022-06-27T21:08:58Z

@saethlin knows the "why" better than I do, but could you start a perf run to see if there's any performance impact?

saethlin · 2022-06-27T21:09:54Z

Phew that's a bit bigger of a diff than I was hoping for. Since this is basically optimizing the debug codegen (but almost a more dramatic version of that) I think it would be best if this PR were very small and had a clear comment explaining exactly why we want to write this code in a particular way. For example, I wonder if simply removing the len method call from the loop and storing the length in a local makes a huge dent in the runtime.

But the speedup is valuable as confirmation that we correctly identified the cause of the problem. I pointed out this pattern:

        while i < self.source.len() {
            let byte = unsafe { *self.source.get_unchecked(i) };

I long suspected that something like this could cause Stacked Borrows blowup, but I have yet failed to build a tight reproducer of the blowup. Perhaps one can be extracted based on this. I think this ends up with a large allocation where the stack-merging code totally fails because each byte has a different pattern of tags. Since this is a slice, each call to len and get_unchecked inserts at least one tag into every stack (it could be way more, at one point I was just printing borrow stacks and I recall some simple operations added a surprising number of items), but the granting tag is way at the bottom. So if I'm right, the linear search is truly worst-case behavior here.

RalfJung · 2022-06-27T21:11:40Z

could you start a perf run to see if there's any performance impact?

Sure can, but that will just test perf of rustc itself, which I doubt is a heavy user of this code.
This will need more targeted micro-benchmarking. I am not sure if we have existing benchmarks for that.

@bors try @rust-timer queue

rust-timer · 2022-06-27T21:11:42Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2022-06-27T21:11:50Z

⌛ Trying commit 0633c81 with merge dc2897de64ccc546d1d9a10d8d93ac2bd641ef30...

JakobDegen · 2022-06-27T21:22:12Z

I don't think this will show on a compiler perf run. Imo we should be benchmarking the implementation directly. Edit: Agh, didn't see Ralf's message

Kixiron · 2022-06-27T21:28:08Z

String's from_utf8_lossy() benches should cover this I think, it uses Utf8LossyChunksIter under the hood

RalfJung · 2022-06-27T21:55:10Z

I think this ends up with a large allocation where the stack-merging code totally fails because each byte has a different pattern of tags.

Ah, that makes a lot of sense.
We could save memory by merging common tails of stacks in something like a linked list but that won't help with the time complexity.

Kixiron · 2022-06-27T21:59:41Z

I ran some benches and it looks like there's little to no difference between the two

Current:

test str::str_from_lossy_emoji  ... bench:       2,679 ns/iter (+/- 91) = 2114 MB/s
test str::str_from_lossy_en     ... bench:       1,409 ns/iter (+/- 47) = 3821 MB/s
test str::str_from_lossy_ru     ... bench:       2,693 ns/iter (+/- 85) = 1886 MB/s
test str::str_from_lossy_zh     ... bench:       3,393 ns/iter (+/- 235) = 1391 MB/s

This PR:

test str::str_from_lossy_emoji  ... bench:       2,623 ns/iter (+/- 123)
test str::str_from_lossy_en     ... bench:       1,423 ns/iter (+/- 96)
test str::str_from_lossy_ru     ... bench:       1,929 ns/iter (+/- 38)
test str::str_from_lossy_zh     ... bench:       3,367 ns/iter (+/- 59)

saethlin · 2022-06-27T22:07:49Z

Did you try running the benchmarks in alloc? x bench library/alloc --test-args=lossy is what I ran, and I got a different set of benchmarks.

bors · 2022-06-27T23:02:19Z

☀️ Try build successful - checks-actions
Build commit: dc2897de64ccc546d1d9a10d8d93ac2bd641ef30 (dc2897de64ccc546d1d9a10d8d93ac2bd641ef30)

rust-timer · 2022-06-27T23:02:21Z

Queued dc2897de64ccc546d1d9a10d8d93ac2bd641ef30 with parent 8e52fa8, future comparison URL.

rust-timer · 2022-06-28T00:33:30Z

Finished benchmarking commit (dc2897de64ccc546d1d9a10d8d93ac2bd641ef30): comparison url.

Instruction count

Primary benchmarks: no relevant changes found
Secondary benchmarks: 😿 relevant regression found

	mean¹	max	count²
Regressions 😿 (primary)	N/A	N/A	0
Regressions 😿 (secondary)	0.4%	0.4%	1
Improvements 🎉 (primary)	N/A	N/A	0
Improvements 🎉 (secondary)	N/A	N/A	0
All 😿🎉 (primary)	N/A	N/A	0

Max RSS (memory usage)

Results

Primary benchmarks: no relevant changes found
Secondary benchmarks: 😿 relevant regressions found

	mean¹	max	count²
Regressions 😿 (primary)	N/A	N/A	0
Regressions 😿 (secondary)	4.6%	6.0%	3
Improvements 🎉 (primary)	N/A	N/A	0
Improvements 🎉 (secondary)	N/A	N/A	0
All 😿🎉 (primary)	N/A	N/A	0

Cycles

This benchmark run did not return any relevant results for this metric.

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf -perf-regression

the arithmetic mean of the percent change ↩ ↩²
number of relevant changes ↩ ↩²

Kixiron · 2022-06-28T05:53:57Z

Ran the alloc from_utf8_lossy benches
Current:

test string::from_utf8_lossy_100_ascii                   ... bench:          40 ns/iter (+/- 2)
test string::from_utf8_lossy_100_invalid                 ... bench:       1,459 ns/iter (+/- 54)
test string::from_utf8_lossy_100_multibyte               ... bench:          78 ns/iter (+/- 5)
test string::from_utf8_lossy_invalid                     ... bench:         199 ns/iter (+/- 5)

This PR:

test string::from_utf8_lossy_100_ascii                   ... bench:          36 ns/iter (+/- 2)
test string::from_utf8_lossy_100_invalid                 ... bench:       1,273 ns/iter (+/- 26)
test string::from_utf8_lossy_100_multibyte               ... bench:          55 ns/iter (+/- 5)
test string::from_utf8_lossy_invalid                     ... bench:         181 ns/iter (+/- 5)

Dylan-DPC · 2022-07-24T08:27:59Z

@bors delegate=saethlin

bors · 2022-07-24T08:28:01Z

✌️ @saethlin can now approve this pull request

saethlin · 2022-07-24T13:09:12Z

library/core/src/str/lossy.rs

-        while i < self.source.len() {
-            // SAFETY: `i < self.source.len()` per previous line.
+        let length = self.source.len();
+        let mut current = self.source.as_ptr();
+        // SAFETY: current + length is one past the end of the allocation
+        let (start, end, mut valid_up_to) = unsafe { (current, current.add(length), current) };
+
+        while current < end {
+            // SAFETY: `current < end` per previous line.
            // For some reason the following are both significantly slower:
            // while let Some(&byte) = self.source.get(i) {
            // while let Some(byte) = self.source.get(i).copied() {
-            let byte = unsafe { *self.source.get_unchecked(i) };


This PR as-written adds a a lot of unsafe. I do not think that we should do that without sufficient justification.

How much of a perf impact do we see on Miri backtrace printing and the standard library benchmarks from only adjusting these lines? The specific pattern we want to avoid is this: https://github.com/rust-lang/miri/blob/8fdb720329d7674a878a8252fe4b79ef93d6ffec/bench-cargo-miri/slice-get-unchecked/src/main.rs#L8-L9

while i < x.len() { let _element = unsafe { *x.get_unchecked(i) };

I'm not completely opposed the pervasive sort of changes that you've implemented in the rest of this function, but we need benchmarking that makes the case for those changes, as opposed to just tweaking the ASCII fast path.

JohnCSimon · 2023-01-29T05:57:42Z

@Kixiron
Ping from triage: I'm closing this due to inactivity, Please reopen when you are ready to continue with this.
Note: if you do please open the PR BEFORE you push to it, else you won't be able to reopen - this is a quirk of github.
Thanks for your contribution.

@rustbot label: +S-inactive

Fixed worst-case miri performance with lossy string decoding

0633c81

rustbot added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Jun 27, 2022

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jun 27, 2022

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Jun 28, 2022

JohnCSimon added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 24, 2022

saethlin requested changes Jul 24, 2022

View reviewed changes

Mark-Simulacrum removed the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jul 24, 2022

camelid assigned saethlin Aug 2, 2022

camelid added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. A-miri Area: The miri tool labels Aug 2, 2022

JohnCSimon closed this Jan 29, 2023

rustbot added the S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. label Jan 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed worst-case miri performance with lossy string decoding #98592

Fixed worst-case miri performance with lossy string decoding #98592

Kixiron commented Jun 27, 2022 •

edited

Loading

RalfJung commented Jun 27, 2022

Kixiron commented Jun 27, 2022

saethlin commented Jun 27, 2022

RalfJung commented Jun 27, 2022

rust-timer commented Jun 27, 2022

bors commented Jun 27, 2022

JakobDegen commented Jun 27, 2022 •

edited

Loading

Kixiron commented Jun 27, 2022 •

edited

Loading

RalfJung commented Jun 27, 2022

Kixiron commented Jun 27, 2022

saethlin commented Jun 27, 2022

bors commented Jun 27, 2022

rust-timer commented Jun 27, 2022

rust-timer commented Jun 28, 2022

Kixiron commented Jun 28, 2022

Dylan-DPC commented Jul 24, 2022

bors commented Jul 24, 2022

saethlin Jul 24, 2022

JohnCSimon commented Jan 29, 2023

Fixed worst-case miri performance with lossy string decoding #98592

Fixed worst-case miri performance with lossy string decoding #98592

Conversation

Kixiron commented Jun 27, 2022 • edited Loading

RalfJung commented Jun 27, 2022

Kixiron commented Jun 27, 2022

saethlin commented Jun 27, 2022

RalfJung commented Jun 27, 2022

rust-timer commented Jun 27, 2022

bors commented Jun 27, 2022

JakobDegen commented Jun 27, 2022 • edited Loading

Kixiron commented Jun 27, 2022 • edited Loading

RalfJung commented Jun 27, 2022

Kixiron commented Jun 27, 2022

saethlin commented Jun 27, 2022

bors commented Jun 27, 2022

rust-timer commented Jun 27, 2022

rust-timer commented Jun 28, 2022

Instruction count

Max RSS (memory usage)

Cycles

Footnotes

Kixiron commented Jun 28, 2022

Dylan-DPC commented Jul 24, 2022

bors commented Jul 24, 2022

saethlin Jul 24, 2022

Choose a reason for hiding this comment

JohnCSimon commented Jan 29, 2023

Kixiron commented Jun 27, 2022 •

edited

Loading

JakobDegen commented Jun 27, 2022 •

edited

Loading

Kixiron commented Jun 27, 2022 •

edited

Loading