Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miri: Skip over GlobalAllocs when sweeping #118080

Closed
wants to merge 1 commit into from

Conversation

saethlin
Copy link
Member

@saethlin saethlin commented Nov 20, 2023

Global allocations in the interpreter are never deallocated, so their AllocId is always in-use and cannot be removed by the GC. So when we are checking whether an AllocId can be removed, we really want to fast-path out if we know that the AllocId refers to a Global allocation.

In the interpreter we have a few HashMaps that map AllocId to something else, so in all those maps I've stuck a bool alongside the value which is a local cache of "Is this AllocId global". LiveAllocs::is_live now also takes a &mut bool so that it can set the flag if it deduces that the AllocId is in use because it is global.

InterpCx::is_alloc_live now returns a Liveness enum which is alive, dead, or global, so that we can populate local bool if we ever look up whether a Global allocation is live.

r? RalfJung


Ideally we could set this flag upon insertion into these maps, but eagerly setting the value for base_ptr_tag doesn't seem possible. It's also tempting to pack the bit we need here into AllocId, but users of these maps don't know if they are looking up a GlobalAlloc so we would need Hash+PartialEq that ignores the bit which seems cursed.

@rustbot rustbot added A-testsuite Area: The testsuite used to check the correctness of rustc S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. labels Nov 20, 2023
@saethlin saethlin changed the title Miri: Skip over GlobalAllocs when GC'ing base_addr Miri: Skip over GlobalAllocs when collecting Nov 20, 2023
@rustbot

This comment was marked as off-topic.

@saethlin saethlin added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 20, 2023
@saethlin saethlin changed the title Miri: Skip over GlobalAllocs when collecting Miri: Skip over GlobalAllocs when sweeping Nov 20, 2023
@rustbot

This comment was marked as abuse.

@saethlin saethlin removed A-testsuite Area: The testsuite used to check the correctness of rustc T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. labels Nov 20, 2023
@bors
Copy link
Contributor

bors commented Nov 21, 2023

☔ The latest upstream changes (presumably #118134) made this pull request unmergeable. Please resolve the merge conflicts.

@saethlin saethlin marked this pull request as ready for review November 23, 2023 22:56
@rustbot
Copy link
Collaborator

rustbot commented Nov 23, 2023

Some changes occurred to the CTFE / Miri engine

cc @rust-lang/miri

The Miri subtree was changed

cc @rust-lang/miri

@saethlin
Copy link
Member Author

Oh 🤦 I just remembered why I was keeping this as a draft. I wanted to post benchmarks. I'll do that in a few hours.

pub fn is_alloc_live(&self, id: AllocId) -> Liveness {
if self.memory.alloc_map.contains_key_ref(&id) {
return Liveness::Live;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that these might be global. On their first read or write, globals get copied into the alloc_map, which is needed because we need to convert them from the global provenance type to that of Miri.

So what even is your definition of "global"? "AllocId is in tcx" or "AllocId is not in memory"? Those two are not equivalent.

@@ -648,32 +648,27 @@ trait EvalContextPrivExt<'mir: 'ecx, 'tcx: 'mir, 'ecx>: crate::MiriInterpCxExt<'
};

let (_size, _align, alloc_kind) = this.get_alloc_info(alloc_id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not care about size and align so it should call is_alloc_live, not get_alloc_info.
(We probably have more of these across Miri.)

@RalfJung
Copy link
Member

In the interpreter we have a few HashMaps that map AllocId to something else, so in all those maps I've stuck a bool alongside the value

Hm, my first reaction is that I really don't like such global changes. Most places shouldn't care whether something is global or not, and so this pollutes the actually relevant logic with some irrelevant book-keeping. I would strongly prefer if that can be avoided, or at least factored away so as to be completely invisible inside intptrcast and borrow tracking.

So the point of this is that tcx.try_get_global_alloc(id).is_some() is so expensive that we want to avoid doing it on each GC run?

@saethlin
Copy link
Member Author

So the point of this is that tcx.try_get_global_alloc(id).is_some() is so expensive that we want to avoid doing it on each GC run?

For allocations which will never deallocated but currently are not referenced in the Machine, the current implementation does two lookups (one into the map of AllocIds referenced in the Machine, then try_get_global_alloc) before we conclude that the GC should not clean up that AllocId. That's the kind of Allocation I want to identify here; those which we know will never be deallocated, and therefore the GC can never clean up entries for them in these maps.

The perf implications of this PR are insignificant for normal operation, so if you don't think we can get the complexity cost down, I think it might make sense to not do this optimization. Running the GC infrequently is very effective for pushing down overheads like this into the noise.

But just to point out the value, I derived this optimization originally by studying execution of the 0weak_memory_consistency test, which uses the flags -Zmiri-disable-stacked-borrows -Zmiri-provenance-gc=1 which is basically the worst case for the AllocId-collecting parts of the GC. With those flags set in the benchmark suite, we see this before this PR:

Benchmark 1: cargo +stage2 miri run --manifest-path=bench-cargo-miri/backtraces/Cargo.toml
  Time (mean ± σ):     36.691 s ±  0.704 s    [User: 36.507 s, System: 0.080 s]
  Range (min … max):   35.720 s … 37.411 s    5 runs
 
Benchmark 1: cargo +stage2 miri run --manifest-path=bench-cargo-miri/invalidate/Cargo.toml
  Time (mean ± σ):     11.261 s ±  0.109 s    [User: 11.160 s, System: 0.068 s]
  Range (min … max):   11.098 s … 11.392 s    5 runs
 
Benchmark 1: cargo +stage2 miri run --manifest-path=bench-cargo-miri/mse/Cargo.toml
  Time (mean ± σ):      1.046 s ±  0.008 s    [User: 0.980 s, System: 0.061 s]
  Range (min … max):    1.038 s …  1.059 s    5 runs
 
Benchmark 1: cargo +stage2 miri run --manifest-path=bench-cargo-miri/serde1/Cargo.toml
  Time (mean ± σ):      3.947 s ±  0.046 s    [User: 3.862 s, System: 0.071 s]
  Range (min … max):    3.874 s …  3.997 s    5 runs
 
Benchmark 1: cargo +stage2 miri run --manifest-path=bench-cargo-miri/serde2/Cargo.toml
  Time (mean ± σ):     11.978 s ±  0.214 s    [User: 11.867 s, System: 0.076 s]
  Range (min … max):   11.848 s … 12.357 s    5 runs
  
Benchmark 1: cargo +stage2 miri run --manifest-path=bench-cargo-miri/slice-get-unchecked/Cargo.toml
  Time (mean ± σ):     697.6 ms ±   1.6 ms    [User: 626.7 ms, System: 66.9 ms]
  Range (min … max):   694.8 ms … 698.6 ms    5 runs
  
Benchmark 1: cargo +stage2 miri run --manifest-path=bench-cargo-miri/unicode/Cargo.toml
  Time (mean ± σ):      6.600 s ±  0.017 s    [User: 6.513 s, System: 0.065 s]
  Range (min … max):    6.582 s …  6.625 s    5 runs
 
Benchmark 1: cargo +stage2 miri run --manifest-path=bench-cargo-miri/zip-equal/Cargo.toml
  Time (mean ± σ):      2.746 s ±  0.010 s    [User: 2.677 s, System: 0.059 s]
  Range (min … max):    2.736 s …  2.757 s    5 runs

And with this PR, we see:

Benchmark 1: cargo +stage2 miri run --manifest-path=bench-cargo-miri/backtraces/Cargo.toml
  Time (mean ± σ):     25.631 s ±  0.422 s    [User: 25.506 s, System: 0.081 s]
  Range (min … max):   25.232 s … 26.200 s    5 runs
 
Benchmark 1: cargo +stage2 miri run --manifest-path=bench-cargo-miri/invalidate/Cargo.toml
  Time (mean ± σ):     10.580 s ±  0.088 s    [User: 10.444 s, System: 0.067 s]
  Range (min … max):   10.441 s … 10.686 s    5 runs
 
Benchmark 1: cargo +stage2 miri run --manifest-path=bench-cargo-miri/mse/Cargo.toml
  Time (mean ± σ):     981.4 ms ±  12.4 ms    [User: 910.6 ms, System: 66.1 ms]
  Range (min … max):   960.2 ms … 993.0 ms    5 runs
 
Benchmark 1: cargo +stage2 miri run --manifest-path=bench-cargo-miri/serde1/Cargo.toml
  Time (mean ± σ):      3.780 s ±  0.041 s    [User: 3.685 s, System: 0.076 s]
  Range (min … max):    3.751 s …  3.851 s    5 runs
 
Benchmark 1: cargo +stage2 miri run --manifest-path=bench-cargo-miri/serde2/Cargo.toml
  Time (mean ± σ):     11.514 s ±  0.396 s    [User: 11.414 s, System: 0.080 s]
  Range (min … max):   11.266 s … 12.214 s    5 runs
 
Benchmark 1: cargo +stage2 miri run --manifest-path=bench-cargo-miri/slice-get-unchecked/Cargo.toml
  Time (mean ± σ):     649.1 ms ±  10.9 ms    [User: 580.6 ms, System: 63.9 ms]
  Range (min … max):   630.1 ms … 657.4 ms    5 runs
 
Benchmark 1: cargo +stage2 miri run --manifest-path=bench-cargo-miri/unicode/Cargo.toml
  Time (mean ± σ):      2.702 s ±  0.049 s    [User: 2.608 s, System: 0.061 s]
  Range (min … max):    2.657 s …  2.780 s    5 runs
 
Benchmark 1: cargo +stage2 miri run --manifest-path=bench-cargo-miri/zip-equal/Cargo.toml
  Time (mean ± σ):      2.612 s ±  0.040 s    [User: 2.511 s, System: 0.067 s]
  Range (min … max):    2.578 s …  2.677 s    5 runs

The 0weak_memory_consistency test has the same change as the unicode benchmark.

@RalfJung
Copy link
Member

Those are some pretty impressive wins indeed, albeit for a rather artificial benchmark. How far do you have to increase the GC interval to make the improvement disappear in the noise?

For allocations which will never deallocated but currently are not referenced in the Machine, the current implementation does two lookups (one into the map of AllocIds referenced in the Machine, then try_get_global_alloc) before we conclude that the GC should not clean up that AllocId.

This could be reduced to one lookup by swapping the order in which they are checked. But I suppose that's worse elsewhere?

@bors
Copy link
Contributor

bors commented Nov 25, 2023

☔ The latest upstream changes (presumably #118284) made this pull request unmergeable. Please resolve the merge conflicts.

@saethlin
Copy link
Member Author

Obviated by #118336

@saethlin saethlin closed this Nov 27, 2023
@saethlin saethlin deleted the skip-global-allocs branch November 27, 2023 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants