-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement parallel sweeping of stack pools #55643
Conversation
use a round robin to only return stacks one thread at a time to avoid contention on munmap syscalls
Running the GC benchmarks they seem to all be the same except the mergesort_parallel one that sees some improvements in wall clock. But in absolute sweep time it goes from a median of around 250ms in master to 35ms in this PR. Which is quite nice as well |
2260ddf
to
4a07749
Compare
4a07749
to
b7e8386
Compare
While you are at it, could you add a metric to measure time spent on sweeping GC stacks to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Please address the latest comment about metrics and run mergesort_parallel.jl
to confirm this metric indeed reduces after the changes from the latest commit.
Master
PR
This not a really GC limited task but the GC was limited by these sweeps. One thing that might also be happening here is that the threads might not go to sleep between mark and sweep now. |
Also use a round robin to only return stacks one thread at a time to avoid contention on munmap syscalls. Using https://github.com/gbaraldi/cilkbench_julia/blob/main/cilk5julia/nqueens.jl as a benchmark it's about 12% faster wall time. This benchmark has other weird behaviours specially single threaded. Where if calls `wait` thousandas of times per second, and if single threaded every single one does a `jl_process_events` call which is a syscall + preemption. So it looks like a hang. With threads the issue isn't there The idea behind the round robin is twofold. One we are just freeing too much and talking with vtjnash we maybe want some less agressive behaviour, the second is that munmap takes a lock in most OSs. So doing it in parallel has severe negative scaling.
Also use a round robin to only return stacks one thread at a time to avoid contention on munmap syscalls.
Using
https://github.com/gbaraldi/cilkbench_julia/blob/main/cilk5julia/nqueens.jl
as a benchmark it's about 12% faster wall time. This benchmark has other weird behaviours specially single threaded. Where if calls
wait
thousandas of times per second, and if single threaded every single one does ajl_process_events
call which is a syscall + preemption. So it looks like a hang. With threads the issue isn't thereThe idea behind the round robin is twofold. One we are just freeing too much and talking with @vtjnash we maybe want some less agressive behaviour, the second is that munmap takes a lock in most OSs. So doing it in parallel has severe negative scaling.