Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parallel sweeping of stack pools #55643

Merged
merged 9 commits into from
Oct 16, 2024
Merged

Conversation

gbaraldi
Copy link
Member

@gbaraldi gbaraldi commented Aug 30, 2024

Also use a round robin to only return stacks one thread at a time to avoid contention on munmap syscalls.
Using
https://github.com/gbaraldi/cilkbench_julia/blob/main/cilk5julia/nqueens.jl
as a benchmark it's about 12% faster wall time. This benchmark has other weird behaviours specially single threaded. Where if calls wait thousandas of times per second, and if single threaded every single one does a jl_process_events call which is a syscall + preemption. So it looks like a hang. With threads the issue isn't there

The idea behind the round robin is twofold. One we are just freeing too much and talking with @vtjnash we maybe want some less agressive behaviour, the second is that munmap takes a lock in most OSs. So doing it in parallel has severe negative scaling.

use a round robin to only return stacks one thread at a time to avoid contention on munmap syscalls
@gbaraldi
Copy link
Member Author

Running the GC benchmarks they seem to all be the same except the mergesort_parallel one that sees some improvements in wall clock. But in absolute sweep time it goes from a median of around 250ms in master to 35ms in this PR. Which is quite nice as well

@giordano giordano added performance Must go faster GC Garbage collector labels Aug 30, 2024
src/gc-stacks.c Outdated Show resolved Hide resolved
src/gc-stock.c Outdated Show resolved Hide resolved
src/gc-stacks.c Show resolved Hide resolved
src/gc-stock.c Outdated Show resolved Hide resolved
src/gc-stock.c Outdated Show resolved Hide resolved
src/gc-stock.c Show resolved Hide resolved
src/gc-stacks.c Show resolved Hide resolved
src/gc-stock.c Outdated Show resolved Hide resolved
src/gc-stock.c Show resolved Hide resolved
src/gc-tls.h Outdated Show resolved Hide resolved
src/gc-stacks.c Outdated Show resolved Hide resolved
@d-netto
Copy link
Member

d-netto commented Sep 11, 2024

While you are at it, could you add a metric to measure time spent on sweeping GC stacks to the GC_Num struct?

@d-netto
Copy link
Member

d-netto commented Sep 11, 2024

Based on the experience with #48600 and #51282, there is a chance we will find a workload in the future in which this PR regresses performance a bit.

It will be good to have proper instrumentation to diagnose whether we are having negative scaling, etc.

Copy link
Member

@d-netto d-netto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Please address the latest comment about metrics and run mergesort_parallel.jl to confirm this metric indeed reduces after the changes from the latest commit.

@gbaraldi
Copy link
Member Author

Master

┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ stack sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │               ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       1959 │     110 │        37 │         69 │               40 │           40 │                72 │      671 │          5 │
│  median │       2012 │     151 │        40 │        103 │               67 │           50 │               194 │      724 │          8 │
│ maximum │       2218 │     204 │        70 │        159 │              117 │           70 │              4047 │      758 │          9 │
│   stdev │         84 │      30 │        11 │         27 │               23 │            9 │              1289 │       28 │          1 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────────┴──────────────┴───────────────────┴──────────┴────────────┘

PR

┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ stack sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │               ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       1967 │      89 │        35 │         53 │               21 │           29 │               462 │      697 │          5 │
│  median │       2009 │      97 │        39 │         58 │               24 │           32 │              1236 │      739 │          5 │
│ maximum │       2073 │     108 │        42 │         69 │               35 │           34 │              4030 │      757 │          5 │
│   stdev │         31 │       6 │         2 │          5 │                5 │            2 │              1065 │       17 │          0 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────────┴──────────────┴───────────────────┴──────────┴────────────┘

This not a really GC limited task but the GC was limited by these sweeps. One thing that might also be happening here is that the threads might not go to sleep between mark and sweep now.

@gbaraldi gbaraldi added the merge me PR is reviewed. Merge when all tests are passing label Oct 15, 2024
@giordano giordano merged commit df5f437 into master Oct 16, 2024
8 checks passed
@giordano giordano deleted the gb/parallel-stack-pools branch October 16, 2024 19:52
@giordano giordano removed the merge me PR is reviewed. Merge when all tests are passing label Oct 16, 2024
KristofferC pushed a commit that referenced this pull request Oct 21, 2024
Also use a round robin to only return stacks one thread at a time to
avoid contention on munmap syscalls.
Using

https://github.com/gbaraldi/cilkbench_julia/blob/main/cilk5julia/nqueens.jl
as a benchmark it's about 12% faster wall time. This benchmark has other
weird behaviours specially single threaded. Where if calls `wait`
thousandas of times per second, and if single threaded every single one
does a `jl_process_events` call which is a syscall + preemption. So it
looks like a hang. With threads the issue isn't there

The idea behind the round robin is twofold. One we are just freeing too
much and talking with vtjnash we maybe want some less agressive
behaviour, the second is that munmap takes a lock in most OSs. So doing
it in parallel has severe negative scaling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants