Implement parallel sweeping of stack pools #55643

gbaraldi · 2024-08-30T18:35:51Z

Also use a round robin to only return stacks one thread at a time to avoid contention on munmap syscalls.
Using
https://github.com/gbaraldi/cilkbench_julia/blob/main/cilk5julia/nqueens.jl
as a benchmark it's about 12% faster wall time. This benchmark has other weird behaviours specially single threaded. Where if calls wait thousandas of times per second, and if single threaded every single one does a jl_process_events call which is a syscall + preemption. So it looks like a hang. With threads the issue isn't there

The idea behind the round robin is twofold. One we are just freeing too much and talking with @vtjnash we maybe want some less agressive behaviour, the second is that munmap takes a lock in most OSs. So doing it in parallel has severe negative scaling.

use a round robin to only return stacks one thread at a time to avoid contention on munmap syscalls

gbaraldi · 2024-08-30T19:48:40Z

Running the GC benchmarks they seem to all be the same except the mergesort_parallel one that sees some improvements in wall clock. But in absolute sweep time it goes from a median of around 250ms in master to 35ms in this PR. Which is quite nice as well

src/gc-stacks.c

src/gc-stock.c

src/gc-stacks.c

src/gc-stock.c

src/gc-stacks.c

src/gc-stock.c

src/gc-tls.h

src/gc-stacks.c

d-netto · 2024-09-11T22:15:15Z

While you are at it, could you add a metric to measure time spent on sweeping GC stacks to the GC_Num struct?

d-netto · 2024-09-11T22:17:52Z

Based on the experience with #48600 and #51282, there is a chance we will find a workload in the future in which this PR regresses performance a bit.

It will be good to have proper instrumentation to diagnose whether we are having negative scaling, etc.

d-netto

LGTM.

Please address the latest comment about metrics and run mergesort_parallel.jl to confirm this metric indeed reduces after the changes from the latest commit.

gbaraldi · 2024-09-25T13:32:21Z

Master

┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ stack sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │               ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       1959 │     110 │        37 │         69 │               40 │           40 │                72 │      671 │          5 │
│  median │       2012 │     151 │        40 │        103 │               67 │           50 │               194 │      724 │          8 │
│ maximum │       2218 │     204 │        70 │        159 │              117 │           70 │              4047 │      758 │          9 │
│   stdev │         84 │      30 │        11 │         27 │               23 │            9 │              1289 │       28 │          1 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────────┴──────────────┴───────────────────┴──────────┴────────────┘

PR

┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ stack sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │               ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       1967 │      89 │        35 │         53 │               21 │           29 │               462 │      697 │          5 │
│  median │       2009 │      97 │        39 │         58 │               24 │           32 │              1236 │      739 │          5 │
│ maximum │       2073 │     108 │        42 │         69 │               35 │           34 │              4030 │      757 │          5 │
│   stdev │         31 │       6 │         2 │          5 │                5 │            2 │              1065 │       17 │          0 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────────┴──────────────┴───────────────────┴──────────┴────────────┘

This not a really GC limited task but the GC was limited by these sweeps. One thing that might also be happening here is that the threads might not go to sleep between mark and sweep now.

Also use a round robin to only return stacks one thread at a time to avoid contention on munmap syscalls. Using https://github.com/gbaraldi/cilkbench_julia/blob/main/cilk5julia/nqueens.jl as a benchmark it's about 12% faster wall time. This benchmark has other weird behaviours specially single threaded. Where if calls `wait` thousandas of times per second, and if single threaded every single one does a `jl_process_events` call which is a syscall + preemption. So it looks like a hang. With threads the issue isn't there The idea behind the round robin is twofold. One we are just freeing too much and talking with vtjnash we maybe want some less agressive behaviour, the second is that munmap takes a lock in most OSs. So doing it in parallel has severe negative scaling.

Implement parallel sweeping of stack pools +

aef2ba0

use a round robin to only return stacks one thread at a time to avoid contention on munmap syscalls

gbaraldi requested review from d-netto and vtjnash August 30, 2024 18:36

giordano added performance Must go faster GC Garbage collector labels Aug 30, 2024

Merge branch 'master' into gb/parallel-stack-pools

015cf96

d-netto reviewed Sep 6, 2024

View reviewed changes

src/gc-stacks.c Outdated Show resolved Hide resolved

src/gc-stock.c Outdated Show resolved Hide resolved

src/gc-stacks.c Show resolved Hide resolved

src/gc-stock.c Outdated Show resolved Hide resolved

gbaraldi force-pushed the gb/parallel-stack-pools branch from 2260ddf to 4a07749 Compare September 6, 2024 16:42

Apply suggestions from code review

b7e8386

gbaraldi force-pushed the gb/parallel-stack-pools branch from 4a07749 to b7e8386 Compare September 6, 2024 16:43

Make analyzegc happier

ebb39c1

d-netto reviewed Sep 9, 2024

View reviewed changes

src/gc-stock.c Outdated Show resolved Hide resolved

src/gc-stock.c Show resolved Hide resolved

gbaraldi added 2 commits September 9, 2024 12:18

Address suggestions from code review

31b5c0b

Move assertion to correct place.

a6d0391

d-netto mentioned this pull request Sep 11, 2024

Backports JuliaLang#55643 for internal testing. RelationalAI/julia#178

Closed

3 tasks

d-netto reviewed Sep 11, 2024

View reviewed changes

src/gc-stacks.c Show resolved Hide resolved

src/gc-stock.c Outdated Show resolved Hide resolved

src/gc-stock.c Show resolved Hide resolved

src/gc-tls.h Outdated Show resolved Hide resolved

d-netto reviewed Sep 11, 2024

View reviewed changes

src/gc-stacks.c Outdated Show resolved Hide resolved

Address suggestions from code review

ad14f6d

d-netto approved these changes Sep 12, 2024

View reviewed changes

gbaraldi added 2 commits September 25, 2024 10:36

Add statistic for sweeping of stack pools

065d982

Merge branch 'master' into gb/parallel-stack-pools

1cdca13

gbaraldi added the merge me PR is reviewed. Merge when all tests are passing label Oct 15, 2024

giordano merged commit df5f437 into master Oct 16, 2024
8 checks passed

giordano deleted the gb/parallel-stack-pools branch October 16, 2024 19:52

giordano removed the merge me PR is reviewed. Merge when all tests are passing label Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement parallel sweeping of stack pools #55643

Implement parallel sweeping of stack pools #55643

gbaraldi commented Aug 30, 2024 •

edited

Loading

gbaraldi commented Aug 30, 2024

d-netto commented Sep 11, 2024

d-netto commented Sep 11, 2024

d-netto left a comment

gbaraldi commented Sep 25, 2024

Implement parallel sweeping of stack pools #55643

Implement parallel sweeping of stack pools #55643

Conversation

gbaraldi commented Aug 30, 2024 • edited Loading

gbaraldi commented Aug 30, 2024

d-netto commented Sep 11, 2024

d-netto commented Sep 11, 2024

d-netto left a comment

Choose a reason for hiding this comment

gbaraldi commented Sep 25, 2024

gbaraldi commented Aug 30, 2024 •

edited

Loading