Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fallback if we have make a weird GC decision. #50682

Merged
merged 12 commits into from
Aug 5, 2023

Conversation

gbaraldi
Copy link
Member

@gbaraldi gbaraldi commented Jul 26, 2023

If something odd happens during GC (the PC goes to sleep) or a very big transient the heuristics might make a bad decision. What this PR implements is if we try to make our target more than double the one we had before we fallback to a more conservative method. This fixes the new issue @vtjnash found in #40644 for me.

@gbaraldi gbaraldi requested a review from vtjnash July 27, 2023 14:59
@gbaraldi
Copy link
Member Author

gbaraldi commented Aug 1, 2023

I will do a follow up PR to refactor the GC num stuff because there is currently an accounting bug somewhere and live_bytes gets weird results. i.e oscar-system/Oscar.jl#2441 (comment)

@gbaraldi gbaraldi added the merge me PR is reviewed. Merge when all tests are passing label Aug 1, 2023
compile_commands.json Outdated Show resolved Hide resolved
Copy link
Contributor

@benlorenz benlorenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The currently unreliable live_bytes metric seems to be still in use for deciding when to do a full sweep (among other criteria):

    if (live_bytes > max_total_memory) {
        sweep_full = 1;
    }

https://github.com/JuliaLang/julia/pull/50682/files#diff-76b4d6ef0f7d9c2a6855e1ef7b0f7c94ddc68fcca2cdcd69e6d2b9007c279fdeR3273-R3275

Edit: Without a heap-size-hint, cgroup restriction or the change I suggested below this would currently never trigger since the max memory is 2PB by default. But with my change this might cause the GC to always do full sweeps when when live_bytes is unreasonably big due to the miscount.

@@ -3612,10 +3619,10 @@ void jl_gc_init(void)
total_mem = uv_get_total_memory();
uint64_t constrained_mem = uv_get_constrained_memory();
if (constrained_mem > 0 && constrained_mem < total_mem)
total_mem = constrained_mem;
jl_gc_set_max_memory(constrained_mem - 250*1024*1024); // LLVM + other libraries need some amount of memory
Copy link
Contributor

@benlorenz benlorenz Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make sense to set the max memory based on total_mem as well, when it is non-zero and constrained_mem is not set? Maybe like this:

Suggested change
jl_gc_set_max_memory(constrained_mem - 250*1024*1024); // LLVM + other libraries need some amount of memory
total_mem = constrained_mem;
if (total_mem > 0)
jl_gc_set_max_memory(total_mem - 250*1024*1024); // LLVM + other libraries need some amount of memory

Currently the maximum without a heap hint is very large:

julia> Base.format_bytes(@ccall jl_gc_get_max_memory()::UInt64)
"2.000 PiB"

(since uv_get_constrained_memory() returns something like "8192.000 PiB" without any cgroup restrictions)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats the idea, if you don't set an explicit hint we will just follow the heap resizing algorithm.

@oscardssmith
Copy link
Member

is this ready to merge?

@gbaraldi
Copy link
Member Author

gbaraldi commented Aug 4, 2023

I believe finally yes!

@oscardssmith oscardssmith removed the merge me PR is reviewed. Merge when all tests are passing label Aug 5, 2023
@oscardssmith oscardssmith merged commit ab94fad into JuliaLang:master Aug 5, 2023
2 checks passed
#endif
if (jl_options.heap_size_hint)
jl_gc_set_max_memory(jl_options.heap_size_hint);
jl_gc_set_max_memory(jl_options.heap_size_hint - 250*1024*1024);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to protect this from underflow?

Copy link
Member Author

@gbaraldi gbaraldi Aug 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jl_gc_set_max_memory protects from underflow, but it doesn't warn the user. Maybe it should?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to warn then either

KristofferC pushed a commit that referenced this pull request Aug 10, 2023
If something odd happens during GC (the PC goes to sleep) or a very big
transient the heuristics might make a bad decision. What this PR
implements is if we try to make our target more than double the one we
had before we fallback to a more conservative method. This fixes the new
issue @vtjnash found in #40644
for me.

(cherry picked from commit ab94fad)
@gbaraldi gbaraldi deleted the gc-fix3 branch August 14, 2023 16:15
KristofferC added a commit that referenced this pull request Aug 16, 2023
Backported PRs:
- [x] #50637 <!-- Remove SparseArrays legacy code -->
- [x] #50665 <!-- print `@time` msg into print buffer -->
- [x] #50523 <!-- Avoid generic call in most cases for getproperty -->
- [x] #50635 <!-- `versioninfo()`: include build info and unofficial
warning -->
- [x] #50670 <!-- Make reinterpret specialize fully. -->
- [x] #50666 <!-- include `--pkgimage=no` caches for stdlibs -->
- [x] #50765 
- [x] #50764
- [x] #50768
- [x] #50767
- [x] #50618 <!-- inference: continue const-prop' when concrete-eval
returns non-inlineable -->
- [x] #50689 <!-- Attach `tanpi` docstring to method -->
- [x] #50671 <!-- Fix rdiv of complex lhs by real factorizations -->
- [x] #50598 <!-- only limit types in stack traces in the REPL -->
- [x] #50766 <!-- Don't partition alwaysinline functions -->
- [x] #50771 <!-- re-allow non-string values in ENV `get!` -->
- [x] #50682 <!-- Add fallback if we have make a weird GC decision. -->
- [x] #50781 <!-- fix `bit_map!` with aliasing -->
- [x] #50172 <!-- print feature flags used for matching pkgimage -->
- [x] #50844 <!-- Bump OpenBLAS binaries to use the new GEMM
multithreading threshold -->
- [x] #50826 <!-- Update dependency builds -->
- [x] #50845 <!-- fix #50438, use default pool for at-threads -->
- [x] #50568 <!-- `Array(::AbstractRange)` should return an `Array` -->
- [x] #50655 <!-- fix hashing regression. -->
- [x] #50779 <!-- Minor refactor to image generation -->
- [x] #50791 <!-- Make symbols internal in jl_create_native, and only
externalize them when partitioning -->
- [x] #50724 <!-- Merge opaque closure modules with the rest of the
workqueue -->
- [x] #50738 <!-- Add alignment to constant globals -->
- [x] #50871 <!-- macOS: Don't inspect dead threadtls during exception
handling. -->

Need manual backport:

Contains multiple commits, manual intervention needed:

Non-merged PRs with backport label:
- [ ] #50850 <!-- Remove weird Rational dispatch and add pi functions to
list -->
- [ ] #50823 <!-- Make ranges more robust with unsigned indexes. -->
- [ ] #50809 <!-- Limit type-printing in MethodError -->
- [ ] #50663 <!-- Fix Expr(:loopinfo) codegen -->
- [ ] #50594 <!-- Disallow non-index Integer types in isassigned -->
- [ ] #50385 <!-- Precompile pidlocks: add to NEWS and docs -->
- [ ] #49805 <!-- Limit TimeType subtraction to AbstractDateTime -->
@KristofferC KristofferC removed the backport 1.10 Change should be backported to the 1.10 release label Aug 18, 2023
d-netto pushed a commit to RelationalAI/julia that referenced this pull request Oct 4, 2023
If something odd happens during GC (the PC goes to sleep) or a very big
transient the heuristics might make a bad decision. What this PR
implements is if we try to make our target more than double the one we
had before we fallback to a more conservative method. This fixes the new
issue @vtjnash found in JuliaLang#40644
for me.
d-netto pushed a commit that referenced this pull request Mar 15, 2024
If something odd happens during GC (the PC goes to sleep) or a very big
transient the heuristics might make a bad decision. What this PR
implements is if we try to make our target more than double the one we
had before we fallback to a more conservative method. This fixes the new
issue @vtjnash found in #40644
for me.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants