proposal: runtime: GC Callback to handle CPU deathspirals with GOMEMLIMIT #59324

Jorropo · 2023-03-30T04:55:33Z

I have been experimenting with using GOMEMLIMIT to increase how many my mostly IO bound service is capable to handle.
It works great, it also work great if used together with a very high GOGC to reduce CPU utilisation (if I allocate 32GiB to a service, I don't care that it is only using 4 and thus running the GC once it reach 8, I might as well wait until the 32 have been used and run GC less often).
Sadly if I am about to OOM instead of crashing it will try to run the GC up to continuously (see #58106), which is undesirable.

I think different solutions are very application dependent that why I think a function like this would be helpfull:

// SetGcCallback register a callback that MAY be called after any GC run.
// Any future garbage collection will block on the callback, that means no
// garbage collection will be executed while the callback is executing.
// However this does not stop the program, that means callbacks must be
// carefull to not run for too long at the risk of allowing the program to allocate
// too much memory before a GC is triggered.
// If a callback is already registered a new call to SetGcCallback replace the
// callback that will be called.
// This does not garentees that a GC is used, or that it will callback into this function.
// However if SetGcCallback is ever called after running GC, then it will always called
// after running GC.
func SetGcCallback(func(sizeBeforeGc, sizeAfterGc uintptr))

Rational behind the different points:

register a callback that MAY be called after any GC run

This does not garentees that a GC is used, or that it will callback into this function.
However if SetGcCallback is ever called after running GC, then it will always called after running GC.

This allows this to be an optional feature that may or may not be implemented by the various go implementations.
However this forces an implementation to either implement this or not, you can't be inconsistent about this (within the lifetime of one program).

Any future garbage collection will block on the callback, that means no
garbage collection will be executed while the callback is executing.
However this does not stop the program, that means callbacks must be
carefull to not run for too long at the risk of allowing the program to allocate
too much memory before a GC is triggered.

The goal of this is to allow programs to handle the CPU death spiral.
Let's assume my solution is that the sizeAfterGc is withing 90% of the GOMEMLIMIT I have set, I panic. (which is most likely what I would want to use in my cases).
If an other GC is triggered followed up by a stop the world event, my panic attempt may be stopped by the next GC.
But this is also flexible enough to allow for more options, maybe instead of panicing I time.Sleep for 3 seconds, if I run out of ram well too bad the program crash, but maybe this is a the peak of a transient event that I can live through if whatever IO is allowed to progress.

If a callback is already registered a new call to SetGcCallback replace the callback that will be called.

It's probably unhealthy if multiple callbacks are registered as they may fight each other trying to slow down the runtime.

It maybe looks like this will now allow users to implement their own GC logic, by purposely setting an extremely low GOMEMLIMIT (like 1 byte), and then blocking in the callback. And yes it does in fact allows this but I don't think this is something new.
You can already use finalizers to get a callback from the GC and combined with an extremely high GOGC and then calling runtime.GC in your code you can implement your own custom GC logic. However unlike finalizers a synchronous callback allows you to effectively handle the death CPU spiral.

proposal: runtime: add way for applications to respond to GC backpressure #29696
... more probably, I somehow can never find what I want with the github search.

The text was updated successfully, but these errors were encountered:

prattmic · 2023-03-31T13:59:38Z

cc @mknyszek @golang/runtime

mknyszek · 2023-03-31T16:36:01Z

You can already use finalizers to get a callback from the GC and combined with an extremely high GOGC and then calling runtime.GC in your code you can implement your own custom GC logic

Because there's no guarantee finalizers will run, I don't think this is actually true. Also, they tend to trigger too slowly in practice to be useful for this. On the other hand your proposal introduces a way to impact and observe GC behavior much more immediately and I would like to avoid opening that door. Specifically, I think this part is problematic:

// Any future garbage collection will block on the callback, that means no
// garbage collection will be executed while the callback is executing.

That being said, the runtime does already have a mechanism to limit GC CPU time close to the memory limit. The CPU time limit is set to 50% over a 1 second window. I would like to collect more evidence and data on program behavior near the limit before we try to add an API to work around it. It may be that the 50% limit is just not an aggressive enough limit, or the time window is too small or too large to be a good default. It's true that this will be application dependent in general, but we may just have a poor default here. The Go runtime should be successfully evading a "true" death spiral with room for the application to still fall over.

However, it's definitely more difficult to deal with the case where memory goes high and then stays there for a long time (e.g. in the case of a slow memory leak), but I don't think there's a good general solution there without also losing the ability to handle transient spikes in memory use effectively, or making the GC behavior of applications even more complex and likely more fragile to code changes. In sum, I think one of these three things has to give:
(1) Can't deal with transient spikes well (fail to quickly).
(2) Can't deal well with memory staying high for a long time.
(3) The GC policy becomes complex and fragile.

There may be a way out, but I don't see it yet.

Also, a feedback mechanism is theoretically useful, but I must point out that when we've experimented with a feedback mechanism in the past and nobody used it in practice. Instead, where feedback mechanisms are appropriate, applications tended to use a mix of other signals, such as the memory metrics obtained from runtime/metrics plus application-specific metrics. Being fixed to the GC cycle didn't matter that much (you could just have a goroutine run periodically and that was fine).

For instance, if your use-case would be to panic inside the callback passed to SetGcCallback under certain conditions, you could theoretically already do that by sampling runtime/metrics, checking the result, and panicking in a specific case without much of a difference in the end result.

I think in #58106 the original poster mentioned tracking runtime/metrics for a more complex policy, and that's fine too. The tools are there and by Go 1 compatibility, will be there for the foreseeable future. For where we are today, that doesn't seem like an unreasonable path forward to me for dealing with specific situations. It's a way of paying for that complexity and fragility yourself on a case-by-case basis.

Jorropo · 2023-03-31T16:43:36Z

That being said, the runtime does already have a mechanism to limit GC CPU time close to the memory limit. The CPU time limit is set to 50% over a 1 second window. I would like to collect more evidence and data on program behavior near the limit before we try to add an API to work around it. It may be that the 50% limit is just not an aggressive enough limit, or the time window is too small or too large to be a good default. It's true that this will be application dependent in general, but we may just have a poor default here. The Go runtime should be successfully evading a "true" death spiral with room for the application to still fall over.

Intresting thx, it sounds like I have an issue in my testing methodology.

I'll try to implement my 90% of GOMEMLIMIT panic using runtime/metrics and finalizers see how it goes.

Jorropo · 2023-05-02T23:30:25Z

After retrying, the runtime does indeed limit itself and do not run GC continously.

Jorropo added the Proposal label Mar 30, 2023

gopherbot added this to the Proposal milestone Mar 30, 2023

bcmills added the compiler/runtime Issues related to the Go compiler and/or runtime. label Mar 31, 2023

ianlancetaylor added this to Proposals Apr 4, 2023

ianlancetaylor moved this to Incoming in Proposals Apr 4, 2023

mknyszek added this to Go Compiler / Runtime Apr 5, 2023

Jorropo closed this as not planned Won't fix, can't repro, duplicate, stale May 2, 2023

github-project-automation bot moved this to Done in Go Compiler / Runtime May 2, 2023

mknyszek removed this from Go Compiler / Runtime Oct 25, 2023

Jorropo mentioned this issue Mar 28, 2024

How often does Go GC after the memory exceeds the GOMEMLIMIT limit? #66569

Closed

golang locked and limited conversation to collaborators May 1, 2024

gopherbot added the FrozenDueToAge label May 1, 2024

rsc removed this from Proposals May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: runtime: GC Callback to handle CPU deathspirals with GOMEMLIMIT #59324

proposal: runtime: GC Callback to handle CPU deathspirals with GOMEMLIMIT #59324

Jorropo commented Mar 30, 2023 •

edited

Loading

prattmic commented Mar 31, 2023

mknyszek commented Mar 31, 2023 •

edited

Loading

Jorropo commented Mar 31, 2023

Jorropo commented May 2, 2023

proposal: runtime: GC Callback to handle CPU deathspirals with GOMEMLIMIT #59324

proposal: runtime: GC Callback to handle CPU deathspirals with GOMEMLIMIT #59324

Comments

Jorropo commented Mar 30, 2023 • edited Loading

prattmic commented Mar 31, 2023

mknyszek commented Mar 31, 2023 • edited Loading

Jorropo commented Mar 31, 2023

Jorropo commented May 2, 2023

Jorropo commented Mar 30, 2023 •

edited

Loading

mknyszek commented Mar 31, 2023 •

edited

Loading