Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: runtime: GC Callback to handle CPU deathspirals with GOMEMLIMIT #59324

Closed
Jorropo opened this issue Mar 30, 2023 · 4 comments
Closed
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge Proposal
Milestone

Comments

@Jorropo
Copy link
Member

Jorropo commented Mar 30, 2023

I have been experimenting with using GOMEMLIMIT to increase how many my mostly IO bound service is capable to handle.
It works great, it also work great if used together with a very high GOGC to reduce CPU utilisation (if I allocate 32GiB to a service, I don't care that it is only using 4 and thus running the GC once it reach 8, I might as well wait until the 32 have been used and run GC less often).
Sadly if I am about to OOM instead of crashing it will try to run the GC up to continuously (see #58106), which is undesirable.

I think different solutions are very application dependent that why I think a function like this would be helpfull:

// SetGcCallback register a callback that MAY be called after any GC run.
// Any future garbage collection will block on the callback, that means no
// garbage collection will be executed while the callback is executing.
// However this does not stop the program, that means callbacks must be
// carefull to not run for too long at the risk of allowing the program to allocate
// too much memory before a GC is triggered.
// If a callback is already registered a new call to SetGcCallback replace the
// callback that will be called.
// This does not garentees that a GC is used, or that it will callback into this function.
// However if SetGcCallback is ever called after running GC, then it will always called
// after running GC.
func SetGcCallback(func(sizeBeforeGc, sizeAfterGc uintptr))

Rational behind the different points:

register a callback that MAY be called after any GC run

This does not garentees that a GC is used, or that it will callback into this function.
However if SetGcCallback is ever called after running GC, then it will always called after running GC.

This allows this to be an optional feature that may or may not be implemented by the various go implementations.
However this forces an implementation to either implement this or not, you can't be inconsistent about this (within the lifetime of one program).

Any future garbage collection will block on the callback, that means no
garbage collection will be executed while the callback is executing.
However this does not stop the program, that means callbacks must be
carefull to not run for too long at the risk of allowing the program to allocate
too much memory before a GC is triggered.

The goal of this is to allow programs to handle the CPU death spiral.
Let's assume my solution is that the sizeAfterGc is withing 90% of the GOMEMLIMIT I have set, I panic. (which is most likely what I would want to use in my cases).
If an other GC is triggered followed up by a stop the world event, my panic attempt may be stopped by the next GC.
But this is also flexible enough to allow for more options, maybe instead of panicing I time.Sleep for 3 seconds, if I run out of ram well too bad the program crash, but maybe this is a the peak of a transient event that I can live through if whatever IO is allowed to progress.

If a callback is already registered a new call to SetGcCallback replace the callback that will be called.

It's probably unhealthy if multiple callbacks are registered as they may fight each other trying to slow down the runtime.


It maybe looks like this will now allow users to implement their own GC logic, by purposely setting an extremely low GOMEMLIMIT (like 1 byte), and then blocking in the callback. And yes it does in fact allows this but I don't think this is something new.
You can already use finalizers to get a callback from the GC and combined with an extremely high GOGC and then calling runtime.GC in your code you can implement your own custom GC logic. However unlike finalizers a synchronous callback allows you to effectively handle the death CPU spiral.


Related:

@gopherbot gopherbot added this to the Proposal milestone Mar 30, 2023
@prattmic
Copy link
Member

cc @mknyszek @golang/runtime

@bcmills bcmills added the compiler/runtime Issues related to the Go compiler and/or runtime. label Mar 31, 2023
@mknyszek
Copy link
Contributor

mknyszek commented Mar 31, 2023

You can already use finalizers to get a callback from the GC and combined with an extremely high GOGC and then calling runtime.GC in your code you can implement your own custom GC logic

Because there's no guarantee finalizers will run, I don't think this is actually true. Also, they tend to trigger too slowly in practice to be useful for this. On the other hand your proposal introduces a way to impact and observe GC behavior much more immediately and I would like to avoid opening that door. Specifically, I think this part is problematic:

// Any future garbage collection will block on the callback, that means no
// garbage collection will be executed while the callback is executing.

That being said, the runtime does already have a mechanism to limit GC CPU time close to the memory limit. The CPU time limit is set to 50% over a 1 second window. I would like to collect more evidence and data on program behavior near the limit before we try to add an API to work around it. It may be that the 50% limit is just not an aggressive enough limit, or the time window is too small or too large to be a good default. It's true that this will be application dependent in general, but we may just have a poor default here. The Go runtime should be successfully evading a "true" death spiral with room for the application to still fall over.

However, it's definitely more difficult to deal with the case where memory goes high and then stays there for a long time (e.g. in the case of a slow memory leak), but I don't think there's a good general solution there without also losing the ability to handle transient spikes in memory use effectively, or making the GC behavior of applications even more complex and likely more fragile to code changes. In sum, I think one of these three things has to give:
(1) Can't deal with transient spikes well (fail to quickly).
(2) Can't deal well with memory staying high for a long time.
(3) The GC policy becomes complex and fragile.

There may be a way out, but I don't see it yet.

Also, a feedback mechanism is theoretically useful, but I must point out that when we've experimented with a feedback mechanism in the past and nobody used it in practice. Instead, where feedback mechanisms are appropriate, applications tended to use a mix of other signals, such as the memory metrics obtained from runtime/metrics plus application-specific metrics. Being fixed to the GC cycle didn't matter that much (you could just have a goroutine run periodically and that was fine).

For instance, if your use-case would be to panic inside the callback passed to SetGcCallback under certain conditions, you could theoretically already do that by sampling runtime/metrics, checking the result, and panicking in a specific case without much of a difference in the end result.

I think in #58106 the original poster mentioned tracking runtime/metrics for a more complex policy, and that's fine too. The tools are there and by Go 1 compatibility, will be there for the foreseeable future. For where we are today, that doesn't seem like an unreasonable path forward to me for dealing with specific situations. It's a way of paying for that complexity and fragility yourself on a case-by-case basis.

@Jorropo
Copy link
Member Author

Jorropo commented Mar 31, 2023

That being said, the runtime does already have a mechanism to limit GC CPU time close to the memory limit. The CPU time limit is set to 50% over a 1 second window. I would like to collect more evidence and data on program behavior near the limit before we try to add an API to work around it. It may be that the 50% limit is just not an aggressive enough limit, or the time window is too small or too large to be a good default. It's true that this will be application dependent in general, but we may just have a poor default here. The Go runtime should be successfully evading a "true" death spiral with room for the application to still fall over.

Intresting thx, it sounds like I have an issue in my testing methodology.


I'll try to implement my 90% of GOMEMLIMIT panic using runtime/metrics and finalizers see how it goes.

@Jorropo
Copy link
Member Author

Jorropo commented May 2, 2023

After retrying, the runtime does indeed limit itself and do not run GC continously.

@Jorropo Jorropo closed this as not planned Won't fix, can't repro, duplicate, stale May 2, 2023
@golang golang locked and limited conversation to collaborators May 1, 2024
@rsc rsc removed this from Proposals May 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge Proposal
Projects
None yet
Development

No branches or pull requests

5 participants