Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time,runtime: too many concurrent timer firings for short time.Ticker #69880

Closed
mknyszek opened this issue Oct 14, 2024 · 13 comments
Closed

time,runtime: too many concurrent timer firings for short time.Ticker #69880

mknyszek opened this issue Oct 14, 2024 · 13 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@mknyszek
Copy link
Contributor

We've observed the "too many concurrent timer firings" panic, introduced in go.dev/cl/611496 for #69312, in production. We suspect they're due to relatively short time.Tickers.

This bug tracks figuring out why it happens and mitigating it. We expected this to be rare, and it still is relatively rare, but it does happen. go.dev/cl/620136 is a mitigation for time.Ticker that already landed on tip.

CC @ianlancetaylor

@gabyhelp
Copy link

Related Issues and Documentation

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Oct 14, 2024
@mknyszek
Copy link
Contributor Author

@gopherbot Please open a backport issue for Go 1.23.

This is a serious issue that affects users with no reasonable workaround. This issues does not affect Go 1.22.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #69881 (for 1.22), #69882 (for 1.23).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@ianlancetaylor
Copy link
Contributor

Fixed on tip, we think, by https://go.dev/cl/620136. Will reopen if it continues to occur.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/620137 mentions this issue: [release-branch.go1.23] runtime: don't frob isSending for tickers

@dmitshur dmitshur added this to the Go1.24 milestone Oct 14, 2024
@dmitshur dmitshur added the NeedsFix The path to resolution is known, but the work has not been done. label Oct 14, 2024
@jeromefroe
Copy link

Hi, my team has also observed occasional "too many concurrent timer firings" panics after upgrading to Go 1.23.2. In our case, the panics occur in a service that runs a gRPC server and my suspicion is that it may be because the gRPC server creates three new timers for each new connection:

idleTimer := time.NewTimer(t.kp.MaxConnectionIdle)
ageTimer := time.NewTimer(t.kp.MaxConnectionAge)
kpTimer := time.NewTimer(t.kp.Time)

In one case I saw that a replica that panicked had about 1000 active connections at the time it panicked. We initially suspected that this was a thundering herd problem in which too many clients try to connect to the service at once and thereby cause too many concurrent timers to be created, so we tried increasing the jitter the clients use when connecting to the service. Even with that change though we continued to observed panics in the service (I suspect this might be because the server resets the timers it creates so just due to random timing effects one might end up with many timers expiring at about the same time).

I'm still working on a reproducible example that I can share, but was curious if the linked CL would also effect timers or if it was only relevant to tickers?

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Oct 15, 2024

CL 620136 will only affect Tickers, not Timers.

Note that this crash is not the result of having many concurrent timers. It is the result of a single timer being run multiple times concurrently. Creating lots of timers can't cause this problem.

What can cause this problem is having a single timer that is reset multiple times very quickly to times that are in the past or the very near future. Each such timer reset can potentially lead to an immediate run of the timer. If those timer runs somehow happens 8 times concurrently, this error can occur. That said, I haven't been able to recreate this problem using timers. A test case would certainly be helpful.

@jeromefroe
Copy link

Thank you, that's incredibly helpful context. I had misunderstood the panic message, but this gives me a lead I can use to find the problematic goroutine in the stack traces emitted after the panics. Once I find the issue I'll see if I can turn it into a simple reproducible example.

@matbits
Copy link

matbits commented Oct 16, 2024

So yesterday I also encountered this panic. It happened after 2-3 Hours and I found in my code the responsible code pattern:

func cleaner(ctx context.Context) {
	for {
		select {
		case <-ctx.Done():
			return
		case <-time.After(time.Millisecond):
			cleanup()
		}
	}
}

I know this is a bad pattern, I wanted to test the new golang 1.23 feature, that automically GCs unreferenced unstopped tickers. I refactored this code pattern to use only one Ticker:

func cleaner(ctx context.Context) {
	ticker := time.NewTicker(time.Millisecond)
	defer ticker.Stop()

	for {
		select {
		case <-ctx.Done():
			return
		case <-ticker.C:
			cleanup()
		}
	}
}

@ianlancetaylor
Copy link
Contributor

@matbits Thanks. I don't see how the original cleaner function could cause the "too many concurrent timer firings" error. Can you describe why you think that caused the problem?

@matbits
Copy link

matbits commented Oct 18, 2024

Good point, I thought the bug happened in my code and this is the only time I am using the time package with a ticker in my code. But maybe the panic happened inside of a third party library. I need to investigate this further

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/621616 mentions this issue: runtime,time: use atomic.Int32 for isSending

gopherbot pushed a commit that referenced this issue Oct 21, 2024
This change switches isSending to be an atomic.Int32 instead of an
atomic.Uint8. The Int32 version is managed as a counter, which is
something that we couldn't do with Uint8 without adding a new intrinsic
which may not be available on all architectures.

That is, instead of only being able to support 8 concurrent timer
firings on the same timer because we only have 8 independent bits to set
for each concurrent timer firing, we can now have 2^31-1 concurrent
timer firings before running into any issues. Like the fact that each
bit-set was matched with a clear, here we match increments with
decrements to indicate that we're in the "sending on a channel" critical
section in the timer code, so we can report the correct result back on
Stop or Reset.

We choose an Int32 instead of a Uint32 because it's easier to check for
obviously bad values (negative values are always bad) and 2^31-1
concurrent timer firings should be enough for anyone.

Previously, we avoided anything bigger than a Uint8 because we could
pack it into some padding in the runtime.timer struct. But it turns out
that the type that actually matters, runtime.timeTimer, is exactly 96
bytes in size. This means its in the next size class up in the 112 byte
size class because of an allocation header. We thus have some free space
to work with. This change increases the size of this struct from 96
bytes to 104 bytes.

(I'm not sure if runtime.timer is often allocated directly, but if it
is, we get lucky in the same way too. It's exactly 80 bytes in size,
which means its in the 96-byte size class, leaving us with some space to
work with.)

Fixes #69969.
Related to #69880 and #69312.

Change-Id: I9fd59cb6a69365c62971d1f225490a65c58f3e77
Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-longtest
Reviewed-on: https://go-review.googlesource.com/c/go/+/621616
Reviewed-by: Ian Lance Taylor <[email protected]>
Auto-Submit: Michael Knyszek <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
gopherbot pushed a commit that referenced this issue Oct 23, 2024
The Ticker Stop and Reset methods don't report a value,
so we don't need to track whether they are interrupting a send.

This includes a test that used to fail about 2% of the time on
my laptop when run under x/tools/cmd/stress.

For #69880
Fixes #69882

Change-Id: Ic6d14b344594149dd3c24b37bbe4e42e83f9a9ad
Reviewed-on: https://go-review.googlesource.com/c/go/+/620136
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Ian Lance Taylor <[email protected]>
Auto-Submit: Michael Knyszek <[email protected]>
Auto-Submit: Ian Lance Taylor <[email protected]>
Reviewed-by: Michael Knyszek <[email protected]>
(cherry picked from commit 48849e0)
Reviewed-on: https://go-review.googlesource.com/c/go/+/620137
Reviewed-by: Dmitri Shuralyov <[email protected]>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/621856 mentions this issue: [release-branch.go1.23] runtime,time: use atomic.Int32 for isSending

gopherbot pushed a commit that referenced this issue Oct 25, 2024
This change switches isSending to be an atomic.Int32 instead of an
atomic.Uint8. The Int32 version is managed as a counter, which is
something that we couldn't do with Uint8 without adding a new intrinsic
which may not be available on all architectures.

That is, instead of only being able to support 8 concurrent timer
firings on the same timer because we only have 8 independent bits to set
for each concurrent timer firing, we can now have 2^31-1 concurrent
timer firings before running into any issues. Like the fact that each
bit-set was matched with a clear, here we match increments with
decrements to indicate that we're in the "sending on a channel" critical
section in the timer code, so we can report the correct result back on
Stop or Reset.

We choose an Int32 instead of a Uint32 because it's easier to check for
obviously bad values (negative values are always bad) and 2^31-1
concurrent timer firings should be enough for anyone.

Previously, we avoided anything bigger than a Uint8 because we could
pack it into some padding in the runtime.timer struct. But it turns out
that the type that actually matters, runtime.timeTimer, is exactly 96
bytes in size. This means its in the next size class up in the 112 byte
size class because of an allocation header. We thus have some free space
to work with. This change increases the size of this struct from 96
bytes to 104 bytes.

(I'm not sure if runtime.timer is often allocated directly, but if it
is, we get lucky in the same way too. It's exactly 80 bytes in size,
which means its in the 96-byte size class, leaving us with some space to
work with.)

Fixes #69978
For #69969.
Related to #69880 and #69312 and #69882.

Change-Id: I9fd59cb6a69365c62971d1f225490a65c58f3e77
Cq-Include-Trybots: luci.golang.try:go1.23-linux-amd64-longtest
Reviewed-on: https://go-review.googlesource.com/c/go/+/621616
Reviewed-by: Ian Lance Taylor <[email protected]>
Auto-Submit: Michael Knyszek <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
(cherry picked from commit 6a49f81)
Reviewed-on: https://go-review.googlesource.com/c/go/+/621856
Auto-Submit: Ian Lance Taylor <[email protected]>
Reviewed-by: Michael Pratt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

7 participants