Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timeutil: prevent illegal sharing of Timers #119595

Merged
merged 2 commits into from
Feb 27, 2024

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Feb 23, 2024

timeutil: prevent illegal sharing of Timers

Currently, whenever Timer.Stop is called, we unconditionally put that timer into timerPool. This works well assuming that the contract of Stop is satisfied - namely that the stopped timer can no longer be used.

However, we have at least one case where this contract is violated (fixed in the following commit), and in such a scenario it is possible for multiple users to access the same timer object which can lead to an undefined behavior (we've seen a deadlock on a test server shutdown).

This commit hardens the timer code to prevent a class of such issues by putting the timer into the pool on Stop only if the timer originally came from the pool.

Release note: None

stmtdiagnostics: fix incorrect usage of timeutil.Timer

The contract of timeutil.Timer.Stop is such that the timer can no longer be used, and it was violated in Registry.poll. This now fixed by allocating a fresh timer after each Stop call.

Fixes: #119593.

Release note: None

@yuzefovich yuzefovich requested review from nvanbenschoten, a team and mgartner and removed request for a team February 23, 2024 19:42
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice debugging! The fix in the second commit here :lgtm:. I left a question about the first patch though.

Reviewed 2 of 2 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner and @yuzefovich)


pkg/util/timeutil/timer.go line 110 at r1 (raw file):

		}
	}
	if !t.fromPool {

I don't entirely follow why this provides protection from misuse if the contract is still that "a Timer must never be used again after calls to Stop". Is the idea that stack allocated timers (which I assume still escape to the heap and would always benefit from just using NewTimer) can now be re-used after Stop is called? In other words, is this allowing us to change the contract? Or just making a common misuse of the contract less impactful, with the assumption that stack allocation is often paired with violating this contract?

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner and @nvanbenschoten)


pkg/util/timeutil/timer.go line 110 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't entirely follow why this provides protection from misuse if the contract is still that "a Timer must never be used again after calls to Stop". Is the idea that stack allocated timers (which I assume still escape to the heap and would always benefit from just using NewTimer) can now be re-used after Stop is called? In other words, is this allowing us to change the contract? Or just making a common misuse of the contract less impactful, with the assumption that stack allocation is often paired with violating this contract?

This is a defensive mechanism that prevents problems from one class of misuse (like the one the second commit fixes). In particular, now with the "stack-allocated" (that, indeed, should actually escape) timer it won't ever be possible to share the timer by mistake. It is still possible for this problem to occur if the timer is created via NewTimer and the contract of Stop not being honored.

I'm somewhat on the fence about the first commit, but I'm leaning towards being defensive and merging it due to:

  • if the timer originally didn't come from the pool, then it seems somewhat strange to put it into the pool on Stop
  • this mechanism does protect from misuse - with low overhead - at least in some cases, and even very experienced engineers can make mistakes like this (the bug fixed in the second commit was introduced by Andrew W in 4ef67a0).

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @nvanbenschoten and @yuzefovich)


-- commits line 15 at r1:
nit: remove one of the two "only"'s


pkg/util/timeutil/timer.go line 110 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

This is a defensive mechanism that prevents problems from one class of misuse (like the one the second commit fixes). In particular, now with the "stack-allocated" (that, indeed, should actually escape) timer it won't ever be possible to share the timer by mistake. It is still possible for this problem to occur if the timer is created via NewTimer and the contract of Stop not being honored.

I'm somewhat on the fence about the first commit, but I'm leaning towards being defensive and merging it due to:

  • if the timer originally didn't come from the pool, then it seems somewhat strange to put it into the pool on Stop
  • this mechanism does protect from misuse - with low overhead - at least in some cases, and even very experienced engineers can make mistakes like this (the bug fixed in the second commit was introduced by Andrew W in 4ef67a0).

Is there any reasonable use case for a stack allocated timer? If not, why not prevent stack allocated timers by making Timer a point to an internal type struct:

type Timer = *t

type t struct {
	timer *time.Timer
	// C is a local "copy" of timer.C that can be used in a select case before
	// the timer has been initialized (via Reset).
	C    <-chan time.Time
	Read bool
}

Then trying to use an uninitialized var t Timer will result in a NPE, which should force all timers to be allocated via the pool.

Currently, whenever `Timer.Stop` is called, we unconditionally put that
timer into `timerPool`. This works well assuming that the contract of
`Stop` is satisfied - namely that the stopped timer can no longer be
used.

However, we have at least one case where this contract is violated
(fixed in the following commit), and in such a scenario it is possible
for multiple users to access the same timer object which can lead to an
undefined behavior (we've seen a deadlock on a test server shutdown).

This commit hardens the timer code to prevent a class of such issues by
putting the timer into the pool on `Stop` only if the timer originally
came from the pool.

Release note: None
The contract of `timeutil.Timer.Stop` is such that the timer can no
longer be used, and it was violated in `Registry.poll`. This is now
fixed by allocating a fresh timer after each `Stop` call.

Release note: None
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @mgartner and @nvanbenschoten)


pkg/util/timeutil/timer.go line 110 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Is there any reasonable use case for a stack allocated timer? If not, why not prevent stack allocated timers by making Timer a point to an internal type struct:

type Timer = *t

type t struct {
	timer *time.Timer
	// C is a local "copy" of timer.C that can be used in a select case before
	// the timer has been initialized (via Reset).
	C    <-chan time.Time
	Read bool
}

Then trying to use an uninitialized var t Timer will result in a NPE, which should force all timers to be allocated via the pool.

On a quick search there are like 20 places where the timers are allocated on the "stack" and don't use the pool. For example, replicaScanner from server/scanner.go - it seems reasonable to me that we embed the timer into the parent struct to avoid accessing timerPool (although, later, on Reset we'll still access the shared timeTimerPool). To me this seems like a reasonable use case since we avoid an extra pointer to track / extra access to timerPool - WDYT?

@mgartner
Copy link
Collaborator

pkg/util/timeutil/timer.go line 110 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

On a quick search there are like 20 places where the timers are allocated on the "stack" and don't use the pool. For example, replicaScanner from server/scanner.go - it seems reasonable to me that we embed the timer into the parent struct to avoid accessing timerPool (although, later, on Reset we'll still access the shared timeTimerPool). To me this seems like a reasonable use case since we avoid an extra pointer to track / extra access to timerPool - WDYT?

I see, makes sense. No need to disallow stack allocation here then.

@yuzefovich
Copy link
Member Author

TFTRs!

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 27, 2024

Build succeeded:

@craig craig bot merged commit 2ae2550 into cockroachdb:master Feb 27, 2024
15 checks passed
@yuzefovich yuzefovich deleted the timer-fix branch February 27, 2024 20:55
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)


pkg/util/timeutil/timer.go line 110 at r1 (raw file):

Is the idea that stack allocated timers (which I assume still escape to the heap and would always benefit from just using NewTimer) can now be re-used after Stop is called? In other words, is this allowing us to change the contract?

So can the contact be relaxed now for timers that didn't come from the pool? Seems like code actually does want to reuse timers after Stop.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Mar 5, 2024
Informs cockroachdb#119593.
Closes cockroachdb#38055.

This commit removes the pooling of timeutil.Timer structs and, in doing so,
permits the structs to be stack allocated so that no pooling is necessary. This
superfluous (and in hindsight, harmful) memory pooling was introduced in
f11ec1c, which also added very necessary pooling for the internal time.Timer
structs.

The pooling was harmful because it mandated a contract where Timer structs could
not be used after their Stop method was called. This was surprising (time.Timer
has no such limitation) and led to subtle use-after-free bugs over time (cockroachdb#61373
and cockroachdb#119595). It was also unnecessary because the outer Timer structs can be
stack allocated. Ironically, the only thing that causes them to escape to the
heap was the pooling mechanism itself. Removing pooling solves the issue.

```
name      old time/op    new time/op    delta
Timer-10     153µs ± 1%     152µs ± 1%   ~     (p=0.589 n=10+9)

name      old alloc/op   new alloc/op   delta
Timer-10      200B ± 0%      200B ± 0%   ~     (all equal)

name      old allocs/op  new allocs/op  delta
Timer-10      3.00 ± 0%      3.00 ± 0%   ~     (all equal)
```

Epic: None
Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Mar 5, 2024
Informs cockroachdb#119593.
Closes cockroachdb#38055.

This commit removes the pooling of timeutil.Timer structs and, in doing so,
permits the structs to be stack allocated so that no pooling is necessary. This
superfluous (and in hindsight, harmful) memory pooling was introduced in
f11ec1c, which also added very necessary pooling for the internal time.Timer
structs.

The pooling was harmful because it mandated a contract where Timer structs could
not be used after their Stop method was called. This was surprising (time.Timer
has no such limitation) and led to subtle use-after-free bugs over time (cockroachdb#61373
and cockroachdb#119595). It was also unnecessary because the outer Timer structs can be
stack allocated. Ironically, the only thing that causes them to escape to the
heap was the pooling mechanism itself. Removing pooling solves the issue.

```
name      old time/op    new time/op    delta
Timer-10     153µs ± 1%     152µs ± 1%   ~     (p=0.589 n=10+9)

name      old alloc/op   new alloc/op   delta
Timer-10      200B ± 0%      200B ± 0%   ~     (all equal)

name      old allocs/op  new allocs/op  delta
Timer-10      3.00 ± 0%      3.00 ± 0%   ~     (all equal)
```

Epic: None
Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Mar 5, 2024
Informs cockroachdb#119593.
Closes cockroachdb#38055.

This commit removes the pooling of timeutil.Timer structs and, in doing so,
permits the structs to be stack allocated so that no pooling is necessary. This
superfluous (and in hindsight, harmful) memory pooling was introduced in
f11ec1c, which also added very necessary pooling for the internal time.Timer
structs.

The pooling was harmful because it mandated a contract where Timer structs could
not be used after their Stop method was called. This was surprising (time.Timer
has no such limitation) and led to subtle use-after-free bugs over time (cockroachdb#61373
and cockroachdb#119595). It was also unnecessary because the outer Timer structs can be
stack allocated. Ironically, the only thing that causes them to escape to the
heap was the pooling mechanism itself. Removing pooling solves the issue.

```
name      old time/op    new time/op    delta
Timer-10     153µs ± 1%     152µs ± 1%   ~     (p=0.589 n=10+9)

name      old alloc/op   new alloc/op   delta
Timer-10      200B ± 0%      200B ± 0%   ~     (all equal)

name      old allocs/op  new allocs/op  delta
Timer-10      3.00 ± 0%      3.00 ± 0%   ~     (all equal)
```

Epic: None
Release note: None
craig bot pushed a commit that referenced this pull request Mar 6, 2024
119901: timeutil: stack-allocate Timer, remove pooling r=nvanbenschoten a=nvanbenschoten

Informs #119593.
Closes #38055.

This PR removes the pooling of `timeutil.Timer` structs and, in doing so, permits the structs to be stack allocated so that no pooling is necessary. This superfluous (and in hindsight, harmful) memory pooling was introduced in f11ec1c, which also added very necessary pooling for the internal time.Timer structs.

The pooling was harmful because it mandated a contract where Timer structs could not be used after their Stop method was called. This was surprising (time.Timer has no such limitation) and led to subtle use-after-free bugs over time (#61373 and #119595). It was also unnecessary because the outer Timer structs can be stack allocated. Ironically, the only thing that causes them to escape to the heap was the pooling mechanism itself. Removing pooling solves the issue.

```
name      old time/op    new time/op    delta
Timer-10     153µs ± 1%     152µs ± 1%   ~     (p=0.589 n=10+9)

name      old alloc/op   new alloc/op   delta
Timer-10      200B ± 0%      200B ± 0%   ~     (all equal)

name      old allocs/op  new allocs/op  delta
Timer-10      3.00 ± 0%      3.00 ± 0%   ~     (all equal)
```

----

The PR then improves the memory pooling of the inner `time.Timer` so that it is always recycled. This was originally identified by `@andreimatei` in #13466 (review).

Doing so has a positive impact on the microbenchmark introduced in the first commit, demonstrating that timers can be stack-allocated and require zero heap allocations:
```
name      old time/op    new time/op    delta
Timer-10     152µs ± 1%     153µs ± 1%      ~     (p=0.133 n=9+10)

name      old alloc/op   new alloc/op   delta
Timer-10      200B ± 0%        0B       -100.00%  (p=0.000 n=10+10)

name      old allocs/op  new allocs/op  delta
Timer-10      3.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)
```

----

cc. `@andreimatei` `@ajwerner`

Epic: None
Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stmtdiagnostics: seemingly a deadlock in spanlatch.(*Manager).wait
5 participants