Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stmtdiagnostics: seemingly a deadlock in spanlatch.(*Manager).wait #119593

Closed
yuzefovich opened this issue Feb 23, 2024 · 1 comment · Fixed by #119595
Closed

stmtdiagnostics: seemingly a deadlock in spanlatch.(*Manager).wait #119593

yuzefovich opened this issue Feb 23, 2024 · 1 comment · Fixed by #119595
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). T-sql-queries SQL Queries Team

Comments

@yuzefovich
Copy link
Member

yuzefovich commented Feb 23, 2024

Extracted from this failure of TestDiagnosticsRequest:

* INFO: slow quiesce. stack traces:
...
    test_server_shim.go:157: automatically injected a shared process virtual cluster under test; see comment at top of test_server_shim.go for details.
...
goroutine 3547 [chan receive (nil chan), 4 minutes]:
github.com/cockroachdb/cockroach/pkg/util/timeutil.(*Timer).Reset(0xc0062988d0, 0xc002ad16f0?)
	github.com/cockroachdb/cockroach/pkg/util/timeutil/timer.go:83 +0x5a
github.com/cockroachdb/cockroach/pkg/kv/kvserver/spanlatch.(*Manager).wait(0xc004499b00?, {0x65c13b8, 0xc00163f290}, 0xc003f40400, {{{{0x0, 0x0}, {0x0, 0x0}}, {{0x0, 0x0}, ...}}})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/spanlatch/manager.go:499 +0x8f
github.com/cockroachdb/cockroach/pkg/kv/kvserver/spanlatch.(*Manager).Acquire(0x4322203a2265756c?, {0x65c13b8, 0xc00163f290}, 0x51c9a46?, 0x22ee95b?, {0x6584680?, 0xc003c19320?})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/spanlatch/manager.go:251 +0x1a5
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*latchManagerImpl).Acquire(0xc00449f2c0?, {0x65c13b8?, 0xc00163f290?}, {0x0, {0x17b68ddcf017e2ae, 0x0}, 0x0, 0x0, 0x0, 0x0, ...})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/latch_manager.go:29 +0x45

The test failed during the server shutdown. To me it seems like a deadlock - the timer in question uses 15 seconds.

I think we have a problematic pattern of using timeutil.Timer at least in stmtdiagnostics.Registry.poll - namely we allocate the timer locally but then calling Stop on it puts it into the sync.Pool, yet we later will keep on using the same timer which can result in concurrent access to the same timer object, which is invalid.

Jira issue: CRDB-36234

@yuzefovich yuzefovich added C-test-failure Broken test (automatically or manually discovered). T-sql-queries SQL Queries Team labels Feb 23, 2024
@yuzefovich yuzefovich self-assigned this Feb 23, 2024
@github-project-automation github-project-automation bot moved this to Triage in SQL Queries Feb 23, 2024
@yuzefovich yuzefovich moved this from Triage to Active in SQL Queries Feb 23, 2024
@yuzefovich
Copy link
Member Author

We've seen this previously #118524 (comment).

@craig craig bot closed this as completed in 2ae2550 Feb 27, 2024
@github-project-automation github-project-automation bot moved this from Active to Done in SQL Queries Feb 27, 2024
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 5, 2024
Informs cockroachdb#119593.
Closes cockroachdb#38055.

This commit removes the pooling of timeutil.Timer structs and, in doing so,
permits the structs to be stack allocated so that no pooling is necessary. This
superfluous (and in hindsight, harmful) memory pooling was introduced in
f11ec1c, which also added very necessary pooling for the internal time.Timer
structs.

The pooling was harmful because it mandated a contract where Timer structs could
not be used after their Stop method was called. This was surprising (time.Timer
has no such limitation) and led to subtle use-after-free bugs over time (cockroachdb#61373
and cockroachdb#119595). It was also unnecessary because the outer Timer structs can be
stack allocated. Ironically, the only thing that causes them to escape to the
heap was the pooling mechanism itself. Removing pooling solves the issue.

```
name      old time/op    new time/op    delta
Timer-10     153µs ± 1%     152µs ± 1%   ~     (p=0.589 n=10+9)

name      old alloc/op   new alloc/op   delta
Timer-10      200B ± 0%      200B ± 0%   ~     (all equal)

name      old allocs/op  new allocs/op  delta
Timer-10      3.00 ± 0%      3.00 ± 0%   ~     (all equal)
```

Epic: None
Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 5, 2024
Informs cockroachdb#119593.
Closes cockroachdb#38055.

This commit removes the pooling of timeutil.Timer structs and, in doing so,
permits the structs to be stack allocated so that no pooling is necessary. This
superfluous (and in hindsight, harmful) memory pooling was introduced in
f11ec1c, which also added very necessary pooling for the internal time.Timer
structs.

The pooling was harmful because it mandated a contract where Timer structs could
not be used after their Stop method was called. This was surprising (time.Timer
has no such limitation) and led to subtle use-after-free bugs over time (cockroachdb#61373
and cockroachdb#119595). It was also unnecessary because the outer Timer structs can be
stack allocated. Ironically, the only thing that causes them to escape to the
heap was the pooling mechanism itself. Removing pooling solves the issue.

```
name      old time/op    new time/op    delta
Timer-10     153µs ± 1%     152µs ± 1%   ~     (p=0.589 n=10+9)

name      old alloc/op   new alloc/op   delta
Timer-10      200B ± 0%      200B ± 0%   ~     (all equal)

name      old allocs/op  new allocs/op  delta
Timer-10      3.00 ± 0%      3.00 ± 0%   ~     (all equal)
```

Epic: None
Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 5, 2024
Informs cockroachdb#119593.
Closes cockroachdb#38055.

This commit removes the pooling of timeutil.Timer structs and, in doing so,
permits the structs to be stack allocated so that no pooling is necessary. This
superfluous (and in hindsight, harmful) memory pooling was introduced in
f11ec1c, which also added very necessary pooling for the internal time.Timer
structs.

The pooling was harmful because it mandated a contract where Timer structs could
not be used after their Stop method was called. This was surprising (time.Timer
has no such limitation) and led to subtle use-after-free bugs over time (cockroachdb#61373
and cockroachdb#119595). It was also unnecessary because the outer Timer structs can be
stack allocated. Ironically, the only thing that causes them to escape to the
heap was the pooling mechanism itself. Removing pooling solves the issue.

```
name      old time/op    new time/op    delta
Timer-10     153µs ± 1%     152µs ± 1%   ~     (p=0.589 n=10+9)

name      old alloc/op   new alloc/op   delta
Timer-10      200B ± 0%      200B ± 0%   ~     (all equal)

name      old allocs/op  new allocs/op  delta
Timer-10      3.00 ± 0%      3.00 ± 0%   ~     (all equal)
```

Epic: None
Release note: None
craig bot pushed a commit that referenced this issue Mar 6, 2024
119901: timeutil: stack-allocate Timer, remove pooling r=nvanbenschoten a=nvanbenschoten

Informs #119593.
Closes #38055.

This PR removes the pooling of `timeutil.Timer` structs and, in doing so, permits the structs to be stack allocated so that no pooling is necessary. This superfluous (and in hindsight, harmful) memory pooling was introduced in f11ec1c, which also added very necessary pooling for the internal time.Timer structs.

The pooling was harmful because it mandated a contract where Timer structs could not be used after their Stop method was called. This was surprising (time.Timer has no such limitation) and led to subtle use-after-free bugs over time (#61373 and #119595). It was also unnecessary because the outer Timer structs can be stack allocated. Ironically, the only thing that causes them to escape to the heap was the pooling mechanism itself. Removing pooling solves the issue.

```
name      old time/op    new time/op    delta
Timer-10     153µs ± 1%     152µs ± 1%   ~     (p=0.589 n=10+9)

name      old alloc/op   new alloc/op   delta
Timer-10      200B ± 0%      200B ± 0%   ~     (all equal)

name      old allocs/op  new allocs/op  delta
Timer-10      3.00 ± 0%      3.00 ± 0%   ~     (all equal)
```

----

The PR then improves the memory pooling of the inner `time.Timer` so that it is always recycled. This was originally identified by `@andreimatei` in #13466 (review).

Doing so has a positive impact on the microbenchmark introduced in the first commit, demonstrating that timers can be stack-allocated and require zero heap allocations:
```
name      old time/op    new time/op    delta
Timer-10     152µs ± 1%     153µs ± 1%      ~     (p=0.133 n=9+10)

name      old alloc/op   new alloc/op   delta
Timer-10      200B ± 0%        0B       -100.00%  (p=0.000 n=10+10)

name      old allocs/op  new allocs/op  delta
Timer-10      3.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)
```

----

cc. `@andreimatei` `@ajwerner`

Epic: None
Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant