Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlstats: PersistedSQLStats.Stop() blocking server drain #102574

Closed
erikgrinaker opened this issue Apr 28, 2023 · 4 comments · Fixed by #102590
Closed

sqlstats: PersistedSQLStats.Stop() blocking server drain #102574

erikgrinaker opened this issue Apr 28, 2023 · 4 comments · Fixed by #102590
Assignees
Labels
A-sql-observability Related to observability of the SQL layer branch-master Failures and bugs on the master branch. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Apr 28, 2023

Seen on a 23.1 test cluster, drain stalls for several minutes until it times out:

roachprod run grinaker-231:7 './cockroach node drain --insecure --self'
node is draining... 
ERROR: drain timeout, consider adjusting --drain-wait, especially under custom server.shutdown.{drain,query,connection,lease_transfer}_wait cluster settings
Failed running "node drain"

Goroutine dumps show it blocked here:

goroutine 1670852493 [semacquire, 6 minutes]:
sync.runtime_Semacquire(0xc004a828c8?)
	GOROOT/src/runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0x0?)
	GOROOT/src/sync/waitgroup.go:139 +0x52
github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats.(*PersistedSQLStats).Stop(0xc004a82840, {0x6fc1f58, 0xc00950fc40})
	github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats/pkg/sql/sqlstats/persistedsqlstats/provider.go:134 +0x125
github.com/cockroachdb/cockroach/pkg/server.(*drainServer).drainClients(0xc004a51700, {0x6fc2000?, 0xc08f2bc060?}, 0x6fc2000?)
	github.com/cockroachdb/cockroach/pkg/server/drain.go:416 +0x332
github.com/cockroachdb/cockroach/pkg/server.(*drainServer).drainInner(0xc004a51700, {0x6fc2000, 0xc08f2bc060}, 0xc02fcb92d0, 0x60?)
	github.com/cockroachdb/cockroach/pkg/server/drain.go:334 +0xed
github.com/cockroachdb/cockroach/pkg/server.(*drainServer).runDrain(0xc001d374b0?, {0x6fc2000, 0xc08f2bc060}, 0x0?)
	github.com/cockroachdb/cockroach/pkg/server/drain.go:302 +0x19a
github.com/cockroachdb/cockroach/pkg/server.(*drainServer).handleDrain(0xc004a829a0?, {0x6fc2000, 0xc08f2bc060}, 0xc018827de0, {0x6ffd550, 0xc0241e5a20})
	github.com/cockroachdb/cockroach/pkg/server/drain.go:152 +0x134
github.com/cockroachdb/cockroach/pkg/server.(*adminServer).Drain(0xc004a829a0, 0xc018827de0, {0x6ffd550, 0xc0241e5a20})
	github.com/cockroachdb/cockroach/pkg/server/drain.go:98 +0x11c
github.com/cockroachdb/cockroach/pkg/server/serverpb._Admin_Drain_Handler({0x59df7c0?, 0xc004a33a70}, {0x6ff78a8, 0xc018827dc0})
	github.com/cockroachdb/cockroach/pkg/server/serverpb/bazel-out/k8-opt/bin/pkg/server/serverpb/serverpb_go_proto_/github.com/cockroachdb/cockroach/pkg/server/serverpb/admin.pb.go:6238 +0xd0
github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor.StreamServerInterceptor.func1({0x59df7c0, 0xc004a33a70}, {0x6ff7fb0?, 0xc027d0f7a0?}, 0xc0945aa318, 0x5de6af8)
	github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:164 +0x6c4
google.golang.org/grpc.chainStreamInterceptors.func1.1({0x59df7c0?, 0xc004a33a70?}, {0x6ff7fb0?, 0xc027d0f7a0?})
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1482 +0x5b
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func4({0x59df7c0, 0xc004a33a70}, {0x6ff7fb0, 0xc027d0f7a0}, 0x11c79f0?, 0xc00950fb80)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:285 +0x74
google.golang.org/grpc.chainStreamInterceptors.func1.1({0x59df7c0?, 0xc004a33a70?}, {0x6ff7fb0?, 0xc027d0f7a0?})
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1485 +0x83
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func2.1({0x0?, 0x0?})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:252 +0x2d
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc00091c980, {0x6fc2000, 0xc018923f50}, {0xc003fb8a98?, 0x47203f?}, 0xc003fb8a50)
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:322 +0xd1
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func2({0x59df7c0, 0xc004a33a70}, {0x6ff7fb0?, 0xc027d0f7a0?}, 0xc0945aa318, 0xc00950fb80)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:251 +0xef
google.golang.org/grpc.chainStreamInterceptors.func1.1({0x59df7c0?, 0xc004a33a70?}, {0x6ff7fb0?, 0xc027d0f7a0?})
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1485 +0x83
google.golang.org/grpc.chainStreamInterceptors.func1({0x59df7c0, 0xc004a33a70}, {0x6ff7fb0, 0xc027d0f7a0}, 0xc0945aa318, 0x5de6af8)
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1487 +0x12b
google.golang.org/grpc.(*Server).processStreamingRPC(0xc000052780, {0x700f780, 0xc010b8f380}, 0xc007cd3b00, 0xc000ed7e30, 0xa0fe480, 0x0)
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1636 +0x11c6
google.golang.org/grpc.(*Server).handleStream(0xc000052780, {0x700f780, 0xc010b8f380}, 0xc007cd3b00, 0x0)
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1717 +0x9ea
google.golang.org/grpc.(*Server).serveStreams.func1.2()
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:965 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:963 +0x28a

Jira issue: CRDB-27539

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-observability Related to observability of the SQL layer O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster labels Apr 28, 2023
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Apr 28, 2023

Other goroutines in persistedsqlstats:

goroutine 685 [chan send, 2999 minutes]:
github.com/cockroachdb/cockroach/pkg/sql.(*sqlActivityUpdateJob).Resume.func2()
	github.com/cockroachdb/cockroach/pkg/sql/sql_activity_update_job.go:98 +0x25
github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats.(*PersistedSQLStats).startSQLStatsFlushLoop.func1.2(0xc004a82840?)
	github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats/pkg/sql/sqlstats/persistedsqlstats/provider.go:193 +0x91
github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats.(*PersistedSQLStats).startSQLStatsFlushLoop.func1({0x6fc2000, 0xc0079ea2a0})
	github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats/pkg/sql/sqlstats/persistedsqlstats/provider.go:195 +0x319
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470 +0x146
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:461 +0x43b

goroutine 369276 [semacquire, 2815 minutes]:
sync.runtime_SemacquireMutex(0x3b?, 0x7c?, 0xc0355c0ec0?)
	GOROOT/src/runtime/sema.go:77 +0x25
sync.(*Mutex).lockSlow(0xc004a82860)
	GOROOT/src/sync/mutex.go:171 +0x165
sync.(*Mutex).Lock(...)
	GOROOT/src/sync/mutex.go:90
github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats.(*PersistedSQLStats).SetFlushDoneCallback(0xc0503194c8?, 0xc019f672e0?)
	github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats/pkg/sql/sqlstats/persistedsqlstats/provider.go:138 +0x59
github.com/cockroachdb/cockroach/pkg/sql.(*sqlActivityUpdateJob).Resume(0xc00ba9c270, {0x6fc2000, 0xc0108964e0}, {0x575cb60?, 0xc00ba9c298})
	github.com/cockroachdb/cockroach/pkg/sql/sql_activity_update_job.go:97 +0x237
github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine.func2(0xc017d4c9c0?, 0x0?, {0x6f960b0?, 0xc00ba9c270?}, {0x6fc2000?, 0xc0108964e0?}, {0x575cb60?, 0xc00ba9c298?}, 0xc08b989850)
	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1624 +0xcc
github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine(0xc0006fca00, {0x6fc2000, 0xc00d300db0}, {0x575cb60, 0xc00ba9c298}, {0x6f960b0, 0xc00ba9c270}, 0xc017d4c9c0, {0xc012cf2680, 0x7}, ...)
	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1625 +0x8df
github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).runJob(0xc0006fca00, {0x6fc1f58, 0xc00c063240}, {0x6f960b0, 0xc00ba9c270}, 0xc017d4c9c0, {0xc012cf2680, 0x7}, {0xc00584de20, 0x7})
	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:474 +0x637
github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).resumeJob.func2({0x6fc2000?, 0xc00d300b40?})
	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:388 +0x146
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470 +0x146
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:461 +0x43b

@erikgrinaker
Copy link
Contributor Author

I see @knz was in this area recently: #100476.

Mind having a look?

@erikgrinaker
Copy link
Contributor Author

The fact that we've been waiting 2815 minutes to acquire a mutex in SetFlushDoneCallback() seems bad.

@erikgrinaker
Copy link
Contributor Author

Full goroutine dump: goroutines.txt

@knz knz added branch-master Failures and bugs on the master branch. GA-blocker branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-23.1.0 labels Apr 28, 2023
knz added a commit to knz/cockroach that referenced this issue Apr 28, 2023
Prior to this change, the coordination between the stats flusher
task (an async stopper task) and the activity flusher job was
performed using a two-step process:

- the stats persistence task offered to call a callback _function_
  every time a flush would complete.
- the job would _reconfigure the callback function_ on each iteration.
- the function was writing to a channel that was subsequently
  read by the job iteration body.

This approach was defective in 3 ways:

1. if the job iteration body would exit (e.g. due to a server drain)
   *after* it installed the callback fn, but *before* the stats flusher
   would read and call the callback fn, a window of time existed
   where a deadlock could occur:
   - the stats flusher retrieves the pointer to the caller fn
     but doesn't call it yet.
   - the job loop exits. From then on it will not read from the
     channel any more.
   - the stats flusher attempts to write to the channel.
     A deadlock occurs.

   (This was seen during testing. See cockroachdb#102574)

   The fix here is to always jointly `select` the write to the channel
   and also a read from the drain/stopper signals, to abort the
   channel operation if a shutdown is requested.

2. the stats flusher task was holding the mutex locked while
   performing the channel write. This is generally bad code hygiene
   as it forces the code maintainer to double-check whether the lock
   and channel operations don't mutually interlock.

   The fix is to use the mutex to retrieve the channel reference, and
   then write to the channel while the mutex is not held any more.

3. the API between the two was defining a *callback function* where
   really just a notification channel was needed.

   The fix here is to simplify the API.

Release note: None
craig bot pushed a commit that referenced this issue Apr 28, 2023
100181: kv: Use strict types for common fields r=erikgrinaker a=andrewbaptist

This PR introduces 3 new typed fields in mvcc.go:
RaftTerm, RaftIndex and LeaseSequence. These fields were previously just unit64 throughout the code and this made the code harder to read and risked incorrect conversions.

Epic: none

Release note: None

102407: kvserver: check PROSCRIBED lease status over UNUSABLE r=erikgrinaker,tbg a=pavelkalinnikov

The PROSCRIBED lease status, just like EXPIRED, puts a lease to a definitely invalid state. The UNUSABLE state (when request timestamp is in stasis period) is less of a clear cut: we still own the lease but callers may use or not use it depending on context.

For example, the closed timestamp side-transport ignores the UNUSABLE state (because we still own the lease), and takes it as usable for its purposes. Because of the order in which the checks were made, this has lead to a bug: a PROSCRIBED lease is reported as UNUSABLE during stasis periods, the closed timestamp side-transport then considers it usable, and updates closed timestamps when it shouldn't.

This commit fixes the bug by swapping the order of checks in the leaseStatus method. The order now goes from "hard" checks like EXPIRED and PROSCRIBED, to "softer" UNUSABLE, and (when the softness is put to the limit) VALID.

Fixes #98698
Fixes #99931
Fixes #100101
Epic: none

Release note (bug fix): a bug is fixed in closed timestamp updates within its side-transport. Previously, during asymmetric partitions, a node that transfers a lease away, and misses a liveness heartbeat, could then erroneously update the closed timestamp during the stasis period of its liveness. This could lead to closed timestamp invariant violation, and node crashes; in extreme cases this could lead to inconsistencies in read-only queries.

102503: concurrency: do not partition locks in the lock table by span scope r=nvanbenschoten a=arulajmani

This patch is entirely a refactor and does not change any functionality. This is done in preparation for introducing `LockSpanSets` to track lock spans, which do not make a distinction between global and local keys (unlike `SpanSets`, which do).

The main changes here are in `lockTableImpl`, which actually stores locks, and `lockTableGuardImpl` which snapshots the lock table. We no longer make a distinction between locks on Local and Global keys when storing them.  The majority of this diff is composed of test file churn caused because of the printing changes to the lock table.

Informs #102008

Release note: None

102590: sql,persistedsqlstats: prevent a deadlock during shutdown r=j82w a=knz

Fixes #102574.

Prior to this change, the coordination between the stats flusher
task (an async stopper task) and the activity flusher job was
performed using a two-step process:

- the stats persistence task offered to call a callback _function_
  every time a flush would complete.
- the job would _reconfigure the callback function_ on each iteration.
- the function was writing to a channel that was subsequently
  read by the job iteration body.

This approach was defective in 3 ways:

1. if the job iteration body would exit (e.g. due to a server drain)
   *after* it installed the callback fn, but *before* the stats flusher
   would read and call the callback fn, a window of time existed
   where a deadlock could occur:
   - the stats flusher retrieves the pointer to the caller fn
     but doesn't call it yet.
   - the job loop exits. From then on it will not read from the
     channel any more.
   - the stats flusher attempts to write to the channel.
     A deadlock occurs.

   (This was seen during testing. See #102574)

   The fix here is to always jointly `select` the write to the channel
   and also a read from the drain/stopper signals, to abort the
   channel operation if a shutdown is requested.

2. the stats flusher task was holding the mutex locked while
   performing the channel write. This is generally bad code hygiene
   as it forces the code maintainer to double-check whether the lock
   and channel operations don't mutually interlock.

   The fix is to use the mutex to retrieve the channel reference, and
   then write to the channel while the mutex is not held any more.

3. the API between the two was defining a *callback function* where
   really just a notification channel was needed.

   The fix here is to simplify the API.

Release note: None

Co-authored-by: Andrew Baptist <[email protected]>
Co-authored-by: Pavel Kalinnikov <[email protected]>
Co-authored-by: Arul Ajmani <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
@craig craig bot closed this as completed in #102590 Apr 28, 2023
blathers-crl bot pushed a commit that referenced this issue Apr 28, 2023
Prior to this change, the coordination between the stats flusher
task (an async stopper task) and the activity flusher job was
performed using a two-step process:

- the stats persistence task offered to call a callback _function_
  every time a flush would complete.
- the job would _reconfigure the callback function_ on each iteration.
- the function was writing to a channel that was subsequently
  read by the job iteration body.

This approach was defective in 3 ways:

1. if the job iteration body would exit (e.g. due to a server drain)
   *after* it installed the callback fn, but *before* the stats flusher
   would read and call the callback fn, a window of time existed
   where a deadlock could occur:
   - the stats flusher retrieves the pointer to the caller fn
     but doesn't call it yet.
   - the job loop exits. From then on it will not read from the
     channel any more.
   - the stats flusher attempts to write to the channel.
     A deadlock occurs.

   (This was seen during testing. See #102574)

   The fix here is to always jointly `select` the write to the channel
   and also a read from the drain/stopper signals, to abort the
   channel operation if a shutdown is requested.

2. the stats flusher task was holding the mutex locked while
   performing the channel write. This is generally bad code hygiene
   as it forces the code maintainer to double-check whether the lock
   and channel operations don't mutually interlock.

   The fix is to use the mutex to retrieve the channel reference, and
   then write to the channel while the mutex is not held any more.

3. the API between the two was defining a *callback function* where
   really just a notification channel was needed.

   The fix here is to simplify the API.

Release note: None
blathers-crl bot pushed a commit that referenced this issue Apr 28, 2023
Prior to this change, the coordination between the stats flusher
task (an async stopper task) and the activity flusher job was
performed using a two-step process:

- the stats persistence task offered to call a callback _function_
  every time a flush would complete.
- the job would _reconfigure the callback function_ on each iteration.
- the function was writing to a channel that was subsequently
  read by the job iteration body.

This approach was defective in 3 ways:

1. if the job iteration body would exit (e.g. due to a server drain)
   *after* it installed the callback fn, but *before* the stats flusher
   would read and call the callback fn, a window of time existed
   where a deadlock could occur:
   - the stats flusher retrieves the pointer to the caller fn
     but doesn't call it yet.
   - the job loop exits. From then on it will not read from the
     channel any more.
   - the stats flusher attempts to write to the channel.
     A deadlock occurs.

   (This was seen during testing. See #102574)

   The fix here is to always jointly `select` the write to the channel
   and also a read from the drain/stopper signals, to abort the
   channel operation if a shutdown is requested.

2. the stats flusher task was holding the mutex locked while
   performing the channel write. This is generally bad code hygiene
   as it forces the code maintainer to double-check whether the lock
   and channel operations don't mutually interlock.

   The fix is to use the mutex to retrieve the channel reference, and
   then write to the channel while the mutex is not held any more.

3. the API between the two was defining a *callback function* where
   really just a notification channel was needed.

   The fix here is to simplify the API.

Release note: None
cameronnunez pushed a commit to cameronnunez/cockroach that referenced this issue May 2, 2023
Prior to this change, the coordination between the stats flusher
task (an async stopper task) and the activity flusher job was
performed using a two-step process:

- the stats persistence task offered to call a callback _function_
  every time a flush would complete.
- the job would _reconfigure the callback function_ on each iteration.
- the function was writing to a channel that was subsequently
  read by the job iteration body.

This approach was defective in 3 ways:

1. if the job iteration body would exit (e.g. due to a server drain)
   *after* it installed the callback fn, but *before* the stats flusher
   would read and call the callback fn, a window of time existed
   where a deadlock could occur:
   - the stats flusher retrieves the pointer to the caller fn
     but doesn't call it yet.
   - the job loop exits. From then on it will not read from the
     channel any more.
   - the stats flusher attempts to write to the channel.
     A deadlock occurs.

   (This was seen during testing. See cockroachdb#102574)

   The fix here is to always jointly `select` the write to the channel
   and also a read from the drain/stopper signals, to abort the
   channel operation if a shutdown is requested.

2. the stats flusher task was holding the mutex locked while
   performing the channel write. This is generally bad code hygiene
   as it forces the code maintainer to double-check whether the lock
   and channel operations don't mutually interlock.

   The fix is to use the mutex to retrieve the channel reference, and
   then write to the channel while the mutex is not held any more.

3. the API between the two was defining a *callback function* where
   really just a notification channel was needed.

   The fix here is to simplify the API.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-observability Related to observability of the SQL layer branch-master Failures and bugs on the master branch. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants