Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: acceptance/bank/node-restart flake #57798

Closed
RaduBerinde opened this issue Dec 10, 2020 · 6 comments · Fixed by #58722
Closed

roachtest: acceptance/bank/node-restart flake #57798

RaduBerinde opened this issue Dec 10, 2020 · 6 comments · Fixed by #58722
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest

Comments

@RaduBerinde
Copy link
Member

https://teamcity.cockroachdb.com/viewLog.html?buildId=2510921&buildTypeId=Cockroach_UnitTests

test artifacts and logs in: artifacts/acceptance/bank/node-restart/run_1
	bank.go:354,bank.go:469,acceptance.go:104,test_runner.go:760: after 33.7s: pq: query execution canceled due to statement timeout
		(1) attached stack trace
		  -- stack trace:
		  | main.(*bankClient).transferMoney
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/bank.go:74
		  | main.(*bankState).transferMoney
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/bank.go:158
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) after 33.7s
		Wraps: (3) pq: query execution canceled due to statement timeout
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *pq.Error
@RaduBerinde RaduBerinde added C-test-failure Broken test (automatically or manually discovered). O-roachtest labels Dec 10, 2020
@tbg
Copy link
Member

tbg commented Dec 16, 2020

Timed out here:

// If this statement gets stuck, the test harness will get stuck. Run with a
// statement timeout, which unfortunately precludes the use of prepared
// statements.
q := fmt.Sprintf(`
SET statement_timeout = '30s';
UPDATE bank.accounts
SET balance = CASE id WHEN %[1]d THEN balance-%[3]d WHEN %[2]d THEN balance+%[3]d END
WHERE id IN (%[1]d, %[2]d) AND (SELECT balance >= %[3]d FROM bank.accounts WHERE id = %[1]d);
`, from, to, amount)

Unfortunately this will be annoying to track down. For some reason this also just happened on release-20.2: #57982

@ajwerner
Copy link
Contributor

ajwerner commented Jan 4, 2021

@tbg
Copy link
Member

tbg commented Jan 5, 2021

Ah, that one is helpful, thanks @ajwerner! Look at this:

goroutine 13 [select, 9 minutes]:
runtime.gopark(0x4ea8df0, 0x0, 0x1809, 0x1)
	/usr/local/go/src/runtime/proc.go:306 +0xe5 fp=0xc0000bce08 sp=0xc0000bcde8 pc=0x47f3e5
[...]

goroutine 29 [select]:
runtime.gopark(0x4ea8df0, 0x0, 0x1809, 0x1)
	/usr/local/go/src/runtime/proc.go:306 +0xe5 fp=0xc0005e67a0 sp=0xc0005e6780 pc=0x47f3e5
runtime.selectgo(0xc0005e6970, 0xc0005e691c, 0x3, 0x0, 0x3fc3333333333333)
	/usr/local/go/src/runtime/select.go:338 +0xcef fp=0xc0005e68c8 sp=0xc0005e67a0 pc=0x48f54f
github.com/cockroachdb/cockroach/pkg/util/retry.(*Retry).Next(0xc0005e7018, 0x54e3720)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:128 +0x187 fp=0xc0005e69f8 sp=0xc0005e68c8 pc=0x1501dc7
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(0xc000b80400, 0x54e3720, 0xc00116a5d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1465 +0x258 fp=0xc0005e7d88 sp=0xc0005e69f8 pc=0x19c13b8
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc000b80400, 0x54e3720, 0xc00116a5d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1143 +0x187f fp=0xc0005e8368 sp=0xc0005e7d88 pc=0x19c093f
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0xc000b80400, 0x54e3720, 0xc000fdc6c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:782 +0x5bb fp=0xc0005e8678 sp=0xc0005e8368 pc=0x19bd21b
github.com/cockroachdb/cockroach/pkg/kv.(*CrossRangeTxnWrapperSender).Send(0xc000e83b08, 0x54e3720, 0xc000fdc6c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:220 +0x9f fp=0xc0005e8760 sp=0xc0005e8678 pc=0x160eb7f
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(0xc000e83ab0, 0x54e3720, 0xc000fdc6c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:796 +0x13c fp=0xc0005e8838 sp=0xc0005e8760 pc=0x161207c
github.com/cockroachdb/cockroach/pkg/kv.(*DB).send(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:778
github.com/cockroachdb/cockroach/pkg/kv.(*DB).send-fm(0x54e3720, 0xc000fdc6c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:775 +0xf3 fp=0xc0005e8990 sp=0xc0005e8838 pc=0x161f293
github.com/cockroachdb/cockroach/pkg/kv.sendAndFill(0x54e3720, 0xc000fdc6c0, 0xc001818b08, 0xc001e90500, 0x8670, 0x8670)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:706 +0x107 fp=0xc0005e8ad8 sp=0xc0005e8990 pc=0x16119c7
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Run(0xc000e83ab0, 0x54e3720, 0xc000fdc6c0, 0xc001e90500, 0xc001e98400, 0xc000630201)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:729 +0x9c fp=0xc0005e8b28 sp=0xc0005e8ad8 pc=0x1611b9c
github.com/cockroachdb/cockroach/pkg/kv.(*DB).PutInline(0xc000e83ab0, 0x54e3720, 0xc000fdc6c0, 0x44a86e0, 0xc000f7e140, 0x460a280, 0xc001e98400, 0x2, 0x200000014)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:370 +0x9e fp=0xc0005e8b70 sp=0xc0005e8b28 pc=0x160f7fe
github.com/cockroachdb/cockroach/pkg/server/status.(*MetricsRecorder).WriteNodeStatus(0xc001201500, 0x54e3720, 0xc000fdc6c0, 0xc000e83ab0, 0x1, 0x47c56b8, 0x3, 0xc000fe9840, 0xf, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:523 +0x805 fp=0xc0005e8ce0 sp=0xc0005e8b70 pc=0x37abba5
github.com/cockroachdb/cockroach/pkg/server.(*Node).writeNodeStatus.func1(0x54e3720, 0xc000fdc6c0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:781 +0x20a fp=0xc0005e93a0 sp=0xc0005e8ce0 pc=0x38934aa
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTask(0xc0000e8120, 0x54e3720, 0xc000fdc6c0, 0x481add2, 0x1a, 0xc0005e9468, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:309 +0xf1 fp=0xc0005e9418 sp=0xc0005e93a0 pc=0x143ad51
github.com/cockroachdb/cockroach/pkg/server.(*Node).writeNodeStatus(0xc000f8b600, 0x54e3720, 0xc000fdc6c0, 0x0, 0x0, 0x0, 0x54e3720)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:752 +0xbd fp=0xc0005e94a0 sp=0xc0005e9418 pc=0x385a1bd
github.com/cockroachdb/cockroach/pkg/server.(*Node).startWriteNodeStatus(0xc000f8b600, 0x2540be400, 0xc0005d1e60, 0xc0000e8120)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:714 +0xbf fp=0xc0005e9500 sp=0xc0005e94a0 pc=0x3859fff
github.com/cockroachdb/cockroach/pkg/server.(*Server).PreStart(0xc0011d6e00, 0x54e3720, 0xc000f89fb0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1625 +0x2965 fp=0xc0005eb2f0 sp=0xc0005e9500 pc=0x3867465
github.com/cockroachdb/cockroach/pkg/cli.runStart.func4.2(0xc0000e8120, 0xc0006a6438, 0xc000fe88b0, 0x54e3720, 0xc000f89fb0, 0x0, 0x2b6beafd, 0xed7858522, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:573 +0xfa fp=0xc0005ebed0 sp=0xc0005eb2f0 pc=0x3a666fa
github.com/cockroachdb/cockroach/pkg/cli.runStart.func4(0xc0006a6438, 0x54e3720, 0xc000f89fb0, 0xc000fb7480, 0xc0000e8120, 0xc000fe88b0, 0x0, 0x2b6beafd, 0xed7858522, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:696 +0x12a fp=0xc0005ebf88 sp=0xc0005ebed0 pc=0x3a6850a
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc0005ebf90 sp=0xc0005ebf88 pc=0x4b57c1
created by github.com/cockroachdb/cockroach/pkg/cli.runStart
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:529 +0x9b7

which is from the latest log of n1. The first stack proves that the binary was started 9 minutes ago. The second one shouldn't be there nine minutes in - it's a one-off synchronous write of the node status in PreStart. If we're hanging there, it's because that write is not succeeding. I am immediately suspecting that we have lost quorum (and all nodes are stuck there, and quorum will only be restored once they get past that point), however the call to startWriteNodeSummaries is intentionally late in the startup sequence, so this shouldn't be it. But - I see this same stack on all four nodes, so the general theory that something is getting wedged at startup seems reasonable.

Throughout these nine minutes, in the main logs (the stacks were in the stderr logs, I think roachtest sends a SIGSEGV on timeouts so that stacks are dumped) we see a number of "have been waiting":

W210104 21:53:40.718809 30 kv/kvclient/kvcoord/dist_sender.go:1512 ⋮ [n4,summaries] slow range RPC: have been waiting 61.08s (64 attempts) for RPC Put [‹/System/StatusNode/4›,‹/Min›) to r3:‹/System/{NodeLivenessMax-tsd}› [(n1,s1):1LEARNER, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=0]; resp: ‹failed to send RPC: sending to all replicas failed; last error: [NotLeaseHolderError] lease acquisition attempt lost to another lease, which has expired in the meantime; r3: replica (n2,s2):4VOTER_INCOMING not lease holder; lease holder unknown›

However - I never see any such messages from the raft level. Instead, in the goroutines, what I can see is concurrency control.

So, I think a narrative starts to form:

  1. somehow, all of the initial node status summary calls manage to contend on each other (???)
  2. due to that, we never get out of (*Server).PreStart
  3. so we never open the SQL listener
  4. so we see the above failure mode.

@nvanbenschoten this could benefit from an extra pair of eyes (more precisely your eyes :-) )

Full artifacts here, The stacks are here:

reporter tries to scan meta2, sits in distsender while pushing a lock

goroutine 342 [select]:
runtime.gopark(0x4ea8df0, 0x0, 0x1809, 0x1)
	/usr/local/go/src/runtime/proc.go:306 +0xe5 fp=0xc0022beb70 sp=0xc0022beb50 pc=0x47f3e5
runtime.selectgo(0xc0022bed40, 0xc0022becec, 0x3, 0x0, 0x3fc3333333333333)
	/usr/local/go/src/runtime/select.go:338 +0xcef fp=0xc0022bec98 sp=0xc0022beb70 pc=0x48f54f
github.com/cockroachdb/cockroach/pkg/util/retry.(*Retry).Next(0xc0022bf3e8, 0x54e3720)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:128 +0x187 fp=0xc0022bedc8 sp=0xc0022bec98 pc=0x1501dc7
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(0xc000510c00, 0x54e3720, 0xc001bbf890, 0x1657260a43d14608, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1465 +0x258 fp=0xc0022c0158 sp=0xc0022bedc8 pc=0x19c13b8
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc000510c00, 0x54e3720, 0xc001bbf890, 0x1657260a43d14608, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1143 +0x187f fp=0xc0022c0738 sp=0xc0022c0158 pc=0x19c093f
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0xc000510c00, 0x54e3720, 0xc001bbf890, 0x1657260a43d14608, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:782 +0x5bb fp=0xc0022c0a48 sp=0xc0022c0738 pc=0x19bd21b
github.com/cockroachdb/cockroach/pkg/kv.(*CrossRangeTxnWrapperSender).Send(0xc0002faec8, 0x54e3720, 0xc001bbf890, 0x1657260a43d14608, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:220 +0x9f fp=0xc0022c0b30 sp=0xc0022c0a48 pc=0x160eb7f
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(0xc0002fae70, 0x54e3720, 0xc001bbf890, 0x1657260a43d14608, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:796 +0x13c fp=0xc0022c0c08 sp=0xc0022c0b30 pc=0x161207c
github.com/cockroachdb/cockroach/pkg/kv.(*DB).send(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:778
github.com/cockroachdb/cockroach/pkg/kv.(*DB).send-fm(0x54e3720, 0xc001bbf890, 0x1657260a43d14608, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:775 +0xf3 fp=0xc0022c0d60 sp=0xc0022c0c08 pc=0x161f293
github.com/cockroachdb/cockroach/pkg/kv.sendAndFill(0x54e3720, 0xc001bbf890, 0xc0017a8ed8, 0xc0027e4a00, 0x0, 0xc0017a93d0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:706 +0x107 fp=0xc0022c0ea8 sp=0xc0022c0d60 pc=0x16119c7
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Run(0xc0002fae70, 0x54e3720, 0xc001bbf890, 0xc0027e4a00, 0xc0017a9138, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:729 +0x9c fp=0xc0022c0ef8 sp=0xc0022c0ea8 pc=0x1611b9c
github.com/cockroachdb/cockroach/pkg/kv/kvserver/intentresolver.(*IntentResolver).MaybePushTransactions(0xc00127b040, 0x54e3720, 0xc001bbf890, 0xc0017a93d0, 0x1657260a3dd69762, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go:373 +0x745 fp=0xc0022c1310 sp=0xc0022c0ef8 pc=0x1ac9b85
github.com/cockroachdb/cockroach/pkg/kv/kvserver/intentresolver.(*IntentResolver).PushTransaction(0xc00127b040, 0x54e3720, 0xc001bbf890, 0xc000cce718, 0x1657260a3dd69762, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go:283 +0x14e fp=0xc0022c14e0 sp=0xc0022c1310 pc=0x1ac92ce
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*lockTableWaiterImpl).pushLockTxn(0xc002881280, 0x54e3720, 0xc001bbf890, 0xc00143c900, 0x1657260a3dd69762, 0x0, 0x0, 0x0, 0xc00127c100, 0x1, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/lock_table_waiter.go:471 +0x13c fp=0xc0022c18f8 sp=0xc0022c14e0 pc=0x1adb6bc
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*lockTableWaiterImpl).WaitOn(0xc002881280, 0x54e3720, 0xc001bbf890, 0xc00143c900, 0x1657260a3dd69762, 0x0, 0x0, 0x0, 0xc00127c100, 0x1, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/lock_table_waiter.go:370 +0x2f2 fp=0xc0022c1d40 sp=0xc0022c18f8 pc=0x1ada7b2
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).sequenceReqWithGuard(0xc001155c80, 0x54e3720, 0xc001bbf890, 0xc00018e620, 0xc00143c900, 0x1657260a3dd69762, 0x0, 0x0, 0x0, 0xc00127c100, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:172 +0x41f fp=0xc0022c1ed0 sp=0xc0022c1d40 pc=0x1acf07f
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).SequenceReq(0xc001155c80, 0x54e3720, 0xc001bbf890, 0x0, 0xc00143c900, 0x1657260a3dd69762, 0x0, 0x0, 0x0, 0xc00127c100, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:123 +0xf8 fp=0xc0022c1ff0 sp=0xc0022c1ed0 pc=0x1acea38
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc001187000, 0x54e3720, 0xc001bbf890, 0xc002713710, 0x4ea0d78, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:328 +0x3b6 fp=0xc0022c2338 sp=0xc0022c1ff0 pc=0x1d4f5b6
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc001187000, 0x54e3720, 0xc001bbf860, 0x1, 0xc002713710, 0x0, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:93 +0x274 fp=0xc0022c2550 sp=0xc0022c2338 pc=0x1d4e0d4
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:36
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send(0xc001180000, 0x54e3720, 0xc001bbf830, 0x1657260a3dd69762, 0x0, 0x300000003, 0x2, 0x0, 0x1, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_send.go:194 +0x5ec fp=0xc0022c2dd0 sp=0xc0022c2550 pc=0x1d8e66c
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send(0xc0002fb730, 0x54e3720, 0xc001bbf830, 0x0, 0x0, 0x300000003, 0x2, 0x0, 0x1, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:177 +0x118 fp=0xc0022c2f70 sp=0xc0022c2dd0 pc=0x1d99a98
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x54e3720, 0xc001bbf830, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:874 +0x225 fp=0xc0022c3180 sp=0xc0022c2f70 pc=0x3894045
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc0006b1710, 0x54e3720, 0xc001bbf830, 0x47ef56d, 0x10, 0xc0017ab250, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:326 +0xf1 fp=0xc0022c31f8 sp=0xc0022c3180 pc=0x143aef1
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc000ca4100, 0x54e3720, 0xc001bbf830, 0xc002713680, 0xc001bbf830, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:862 +0x15b fp=0xc0022c3280 sp=0xc0022c31f8 pc=0x385a5fb
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc000ca4100, 0x54e3720, 0xc001bbf800, 0xc002713680, 0xc001bbf740, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:900 +0xa5 fp=0xc0022c3320 sp=0xc0022c3280 pc=0x385a705
github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.Batch(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:462
github.com/cockroachdb/cockroach/pkg/rpc.(*internalClientAdapter).Batch(0xc0006b7710, 0x54e3720, 0xc001bbf800, 0xc002713680, 0x0, 0x0, 0x0, 0x10, 0xc002618ae0, 0xc0017ab410)
	<autogenerated>:1 +0x63 fp=0xc0022c3368 sp=0xc0022c3320 pc=0x18ecdc3
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).sendBatch(0xc0016c9340, 0x54e3720, 0xc001bbf800, 0x3, 0x552d9e0, 0xc0006b7710, 0x0, 0x0, 0x300000003, 0x2, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:193 +0x176 fp=0xc0022c3420 sp=0xc0022c3368 pc=0x19cc0b6
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).SendNext(0xc0016c9340, 0x54e3720, 0xc001bbf740, 0x0, 0x0, 0x300000003, 0x2, 0x0, 0x1, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:175 +0x198 fp=0xc0022c3518 sp=0xc0022c3420 pc=0x19cbed8
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas(0xc000510c00, 0x54e3720, 0xc001bbf740, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1885 +0x83f fp=0xc0022c3fb0 sp=0xc0022c3518 pc=0x19c4f7f
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(0xc000510c00, 0x54e3720, 0xc001bbf740, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1505 +0x36a fp=0xc0022c5340 sp=0xc0022c3fb0 pc=0x19c14ca
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc000510c00, 0x54e3720, 0xc001bbf740, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1143 +0x187f fp=0xc0022c5920 sp=0xc0022c5340 pc=0x19c093f
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0xc000510c00, 0x54e3720, 0xc001bbf740, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:782 +0x5bb fp=0xc0022c5c30 sp=0xc0022c5920 pc=0x19bd21b
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnLockGatekeeper).SendLocked(0xc0004ae438, 0x54e3720, 0xc001bbf740, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go:86 +0x135 fp=0xc0022c5db0 sp=0xc0022c5c30 pc=0x19de555
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnMetricRecorder).SendLocked(0xc0004ae400, 0x54e3720, 0xc001bbf740, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_metric_recorder.go:46 +0x9e fp=0xc0022c5e70 sp=0xc0022c5db0 pc=0x19d745e
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnCommitter).SendLocked(0xc0004ae3d0, 0x54e3720, 0xc001bbf740, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go:126 +0x8db fp=0xc0022c60a8 sp=0xc0022c5e70 pc=0x19d465b
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).sendLockedWithRefreshAttempts(0xc0004ae330, 0x54e3720, 0xc001bbf740, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:267 +0xb5 fp=0xc0022c64d8 sp=0xc0022c60a8 pc=0x19dc095
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).SendLocked(0xc0004ae330, 0x54e3720, 0xc001bbf740, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:202 +0x2c5 fp=0xc0022c6778 sp=0xc0022c64d8 pc=0x19db685
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnPipeliner).SendLocked(0xc0004ae270, 0x54e3720, 0xc001bbf740, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:253 +0x1b5 fp=0xc0022c68c8 sp=0xc0022c6778 pc=0x19d7b55
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSeqNumAllocator).SendLocked(0xc0004ae250, 0x54e3720, 0xc001bbf740, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_seq_num_allocator.go:105 +0x215 fp=0xc0022c69e8 sp=0xc0022c68c8 pc=0x19db015
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnHeartbeater).SendLocked(0xc0004ae1b8, 0x54e3720, 0xc001bbf740, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:171 +0x23f fp=0xc0022c6af8 sp=0xc0022c69e8 pc=0x19d5cff
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*TxnCoordSender).Send(0xc0004ae000, 0x54e3720, 0xc001bbf740, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go:499 +0x415 fp=0xc0022c6c98 sp=0xc0022c6af8 pc=0x19ce875
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(0xc0002fae70, 0x54e3720, 0xc0004281e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:796 +0x13c fp=0xc0022c6d70 sp=0xc0022c6c98 pc=0x161207c
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send(0xc0027135f0, 0x54e3720, 0xc0004281e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:919 +0x12e fp=0xc0022c6ed8 sp=0xc0022c6d70 pc=0x161aaae
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send-fm(0x54e3720, 0xc0004281e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:903 +0x85 fp=0xc0022c6f98 sp=0xc0022c6ed8 pc=0x161f365
github.com/cockroachdb/cockroach/pkg/kv.sendAndFill(0x54e3720, 0xc0004281e0, 0xc0030b9118, 0xc000da7400, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:706 +0x107 fp=0xc0022c70e0 sp=0xc0022c6f98 pc=0x16119c7
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Run(0xc0027135f0, 0x54e3720, 0xc0004281e0, 0xc000da7400, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:590 +0xe7 fp=0xc0022c7140 sp=0xc0022c70e0 pc=0x16188c7
github.com/cockroachdb/cockroach/pkg/kv/kvserver/reports.(*meta2RangeIter).readBatch(0xc001020140, 0x54e3720, 0xc0004281e0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/reports/reporter.go:727 +0x1f8 fp=0xc0022c71f8 sp=0xc0022c7140 pc=0x37d9278
github.com/cockroachdb/cockroach/pkg/kv/kvserver/reports.(*meta2RangeIter).Next(0xc001020140, 0x54e3720, 0xc0004281e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/reports/reporter.go:680 +0x1e9 fp=0xc0022c73a0 sp=0xc0022c71f8 pc=0x37d8b09
github.com/cockroachdb/cockroach/pkg/kv/kvserver/reports.visitRanges(0x54e3720, 0xc0004281e0, 0x5490420, 0xc001020140, 0xc001168280, 0xc0022c7750, 0x3, 0x3, 0xc00072efa0, 0xc00072efb0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/reports/reporter.go:570 +0x18f fp=0xc0022c7598 sp=0xc0022c73a0 pc=0x37d804f
github.com/cockroachdb/cockroach/pkg/kv/kvserver/reports.(*Reporter).update(0xc0005010e0, 0x54e3720, 0xc0004281e0, 0xc0022c7e88, 0xc0022c7e38, 0xc0022c7e60, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/reports/reporter.go:221 +0x8b7 fp=0xc0022c7d90 sp=0xc0022c7598 pc=0x37d6357
github.com/cockroachdb/cockroach/pkg/kv/kvserver/reports.(*Reporter).Start.func2(0x54e3720, 0xc0004281b0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/reports/reporter.go:140 +0x34e fp=0xc0022c7f48 sp=0xc0022c7d90 pc=0x37db70e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc00174f320, 0xc0006b1710, 0xc001755be0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:222 +0xe4 fp=0xc0022c7fc8 sp=0xc0022c7f48 pc=0x143c984
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc0022c7fd0 sp=0xc0022c7fc8 pc=0x4b57c1
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:215 +0xa8

There are two more stacks with (*Node).Batch in them in the stacks:

one in WaitOn

goroutine 16014 [select]:
runtime.gopark(0x4ea8df0, 0x0, 0x1809, 0x1)
	/usr/local/go/src/runtime/proc.go:306 +0xe5 fp=0xc001be3cf0 sp=0xc001be3cd0 pc=0x47f3e5
runtime.selectgo(0xc001be4060, 0xc001be3ef0, 0x4, 0xc000421900, 0x5)
	/usr/local/go/src/runtime/select.go:338 +0xcef fp=0xc001be3e18 sp=0xc001be3cf0 pc=0x48f54f
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*lockTableWaiterImpl).WaitOn(0xc002881280, 0x54e3720, 0xc00288d500, 0x0, 0x1657267bc332da59, 0x0, 0x0, 0x0, 0xc0028f6850, 0x1, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/lock_table_waiter.go:156 +0x22a fp=0xc001be4260 sp=0xc001be3e18 pc=0x1ada6ea
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).sequenceReqWithGuard(0xc001155c80, 0x54e3720, 0xc00288d500, 0xc000cc60e0, 0x0, 0x1657267bc332da59, 0x0, 0x0, 0x0, 0xc0028f6850, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:172 +0x41f fp=0xc001be43f0 sp=0xc001be4260 pc=0x1acf07f
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).SequenceReq(0xc001155c80, 0x54e3720, 0xc00288d500, 0x0, 0x0, 0x1657267bc332da59, 0x0, 0x0, 0x0, 0xc0028f6850, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:123 +0xf8 fp=0xc001be4510 sp=0xc001be43f0 pc=0x1acea38
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc001187000, 0x54e3720, 0xc00288d500, 0xc000e0f710, 0x4ea0d78, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:328 +0x3b6 fp=0xc001be4858 sp=0xc001be4510 pc=0x1d4f5b6
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc001187000, 0x54e3720, 0xc00288d4d0, 0x1, 0xc000e0f710, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:93 +0x274 fp=0xc001be4a70 sp=0xc001be4858 pc=0x1d4e0d4
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:36
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send(0xc001180000, 0x54e3720, 0xc00288d4a0, 0x1657267bc332da59, 0x0, 0x300000003, 0x2, 0x0, 0x1, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_send.go:194 +0x5ec fp=0xc001be52f0 sp=0xc001be4a70 pc=0x1d8e66c
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send(0xc0002fb730, 0x54e3720, 0xc00288d4a0, 0x0, 0x0, 0x300000003, 0x2, 0x0, 0x1, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:177 +0x118 fp=0xc001be5490 sp=0xc001be52f0 pc=0x1d99a98
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x54e3720, 0xc00288d470, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:874 +0x225 fp=0xc001be56a0 sp=0xc001be5490 pc=0x3894045
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc0006b1710, 0x54e3720, 0xc00288d470, 0x47ef56d, 0x10, 0xc001be5770, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:326 +0xf1 fp=0xc001be5718 sp=0xc001be56a0 pc=0x143aef1
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc000ca4100, 0x54e3720, 0xc00288d470, 0xc000e0f680, 0xc00288d470, 0x42ac380, 0x4750ca0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:862 +0x15b fp=0xc001be57a0 sp=0xc001be5718 pc=0x385a5fb
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc000ca4100, 0x54e3720, 0xc00288d410, 0xc000e0f680, 0xc000ca4100, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:900 +0xa5 fp=0xc001be5840 sp=0xc001be57a0 pc=0x385a705
github.com/cockroachdb/cockroach/pkg/roachpb._Internal_Batch_Handler.func1(0x54e3720, 0xc00288d410, 0x4793b00, 0xc000e0f680, 0x0, 0x0, 0x803bc01, 0x803bc20)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:9076 +0x89 fp=0xc001be5888 sp=0xc001be5840 pc=0x13d1f29
github.com/cockroachdb/cockroach/pkg/util/tracing.ServerInterceptor.func1(0x54e3720, 0xc00288d410, 0x4793b00, 0xc000e0f680, 0xc002709420, 0xc002709440, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/grpc_interceptor.go:125 +0x669 fp=0xc001be5a08 sp=0xc001be5888 pc=0xe04169
google.golang.org/grpc.getChainUnaryHandler.func1(0x54e3720, 0xc00288d410, 0x4793b00, 0xc000e0f680, 0xc001be5ac8, 0xc7c6c8, 0x44dd960, 0xc001743700)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:921 +0xe7 fp=0xc001be5a88 sp=0xc001be5a08 pc=0xc8f687
github.com/cockroachdb/cockroach/pkg/rpc.NewServer.func1(0x54e3720, 0xc00288d410, 0x4793b00, 0xc000e0f680, 0xc002709420, 0xc001743700, 0xc001743700, 0x20, 0x424e5a0, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:176 +0xa8 fp=0xc001be5ad8 sp=0xc001be5a88 pc=0x18e7d08
google.golang.org/grpc.chainUnaryServerInterceptors.func1(0x54e3720, 0xc00288d410, 0x4793b00, 0xc000e0f680, 0xc002709420, 0xc002709440, 0xc001c3fba0, 0x6507a6, 0x44f96a0, 0xc00288d410)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:907 +0xd0 fp=0xc001be5b40 sp=0xc001be5ad8 pc=0xc8f530
github.com/cockroachdb/cockroach/pkg/roachpb._Internal_Batch_Handler(0x4750ca0, 0xc000ca4100, 0x54e3720, 0xc00288d410, 0xc002e69500, 0xc0012a78e0, 0x54e3720, 0xc00288d410, 0xc001749f80, 0x31)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:9078 +0x150 fp=0xc001be5bb0 sp=0xc001be5b40 pc=0x126d4d0
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000e22000, 0x554a720, 0xc000f1cd80, 0xc001469100, 0xc0011bdf80, 0x75a2e80, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:1082 +0x522 fp=0xc001be5e40 sp=0xc001be5bb0 pc=0xc7cca2
google.golang.org/grpc.(*Server).handleStream(0xc000e22000, 0x554a720, 0xc000f1cd80, 0xc001469100, 0x0)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:1405 +0xcc5 fp=0xc001be5f68 sp=0xc001be5e40 pc=0xc80da5
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc001c3d2f0, 0xc000e22000, 0x554a720, 0xc000f1cd80, 0xc001469100)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:746 +0xa5 fp=0xc001be5fb8 sp=0xc001be5f68 pc=0xc8f1e5
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc001be5fc0 sp=0xc001be5fb8 pc=0x4b57c1
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:744 +0xa5

and another in waitOn and pushing (sitting in distsender)

goroutine 15829 [select]:
runtime.gopark(0x4ea8df0, 0x0, 0x1809, 0x1)
	/usr/local/go/src/runtime/proc.go:306 +0xe5 fp=0xc001fdf090 sp=0xc001fdf070 pc=0x47f3e5
runtime.selectgo(0xc001fdf260, 0xc001fdf20c, 0x3, 0x0, 0x3fc3333333333333)
	/usr/local/go/src/runtime/select.go:338 +0xcef fp=0xc001fdf1b8 sp=0xc001fdf090 pc=0x48f54f
github.com/cockroachdb/cockroach/pkg/util/retry.(*Retry).Next(0xc001fdf908, 0x54e3720)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:128 +0x187 fp=0xc001fdf2e8 sp=0xc001fdf1b8 pc=0x1501dc7
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(0xc000510c00, 0x54e3720, 0xc00289da70, 0x1657267bc6248024, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1465 +0x258 fp=0xc001fe0678 sp=0xc001fdf2e8 pc=0x19c13b8
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc000510c00, 0x54e3720, 0xc00289da70, 0x1657267bc6248024, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1143 +0x187f fp=0xc001fe0c58 sp=0xc001fe0678 pc=0x19c093f
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0xc000510c00, 0x54e3720, 0xc00289da70, 0x1657267bc6248024, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:782 +0x5bb fp=0xc001fe0f68 sp=0xc001fe0c58 pc=0x19bd21b
github.com/cockroachdb/cockroach/pkg/kv.(*CrossRangeTxnWrapperSender).Send(0xc0002faec8, 0x54e3720, 0xc00289da70, 0x1657267bc6248024, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:220 +0x9f fp=0xc001fe1050 sp=0xc001fe0f68 pc=0x160eb7f
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(0xc0002fae70, 0x54e3720, 0xc00289da70, 0x1657267bc6248024, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:796 +0x13c fp=0xc001fe1128 sp=0xc001fe1050 pc=0x161207c
github.com/cockroachdb/cockroach/pkg/kv.(*DB).send(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:778
github.com/cockroachdb/cockroach/pkg/kv.(*DB).send-fm(0x54e3720, 0xc00289da70, 0x1657267bc6248024, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:775 +0xf3 fp=0xc001fe1280 sp=0xc001fe1128 pc=0x161f293
github.com/cockroachdb/cockroach/pkg/kv.sendAndFill(0x54e3720, 0xc00289da70, 0xc0025df3f8, 0xc001f0b900, 0x0, 0xc0025df8f0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:706 +0x107 fp=0xc001fe13c8 sp=0xc001fe1280 pc=0x16119c7
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Run(0xc0002fae70, 0x54e3720, 0xc00289da70, 0xc001f0b900, 0xc0025df658, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:729 +0x9c fp=0xc001fe1418 sp=0xc001fe13c8 pc=0x1611b9c
github.com/cockroachdb/cockroach/pkg/kv/kvserver/intentresolver.(*IntentResolver).MaybePushTransactions(0xc00127b040, 0x54e3720, 0xc00289da70, 0xc0025df8f0, 0x165726797a6df257, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go:373 +0x745 fp=0xc001fe1830 sp=0xc001fe1418 pc=0x1ac9b85
github.com/cockroachdb/cockroach/pkg/kv/kvserver/intentresolver.(*IntentResolver).PushTransaction(0xc00127b040, 0x54e3720, 0xc00289da70, 0xc000cce718, 0x165726797a6df257, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go:283 +0x14e fp=0xc001fe1a00 sp=0xc001fe1830 pc=0x1ac92ce
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*lockTableWaiterImpl).pushLockTxn(0xc002881280, 0x54e3720, 0xc00289da70, 0x0, 0x165726797a6df257, 0x0, 0x0, 0x0, 0xc0006b62f0, 0x1, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/lock_table_waiter.go:471 +0x13c fp=0xc001fe1e18 sp=0xc001fe1a00 pc=0x1adb6bc
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*lockTableWaiterImpl).WaitOn(0xc002881280, 0x54e3720, 0xc00289da70, 0x0, 0x165726797a6df257, 0x0, 0x0, 0x0, 0xc0006b62f0, 0x1, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/lock_table_waiter.go:370 +0x2f2 fp=0xc001fe2260 sp=0xc001fe1e18 pc=0x1ada7b2
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).sequenceReqWithGuard(0xc001155c80, 0x54e3720, 0xc00289da70, 0xc000cce000, 0x0, 0x165726797a6df257, 0x0, 0x0, 0x0, 0xc0006b62f0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:172 +0x41f fp=0xc001fe23f0 sp=0xc001fe2260 pc=0x1acf07f
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).SequenceReq(0xc001155c80, 0x54e3720, 0xc00289da70, 0x0, 0x0, 0x165726797a6df257, 0x0, 0x0, 0x0, 0xc0006b62f0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:123 +0xf8 fp=0xc001fe2510 sp=0xc001fe23f0 pc=0x1acea38
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc001187000, 0x54e3720, 0xc00289da70, 0xc0011e2b40, 0x4ea0d78, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:328 +0x3b6 fp=0xc001fe2858 sp=0xc001fe2510 pc=0x1d4f5b6
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc001187000, 0x54e3720, 0xc00289da40, 0x1, 0xc0011e2b40, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:93 +0x274 fp=0xc001fe2a70 sp=0xc001fe2858 pc=0x1d4e0d4
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:36
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send(0xc001180000, 0x54e3720, 0xc00289da10, 0x165726797a6df257, 0x0, 0x300000003, 0x2, 0x0, 0x1, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_send.go:194 +0x5ec fp=0xc001fe32f0 sp=0xc001fe2a70 pc=0x1d8e66c
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send(0xc0002fb730, 0x54e3720, 0xc00289da10, 0x0, 0x0, 0x300000003, 0x2, 0x0, 0x1, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:177 +0x118 fp=0xc001fe3490 sp=0xc001fe32f0 pc=0x1d99a98
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x54e3720, 0xc00289d9e0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:874 +0x225 fp=0xc001fe36a0 sp=0xc001fe3490 pc=0x3894045
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc0006b1710, 0x54e3720, 0xc00289d9e0, 0x47ef56d, 0x10, 0xc0025e1770, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:326 +0xf1 fp=0xc001fe3718 sp=0xc001fe36a0 pc=0x143aef1
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc000ca4100, 0x54e3720, 0xc00289d9e0, 0xc0011e2ab0, 0xc00289d9e0, 0x42ac380, 0x4750ca0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:862 +0x15b fp=0xc001fe37a0 sp=0xc001fe3718 pc=0x385a5fb
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc000ca4100, 0x54e3720, 0xc00289d980, 0xc0011e2ab0, 0xc000ca4100, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:900 +0xa5 fp=0xc001fe3840 sp=0xc001fe37a0 pc=0x385a705
github.com/cockroachdb/cockroach/pkg/roachpb._Internal_Batch_Handler.func1(0x54e3720, 0xc00289d980, 0x4793b00, 0xc0011e2ab0, 0x0, 0x0, 0x803bc01, 0x803bc20)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:9076 +0x89 fp=0xc001fe3888 sp=0xc001fe3840 pc=0x13d1f29
github.com/cockroachdb/cockroach/pkg/util/tracing.ServerInterceptor.func1(0x54e3720, 0xc00289d980, 0x4793b00, 0xc0011e2ab0, 0xc0025f7320, 0xc0025f7340, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/grpc_interceptor.go:125 +0x669 fp=0xc001fe3a08 sp=0xc001fe3888 pc=0xe04169
google.golang.org/grpc.getChainUnaryHandler.func1(0x54e3720, 0xc00289d980, 0x4793b00, 0xc0011e2ab0, 0xc0025e1ac8, 0xc7c6c8, 0x44dd960, 0xc001820bc0)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:921 +0xe7 fp=0xc001fe3a88 sp=0xc001fe3a08 pc=0xc8f687
github.com/cockroachdb/cockroach/pkg/rpc.NewServer.func1(0x54e3720, 0xc00289d980, 0x4793b00, 0xc0011e2ab0, 0xc0025f7320, 0xc001820bc0, 0xc001820bc0, 0x20, 0x424e5a0, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:176 +0xa8 fp=0xc001fe3ad8 sp=0xc001fe3a88 pc=0x18e7d08
google.golang.org/grpc.chainUnaryServerInterceptors.func1(0x54e3720, 0xc00289d980, 0x4793b00, 0xc0011e2ab0, 0xc0025f7320, 0xc0025f7340, 0xc00149bba0, 0x6507a6, 0x44f96a0, 0xc00289d980)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:907 +0xd0 fp=0xc001fe3b40 sp=0xc001fe3ad8 pc=0xc8f530
github.com/cockroachdb/cockroach/pkg/roachpb._Internal_Batch_Handler(0x4750ca0, 0xc000ca4100, 0x54e3720, 0xc00289d980, 0xc002b75080, 0xc0012a78e0, 0x54e3720, 0xc00289d980, 0xc001e2b000, 0x31)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:9078 +0x150 fp=0xc001fe3bb0 sp=0xc001fe3b40 pc=0x126d4d0
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000e22000, 0x554a720, 0xc000f1c900, 0xc001897800, 0xc0011bdf80, 0x75a2e80, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:1082 +0x522 fp=0xc001fe3e40 sp=0xc001fe3bb0 pc=0xc7cca2
google.golang.org/grpc.(*Server).handleStream(0xc000e22000, 0x554a720, 0xc000f1c900, 0xc001897800, 0x0)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:1405 +0xcc5 fp=0xc001fe3f68 sp=0xc001fe3e40 pc=0xc80da5
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc001c3d1f0, 0xc000e22000, 0x554a720, 0xc000f1c900, 0xc001897800)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:746 +0xa5 fp=0xc001fe3fb8 sp=0xc001fe3f68 pc=0xc8f1e5
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc001fe3fc0 sp=0xc001fe3fb8 pc=0x4b57c1
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:744 +0xa5

I sure have some trouble here. Everyone's stuck in DistSender (btw, all of these stacks are on n3; didn't find anything in the other logs other than trouble writing the initial summary); otherwise, I should be seeing someone in MaybeWaitForPush (right?)

Oh, hmm, I am seeing this:

W210104 21:54:37.372618 342 kv/kvclient/kvcoord/dist_sender.go:1512 ⋮ [n3,s3,r1/2:‹/{Min-System/NodeL…}›] slow range RPC: have been waiting 60.03s (64 attempts) for RPC PushTxn(d1da681b->6006f32c) [‹/Local/Range/System/NodeLivenessMax/RangeDescriptor›,‹/Min›) to r3:‹/System/{NodeLivenessMax-tsd}› [(n1,s1):1LEARNER, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=0]; resp: ‹failed to send RPC: sending to all replicas failed; last error: [NotLeaseHolderError] refusing to acquire lease on follower; r3: replica (n4,s4):2 not lease holder; current lease is repl=(n2,s2):4VOTER_INCOMING seq=0 start=0,0 exp=<nil>›

At the same time the replicaGCQueue is regularly timing out:

W210104 21:55:34.989605 7151 kv/kvclient/kvcoord/dist_sender.go:1512 ⋮ [n2,replicaGC,s2,r3/4:‹/System/{NodeLive…-tsd}›] slow range RPC: have been waiting 60.00s (1 attempts) for RPC Scan [‹/Meta2/System/NodeLivenessMax/NULL›,‹/System/""›) to r1:‹/{Min-System/NodeLiveness}› [(n1,s1):1, (n3,s3):2, (n4,s4):3, next=4, gen=4]; resp: aborted during DistSender.Send: context deadline exceeded

@pbardea pbardea added branch-release-20.2 release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jan 5, 2021
@tbg
Copy link
Member

tbg commented Jan 6, 2021

I didn't get around to writing this down yesterday, but the issue is known now.

We don't allow VOTER_{INCOMING,OUTGOING,DEMOTING} to acquire a lease, but we did add code that prevents non-raft leaders to acquire the lease (to fix #37906). So we end up in a place where n4 is VOTER_INCOMING and is raft leader, and n2 is asked to get a lease, and it refuses (redirecting requests to n4), but n4 cannot actually get a lease, ad infinitum.

For VOTER_INCOMING, this is just because we were defensive at the time of its introduction; we can let those get the lease. But the same bug exists for any other state that's not VOTER_FULL. For VOTER_OUTGOING, we currently need to avoid giving it a lease, for a transition out of the joint config would necessarily fail - it amounts to a removal of the leaseholder, and we block that:

replID := r.ReplicaID()
for _, rDesc := range crt.Removed() {
if rDesc.ReplicaID == replID {
err := errors.Mark(errors.Newf("received invalid ChangeReplicasTrigger %s to remove self (leaseholder)", crt),
errMarkInvalidReplicationChange)
log.Errorf(p.ctx, "%v", err)
return 0, roachpb.NewError(err)
}
}

So if VOTER_OUTGOING is allowed to get a lease, transitioning out of the joint state becomes more complicated: we need to first transfer the lease to a voter and then that voter has to complete the replication change. This would be fine if the replicate queue were the only actor here, but everyone who needs to mutate the descriptor needs to have the ability to to transition out of the joint config (at least that's how we set it up), via

// maybeLeaveAtomicChangeReplicas transitions out of the joint configuration if
// the descriptor indicates one. This involves running a distributed transaction
// updating said descriptor, the result of which will be returned. The
// descriptor returned from this method will contain replicas of type LEARNER
// and VOTER_FULL only.
func maybeLeaveAtomicChangeReplicas(
ctx context.Context, s *Store, desc *roachpb.RangeDescriptor,
) (*roachpb.RangeDescriptor, error) {
// We want execChangeReplicasTxn to be able to make sure it's only tasked
// with leaving a joint state when it's in one, so make sure we don't call
// it if we're not.
if !desc.Replicas().InAtomicReplicationChange() {
return desc, nil
}
// NB: this is matched on in TestMergeQueueSeesLearner.
log.Eventf(ctx, "transitioning out of joint configuration %s", desc)
// NB: reason and detail won't be used because no range log event will be
// emitted.
//
// TODO(tbg): reconsider this.
return execChangeReplicasTxn(
ctx, desc, kvserverpb.ReasonUnknown /* unused */, "", nil, /* iChgs */
changeReplicasTxnArgs{
db: s.DB(),
liveAndDeadReplicas: s.allocator.storePool.liveAndDeadReplicas,
logChange: s.logChange,
testForceJointConfig: s.TestingKnobs().ReplicationAlwaysUseJointConfig,
testAllowDangerousReplicationChanges: s.TestingKnobs().AllowDangerousReplicationChanges,
})
}

"Luckily", we don't use VOTER_OUTGOING really, due to unrelated issues with Raft. Instead, we always useVOTER_DEMOTING which transitions the voter into a learner (and then it is removed separately). This is "better" - if we allow VOTER_DEMOTING to get the lease, we could in principle allow the transition out of the joint config, and would end up with a learner that is the leaseholder. Not something we support today, and certainly there will be some edge cases to work out, and it will not be a performant state (since the Raft leader will never be on a learner, plus we might end up leaving Raft leaderless for a while, or worse, run into unchartered territory with etcd/raft).

One prudent option to fix the bug without getting into too much of this mess is to allow lease acquisition if the raft leader looks like a VOTER_OUTGOING or VOTER_DEMOTING. If we don't have the leader in the descriptor (under its replicaID) then we're definitely behind - don't try to get a lease; send to the leader which in fact is very likely to be a full voter. If we do have it, and it's in either of these states, get the lease ourselves (or try). We could, in principle, be far behind and so this regresses partially on #37906, but we're not likely to hit this as readily as the original issues, if at all.

Assigning to @andreimatei who will work on a fix.

andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 6, 2021
This reverts commit 8f98ade.

This is a revert of the main commit in cockroachdb#57789 (which was a backport of cockroachdb#55148).
It turns out that the commit in question introduced a deadlock: if
replicas that are not the leader refuse to take the lease (because their
not the leader) and the leader is a VOTER_INCOMING, VOTER_DEMOTING or
VOTER_OUTGOING replica which also refuses to take the lease because it's
not a VOTER_FULL [0] => deadlock.
This deadlock was found in cockroachdb#57798.

The patch will return with some massaging.

This revert is original work on the 20.2 branch; I'm not reverting it on
master in the hope of just fixing the issue.

Touches cockroachdb#37906

Release note: This voids the previous release note reading "A bug
causing queries sent to a freshly-restarted node to sometimes hang for a
long time while the node catches up with replication has been fixed."
@pbardea pbardea removed branch-release-20.2 release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jan 6, 2021
andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 11, 2021
This patch backpedals a little bit on the logic introduced in cockroachdb#55148.
That patch said that, if a leader is known, every other replica refuses
to propose a lease acquisition. Instead, the replica in question
redirects whomever was triggering the lease acquisition to the leader,
thinking that the leader should take the lease.
That patch introduced a deadlock: some replicas refuse to take the lease
because they are not VOTER_FULL (see CheckCanReceiveLease()). To fix the
deadlock, this patch incorporates that check in the proposal buffer's
decision about whether or not to reject a proposal: if the leader is
believed to refuse to take the lease, then we again forward our own
lease request.

An edge case to the edge case is when the leader is not even part of the
proposer's range descriptor. This can happen if the proposer is far
behind. In this case, we assume that the leader is eligible. If it
isn't, the deadlock will resolve once the proposer catches up.

A future patch will relax the conditions under which a replica agrees to
take the lease. VOTER_INCOMING replicas should take the lease.
VOTER_DEMOTING are more controversial.

Fixes cockroachdb#57798

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 14, 2021
This patch backpedals a little bit on the logic introduced in cockroachdb#55148.
That patch said that, if a leader is known, every other replica refuses
to propose a lease acquisition. Instead, the replica in question
redirects whomever was triggering the lease acquisition to the leader,
thinking that the leader should take the lease.
That patch introduced a deadlock: some replicas refuse to take the lease
because they are not VOTER_FULL (see CheckCanReceiveLease()). To fix the
deadlock, this patch incorporates that check in the proposal buffer's
decision about whether or not to reject a proposal: if the leader is
believed to refuse to take the lease, then we again forward our own
lease request.

An edge case to the edge case is when the leader is not even part of the
proposer's range descriptor. This can happen if the proposer is far
behind. In this case, we assume that the leader is eligible. If it
isn't, the deadlock will resolve once the proposer catches up.

A future patch will relax the conditions under which a replica agrees to
take the lease. VOTER_INCOMING replicas should take the lease.
VOTER_DEMOTING are more controversial.

Fixes cockroachdb#57798

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 22, 2021
This patch backpedals a little bit on the logic introduced in cockroachdb#55148.
That patch said that, if a leader is known, every other replica refuses
to propose a lease acquisition. Instead, the replica in question
redirects whomever was triggering the lease acquisition to the leader,
thinking that the leader should take the lease.
That patch introduced a deadlock: some replicas refuse to take the lease
because they are not VOTER_FULL (see CheckCanReceiveLease()). To fix the
deadlock, this patch incorporates that check in the proposal buffer's
decision about whether or not to reject a proposal: if the leader is
believed to refuse to take the lease, then we again forward our own
lease request.

An edge case to the edge case is when the leader is not even part of the
proposer's range descriptor. This can happen if the proposer is far
behind. In this case, we assume that the leader is eligible. If it
isn't, the deadlock will resolve once the proposer catches up.

A future patch will relax the conditions under which a replica agrees to
take the lease. VOTER_INCOMING replicas should take the lease.
VOTER_DEMOTING are more controversial.

Fixes cockroachdb#57798

Release note: None
craig bot pushed a commit that referenced this issue Jan 25, 2021
58722: kvserver: don't refuse to fwd lease proposals in some edge cases r=andreimatei a=andreimatei

This patch backpedals a little bit on the logic introduced in #55148.
That patch said that, if a leader is known, every other replica refuses
to propose a lease acquisition. Instead, the replica in question
redirects whomever was triggering the lease acquisition to the leader,
thinking that the leader should take the lease.
That patch introduced a deadlock: some replicas refuse to take the lease
because they are not VOTER_FULL (see CheckCanReceiveLease()). To fix the
deadlock, this patch incorporates that check in the proposal buffer's
decision about whether or not to reject a proposal: if the leader is
believed to refuse to take the lease, then we again forward our own
lease request.

An edge case to the edge case is when the leader is not even part of the
proposer's range descriptor. This can happen if the proposer is far
behind. In this case, we assume that the leader is eligible. If it
isn't, the deadlock will resolve once the proposer catches up.

A future patch will relax the conditions under which a replica agrees to
take the lease. VOTER_INCOMING replicas should take the lease.
VOTER_DEMOTING are more controversial.

Fixes #57798

Release note: None

59087: util/log: new output format 'crdb-v2' r=itsbilal a=knz

Fixes  #50166. 

This new format intends to address all the known shortcomings with `crdb-v1` while remaining compatible with entry parsers designed for the previous version.
See the user-facing release note below for a summary of changes; and the included reference documentation for details.


Example TTY output with colors:
![image](https://user-images.githubusercontent.com/642886/104824568-261e9380-5853-11eb-9ad9-e5936f0890fd.png)


Example for a single-line unstructured entry.

Before:
```
I210116 22:17:03.736236 57 cli/start.go:681 ⋮ node startup completed:
```

After:
```
I210116 22:17:03.736236 57 cli/start.go:681 ⋮ [-] 12  node startup completed:
              tag field now always included   ^^^
          entry counter now always included       ^^^
```

Example for a multi-line unstructured entry.

Before:
```
I210116 22:15:38.105666 452 gossip/gossip.go:567 ⋮ [n1] 74  gossip status (ok, 1 node‹›)
gossip client (0/3 cur/max conns)
gossip connectivity
  n1 [sentinel];
```

(subsequent lines lack a log entry prefix; hard to determine where
entries start and end)

After:
```
I210116 22:15:38.105666 452 gossip/gossip.go:567 ⋮ [n1] 74  gossip status (ok, 1 node‹›)
I210116 22:15:38.105666 452 gossip/gossip.go:567 ⋮ [n1] 74 +gossip client (0/3 cur/max conns)
I210116 22:15:38.105666 452 gossip/gossip.go:567 ⋮ [n1] 74 +gossip connectivity
I210116 22:15:38.105666 452 gossip/gossip.go:567 ⋮ [n1] 74 +  n1 [sentinel];
  ^^^ common prefix repeated for each msg line
       same entry counter for every subsequent line    ^^^
               continuation marker "+" on susequent lines ^^
```

Example for a structured entry.

Before:
```
I210116 22:14:38.175829 469 util/log/event_log.go:32 ⋮ [n1] Structured entry: {...}
```

After:
```
I210116 22:14:38.175829 469 util/log/event_log.go:32 ⋮ [n1] 21 ={...}
                            entry counter always present    ^^
                      equal sign "=" denotes structured entry ^^
```

Release note (cli change): The default output format for `file-group`
and `stderr` sinks has been changed to `crdb-v2`.

This new format is non-ambiguous and makes it possible to reliably
parse log files. Refer to the format's documentation for
details. Additionally, it prevents single log lines from exceeding a
large size; this problem is inherent to the `crdb-v1` format and can
prevent `cockroach debug zip` from retrieving v1 log files.

The new format has also been designed so that existinglog file
analyzers for the `crdb-v1` format can read entries written the new
format. However, this conversion may be imperfect. Again, refer to
the new format's documentation for details.

In case of incompatibility, users can force the previous format by
using `format: crdb-v1` in their logging configuration.

59141: ui: upgrade admin-ui-components to new dep r=dhartunian a=dhartunian

We renamed the `admin-ui-components` package
to `cluster-ui`.

Release note: None

59388: build,bazel: remove references to `gofmt` in bazel build r=rickystewart a=rickystewart

This was cargo-culted from the `Makefile`, but isn't necessary to get
the build to succeed, and interferes with hermiticity because it
requires `gofmt` to be globally installed. It's simpler to just remove
these references entirely.

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: David Hartunian <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
@craig craig bot closed this as completed in #58722 Jan 25, 2021
@craig craig bot closed this as completed in a767cdd Jan 25, 2021
irfansharif added a commit to irfansharif/cockroach that referenced this issue Feb 3, 2021
Fixes cockroachdb#57342. This looks to have been the same thing as cockroachdb#57798, and was
fixed by cockroachdb#58722.

Release note: None
craig bot pushed a commit that referenced this issue Feb 3, 2021
59762: roachtest: unskip acceptance/bank/cluster-recovery r=irfansharif a=irfansharif

Fixes #57342. This looks to have been the same thing as #57798, and was
fixed by #58722.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
@ajwerner
Copy link
Contributor

@andreimatei
Copy link
Contributor

This test doesn't seem to have failed in a while. And I did 20 passing runs. Closing...

andreimatei added a commit to andreimatei/cockroach that referenced this issue Mar 22, 2021
This patch backpedals a little bit on the logic introduced in cockroachdb#55148.
That patch said that, if a leader is known, every other replica refuses
to propose a lease acquisition. Instead, the replica in question
redirects whomever was triggering the lease acquisition to the leader,
thinking that the leader should take the lease.
That patch introduced a deadlock: some replicas refuse to take the lease
because they are not VOTER_FULL (see CheckCanReceiveLease()). To fix the
deadlock, this patch incorporates that check in the proposal buffer's
decision about whether or not to reject a proposal: if the leader is
believed to refuse to take the lease, then we again forward our own
lease request.

An edge case to the edge case is when the leader is not even part of the
proposer's range descriptor. This can happen if the proposer is far
behind. In this case, we assume that the leader is eligible. If it
isn't, the deadlock will resolve once the proposer catches up.

A future patch will relax the conditions under which a replica agrees to
take the lease. VOTER_INCOMING replicas should take the lease.
VOTER_DEMOTING are more controversial.

Fixes cockroachdb#57798

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants