storage: more quota pool deadlocks #17826

benesch · 2017-08-22T20:25:15Z

I'm able to fairly reliably reproduce a deadlock in the quota pool while performing a ten-node, 2TB restore. Looks very much like #17524, except that #17796 doesn't seem to fix the problem. On the node coordinating the restore, we see a hung goroutine:

goroutine 3312 [select, 22 minutes]:
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendToReplicas(0xc4204c0000, 0x7fb93f5f7438, 0xc4242309c0, 0xc4204c0048, 0x3f1, 0xc426483f80, 0x4, 0x4, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1153 +0x1417
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendRPC(0xc4204c0000, 0x7fb93f5f7438, 0xc4242309c0, 0x3f1, 0xc426483f80, 0x4, 0x4, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:391 +0x2db
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendSingleRange(0xc4204c0000, 0x7fb93f6504d0, 0xc45309e7b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:455 +0x17b
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc4204c0000, 0x7fb93f6504d0, 0xc45309e7b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:939 +0x447
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc4204c0000, 0x7fb93f6504d0, 0xc45309e7b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:804 +0xb46
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).Send(0xc4204c0000, 0x7fb93f6504d0, 0xc45309e7b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:606 +0x344
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).Send(0xc420630000, 0x7fb93f6504d0, 0xc45309e780, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:435 +0x1f1
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).send(0xc420415b40, 0x7fb93f5f7438, 0xc421e2bbc0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:554 +0x1ff
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).(github.com/cockroachdb/cockroach/pkg/internal/client.send)-fm(0x7fb93f5f7438, 0xc421e2bbc0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:491 +0x83
github.com/cockroachdb/cockroach/pkg/internal/client.sendAndFill(0x7fb93f5f7438, 0xc421e2bbc0, 0xc422d217d0, 0xc428eb7800, 0xc424230300, 0xc428ed8320)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:463 +0x103
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).Run(0xc420415b40, 0x7fb93f5f7438, 0xc421e2bbc0, 0xc428eb7800, 0xc428ebc900, 0xc422d219e8)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:491 +0x9d
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).AdminSplit(0xc420415b40, 0x7fb93f5f7438, 0xc421e2bbc0, 0x1bbbca0, 0xc428ebc8e0, 0x1bbbca0, 0xc428ebc900, 0x2, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:383 +0x98
github.com/cockroachdb/cockroach/pkg/ccl/sqlccl.restore.func2(0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlccl/restore.go:735 +0x56b
github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1(0xc421e2bc00, 0xc421e31730)
	/go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:58 +0x57
created by github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go
	/go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:66 +0x66

On the node processing that split, there's a whole bunch of hung goroutines, plus a whole lot of goroutines in quotaPool.acquire: https://gist.github.com/benesch/c747301192eb984686fb60fccf24c506

I'm running with a relatively recent version of #17796, though not exactly the version that merged. https://github.com/benesch/cockroach/tree/quota-pool-deadlock

The text was updated successfully, but these errors were encountered:

benesch · 2017-08-22T20:25:23Z

/cc @nvanbenschoten @tschottdorf

nvanbenschoten · 2017-08-22T20:36:26Z

Unfortunately, this doesn't look like it's the same issue that #17796 was addressing. There are certainly goroutines hung in quotaPool.acquire, but here they're properly waiting to acquire a quota. Did the have been waiting %s attempting to acquire quota log fire?

/cc @irfansharif

benesch · 2017-08-22T20:40:54Z

Yeah, there were a couple lines like "have been waiting 15m0s attempting to acquire quota" in the logs. Didn't your timer PR fix things so that the selects are periodically interrupted by the slowTimer?

I dumped the goroutines several times; occasionally, you'd see one listed as goroutine 37312 [select, 1 minute], but by the next refresh it would read goroutine 37312 [select].

tbg · 2017-08-22T20:42:28Z

Any down nodes? Any slow nodes?

@nvanbenschoten there's a good argument here against the periodic logging we just introduced -- it'll break the nice property that we previously had, which is that we get the [select, x minutes] output. I think that alone is enough reason to revert that.

benesch · 2017-08-22T20:44:42Z

Nope, all nodes were up. What do you mean by slow nodes? The deadlocked node was pretty unresponsive by the end but wasn't slow before that.

benesch · 2017-08-22T22:10:48Z

@irfansharif has kindly volunteered to look into this further. Logs and goroutine stacks from a node exhibiting this deadlock captured here: https://gist.github.com/benesch/c747301192eb984686fb60fccf24c506.

@irfansharif, to repro, run

CLOUDSDK_CORE_PROJECT=cockroach-shared COCKROACH_EXTRA_FLAGS='-tf.keep-cluster=failed' build/teamcity-nightly-acceptance.sh BenchmarkRestore2TB

on my quota-pool-deadlock branch. This'll create 10 VMs on GCE, so best not to leave the cluster lying around too long. To force-delete all locally-created terraform clusters:

cd pkg/acceptance/terraform/gce && for tfs in *.tfstate; do terraform destroy -force -state $tfs; done

irfansharif · 2017-08-23T02:55:54Z

First pass at this running TestBackupRestoreAddSSTable, I observed we acquire quota the size of the AddSSTable command (see here) given it goes through raft. This is most definitely not something we want, maximum quota for a given range is ~1MB. From the little I know presumably these AddSSTable commands are much larger.

We do however cap the acquired quota at a pre-specified maximum value (see here) so other than being horridly slow this shouldn't have caused a problem. Will try with larger commands capping out quota acquisitions.

tbg · 2017-08-23T03:52:05Z

btw, @nvanbenschoten and I were wondering about something else recently - if a node goes down, will we (pretty much) immediately lock down the remaining replicas as well since they'll run out of quota? Presumably that would be the case until we transfer the replica off the dead node, but that may take a very long time. Am I forgetting about a mechanism that alleviates this?

tbg · 2017-08-23T03:53:49Z

I have a faded memory that we avoid quota when the RPC connections to that node aren't healthy. Mind pointing me at this if it's real?

irfansharif · 2017-08-23T04:27:04Z

you're looking for this:

cockroach/pkg/storage/replica.go

Line 975 in b085a91

for _, rep := range r.mu.state.Desc.Replicas {

Assuming we start at steady state with three nodes, when a node goes down the remaining two will make forward progress writing out up to defaultProposalQuota bytes after which subsequent writes are blocked. Eventually we schedule the processReady goroutine and on reaching updateProposalQuotaRaftMuLocked we determine (looking at the RPC connection health) that a node is downed. Following this we "re-add" the delta between the downed node and the second node back to the quota pool.

benesch · 2017-08-23T04:59:53Z

Spitballing here, but this TODO seems ominous:

cockroach/pkg/storage/replica.go

Lines 904 to 909 in b085a91

    
           // TODO(irfansharif): There is still a potential problem here that leaves 
        
           // clients hanging if the replica gets destroyed but this code path is 
        
           // never taken. Moving quota pool draining to every point where a 
        
           // replica can get destroyed is an option, alternatively we can clear 
        
           // our leader status and close the proposalQuota whenever the replica is 
        
           // destroyed.

This was an oversight introduced in cockroachdb#15802, the `undoQuotaAcquisition` was intended to be used for all proposal errors but we seem to have skipped the very first one. It seems to have been a relic of the earlier structure where in the event of proposal errors the acquired quota was released within `Replica.propose` itself, and not left to the caller. Fixes (hopefully) cockroachdb#17826.

benesch assigned irfansharif Aug 22, 2017

irfansharif added this to the 1.1 milestone Aug 23, 2017

irfansharif mentioned this issue Aug 23, 2017

storage: undo quota acquisition on failed proposals #17856

Merged

irfansharif closed this as completed in #17856 Aug 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: more quota pool deadlocks #17826

storage: more quota pool deadlocks #17826

benesch commented Aug 22, 2017 •

edited

Loading

benesch commented Aug 22, 2017

nvanbenschoten commented Aug 22, 2017

benesch commented Aug 22, 2017

tbg commented Aug 22, 2017

benesch commented Aug 22, 2017

benesch commented Aug 22, 2017 •

edited

Loading

irfansharif commented Aug 23, 2017

tbg commented Aug 23, 2017

tbg commented Aug 23, 2017

irfansharif commented Aug 23, 2017

benesch commented Aug 23, 2017

storage: more quota pool deadlocks #17826

storage: more quota pool deadlocks #17826

Comments

benesch commented Aug 22, 2017 • edited Loading

benesch commented Aug 22, 2017

nvanbenschoten commented Aug 22, 2017

benesch commented Aug 22, 2017

tbg commented Aug 22, 2017

benesch commented Aug 22, 2017

benesch commented Aug 22, 2017 • edited Loading

irfansharif commented Aug 23, 2017

tbg commented Aug 23, 2017

tbg commented Aug 23, 2017

irfansharif commented Aug 23, 2017

benesch commented Aug 23, 2017

benesch commented Aug 22, 2017 •

edited

Loading

benesch commented Aug 22, 2017 •

edited

Loading