Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: more quota pool deadlocks #17826

Closed
benesch opened this issue Aug 22, 2017 · 11 comments
Closed

storage: more quota pool deadlocks #17826

benesch opened this issue Aug 22, 2017 · 11 comments
Assignees
Milestone

Comments

@benesch
Copy link
Contributor

benesch commented Aug 22, 2017

I'm able to fairly reliably reproduce a deadlock in the quota pool while performing a ten-node, 2TB restore. Looks very much like #17524, except that #17796 doesn't seem to fix the problem. On the node coordinating the restore, we see a hung goroutine:

goroutine 3312 [select, 22 minutes]:
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendToReplicas(0xc4204c0000, 0x7fb93f5f7438, 0xc4242309c0, 0xc4204c0048, 0x3f1, 0xc426483f80, 0x4, 0x4, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1153 +0x1417
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendRPC(0xc4204c0000, 0x7fb93f5f7438, 0xc4242309c0, 0x3f1, 0xc426483f80, 0x4, 0x4, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:391 +0x2db
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendSingleRange(0xc4204c0000, 0x7fb93f6504d0, 0xc45309e7b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:455 +0x17b
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc4204c0000, 0x7fb93f6504d0, 0xc45309e7b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:939 +0x447
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc4204c0000, 0x7fb93f6504d0, 0xc45309e7b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:804 +0xb46
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).Send(0xc4204c0000, 0x7fb93f6504d0, 0xc45309e7b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:606 +0x344
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).Send(0xc420630000, 0x7fb93f6504d0, 0xc45309e780, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:435 +0x1f1
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).send(0xc420415b40, 0x7fb93f5f7438, 0xc421e2bbc0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:554 +0x1ff
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).(github.com/cockroachdb/cockroach/pkg/internal/client.send)-fm(0x7fb93f5f7438, 0xc421e2bbc0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:491 +0x83
github.com/cockroachdb/cockroach/pkg/internal/client.sendAndFill(0x7fb93f5f7438, 0xc421e2bbc0, 0xc422d217d0, 0xc428eb7800, 0xc424230300, 0xc428ed8320)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:463 +0x103
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).Run(0xc420415b40, 0x7fb93f5f7438, 0xc421e2bbc0, 0xc428eb7800, 0xc428ebc900, 0xc422d219e8)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:491 +0x9d
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).AdminSplit(0xc420415b40, 0x7fb93f5f7438, 0xc421e2bbc0, 0x1bbbca0, 0xc428ebc8e0, 0x1bbbca0, 0xc428ebc900, 0x2, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:383 +0x98
github.com/cockroachdb/cockroach/pkg/ccl/sqlccl.restore.func2(0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlccl/restore.go:735 +0x56b
github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1(0xc421e2bc00, 0xc421e31730)
	/go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:58 +0x57
created by github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go
	/go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:66 +0x66

On the node processing that split, there's a whole bunch of hung goroutines, plus a whole lot of goroutines in quotaPool.acquire: https://gist.github.com/benesch/c747301192eb984686fb60fccf24c506

I'm running with a relatively recent version of #17796, though not exactly the version that merged. https://github.com/benesch/cockroach/tree/quota-pool-deadlock

@benesch
Copy link
Contributor Author

benesch commented Aug 22, 2017

/cc @nvanbenschoten @tschottdorf

@nvanbenschoten
Copy link
Member

Unfortunately, this doesn't look like it's the same issue that #17796 was addressing. There are certainly goroutines hung in quotaPool.acquire, but here they're properly waiting to acquire a quota. Did the have been waiting %s attempting to acquire quota log fire?

/cc @irfansharif

@benesch
Copy link
Contributor Author

benesch commented Aug 22, 2017

Yeah, there were a couple lines like "have been waiting 15m0s attempting to acquire quota" in the logs. Didn't your timer PR fix things so that the selects are periodically interrupted by the slowTimer?

I dumped the goroutines several times; occasionally, you'd see one listed as goroutine 37312 [select, 1 minute], but by the next refresh it would read goroutine 37312 [select].

@tbg
Copy link
Member

tbg commented Aug 22, 2017

Any down nodes? Any slow nodes?

@nvanbenschoten there's a good argument here against the periodic logging we just introduced -- it'll break the nice property that we previously had, which is that we get the [select, x minutes] output. I think that alone is enough reason to revert that.

@benesch
Copy link
Contributor Author

benesch commented Aug 22, 2017

Nope, all nodes were up. What do you mean by slow nodes? The deadlocked node was pretty unresponsive by the end but wasn't slow before that.

@benesch
Copy link
Contributor Author

benesch commented Aug 22, 2017

@irfansharif has kindly volunteered to look into this further. Logs and goroutine stacks from a node exhibiting this deadlock captured here: https://gist.github.com/benesch/c747301192eb984686fb60fccf24c506.

@irfansharif, to repro, run

CLOUDSDK_CORE_PROJECT=cockroach-shared COCKROACH_EXTRA_FLAGS='-tf.keep-cluster=failed' build/teamcity-nightly-acceptance.sh BenchmarkRestore2TB

on my quota-pool-deadlock branch. This'll create 10 VMs on GCE, so best not to leave the cluster lying around too long. To force-delete all locally-created terraform clusters:

cd pkg/acceptance/terraform/gce && for tfs in *.tfstate; do terraform destroy -force -state $tfs; done

@irfansharif irfansharif added this to the 1.1 milestone Aug 23, 2017
@irfansharif
Copy link
Contributor

First pass at this running TestBackupRestoreAddSSTable, I observed we acquire quota the size of the AddSSTable command (see here) given it goes through raft. This is most definitely not something we want, maximum quota for a given range is ~1MB. From the little I know presumably these AddSSTable commands are much larger.

We do however cap the acquired quota at a pre-specified maximum value (see here) so other than being horridly slow this shouldn't have caused a problem. Will try with larger commands capping out quota acquisitions.

@tbg
Copy link
Member

tbg commented Aug 23, 2017

btw, @nvanbenschoten and I were wondering about something else recently - if a node goes down, will we (pretty much) immediately lock down the remaining replicas as well since they'll run out of quota? Presumably that would be the case until we transfer the replica off the dead node, but that may take a very long time. Am I forgetting about a mechanism that alleviates this?

@tbg
Copy link
Member

tbg commented Aug 23, 2017

I have a faded memory that we avoid quota when the RPC connections to that node aren't healthy. Mind pointing me at this if it's real?

@irfansharif
Copy link
Contributor

you're looking for this:

for _, rep := range r.mu.state.Desc.Replicas {

Assuming we start at steady state with three nodes, when a node goes down the remaining two will make forward progress writing out up to defaultProposalQuota bytes after which subsequent writes are blocked. Eventually we schedule the processReady goroutine and on reaching updateProposalQuotaRaftMuLocked we determine (looking at the RPC connection health) that a node is downed. Following this we "re-add" the delta between the downed node and the second node back to the quota pool.

@benesch
Copy link
Contributor Author

benesch commented Aug 23, 2017

Spitballing here, but this TODO seems ominous:

// TODO(irfansharif): There is still a potential problem here that leaves
// clients hanging if the replica gets destroyed but this code path is
// never taken. Moving quota pool draining to every point where a
// replica can get destroyed is an option, alternatively we can clear
// our leader status and close the proposalQuota whenever the replica is
// destroyed.

irfansharif added a commit to irfansharif/cockroach that referenced this issue Aug 23, 2017
This was an oversight introduced in cockroachdb#15802, the `undoQuotaAcquisition`
was intended to be used for all proposal errors but we seem to have
skipped the very first one. It seems to have been a relic of the earlier
structure where in the event of proposal errors the acquired quota was
released within `Replica.propose` itself, and not left to the caller.

Fixes (hopefully) cockroachdb#17826.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants