sqlccl: rate limit Export and Import requests sent #14820

danhhz · 2017-04-11T20:49:33Z

We're already limiting these on the server-side, but the BACKUP/RESTORE
gateway would fill up its distsender/grpc/something and cause all sorts
of badness (node liveness timeouts leading to mass leaseholder
transfers, poor performance on SQL workloads, etc) as well as log spam
about slow distsender requests.

There is likely some better fix post 1.0, this is being tracked in #14798.

For #14792.

cockroach-teamcity · 2017-04-11T20:49:38Z

This change is

maddyblue · 2017-04-12T02:42:29Z

Does it make sense to DRY this stuff up at all? Lots of duplicated code. Maybe it's more annoying to do that than just duplicate it.

Review status: 0 of 3 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

pkg/ccl/sqlccl/backup.go, line 361 at r1 (raw file):

			}
		}
		maxConcurrentExports = maxReplicas * storageccl.ParallelRequestsLimit

Can you add a comment describing why this was chosen?

Comments from Reviewable

benesch · 2017-04-12T10:08:15Z

I think a func maxReplicas(ranges []rangeType) int function would be worth extracting. I'm neutral on the semaphore duplication.

How effective was this at restoring cluster traffic?

Reviewed 3 of 3 files at r1.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

pkg/ccl/sqlccl/backup.go, line 361 at r1 (raw file):

Previously, mjibson (Matt Jibson) wrote…

Can you add a comment describing why this was chosen?

👍

Comments from Reviewable

danhhz · 2017-04-12T15:01:12Z

While mulling this over last night, I realized we really wanted the number of nodes, which we can get very quickly from gossip, so changed that. RFAL

The new version is a little more DRY. I'm inclined against extracting the semaphore though, I find it more obvious what's going on when that's inlined.

How effective was this at restoring cluster traffic?

RESTORE is still choppy but seems to cut down on the log spam and (more importantly) on interfering with node liveness.

Review status: 1 of 3 files reviewed at latest revision, 1 unresolved discussion.

pkg/ccl/sqlccl/backup.go, line 361 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

👍

Done.

Comments from Reviewable

maddyblue · 2017-04-12T17:53:33Z

Review status: 1 of 3 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.

pkg/ccl/sqlccl/backup.go, line 361 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Done.

I think this logic here will severely limit the speed at which restores and backups occur.

In the most ideal scenario there are nodeCount * parallelReqLimit requests out, evenly distributed among the nodes. Even in that case though they nodes will be waiting between the time when a request finished and another is issued. I think it would be better if, in this case, there were a few requests waiting on the clients to proceed, so that the clients won't have to wait for additional requests before filling up their work queue.

However I'm doubtful that will ever happen because the order in which the requests are sent out will only rarely be perfectly distributed that way. Instead we will have some multiple of 5 requests at a single node, while others wait empty. In the worst case a single node would have all outstanding requests and all other nodes would be doing nothing, because their range leases are later on in spans.

Did I miss something? Is this a known limitation? Are we ok with it because it's still better than what we have today? I think it'll be really hard to hit our perf goals with this rate limiter.

Comments from Reviewable

danhhz · 2017-04-12T18:40:41Z

Review status: 1 of 3 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.

pkg/ccl/sqlccl/backup.go, line 361 at r1 (raw file):
tl;dr yes it's better than what we have today. This could keep track of outstanding requests per StoreID/NodeID (I considered doing that), but this is considerably simpler. I don't think this issue is resolved (hence #14798), but we need to get something stable ASAP so we can have a couple weeks of testing before shipping 1.0

I think it'll be really hard to hit our perf goals with this rate limiter.

At this point, I'm way more concerned about stability (and that things finish without bringing down the cluster) than performance. The worst case you mention could happen but, anecdotally, BACKUP is still going pretty fast on lapis. I'd be interested to see how long the 2tb test takes with this change once it merges.

Comments from Reviewable

We're already limiting these on the server-side, but the BACKUP/RESTORE gateway would fill up its distsender/grpc/something and cause all sorts of badness (node liveness timeouts leading to mass leaseholder transfers, poor performance on SQL workloads, etc) as well as log spam about slow distsender requests. There is likely some better fix post 1.0, this is being tracked in cockroachdb#14798. For cockroachdb#14792.

danhhz · 2017-04-12T18:52:50Z

RFAL

Review status: 1 of 3 files reviewed at latest revision, 2 unresolved discussions.

pkg/ccl/sqlccl/restore.go, line 704 at r2 (raw file):

	//
	// TODO(dan): See if there's some better solution than rate-limiting #14798.
	maxConcurrentImports := clusterNodeCount(p.ExecCfg().Gossip)

The old version wasn't working on lapis. I made it even more conservative for now. I'm really interested in seeing if we can get the 2tb backup to finish at all

Comments from Reviewable

danhhz · 2017-04-13T18:28:58Z

TFTR!

danhhz requested review from maddyblue and benesch April 11, 2017 20:49

danhhz force-pushed the limit branch from 213f0e9 to 04b0ebd Compare April 12, 2017 15:00

danhhz force-pushed the limit branch from 04b0ebd to f0548f3 Compare April 12, 2017 18:51

maddyblue approved these changes Apr 13, 2017

View reviewed changes

danhhz merged commit 8d5587f into cockroachdb:master Apr 13, 2017

a-robinson mentioned this pull request Apr 14, 2017

restore: OOM with TPC-H scale factor 10 data #14776

Closed

danhhz deleted the limit branch April 2, 2018 20:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sqlccl: rate limit Export and Import requests sent #14820

sqlccl: rate limit Export and Import requests sent #14820

danhhz commented Apr 11, 2017

cockroach-teamcity commented Apr 11, 2017

maddyblue commented Apr 12, 2017

benesch commented Apr 12, 2017

danhhz commented Apr 12, 2017

maddyblue commented Apr 12, 2017

danhhz commented Apr 12, 2017

danhhz commented Apr 12, 2017

danhhz commented Apr 13, 2017

sqlccl: rate limit Export and Import requests sent #14820

sqlccl: rate limit Export and Import requests sent #14820

Conversation

danhhz commented Apr 11, 2017

cockroach-teamcity commented Apr 11, 2017

maddyblue commented Apr 12, 2017

benesch commented Apr 12, 2017

danhhz commented Apr 12, 2017

maddyblue commented Apr 12, 2017

danhhz commented Apr 12, 2017

danhhz commented Apr 12, 2017

danhhz commented Apr 13, 2017