Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlccl: rate limit Export and Import requests sent #14820

Merged
merged 1 commit into from
Apr 13, 2017

Conversation

danhhz
Copy link
Contributor

@danhhz danhhz commented Apr 11, 2017

We're already limiting these on the server-side, but the BACKUP/RESTORE
gateway would fill up its distsender/grpc/something and cause all sorts
of badness (node liveness timeouts leading to mass leaseholder
transfers, poor performance on SQL workloads, etc) as well as log spam
about slow distsender requests.

There is likely some better fix post 1.0, this is being tracked in #14798.

For #14792.

@danhhz danhhz requested review from maddyblue and benesch April 11, 2017 20:49
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@maddyblue
Copy link
Contributor

Does it make sense to DRY this stuff up at all? Lots of duplicated code. Maybe it's more annoying to do that than just duplicate it.


Review status: 0 of 3 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


pkg/ccl/sqlccl/backup.go, line 361 at r1 (raw file):

			}
		}
		maxConcurrentExports = maxReplicas * storageccl.ParallelRequestsLimit

Can you add a comment describing why this was chosen?


Comments from Reviewable

@benesch
Copy link
Contributor

benesch commented Apr 12, 2017

I think a func maxReplicas(ranges []rangeType) int function would be worth extracting. I'm neutral on the semaphore duplication.

How effective was this at restoring cluster traffic?


Reviewed 3 of 3 files at r1.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


pkg/ccl/sqlccl/backup.go, line 361 at r1 (raw file):

Previously, mjibson (Matt Jibson) wrote…

Can you add a comment describing why this was chosen?

👍


Comments from Reviewable

@danhhz
Copy link
Contributor Author

danhhz commented Apr 12, 2017

While mulling this over last night, I realized we really wanted the number of nodes, which we can get very quickly from gossip, so changed that. RFAL

The new version is a little more DRY. I'm inclined against extracting the semaphore though, I find it more obvious what's going on when that's inlined.

How effective was this at restoring cluster traffic?

RESTORE is still choppy but seems to cut down on the log spam and (more importantly) on interfering with node liveness.


Review status: 1 of 3 files reviewed at latest revision, 1 unresolved discussion.


pkg/ccl/sqlccl/backup.go, line 361 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

👍

Done.


Comments from Reviewable

@maddyblue
Copy link
Contributor

Review status: 1 of 3 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/ccl/sqlccl/backup.go, line 361 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Done.

I think this logic here will severely limit the speed at which restores and backups occur.

In the most ideal scenario there are nodeCount * parallelReqLimit requests out, evenly distributed among the nodes. Even in that case though they nodes will be waiting between the time when a request finished and another is issued. I think it would be better if, in this case, there were a few requests waiting on the clients to proceed, so that the clients won't have to wait for additional requests before filling up their work queue.

However I'm doubtful that will ever happen because the order in which the requests are sent out will only rarely be perfectly distributed that way. Instead we will have some multiple of 5 requests at a single node, while others wait empty. In the worst case a single node would have all outstanding requests and all other nodes would be doing nothing, because their range leases are later on in spans.

Did I miss something? Is this a known limitation? Are we ok with it because it's still better than what we have today? I think it'll be really hard to hit our perf goals with this rate limiter.


Comments from Reviewable

@danhhz
Copy link
Contributor Author

danhhz commented Apr 12, 2017

Review status: 1 of 3 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/ccl/sqlccl/backup.go, line 361 at r1 (raw file):
tl;dr yes it's better than what we have today. This could keep track of outstanding requests per StoreID/NodeID (I considered doing that), but this is considerably simpler. I don't think this issue is resolved (hence #14798), but we need to get something stable ASAP so we can have a couple weeks of testing before shipping 1.0

I think it'll be really hard to hit our perf goals with this rate limiter.

At this point, I'm way more concerned about stability (and that things finish without bringing down the cluster) than performance. The worst case you mention could happen but, anecdotally, BACKUP is still going pretty fast on lapis. I'd be interested to see how long the 2tb test takes with this change once it merges.


Comments from Reviewable

We're already limiting these on the server-side, but the BACKUP/RESTORE
gateway would fill up its distsender/grpc/something and cause all sorts
of badness (node liveness timeouts leading to mass leaseholder
transfers, poor performance on SQL workloads, etc) as well as log spam
about slow distsender requests.

There is likely some better fix post 1.0, this is being tracked in cockroachdb#14798.

For cockroachdb#14792.
@danhhz
Copy link
Contributor Author

danhhz commented Apr 12, 2017

RFAL


Review status: 1 of 3 files reviewed at latest revision, 2 unresolved discussions.


pkg/ccl/sqlccl/restore.go, line 704 at r2 (raw file):

	//
	// TODO(dan): See if there's some better solution than rate-limiting #14798.
	maxConcurrentImports := clusterNodeCount(p.ExecCfg().Gossip)

The old version wasn't working on lapis. I made it even more conservative for now. I'm really interested in seeing if we can get the 2tb backup to finish at all


Comments from Reviewable

@danhhz
Copy link
Contributor Author

danhhz commented Apr 13, 2017

TFTR!

@danhhz danhhz merged commit 8d5587f into cockroachdb:master Apr 13, 2017
@danhhz danhhz deleted the limit branch April 2, 2018 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants