-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sqlccl: rate limit Export and Import requests sent #14820
Conversation
Does it make sense to DRY this stuff up at all? Lots of duplicated code. Maybe it's more annoying to do that than just duplicate it. Review status: 0 of 3 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed. pkg/ccl/sqlccl/backup.go, line 361 at r1 (raw file):
Can you add a comment describing why this was chosen? Comments from Reviewable |
I think a How effective was this at restoring cluster traffic? Reviewed 3 of 3 files at r1. pkg/ccl/sqlccl/backup.go, line 361 at r1 (raw file): Previously, mjibson (Matt Jibson) wrote…
👍 Comments from Reviewable |
While mulling this over last night, I realized we really wanted the number of nodes, which we can get very quickly from gossip, so changed that. RFAL The new version is a little more DRY. I'm inclined against extracting the semaphore though, I find it more obvious what's going on when that's inlined.
RESTORE is still choppy but seems to cut down on the log spam and (more importantly) on interfering with node liveness. Review status: 1 of 3 files reviewed at latest revision, 1 unresolved discussion. pkg/ccl/sqlccl/backup.go, line 361 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
Done. Comments from Reviewable |
Review status: 1 of 3 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful. pkg/ccl/sqlccl/backup.go, line 361 at r1 (raw file): Previously, danhhz (Daniel Harrison) wrote…
I think this logic here will severely limit the speed at which restores and backups occur. In the most ideal scenario there are However I'm doubtful that will ever happen because the order in which the requests are sent out will only rarely be perfectly distributed that way. Instead we will have some multiple of 5 requests at a single node, while others wait empty. In the worst case a single node would have all outstanding requests and all other nodes would be doing nothing, because their range leases are later on in Did I miss something? Is this a known limitation? Are we ok with it because it's still better than what we have today? I think it'll be really hard to hit our perf goals with this rate limiter. Comments from Reviewable |
Review status: 1 of 3 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful. pkg/ccl/sqlccl/backup.go, line 361 at r1 (raw file):
At this point, I'm way more concerned about stability (and that things finish without bringing down the cluster) than performance. The worst case you mention could happen but, anecdotally, BACKUP is still going pretty fast on lapis. I'd be interested to see how long the 2tb test takes with this change once it merges. Comments from Reviewable |
We're already limiting these on the server-side, but the BACKUP/RESTORE gateway would fill up its distsender/grpc/something and cause all sorts of badness (node liveness timeouts leading to mass leaseholder transfers, poor performance on SQL workloads, etc) as well as log spam about slow distsender requests. There is likely some better fix post 1.0, this is being tracked in cockroachdb#14798. For cockroachdb#14792.
RFAL Review status: 1 of 3 files reviewed at latest revision, 2 unresolved discussions. pkg/ccl/sqlccl/restore.go, line 704 at r2 (raw file):
The old version wasn't working on lapis. I made it even more conservative for now. I'm really interested in seeing if we can get the 2tb backup to finish at all Comments from Reviewable |
TFTR! |
We're already limiting these on the server-side, but the BACKUP/RESTORE
gateway would fill up its distsender/grpc/something and cause all sorts
of badness (node liveness timeouts leading to mass leaseholder
transfers, poor performance on SQL workloads, etc) as well as log spam
about slow distsender requests.
There is likely some better fix post 1.0, this is being tracked in #14798.
For #14792.