-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
restore: 2tb dataset doesn't finish #14792
Comments
The first thing that happens is splits. The ~16k splits work fine on a 1 node cluster, but seem to lock up on a 3 node cluster. Rate limiting them to 100/second seems to work locally with a 3 node roachdemo cluster. I'd like to know why dumping them all at once doesn't work, but I've seen similar badness from dumping all the Export commands in backup and am thinking about rate limiting that, too. In the interest of making progress, I'm going to merge the various rate limitings and file issues to look into what is breaking. |
cc @mjibson |
@cuongdo has seen similar badness with large numbers of splits; where's that issue? I don't seem to find it now. |
On clusters of more than 1 node, dumping all 16000 splits at once (more or less) would consistenly get the cluster stuck in some state it never got out of. Given that our normal codepaths probably weren't designed with this sort of abuse in mind, spread the splits out a bit. For cockroachdb#14792.
We're already limiting these on the server-side, but the BACKUP/RESTORE gateway would fill up its distsender/grpc/something and cause all sorts of badness (node liveness timeouts leading to mass leaseholder transfers, poor performance on SQL workloads, etc) as well as log spam about slow distsender requests. There is likely some better fix post 1.0, this is being tracked in cockroachdb#14798. For cockroachdb#14792.
We're already limiting these on the server-side, but the BACKUP/RESTORE gateway would fill up its distsender/grpc/something and cause all sorts of badness (node liveness timeouts leading to mass leaseholder transfers, poor performance on SQL workloads, etc) as well as log spam about slow distsender requests. There is likely some better fix post 1.0, this is being tracked in cockroachdb#14798. For cockroachdb#14792.
when testing with rate limiting the Import and Export requests client-size, it seems we then run into #14776 and nodes start crashing |
We used to run 5, but this was overloading disks, which caused contention in RocksDB, which slowed down heartbeats, which caused mass lease transfers. This worked much better in our large scale tests and doesn't seem to slow it down much (10-15%). A 2TB restore finished with a handful of missed heartbeats. A followup will more smooth out the WriteBatch work, which helps even more. For cockroachdb#14792.
We used to run 5, but this was overloading disks, which caused contention in RocksDB, which slowed down heartbeats, which caused mass lease transfers. This worked much better in our large scale tests and doesn't seem to slow it down much (10-15%). A 2TB restore finished with a handful of missed heartbeats. A followup will more smooth out the WriteBatch work, which helps even more. For cockroachdb#14792.
Running
RESTORE datablocks.* FROM 'redacted-2tb-backup'
gets stuck and never finishes.The text was updated successfully, but these errors were encountered: