restore: 2tb dataset doesn't finish #14792

danhhz · 2017-04-11T14:57:45Z

Running RESTORE datablocks.* FROM 'redacted-2tb-backup' gets stuck and never finishes.

The text was updated successfully, but these errors were encountered:

danhhz · 2017-04-11T15:01:57Z

The first thing that happens is splits. The ~16k splits work fine on a 1 node cluster, but seem to lock up on a 3 node cluster. Rate limiting them to 100/second seems to work locally with a 3 node roachdemo cluster.

I'd like to know why dumping them all at once doesn't work, but I've seen similar badness from dumping all the Export commands in backup and am thinking about rate limiting that, too. In the interest of making progress, I'm going to merge the various rate limitings and file issues to look into what is breaking.

danhhz · 2017-04-11T15:02:19Z

cc @mjibson

andreimatei · 2017-04-11T16:27:23Z

@cuongdo has seen similar badness with large numbers of splits; where's that issue? I don't seem to find it now.

On clusters of more than 1 node, dumping all 16000 splits at once (more or less) would consistenly get the cluster stuck in some state it never got out of. Given that our normal codepaths probably weren't designed with this sort of abuse in mind, spread the splits out a bit. For cockroachdb#14792.

We're already limiting these on the server-side, but the BACKUP/RESTORE gateway would fill up its distsender/grpc/something and cause all sorts of badness (node liveness timeouts leading to mass leaseholder transfers, poor performance on SQL workloads, etc) as well as log spam about slow distsender requests. There is likely some better fix post 1.0, this is being tracked in cockroachdb#14798. For cockroachdb#14792.

danhhz · 2017-04-12T21:41:06Z

when testing with rate limiting the Import and Export requests client-size, it seems we then run into #14776 and nodes start crashing

We used to run 5, but this was overloading disks, which caused contention in RocksDB, which slowed down heartbeats, which caused mass lease transfers. This worked much better in our large scale tests and doesn't seem to slow it down much (10-15%). A 2TB restore finished with a handful of missed heartbeats. A followup will more smooth out the WriteBatch work, which helps even more. For cockroachdb#14792.

danhhz · 2017-04-28T16:35:03Z

#15415 seems to have addressed this. There's still some room to move in not affecting liveness, but I'll leave that to be tracked in #15341

danhhz added this to the 1.0 milestone Apr 11, 2017

danhhz self-assigned this Apr 11, 2017

danhhz mentioned this issue Apr 11, 2017

sqlccl: rate limit presplitRanges in RESTORE #14794

Merged

danhhz mentioned this issue Apr 11, 2017

sqlccl: rate limit Export and Import requests sent #14820

Merged

danhhz mentioned this issue Apr 27, 2017

storageccl: limit parallel Import requests to 1 #15415

Merged

danhhz closed this as completed Apr 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restore: 2tb dataset doesn't finish #14792

restore: 2tb dataset doesn't finish #14792

danhhz commented Apr 11, 2017

danhhz commented Apr 11, 2017

danhhz commented Apr 11, 2017

andreimatei commented Apr 11, 2017

danhhz commented Apr 12, 2017 •

edited

Loading

danhhz commented Apr 28, 2017

restore: 2tb dataset doesn't finish #14792

restore: 2tb dataset doesn't finish #14792

Comments

danhhz commented Apr 11, 2017

danhhz commented Apr 11, 2017

danhhz commented Apr 11, 2017

andreimatei commented Apr 11, 2017

danhhz commented Apr 12, 2017 • edited Loading

danhhz commented Apr 28, 2017

danhhz commented Apr 12, 2017 •

edited

Loading