Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restore: 2tb dataset doesn't finish #14792

Closed
danhhz opened this issue Apr 11, 2017 · 5 comments
Closed

restore: 2tb dataset doesn't finish #14792

danhhz opened this issue Apr 11, 2017 · 5 comments
Assignees
Milestone

Comments

@danhhz
Copy link
Contributor

danhhz commented Apr 11, 2017

Running RESTORE datablocks.* FROM 'redacted-2tb-backup' gets stuck and never finishes.

@danhhz danhhz added this to the 1.0 milestone Apr 11, 2017
@danhhz danhhz self-assigned this Apr 11, 2017
@danhhz
Copy link
Contributor Author

danhhz commented Apr 11, 2017

The first thing that happens is splits. The ~16k splits work fine on a 1 node cluster, but seem to lock up on a 3 node cluster. Rate limiting them to 100/second seems to work locally with a 3 node roachdemo cluster.

I'd like to know why dumping them all at once doesn't work, but I've seen similar badness from dumping all the Export commands in backup and am thinking about rate limiting that, too. In the interest of making progress, I'm going to merge the various rate limitings and file issues to look into what is breaking.

@danhhz
Copy link
Contributor Author

danhhz commented Apr 11, 2017

cc @mjibson

@andreimatei
Copy link
Contributor

@cuongdo has seen similar badness with large numbers of splits; where's that issue? I don't seem to find it now.

danhhz added a commit to danhhz/cockroach that referenced this issue Apr 11, 2017
On clusters of more than 1 node, dumping all 16000 splits at once (more
or less) would consistenly get the cluster stuck in some state it never
got out of. Given that our normal codepaths probably weren't designed
with this sort of abuse in mind, spread the splits out a bit.

For cockroachdb#14792.
danhhz added a commit to danhhz/cockroach that referenced this issue Apr 11, 2017
We're already limiting these on the server-side, but the BACKUP/RESTORE
gateway would fill up its distsender/grpc/something and cause all sorts
of badness (node liveness timeouts leading to mass leaseholder
transfers, poor performance on SQL workloads, etc) as well as log spam
about slow distsender requests.

There is likely some better fix post 1.0, this is being tracked in cockroachdb#14798.

For cockroachdb#14792.
danhhz added a commit to danhhz/cockroach that referenced this issue Apr 12, 2017
We're already limiting these on the server-side, but the BACKUP/RESTORE
gateway would fill up its distsender/grpc/something and cause all sorts
of badness (node liveness timeouts leading to mass leaseholder
transfers, poor performance on SQL workloads, etc) as well as log spam
about slow distsender requests.

There is likely some better fix post 1.0, this is being tracked in cockroachdb#14798.

For cockroachdb#14792.
@danhhz
Copy link
Contributor Author

danhhz commented Apr 12, 2017

when testing with rate limiting the Import and Export requests client-size, it seems we then run into #14776 and nodes start crashing

danhhz added a commit to danhhz/cockroach that referenced this issue Apr 27, 2017
We used to run 5, but this was overloading disks, which caused
contention in RocksDB, which slowed down heartbeats, which caused mass
lease transfers. This worked much better in our large scale tests and
doesn't seem to slow it down much (10-15%). A 2TB restore finished with
a handful of missed heartbeats. A followup will more smooth out the
WriteBatch work, which helps even more.

For cockroachdb#14792.
danhhz added a commit to danhhz/cockroach that referenced this issue Apr 27, 2017
We used to run 5, but this was overloading disks, which caused
contention in RocksDB, which slowed down heartbeats, which caused mass
lease transfers. This worked much better in our large scale tests and
doesn't seem to slow it down much (10-15%). A 2TB restore finished with
a handful of missed heartbeats. A followup will more smooth out the
WriteBatch work, which helps even more.

For cockroachdb#14792.
@danhhz
Copy link
Contributor Author

danhhz commented Apr 28, 2017

#15415 seems to have addressed this. There's still some room to move in not affecting liveness, but I'll leave that to be tracked in #15341

@danhhz danhhz closed this as completed Apr 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants