Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: hundreds of unavailable ranges after dist restore on v2.0-alpha.20171218 #21291

Closed
nstewart opened this issue Jan 6, 2018 · 7 comments
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Milestone

Comments

@nstewart
Copy link
Contributor

nstewart commented Jan 6, 2018

screen shot 2018-01-06 at 3 13 19 pm

I ran a distributed restore of about 18GiB on a 5 node cluster of t2.larges on AWS running v2.0-alpha.20171218. Afterwards, hundreds of ranges were unavailable and the cluster was unusable.

Note that during the restore some of the nodes periodically died and where restarted; judging by the uptimes it looks like we lost quorums for a period of time.

Are these nodes not big enough to run a dist restore? According to our production docs, "Each node should have sufficient CPU, RAM, network, and storage capacity to handle your workload." Is 8GiB/node not large enough for a 20GiB distributed restore? cc @jseldess

I took a debug zip immediately, so I can share with anyone handling this ticket.

Not clear if this is the same problem in #21102 since in that one the cluster was actually usable after the restore.

@nstewart
Copy link
Contributor Author

nstewart commented Jan 6, 2018

Still getting node restarts and unavailable ranges on m4.xlarges

@bdarnell
Copy link
Contributor

bdarnell commented Jan 8, 2018

Note that all t2 aws instances are "burstable": you don't get your full CPU allocations, and we believe that this causes problems for CRDB. We're in the process of adding this to the docs: cockroachdb/docs#2181

Can you post more details from the run on m4.xlarge?

@nstewart
Copy link
Contributor Author

nstewart commented Jan 8, 2018

image

With m4.xlarge nodes I saw the following when running a restore:
78% - ran perfectly before I lost a node
82.5% - progress slowed down, more than one node had been restarted
86.6% - stuck for a long time here
90.7 - no unavailable ranges
99.96% - nodes die again with unavailable ranges
I left for a couple hours and when I came back I saw everything was running with no unavailable ranges.

I have a debug zip if that's useful

@a-robinson
Copy link
Contributor

It'd be much easier to look into the unavailable and underreplicated range issue if there weren't also a bunch of node failures along the way muddying the waters, especially since the node restarts may be causing them to some degree.

@danhhz when will someone on bulkio be able to look into the restore node failures that @nstewart keeps hitting? I will say it's pretty cool that the restores still finish successfully despite all the failures, though 👍

@danhhz
Copy link
Contributor

danhhz commented Jan 8, 2018

18GiB is tiny, I doubt the size of the machines is the problem. The latest nightly 2TB restore test just took 4h (usual is 55m) and most of that was spent stalled, could be related. @mjibson or @dt can you look at this please?

@maddyblue maddyblue added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 26, 2018
@knz knz changed the title Hundreds of unavailable ranges after dist restore on v2.0-alpha.20171218 backupccl: hundreds of unavailable ranges after dist restore on v2.0-alpha.20171218 Jul 23, 2018
@knz knz added this to the 2.1 milestone Jul 23, 2018
@nstewart nstewart added the S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. label Sep 18, 2018
@rolandcrosby
Copy link

@nstewart Have you seen any behavior like this recently? @awoods187 you do a lot of restores in your testing, is this something you've encountered lately?

@tbg
Copy link
Member

tbg commented Dec 5, 2018

I'd expect this to have been fixed by #32594.

@tbg tbg closed this as completed Dec 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Projects
None yet
Development

No branches or pull requests

8 participants