backupccl: hundreds of unavailable ranges after dist restore on v2.0-alpha.20171218 #21291

nstewart · 2018-01-06T20:25:31Z

I ran a distributed restore of about 18GiB on a 5 node cluster of t2.larges on AWS running v2.0-alpha.20171218. Afterwards, hundreds of ranges were unavailable and the cluster was unusable.

Note that during the restore some of the nodes periodically died and where restarted; judging by the uptimes it looks like we lost quorums for a period of time.

Are these nodes not big enough to run a dist restore? According to our production docs, "Each node should have sufficient CPU, RAM, network, and storage capacity to handle your workload." Is 8GiB/node not large enough for a 20GiB distributed restore? cc @jseldess

I took a debug zip immediately, so I can share with anyone handling this ticket.

Not clear if this is the same problem in #21102 since in that one the cluster was actually usable after the restore.

nstewart · 2018-01-06T22:36:01Z

Still getting node restarts and unavailable ranges on m4.xlarges

bdarnell · 2018-01-08T04:08:07Z

Note that all t2 aws instances are "burstable": you don't get your full CPU allocations, and we believe that this causes problems for CRDB. We're in the process of adding this to the docs: cockroachdb/docs#2181

Can you post more details from the run on m4.xlarge?

nstewart · 2018-01-08T13:37:36Z

With m4.xlarge nodes I saw the following when running a restore:
78% - ran perfectly before I lost a node
82.5% - progress slowed down, more than one node had been restarted
86.6% - stuck for a long time here
90.7 - no unavailable ranges
99.96% - nodes die again with unavailable ranges
I left for a couple hours and when I came back I saw everything was running with no unavailable ranges.

I have a debug zip if that's useful

a-robinson · 2018-01-08T15:31:56Z

It'd be much easier to look into the unavailable and underreplicated range issue if there weren't also a bunch of node failures along the way muddying the waters, especially since the node restarts may be causing them to some degree.

@danhhz when will someone on bulkio be able to look into the restore node failures that @nstewart keeps hitting? I will say it's pretty cool that the restores still finish successfully despite all the failures, though 👍

danhhz · 2018-01-08T16:28:27Z

18GiB is tiny, I doubt the size of the machines is the problem. The latest nightly 2TB restore test just took 4h (usual is 55m) and most of that was spent stalled, could be related. @mjibson or @dt can you look at this please?

rolandcrosby · 2018-12-04T19:43:48Z

@nstewart Have you seen any behavior like this recently? @awoods187 you do a lot of restores in your testing, is this something you've encountered lately?

tbg · 2018-12-05T14:35:56Z

I'd expect this to have been fixed by #32594.

nstewart added the A-disaster-recovery label Jan 6, 2018

maddyblue added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 26, 2018

knz changed the title ~~Hundreds of unavailable ranges after dist restore on v2.0-alpha.20171218~~ backupccl: hundreds of unavailable ranges after dist restore on v2.0-alpha.20171218 Jul 23, 2018

knz added this to the 2.1 milestone Jul 23, 2018

nstewart added the S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. label Sep 18, 2018

tbg closed this as completed Dec 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: hundreds of unavailable ranges after dist restore on v2.0-alpha.20171218 #21291

backupccl: hundreds of unavailable ranges after dist restore on v2.0-alpha.20171218 #21291

nstewart commented Jan 6, 2018 •

edited

Loading

nstewart commented Jan 6, 2018

bdarnell commented Jan 8, 2018

nstewart commented Jan 8, 2018 •

edited

Loading

a-robinson commented Jan 8, 2018

danhhz commented Jan 8, 2018

rolandcrosby commented Dec 4, 2018

tbg commented Dec 5, 2018

backupccl: hundreds of unavailable ranges after dist restore on v2.0-alpha.20171218 #21291

backupccl: hundreds of unavailable ranges after dist restore on v2.0-alpha.20171218 #21291

Comments

nstewart commented Jan 6, 2018 • edited Loading

nstewart commented Jan 6, 2018

bdarnell commented Jan 8, 2018

nstewart commented Jan 8, 2018 • edited Loading

a-robinson commented Jan 8, 2018

danhhz commented Jan 8, 2018

rolandcrosby commented Dec 4, 2018

tbg commented Dec 5, 2018

nstewart commented Jan 6, 2018 •

edited

Loading

nstewart commented Jan 8, 2018 •

edited

Loading