-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: hundreds of unavailable ranges after dist restore on v2.0-alpha.20171218 #21291
Comments
Still getting node restarts and unavailable ranges on m4.xlarges |
Note that all Can you post more details from the run on |
With m4.xlarge nodes I saw the following when running a restore: I have a debug zip if that's useful |
It'd be much easier to look into the unavailable and underreplicated range issue if there weren't also a bunch of node failures along the way muddying the waters, especially since the node restarts may be causing them to some degree. @danhhz when will someone on bulkio be able to look into the restore node failures that @nstewart keeps hitting? I will say it's pretty cool that the restores still finish successfully despite all the failures, though 👍 |
@nstewart Have you seen any behavior like this recently? @awoods187 you do a lot of restores in your testing, is this something you've encountered lately? |
I'd expect this to have been fixed by #32594. |
I ran a distributed restore of about 18GiB on a 5 node cluster of t2.larges on AWS running v2.0-alpha.20171218. Afterwards, hundreds of ranges were unavailable and the cluster was unusable.
Note that during the restore some of the nodes periodically died and where restarted; judging by the uptimes it looks like we lost quorums for a period of time.
Are these nodes not big enough to run a dist restore? According to our production docs, "Each node should have sufficient CPU, RAM, network, and storage capacity to handle your workload." Is 8GiB/node not large enough for a 20GiB distributed restore? cc @jseldess
I took a debug zip immediately, so I can share with anyone handling this ticket.
Not clear if this is the same problem in #21102 since in that one the cluster was actually usable after the restore.
The text was updated successfully, but these errors were encountered: