Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: freeze-cluster never terminates #7038

Closed
bdarnell opened this issue Jun 3, 2016 · 14 comments
Closed

stability: freeze-cluster never terminates #7038

bdarnell opened this issue Jun 3, 2016 · 14 comments
Milestone

Comments

@bdarnell
Copy link
Contributor

bdarnell commented Jun 3, 2016

In #6985, we found that freeze-cluster would hang without completing. We need to make this process reliable before we can recommend it as a part of the regular upgrade process.

In the meantime, we should be trying freeze-cluster as a part of all of our upgrades for testing purposes even though we're not yet recommending that regular users do so.

@tbg
Copy link
Member

tbg commented Jun 5, 2016

Here's an acceptance tests where this happens. We know that when there are GCable replicas, deadlocks can occur - should tackle that first, but also put in logging that allows debugging the situation in the wild.

https://circleci.com/gh/cockroachdb/cockroach/18635

@mberhault
Copy link
Contributor

mberhault commented Jun 8, 2016

attempting freeze on beta (I'm trying to incorporate it into my upgrade).
The cluster was rather happy, but the freeze timed out.

I called freeze on the first node: 104.196.97.139

{
  "Error": "timed out waiting for 1 store to report freeze",
  "Code": 2
}

A repeated call to freeze resulted in:

{
  "ranges_affected": "0"
}

@tbg
Copy link
Member

tbg commented Jun 8, 2016

Thanks for the report, let's see how this behaves once
#7090 is in. I suspect that
there's a gc'able range in the system (since all addressable ranges are
apparently frozen).

On Wed, Jun 8, 2016 at 9:52 AM marc [email protected] wrote:

attempting freeze on beta (I'm trying to incorporate it into my upgrade).
The cluster was rather happy, but the freeze timed out.

I called freeze on the first node: 104.196.97.139
A repeated call to freeze resulted in:

{
"ranges_affected": "0"
}


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#7038 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AE135FjNdUvYX0-FLk1F5vfS2RabJg5Aks5qJsk0gaJpZM4ItytT
.

-- Tobias

@tbg
Copy link
Member

tbg commented Jun 8, 2016

#7090 and #7104 have landed, from now on we should get more actionable information when it happens again.

@tbg
Copy link
Member

tbg commented Jun 8, 2016

@bdarnell, any ideas on how to deal with gc'able replicas in practice? We won't be able to run the GC queue (since consistent KV lookups aren't available any more at that point).
I could remove unfrozen replicas which have no outstanding Raft commands (appliedIndex == lastIndex) under the assumption that they would otherwise have been affected by the scans, but that's a big hammer.

@bdarnell
Copy link
Contributor Author

Consistent KV lookups (of the data needed for the GC queue) are still available - that's why we freeze the meta ranges last. I think that when we get a PollFrozen request we can submit all our unfrozen (or unthawed) replicas to the GC queue.

@tbg
Copy link
Member

tbg commented Jun 13, 2016

But that's not how it works - we freeze everything first, and then we poll the stores. What you're suggesting is freezing all non-meta ranges first, but that means the polling mechanism must run then, and must ignore all meta ranges, (i.e. stores must deliberately ignore their meta ranges for the return when requested) adding more complications. Likely that is still the best (only) way to go, right?

@bdarnell
Copy link
Contributor Author

Ah, right. I think adding a polling cycle in between each phase of the freeze is likely to be the cleanest solution here.

@tbg
Copy link
Member

tbg commented Jun 13, 2016

But how does that solve the problem when one of the GC'able ranges is a
meta range?

On Mon, Jun 13, 2016 at 6:05 PM Ben Darnell [email protected]
wrote:

Ah, right. I think adding a polling cycle in between each phase of the
freeze is likely to be the cleanest solution here.


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#7038 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AE135NCWGq8Onv8X7DdGILTgXM5JkT_qks5qLdQggaJpZM4ItytT
.

-- Tobias

@bdarnell
Copy link
Contributor Author

Hmm. We can GC a meta2 range as long as range 1 (which contains all of meta1) is unfrozen, but range 1 is tricky. We might be able to use the gossiped range descriptor for this, as long as we keep gossiping the descriptor while frozen. Gossip is not consistent, but the monotonicity of NextReplicaID might be enough for us to tell whether we're a current member of the range or not.

Maybe we could just allow consistent RangeLookups while frozen?

@petermattis petermattis added this to the Q3 milestone Jul 11, 2016
@bdarnell
Copy link
Contributor Author

As of 30175ad, freeze-cluster is still not working. Is there any more data we need to collect, or do we know enough to understand that this is related to GC'able replicas that can't proceed because of failing range lookups?

I'm not getting the new messages to stdout that were added in #7090. cockroach freeze-cluster is just blocking without printing anything.

@tbg
Copy link
Member

tbg commented Aug 3, 2016

Sorry, I missed those two last posts. I'll get back to them.

@petermattis
Copy link
Collaborator

Will be interesting to see if this is still happening now that we're more eagerly GC'ing replicas.

@tbg tbg removed their assignment Oct 11, 2016
@tbg
Copy link
Member

tbg commented Oct 11, 2016

We should make this standard procedure again to see how broken this currently is.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 11, 2017
This command was unfinished, and now it never will be as freezing the
cluster is no longer an acceptable upgrade process.

Closes cockroachdb#7238
Closes cockroachdb#7038
Closes cockroachdb#7928
bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 11, 2017
This command was unfinished, and now it never will be as freezing the
cluster is no longer an acceptable upgrade process.

Closes cockroachdb#7238
Closes cockroachdb#7038
Closes cockroachdb#7928
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants