-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: freeze-cluster never terminates #7038
Comments
Here's an acceptance tests where this happens. We know that when there are GCable replicas, deadlocks can occur - should tackle that first, but also put in logging that allows debugging the situation in the wild. |
attempting freeze on beta (I'm trying to incorporate it into my upgrade). I called freeze on the first node:
A repeated call to freeze resulted in:
|
Thanks for the report, let's see how this behaves once On Wed, Jun 8, 2016 at 9:52 AM marc [email protected] wrote:
-- Tobias |
@bdarnell, any ideas on how to deal with gc'able replicas in practice? We won't be able to run the GC queue (since consistent KV lookups aren't available any more at that point). |
Consistent KV lookups (of the data needed for the GC queue) are still available - that's why we freeze the meta ranges last. I think that when we get a PollFrozen request we can submit all our unfrozen (or unthawed) replicas to the GC queue. |
But that's not how it works - we freeze everything first, and then we poll the stores. What you're suggesting is freezing all non-meta ranges first, but that means the polling mechanism must run then, and must ignore all meta ranges, (i.e. stores must deliberately ignore their meta ranges for the return when requested) adding more complications. Likely that is still the best (only) way to go, right? |
Ah, right. I think adding a polling cycle in between each phase of the freeze is likely to be the cleanest solution here. |
But how does that solve the problem when one of the GC'able ranges is a On Mon, Jun 13, 2016 at 6:05 PM Ben Darnell [email protected]
-- Tobias |
Hmm. We can GC a meta2 range as long as range 1 (which contains all of meta1) is unfrozen, but range 1 is tricky. We might be able to use the gossiped range descriptor for this, as long as we keep gossiping the descriptor while frozen. Gossip is not consistent, but the monotonicity of NextReplicaID might be enough for us to tell whether we're a current member of the range or not. Maybe we could just allow consistent RangeLookups while frozen? |
As of 30175ad, I'm not getting the new messages to stdout that were added in #7090. |
Sorry, I missed those two last posts. I'll get back to them. |
Will be interesting to see if this is still happening now that we're more eagerly GC'ing replicas. |
We should make this standard procedure again to see how broken this currently is. |
This command was unfinished, and now it never will be as freezing the cluster is no longer an acceptable upgrade process. Closes cockroachdb#7238 Closes cockroachdb#7038 Closes cockroachdb#7928
This command was unfinished, and now it never will be as freezing the cluster is no longer an acceptable upgrade process. Closes cockroachdb#7238 Closes cockroachdb#7038 Closes cockroachdb#7928
In #6985, we found that freeze-cluster would hang without completing. We need to make this process reliable before we can recommend it as a part of the regular upgrade process.
In the meantime, we should be trying freeze-cluster as a part of all of our upgrades for testing purposes even though we're not yet recommending that regular users do so.
The text was updated successfully, but these errors were encountered: