stability: freeze-cluster never terminates #7038

bdarnell · 2016-06-03T18:34:15Z

In #6985, we found that freeze-cluster would hang without completing. We need to make this process reliable before we can recommend it as a part of the regular upgrade process.

In the meantime, we should be trying freeze-cluster as a part of all of our upgrades for testing purposes even though we're not yet recommending that regular users do so.

tbg · 2016-06-05T05:14:32Z

Here's an acceptance tests where this happens. We know that when there are GCable replicas, deadlocks can occur - should tackle that first, but also put in logging that allows debugging the situation in the wild.

https://circleci.com/gh/cockroachdb/cockroach/18635

mberhault · 2016-06-08T13:52:36Z

attempting freeze on beta (I'm trying to incorporate it into my upgrade).
The cluster was rather happy, but the freeze timed out.

I called freeze on the first node: 104.196.97.139

{
  "Error": "timed out waiting for 1 store to report freeze",
  "Code": 2
}

A repeated call to freeze resulted in:

{
  "ranges_affected": "0"
}

tbg · 2016-06-08T14:34:27Z

Thanks for the report, let's see how this behaves once
#7090 is in. I suspect that
there's a gc'able range in the system (since all addressable ranges are
apparently frozen).

On Wed, Jun 8, 2016 at 9:52 AM marc [email protected] wrote:

attempting freeze on beta (I'm trying to incorporate it into my upgrade).
The cluster was rather happy, but the freeze timed out.

I called freeze on the first node: 104.196.97.139
A repeated call to freeze resulted in:

{
"ranges_affected": "0"
}

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#7038 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AE135FjNdUvYX0-FLk1F5vfS2RabJg5Aks5qJsk0gaJpZM4ItytT
.

-- Tobias

tbg · 2016-06-08T20:55:45Z

#7090 and #7104 have landed, from now on we should get more actionable information when it happens again.

tbg · 2016-06-08T22:09:24Z

@bdarnell, any ideas on how to deal with gc'able replicas in practice? We won't be able to run the GC queue (since consistent KV lookups aren't available any more at that point).
I could remove unfrozen replicas which have no outstanding Raft commands (appliedIndex == lastIndex) under the assumption that they would otherwise have been affected by the scans, but that's a big hammer.

bdarnell · 2016-06-13T21:56:17Z

Consistent KV lookups (of the data needed for the GC queue) are still available - that's why we freeze the meta ranges last. I think that when we get a PollFrozen request we can submit all our unfrozen (or unthawed) replicas to the GC queue.

tbg · 2016-06-13T21:59:02Z

But that's not how it works - we freeze everything first, and then we poll the stores. What you're suggesting is freezing all non-meta ranges first, but that means the polling mechanism must run then, and must ignore all meta ranges, (i.e. stores must deliberately ignore their meta ranges for the return when requested) adding more complications. Likely that is still the best (only) way to go, right?

bdarnell · 2016-06-13T22:05:08Z

Ah, right. I think adding a polling cycle in between each phase of the freeze is likely to be the cleanest solution here.

tbg · 2016-06-13T22:09:11Z

But how does that solve the problem when one of the GC'able ranges is a
meta range?

On Mon, Jun 13, 2016 at 6:05 PM Ben Darnell [email protected]
wrote:

Ah, right. I think adding a polling cycle in between each phase of the
freeze is likely to be the cleanest solution here.

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#7038 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AE135NCWGq8Onv8X7DdGILTgXM5JkT_qks5qLdQggaJpZM4ItytT
.

-- Tobias

bdarnell · 2016-06-13T22:41:17Z

Hmm. We can GC a meta2 range as long as range 1 (which contains all of meta1) is unfrozen, but range 1 is tricky. We might be able to use the gossiped range descriptor for this, as long as we keep gossiping the descriptor while frozen. Gossip is not consistent, but the monotonicity of NextReplicaID might be enough for us to tell whether we're a current member of the range or not.

Maybe we could just allow consistent RangeLookups while frozen?

bdarnell · 2016-07-13T15:47:26Z

As of 30175ad, freeze-cluster is still not working. Is there any more data we need to collect, or do we know enough to understand that this is related to GC'able replicas that can't proceed because of failing range lookups?

I'm not getting the new messages to stdout that were added in #7090. cockroach freeze-cluster is just blocking without printing anything.

tbg · 2016-08-03T10:22:37Z

Sorry, I missed those two last posts. I'll get back to them.

petermattis · 2016-08-03T15:40:05Z

Will be interesting to see if this is still happening now that we're more eagerly GC'ing replicas.

tbg · 2016-10-11T06:35:20Z

We should make this standard procedure again to see how broken this currently is.

This command was unfinished, and now it never will be as freezing the cluster is no longer an acceptable upgrade process. Closes cockroachdb#7238 Closes cockroachdb#7038 Closes cockroachdb#7928

bdarnell assigned tbg Jun 3, 2016

bdarnell mentioned this issue Jun 3, 2016

stability: kv inconsistency in beta cluster #6985

Closed

petermattis added this to the Q3 milestone Jul 11, 2016

tbg removed their assignment Oct 11, 2016

bdarnell mentioned this issue Nov 14, 2016

stability: entire delta cluster stuck, not serving any SQL traffic #10602

Closed

bdarnell mentioned this issue Apr 11, 2017

*: Remove freeze-cluster and supporting code #14779

Merged

bdarnell closed this as completed in #14779 Apr 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: freeze-cluster never terminates #7038

stability: freeze-cluster never terminates #7038

bdarnell commented Jun 3, 2016

tbg commented Jun 5, 2016

mberhault commented Jun 8, 2016 •

edited

Loading

tbg commented Jun 8, 2016

tbg commented Jun 8, 2016

tbg commented Jun 8, 2016

bdarnell commented Jun 13, 2016

tbg commented Jun 13, 2016 •

edited

Loading

bdarnell commented Jun 13, 2016

tbg commented Jun 13, 2016

bdarnell commented Jun 13, 2016

bdarnell commented Jul 13, 2016

tbg commented Aug 3, 2016

petermattis commented Aug 3, 2016

tbg commented Oct 11, 2016

stability: freeze-cluster never terminates #7038

stability: freeze-cluster never terminates #7038

Comments

bdarnell commented Jun 3, 2016

tbg commented Jun 5, 2016

mberhault commented Jun 8, 2016 • edited Loading

tbg commented Jun 8, 2016

tbg commented Jun 8, 2016

tbg commented Jun 8, 2016

bdarnell commented Jun 13, 2016

tbg commented Jun 13, 2016 • edited Loading

bdarnell commented Jun 13, 2016

tbg commented Jun 13, 2016

bdarnell commented Jun 13, 2016

bdarnell commented Jul 13, 2016

tbg commented Aug 3, 2016

petermattis commented Aug 3, 2016

tbg commented Oct 11, 2016

mberhault commented Jun 8, 2016 •

edited

Loading

tbg commented Jun 13, 2016 •

edited

Loading