RFC: quiesce ranges [postponed] #8811

dt · 2016-08-24T21:31:35Z

This change is

jordanlewis · 2016-08-24T21:43:24Z

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.

docs/RFCS/quiesce_ranges.md, line 16 [r1] (raw file):

As clusters grow to potentially very large numbers of ranges, if access patterns
are such that some of those ranges see no traffic, maintaining the raft groups
and other state for them potentially wastes substantial resources for little

Consider dropping this second "potentially"

docs/RFCS/quiesce_ranges.md, line 59 [r1] (raw file):

Once quiesced, any replica receiving a request should restart raft and trigger
an election to be able to acquire the lease. Theortically, the lease holder

theoretically

docs/RFCS/quiesce_ranges.md, line 62 [r1] (raw file):

could serve reads for the remainder of its lease but if a request is received,
it is reasonable to resume raft immediately as the range is no longer inactive
and thus should no longer be quiesced.

What happens on the raft layer when a killed node is restarted? How does it figure out that a range is quiesced, and how does this interact with the lazy replica initialization? I think it could be useful to include these scenarios in the explanation.

Comments from Reviewable

tbg · 2016-08-25T04:53:58Z

docs/RFCS/quiesce_ranges.md

+more than the election timeout, it was going to call the election anyway, thus
+ending the term that quiesced.
+
+Once quiesced, any replica receiving a request should restart raft and trigger


s/receiving a request/asked to propose a command/

tbg · 2016-08-25T05:04:38Z

Consider also adding some discussion on the urgency of this proposal - I think it's fairly low compared to some other, more immediate problems we have with Raft performance: we bottleneck when many replicas make substantial progress simultaneously, in particular when snapshots are involved (#8638).
In that context, a back-of-the-envelope computation could also be interesting: if we have (say) a block writer table with single-point inserts going randomly all over the place, how often (on average) is each range hit by an insert in the long run? When does that time get long enough for quiescing to make sense (giving that the next insert will pay double)? At approximately what magnitude of number of Replicas do we expect the simple housekeeping (for example ticking them all) to be measurable? I think these are interesting questions - my hunch is that with coalesced heartbeats in, quiescence is really only useful or needed for very large data sets (which are correspondingly rare, at least in the foreseeable future), and that we need to allow for a lot of active Raft groups at a time in the common case because of the penalty associated with recreation.

tbg · 2016-08-25T05:05:30Z

docs/RFCS/quiesce_ranges.md

+
+This almost certainly needs to be implemented at least partially upstream in
+raft -- inspecting the follower state to determine when it is safe to stop
+heartbeating involves inspecting internal raft state.


Could add here that this feature may not be received enthusiastically in upstream Raft, as we're the only multi-range use case and the cost of letting a single Raft group live is marginal.

petermattis · 2016-08-29T13:29:40Z

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.

docs/RFCS/quiesce_ranges.md, line 65 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

And this complexity becomes moot when/if we implement node leases (#4288). Should we push forward with that plan instead of the stopgap solution of quiescent ranges?

Another alternative is to stop calling `RawNode.Tick()` and take over responsibility for replica related timers. @bdarnell I think you mentioned this somewhere else and my initial exploration of the raft code suggests it would be feasible. This approach would need to be folded in to some sort of coalesced heartbeat or node lease proposal to see any benefit.

Comments from Reviewable

a-robinson · 2016-08-29T13:40:56Z

Reviewed 1 of 1 files at r1.
Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.

docs/RFCS/quiesce_ranges.md, line 29 [r1] (raw file):

Additionally, some operations may be able to use the fact a range is in a
quiesced state. For example, some possible implementations of bulk ingestion of
*new* ranges could be simplified (by being able assume the raft state is

being able to assume

Comments from Reviewable

dt · 2016-08-30T19:16:57Z

Context for people who missed discussion with @mjibson and @tschotdorf last week:
While this was originally proposed to (in #357) purely for the network and resource savings, the ability to freeze raft state for a range is potentially useful e.g. for bulk ingestion.

So while the potential resource usage savings haven't yet seemed like enough, on their own, to make this a priority, its potential in building other things is why we wanted to explore it further now (though someone mentioning an idle cluster burning hundreds of gigs of network recently seems to suggest those savings might be worth something on their own too).

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.

docs/RFCS/quiesce_ranges.md, line 16 [r1] (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Consider dropping this second "potentially"

Done.

docs/RFCS/quiesce_ranges.md, line 29 [r1] (raw file):

Previously, a-robinson (Alex Robinson) wrote…

being able to assume

Done.

docs/RFCS/quiesce_ranges.md, line 58 [r1] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

s/receiving a request/asked to propose a command/

Done.

docs/RFCS/quiesce_ranges.md, line 59 [r1] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

s/Theort/Theor/

Done.

docs/RFCS/quiesce_ranges.md, line 59 [r1] (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

theoretically

Done.

docs/RFCS/quiesce_ranges.md, line 60 [r1] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

This is already handled implicitly by the fact that we extend the lease well before it expires, so I don't think anything needs to be done here.

Done.

docs/RFCS/quiesce_ranges.md, line 62 [r1] (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

What happens on the raft layer when a killed node is restarted? How does it figure out that a range is quiesced, and how does this interact with the lazy replica initialization? I think it could be useful to include these scenarios in the explanation.

It'll just wake up everyone and then they'll all go back to sleep. Added a note about spurious wake ups being OK to drawbacks.

docs/RFCS/quiesce_ranges.md, line 64 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Another drawback: This proposal makes assumptions about access patterns, namely that a cluster's access patterns are such that a substantial fraction of ranges could become quiescent. Some clusters will not see any benefit from this change.

Done.

docs/RFCS/quiesce_ranges.md, line 65 [r1] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Another alternative is to stop calling RawNode.Tick() and take over responsibility for replica related timers. @bdarnell I think you mentioned this somewhere else and my initial exploration of the raft code suggests it would be feasible. This approach would need to be folded in to some sort of coalesced heartbeat or node lease proposal to see any benefit.

quiescing is also motivated by being able

whether we discard the raft group or just stop ticking it, we still need to coordinate doing so across nodes at the same-ish time so that nodes don't think their peers have died

docs/RFCS/quiesce_ranges.md, line 70 [r1] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Could add here that this feature may not be received enthusiastically in upstream Raft, as we're the only multi-range use case and the cost of letting a single Raft group live is marginal.

Done.

docs/RFCS/quiesce_ranges.md, line 72 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

@spencerkimball has suggested handling quiescing in a different way, in which the former leader can optimistically attempt to revive the group with the same term and leader as before (if a node other than the former leader does the reviving, it will need to increment the term and have a full election).

Added under possible follow-on optimizations.

Comments from Reviewable

bdarnell · 2016-08-31T02:44:19Z

The document is written from the perspective that this is an optimization; if the primary motivation is now bulk ingestion, that should be reflected in the doc.

Quiescing-as-optimization and quiescing-for-bulk-ingestion have rather different needs. As discussed below, raft-level quiescing is fragile since there's no great way to synchronize it across all nodes and the slightest breeze will bring the raft group back. That's fine if it's an optimization, but not if you're semantically relying on it. I think for bulk ingestion it would be better to base things on the Freeze command and avoid low-level raft tinkering.

Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.

Comments from Reviewable

dt · 2016-08-31T03:00:43Z

To some degree I was hoping quiescing could be used by both bulk ingestion and as an resource usage optimization -- the machinery discussed herein for how to quiesce a range is hopefully general enough that it could be used either by a "looks like this range is inactive so I can save some cycles" routine or by an explicit "i need this range to freeze its raft state so I can do something" bulk ingestion.

The current freeze doesn't quite work for some of the current proposals for the later, as they involve simple snapshots that happen to correspond to a an applied index in the future, and thus need to be sure raft is inactive and that index stays in the future -- but all of that I believe is covered in more detail in the forthcoming bulk-ingestion-of-ranges RFC that @paperstreet is writing -- this just got broken into a standalone RFC that that could reference and so we could gauge scope, particularly since it might have utility outside bulk ingestion as well.

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.

Comments from Reviewable

tbg · 2016-08-31T03:09:44Z

Having discussed this a fair bit, I think that quiescing for restore purposes is a long shot (or at least so complicated that we're better off pondering alternatives), so I think that we're best off focusing here on the merit quiescing could have outside of that context.

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.

Comments from Reviewable

dt · 2016-08-31T14:10:37Z

Ok, sounds like we've gathered more or less what we wanted to know for backup/restore -- quiescing active ranges seems tricky enough and the other benefits questionable enough that we might be better off exploring other ways for bulk ingestion first.

Should I change status to rejected and merge, or just close unmerged?

bdarnell · 2016-09-01T00:34:14Z

Merge as rejected, or maybe as "postponed" or "deferred".

On Wed, Aug 31, 2016, 10:10 PM David Taylor [email protected]
wrote:

Ok, sounds like we've gathered more or less what we wanted to know for
backup/restore -- quiescing active ranges seems tricky enough and the other
benefits questionable enough that we might be better off exploring other
ways for bulk ingestion first.

Should I change status to rejected and merge, or just close unmerged?

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#8811 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAJzMmmgQtNy6qO6Lm_RWbPshWAcsQW8ks5qlYtegaJpZM4JseX-
.

dt · 2016-09-01T19:08:12Z

done.

dt assigned bdarnell and tbg Aug 24, 2016

dt force-pushed the quiesce branch from bf41429 to 13d7205 Compare August 24, 2016 21:32

tbg reviewed Aug 25, 2016
View reviewed changes

dt force-pushed the quiesce branch from 13d7205 to a714ffc Compare August 30, 2016 19:05

dt force-pushed the quiesce branch from a714ffc to 20663ea Compare August 31, 2016 14:50

dt force-pushed the quiesce branch from 20663ea to b4c5717 Compare September 1, 2016 18:48

dt changed the title ~~RFC: quiesce ranges [first draft]~~ RFC: quiesce ranges [postponed] Sep 1, 2016

rfc: quiesce ranges

c8253a6

dt force-pushed the quiesce branch from b4c5717 to c8253a6 Compare September 1, 2016 19:10

dt merged commit a4e2b2f into cockroachdb:develop Sep 2, 2016

dt deleted the quiesce branch September 2, 2016 15:59

tbg mentioned this pull request Sep 12, 2016

stability: slow performance after restarting cluster with many ranges #9273

Closed

rjnn mentioned this pull request Sep 14, 2016

storage: quiescence leads to more elections #9372

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: quiesce ranges [postponed] #8811

RFC: quiesce ranges [postponed] #8811

dt commented Aug 24, 2016 •

edited by petermattis

Loading

jordanlewis commented Aug 24, 2016

tbg Aug 25, 2016

tbg commented Aug 25, 2016

tbg Aug 25, 2016

petermattis commented Aug 29, 2016

a-robinson commented Aug 29, 2016

dt commented Aug 30, 2016

bdarnell commented Aug 31, 2016

dt commented Aug 31, 2016

tbg commented Aug 31, 2016

dt commented Aug 31, 2016

bdarnell commented Sep 1, 2016

dt commented Sep 1, 2016

RFC: quiesce ranges [postponed] #8811

RFC: quiesce ranges [postponed] #8811

Conversation

dt commented Aug 24, 2016 • edited by petermattis Loading

jordanlewis commented Aug 24, 2016

tbg Aug 25, 2016

Choose a reason for hiding this comment

tbg commented Aug 25, 2016

tbg Aug 25, 2016

Choose a reason for hiding this comment

petermattis commented Aug 29, 2016

a-robinson commented Aug 29, 2016

dt commented Aug 30, 2016

bdarnell commented Aug 31, 2016

dt commented Aug 31, 2016

tbg commented Aug 31, 2016

dt commented Aug 31, 2016

bdarnell commented Sep 1, 2016

dt commented Sep 1, 2016

dt commented Aug 24, 2016 •

edited by petermattis

Loading