Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: quiesce ranges [postponed] #8811

Merged
merged 1 commit into from
Sep 2, 2016
Merged

RFC: quiesce ranges [postponed] #8811

merged 1 commit into from
Sep 2, 2016

Conversation

dt
Copy link
Member

@dt dt commented Aug 24, 2016

This change is Reviewable

@jordanlewis
Copy link
Member

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


docs/RFCS/quiesce_ranges.md, line 16 [r1] (raw file):

As clusters grow to potentially very large numbers of ranges, if access patterns
are such that some of those ranges see no traffic, maintaining the raft groups
and other state for them potentially wastes substantial resources for little

Consider dropping this second "potentially"


docs/RFCS/quiesce_ranges.md, line 59 [r1] (raw file):

Once quiesced, any replica receiving a request should restart raft and trigger
an election to be able to acquire the lease. Theortically, the lease holder

theoretically


docs/RFCS/quiesce_ranges.md, line 62 [r1] (raw file):

could serve reads for the remainder of its lease but if a request is received,
it is reasonable to resume raft immediately as the range is no longer inactive
and thus should no longer be quiesced.

What happens on the raft layer when a killed node is restarted? How does it figure out that a range is quiesced, and how does this interact with the lazy replica initialization? I think it could be useful to include these scenarios in the explanation.


Comments from Reviewable

more than the election timeout, it was going to call the election anyway, thus
ending the term that quiesced.

Once quiesced, any replica receiving a request should restart raft and trigger
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/receiving a request/asked to propose a command/

@tbg
Copy link
Member

tbg commented Aug 25, 2016

Consider also adding some discussion on the urgency of this proposal - I think it's fairly low compared to some other, more immediate problems we have with Raft performance: we bottleneck when many replicas make substantial progress simultaneously, in particular when snapshots are involved (#8638).
In that context, a back-of-the-envelope computation could also be interesting: if we have (say) a block writer table with single-point inserts going randomly all over the place, how often (on average) is each range hit by an insert in the long run? When does that time get long enough for quiescing to make sense (giving that the next insert will pay double)? At approximately what magnitude of number of Replicas do we expect the simple housekeeping (for example ticking them all) to be measurable? I think these are interesting questions - my hunch is that with coalesced heartbeats in, quiescence is really only useful or needed for very large data sets (which are correspondingly rare, at least in the foreseeable future), and that we need to allow for a lot of active Raft groups at a time in the common case because of the penalty associated with recreation.


This almost certainly needs to be implemented at least partially upstream in
raft -- inspecting the follower state to determine when it is safe to stop
heartbeating involves inspecting internal raft state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add here that this feature may not be received enthusiastically in upstream Raft, as we're the only multi-range use case and the cost of letting a single Raft group live is marginal.

@petermattis
Copy link
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.


docs/RFCS/quiesce_ranges.md, line 65 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

And this complexity becomes moot when/if we implement node leases (#4288). Should we push forward with that plan instead of the stopgap solution of quiescent ranges?

Another alternative is to stop calling `RawNode.Tick()` and take over responsibility for replica related timers. @bdarnell I think you mentioned this somewhere else and my initial exploration of the raft code suggests it would be feasible. This approach would need to be folded in to some sort of coalesced heartbeat or node lease proposal to see any benefit.

Comments from Reviewable

@a-robinson
Copy link
Contributor

Reviewed 1 of 1 files at r1.
Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.


docs/RFCS/quiesce_ranges.md, line 29 [r1] (raw file):

Additionally, some operations may be able to use the fact a range is in a
quiesced state. For example, some possible implementations of bulk ingestion of
*new* ranges could be simplified (by being able assume the raft state is

being able to assume


Comments from Reviewable

@dt
Copy link
Member Author

dt commented Aug 30, 2016

Context for people who missed discussion with @mjibson and @tschotdorf last week:
While this was originally proposed to (in #357) purely for the network and resource savings, the ability to freeze raft state for a range is potentially useful e.g. for bulk ingestion.

So while the potential resource usage savings haven't yet seemed like enough, on their own, to make this a priority, its potential in building other things is why we wanted to explore it further now (though someone mentioning an idle cluster burning hundreds of gigs of network recently seems to suggest those savings might be worth something on their own too).


Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.


docs/RFCS/quiesce_ranges.md, line 16 [r1] (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Consider dropping this second "potentially"

Done.

docs/RFCS/quiesce_ranges.md, line 29 [r1] (raw file):

Previously, a-robinson (Alex Robinson) wrote…

being able to assume

Done.

docs/RFCS/quiesce_ranges.md, line 58 [r1] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

s/receiving a request/asked to propose a command/

Done.

docs/RFCS/quiesce_ranges.md, line 59 [r1] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

s/Theort/Theor/

Done.

docs/RFCS/quiesce_ranges.md, line 59 [r1] (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

theoretically

Done.

docs/RFCS/quiesce_ranges.md, line 60 [r1] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

This is already handled implicitly by the fact that we extend the lease well before it expires, so I don't think anything needs to be done here.

Done.

docs/RFCS/quiesce_ranges.md, line 62 [r1] (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

What happens on the raft layer when a killed node is restarted? How does it figure out that a range is quiesced, and how does this interact with the lazy replica initialization? I think it could be useful to include these scenarios in the explanation.

It'll just wake up everyone and then they'll all go back to sleep. Added a note about spurious wake ups being OK to drawbacks.

docs/RFCS/quiesce_ranges.md, line 64 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Another drawback: This proposal makes assumptions about access patterns, namely that a cluster's access patterns are such that a substantial fraction of ranges could become quiescent. Some clusters will not see any benefit from this change.

Done.

docs/RFCS/quiesce_ranges.md, line 65 [r1] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Another alternative is to stop calling RawNode.Tick() and take over responsibility for replica related timers. @bdarnell I think you mentioned this somewhere else and my initial exploration of the raft code suggests it would be feasible. This approach would need to be folded in to some sort of coalesced heartbeat or node lease proposal to see any benefit.

quiescing is also motivated by being able

whether we discard the raft group or just stop ticking it, we still need to coordinate doing so across nodes at the same-ish time so that nodes don't think their peers have died


docs/RFCS/quiesce_ranges.md, line 70 [r1] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Could add here that this feature may not be received enthusiastically in upstream Raft, as we're the only multi-range use case and the cost of letting a single Raft group live is marginal.

Done.

docs/RFCS/quiesce_ranges.md, line 72 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

@spencerkimball has suggested handling quiescing in a different way, in which the former leader can optimistically attempt to revive the group with the same term and leader as before (if a node other than the former leader does the reviving, it will need to increment the term and have a full election).

Added under possible follow-on optimizations.

Comments from Reviewable

@bdarnell
Copy link
Contributor

The document is written from the perspective that this is an optimization; if the primary motivation is now bulk ingestion, that should be reflected in the doc.

Quiescing-as-optimization and quiescing-for-bulk-ingestion have rather different needs. As discussed below, raft-level quiescing is fragile since there's no great way to synchronize it across all nodes and the slightest breeze will bring the raft group back. That's fine if it's an optimization, but not if you're semantically relying on it. I think for bulk ingestion it would be better to base things on the Freeze command and avoid low-level raft tinkering.


Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


Comments from Reviewable

@dt
Copy link
Member Author

dt commented Aug 31, 2016

To some degree I was hoping quiescing could be used by both bulk ingestion and as an resource usage optimization -- the machinery discussed herein for how to quiesce a range is hopefully general enough that it could be used either by a "looks like this range is inactive so I can save some cycles" routine or by an explicit "i need this range to freeze its raft state so I can do something" bulk ingestion.

The current freeze doesn't quite work for some of the current proposals for the later, as they involve simple snapshots that happen to correspond to a an applied index in the future, and thus need to be sure raft is inactive and that index stays in the future -- but all of that I believe is covered in more detail in the forthcoming bulk-ingestion-of-ranges RFC that @paperstreet is writing -- this just got broken into a standalone RFC that that could reference and so we could gauge scope, particularly since it might have utility outside bulk ingestion as well.


Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Aug 31, 2016

Having discussed this a fair bit, I think that quiescing for restore purposes is a long shot (or at least so complicated that we're better off pondering alternatives), so I think that we're best off focusing here on the merit quiescing could have outside of that context.


Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.


Comments from Reviewable

@dt
Copy link
Member Author

dt commented Aug 31, 2016

Ok, sounds like we've gathered more or less what we wanted to know for backup/restore -- quiescing active ranges seems tricky enough and the other benefits questionable enough that we might be better off exploring other ways for bulk ingestion first.

Should I change status to rejected and merge, or just close unmerged?

@bdarnell
Copy link
Contributor

bdarnell commented Sep 1, 2016

Merge as rejected, or maybe as "postponed" or "deferred".

On Wed, Aug 31, 2016, 10:10 PM David Taylor [email protected]
wrote:

Ok, sounds like we've gathered more or less what we wanted to know for
backup/restore -- quiescing active ranges seems tricky enough and the other
benefits questionable enough that we might be better off exploring other
ways for bulk ingestion first.

Should I change status to rejected and merge, or just close unmerged?


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#8811 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAJzMmmgQtNy6qO6Lm_RWbPshWAcsQW8ks5qlYtegaJpZM4JseX-
.

@dt dt changed the title RFC: quiesce ranges [first draft] RFC: quiesce ranges [postponed] Sep 1, 2016
@dt
Copy link
Member Author

dt commented Sep 1, 2016

done.

@dt dt merged commit a4e2b2f into cockroachdb:develop Sep 2, 2016
@dt dt deleted the quiesce branch September 2, 2016 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants