-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Zen2] Introduce ElectionScheduler #32846
[Zen2] Introduce ElectionScheduler #32846
Conversation
The ElectionScheduler runs while there is no known elected master and is responsible for scheduling elections randomly, backing off on failure, to balance the desire to elect a master quickly with the desire to avoid more than one node starting an election at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some smaller comments. Main change looks good though.
@Nullable | ||
private volatile Object currentScheduler; // only care about its identity; null if stopped | ||
|
||
ElectionScheduler(Settings settings, Random random, ThreadPool threadPool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this public please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
|
||
public class ElectionScheduler extends AbstractComponent { | ||
|
||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this as Javadoc to the class level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
/** | ||
* @param upperBound exclusive upper bound | ||
*/ | ||
private long randomPositiveLongLessThan(long upperBound) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a static method (random instance as parameter) so that it can be unittested in isolation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. And I added tests.
return nonNegative(random.nextLong()) % (upperBound - 1) + 1; | ||
} | ||
|
||
private long backOffCurrentMaxDelay(long currentMaxDelayMillis) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call this boundDelayByMaxTimeout
? In any case, please add a comment what this method does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, more docs.
logger.debug("{} not starting election", ElectionScheduler.this); | ||
return; | ||
} | ||
logger.debug("{} starting pre-voting", ElectionScheduler.this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a more generic message here? Something like executing scheduled election
. The Prevoting class can output it's own log line that prevoting has started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
|
||
private class ElectionScheduler implements Releasable { | ||
private AtomicLong currentMaxDelayMillis = new AtomicLong(minTimeout.millis()); | ||
private AtomicBoolean isRunning = new AtomicBoolean(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, change it to isClosed
(or just closed
), so that we don't have to write == false
everywhere :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
private AtomicBoolean isRunning = new AtomicBoolean(true); | ||
|
||
void scheduleNextElection(final TimeValue gracePeriod, final Runnable scheduledRunnable) { | ||
final long delay; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while declare it here? Just declare where it's defined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hangover from an earlier implementation, now fixed.
|
||
private DeterministicTaskQueue deterministicTaskQueue; | ||
private ElectionSchedulerFactory electionSchedulerFactory; | ||
private boolean electionStarted = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to live at the class level? can we move this to the test method? Same for electionSchedulerFactory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, only used by one test now, so can be local to that (with suitable parameters)
assertThat(electionDelay, greaterThanOrEqualTo(initialGracePeriod.millis())); | ||
|
||
// Check upper bound | ||
assertThat(electionDelay, lessThanOrEqualTo(ELECTION_MIN_TIMEOUT_SETTING.get(Settings.EMPTY).millis() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we inject random settings with min timeout/ max timeout/backoff to test this on different settings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. This was a little painful because putting a TimeValue
into a Settings.Builder
converts it (lossily) to a string and then back again, and even rejects things with a fraction like 1.1m
. Putting a specific number of milliseconds in as a string is the answer.
|
||
// bounds on the time between election attempts | ||
private static final String ELECTION_MIN_TIMEOUT_SETTING_KEY = "cluster.election.min_timeout"; | ||
private static final String ELECTION_BACK_OFF_TIME_SETTING_KEY = "cluster.election.back_off_time"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you document how the backoff works, i.e. the initial interval, then the subsequent intervals we consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some more comments, mostly very minor stuff, and left a few suggestions which I would like to get your opinion on.
* The first election is scheduled to occur a random number of milliseconds after the scheduler is started, where the random number of | ||
* milliseconds is chosen uniformly from | ||
* | ||
* (0, min(ELECTION_INITIAL_TIMEOUT_SETTING, ELECTION_MAX_TIMEOUT_SETTING_KEY)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove _KEY
* For `n > 1`, the `n`th election is scheduled to occur a random number of milliseconds after the `n - 1`th election, where the random | ||
* number of milliseconds is chosen uniformly from | ||
* | ||
* (0, min(ELECTION_INITIAL_TIMEOUT_SETTING + (n-1) * ELECTION_BACK_OFF_TIME_SETTING, ELECTION_MAX_TIMEOUT_SETTING_KEY)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove _KEY
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops.
* @param scheduledRunnable The action to run each time an election should be attempted. | ||
*/ | ||
public Releasable startElectionScheduler(TimeValue gracePeriod, Runnable scheduledRunnable) { | ||
final ElectionScheduler currentScheduler = new ElectionScheduler(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just call this scheduler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
|
||
/** | ||
* @param randomSupplier supplier of randomly-chosen longs | ||
* @param upperBound exclusive upper bound |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also document @return
here, to denote what's returned? This will go nicely with your explanation above about the uniformly chosen value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😐 it returns a random positive long that's less than upperBound
.
(actually neater to make it an inclusive upper bound now - this is on it way)
* The current maximum timeout: the next election is scheduled randomly no later than this number of milliseconds in the future. On | ||
* each election attempt this value is increased by `backoffTime`, up to the `maxTimeout`, to adapt to higher-than-expected latency. | ||
*/ | ||
private final AtomicLong currentMaxTimeoutMillis = new AtomicLong(Math.min(initialTimeout.millis(), maxTimeout.millis())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not enforce the condition on maxTimeout to be >= initialTimeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If maxTimeout < initialTimeout
then nothing particularly bad happens. I added this call to Math.min
because the spec is more uniform with it there than not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, nothing bad happens, but it might still be a user configuration error (user configured initalTimeout, but forgot to set maxTimeout). I think it's nicer to just throw an exception and report it.
public boolean isForceExecution() { | ||
// There are very few of these scheduled, and they back off, but it's important that they're not rejected as | ||
// this could prevent a cluster from ever forming. | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not sure this is needed. Have you checked this? If not, let's add a TODO here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ThreadPool#schedule()
wraps this up in a ThreadedRunnable
, but I think that's what we want. When the ThreadedRunnable
finally runs, it runs this AbstractRunnable
directly using the executor, so rejections seem to be handled right.
As far as I can see there's nothing special about whether the generic threadpool does rejections, but perhaps I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generic threadpool has an unbounded queue, so rejections should not happen unless the executor is shutdown. In that case, there's no need for this code to still execute.
|
||
@Override | ||
public void close() { | ||
boolean isClosedChanged = isClosed.compareAndSet(false, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe call this previouslyNotClosed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
* For `n > 1`, the `n`th election is scheduled to occur a random number of milliseconds after the `n - 1`th election, where the random | ||
* number of milliseconds is chosen uniformly from | ||
* | ||
* (0, min(ELECTION_INITIAL_TIMEOUT_SETTING + (n-1) * ELECTION_BACK_OFF_TIME_SETTING, ELECTION_MAX_TIMEOUT_SETTING_KEY)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this formula. I wonder if we should try to implement ElectionScheduler exactly this way. We would "just" need to keep track of the round n (which might be useful anyhow for debugging purposes), and there would be no need for currentMaxTimeoutMillis, which feels more awkward. The only arithmetic calculation we would need to do then would be just a straight-forward implementation of this formula, which would preferably be done in a single static method at the top-level here. This would also allow us to get rid of the methods randomPositiveLongLessThan
and backOffCurrentMaxTimeout
, both of which could be folded into this static method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. I did this, but there's now potential for integer overflow, so perhaps we do need that upper bound on the settings?
Also I'm unconvinced at the need for a static method to do this calculation. We test it via the actual timeouts chosen, so why extract it?
Also randomPositiveLongAtMost
is still needed, and you wanted a test for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On reflection, there was already potential for integer overflow without an upper bound on the settings and this change makes it no worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were two remaining discussion points in #32709 - I brought them over here.
public static final Setting<TimeValue> ELECTION_BACK_OFF_TIME_SETTING = Setting.timeSetting(ELECTION_BACK_OFF_TIME_SETTING_KEY, | ||
TimeValue.timeValueMillis(100), TimeValue.timeValueMillis(1), Property.NodeScope); | ||
|
||
public static final Setting<TimeValue> ELECTION_MAX_TIMEOUT_SETTING = Setting.timeSetting(ELECTION_MAX_TIMEOUT_SETTING_KEY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ywelsch you asked for a Setting.timeSetting()
that allowed us to specify an upper bound for these settings as well as a lower bound. I don't think we need this - the consequences of setting ELECTION_MAX_TIMEOUT_SETTING
too high do not seem terribly surprising, and documenting this seems sufficient.
} | ||
|
||
@Override | ||
public boolean isForceExecution() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ywelsch I didn't totally understand what action you wanted here (see
#32709 (comment)).
public boolean isForceExecution() { | ||
// There are very few of these scheduled, and they back off, but it's important that they're not rejected as | ||
// this could prevent a cluster from ever forming. | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generic threadpool has an unbounded queue, so rejections should not happen unless the executor is shutdown. In that case, there's no need for this code to still execute.
} | ||
|
||
final long thisAttempt = attempt.getAndIncrement(); | ||
final long maxDelayMillis = Math.min(maxTimeout.millis(), initialTimeout.millis() + thisAttempt * backoffTime.millis()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned the overflow that could happen here. Can you please add a safeguard? Otherwise I might not be able to sleep anymore 😨
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. It's a bit daft because now that there are upper limits on the settings it would take solidly more than a million years to overflow, but there you go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you put it like that, you can undo c08cd8b
* The current maximum timeout: the next election is scheduled randomly no later than this number of milliseconds in the future. On | ||
* each election attempt this value is increased by `backoffTime`, up to the `maxTimeout`, to adapt to higher-than-expected latency. | ||
*/ | ||
private final AtomicLong currentMaxTimeoutMillis = new AtomicLong(Math.min(initialTimeout.millis(), maxTimeout.millis())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, nothing bad happens, but it might still be a user configuration error (user configured initalTimeout, but forgot to set maxTimeout). I think it's nicer to just throw an exception and report it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (two nits)
The ElectionScheduler runs while there is no known elected master and is
responsible for scheduling elections randomly, backing off on failure, to
balance the desire to elect a master quickly with the desire to avoid more than
one node starting an election at once.