Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drain master task queue when stabilising #42504

Conversation

DaveCTurner
Copy link
Contributor

Today the default stabilisation time is calculated on the assumption that the
elected master has no pending tasks to process when it is elected, but this is
not a safe assumption to make. This can result in a cluster reaching the end of
its stabilisation time without having stabilised. Furthermore in #36943 we
increased the probability that each step in runRandomly() enqueues another
task, vastly increasing the chance that we hit such a situation.

This change extends the stabilisation process to allow time for all pending
tasks, plus a task that might currently be in flight.

Fixes #41967, in which the master entered the stabilisation phase with over 800
tasks to process.

Today the default stabilisation time is calculated on the assumption that the
elected master has no pending tasks to process when it is elected, but this is
not a safe assumption to make. This can result in a cluster reaching the end of
its stabilisation time without having stabilised. Furthermore in elastic#36943 we
increased the probability that each step in `runRandomly()` enqueues another
task, vastly increasing the chance that we hit such a situation.

This change extends the stabilisation process to allow time for all pending
tasks, plus a task that might currently be in flight.

Fixes elastic#41967, in which the master entered the stabilisation phase with over 800
tasks to process.
@DaveCTurner DaveCTurner added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.3.0 labels May 24, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner merged commit 3907a6d into elastic:master May 24, 2019
@DaveCTurner DaveCTurner deleted the 2019-05-24-drain-master-task-queue-in-stabilisation branch May 24, 2019 13:17
DaveCTurner added a commit that referenced this pull request May 24, 2019
Today the default stabilisation time is calculated on the assumption that the
elected master has no pending tasks to process when it is elected, but this is
not a safe assumption to make. This can result in a cluster reaching the end of
its stabilisation time without having stabilised. Furthermore in #36943 we
increased the probability that each step in `runRandomly()` enqueues another
task, vastly increasing the chance that we hit such a situation.

This change extends the stabilisation process to allow time for all pending
tasks, plus a task that might currently be in flight.

Fixes #41967, in which the master entered the stabilisation phase with over 800
tasks to process.
DaveCTurner added a commit that referenced this pull request May 24, 2019
DaveCTurner added a commit that referenced this pull request May 24, 2019
Today the default stabilisation time is calculated on the assumption that the
elected master has no pending tasks to process when it is elected, but this is
not a safe assumption to make. This can result in a cluster reaching the end of
its stabilisation time without having stabilised. Furthermore in #36943 we
increased the probability that each step in `runRandomly()` enqueues another
task, vastly increasing the chance that we hit such a situation.

This change extends the stabilisation process to allow time for all pending
tasks, plus a task that might currently be in flight.

Fixes #41967, in which the master entered the stabilisation phase with over 800
tasks to process.
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
Today the default stabilisation time is calculated on the assumption that the
elected master has no pending tasks to process when it is elected, but this is
not a safe assumption to make. This can result in a cluster reaching the end of
its stabilisation time without having stabilised. Furthermore in elastic#36943 we
increased the probability that each step in `runRandomly()` enqueues another
task, vastly increasing the chance that we hit such a situation.

This change extends the stabilisation process to allow time for all pending
tasks, plus a task that might currently be in flight.

Fixes elastic#41967, in which the master entered the stabilisation phase with over 800
tasks to process.
henningandersen pushed a commit that referenced this pull request Jun 10, 2019
Today the default stabilisation time is calculated on the assumption that the
elected master has no pending tasks to process when it is elected, but this is
not a safe assumption to make. This can result in a cluster reaching the end of
its stabilisation time without having stabilised. Furthermore in #36943 we
increased the probability that each step in `runRandomly()` enqueues another
task, vastly increasing the chance that we hit such a situation.

This change extends the stabilisation process to allow time for all pending
tasks, plus a task that might currently be in flight.

Fixes #41967, in which the master entered the stabilisation phase with over 800
tasks to process.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test Issues or PRs that are addressing/adding tests v7.2.0 v7.3.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CoordinatorTests.testDiscoveryUsesNodesFromLastClusterState test failure
5 participants