-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Defer reroute when starting shards #44433
Defer reroute when starting shards #44433
Conversation
Today we reroute the cluster as part of the process of starting a shard, which runs at `URGENT` priority. In large clusters, rerouting may take some time to complete, and this means that a mere trickle of shard-started events can cause starvation for other, lower-priority, tasks that are pending on the master. However, it isn't really necessary to perform a reroute when starting a shard, as long as one occurs eventually. This commit removes the inline reroute from the process of starting a shard and replaces it with a deferred one that runs at `NORMAL` priority, avoiding starvation of higher-priority tasks. This may improve some of the situations related to elastic#42738 and elastic#42105.
Pinging @elastic/es-distributed |
Note to reviewers: don't let the |
@DaveCTurner seems there's a problem with the setting change on CI and it's probably just missing a warning exclusion in the rest tests:
|
We cannot set the priority in all InternalTestClusters because the deprecation warning makes some tests unhappy. This commit adds a specific test instead.
Bit more complicated than that - I thought I could get away with something simple but got bitten by the deprecation logger, so had to add a specific test in 5fc2e7e. |
Jenkins run elasticsearch-ci/2 (just the xpack license test failing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good, just one question.
Also, to me the change to NORMAL
priority/not-inline-execution makes sense here but maybe Yannick should confirm as well :)
I wonder though if we really should hide this change as much as it's hidden now. Couldn't we add a note on this change stating that this might be a problem if your cluster is already badly configured (master overloaded + lots of shard fluctuation?) but is an optimization for a healthy setup? That way users affected negatively by this might be more likely to either understand/fix the problem or report back in a more meaningful way?
server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java
Outdated
Show resolved
Hide resolved
Sure. There's no way to express changes to the migration docs in a PR against diff --git a/docs/reference/migration/migrate_7_4.asciidoc b/docs/reference/migration/migrate_7_4.asciidoc
index ebfca7d25c1..63315f9fb65 100644
--- a/docs/reference/migration/migrate_7_4.asciidoc
+++ b/docs/reference/migration/migrate_7_4.asciidoc
@@ -67,4 +67,20 @@ unsupported on buckets created after September 30th 2020.
Starting in version 7.4, a `+` in a URL will be encoded as `%2B` by all REST API functionality. Prior versions handled a `+` as a single space.
If your application requires handling `+` as a single space you can return to the old behaviour by setting the system property
`es.rest.url_plus_as_space` to `true`. Note that this behaviour is deprecated and setting this system property to `true` will cease
-to be supported in version 8.
\ No newline at end of file
+to be supported in version 8.
+
+[float]
+[[breaking_74_cluster_changes]]
+=== Cluster changes
+
+[float]
+==== Rerouting after starting a shard runs at lower priority
+
+After starting each shard the elected master node must perform a reroute to
+search for other shards that could be allocated. In particular, when creating
+an index it is this task that allocates the replicas once the primaries have
+started. In versions prior to 7.4 this task runs at priority `URGENT`, but
+starting in version 7.4 its priority is reduced to `NORMAL`. In a
+well-configured cluster this reduces the amount of work the master must do, but
+means that a cluster with a master that is overloaded with other tasks at
+`HIGH` or `URGENT` priority may take longer to allocate all replicas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :) thanks @DaveCTurner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a few small comments, looking good o.w.
server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Defer reroute when starting shards Today we reroute the cluster as part of the process of starting a shard, which runs at `URGENT` priority. In large clusters, rerouting may take some time to complete, and this means that a mere trickle of shard-started events can cause starvation for other, lower-priority, tasks that are pending on the master. However, it isn't really necessary to perform a reroute when starting a shard, as long as one occurs eventually. This commit removes the inline reroute from the process of starting a shard and replaces it with a deferred one that runs at `NORMAL` priority, avoiding starvation of higher-priority tasks. This may improve some of the situations related to elastic#42738 and elastic#42105. * Specific test case for followup priority setting We cannot set the priority in all InternalTestClusters because the deprecation warning makes some tests unhappy. This commit adds a specific test instead. * Checkstyle * Cluster state always changed here * Assert consistency of routing nodes * Restrict setting only to reasonable priorities
The change in elastic#44433 introduces a state in which the cluster has no relocating shards but still has a pending reroute task which might start a shard relocation. `TransportSearchFailuresIT` failed on a PR build seemingly because it did not wait for this pending task to complete too, reporting more active shards than expected: 2> java.lang.AssertionError: Expected: <9> but: was <10> at __randomizedtesting.SeedInfo.seed([4057CA4301FE95FA:207EC88573747235]:0) at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at org.junit.Assert.assertThat(Assert.java:956) at org.junit.Assert.assertThat(Assert.java:923) at org.elasticsearch.search.basic.TransportSearchFailuresIT.testFailedSearchWithWrongQuery(TransportSearchFailuresIT.java:97) This commit addresses this failure by waiting until there are neither pending tasks nor shard relocations in progress.
The change in elastic#44433 introduces a state in which the cluster has no relocating shards but still has a pending reroute task which might start a shard relocation. `TransportSearchFailuresIT` failed on a PR build seemingly because it did not wait for this pending task to complete too, reporting more active shards than expected: 2> java.lang.AssertionError: Expected: <9> but: was <10> at __randomizedtesting.SeedInfo.seed([4057CA4301FE95FA:207EC88573747235]:0) at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at org.junit.Assert.assertThat(Assert.java:956) at org.junit.Assert.assertThat(Assert.java:923) at org.elasticsearch.search.basic.TransportSearchFailuresIT.testFailedSearchWithWrongQuery(TransportSearchFailuresIT.java:97) This commit addresses this failure by waiting until there are neither pending tasks nor shard relocations in progress.
The change in #44433 introduces a state in which the cluster has no relocating shards but still has a pending reroute task which might start a shard relocation. `TransportSearchFailuresIT` failed on a PR build seemingly because it did not wait for this pending task to complete too, reporting more active shards than expected: 2> java.lang.AssertionError: Expected: <9> but: was <10> at __randomizedtesting.SeedInfo.seed([4057CA4301FE95FA:207EC88573747235]:0) at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at org.junit.Assert.assertThat(Assert.java:956) at org.junit.Assert.assertThat(Assert.java:923) at org.elasticsearch.search.basic.TransportSearchFailuresIT.testFailedSearchWithWrongQuery(TransportSearchFailuresIT.java:97) This commit addresses this failure by waiting until there are neither pending tasks nor shard relocations in progress.
Today we reroute the cluster as part of the process of starting a shard, which runs at `URGENT` priority. In large clusters, rerouting may take some time to complete, and this means that a mere trickle of shard-started events can cause starvation for other, lower-priority, tasks that are pending on the master. However, it isn't really necessary to perform a reroute when starting a shard, as long as one occurs eventually. This commit removes the inline reroute from the process of starting a shard and replaces it with a deferred one that runs at `NORMAL` priority, avoiding starvation of higher-priority tasks. Backport of #44433 and #44543.
In elastic#44433 we introduced a temporary (immediately deprecated) escape-hatch setting to control the priority of the reroute scheduled after starting a batch of shards. This commit removes this setting in `master`, fixing the followup reroute's priority at `NORMAL`.
In #44433 we introduced a temporary (immediately deprecated) escape-hatch setting to control the priority of the reroute scheduled after starting a batch of shards. This commit removes this setting in `master`, fixing the followup reroute's priority at `NORMAL`.
Adds a `waitForEvents(Priority.LANGUID)` to the cluster health request in `ESIntegTestCase#waitForRelocation()` to deal with the case that this health request returns successfully despite the fact that there is a pending reroute task which will relocate another shard. Relates elastic#44433 Fixes elastic#45003
This adds the `IndexBalanceTests` ES tests based on elastic/elasticsearch@5dda2b0 because the more recent version requires a not yet backport patch (elastic/elasticsearch#44433).
This adds the `IndexBalanceTests` ES tests based on elastic/elasticsearch@5dda2b0 because the more recent version requires a not yet backport patch (elastic/elasticsearch#44433).
Today we reroute the cluster as part of the process of starting a shard, which
runs at
URGENT
priority. In large clusters, rerouting may take some time tocomplete, and this means that a mere trickle of shard-started events can cause
starvation for other, lower-priority, tasks that are pending on the master.
However, it isn't really necessary to perform a reroute when starting a shard,
as long as one occurs eventually. This commit removes the inline reroute from
the process of starting a shard and replaces it with a deferred one that runs
at
NORMAL
priority, avoiding starvation of higher-priority tasks.This may improve some of the situations related to #42738 and #42105.