Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for pending tasks in TransportSearchFailuresIT #44543

Conversation

DaveCTurner
Copy link
Contributor

The change in #44433 introduces a state in which the cluster has no relocating
shards but still has a pending reroute task which might start a shard
relocation. TransportSearchFailuresIT failed on a PR build seemingly because
it did not wait for this pending task to complete too, reporting more active shards
than expected:

2> java.lang.AssertionError:
  Expected: <9>
       but: was <10>
      at __randomizedtesting.SeedInfo.seed([4057CA4301FE95FA:207EC88573747235]:0)
      at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
      at org.junit.Assert.assertThat(Assert.java:956)
      at org.junit.Assert.assertThat(Assert.java:923)
      at org.elasticsearch.search.basic.TransportSearchFailuresIT.testFailedSearchWithWrongQuery(TransportSearchFailuresIT.java:97)

This commit addresses this failure by waiting until there are neither pending
tasks nor shard relocations in progress.

The change in elastic#44433 introduces a state in which the cluster has no relocating
shards but still has a pending reroute task which might start a shard
relocation. `TransportSearchFailuresIT` failed on a PR build seemingly because
it did not wait for this pending task to complete too, reporting more active shards
than expected:

    2> java.lang.AssertionError:
      Expected: <9>
           but: was <10>
          at __randomizedtesting.SeedInfo.seed([4057CA4301FE95FA:207EC88573747235]:0)
          at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
          at org.junit.Assert.assertThat(Assert.java:956)
          at org.junit.Assert.assertThat(Assert.java:923)
          at org.elasticsearch.search.basic.TransportSearchFailuresIT.testFailedSearchWithWrongQuery(TransportSearchFailuresIT.java:97)

This commit addresses this failure by waiting until there are neither pending
tasks nor shard relocations in progress.
@DaveCTurner DaveCTurner added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 labels Jul 18, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2

@DaveCTurner DaveCTurner merged commit fd5b885 into elastic:master Jul 18, 2019
@DaveCTurner DaveCTurner deleted the 2019-07-18-fix-TransportSearchFailuresIT branch July 18, 2019 10:13
DaveCTurner added a commit that referenced this pull request Jul 18, 2019
Today we reroute the cluster as part of the process of starting a shard, which
runs at `URGENT` priority. In large clusters, rerouting may take some time to
complete, and this means that a mere trickle of shard-started events can cause
starvation for other, lower-priority, tasks that are pending on the master.

However, it isn't really necessary to perform a reroute when starting a shard,
as long as one occurs eventually. This commit removes the inline reroute from
the process of starting a shard and replaces it with a deferred one that runs
at `NORMAL` priority, avoiding starvation of higher-priority tasks.

Backport of #44433 and #44543.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >test Issues or PRs that are addressing/adding tests v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants