Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace health request with a state observer. #88641

Merged
merged 6 commits into from
Jul 21, 2022

Conversation

idegtiarenko
Copy link
Contributor

This was using health request assuming yellow index is one with all primaries
initialized. This is not true as newly created index remain yellow when
primaries allocation is throttled. This was discovered while working on a new
shards allocator.

This was using health request assuming yellow index is one with all primaries
initialized. This is not true as newly created index remain yellow when
primaries allocation is trottled. This was discovered while working on a new
shards allocator.
@idegtiarenko idegtiarenko added >bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.4.0 labels Jul 20, 2022
@idegtiarenko idegtiarenko requested a review from DaveCTurner July 20, 2022 11:05
@elasticsearchmachine
Copy link
Collaborator

Hi @idegtiarenko, I've created a changelog YAML for you.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I'd like Tim to take a look too.

@DaveCTurner DaveCTurner requested a review from Tim-Brooks July 20, 2022 11:28
@DaveCTurner
Copy link
Contributor

Actually I think we should have a test that exposes this change. It looks to me like GetGlobalCheckpointsActionIT#testWaitOnPrimaryShardsReady tries to work around the problem we encountered by creating a red index and then moving it to yellow later. With this change it should be possible to remove that I think?

Copy link
Contributor

@Tim-Brooks Tim-Brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes looks fine to me. Agree with David's comment.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Settings.builder().putNull(CLUSTER_ROUTING_ALLOCATION_NODE_INITIAL_PRIMARIES_RECOVERIES_SETTING.getKey()).build()
)
.get();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline diff is totally confusing, consider reviewing in a split mode.

In short, this test:

  • sets cluster.routing.allocation.node_initial_primaries_recoveries=0 so that every created index is throttled
  • creates an index
  • executes GetGlobalCheckpointsAction with a timeout
  • asserts it is complaining about unavailable shards rather then anything else
  • reverts the setting to a default value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I could also rewrite it to revert the setting after executing GetGlobalCheckpointsAction and assert it could get result back but I am not sure it would make the test better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we want that. As it stands the test suite still passes on master today (i.e. reverting the production-code changes in this PR). I think we have to check the success case here to see the value in this change.

@idegtiarenko idegtiarenko requested a review from DaveCTurner July 21, 2022 07:11
.cluster()
.prepareUpdateSettings()
.setPersistentSettings(
Settings.builder().put(CLUSTER_ROUTING_ALLOCATION_NODE_INITIAL_PRIMARIES_RECOVERIES_SETTING.getKey(), 0).build()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm it seems like a bug that we even permit 0 here. I don't see a good reason to do this in production and it would be pretty harmful to do this accidentally. Ok for now, but if we fixed this bug we'd need to find some other way to delay allocation.

}

public void testWaitOnPrimaryShardsReady() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't want to remove this test. It's still valid isn't it?

I was expecting us to strengthen this test to verify that creating the index concurrently with running the action still works. It turns out we already have that test, so we can leave this one as-is IMO.

Settings.builder().putNull(CLUSTER_ROUTING_ALLOCATION_NODE_INITIAL_PRIMARIES_RECOVERIES_SETTING.getKey()).build()
)
.get();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we want that. As it stands the test suite still passes on master today (i.e. reverting the production-code changes in this PR). I think we have to check the success case here to see the value in this change.

@idegtiarenko idegtiarenko requested a review from DaveCTurner July 21, 2022 09:07
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@idegtiarenko idegtiarenko merged commit 3e85195 into elastic:master Jul 21, 2022
@idegtiarenko idegtiarenko deleted the fix_testWaitOnIndexCreated branch July 21, 2022 13:36
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Jul 22, 2022
* upstream/master: (40 commits)
  Fix CI job naming
  [ML] disallow autoscaling downscaling in two trained model assignment scenarios (elastic#88623)
  Add "Vector Search" area to changelog schema
  [DOCS] Update API key API (elastic#88499)
  Enable the pipeline on the feature branch (elastic#88672)
  Adding the ability to register a PeerFinderListener to Coordinator (elastic#88626)
  [DOCS] Fix transform painless example syntax (elastic#88364)
  [ML] Muting InternalCategorizationAggregationTests testReduceRandom (elastic#88685)
  Fix double rounding errors for disk usage (elastic#88683)
  Replace health request with a state observer. (elastic#88641)
  [ML] Fail model deployment if all allocations cannot be provided (elastic#88656)
  Upgrade to OpenJDK 18.0.2+9 (elastic#88675)
  [ML] make bucket_correlation aggregation generally available (elastic#88655)
  Adding cardinality support for random_sampler agg (elastic#86838)
  Use custom task instead of generic AckedClusterStateUpdateTask (elastic#88643)
  Reinstate test cluster throttling behavior (elastic#88664)
  Mute testReadBlobWithPrematureConnectionClose
  Simplify plugin descriptor tests (elastic#88659)
  Add CI job for testing more job parallelism
  [ML] make deployment infer requests fully cancellable (elastic#88649)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants