Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize log cluster health performance. #87723

Merged
merged 12 commits into from
Jul 1, 2022

Conversation

howardhuanghua
Copy link
Contributor

@howardhuanghua howardhuanghua commented Jun 16, 2022

Cluster health would be checked in each allocation task, since construct cluster state health would iterate all the shards, logClusterHealthStateChange is costly if cluster has huge number of shards.

70.4% (352ms out of 500ms) cpu usage by thread 'elasticsearch[1592969635000032535][transport_worker][T[#32](https://git.woa.com/ces/elasticsearch/issues/32)]'
4/10 snapshots sharing following 62 elements
java.util.Collections$UnmodifiableCollection$1.(Collections.java:1041)
java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1040)
java.util.Collections$UnmodifiableCollection$1.(Collections.java:1041)
java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1040)
org.elasticsearch.cluster.routing.IndexShardRoutingTable.iterator(IndexShardRoutingTable.java:155)
org.elasticsearch.cluster.health.ClusterShardHealth.(ClusterShardHealth.java:90)
org.elasticsearch.cluster.health.ClusterIndexHealth.(ClusterIndexHealth.java:121)
org.elasticsearch.cluster.health.ClusterStateHealth.(ClusterStateHealth.java:78)

With this commit, we could skip redundant shards iteration. Quickly check cluster status by:

  1. No unassigned and inactive shards, return GREEN directly.
  2. No unassigned and inactive primary shards, return YELLOW directly.
  3. Iterate index, if we found one RED one then return RED directly.

@elasticsearchmachine
Copy link
Collaborator

@howardhuanghua please enable the option "Allow edits and access to secrets by maintainers" on your PR. For more information, see the documentation.

@elasticsearchmachine elasticsearchmachine added v8.4.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jun 16, 2022
@pquentin pquentin added the :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label Jun 16, 2022
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 16, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @howardhuanghua this is a really nice idea + speedup.
See my comments inline: I'm a little hesitant about using RoutingNodes here and suggested a smaller change for a first step. Let me know what you think.

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @howardhuanghua looks really nice now! Just one question/request left.

@original-brownbear
Copy link
Member

Jenkins test this

@howardhuanghua
Copy link
Contributor Author

Fixed changelog area value issue.

@original-brownbear
Copy link
Member

Jenkins test this

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests found a small bug in this, see comment inline.

ClusterHealthStatus computeStatus = ClusterHealthStatus.GREEN;
for (String index : clusterState.metadata().getConcreteAllIndices()) {
IndexRoutingTable indexRoutingTable = clusterState.routingTable().index(index);
if (indexRoutingTable.allShardsActive()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're failing a test here if this turns out to be null (which could be the case and is not a bug) currently. We have to short-circuit on indexRoutingTable == null here as well, that should fix things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks you, I have fixed it.

@original-brownbear
Copy link
Member

Jenkins test this

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @howardhuanghua this is a nice speedup in our many-shards benchmark run.

@original-brownbear original-brownbear merged commit 97df136 into elastic:master Jul 1, 2022
@howardhuanghua
Copy link
Contributor Author

Thank you @original-brownbear.

@kaiwang-aviatrix
Copy link

From current version of elasticsearch "8.12" this log is gone, even if I restart elasticsearch, anyone knows reason for that? used to have a log of [2024-02-26T22:15:53,372][INFO ][o.e.c.r.a.AllocationService] [ip-172-31-16-116] current.health="YELLOW" message="Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[.ds-.logs-d Are there anyway to enable it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants