Optimize log cluster health performance. #87723

howardhuanghua · 2022-06-16T09:07:48Z

Cluster health would be checked in each allocation task, since construct cluster state health would iterate all the shards, logClusterHealthStateChange is costly if cluster has huge number of shards.

70.4% (352ms out of 500ms) cpu usage by thread 'elasticsearch[1592969635000032535][transport_worker][T[#32](https://git.woa.com/ces/elasticsearch/issues/32)]'
4/10 snapshots sharing following 62 elements
java.util.Collections$UnmodifiableCollection$1.(Collections.java:1041)
java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1040)
java.util.Collections$UnmodifiableCollection$1.(Collections.java:1041)
java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1040)
org.elasticsearch.cluster.routing.IndexShardRoutingTable.iterator(IndexShardRoutingTable.java:155)
org.elasticsearch.cluster.health.ClusterShardHealth.(ClusterShardHealth.java:90)
org.elasticsearch.cluster.health.ClusterIndexHealth.(ClusterIndexHealth.java:121)
org.elasticsearch.cluster.health.ClusterStateHealth.(ClusterStateHealth.java:78)

With this commit, we could skip redundant shards iteration. Quickly check cluster status by:

No unassigned and inactive shards, return GREEN directly.
No unassigned and inactive primary shards, return YELLOW directly.
Iterate index, if we found one RED one then return RED directly.

elasticsearchmachine · 2022-06-16T09:08:10Z

@howardhuanghua please enable the option "Allow edits and access to secrets by maintainers" on your PR. For more information, see the documentation.

elasticmachine · 2022-06-16T10:26:44Z

Pinging @elastic/es-distributed (Team:Distributed)

original-brownbear

Thanks @howardhuanghua this is a really nice idea + speedup.
See my comments inline: I'm a little hesitant about using RoutingNodes here and suggested a smaller change for a first step. Let me know what you think.

server/src/main/java/org/elasticsearch/cluster/routing/IndexRoutingTable.java

server/src/main/java/org/elasticsearch/cluster/health/ClusterStateHealth.java

original-brownbear

Thanks @howardhuanghua looks really nice now! Just one question/request left.

server/src/main/java/org/elasticsearch/cluster/health/ClusterStateHealth.java

original-brownbear · 2022-06-29T15:50:53Z

Jenkins test this

howardhuanghua · 2022-06-30T00:41:02Z

Fixed changelog area value issue.

original-brownbear · 2022-06-30T07:20:12Z

Jenkins test this

original-brownbear

Tests found a small bug in this, see comment inline.

original-brownbear · 2022-06-30T11:22:49Z

server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java

+        ClusterHealthStatus computeStatus = ClusterHealthStatus.GREEN;
+        for (String index : clusterState.metadata().getConcreteAllIndices()) {
+            IndexRoutingTable indexRoutingTable = clusterState.routingTable().index(index);
+            if (indexRoutingTable.allShardsActive()) {


We're failing a test here if this turns out to be null (which could be the case and is not a bug) currently. We have to short-circuit on indexRoutingTable == null here as well, that should fix things.

Thanks you, I have fixed it.

original-brownbear · 2022-06-30T11:28:21Z

Jenkins test this

original-brownbear

LGTM thanks @howardhuanghua this is a nice speedup in our many-shards benchmark run.

howardhuanghua · 2022-07-01T22:49:11Z

Thank you @original-brownbear.

kaiwang-aviatrix · 2024-02-27T01:09:03Z

From current version of elasticsearch "8.12" this log is gone, even if I restart elasticsearch, anyone knows reason for that? used to have a log of [2024-02-26T22:15:53,372][INFO ][o.e.c.r.a.AllocationService] [ip-172-31-16-116] current.health="YELLOW" message="Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[.ds-.logs-d Are there anyway to enable it?

howardhuanghua added 2 commits June 16, 2022 16:36

Optimize log cluster health performance.

e702916

fix unused import

d38ab3c

elasticsearchmachine added v8.4.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jun 16, 2022

Add change log.

65ffd63

pquentin added the :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label Jun 16, 2022

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 16, 2022

original-brownbear self-requested a review June 16, 2022 10:35

original-brownbear self-assigned this Jun 16, 2022

original-brownbear added the >enhancement label Jun 20, 2022

original-brownbear reviewed Jun 20, 2022

View reviewed changes

howardhuanghua added 3 commits June 20, 2022 19:12

Merge remote-tracking branch 'origin' into opt_log_health

68fb7ee

remote routing nodes logic in log health

5fa2ff4

Merge remote-tracking branch 'origin' into opt_log_health

9436e50

original-brownbear reviewed Jun 29, 2022

View reviewed changes

server/src/main/java/org/elasticsearch/cluster/health/ClusterStateHealth.java Outdated Show resolved Hide resolved

howardhuanghua added 2 commits June 29, 2022 23:24

Merge remote-tracking branch 'origin' into opt_log_health

f7014ad

move getHealthStatus to AllocationService

8af1d28

original-brownbear closed this Jun 29, 2022

original-brownbear reopened this Jun 29, 2022

howardhuanghua added 2 commits June 30, 2022 08:38

Update changelog area value.

9d0eecf

Merge remote-tracking branch 'origin' into opt_log_health

2e65575

original-brownbear requested changes Jun 30, 2022

View reviewed changes

howardhuanghua added 2 commits June 30, 2022 19:27

Fix NPE issue.

5408aec

Merge remote-tracking branch 'origin' into opt_log_health

52c01d0

howardhuanghua requested a review from original-brownbear July 1, 2022 09:50

original-brownbear approved these changes Jul 1, 2022

View reviewed changes

original-brownbear merged commit 97df136 into elastic:master Jul 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize log cluster health performance. #87723

Optimize log cluster health performance. #87723

howardhuanghua commented Jun 16, 2022 •

edited

Loading

elasticsearchmachine commented Jun 16, 2022

elasticmachine commented Jun 16, 2022

original-brownbear left a comment

original-brownbear left a comment

original-brownbear commented Jun 29, 2022

howardhuanghua commented Jun 30, 2022

original-brownbear commented Jun 30, 2022

original-brownbear left a comment

original-brownbear Jun 30, 2022

howardhuanghua Jun 30, 2022

original-brownbear commented Jun 30, 2022

original-brownbear left a comment

howardhuanghua commented Jul 1, 2022

kaiwang-aviatrix commented Feb 27, 2024

Optimize log cluster health performance. #87723

Optimize log cluster health performance. #87723

Conversation

howardhuanghua commented Jun 16, 2022 • edited Loading

elasticsearchmachine commented Jun 16, 2022

elasticmachine commented Jun 16, 2022

original-brownbear left a comment

Choose a reason for hiding this comment

original-brownbear left a comment

Choose a reason for hiding this comment

original-brownbear commented Jun 29, 2022

howardhuanghua commented Jun 30, 2022

original-brownbear commented Jun 30, 2022

original-brownbear left a comment

Choose a reason for hiding this comment

original-brownbear Jun 30, 2022

Choose a reason for hiding this comment

howardhuanghua Jun 30, 2022

Choose a reason for hiding this comment

original-brownbear commented Jun 30, 2022

original-brownbear left a comment

Choose a reason for hiding this comment

howardhuanghua commented Jul 1, 2022

kaiwang-aviatrix commented Feb 27, 2024

howardhuanghua commented Jun 16, 2022 •

edited

Loading